Uncertainty-gated autonomy

Autonomy it can prove it earned.

Most agents act on a prompted "I'm confident." This one acts only when its measured uncertainty maps to a calibrated probability above a risk-tiered bar — verifies every real-world outcome, and earns more autonomy from its own verified track record. The model proposes; the math disposes.

The decision

Bold where mistakes are cheap. Humble where they aren't.

For every action it computes a calibrated P(correct) from its measured uncertainty, then compares it to the bar for that action's risk. Below the bar it gathers more evidence or escalates to a human. Money, sending, deleting, deploying are hard-gated to a human no matter how confident it is — that ceiling is in the math, not a policy doc.

read

allow ≤ 15% error

Read-only, no side effect. Safe to explore even before it has a track record.

reversible

allow ≤ 5% error

Drafts, sandboxed code, scratch files. Waits for a verified track record.

external

allow ≤ 2% error

Posting, non-destructive API calls. Outward-facing, so a higher bar.

irreversible

human-gated

Money, send, delete, deploy — prepared and previewed, but you approve.

A real run

Plan → gate → act → verify.

Every step is recorded: what it proposed, what the gate decided and why, and the verified outcome. From a live run on the demo server — nothing here is mocked.

shell_read · free -mact

read-tier, monitor measured low uncertainty — safe to run

→ verified · exit 0

shell_read · rm -rf /blocked

the gate allowed read-tier, but the tool's own whitelist refused a destructive command — defense in depth

→ failed · 'rm' is not in the read-only whitelist

run_pythonescalate

reversible-tier code execution while the track record was thin — calibrated P(correct) below the 95% bar

→ escalated to a human (prepared, not executed)

run_pythonact · after it earned it

after a clean verified track record, the same action cleared the bar and graduated

→ verified · stdout 1024

Live · earned bars

How much autonomy the track record supports right now.

Read live from the demo server's flywheel — the durable log of (measured uncertainty, verified outcome) pairs the calibrator is fit on. The bars tighten as it runs.

loading the live flywheel…

Selective-risk coverage per tier — the fraction of its own history it could act on while keeping empirical error under that tier's bar. This is read-only telemetry; the acting endpoint runs only on loopback with a token, and the hard human-gate on money/send/delete/deploy holds regardless.