How Agents Never Sleep works

A walkthrough of one unattended run, start to finish — the mechanism, not a pitch.

Definition

What is unattended agent execution?

Unattended agent execution is a coding agent working through a backlog of tickets with nobody watching each step — no human present to answer a clarifying question. It only works if the agent has a standing rule for what to do the moment it isn't sure: assume and keep going, defer that one decision and move to the next ticket, or stop the whole run. Agents Never Sleep (ANS) is the governance layer that supplies that rule as a durable, enforced contract, so one ambiguous ticket at 2 a.m. never freezes the other thirty-nine.

1. Launch preflight — before any token is spent

A headless run starts through the launcher (bin/ans-run), a deterministic GO/NO-GO gate that runs before the agent CLI boots — because by the time the agent's own preflight checks run, the first tokens are already spent on a run that might be doomed.

The launcher checks, in order: config trust (the repo's .claude/agents-never-sleep.json must be explicitly trusted once per user, keyed on its SHA-256 — a new or changed config with nobody around to approve it is a NO-GO); identity (a configured target user, never an unattended run left running as root); agent selection (a named preset, verified with a real --version probe so flag drift is caught before spend, not after); and a working-tree lock — a non-blocking flock(2) so two simultaneous launches on the same repo always yield exactly one winner, released automatically by the kernel on any crash.

Exit codes are plain: 0 started, 64 NO-GO, 65 the working tree is already busy. Autonomy flags — the permission mode that lets a detached run actually proceed without stalling at an approval prompt — are never a default; a preset only becomes launchable once a human has confirmed exactly what that flag grants.

Tickets move through the machine one checkpoint at a time.

2. The per-ticket state machine

Once running, the agent drives a two-command loop against the harness: next asks for one ticket to work, complete records what happened. Each call is a fresh subprocess over durable, atomically-written state — a crash between the two loses nothing, because nothing lives only in memory.

Every ticket ends in exactly one of seven outcome states, chosen so the morning-after response is never ambiguous:

DONE — implemented, deterministic gate green.
DONE_LOW_CONFIDENCE — gate green, but a high-risk diff's delegated review raised concerns, errored, or never ran; needs daylight review.
PARKED_DECISION — one decision deferred to a human; the run kept moving.
PARKED_FOUNDATIONAL — a foundational ambiguity; dependent tickets are quarantined until it's resolved.
BLOCKED_ENV — the environment blocked progress (a git lock, an un-runnable gate) — not the agent's fault.
FAILED_RETRYABLE — a gate caught a bug the diff introduced; the edit was reverted; safe to retry.
FAILED_BUG_IN_AGENT — repeated failures suggest a systematic problem; needs a human look.

3. Decide — then implement

Before touching a file, the harness classifies the ticket's blast radius and decides PROCEED, PARK, or HALT. Only a PROCEED ticket reaches the agent for implementation; a PARK is recorded before any edit, because the decision itself is the work product.

Once a PROCEED ticket is handed over, the agent — the worker in this design — edits exactly that one ticket's files. The harness snapshots the working tree first, so whatever the agent does can be undone.

The gate decides: the stamp only lands when the tests pass.

4. The deterministic gate — the only hard block

A gate is a shell command — your test suite — run after the edit. Exit 0 is green; non-zero is red. This is the only thing in the whole run that can hard-block a ticket; everything else (a delegated model review, a specialist lens) is advisory and can only withhold a trust stamp, never revert work or stop the run.

A red gate is classified, not just reported: a failure the diff clearly introduced reverts to the last green commit and records FAILED_RETRYABLE; a failure that looks pre-existing, flaky, or environmental downgrades confidence but keeps the work; a gate that can't run at all (a lock, a timeout, a non-interactive prompt it can't answer) records BLOCKED_ENV — never a silent stop. ANS never deletes or skips a failing test to force green.

---
id: fix-flaky-webhook-retry
title: Retry webhook delivery on 5xx with backoff
blast_radius: medium      # optional hint — the harness auto-classifies from the diff
gate: pytest tests/webhooks/ -x
---

The webhook sender should retry on 5xx responses with exponential backoff
(max 3 attempts). Cover it with a test that simulates two 500s then a 200.

Illustrative example — a ticket is a Markdown file with an optional YAML front-matter block; the body is the only required part, and gate names the command that decides pass or fail for that ticket.

5. ASK / PARK / HALT — the decision points

Three distinct responses to uncertainty, never collapsed into one. ASK is forbidden the moment a run is unattended — there is nobody there to answer, so it is converted to PARK automatically. That leaves two real choices:

PARK — defer this one ticket or decision, and move straight to the next independent ticket. This is normal and healthy, not a stop. A park always records why, the candidate interpretations, and the exact human decision waiting in the morning. Real examples that Hard-PARK: which direction a database migration should go, a change to a public or shared API contract, anything touching a security or tenant-isolation boundary, and anything involving money, billing, or pricing.
HALT — stop the whole run. Reserved for genuinely irreversible danger with no safety net at all — for example, no version control present and none that can be created. HALT is rare by design; most uncertainty is a PARK, not a HALT.

The discipline behind the choice is blast radius: naming, internal structure, log wording, or a choice between two equivalent local implementations can PROCEED (assume, log, keep going, reversibly). Anything with a large blast radius, or anything genuinely unclassifiable, is parked. A wrongly-parked small item costs a five-second morning decision; a wrongly-assumed big one costs a night of wrong work in the wrong direction.

The watchdog keeps watch; a frozen run is restarted, resumably.

6. The watchdog — freeze, then a resumable restart

A hang is a different failure than a stop: the process is alive, doing nothing, and no hook watching for a premature stop can see it. A sustained provider-overload wave is the realistic cause — the run is still there, but its heartbeat has gone stale.

The watchdog is a sidecar that runs the unattended command as a child and polls its heartbeat file. When the heartbeat goes stale past a configured threshold, it kills the child and restarts it — and because ANS state is durable, the restart resumes exactly where the run was; the in-flight ticket's partial edits are reverted to its last snapshot, so nothing is lost or double-counted. After a capped number of restarts it fires an alert and exits, rather than restarting forever. ans-run wraps every detached launch in the watchdog by default.

The same sidecar also reaps its own leaked child processes — for example an MCP server the agent spawned and never closed — strictly by parent-chain lineage from the run's own process ID, never by matching a process name (a name match could kill an unrelated run on the same machine). Honest limit: if the supervisor itself is force-killed, it can't reap after the fact, so this reduces process leakage, it does not eliminate it.

7. Drain — running until there's nothing left to do

The agent keeps calling next then complete until next returns a terminal status: DRAINED (the backlog is empty), HALTED (a HALT condition was hit), or LOW_YIELD (a circuit breaker tripped because most recent tickets are parking or blocking rather than completing — a signal that something about the backlog itself needs a human look, not more attempts). Attempt and loop caps force-park any single ticket that would otherwise burn the whole run, so one cursed item can't repeat the exact failure this whole design exists to avoid.

8. The morning report

When the run ends, it writes a single ranked report (night-report.md) rather than leaving you to reconstruct what happened from logs. It states: what's done and trusted, what's done but needs daylight review (a high-risk change whose delegated review didn't clear it), what's parked — each with its candidate interpretations and the exact next action — what's blocked by the environment, and any blind spots: a degraded guarantee, a capability that wasn't available, a credential the run couldn't read. A low-yield night is flagged loudly in the report, so "the run finished" is never mistaken for "the work got done".