Plan Execution

Execute approved plans with coordinated agent teams. Each task gets a dedicated Executor, Reviewer, and Tester that work together through a structured pipeline — planning, implementation, code review, and testing happen automatically.

Team members run as tmux panes — persistent, separately-running sessions that message each other live. This requires teammateMode: "tmux" in ~/.claude/settings.json and launching Claude inside a tmux session (both configured by /uc:setup). Without them, named teammates default to in-process with no panes; plan-execution detects this at startup and stops with a remediation rather than silently running a degraded in-process pipeline.

At startup the Lead also names the tmux window UC::P-NNN::<plan title> so it is identifiable in the status bar. Planning modes set a mode-form name on entry (UC::Feature::<subject>, UC::Debug::<subject>) and upgrade to the plan form once the plan is written — the plan ID takes priority. You can set it by hand any time with /uc:rename-window.

Running a plan

/uc:plan-execution {plan-name}

Before spawning any agents, execution presents a cost estimate and asks for confirmation. You'll see the task count, concurrency level, and estimated token usage.

The task pipeline

Each task in your plan gets a dedicated mini-team that self-coordinates:

Role	Model	What it does
Executor	Opus	Explores codebase, writes implementation plan, implements code, drives the review/test cycle
Reviewer	Sonnet	Reads standards and architecture before code is written, gives plan feedback, performs formal code review
Tester	Sonnet	Tests against product docs and success criteria, writes missing test coverage, validates UI in browser

One plan-wide teammate supports all tasks:

Project Manager (Sonnet) — monitors progress, detects stalls, tracks usage, maintains live execution state files

Knowledge lives per-task in each tasks/task-N/task.md's **Research:** section — a list of pointers to durable files under documentation/technology/research/, populated by planning Stage 2 via /uc:research. Lead reviews research coverage per-task just before spawning each team and fills any gaps. Mid-execution, teammates send QUERY: {question} to Lead for external docs; Lead runs /uc:research, replies with an ANSWER:, and appends the new pointer to the task's task.md so it's durable for re-spawns and future teammates. The distinction between Teammates (the Agent tool in teammate mode — name + run_in_background, stateful, tracked on the team graph) and Subagents (the Agent tool in one-shot mode — foreground, no name, stateless, invisible to the team graph) keeps the team focused on pipeline work while allowing cheap isolated research lookups.

How a task flows

Planning modes have already written tasks/task-N/task.md for every task (authoritative content: description, files, patterns, Research pointers, success criteria, dependencies). The plan README holds a flat task heading index only.
Lead runs its pre-spawn checklist — reviews task.md's Research pointers, runs /uc:research for any gap or staleness, then spawns Executor + Reviewer in parallel (minimal pointer prompts; agents read the task directory on startup).
Reviewer speaks first. After its startup read, Reviewer synthesizes a REVIEWER TAKE from task.md + standards + architecture + patterns and sends it to Executor before Executor plans.
Executor reads task.md and explores the codebase while the Reviewer is still building its take. It blocks on REVIEWER TAKE before writing tasks/task-N/plan.md — waiting time is productive exploration and mental drafting, not idle. Once the take arrives, plan.md is written once, with the take baked in, as a thin execution delta (approach per file, criterion-to-approach mapping by ID, reviewer-take incorporation, risks). It does not restate task.md content.
Executor runs a deviation self-check against task.md. If clean (all files in scope, all criteria mapped, no reviewer-take contradictions) it proceeds directly to implementation — no Lead gate. If the plan deviates, Executor sends ADVICE REQUEST task-N [deviation] and waits for APPROVED (Lead amends task.md and broadcasts) or AMEND.
Executor implements. Reviewer reads files as they're written via progress updates.
The moment code is done — before writing impl.md — Executor signals "code complete" to Lead. Lead lazy-spawns the Tester (which runs its own startup read in parallel) and may pre-spawn the next dependent task's team into a pipeline-wait gate.
Executor writes tasks/task-N/impl.md (delta only — created/modified files, exports, INTEGRATION, GOTCHA), broadcasts via CommunicateTeam, then requests review and test via CommunicateTeamMember with signal backup.
Reviewer and Tester run in parallel. Both must PASS. Verdicts are sent via CommunicateTeamMember (with signal + content files for FAIL details). Executor uses WaitForTeamMember to receive verdicts. If either fails, Executor fixes, updates impl.md, re-broadcasts, and re-submits.
Both pass → task done. Lead sends Implementation approved to any parked pipeline successor, shuts the team down, and fills the slot with the next unblocked task.

An optional ADVICE channel is open throughout for Executor to ask Lead for judgment on complicated problems, deep-reasoning design calls, or knowledge about orchestration context (other tasks in flight, plan history, user intent). ADVICE is non-blocking except for the mandatory [deviation] case.

Concurrency

Multiple tasks run in parallel based on plan size:

Plan size	Concurrent teams
1–3 tasks	1–2
4–8 tasks	2–3
9+ tasks	3–4

Tasks normally spawn when their dependencies are completed and a slot is available. As a pipeline optimization, when an Executor signals "code complete", the Lead may pre-spawn the next dependent task into a planning stage if a concurrency slot is free — that successor researches and plans during the predecessor's review/test window, then begins coding only after the predecessor fully passes.

What you see during execution

Execution state — the PM maintains JSON state files showing task status, active stages, and timing
Escalation questions — if a task exceeds 10 fix cycles or discovers something plan-invalidating, you'll be asked to decide
Completion summary — tasks completed, files modified, decisions made, test results, and follow-up items

Between these, the system runs silently. The Lead doesn't narrate what agents are doing.

Checkpoints and session recovery

Checkpoints save automatically every 3 completed tasks, and on any usage pause. If your session dies mid-execution:

Run the same /uc:plan-execution {plan-name} command again
The system detects the checkpoint and shows you the progress so far
Confirm resume — completed tasks are skipped, incomplete tasks are re-spawned with their context

All progress is preserved in the plan directory: task artifacts, shared notes, and checkpoint files survive across sessions.

Task directory as the team's single source of truth

Every task gets its own directory at documentation/plans/{plan}/tasks/task-N/ containing these files:

task.md — written by the planning mode at Stage 4 (Lead may amend mid-execution). Authoritative task brief: description, files, patterns, **Research:** pointers, success criteria, dependencies. Every team member reads this on startup.
signals.jsonl — append-only JSONL file managed by the execution communication protocol. All agents use CommunicateTeamMember/CommunicateTeam to write signals; each agent arms one persistent tail -F "inbox" monitor at startup that follows this file and wakes it the instant any signal relevant to its role is appended (so WaitForTeamMember is just "yield your turn" — no per-wait monitors, no re-arm churn) — SendMessage is the primary (immediate) channel; the signal file is the durable shared state log (crash recovery + observability) and the authoritative delivery channel, not merely a backup. 21 signal types cover the full pipeline lifecycle. Critical for crash recovery — re-spawned agents infer precise pipeline state from the signal log, which a live-only channel like SendMessage cannot provide.
plan.md — written by the Executor during its Phase 3. A thin execution delta that extends task.md — concrete function/class/signature choices per file, criterion-to-approach mapping by ID, reviewer-take incorporation, risks. Does not restate task.md content.
impl.md — written by the Executor during Phase 4.5 after code complete. An implementation delta — created/modified files with line ranges, exports, INTEGRATION notes for other tasks, GOTCHA notes for library quirks. Does not restate task.md or plan.md content.
take.md — written by the Reviewer before sending the REVIEWER TAKE. Persistent copy of the standards-aware perspective that shapes the Executor's plan.
review-feedback.md — written by the Reviewer on FAIL verdict. Structured failure details (overwritten on each re-review cycle).
test-feedback.md — written by the Tester on FAIL verdict. Structured failure details (overwritten on each re-test cycle).

Spawn prompts carry no inline task content — they're minimal pointers. All team members (including lazy-spawned Tester, pipeline successors, and crash re-spawns) follow the shared task-team-startup protocol to read the task directory on their first action. File writes broadcast FILE-UPDATED task-N/{file}: reason to active teammates so state stays in sync without verbose messages.

Why Reviewer speaks first

Reviewer's unique value is standards / architecture / patterns knowledge — the stuff it reads from documentation/technology/standards/ and documentation/technology/architecture/ that Executor doesn't touch. In the old model, that knowledge only surfaced after Executor had already written a plan, making Reviewer's input an advisory round-trip on work that was already done.

In the new model, Reviewer front-loads that knowledge as a REVIEWER TAKE message sent to Executor before plan.md is written. Executor incorporates it directly into plan.md as a primary input. Three wins:

Parallelizes work. Reviewer reads standards while Executor explores the codebase — no serial bottleneck.
Plan.md gets written once. Executor blocks on the take before writing, so plan.md is authored a single time with the take baked in — no draft/feedback/rewrite cycle.
Lean plan.md. plan.md is a thin extension of task.md + reviewer take, not a multi-section document retro-fit to accommodate teammate concerns.

Reviewer's later role (formal code review on "ready for review") is unchanged. Its upfront voice is new; its backend voice is the same.

Lead plan review is gone — ADVICE and deviation self-check

The old blocking "Lead plan review" gate is removed. Rationale: task.md already encodes the authoritative scope (user-approved at Stage 4), REVIEWER TAKE covers standards/architecture fit, and Executor's plan.md is structurally a thin delta that can't silently expand scope without visible deviation.

Before implementing, Executor runs a deviation self-check:

Every file plan.md proposes to touch appears in task.md's Files list
Every success criterion in task.md has a mapping entry in plan.md
plan.md does not contradict any constraint from the REVIEWER TAKE

If all three pass, Executor proceeds to implementation with zero Lead round-trips. If any fails, Executor sends ADVICE REQUEST task-N [deviation]: {reason} and waits for APPROVED (Lead amends task.md and broadcasts) or AMEND: {instructions}.

Executor can also pull Lead's judgment at any time (planning or implementation) via ADVICE REQUEST task-N [complicated|deep-reasoning|knowledge]: {context + question}. Those cases are non-blocking — Executor decides whether to wait. ADVICE is distinct from QUERY: (external library docs via /uc:research): ADVICE is for Lead's judgment and orchestration context, QUERY is for external knowledge.

Execution communication protocol

SendMessage is unreliable (silent failures, name/ID mismatches, orphaned inboxes — multiple open bugs in Claude Code). All agents use a unified communication protocol with three procedures (named conventions defined by the protocol — not Claude Code tools, so they won't appear in a tool search). Each composes the real primitives SendMessage and Monitor (both loaded via ToolSearch at startup) with an append to signals.jsonl:

CommunicateTeamMember(to, message, signal?, content_file?) — send to one agent. Writes content file (if any), appends signal to signals.jsonl (if any), then attempts SendMessage (best-effort). The ordering is load-bearing: content before signal, signal before SendMessage.
CommunicateTeam(message, signal?, content_file?) — broadcast to all active teammates + Lead. Same ordering, SendMessage to each recipient.
WaitForTeamMember(signal, from?) — wait to receive a signal. SendMessage is the primary wake and is processed immediately; the durable, authoritative channel is a single persistent tail -F inbox monitor each agent arms once at startup, which follows signals.jsonl and wakes the agent the instant any signal relevant to its role is appended (~0 latency, even if SendMessage silently drops). A "wait" is therefore just yield your turn — the agent arms no per-wait monitor and never re-arms, so an agent blocked on two producers (review + test) is woken by that one inbox, not two monitors. There is no per-agent timeout; a counterparty that dies and never appends its signal is caught by the PM's independent stall watch (>10 min silence → STALL to Lead), and Lead records a visible WAIT_TIMEOUT if it can't resolve it — never a silent hang.

21 signal types cover the full pipeline lifecycle, including signals added to close visibility gaps: PLAN_READY (executor wrote plan.md), CODE_COMPLETE (all source code done), ADVICE_RESPONSE (Lead replied to an advice request), and WAIT_TIMEOUT (Lead abandoned a stalled wait after an unresolved PM STALL). Signals may carry an optional one-line note payload (e.g. a verdict).

Token overhead is minimal: parked agents burn zero tokens (the inbox never self-terminates, so there is no idle re-arm churn) and each relevant event is one small re-invocation, ~1K–5K tokens total per task.

Failure handling

Failures are handled internally within each task team:

Reviewer or Tester sends FAIL → Executor fixes → re-submits for review/test
Up to 10 fix cycles before escalating to the user
If a team member crashes, it's re-spawned with full context from task artifacts
If the Lead discovers a plan-invalidating issue, the pipeline pauses until you decide how to proceed

Usage management

Before execution starts, you choose whether to enable extra usage:

Usage mode (asked first): the very first question of execution is whether to respect the limits at all (auto-pause vs. push-through), asked before any usage data is read. Only in auto-pause does the Lead then check current usage via a one-shot usage-monitor.sh status (and not start over an already-hit limit). Set once, never re-asked mid-run.
Push-through: runs at full speed. PM’s monitor suppresses usage emits (stall detection still runs); the Lead never stops for usage.
Auto-pause: PM owns one background script (scripts/usage-monitor.sh watch, via the Monitor tool) — the single source of truth for usage. It emits only on actionable milestones: the critical-stop crossing, the work-can-restart reset (time-authoritative — fires on the known reset time even if usage data went stale while paused), and stalls. The soft limit never interrupts; the Lead enforces it with a status check before each spawn (don’t start new work while soft). On a critical-stop the Lead pauses in-flight work, checkpoints, and arms a self-owned one-shot wake at the known reset time — its guaranteed restart, independent of the PM chain. On reset, agents resume with full context.

Two agents, not three: PM (which owns the usage monitor) and the Lead — no separate watchdog. PM forwards to the Lead only when the chosen mode makes something actionable (a start/stop), so routine usage activity never wakes the Lead. Per-task budget data (usage % at start/end) is tracked in plan.json for the operational report.