Token Efficiency
Ultra Claude burns more tokens than normal development. That's by design — dedicated Reviewers, Testers, and a shared Knowledge agent increase quality, but they cost context. The framework is built to make that trade-off worthwhile and keep usage under control.
Cost per plan
Before execution starts, you see a cost estimate:
| Component | Cost | Details |
|---|---|---|
| Per-task pipeline | ~120K tokens | Executor ~80K + Reviewer ~30K + Tester ~10K |
| Knowledge Brief synthesis | ~20K tokens amortized | Lead loops /uc:research over the plan's Tech Stack in Phase 1.8. Cache hits are free; misses spawn a short-lived researcher subagent. |
| Mid-execution knowledge QUERYs | ~1K (hit) / ~15K (miss) | Teammates send QUERY: to Lead; Lead runs /uc:research. Subsequent queries for the same topic hit the cache. |
| Project Manager | ~50K tokens | Observational monitoring, entire execution |
A 5-task plan costs roughly 670K tokens (5 × 120K + 20K + 50K, plus a few mid-execution QUERY misses). The estimate is shown before any agents spawn — you confirm before spending. Second and third plans in the same project benefit more from cache hits, driving the per-plan knowledge cost toward zero over time.
Why it's worth it
Normal Claude Code development is one agent doing everything — writing code, hoping it follows patterns, and trusting it works. Ultra Claude adds dedicated roles:
- Reviewer catches pattern violations and standards drift before code is merged
- Tester validates against product docs, not just "does it compile"
- Cache-first research serves current API docs so agents work with real signatures, not stale training data — and the cache persists across plans, so a second plan touching the same libraries pays almost nothing for knowledge
The extra tokens buy you code that follows your standards, passes real tests, and uses current library APIs. Without them, you spend the same tokens later fixing the bugs that review and testing would have caught.
How the framework saves tokens
Model assignment
Only the Executor runs on Opus (the most capable, most expensive model). Every other role — Reviewer, Tester, Project Manager — runs on Sonnet. The researcher subagent (spawned on cache miss) also runs on Sonnet. This concentrates cost where capability matters most: writing code.
Lazy Tester spawning
The Tester doesn't spawn when the task starts. It's created only after the Executor signals "code complete" — the moment all source code is written, before the Executor writes the implementation report. During the planning and implementation phases (often the longest part), only two agents are active instead of three, avoiding ~10K tokens of idle context per task. Firing the signal before the impl-report write also gives the Tester its cold-read window "for free", so it's ready to test almost as soon as the Executor is done.
Cache-first research (Lead as Sage)
Research findings are committed as first-class project documentation under documentation/technology/research/, not invisible cache. The /uc:research skill checks a small JSON index via jq before spawning anything — fresh entries (per-file expires field, library 10d / patterns 90d / market 30d / historical frozen) return immediately with zero agent overhead. Cache misses spawn the stateless researcher subagent one-shot via the Task tool; it writes the target file, updates the index, and exits. No persistent knowledge teammate, no per-task duplicate lookups, and the knowledge base grows across plans.
Typed spawn distinction
Teammates (TeamCreate) are persistent, stateful, and tracked on the PM dashboard. Subagents (Task tool) are stateless, one-shot, and invisible to the team graph. The researcher is a subagent because it fits the pattern "do this one thing, return the answer, die" — making it a persistent teammate would waste tokens on a lifecycle it doesn't need.
Ref.tools integration
External library documentation is fetched via Ref.tools MCP at ~500–5K tokens per query. The fallback (raw web search) costs 50K+ tokens for the same information. Combined with the cache, the per-lookup cost on a repeated topic drops to zero.
Checkpoint recovery
When a session dies mid-execution, checkpoints let you resume without re-running completed tasks. A 10-task plan interrupted after task 5 resumes from task 6 — the 600K tokens already spent on tasks 1–5 are not re-spent.
Concurrency limits
Parallel execution is capped at 1–4 task-teams based on plan size. This prevents token burn from too many agents running simultaneously while still allowing parallel progress.
| Plan size | Max concurrent teams |
|---|---|
| 1–3 tasks | 1–2 |
| 4–8 tasks | 2–3 |
| 9+ tasks | 3–4 |
Usage management
Before execution, you're asked whether to enable extra usage:
Extra usage enabled
Development runs as fast as possible with no pauses. Use this when you need results quickly and your Anthropic account has extra usage capacity.
Extra usage disabled
A three-agent chain handles usage management with ~6x less token cost than the old PM polling approach:
- Haiku watchdog ticks every minute (~$0.03/5h) checking two independent rate-limit windows (5h: 80% soft / 90% hard, 7d: 90% soft / 95% hard) and executor staleness. Silent when healthy.
- PM validates watchdog alerts, adds operational context (team count, avg task cost), forwards to Lead with window-qualified recommendations. Event-driven, no cron.
- Lead acts uniformly across windows — on soft, in-flight work finishes and spawning stops until reset; on hard, all teams stop immediately. Multiple blocks (5h + 7d) track independently; Lead resumes only when all windows clear.
Per-task budget data (usage % at start/end of each task) is tracked in plan.json. Multiple pause/resume cycles are supported across rate-limit windows.
Plan during the day, execute overnight. Planning modes (feature, debug, verification) are interactive and use modest tokens. Execution is autonomous and token-heavy. Start a plan during the day when you can discuss scope, then kick off execution before bed — most plans complete within a single rate limit window.
Token budget by activity
| Activity | Token range | Interactive? |
|---|---|---|
| Feature planning | ~20–50K | Yes — conversation with user |
| Debug investigation | ~30–60K | Yes — hypothesis discussion |
| Verification scan | ~20–40K | Yes — discrepancy review |
| Tech research query | ~0.5–5K | No — automated retrieval |
| Plan execution (per task) | ~120K | No — autonomous |
| Plan execution (overhead) | ~150K | No — PM + Knowledge |
Interactive activities are cheap. Execution is expensive but autonomous. This is why the "plan by day, execute by night" workflow works well.