Working v1 · 2026

a scaffolder for phase-gated, eval-first agent missions

A CLI that scaffolds phase-gated, worktree-isolated, eval-first agent missions, built by an agent team running the discipline it encodes.

Context

The method is the asset. Every build in this portfolio ran the same way: a mission brief written before any code, phases with human gates, disposable workers scoped to their own subtree, reviewers who never grade their own work, and a failing eval that defines done. I had been hand-rolling that structure each time. orchestrate turns it into one command.

What it does

orchestrate init generates a complete mission directory: the brief, a status tracker, an append-only build log, a handoff doc, an invariants constitution, a gate checklist, and a starter eval that fails on purpose. Seven files, every one a contract the next worker implements against. orchestrate check

validates that a mission has every required file and that its eval is real, runnable, and collectible by the test runner. Stdlib only, no dependencies, and the generated output is byte-identical run to run, so a mission is reproducible rather than freshly improvised each time.

How it was built

The part worth seeing: orchestrate was built by the exact method it scaffolds. An orchestrator dispatched scoped Sonnet workers into isolated git worktrees, each owning a non-overlapping part of the tree. Reviews were done by Opus reviewers who wrote none of the code under review. It ran in five gated phases, spec and a failing eval first, then templates, then the CLI, then a worked example wired to the eval, then a clean-clone review gate. Workers did no git; the manager committed at each boundary. The tool that scaffolds agent missions was itself the output of an agent mission, run under the discipline it encodes.

What's proven

Eval-first throughout: the suite was proven failing first, then driven to green, ending at eleven passing tests. The generated output is byte-identical across two runs. A mission produced by init passes check, the tool eating its own cooking. The final gate was a review from a clean clone, and the falsification table came back clean on every criterion: the dogfood round-trip holds, the suite is green from a fresh checkout, output is deterministic, no worker touched git, and nothing outside the declared scope was added. I re-ran the suite and the round-trip first-hand to confirm.

It is a brutally-scoped v1: two commands, stdlib only, no platform around it. That is the point. The value is not surface area, it is a repeatable method made concrete, with the discipline proven on its own construction.

Demo

orchestrate: scaffolder build · 120x34