Working v1 · 2026

ORCHESTRATE

a scaffolder for phase-gated, eval-first agent missions

A CLI that scaffolds phase-gated, worktree-isolated, eval-first agent missions, built by an agent team running the discipline it encodes.

TLDR

A one-command setup for a disciplined AI build: clear phases, human checkpoints, and a test that has to pass before anything ships.

Working v1
2026

2 commands do the whole job

11 tests, proven failing first

5 human-gated build phases

0 external dependencies

What it is

Every build in this portfolio followed the same disciplined method: plan first, work in stages, review each stage, and prove it with a test before moving on. orchestrate turns that entire method into a single command, so the next project begins with all of the guardrails already in place.

The work

Context

The method is the asset

Every build in this portfolio ran the same way: a mission brief written before any code, phases with human gates, disposable workers scoped to their own subtree, reviewers who never grade their own work, and a failing eval that defines done. I had been hand-rolling that structure each time. orchestrate turns it into one command.

What it does

Two commands, seven contracts

orchestrate init generates a complete mission directory: the brief, a status tracker, an append-only build log, a handoff doc, an invariants constitution, a gate checklist, and a starter eval that fails on purpose. Seven files, every one a contract the next worker implements against.

orchestrate check validates that a mission has every required file and that its eval is real, runnable, and collectible by the test runner. Stdlib only, no dependencies, and the generated output is byte-identical run to run, so a mission is reproducible rather than freshly improvised each time.

How it was built

Built by the method it scaffolds

orchestrate was built by the exact method it scaffolds. An orchestrator dispatched scoped Sonnet workers into isolated git worktrees, each owning a non-overlapping part of the tree, and reviews were done by Opus reviewers who wrote none of the code under review.

It ran in five gated phases: spec and a failing eval first, then templates, then the CLI, then a worked example wired to the eval, then a clean-clone review gate. Workers did no git; the manager committed at each boundary. The tool that scaffolds agent missions was itself the output of an agent mission.

What's proven

Eval-first, end to end

Eval-first throughout: the suite was proven failing first, then driven to green, ending at eleven passing tests. The generated output is byte-identical across two runs, and a mission produced by init passes check, the tool eating its own cooking.

The final gate was a review from a clean clone, and the falsification table came back clean on every criterion. It is a brutally-scoped v1: two commands, stdlib only. The value is not surface area, it is a repeatable method made concrete, with the discipline proven on its own construction.

Demo

orchestrate: scaffolder build · 120x34