Agent Governance Playbook v1 / Mitch Clarke

TLDR

How I run AI agents like a workforce instead of a feature: permission tiers set by blast radius, gates that fail safe, failure defined before the work starts, and a reviewer who never grades their own work. State lives in files so a crash loses nothing, models are matched to the stakes, and no number ships unless a tool produced it.

Permission tiers
Human gates
Falsification clauses
Writer never reviews
State in files
Real numbers only

I build and govern fleets of AI agents that ship real systems.

My operating principles:

No number exists unless a tool produced it.
Every agent runs behind a human gate.
Define failure before you build.
Whoever builds it does not review it.
Incidents are evidence, not embarrassments.
State lives in files, not memory.

1. Why governance: agents as a workforce, what failure actually costs

Treat agents as staff, not as a feature. The work I do is to design and operate governed multi-agent systems that ship production software unattended (MISSION-SITE.md). Unattended is the load-bearing word. An agent that runs only while a human watches is a demo. An agent that ships while nobody is in the room is a workforce, and a workforce needs governance.

The reason is plain once you have watched it happen. Ungoverned agents fabricate. On the R+S site an early agent invented facts about business credentials, which forced a do-not-fabricate rule across every agent on the project, encoded as "Real, confirmed facts only" (R+S MISSION.md). That is not a one-off. The same failure pattern shows up wherever an agent is asked for a number it does not have. So the rule across the artifacts work is blunt: "The site may not display any number these modules did not produce. That rule is load-bearing for the whole weekend" (MISSION-ARTIFACTS.md).

What does failure cost. The honest answer from the corpus is that failure is defined in advance, condition by condition, rather than priced in hours or dollars. Every mission file carries a falsification clause: a list of specific, testable conditions that mean the work has failed at handoff, regardless of how finished the code looks (MISSION-DOCS.md, MISSION-ARTIFACTS.md, MISSION-SITE.md). An agent inventing a number is not a cosmetic slip. It is named as mission failure (MISSION-ARTIFACTS.md). The point of writing failure down before the work starts is that you can catch it before it ships, not after a customer does.

Governance here is structural, not a vibe check. Three load-bearing structures recur. The writer of a phase is never the thing that reviews it, because when the manager writes the code the structural second-pass review collapses (PdlBldrV1 HANDOFF.md). A crashed agent must not silently block work, which is why the destructive denylist lives in native deny rules that survive a dead agent rather than in the agent's own hook (Orchestra gate-design.md). And a silently failed gate is treated as a design fault, not a runtime accident; the reliable way to enforce the accuracy gates is scoped sub-instances with validated structured returns (DCG HANDOFF.md).

What it costs when this is missing is not abstract. An ungoverned agent does not fail loudly; it ships a confident error, and the cost lands later, as a customer told the wrong thing, a number in a proposal that cannot be backed, work redone after it was called finished. The expensive failures in agent work are the ones that look complete. So the cost is controlled structurally rather than estimated: failure is written down before the work starts, the reviewer is never the writer, and no number reaches a customer that a tool did not produce. A governed workforce is not slower. It is cheaper at the only point that matters, the point where a mistake would otherwise leave the building.

2. Permission tiers: read, write, execute, external actions

Permission is organised by blast radius, not by tool name. The Orchestra gate matrix defines five tiers, adopted by Mitch on 2026-06-13 as the v1 default (gate-design.md). Tier is what a call can damage, not which tool issued it.

Tier	Tools	Default
0, Inspection	Read, Glob, Grep, TodoWrite, read-only MCP	allow, pass-through, never gated
1a, Web search	WebSearch	allow plus log
1b, Web fetch	WebFetch	default-ask
2, File mutation	Edit, Write, MultiEdit, NotebookEdit	allow inside project root, ask outside, deny to a sensitive-path denylist
3, Shell	Bash	default-ask, hard deny-list for destructive patterns
4, Subagent	Task	out of scope for Phase 2, default-ask
5, MCP tools	MCP server tools	default-ask

Source: gate-design.md (gate matrix table).

Two tiers never reach the gate at all. The hook matcher scopes which calls even reach the decision point, and Tier 0 plus Task are excluded by the matcher itself (gate-design.md, bridge.py). Inspection has zero blast radius, so gating it would only add friction.

The destructive floor sits below the gate and survives a crash. It is encoded as native permissions.deny rules in the scoped settings file, not in the agent's hook: rm -rf, rm -fr, sudo rm, git reset --hard, git push --force, git push -f (floor.py, DESTRUCTIVE_BASH_FLOOR). Sensitive paths are denied for file-mutation tools: the ssh directory, the claude config directory and claude.json, the gnupg directory, the aws directory, .git, and the system paths etc, System, usr, bin, sbin (floor.py, SENSITIVE_PATH_FLOOR).

The platform enforces the direction of travel. A hook deny is absolute. It beats the skip-permissions flag and auto modes, so the gate is a reliable hard stop (gate-design.md). And a native deny rule beats a hook allow, confirmed by a real-TTY fire-test on 2026-06-13, which makes tighten-only platform-enforced (verified-facts.md item 20, gate-design.md). Tighten-only means v1 may deny or ask where the agent would otherwise allow, but it must not silently loosen past the user's own deny rules (gate-design.md, Key decision #2). One subtlety the corpus is firm on: in an interactive session Orchestra never emits defer. The v1 observe hook returns no decision, an empty object, which is distinct from defer, because defer is interactive-ignored and so would be wrong (verified-facts.md, bridge.py).

Two estate rules sit alongside the matrix. The downloaded-skills directory is read-only and quarantined; applying, executing, copying, or installing anything from it requires asking first, every time, and it is never symlinked or copied into any skills directory (estate CLAUDE.md). Code repositories never go inside the vault; new repos go in Projects only (estate CLAUDE.md). One more operational guard: the API key is unset from the child environment, because if it is set the CLI bills the API account instead of the subscription, so Orchestra deletes it before spawning the child (auth-and-cost.md, instrumentation-bootstrap.md).

This matrix is the v1 permission taxonomy for the estate, not only for Orchestra. The hard estate rules map onto it directly: the downloaded-skills quarantine and the no-repos-in-the-vault rule are tier 2 file-mutation denials, and the destructive-command floor is the tier 3 shell deny-list. Tier is blast radius, whatever the tool or the rule that raised it.

3. Gate catalogue: approval gates (canUseTool), content-accuracy gates, photo and artifact gates, deploy gates

A gate is a checkpoint a call must pass before it proceeds. The corpus has four kinds.

Approval gates

In the Agent SDK the approval mechanism is canUseTool. Orchestra does not use it, because Orchestra's substrate is the real interactive CLI, not the SDK; the correct gate there is the PreToolUse hook (verified-facts.md, section 7). The Phase 2 flow runs like this: the agent is about to run a gated tool and fires PreToolUse; the relay carries the call over a socket to the hooks server, which is the block point; the hooks server marks a pending decision in state; the bridge drives a gate overlay showing tool name, tool input, and context; the user chooses allow, deny, rewrite, or lets normal flow handle it; the hooks server returns the decision JSON back to the agent (gate-design.md).

The timeout policy is a hybrid, resolved on 2026-06-13. If the user does not answer in time, the gate returns ask, handing the call back to the agent's own in-pane prompt. Nothing is silently allowed, nothing is hard-stopped. The timeout is an explicit constant, GATE_TIMEOUT_S = 120.0, in bridge.py (gate-design.md, bridge.py). On teardown the same principle holds: GateBridge.close releases every held call with ask, so no blocked handler thread survives the app (bridge.py, close method). The decision UI has to live in the Orchestra process rather than the hook, because hooks run in non-interactive shells and cannot draw UI or take keyboard input (verified-facts.md, section 4).

Content-accuracy gates

Accuracy is gated by structure and by grep, not by judgement alone. In DCG the reliable way to enforce the writer/reviewer and content-accuracy gates is scoped sub-instances orchestrated by the manager with validated structured returns (DCG HANDOFF.md). The copy-rule gate is enforced by grep, not vibes; both MISSION-DOCS and the handoff docs say "Enforce by grep before claiming any phase done" (HANDOFF-DOCS.md, STATUS-DOCS.md). On the R+S site that audit is a Definition-of-Done item: grep finds zero em dashes and zero banned phrases across the source tree (R+S MISSION.md). The Design Library held an anti-fabrication review as a gate in Phase 7: no invented statistics, no fake study citations, no specific design decisions attributed to named individual websites (DESIGN LIBRARY BUILDLOG.md).

Two examples show how far the separation goes. On WorkflowCalculator the scoring engine was built by one Opus agent and verified by a different Opus agent that wrote 34 hand-calculated tests without ever seeing the engine (WorkflowCalculator STATUS.md). On PdlBldrV1 a verbatim check runs a mechanical character-for-character diff as part of the test command: zero-delta or bust (PdlBldrV1 HANDOFF.md). And some inputs are not gated, they are off-limits: human-only inputs, the hero line and the operating principles, are never written, edited, or improved by agents, because they arrive from Mitch as locked files (MISSION-DOCS.md, HANDOFF-DOCS.md).

Photo and artifact gates

If an artifact does not exist, the page ships without it rather than faking it. The photo gate is stated as: if an asset does not exist, the page ships with text-only artifact links, and a mock is mission failure (MISSION-SITE.md). For Diagnostic Buddy the gate blocks the page outright: artifacts are screenshots plus a short capture of a real diagnostic flow, and with no real capture there is no page (MISSION-DOCS.md). The gate is releasable by the owner, not by an agent: on the R+S site the owner chose to proceed with placeholders that auto-integrate when the real photos are dropped in (R+S handoff STATUS.md). Numbers are gated the same way. Any number on the portfolio site that was not produced by a Ledger or Eval export is a falsification condition (MISSION-SITE.md). If telemetry.json is absent at build, the numbers strip renders nothing; placeholder numbers are mission failure (MISSION-SITE.md).

Deploy gates

Deploy is a human gate. Phase S10 puts the deploy gate, owned by a human, before production goes live on orchestrator.company (MISSION-SITE.md). The artifacts work adds a human gate on PRICING.json before any currency is printed: Mitch verifies the rates against currently published API pricing and signs verified_by, and if it is unverified, Ledger prints token counts and refuses to print currency (MISSION-ARTIFACTS.md). Mitch also reviews SCHEMA.md before Phase 1 begins, as a hard stop (MISSION-ARTIFACTS.md, STATUS.md). Gates are pass/fail, not "looks done": a phase advances because the adversarial reviewer has checked it against the criteria and every item objectively passes, not because it looks finished (R+S MISSION.md). PdlBldrV1's deploy gate requires the 3f and 3g checks to pass on the production build before deploy, with evidence in the build log (PdlBldrV1 HANDOFF.md). Phases themselves are gated: do not start phase N+1 until N's exit criteria are demonstrably met and reviewed (Orchestra roadmap.md). And the locked content gate is absolute: hero.md and principles.md are written by Mitch, and any agent edit to them is mission failure (MISSION-SITE.md).

This catalogue is that single reference, and the content-accuracy gate is named and defined here as a first-class gate rather than left as an ad-hoc grep habit.

4. Falsification clauses: what they are, the template, one worked example

A falsification clause is a list of specific, testable conditions that define failure in advance. Failure is declared up front, not inferred after the fact from a bad outcome. Every mission file carries one (MISSION-DOCS.md, MISSION-ARTIFACTS.md, MISSION-SITE.md).

The template is a fixed form. It reads: "This mission has failed if any of the following are true at handoff, regardless of how finished the code looks", followed by a numbered list of specific, verifiable conditions (MISSION-ARTIFACTS.md). The phrase "regardless of how finished the code looks" is doing real work. It blocks the most common self-deception in agent output, where polish is mistaken for correctness.

Here is one worked example, the clause governing this very document (MISSION-DOCS.md):

This mission has failed if any of the following are true at handoff:

Any banned phrase or em dash survives in any deliverable.

Any claim lacks a verifiable artifact, a screenshot, file, link, or log reference.

The Playbook exceeds 12 pages or reads like a corporate wiki.

The postmortem reads sanitised or hypothetical.

Any sentence fails the voice test: would a sharp human operator say this out loud.

Every item is checkable. Item 1 is a grep. Item 2 is an audit table. Item 3 is a page count. Items 4 and 5 are read-throughs against a stated bar. None of them depends on the author's opinion of their own work.

The form scales to numeric precision. The Ledger clause shows it (MISSION-ARTIFACTS.md): the work fails if Ledger aggregates fail manual reconciliation against 3 raw sessions at zero tolerance, if any cost figure uses a rate not human-verified in PRICING.json, if the eval command cannot produce a scorecard from a clean clone with one command, if actual output deviates from the schemas in the file, or if any metric is estimated, interpolated, or invented rather than parsed. The site clause adds its own conditions, including that any number on the site was not produced by a Ledger or Eval export, and that any interactive feature lacks a working static fallback (MISSION-SITE.md). The discipline is that agents reread the clause before claiming anything done: the clause in MISSION-DOCS defines failure, reread before claiming anything done (HANDOFF-DOCS.md).

5. Writer-not-reviewer separation

The rule is one line and it is non-negotiable. The writer of a phase never reviews that phase; reviews use Opus, never the writer (HANDOFF.md, operating rule 2).

The reason is structural. The worker/manager split exists so the thing reviewing the code is not the thing that wrote it. When the manager writes the code, the structural second-pass review collapses, because the manager reviews its own work with the same blind spots in both passes (PdlBldrV1 HANDOFF.md). A reviewer that shares the writer's blind spots is not a reviewer, it is a second opinion from the same head.

The reviewer is adversarial by design. On DCG the adversarial reviewer red-teams each phase output against the criteria and the Definition of Done, is paid to find what is wrong rather than to agree, and a gate the reviewer cannot fully sign off does not pass (R+S MISSION.md). The line not to cross is a manager writing exploratory or design-bearing code with no second reviewer; tightly enumerated surgical changes are the only exception, and they require disclosure before the fact, not after (PdlBldrV1 HANDOFF.md).

The pattern is everywhere in the corpus. On Orchestra the adversarial reviewer is Opus, never the writer, and reviews numbers and behaviour, not style (MISSION-ARTIFACTS.md). For this document, Phase D4 is the reviewer, not the writer, running a mechanical banned-phrase scan by grep, a claim-evidence audit, and a length check (MISSION-DOCS.md). The site's Phase S9 uses reviewer agents that are never the writers across three passes: banned-phrase and em-dash scan, claim-evidence audit, and motion and responsive QA (MISSION-SITE.md). WorkflowCalculator's final acceptance audit is an independent Opus sub-agent (WorkflowCalculator STATUS.md). The Design Library and the Forgiveness Letter both had a fresh sub-agent, not a builder, review the full screenshot set (Forgiveness Letter BUILDLOG.md, Forgiveness Letter STATUS.md).

The separation also shapes how writing is structured. On the R+S blog, writers return structured data and a single BOSS process writes the files, which avoids parallel write conflicts and gives a single gate, with BOSS owning the deterministic metadata (R+S handoff BUILDLOG.md). For this Playbook, Phase D2 is an Opus worker drafting per the locked outline from sources only, and it is not the Phase D1 writer who mined the sources (MISSION-DOCS.md). The writer never signs off on its own work, at any layer.

6. Crash recovery: HANDOFF.md and file-based state

State lives in files, not in a chat history, so a fresh agent can pick up where a dead one stopped. HANDOFF.md is written so a fresh agent can resume with zero verbal context from Mitch (HANDOFF.md, HANDOFF-DOCS.md). That sentence is the whole design goal. If recovery needs a human to explain what was happening, recovery has failed.

The logging discipline is standard across every mission: STATUS.md is updated at every phase boundary, BUILDLOG.md is appended per phase, and HANDOFF.md is kept current enough that a fresh agent can resume after a crash with zero verbal context (MISSION-ARTIFACTS.md). Three files, three jobs. STATUS.md holds current state, with a phase-tracker table showing phases, roles, and status such as IN PROGRESS, NOT STARTED, or PENDING, and both agents and humans appear in it (R+S MISSION.md, STATUS.md). BUILDLOG.md is append-only, one entry per phase, and evidence lives there (BUILDLOG.md, BUILDLOG-DOCS.md). HANDOFF.md is the resume brief, maintained throughout, not assembled at the end (R+S MISSION.md).

Every HANDOFF.md carries a "How to resume after a crash" section as a standing part of the file. It specifies what file to read first, what state to check, and whether to wait for a human gate or continue (HANDOFF.md, HANDOFF-DOCS.md). The procedure is concrete, not aspirational. DCG's reads: read PLAN.md and this file, inspect the research and source directories, run the typecheck and build, continue from the first pending phase (DCG HANDOFF.md). Orchestra's checks artifact existence first: SCHEMA.md absent or incomplete means re-run Phase 0, so recovery is check artifact existence, then check the build log for completion evidence, and only then decide whether to wait or proceed (HANDOFF.md).

A few rules keep the file-based state honest. Read HANDOFF.md first, every session, no exceptions (PdlBldrV1 HANDOFF.md). When the project state changes, a new commit on main, a new task, a decision made, update the file or the decisions log immediately (PdlBldrV1 HANDOFF.md). When parallel missions share a directory, use suffixed filenames so they do not clobber each other; this Playbook's mission uses the -DOCS suffix and does not write the unsuffixed files that belong to the parallel session (HANDOFF-DOCS.md, STATUS-DOCS.md). And anything that cannot be solved inline escalates: any design question that changes a schema or the CLI surface goes up, it does not get solved inline (MISSION-ARTIFACTS.md, HANDOFF.md).

7. Model routing: architect tier vs worker tier, and when each applies

Models are matched to stakes. The pattern is consistent across the corpus even though it is not written as a single decision tree: the most capable model does the judgement, accuracy, and review work; a cheaper model does the mechanical work; an architect tier handles questions that change the shape of the system.

The artifacts mission states the routing directly. The manager runs on Sonnet and owns STATUS.md. Workers are Sonnet, disposable, one phase each. The adversarial reviewer is Opus and never the writer. Architect escalation goes to the strongest model on hand, Fable at the time and Opus 4.8 after Fable was withdrawn: any design question that changes a schema or the CLI surface goes up, it does not get solved inline (MISSION-ARTIFACTS.md). For this document, Phase D1 source mining is a Sonnet worker, Phase D2 drafting is an Opus worker, and Phase D4 review is the reviewer, not the writer (MISSION-DOCS.md). The site mission uses a boss agent to coordinate, a manager to run phases, disposable Sonnet workers, and reviewers that are never writers, with the design plan handled at the architect tier, Fable then and Opus 4.8 now (MISSION-SITE.md). Workers are disposable, one per phase (HANDOFF.md).

On the highest-stakes work the default flips toward the strongest model. PdlBldrV1 makes this explicit: the manager runs on Opus at all times, workers default to Opus, and Sonnet or Haiku is suggested for a worker only when there is genuinely zero quality gain from Opus, such as appending a known line to a log file or running a single status check. The note is sharp: suggesting Sonnet to save quota is exactly the wrong instinct here (PdlBldrV1 HANDOFF.md). On the R+S work the boss runs on Opus at maximum effort for planning, schema, and review, while mechanical work can drop effort (R+S MISSION.md). The Design Library ran written-library and final-review phases on Opus at high effort for judgement and accuracy-sensitive work, and the scaffold on Sonnet at lower effort (DESIGN LIBRARY STATUS.md). WorkflowCalculator put the scoring engine and its independent tests on Opus at high effort as the highest-stakes reasoning task (WorkflowCalculator BUILDLOG.md). The Forgiveness Letter used Opus for the tone-sensitive emotional core and send sequence, and Sonnet for the spec-driven compose screen (Forgiveness Letter BUILDLOG.md).

When to escalate to the architect tier is enumerated in PdlBldrV1: take it upstream to the boss on architectural ambiguity that affects more than the current task, scope-creep tension, repeated worker failures on the same task, or voice and quality drift persistent across workers (PdlBldrV1 HANDOFF.md).

Stated as one decision tree, which the corpus has only ever implied:

Architecture and schema work, anything that changes the shape of the system: the strongest model available, at high effort. That tier was Fable until 13 June; when Fable was withdrawn mid-build it became Opus 4.8 with no loss, because the tier is defined by capability and effort, not by a model's name. The substitution is the lesson, written up in postmortem 001.
Judgement, accuracy, and review: Opus, and never the writer of the work under review.
Mechanical, well-specified work, one phase at a time: the cheaper tier, Sonnet by default, Haiku only where there is no quality to gain.
Escalate up a tier when a question reaches past the current task, when scope-creep tension appears, when the same task fails repeatedly, or when voice drifts across workers.

Above the tree sits the rule the withdrawal taught: route by capability tier, never by a single pinned model. A tier with no fallback is a single point of failure.

8. Telemetry as a control: Ledger

Telemetry is a control, not a dashboard. Ledger does per-session, per-project, per-model token and cost aggregation from session JSONL (MISSION-ARTIFACTS.md). The CLI is orchestra ledger with optional since, project, and export flags (MISSION-ARTIFACTS.md).

The control is that Ledger refuses to lie about money. Cost is tokens multiplied by rates from PRICING.json. PRICING.json carries verified_by and verified_date. If the rates are unverified, Ledger prints token counts and refuses to print currency: no silent estimates, ever (MISSION-ARTIFACTS.md). The verification is a human gate: Mitch verifies PRICING.json rates against currently published API pricing and signs verified_by (MISSION-ARTIFACTS.md). The export flag writes telemetry.json, whose schema is pinned with a "fields may be added later, never renamed" constraint (MISSION-ARTIFACTS.md).

The telemetry.json v1 schema is load-bearing (MISSION-ARTIFACTS.md):

{
  "schema": "orchestra.telemetry.v1",
  "generated_at": "ISO8601",
  "period": { "from": "ISO8601", "to": "ISO8601" },
  "totals": {
    "sessions": 0, "input_tokens": 0, "output_tokens": 0,
    "cache_read_tokens": 0, "cache_write_tokens": 0,
    "est_cost": { "amount": 0, "currency": "USD", "pricing_verified": false }
  },
  "by_model": [ { "model": "", "sessions": 0, "input_tokens": 0, "output_tokens": 0, "est_cost": 0 } ],
  "by_project": [ { "project": "", "sessions": 0, "tokens_total": 0, "est_cost": 0 } ]
}

The numbers are checked, not trusted. Ledger's Phase 3 review reconciles 3 sessions by an independent hand-count script against Ledger output at zero tolerance, verifies the currency-refusal behaviour when PRICING.json is unverified, and records the evidence in BUILDLOG.md (MISSION-ARTIFACTS.md). The same no-fabrication stance runs through the rest of the telemetry stack. Orchestra's OTel panels show token and cost from real taps and never fabricate; where a window is not in the data, do not fabricate it, show tokens and cost and mark the window unavailable until sourced (auth-and-cost.md). The subscription usage window, 5-hour and weekly, is not exposed via OpenTelemetry, so it is labelled unavailable with source to be determined until a real source exists (auth-and-cost.md, verified-facts.md). The overarching rule: no invented numbers, no estimated metrics, no placeholder data, parsed or nothing, and the portfolio site may not display any number these modules did not produce (HANDOFF.md). The OTel metric fields that are available are token usage by type, input, output, cache read, and cache creation, notional cost in USD, total active time, per-agent attribution, and current model (auth-and-cost.md).

Built since this was mined. The Ledger now runs over the real corpus: its first export totals 4,609,452,561 tokens across 195 sessions, token-only (telemetry.json). Currency was descoped, so PRICING.json stays unsigned and the refusal to print unverified currency described above is demonstrated rather than theoretical. The anonymised export feeds the portfolio's own numbers strip.

9. Evals as regression safety: Eval

The eval suite is a regression rig for agent configs: golden tasks in, scorecard out (MISSION-ARTIFACTS.md). The CLI is orchestra eval with optional suite, task, and export flags (MISSION-ARTIFACTS.md). It is the safety net that catches a config change quietly making the agent worse.

Tasks are defined in YAML, one file per task in the tasks directory, with fields id, category, description, workspace, agent_config, timeout_s, and checks (MISSION-ARTIFACTS.md). The check types in scope for v1 are file_exists, command_exit_zero, contains with an inverted flag for must-not-contain, and json_valid with optional required_keys (MISSION-ARTIFACTS.md). LLM-judge checks are out of scope for v1 and are noted in the README as roadmap, nothing more (MISSION-ARTIFACTS.md).

Determinism is the design constraint that makes the eval suite trustworthy. Agent outputs vary between runs, so checks assert on artifacts, never on transcript wording, and a task that cannot be checked deterministically does not belong in v1 (MISSION-ARTIFACTS.md). The v1 set is 10 golden tasks across three categories: code (4), docs and copy (3), and ops and files (3). The banned-phrase rule is tested via an inverted contains check in the docs and copy suite, so the eval suite enforces the same copy rule the reviewers do (MISSION-ARTIFACTS.md).

The scorecard.json v1 schema is load-bearing (MISSION-ARTIFACTS.md):

{
  "schema": "orchestra.eval.v1",
  "run_id": "",
  "started_at": "ISO8601",
  "finished_at": "ISO8601",
  "agent_config_hash": "",
  "tasks": [
    { "id": "", "category": "", "status": "pass | fail | error",
      "duration_s": 0,
      "checks": [ { "type": "", "status": "", "detail": "" } ] }
  ],
  "summary": { "total": 0, "passed": 0, "failed": 0, "errored": 0 }
}

The rig is tested against itself. Eval's Phase 7 review is a clean-clone test: a fresh checkout, one command, scorecard produced, then the check engine run twice against identical artifacts to confirm identical results (MISSION-ARTIFACTS.md). The pattern is already proven elsewhere. PdlBldrV1 runs a golden regression suite over 25 committed goldens, with a version guard that fails the suite on an engine bump and forces a re-bless (PdlBldrV1 HANDOFF.md). Its honesty contract states "SPICE modelled" on the UI labels and surfaces the simulation source field in the scorecard annotation, restructured as a result of the gate check (PdlBldrV1 HANDOFF.md). On the portfolio site, the numbered strip counts from Eval and Ledger exports only, fed by a dated telemetry.json export as a hard requirement (MISSION-SITE.md).

Built since this was mined. The Eval now runs the ten golden tasks and emits a real scorecard; the showcase run scored 10 of 10 (scorecard.json), with a clean-clone review confirming the rig reproduces from a fresh checkout. The regression net described above is live, not planned.

10. Incidents: response practice and the postmortem habit

Be honest about the state of this section. The corpus has a strong postmortem habit and a thin response practice. I am going to say so rather than pad it.

The postmortem is a formal deliverable, one of the seven in this mission set (MISSION-DOCS.md). Its template is locked: Summary, Timeline, Detection (which gate fired, or why none did), Impact, Root cause, What changed, What could still fail (MISSION-DOCS.md). The Detection field points straight back at the gate catalogue as the first analytical lens; the first question after an incident is which gate should have caught this, and why it did not (MISSION-DOCS.md).

Two rules keep the postmortem real. It is a blameless register: the system gets examined, nobody gets performed at (MISSION-DOCS.md). And no fact may enter the document that Mitch did not supply; Mitch supplies the incident and the raw facts, agents format and tighten only (MISSION-DOCS.md). The bar against fakery is itself a falsification condition: a postmortem that reads sanitised or hypothetical is mission failure (MISSION-DOCS.md). The wider standing practice that surrounds incidents is the claim-to-artifact audit, a table in BUILDLOG.md covering every deliverable (MISSION-DOCS.md), and the decisions log: PdlBldrV1 records a false-premise finding as decision #39, a misidentified bug corrected by a boss ruling and the lesson to verify provenance, not just the value, logged as a numbered decision (PdlBldrV1 HANDOFF.md).

Two incidents are on the record. The hard-deny case rm -rf against the project temp directory justified the Tier-3 Bash gate (gate-design.md). And the model-withdrawal of 13 June is written up in full under the locked template as postmortem 001: the model every session ran on was pulled mid-build, every session resumed from durable state with no work lost, and the gap it exposed, a single model with no fallback, is named.

The response practice, kept honest, is the one postmortem 001 demonstrates rather than a runbook on paper: detect from the durable signal, not from memory; reground from committed state and the file-based handoffs; substitute the failed component, here the model, and resume; then write the postmortem. This is a one-person estate, so there is no on-call rotation or comms protocol to invent, and pretending otherwise would fail the no-fabrication bar. The practice is small and real: state in files, commit at boundaries, a blameless write-up after.

Closing

This Playbook describes how ORCHESTRATOR.COMPANY governs an agent workforce: by blast-radius permission tiers, gates that fail safe, falsification clauses that define failure before work starts, a writer who never reviews their own work, file-based state that survives a crash, models matched to stakes, telemetry that refuses to invent numbers, an eval suite that catches regressions, and a blameless postmortem habit.

Authored by Mitch Clarke. PrizmaCore.

ORCHESTRATOR.COMPANY / AGENT GOVERNANCE PLAYBOOK V1