Postmortem · 001 14 June 2026
Summary
On the afternoon of Saturday 13 June 2026, the model every active session was running on was withdrawn mid-build. Three Claude Code terminals and the coordinating session all stalled within the same minute. No work was lost. Every session resumed on a different model from durable state, and cleanup came down to two stale git locks and a set of uncommitted edits, all recovered intact. The disruption was build time, not work.
Timeline
That afternoon, all four sessions were mid-build on claude-fable-5: one finishing a ledger phase, one merging feature branches, one holding at a human gate, and the coordinating session mid-edit. The model was withdrawn, and every session's next turn failed with the same error: the selected model may not exist or may not be available. All four stopped where they stood. The cause was identified quickly as the model, not the work. Each session was switched to Opus 4.8 and resumed from its transcript and state files; the coordinating session rebuilt the true state, cleared the fallout, and the builds carried on.
Detection
Detection was immediate and unambiguous: the model error printed in every terminal at once, with no silent degradation. The harder part was the second-order state: which work had been committed, which was still in flight, and what the stall had left behind. That was read from durable state, the git history and the per-mission state files, rather than trusted to any session's memory.
Impact
No work was lost, and that was design rather than luck. Committed work was safe: the ledger phases were committed, the site's earlier phases were on the main branch, and the parallel feature work was committed on its own branches. State lived in files, so each mission could resume cold from its last clean checkpoint. The fallout was contained to two stale lock files, left by git subprocesses that died with their sessions and blocked the next git operation until cleared, and a set of live coordination edits swept into a git stash and recovered whole. The cost was the time to reground and resume.
Root cause
The estate was a monoculture on one axis: every session depended on a single model, with no fallback configured. A model is an external dependency that can be withdrawn or have its access revoked at any time. When this one was, there was no second model to fail over to, so every session failed together. The trigger was external. The exposure was a design choice: one model, no fallback, every session pinned to it.
What could still fail
A single-model dependency is the open gap. The fix is a configured fallback in the same context class, or a rehearsed fast manual swap. The swap worked here because the replacement model also supported the large context the sessions had grown into; a fallback that did not would have failed in a different way. Stale locks are the second sharp edge: a git subprocess that dies mid-write leaves a lock the sandbox cannot clear, so it needs a host-side removal, a step now documented rather than rediscovered under pressure. And in-flight work is the fragile surface: everything committed survived cleanly, everything uncommitted needed careful recovery. The lesson is the one the estate already runs on. Commit at boundaries, keep state in files, and a model vanishing becomes an interruption instead of a loss.
This was an unplanned live test of the estate's crash-recovery design, and it passed. The work was never at risk. The gap it exposed, a single model with no fallback, is real and worth closing.