a private, local AI receptionist for service businesses
A small-language-model appliance that reads service-business enquiry emails and books them on local hardware, with no cloud and no per-call fee.
Context
Service businesses run on enquiry email, and most of it arrives as mess. "Hot water's dead, can you come Tuesday arvo, 12 Marion Rd, 0411." Triaging that by hand is tedious and it drops leads. The two options on the market are both bad: do it by hand, or pipe your customers' details to a cloud AI that charges per call and takes their data off-site.
SLM4SMB is the third option. It is a small-language-model appliance that sits in a box in the back office, reads the enquiries, works out what the customer wants, and puts it on the calendar. Nothing leaves the building to a third-party AI, and there is no monthly fee. I built it to prove one thesis: a small local model is good enough for the narrow job of classifying and extracting from these emails, and it can run on cheap or old hardware. If that holds, the cost problem and the privacy problem collapse at once.
What it does
It reads the mail and strips the noise, down to the words the customer actually typed.
It fills in a strict form, not a conversation. The local model is forced to act like a data-entry clerk: is this a booking, what is the number, what service is needed. If it is unsure, it does not guess. It flags the email for a human.
It never does the date maths. If the customer wrote "next Tuesday arvo," the model copies that phrase and a plain, predictable Python routine resolves it against today's date. The model handles fuzzy language; arithmetic is never trusted to it.
It books a tentative event and raises the alarm. Urgent words like "burst pipe" fire an immediate notification so the owner does not miss it.
It has a perfect memory. Pull the plug mid-read and it resumes where it stopped. A reply on the same thread updates the original event instead of creating a duplicate.
The decisions that matter
The best call was the split between what is probabilistic and what is exact. The model only extracts date and time phrases verbatim. Every resolution to a real datetime happens in plain Python, timezone-aware. That is the first of ten invariants written before any code, and it is why the date logic is provably correct and fully testable with the model out of the loop.
Extraction is fail-safe by construction. Schema-constrained decoding against a typed schema, wrapped in a validate, retry, then needs-review loop that never raises. A lead is never silently dropped: a failed read becomes a review item, an unresolvable date becomes an all-day "needs scheduling" event rather than vanishing.
Idempotency is thread-scoped. The event's identity derives from the thread-root message, so a reply updates the same event and the system does not double-book. Done wrong, that is the bug that silently double-books people in production.
The subtlest piece: the alarm cannot be the calendar. The calendar is the system of record, but a subscribed calendar can take 8 to 24 hours to refresh, useless for an emergency. So urgent notifications run as a separate local channel at roughly five-minute latency, and a notification failure never breaks the booking.
The privacy claim is deliberately precise: no third-party AI or SaaS touches lead data. Not "data never leaves the network," because the mail still lives on the provider. Saying exactly what is true is the point.
How it was built
One man, one machine, one day. The roles were split hard: I was the only decision-maker, Claude acted as technical director producing prompt blocks rather than code, and a Claude Code orchestrator dispatched scoped worker agents. What to build stayed mine; how to implement was delegated.
Eight phase-gated stages with bounded parallel waves. Workers ran in filesystem-isolated git worktrees, each owning a non-overlapping part of the tree, so merges were conflict-free by construction, not by luck. Every worker saw only its own spec and the contracts it consumed. Every gate was human-approved; no agent widened its own scope. The ten invariants lived in a shared constitution file, copied byte-identically so every agent ran under the same rules. Underneath all of it: eval-first. A failing test suite written before the code, so "done" was a number going green, not a judgment call.
What is proven, and what is not
Sixteen hand-labelled fixtures cover the real spread: clean bookings, emergencies, fuzzy and day-month-year dates, threaded replies, web-form cruft, and a batch of non-bookings. The suite was proven failing first, to confirm the tests test something. The deterministic core was verified with the model swapped out entirely, by injecting canned results, so correctness does not depend on model quality.
The eval caught real defects before they could ship. One test exposed that a mail password would have leaked into the logs, a real credential exposure caught before the code existed. On the live model it surfaced a domain gap: the model handled faults but dropped the service field on maintenance and quote enquiries, because the prompt only anchored fault examples. A blind spot, not a crash, and only a real eval finds it.
The marquee result is the orchestration audit. Context isolation was proven from the agents' own tool-call transcripts: the worker that wrote all the date logic resolved every case having logged zero raw-email accesses, and no worker crossed another's subtree. A separate security-review agent reproduced the audit independently. Provable, not claimed.
Final board on Llama 3.2 3B, on a 2013 CPU-only iMac: booking detection 18 of 18, end to end 18 of 18, field extraction 12 of 12, date resolution 10 of 10. The grade took about eight minutes on that machine, fine for async email and far faster on modern silicon.
This is a validated prototype, not a production-hardened system. It passes a tough hand-built eval on one model on one machine. The dataset is small. The team of agents was directed by me at every step; the achievement is the orchestration and the discipline, not autonomy. Nothing built itself. Those are not apologies. They are the line between a claim that survives scrutiny and one that does not, which is the line this project held.
Evidence