Multi-Agent Orchestration Blueprint: Dispatch/Worker Architecture for AI Agents
The dispatch/worker architecture for coordinating multiple AI agents: Agent Bus messaging, state machines, async inboxes, and 3 reference implementations from a system running 5 agents 24/7.
Running one AI agent is straightforward. Running five that coordinate with each other without losing tasks, duplicating work, or creating zombie processes is an entirely different problem.
This blueprint documents the architecture of a production system that runs 5 autonomous Claude Code sessions continuously, a dispatch coordinator routing work to specialized workers for content processing, trading operations, infrastructure maintenance, and research.
What you get:
12-chapter guide (10,000+ words) with 6 mermaid architecture diagrams, 3 failure post-mortems, and battle-scar callouts from 60+ days of 24/7 production, covering:
Chapter 1: Why Multi-Agent When multi-agent is the right call and when a single agent with a task queue is simpler. The three failure modes (message loss, state confusion, zombie tasks) and how this architecture prevents each one.
Chapter 2: Dispatch/Worker Topology The coordinator pattern: dispatch receives all incoming work, classifies by domain, routes to workers. Why hierarchy beats flat architectures. How to design worker boundaries that minimize cross-worker chattiness.
Chapter 3: The Agent Bus SQLite-backed message router with lease-based delivery. Messages aren't deleted on delivery, they're leased for 30 seconds. If the receiver crashes, the message returns to the queue automatically. Full protocol spec with message types, lifecycle, and routing patterns.
Chapter 4: The Inbox System File-based async queues for when agents are offline or tasks arrive from cron jobs. Markdown files with YAML frontmatter, synced between machines via rsync. When to use inboxes vs the Agent Bus.
Chapter 5: State Machines Explicit task states (queued, acked, running, done, failed, escalated) with valid transition enforcement. Priority queues, retry logic, and stuck-task detection with automatic reassignment.
Chapters 6-8: Three Reference Implementations
- Dispatch/Worker (Python): Coordinator + N workers with file-based inboxes, domain classification, routing, and timeout handling
- Message Bus Hub (TypeScript/Bun): Full SQLite-backed HTTP server with lease/ack protocol, cleanup cron, and a TypeScript client library
- Pipeline Orchestrator (Python): Linear stage-by-stage processing with checkpoint-based recovery so failed pipelines resume from the last successful stage
Chapter 9: Operational Patterns Health monitoring, graceful shutdown, scaling workers, daily status reports, and JSONL event logging for observability.
Chapter 10: Security and Trust Boundaries Per-worker permission scoping, message validation, preventing escalation attacks, and the audit trail that the bus gives you for free.
Chapter 11: Testing Multi-Agent Systems Unit testing the bus, dispatch routing tests, end-to-end integration patterns, and load testing that actually catches bugs (not just throughput benchmarks).
Chapter 12: Observability The event log schema, JSONL parsing patterns with jq, Grafana dashboard layouts, and the 5 alerts that are actually worth paging on. Plus the cheapest-possible observability stack if you are not ready for Grafana yet.
Plus:
- Worker manifest schema for self-registration
- Hub and client configuration reference
- Troubleshooting table for common failure modes
- Migration guide: how to go from single agent to multi-agent incrementally
- Appendix E: three failure post-mortems (the OOM'd dispatch, the 6-hour stuck task, and the Postgres migration we decided not to do)
Who this is for:
- Developers running multiple AI agent sessions that need to coordinate
- Teams building autonomous agent systems with Claude Code, LangChain, or similar frameworks
- Anyone whose agents need reliable task routing, health monitoring, and failure recovery
Who this is NOT for:
- If you only need one agent, this is overengineered for your use case
- If you want a managed orchestration platform, this is self-hosted infrastructure patterns
- This is Python and TypeScript, no other language implementations included
Python components: 3.10+, zero dependencies. TypeScript components: Bun 1.0+.