Single-agent LLM systems hit a ceiling fast. I found this out the hard way while building an AI-powered document processing pipeline for a financial services client. The task: ingest unstructured documents, extract structured data, validate it against business rules, flag anomalies, and produce a final report — all autonomously. A single GPT-4o prompt with a complex system message worked in demos. It collapsed at scale.
EMAOS — the Event-driven Multi-Agent Orchestration System — is the framework we built at SandyTech to solve this class of problem. It now powers several of our products.
Before getting into the architecture, it is worth understanding the specific failure modes:
Context window exhaustion. A complex workflow accumulates context fast. By step 8 of a 12-step pipeline, you are feeding the LLM thousands of tokens of history. Quality degrades, costs spike, and you start losing early context due to attention dilution.
Monolithic failure. A single agent either completes the task or it does not. There is no opportunity for specialisation, retry, or partial recovery. One bad tool call can derail the entire workflow.
No separation of concerns. Asking one LLM to simultaneously plan, execute, critique, and remember is like asking a single employee to be simultaneously your strategist, analyst, QA team, and institutional memory. People do not work that way. Neither do LLMs.
Cost control is impossible. With a single agent, you use your most capable (expensive) model for everything, including tasks that a GPT-3.5-tier model could handle perfectly well.
┌─────────────────────────────────────────────────────────────┐
│ EMAOS Orchestrator │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │
│ │ Planner │ │ Executor │ │ Critic │ │ Memory │ │
│ │ Agent │ │ Agent │ │ Agent │ │ Agent │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └───┬────┘ │
│ │ │ │ │ │
│ ─────┴───────────────┴───────────────┴─────────────┴────── │
│ Azure Event Hubs (Message Bus) │
└─────────────────────────────────────────────────────────────┘
│
┌─────────┴─────────┐
│ State Store │
│ (Redis / CosmosDB)│
└───────────────────┘
Each agent is a stateless, independently deployable process. They communicate exclusively through the message bus. No agent calls another agent directly.
The Planner is the only agent that sees the high-level task description. It uses a capable model (GPT-4o) to decompose the task into a Directed Acyclic Graph (DAG) of subtasks:
{
"task_id": "task-abc123",
"steps": [
{ "step_id": "s1", "type": "extract", "depends_on": [], "input": "raw_document" },
{ "step_id": "s2", "type": "validate", "depends_on": ["s1"], "input": "s1.output" },
{ "step_id": "s3", "type": "enrich", "depends_on": ["s2"], "input": "s2.output" },
{ "step_id": "s4", "type": "critique", "depends_on": ["s3"], "input": "s3.output" },
{ "step_id": "s5", "type": "report", "depends_on": ["s4"], "input": "s4.output" }
]
}The Planner publishes this plan to the task.planned topic on Event Hubs and terminates. It does not supervise execution.
The Executor subscribes to task.planned and step.ready events. It picks up executable steps (those whose dependencies are all resolved), runs them using a tool-calling LLM, and publishes results to step.completed or step.failed.
The Executor uses lighter-weight models where possible. For a "validate structured data against schema" step, GPT-4o-mini is sufficient. The model selection is specified in the plan by the Planner.
class ExecutorAgent:
async def handle_step_ready(self, event: StepReadyEvent):
step = await self.state_store.get_step(event.step_id)
model = step.metadata.get("model", "gpt-4o-mini")
result = await self.llm_client.complete(
model=model,
messages=self.build_messages(step),
tools=self.get_tools_for_step_type(step.type),
)
if result.finish_reason == "tool_calls":
tool_result = await self.execute_tool_calls(result.tool_calls)
await self.publish("step.completed", {
"step_id": step.id,
"output": tool_result,
})
else:
await self.publish("step.failed", {
"step_id": step.id,
"error": "unexpected_finish_reason",
})After each Executor output, the Critic Agent evaluates quality. This is the step most frameworks skip, and it is the one that makes EMAOS reliable in production.
The Critic uses a rubric defined at task-creation time. For a document extraction task, the rubric might check: "Are all required fields present? Are numeric fields within expected ranges? Does the extracted data match the source document?"
If the Critic scores the output below threshold, it publishes a step.rejected event with specific feedback. The Executor picks this up and retries with the feedback appended as context — a targeted self-correction loop rather than a full task restart.
class CriticAgent:
async def handle_step_completed(self, event: StepCompletedEvent):
step = await self.state_store.get_step(event.step_id)
rubric = await self.state_store.get_rubric(step.task_id, step.type)
evaluation = await self.evaluate(step.output, rubric)
if evaluation.score >= rubric.threshold:
await self.publish("step.approved", {"step_id": step.id})
else:
await self.publish("step.rejected", {
"step_id": step.id,
"feedback": evaluation.feedback,
"retry_count": step.retry_count + 1,
})The Critic is the most expensive agent to run because it needs a capable model — you cannot have a weak model assess a strong one effectively. We accept this cost because it dramatically reduces the failure rate at the task level.
The Memory Agent maintains a persistent, task-scoped context store. Rather than passing growing context arrays between agents (which is what AutoGen and CrewAI do by default), agents query the Memory Agent for relevant context.
Memory is structured in three tiers:
When the Executor starts a step, it queries Memory for "what do I need to know to complete this step type?" rather than being fed the full history. This keeps token counts predictable.
We chose Azure Event Hubs over Service Bus for EMAOS for several reasons:
# Publishing an event
producer = EventHubProducerClient.from_connection_string(
conn_str=os.environ["EVENT_HUB_CONNECTION_STRING"],
eventhub_name="emaos-events"
)
async def publish(event_type: str, payload: dict):
event_data = EventData(json.dumps({
"type": event_type,
"timestamp": datetime.utcnow().isoformat(),
"payload": payload,
}))
event_data.properties = {"event_type": event_type}
async with producer:
batch = await producer.create_batch()
batch.add(event_data)
await producer.send_batch(batch)| Aspect | EMAOS | AutoGen | CrewAI | |---|---|---|---| | Agent communication | Event bus (async) | Direct function calls (sync) | Sequential/hierarchical (sync) | | State management | External store, per-agent | In-memory, shared | In-memory, shared | | Failure isolation | Per-step retry, critic | Agent-level | Task-level | | Model selection | Per-step in plan | Per-agent | Per-agent | | Scalability | Horizontally scalable | Single process | Single process | | Observability | Event log is the audit trail | Conversation log | Conversation log |
AutoGen and CrewAI are excellent for prototype-scale agentic tasks. EMAOS is designed for production workloads where you need audit trails, cost control, horizontal scaling, and graceful degradation.
At SandyTech, EMAOS powers:
If you are building an AI product where a single ChatCompletion call is no longer enough, EMAOS-style thinking is the path forward.