Why not just use a massive context window?

Because distraction is real. Even with a million tokens, models get confused by irrelevant noise. Specialization forces focus. An agent that only sees refund policies won't get confused by technical support docs.

Is this overkill for a startup?

Only if you don't care about quality. If you're building a toy, use a single prompt. If you're building a business process that handles money or customer data, you need the reliability that comes from breaking the problem down.

How do I handle the cost of multiple agents?

It's usually cheaper. You can use a massive, expensive model (like Opus or GPT-4) for the complex "Intake" step, and then use tiny, cheap models (like Haiku or Flash) for the specific tasks. Specialization allows for model arbitrage.

Why We Build AI Agent Armies, Not AI Tools

Most companies buy an AI tool, plug it in, and wait for magic. Then they wonder why nothing changes.

The problem isn't the model. It’s that they bought a single employee and expected them to run the entire company without a manager, a desk, or a job description. If you want AI to actually ship work, you have to treat it like a team from day one. We call this building an "AI army"—though "squad" might be more accurate.

The "God Mode" Trap

Here is a mistake I see every week. A founder spins up a single agent, gives it access to Slack, Jira, and the production database, and tells it: "You are a helpful assistant. Fix bugs."

This is terrifying.

A single generalist agent is trapped by its own context window. It gets confused. It hallucinates. It tries to be helpful rather than accurate.

I watched one implementation where a "support assistant" hallucinated a 100% refund policy because it read an outdated PDF from 2019 that was still in the vector database. It didn't just promise the money; it drafted the email and nearly sent it to 50 angry customers. The only thing that stopped it was a rate limit error.

That is not a model failure. That is a system design failure.

Specialization Beats Intelligence

Humans learned this centuries ago: division of labor works. You don't hire one person to be your VP of Sales, lead engineer, and janitor. Yet we expect LLMs to do exactly that.

The fix is simple but requires more upfront work. You break the "God Agent" into three distinct roles.

First, the Intake Agent. Its only job is to understand what is being asked. It doesn't solve anything. It just takes the messy, emotional email from a customer and turns it into a structured JSON object. It categorizes. It tags. It cleans.

Second, the Research Agent. This agent reads the clean JSON and goes hunting. It searches only the approved documentation. It finds the specific refund clause for "SaaS subscriptions over $1k." It cites its sources. It produces a brief.

Finally, the Executor Agent. This one is the doer. It takes the brief and performs a single, narrow action. It drafts the reply. It updates the ticket. It doesn't decide policy; it executes the policy found by the researcher.

When you split the work, you lower the cognitive load on each model. You get better results from smaller, cheaper models because they only have to do one thing well.

The Manager in the Middle

But a team without a manager is just a mob. You need orchestration.

In software terms, this means state. You need a system that remembers where the ticket is. Did the Researcher fail to find a document? The Orchestrator should catch that error and ask a human for help, not just let the Executor hallucinate an answer.

We built a system recently where the Orchestrator forces a "human-in-the-loop" review if the confidence score on the Research step drops below 80%. It’s boring, invisible work. It’s also the difference between a cool demo and a production system that doesn't bankrupt you.

Boring is Good

If your AI system feels exciting, it's probably dangerous. Production AI should feel boring. It should feel like traditional software.

Real reliability comes from constraints. We lock our agents down with rigid allowlists. The Executor agent can call refund_user(id, amount) but it cannot call delete_user(id). It physically doesn't have the API key.

We also treat logs like a crime scene. Every decision, every tool call, every piece of retrieved context is saved. When—not if—the agent messes up, we can replay the tape. We can see exactly why it thought the user was a premium subscriber. We can patch the logic, add a test case, and redeploy.

Building Your Army

You don't need to hire 50 engineers to do this. You just need to stop thinking about "AI" as a magic box.

Start with one workflow. Maybe it's invoice processing. Maybe it's triaging support tickets.

Don't write a better prompt. Design a better org chart. Define the roles. Limit their access. Force them to show their work.

That is how you move from "playing with AI" to shipping value.

Sources

Building agents with the Claude Agent SDK — Anthropic's guide on agent patterns.
OWASP Top 10 for LLM Applications 2025 — Key security risks for AI apps.
AI Risk Management Framework (AI RMF 1.0) — NIST's standard for AI safety.
LangGraph documentation — Framework for building stateful, multi-agent applications.
Temporal documentation — Durable execution platform for reliable orchestration.

Frequently Asked Questions

I reply to all emails if you want to chat: