I once watched a "perfect" support agent promise a full refund to a customer who wasn't eligible.
The logic was sound. The tone was empathetic. Even the reasoning trace looked impeccable. But the retrieval step had pulled a policy document from 2022 that we thought was archived.
The model performed exactly as designed. The system failed. We deployed a probabilistic component without the deterministic guardrails required for finance.
If you are seeing production failures, the temptation is to blame the model's reasoning capabilities. You want to upgrade to a larger model or tweak the prompt. Yet 85% of failures stem from system design flaws.
The Gap Between Demo and Reality
Demos are sterile. You have clean data and expected user inputs. Latency is non-existent.
Production is a street fight.
Users paste a 500-line error log and demand to know why it's broken. Meanwhile, APIs time out or retrieval systems return 15 documents that contradict each other.
Deploying agents without accounting for this mess is gambling. We bet that a probabilistic model will align with deterministic business rules. The house usually wins.
A Stricter System
A stricter system beats a smarter model.
1. Evals are unit tests
Most teams judge their agents by feel. They chat with it for 20 minutes, see it works, and ship it.
This fails. You wouldn't ship code without unit tests. Shipping non-deterministic agents without evaluation suites is worse.
Build a dataset of 50 real inputs—messy ones, with typos and ambiguity. Define the "pass" criteria for each. Run your agent against this set every single time you change a prompt or a tool. If success drops from 90% to 88%, the deploy stops.
2. The Critic Pattern
Asking an agent to "double check your work" in the same prompt rarely works. It often hallucinates a confirmation of its own error.
Implement a separate "critic" step. This is a second LLM call that sees the first agent's proposed action and the original policy. Its only job is to find violations.
I prompt the critic with a specific persona: "You are a compliance officer. Your job is to find reasons why this action is illegal or against policy. If you find none, output PASS." This catches errors before the user ever sees them.
3. Track Business Outcomes
Dashboards showing latency and error rates are useless for agents. An agent can successfully return a 200 OK response that completely destroys customer trust.
Track the result. Did the user accept the solution? Did they open a new ticket 10 minutes later? Did the refund actually process in Stripe?
Silence is the worst failure mode. A zombie agent keeps chatting but stops calling tools. It looks alive, but it does nothing.
Reliability is a Constraint Problem
That refund agent never made the same mistake again. We kept the same model and prompt. We added a critic step that explicitly checked policy dates against the current date.
Reliability comes from better constraints.
Sources
- Forbes: 5 AI mistakes that could kill your business in 2025 — Cites Gartner’s AI initiative failure rate.
- ITBench (arXiv): Benchmarking LLM agents for real IT tasks — Reports low task resolution rates for SRE/CISO/FinOps scenarios.
- PYMNTS: AI agents rise, readiness questions remain — Summarizes agent readiness concerns.
- Gradient Flow: 10 things to know about the state of AI agents — Practical notes on debugging and maintenance at scale.
- Shelf.io: The #1 barrier to AI agent success — Data quality and hallucination risk framing.
- TalkToAgent: AI agent deployment pitfalls — Common governance and deployment failure modes.
I reply to all emails if you want to chat:
