Building AI Agents That Actually Work in Production
Everyone is building AI agents. Most of them break within a week of going live. The demo looks flawless — the agent browses the web, writes code, sends emails, calls APIs. Then real users hit it and it loops forever, bills you $400 in a single afternoon, or confidently takes an action that can't be undone.
We've shipped agentic systems into production across fintech, HR automation, and e-commerce. Here's what we've actually learned — not what the framework README tells you.
An agent is only as reliable as its ability to fail gracefully. Build the failure paths before you build the happy path.
1. The architecture decision that matters most: single-agent vs. multi-agent
The first question isn't "which LLM?" It's "does this need one agent or many?" Single-agent systems are easier to debug, have lower latency, and are cheaper. Multi-agent systems make sense when:
- Tasks are genuinely parallelisable (research + writing + fact-checking at the same time)
- Different sub-tasks require different tool access or permission scopes
- You need a critic/verifier pattern — one agent produces output, another validates it
- Context window limits make a single long-running agent impractical
If you can't clearly articulate why you need multiple agents, use one. The coordination overhead of multi-agent is significant and adds failure surface.
2. Tool design is where most agents fail
Agents are only as good as the tools you give them. Poorly designed tools cause more production failures than model issues. Rules we follow:
Every tool must be idempotent
If an agent calls a tool twice with the same inputs, the second call should produce the same result without side effects. If your sendEmail tool isn't idempotent, a retry loop will spam your users. Use idempotency keys everywhere.
Separate read tools from write tools
Read tools (search, fetch, list) should never have side effects. Write tools (create, update, delete, send) should require explicit confirmation in the agent's reasoning before calling. This single rule prevents most catastrophic agent mistakes.
Tool descriptions are prompts
The description you write for each tool is part of the model's context. Be specific about edge cases, what the tool can't do, and what to do when the tool returns an error. Vague descriptions cause the model to guess — and guess wrong.
3. Loop detection and cost controls are non-negotiable
An agent that loops is an agent that bills you infinitely. Every agentic system needs hard stops:
- Max steps — never more than N tool calls per run. We default to 25 for complex tasks, 10 for simple ones.
- Max tokens per run — cumulative token budget across all steps. Alert at 80%, hard stop at 100%.
- Cycle detection — if the agent calls the same tool with the same inputs twice in a row, stop and return a graceful failure.
- Wall-clock timeout — no agent run should exceed N minutes. Set this at the infrastructure level, not just in code.
Always ask: what happens if this agent runs for 10x longer than expected? If the answer is "we get a very large bill," you haven't built the cost controls yet.
4. Observability for agents is different from observability for APIs
Traditional API monitoring (latency, error rate, status codes) tells you almost nothing useful about an agent. You need trace-level visibility into every step:
- The full input and output of every tool call
- The model's reasoning at each step (if using chain-of-thought or extended thinking)
- Total tokens and cost per run, broken down by step
- Where in the step sequence failures or retries occurred
- The final output and whether a human reviewed or corrected it
We use LangSmith or a custom trace store. The key insight: when an agent produces a bad output, you need to replay the trace and see exactly where it went wrong. Without step-level logging you're debugging in the dark.
Have questions? Our AI can answer instantly
Ask about our services, tech stack, process, or case studies — no forms, no waiting, no sales calls required.
Try the AI Profile5. Human-in-the-loop is an architectural pattern, not a UX afterthought
For any agent that takes irreversible actions — sending emails, making payments, deleting data, publishing content — design a human approval step into the architecture from the start. Not as a patch, but as a first-class component.
The pattern looks like: agent generates an action plan → system persists the plan → human reviews and approves → agent executes. This gives you the speed of automation with the safety of human oversight. The approval step can be progressively removed as confidence in the agent grows.
6. Start narrower than you think you should
The most reliable agents we've shipped do one thing very well. The least reliable do many things loosely. Define the exact scope of the first version before you build. "An agent that drafts and sends follow-up emails after sales calls" is a product. "An AI assistant that helps with sales" is a hope.
Expand scope after the first version is stable in production — not before. Each new tool you add multiplies the number of states the agent can reach. Add one tool at a time, re-evaluate reliability, then add the next.
The agents that users trust are the ones that do exactly what they claim, every time. Breadth is the enemy of reliability. Pick a narrow use case and own it completely.