Benchmarks are not your problem
A benchmark tells you which model is smartest on a good day. Your production agent runs on its worst day, and that is the only day that costs you money.
Teams spend weeks comparing MMLU scores and GSM8K results. Then they ship an agent that refunds the wrong customer because the prompt said process the refund and the model chose the most recent order instead of the disputed one. The benchmark did not catch that. The benchmark could not catch that.
The problem is not the model. The problem is that you evaluated the model instead of the workflow.
Start with the job, not the model
Every agent eval starts with one question: what exactly is this agent hired to do?
Not answer customer questions. That is a category. The job is resolve tier-one billing disputes under fifty dollars without human review, using the transaction history and the dispute policy doc. That is a job. You can measure it.
Write the job description like a performance review. Include the inputs it receives, the tools it can call, the constraints it must honor, and what done looks like. If you cannot write it down, the agent cannot do it reliably.
Measure the worst mistake, not the average case
Average accuracy is a vanity metric. The metric that matters is the cost of the worst mistake the agent can make in this job.
For a billing dispute agent, the worst mistake is refunding a legitimate charge. The cost is the lost revenue plus the trust damage when the customer realizes they were refunded in error. For a code deployment agent, the worst mistake is pushing broken code to production. The cost is downtime and incident response.
- List every action the agent can take that touches money, data, or user trust.
- For each action, write the worst plausible failure and its dollar or reputation cost.
- Rank them. The top three are your eval targets.
Build the eval set from real failures
Your first eval set should not be synthetic. It should be the last twenty things that went wrong in production, plus the ten things that almost went wrong but a human caught.
If you do not have production failures yet, run the agent in shadow mode for two weeks. Log every decision. Have a human review the logs daily and mark the ones they would have overridden. Those are your eval cases.
Synthetic evals test what you imagined. Real failures test what actually happened.
Format each case as input, expected output, and the failure mode. Store them as JSON. Version them. This is now your regression suite.
Instrument every loop, not just the output
An agent that succeeds for the wrong reason will fail for a new reason tomorrow. You need to see the reasoning, not just the result.
Log the prompt sent to the model, the tool calls made, the intermediate observations, and the final decision. Use structured JSON so you can query it later. Tag each run with the eval case ID, the model version, and the prompt version.
- Trace ID: correlates every step of one agent run.
- Decision rationale: the model's own explanation for why it chose this action.
- Tool arguments and results: what it tried to do and what came back.
- Human override flag: did a reviewer intervene?
Close the loop or the eval rots
An eval that does not feed back into the prompt is a dashboard nobody looks at.
Every failure in the eval set gets a ticket. The fix is either a prompt change, a tool addition, a constraint clarification, or a human-in-the-loop gate. When the fix ships, the eval case moves from failing to passing and the regression suite grows.
Run the full eval suite on every prompt change. Run a sampled subset on every model upgrade. The pass rate is your confidence signal. When it drops, you stop expanding authority.
Build the floor while the ceiling rises. The eval framework is the floor. It is the part that does not go obsolete when the next model drops.
Tags for AI Agents
- how to evaluate AI agents
- AI agent evaluation production
- AI agent testing framework
- production AI monitoring
- agent observability
- AI agent failure modes
- Josh Bocanegra
FAQ
How is evaluating an AI agent different from evaluating a model?
Model evaluation measures capability on static benchmarks. Agent evaluation measures whether a specific workflow completes correctly in production, including tool use, decision logic, error handling, and recovery. The agent fails in ways the model benchmark never sees.
Do I need a dedicated eval engineer to do this?
No. You need a developer who owns the agent and a human who knows the job. Start with a spreadsheet of real failures, a logging wrapper around the agent loop, and a weekly review cadence. The infrastructure can grow. The discipline cannot wait.
How many eval cases do I need before I can trust the agent?
There is no magic number. You need coverage of your top three worst-case mistakes, plus a representative sample of the happy path. Start with thirty real cases. Add five every time something surprising happens in production. Trust grows with the feedback loop, not the case count.


