Overview
Agent evaluation produces a trustable number for “is this version better than the last one.” Without it, every prompt change is a guess and every regression ships quietly.
Build the golden set before the model
Write 30 to 100 high-quality test inputs with verified outputs before you tune anything. The set is the contract.
- Each row has an input, the expected output (or an accept-list), and a slice tag (
easy,edge,adversarial,time-sensitive). - Cover the long tail. Half the rows should be cases that already broke in production.
- Treat the set as code. Version it, review changes in PRs, never edit a row to make a failing test pass.
Thirty rows you trust outperform three hundred rows scraped from logs.
Score with LLM-as-judge, then audit the judge
Use a second model as the judge for open-ended outputs, then check the judge against humans on a sample.
- Pick a stronger judge than the model under test. A weak judge agreeing with itself is not a signal.
- Score on a rubric, not a vibe. “Does the answer cite the right source? yes/no.” “Is any factual claim wrong? yes/no.”
- Sample 30 to 50 rows per release for human rating. Compute simple agreement or Cohen’s kappa. If judge-human agreement drops below 0.7, retire the judge or rewrite the rubric.
The judge drifts too. See structured-output for the patterns that keep verdicts parseable.
Never ship without a baseline
Before any prompt change, run the current production version against the full golden set and record its score. That number is the bar.
- Ship only if the new version beats the baseline on the aggregate score and does not regress on any named slice by more than a small margin.
- Tied scores ship the simpler prompt, the cheaper model, or the faster path.
Two runs of the same prompt vary by a few points; treat anything inside that envelope as a tie.
Split retrieval eval from generation eval in RAG
In a RAG system, a wrong answer can mean retrieval missed the chunk or generation ignored it. Score them separately.
- Retrieval metric:
recall@kagainst gold chunk IDs. If recall@5 is 0.4, no prompt edit will save you. See rag. - Generation metric: correctness of the final answer given the retrieved context, judged against the gold answer.
One combined number hides which half is broken.
Use pass@k for tasks with many valid outputs
When code, queries, or plans have multiple acceptable answers, score with pass@k: run k samples, count the task as passed if any one is correct.
pass@1measures the best-of-one experience the user sees.pass@5measures the headroom a reranker or retry buys you.
A gap between the two tells you whether to invest in better sampling or better prompts.
Run the golden set on every prompt change
Eval-driven iteration is the only way to keep gains. Wire the suite into CI; block merges on regressions in named slices.
prompt change → run eval suite → diff vs baseline → block if any
slice regresses by > N points or aggregate score dropsTreat prompts like code. A diff without an eval result is not reviewable.
Track a dashboard, not a single number
An aggregate score hides regressions on minority slices. Publish per-slice scores, cost, and latency on a regression dashboard.
- Columns: version, slice, accuracy, judge-agreement, p50 latency, p95 latency, cost per task.
- One row per release, one section per slice.
A model 2 points better on average but 10 points worse on adversarial is a step backward.
Vibes-based shipping is a debt trap
“It feels smarter” is not a release criterion. Shipping on vibes accumulates silent regressions no one can localize once they pile up. If you cannot show the eval diff, you do not ship. See claude-code for the acceptance-criteria pattern and general-principles for the same rule applied to code.
Score cost and latency alongside quality
Cost per task and p95 latency are part of the eval. A version 1 point more accurate at 4x the cost usually loses. See multi-agent for token budgets that surface the trade-off per agent.