Overview

Agent evaluation produces a trustable number for “is this version better than the last one.” Without it, every prompt change is a guess and every regression ships quietly.

Build the golden set before the model

Write 30 to 100 high-quality test inputs with verified outputs before you tune anything. The set is the contract.

  • Each row has an input, the expected output (or an accept-list), and a slice tag (easy, edge, adversarial, time-sensitive).
  • Cover the long tail. Half the rows should be cases that already broke in production.
  • Treat the set as code. Version it, review changes in PRs, never edit a row to make a failing test pass.

Thirty rows you trust outperform three hundred rows scraped from logs.

Score with LLM-as-judge, then audit the judge

Use a second model as the judge for open-ended outputs, then check the judge against humans on a sample.

  • Pick a stronger judge than the model under test. A weak judge agreeing with itself is not a signal.
  • Score on a rubric, not a vibe. “Does the answer cite the right source? yes/no.” “Is any factual claim wrong? yes/no.”
  • Sample 30 to 50 rows per release for human rating. Compute simple agreement or Cohen’s kappa. If judge-human agreement drops below 0.7, retire the judge or rewrite the rubric.

The judge drifts too. See structured-output for the patterns that keep verdicts parseable.

Never ship without a baseline

Before any prompt change, run the current production version against the full golden set and record its score. That number is the bar.

  • Ship only if the new version beats the baseline on the aggregate score and does not regress on any named slice by more than a small margin.
  • Tied scores ship the simpler prompt, the cheaper model, or the faster path.

Two runs of the same prompt vary by a few points; treat anything inside that envelope as a tie.

Split retrieval eval from generation eval in RAG

In a RAG system, a wrong answer can mean retrieval missed the chunk or generation ignored it. Score them separately.

  • Retrieval metric: recall@k against gold chunk IDs. If recall@5 is 0.4, no prompt edit will save you. See rag.
  • Generation metric: correctness of the final answer given the retrieved context, judged against the gold answer.

One combined number hides which half is broken.

Use pass@k for tasks with many valid outputs

When code, queries, or plans have multiple acceptable answers, score with pass@k: run k samples, count the task as passed if any one is correct.

  • pass@1 measures the best-of-one experience the user sees.
  • pass@5 measures the headroom a reranker or retry buys you.

A gap between the two tells you whether to invest in better sampling or better prompts.

Run the golden set on every prompt change

Eval-driven iteration is the only way to keep gains. Wire the suite into CI; block merges on regressions in named slices.

prompt change → run eval suite → diff vs baseline → block if any
slice regresses by > N points or aggregate score drops

Treat prompts like code. A diff without an eval result is not reviewable.

Track a dashboard, not a single number

An aggregate score hides regressions on minority slices. Publish per-slice scores, cost, and latency on a regression dashboard.

  • Columns: version, slice, accuracy, judge-agreement, p50 latency, p95 latency, cost per task.
  • One row per release, one section per slice.

A model 2 points better on average but 10 points worse on adversarial is a step backward.

Vibes-based shipping is a debt trap

“It feels smarter” is not a release criterion. Shipping on vibes accumulates silent regressions no one can localize once they pile up. If you cannot show the eval diff, you do not ship. See claude-code for the acceptance-criteria pattern and general-principles for the same rule applied to code.

Score cost and latency alongside quality

Cost per task and p95 latency are part of the eval. A version 1 point more accurate at 4x the cost usually loses. See multi-agent for token budgets that surface the trade-off per agent.