Agent Evaluation

Overview

Agent evaluation produces a trustable number for “is this version better than the last one.” Without it, every prompt change is a guess and every regression ships quietly.

Build the golden set before the model

Write 30 to 100 high-quality test inputs with verified outputs before you tune anything. The set is the contract.

Each row has an input, the expected output (or an accept-list), and a slice tag (easy, edge, adversarial, time-sensitive).
Cover the long tail. Half the rows should be cases that already broke in production.
Treat the set as code. Version it, review changes in PRs, never edit a row to make a failing test pass.

Thirty rows you trust outperform three hundred rows scraped from logs.

Score with LLM-as-judge, then audit the judge

Use a second model as the judge for open-ended outputs, then check the judge against humans on a sample.

Pick a stronger judge than the model under test. A weak judge agreeing with itself is not a signal.
Score on a rubric, not a vibe. “Does the answer cite the right source? yes/no.” “Is any factual claim wrong? yes/no.”
Sample 30 to 50 rows per release for human rating. Compute simple agreement or Cohen’s kappa. If judge-human agreement drops below 0.7, retire the judge or rewrite the rubric.

The judge drifts too. See structured-output for the patterns that keep verdicts parseable.

Never ship without a baseline

Before any prompt change, run the current production version against the full golden set and record its score. That number is the bar.

Ship only if the new version beats the baseline on the aggregate score and does not regress on any named slice by more than a small margin.
Tied scores ship the simpler prompt, the cheaper model, or the faster path.

Two runs of the same prompt vary by a few points; treat anything inside that envelope as a tie.

Split retrieval eval from generation eval in RAG

In a RAG system, a wrong answer can mean retrieval missed the chunk or generation ignored it. Score them separately.

Retrieval metric: recall@k against gold chunk IDs. If recall@5 is 0.4, no prompt edit will save you. See rag.
Generation metric: correctness of the final answer given the retrieved context, judged against the gold answer.

One combined number hides which half is broken.

Use pass@k for tasks with many valid outputs

When code, queries, or plans have multiple acceptable answers, score with pass@k: run k samples, count the task as passed if any one is correct.

pass@1 measures the best-of-one experience the user sees.
pass@5 measures the headroom a reranker or retry buys you.

A gap between the two tells you whether to invest in better sampling or better prompts.

Run the golden set on every prompt change

Eval-driven iteration is the only way to keep gains. Wire the suite into CI; block merges on regressions in named slices.

prompt change → run eval suite → diff vs baseline → block if any
slice regresses by > N points or aggregate score drops

Treat prompts like code. A diff without an eval result is not reviewable.

Track a dashboard, not a single number

An aggregate score hides regressions on minority slices. Publish per-slice scores, cost, and latency on a regression dashboard.

Columns: version, slice, accuracy, judge-agreement, p50 latency, p95 latency, cost per task.
One row per release, one section per slice.

A model 2 points better on average but 10 points worse on adversarial is a step backward.

Vibes-based shipping is a debt trap

“It feels smarter” is not a release criterion. Shipping on vibes accumulates silent regressions no one can localize once they pile up. If you cannot show the eval diff, you do not ship. See claude-code for the acceptance-criteria pattern and general-principles for the same rule applied to code.

Score cost and latency alongside quality

Cost per task and p95 latency are part of the eval. A version 1 point more accurate at 4x the cost usually loses. See multi-agent for token budgets that surface the trade-off per agent.

LLM Best Practices

Explorer

Overview

Build the golden set before the model

Score with LLM-as-judge, then audit the judge

Never ship without a baseline

Split retrieval eval from generation eval in RAG

Use pass@k for tasks with many valid outputs

Run the golden set on every prompt change

Track a dashboard, not a single number

Vibes-based shipping is a debt trap

Score cost and latency alongside quality

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

Agent Evaluation

Overview

Build the golden set before the model

Score with LLM-as-judge, then audit the judge

Never ship without a baseline

Split retrieval eval from generation eval in RAG

Use pass@k for tasks with many valid outputs

Run the golden set on every prompt change

Track a dashboard, not a single number

Vibes-based shipping is a debt trap

Score cost and latency alongside quality

Related

Graph View

Table of Contents

Backlinks