Overview
Evaluating LLM systems in production means measuring quality continuously with both offline test suites and online metrics, because a system that passed once will drift as inputs change and providers update models. The unit of evaluation is a task with a known-good outcome, scored automatically, gated in CI, and monitored live. This page is the ops view of evaluation; for the agent-quality methods see evaluation and for retrieval-specific metrics see rag-eval. Dated 2026-06; eval tooling is moving quickly.
Build a golden set that mirrors real traffic
Curate a set of representative inputs with expected outputs or acceptance criteria, sampled from production logs, not invented. Include the hard cases: edge inputs, known failure modes, and adversarial prompts. The golden set is the regression suite; keep it under version control and grow it every time production surfaces a new failure. See golden-set and eval-set.
Gate releases offline in CI
Run the golden set on every prompt, model, or retrieval change and block the merge if the score regresses. Offline eval is fast, deterministic enough, and cheap. Treat it like a unit-test gate; see prompt-evals and llmops-best-practices.
Choose metrics that match the task
Pick the metric the task actually needs.
- Classification or extraction: exact match, F1 against ground-truth.
- Retrieval: recall@k, MRR, nDCG; see rag-eval.
- Open-ended generation: rubric scores via LLM-as-judge, plus faithfulness and hallucination-rate.
A single accuracy number hides the failure that matters; track per-category scores.
Calibrate LLM-as-judge against humans
LLM-as-judge scales open-ended scoring but drifts and is biased toward verbosity and its own style. Calibrate it on a human-labeled subset, measure judge-to-human agreement, and re-check when you change the judge model. An uncalibrated judge gives confident, wrong scores. See llm-as-judge.
Measure online what offline cannot
Offline evals miss real-user behavior. Track online signals: thumbs up/down, edit and regenerate rates, task completion, escalation to a human, and refusal rate. Sample live traffic back into the golden set. Pair with traces from llm-observability.
Catch regressions from model updates
A pinned model can still be deprecated, and a floating alias changes silently. Re-run the full golden set whenever you change the model id and on a schedule, so a provider-side change cannot regress production unnoticed. Pin model ids and version the eval results next to them.
Pitfalls
- Evaluating only on inputs the system already handles; the golden set must include failures.
- Trusting an uncalibrated LLM judge as ground truth.
- Shipping a model swap with no re-eval.
- Reporting one aggregate score that hides a regressed sub-category.