Overview
RAG eval splits two questions. Did retrieval find the right context? Did generation use it correctly? A single end-to-end accuracy number hides which half is broken. The rules below build a golden set, score the two halves separately, and wire the suite into CI. For the general agent eval pattern, see evaluation.
Score retrieval and generation separately
Two metrics, two failure modes, two fixes.
- Retrieval:
recall@kagainst gold chunk IDs. Ifrecall@5is 0.4, no prompt edit saves you. See rag-chunking, embeddings, and rag-reranking. - Generation:
faithfulness(did the answer stay in the retrieved context?) andanswer relevance(does it respond to the question?). Both presume retrieval already worked.
Score independently, then track the joint number for release-blocking.
Build a golden set of 50 to 200 Q&A pairs
A golden set is the contract for “is the new version better than the old one.”
- 50 rows is the floor; 200 is enough for most production systems.
- Each row has:
question,gold_chunk_ids[],gold_answer,slice_tag. - Half should be cases that already broke in production. The other half mixes easy, edge, adversarial, and time-sensitive.
- Treat the set as code. Version it, review changes in PRs, never edit a row to make a failing test pass.
Building the set takes a day; the system pays it back in the first week.
Use recall@k at multiple k values
recall@k is the share of queries where the top-k retrieved chunks include at least one gold chunk. Track at multiple k.
recall@5: what the prompt sees after reranking. The headline retrieval number.recall@10: reranker headroom.recall@50: candidate-pool ceiling. Low here means the dense or sparse stage is the bottleneck, not the reranker.
recall@50 = 0.9 but recall@5 = 0.5 says the reranker is the lever. Flat recall@50 = 0.5 says the reranker is downstream of the real problem.
Score generation with faithfulness and answer relevance
Once retrieval is right, the generation eval asks two questions.
- Faithfulness: is every factual claim in the answer supported by the retrieved context? Hallucinations show up here. Score with a judge model on a per-claim basis or with the RAGAS rubric.
- Answer relevance: does the answer respond to the question? An on-topic but evasive answer fails this. Score with an LLM judge on a 1 to 5 rubric, or by embedding similarity between question and answer.
Faithfulness without relevance ships safe, useless answers. Relevance without faithfulness ships confident hallucinations. The pair catches both.
Use LLM-as-judge with skepticism
Judge models drift. Trust them only after they pass audit.
- Pick a stronger judge than the model under test.
- Score on a rubric, not a vibe:
"Does the answer cite the right chunk? yes/no." - Sample 30 to 50 rows per release for human rating. If judge-human agreement drops below 0.7, retire the judge or rewrite the rubric.
See evaluation for the broader LLM-as-judge harness.
Track per-slice scores, not just the aggregate
An average score hides regressions on minority slices. A model 2 points better on average but 10 points worse on adversarial is a step backward.
- Slice tags:
easy,edge,adversarial,time-sensitive,multi-hop,keyword-heavy. - Dashboard columns: version, slice,
recall@5, faithfulness, answer relevance, p95 latency, cost per query. - Block merges on slice regressions, even when the aggregate moves up.
Wire the suite into CI on every prompt or model change
Eval-driven iteration is the only way RAG gains stick. CI runs the suite; manual checks slip.
prompt or model change → run eval suite → diff vs baseline →
block merge if any slice regresses by > N points or aggregate drops- Cache embeddings and reranks where deterministic so the suite stays cheap. See embeddings.
- Sample for every PR; run the full set nightly.
- A diff without an eval result is unreviewable.
Inspect failures, do not just count them
The eval score points at problems; the failure log explains them. Read a sample of failed queries every release.
- Gold chunk not retrieved at all: retrieval failure.
- Gold chunk retrieved but not in the top-k: reranker failure.
- Chunk in the prompt but answer wrong: generation or citation failure. See rag-citations.
Five minutes reading 10 failures beats an hour staring at the aggregate. The numbers tell you which knob; the failures tell you which turn.