RAG: Evaluation

Overview

RAG eval splits two questions. Did retrieval find the right context? Did generation use it correctly? A single end-to-end accuracy number hides which half is broken. The rules below build a golden set, score the two halves separately, and wire the suite into CI. For the general agent eval pattern, see evaluation.

Score retrieval and generation separately

Two metrics, two failure modes, two fixes.

Retrieval: recall@k against gold chunk IDs. If recall@5 is 0.4, no prompt edit saves you. See rag-chunking, embeddings, and rag-reranking.
Generation: faithfulness (did the answer stay in the retrieved context?) and answer relevance (does it respond to the question?). Both presume retrieval already worked.

Score independently, then track the joint number for release-blocking.

Build a golden set of 50 to 200 Q&A pairs

A golden set is the contract for “is the new version better than the old one.”

50 rows is the floor; 200 is enough for most production systems.
Each row has: question, gold_chunk_ids[], gold_answer, slice_tag.
Half should be cases that already broke in production. The other half mixes easy, edge, adversarial, and time-sensitive.
Treat the set as code. Version it, review changes in PRs, never edit a row to make a failing test pass.

Building the set takes a day; the system pays it back in the first week.

Use recall@k at multiple k values

recall@k is the share of queries where the top-k retrieved chunks include at least one gold chunk. Track at multiple k.

recall@5: what the prompt sees after reranking. The headline retrieval number.
recall@10: reranker headroom.
recall@50: candidate-pool ceiling. Low here means the dense or sparse stage is the bottleneck, not the reranker.

recall@50 = 0.9 but recall@5 = 0.5 says the reranker is the lever. Flat recall@50 = 0.5 says the reranker is downstream of the real problem.

Score generation with faithfulness and answer relevance

Once retrieval is right, the generation eval asks two questions.

Faithfulness: is every factual claim in the answer supported by the retrieved context? Hallucinations show up here. Score with a judge model on a per-claim basis or with the RAGAS rubric.
Answer relevance: does the answer respond to the question? An on-topic but evasive answer fails this. Score with an LLM judge on a 1 to 5 rubric, or by embedding similarity between question and answer.

Faithfulness without relevance ships safe, useless answers. Relevance without faithfulness ships confident hallucinations. The pair catches both.

Use LLM-as-judge with skepticism

Judge models drift. Trust them only after they pass audit.

Pick a stronger judge than the model under test.
Score on a rubric, not a vibe: "Does the answer cite the right chunk? yes/no."
Sample 30 to 50 rows per release for human rating. If judge-human agreement drops below 0.7, retire the judge or rewrite the rubric.

See evaluation for the broader LLM-as-judge harness.

Track per-slice scores, not just the aggregate

An average score hides regressions on minority slices. A model 2 points better on average but 10 points worse on adversarial is a step backward.

Slice tags: easy, edge, adversarial, time-sensitive, multi-hop, keyword-heavy.
Dashboard columns: version, slice, recall@5, faithfulness, answer relevance, p95 latency, cost per query.
Block merges on slice regressions, even when the aggregate moves up.

Wire the suite into CI on every prompt or model change

Eval-driven iteration is the only way RAG gains stick. CI runs the suite; manual checks slip.

prompt or model change → run eval suite → diff vs baseline →
block merge if any slice regresses by > N points or aggregate drops

Cache embeddings and reranks where deterministic so the suite stays cheap. See embeddings.
Sample for every PR; run the full set nightly.
A diff without an eval result is unreviewable.

Inspect failures, do not just count them

The eval score points at problems; the failure log explains them. Read a sample of failed queries every release.

Gold chunk not retrieved at all: retrieval failure.
Gold chunk retrieved but not in the top-k: reranker failure.
Chunk in the prompt but answer wrong: generation or citation failure. See rag-citations.

Five minutes reading 10 failures beats an hour staring at the aggregate. The numbers tell you which knob; the failures tell you which turn.

LLM Best Practices

Explorer

Overview

Score retrieval and generation separately

Build a golden set of 50 to 200 Q&A pairs

Use recall@k at multiple k values

Score generation with faithfulness and answer relevance

Use LLM-as-judge with skepticism

Track per-slice scores, not just the aggregate

Wire the suite into CI on every prompt or model change

Inspect failures, do not just count them

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

RAG: Evaluation

Overview

Score retrieval and generation separately

Build a golden set of 50 to 200 Q&A pairs

Use recall@k at multiple k values

Score generation with faithfulness and answer relevance

Use LLM-as-judge with skepticism

Track per-slice scores, not just the aggregate

Wire the suite into CI on every prompt or model change

Inspect failures, do not just count them

Related

Graph View

Table of Contents

Backlinks