Overview

Public embedding benchmarks measure average quality across many domains. Your retrieval system operates in one domain. The only evaluation that matters is recall and ranking on your corpus with your queries. Build a golden set, compute MRR and recall@k, and run A/B tests when comparing models or index configurations. For end-to-end RAG evaluation, see rag-eval.

Build a golden set before changing any retrieval parameter

A golden set is a fixed collection of query-document pairs where the correct result is known. Without it, every model swap, dimension change, or threshold adjustment is guesswork.

Minimum golden set composition for production use:

  • 200 to 500 distinct queries.
  • At least one known-relevant document per query; more where multiple documents are relevant.
  • Queries sampled from real user logs where available; synthetic queries for new products.
  • Coverage of edge cases: short queries, multi-hop questions, exact identifiers, rare terms.
# Golden set format
golden = [
    {"query": "how to handle EADDRINUSE", "relevant": ["doc_1234", "doc_5678"]},
    {"query": "Postgres connection pooling", "relevant": ["doc_2291"]},
    # ...
]

Mean Reciprocal Rank measures where the first correct answer appears

MRR@k computes the reciprocal of the rank of the first relevant document in the top-k results, averaged across queries. It rewards systems that surface the best answer highest.

def mrr_at_k(retrieved_lists, relevant_sets, k=10):
    scores = []
    for retrieved, relevant in zip(retrieved_lists, relevant_sets):
        score = 0.0
        for rank, doc_id in enumerate(retrieved[:k], start=1):
            if doc_id in relevant:
                score = 1.0 / rank
                break
        scores.append(score)
    return sum(scores) / len(scores)

A score of 1.0 means the first result was always correct. A score of 0.5 means the first correct answer was typically at position 2. For retrieval feeding an LLM, MRR@5 is the most actionable metric; if the answer is not in the top 5, the generator is unlikely to cite it.

Recall@k measures coverage, not ranking

Recall@k is the fraction of relevant documents that appear in the top-k results. Use it when completeness matters more than position, such as when a reranker will reorder after retrieval.

def recall_at_k(retrieved_lists, relevant_sets, k=20):
    scores = []
    for retrieved, relevant in zip(retrieved_lists, relevant_sets):
        retrieved_set = set(retrieved[:k])
        hits = len(relevant & retrieved_set)
        scores.append(hits / len(relevant))
    return sum(scores) / len(scores)

Target recall@20 of 0.85 or higher for a dense-only retriever feeding a reranker. Below 0.7, the reranker cannot save you; the relevant documents were never retrieved. Hybrid search typically improves recall@20 by 5 to 15 percentage points over pure dense. See embeddings-hybrid-search.

nDCG weights relevance by rank for multi-grade relevance judgments

Normalized Discounted Cumulative Gain (nDCG) handles cases where documents have graded relevance (highly relevant, partially relevant, not relevant) rather than binary. It is more informative than recall@k when multiple documents are relevant but not equally so.

import math
 
def ndcg_at_k(retrieved, relevance_grades, k=10):
    dcg = sum(
        relevance_grades.get(doc_id, 0) / math.log2(rank + 2)
        for rank, doc_id in enumerate(retrieved[:k])
    )
    ideal_grades = sorted(relevance_grades.values(), reverse=True)[:k]
    idcg = sum(g / math.log2(rank + 2) for rank, g in enumerate(ideal_grades))
    return dcg / idcg if idcg > 0 else 0.0

Use nDCG when the corpus has documents of clearly different relevance levels. For simple single-answer retrieval, recall@k and MRR are faster to interpret.

A/B test against your retrieval, not against public benchmarks

When switching models, compare on the same golden set, same index, same k, same query preprocessing. Change one variable at a time. The evaluation harness should be:

  1. Embed the golden set queries with model A and model B.
  2. Build two identical indexes.
  3. Run recall@10 and MRR@10 for both.
  4. Report delta and statistical significance if the golden set is large enough (100+ queries).

A 1 percent improvement in recall@10 on a 200-query golden set may not be statistically significant. For small golden sets, run qualitative review of the disagreements, not just aggregate scores.

Automate eval in the deployment pipeline

Treat retrieval quality like a unit test. Run the golden set eval as part of CI before any index rebuild or model upgrade goes to production. Alert on regressions of more than 1 percent in recall@10 or MRR@5.

# In CI after embedding model change
python eval_retrieval.py \
  --golden golden_set.json \
  --model voyage-3-large \
  --dimension 1024 \
  --threshold 0.85

For cost at evaluation time, see embeddings-cost-control.