Overview

Retrieval is where most RAG systems fail. The generation step writes confidently from whatever you hand it, so the bar is set by what the retriever returns. The rules below pick the right retrievers, set k correctly, filter before vectors, and add query expansion when the surface form of the question does not match the corpus. For chunk preparation, see rag-chunking. For the rerank step that follows, see rag-reranking.

Run dense and sparse retrieval in parallel

Dense embeddings (Voyage 3, BGE, OpenAI v3) handle paraphrase and semantic match. Sparse retrieval (BM25, Postgres full-text, SPLADE) handles rare keywords, identifiers, and exact phrases. Run both; merge with reciprocal rank fusion.

def rrf(rankings, k=60):
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

Pure dense retrieval misses EADDRINUSE and pg_dump --jobs=4. Pure BM25 misses “the database that ships with the OS” mapping to SQLite. The hybrid wins both queries. See postgres-full-text-search for the BM25 arm when Postgres is already in the stack.

Retrieve broad, rerank narrow

Start with top_k = 20 to 50 from each retriever, merge to 50, then hand the merged set to a cross-encoder reranker that returns the top 5. See rag-reranking.

  • Dense top-k: 20 to 30. Cheap.
  • Sparse top-k: 20 to 30. Almost free.
  • Reranked pool for the prompt: 3 to 8 chunks.

Sending 5 dense results straight to the prompt skips the reranker’s recovery on borderline matches. Recall at k=5 is brittle; recall at k=50 is forgiving.

Filter on metadata before the vector step

Pre-filtering is faster and more correct than post-filtering. A metadata filter that drops 90 percent of the corpus turns a million-vector search into a hundred-thousand-vector search.

results = collection.query(
    query_embeddings=[query_vec],
    where={"tenant_id": "t_123", "last_updated": {"$gte": "2026-01-01"}},
    n_results=50,
)
  • Always filter by tenant_id, lang, doc_type when the query implies one.
  • Date filters belong here for time-sensitive queries.
  • Engines that filter after the ANN search (rather than during it) lose recall. Pick Qdrant, Weaviate, or pgvector when this matters. See rag-vector-databases.

Expand the query when the surface form differs from the corpus

User queries are short and use different vocabulary than the documents. Multi-query expansion generates 3 to 5 paraphrases, retrieves on each, merges with RRF.

Original: "how do I make my db faster"
Expansions:
  - "Postgres query performance tuning"
  - "index strategies for slow queries"
  - "EXPLAIN ANALYZE workflow"

A small fast model (Haiku, GPT-4o-mini, Llama 3.1 8B) handles the expansion cheaply. The expansion model is part of the retrieval system; eval it like the rest. See rag-eval.

Use HyDE when the corpus is hypothetical-answer-shaped

HyDE (hypothetical document embeddings) asks the model to draft a plausible answer, then embeds that answer instead of the question. The synthetic answer lives in the same space as the corpus.

hypothetical = llm.generate(f"Write a one-paragraph answer to: {question}")
vec = embed(hypothetical)
results = vector_store.query(vec, top_k=20)
  • HyDE helps when the corpus is rich and the query is sparse (“explain X,” “what is Y”).
  • HyDE hurts on long, well-formed searches (“Postgres 17 JIT compile flags”). Skip it there.
  • Combine with multi-query: embed the question and the hypothetical answer; union the results.

Tune k against recall@k, not vibes

The right k is the smallest one that gives the reranker enough good candidates. Find it with the eval suite.

  • Sweep k = 5, 10, 20, 50, 100 against the gold set.
  • Pick the k where recall@k plateaus. That is the candidate-pool size; the rerank step trims it.

Picking k by feel ships a system that under-retrieves on hard queries and over-pays on easy ones.

Diagnose recall before precision

When the system answers wrong, check recall@k against the gold set first.

Diagnose in that order. Most “the model is hallucinating” reports are retrieval misses.