Overview

Retrieval-augmented generation works when the retrieval step actually finds the right context. Most RAG failures are retrieval failures, not generation failures; the model writes confidently from whatever you hand it. The rules below cover chunking, retrieval, reranking, and evaluation. For the vector store, see chromadb. For embedding models, see embeddings.

Chunk on semantic boundaries, not fixed character counts

Split documents on headings and paragraph breaks first, then enforce a token budget. A 500-token chunk that starts mid-sentence is worse than a 700-token chunk that respects the section.

  • Target 200 to 800 tokens per chunk. Below 200, you lose context; above 800, retrieval gets fuzzy.
  • Split on #, ##, blank lines, then code-fence boundaries.
  • Add a 50-token overlap between adjacent chunks so a fact split across a boundary stays retrievable.

Attach metadata to every chunk: source_url, title, heading_path, last_updated. The metadata is half the system.

Combine dense embeddings with metadata filters

Pure semantic search is not enough when the user is specific. “Postgres 17 release notes” needs both vector similarity and a filter on version = "17".

results = collection.query(
    query_embeddings=[query_vec],
    where={"category": "backend", "last_updated": {"$gte": "2026-01-01"}},
    n_results=50,
)

Use BM25 as a parallel retriever when the query has rare keywords (error codes, identifiers, exact phrases). Merge with reciprocal rank fusion. Pure dense retrieval misses EADDRINUSE; BM25 catches it.

Rerank the top 50 down to the top 5

Bi-encoder embeddings are fast but coarse. A cross-encoder reranker is slow but accurate. Use them in series.

  1. Retrieve top 50 by vector similarity.
  2. Score each with a cross-encoder reranker (Cohere Rerank 3, BGE reranker, Voyage rerank-2).
  3. Keep top 5 for the prompt.

The reranker sees the query and chunk together, which catches subtle relevance the embedding missed. Cost stays bounded because you only rerank 50, not the whole corpus.

Evaluate retrieval and answer correctness separately

A golden set is two columns: question and the chunk IDs that contain the answer. A second sheet has question and the correct final answer.

  • Retrieval metric: recall@k. Did the top k chunks include at least one chunk from the gold set? Track recall@5, recall@10, recall@50.
  • Answer metric: correctness against the gold answer, scored by a judge model or by exact-match on extracted facts.

Track them separately. If recall@5 is 0.9 but answer correctness is 0.6, the problem is in generation. If recall@5 is 0.4, no prompt engineering will save you.

Cite the source in the answer

Pass chunk IDs into the prompt; require the model to cite them inline.

Context:
[1] (source: postgres-17-release-notes.md) ...
[2] (source: backend/postgres.md) ...
 
Question: How do I enable JIT in Postgres 17?
 
Answer with inline citations like [1] or [2] for each claim.

Citations are auditable. A user can click through; a developer can replay the retrieval and see whether the model used the chunk it cited.

Filter by freshness when the topic moves

Stale chunks poison the answer. Attach last_updated to every chunk and filter at query time when the question is time-sensitive (“latest”, “current”, “in 2026”).

  • Hard filter: drop chunks older than N months for time-sensitive queries.
  • Soft filter: rerank with a freshness boost, so older chunks are not excluded but lose ties.

The classifier is upstream: detect time sensitivity from the query, then pick the filter mode.

Cache embeddings and retrievals

Re-embedding the same chunk is waste; re-retrieving the same query is waste.

  • Cache embeddings by (model_id, model_version, sha256(text)). See embeddings for cache-key rules.
  • Cache retrievals by (query_text_normalized, filters) with a short TTL.

A warm cache cuts RAG latency in half and embedding spend by an order of magnitude.