Overview
Dense retrieval captures semantic meaning. Sparse retrieval (BM25 and full-text indexes) captures exact keyword matches, rare tokens, and identifiers. Neither alone dominates across query types. Hybrid search runs both in parallel and merges the result lists with Reciprocal Rank Fusion (RRF). For retrieval architecture in the broader RAG pipeline, see rag-retrieval.
Run dense and sparse retrievers in parallel, not sequentially
Sequential retrieval (dense first, then keyword filter) discards candidates before the merge. Parallel retrieval collects independent top-k lists from each retriever, then merges. The parallel pattern maximizes recall and lets RRF resolve conflicts by rank, not by pre-filtering.
import asyncio
async def hybrid_search(query, k=20):
dense_task = asyncio.create_task(dense_retrieve(query, k=k))
sparse_task = asyncio.create_task(bm25_retrieve(query, k=k))
dense_results, sparse_results = await asyncio.gather(dense_task, sparse_task)
return rrf([dense_results, sparse_results], k=60)Use Reciprocal Rank Fusion to merge result lists
RRF combines multiple ranked lists without needing score calibration. Each document gets a score of 1 / (k + rank) for every list it appears in; scores sum across lists.
def rrf(rankings: list[list[str]], k: int = 60) -> list[str]:
scores: dict[str, float] = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores, key=scores.__getitem__, reverse=True)The constant k=60 is the standard default from the original RRF paper. Lower values amplify top-rank documents; higher values flatten differences. Tune k on your golden set if retrieval quality is unsatisfactory, but 60 is correct for most corpora.
Hybrid beats pure dense on keyword-heavy and rare-token queries
Dense models generalize well but fail predictably on:
- Exact identifiers: error codes, function names, SKUs, UUIDs.
- Rare proper nouns: product names, people, organizations not well-represented in training.
- Numeric values: dates, prices, version numbers where similarity is meaningless.
- Short queries with one operative word: “EADDRINUSE” means nothing to an embedding model trained on prose.
BM25 handles all of these exactly. For general conceptual queries (“how do I handle port conflicts”), dense dominates. The hybrid covers both cases and rarely underperforms either retriever in isolation. Verify this claim on your own corpus using embeddings-eval.
Apply metadata pre-filters before both retrievers, not after
Pre-filtering by metadata (date range, user ID, document type) is cheaper than post-filtering from a large merged result list. Apply filters at the index level:
# pgvector example with pre-filter
WHERE metadata->>'category' = 'support'
ORDER BY embedding <#> $1::vector
LIMIT 20Post-filtering a large k list to hit a small target wastes retrieval bandwidth and can produce empty result sets when the metadata is selective. See pinecone-vs-pgvector for filter support differences between vector stores.
Query routing reduces cost on query categories that do not need hybrid
Not every query benefits from running both retrievers. If query logs show that 60 percent of queries are exact identifier lookups, route those directly to BM25 and skip the embedding call. Route semantic queries to dense-only. Route ambiguous queries to hybrid.
A lightweight query classifier (even a simple keyword pattern) can reduce embedding API calls by 20 to 40 percent in production. Measure the cost before adding complexity; for low-volume applications, always run hybrid.
Re-ranking after hybrid retrieval improves precision
RRF maximizes recall, not precision. After merging to top-20, pass the candidates to a cross-encoder reranker (Cohere Rerank, Voyage Rerank, or a local cross-encoder/ms-marco model) to reorder by relevance. The reranker sees the full query and document pair, not just embeddings. See rag-retrieval for the full retrieval-to-rerank pipeline.