Embeddings: Hybrid Search

Overview

Dense retrieval captures semantic meaning. Sparse retrieval (BM25 and full-text indexes) captures exact keyword matches, rare tokens, and identifiers. Neither alone dominates across query types. Hybrid search runs both in parallel and merges the result lists with Reciprocal Rank Fusion (RRF). For retrieval architecture in the broader RAG pipeline, see rag-retrieval.

Run dense and sparse retrievers in parallel, not sequentially

Sequential retrieval (dense first, then keyword filter) discards candidates before the merge. Parallel retrieval collects independent top-k lists from each retriever, then merges. The parallel pattern maximizes recall and lets RRF resolve conflicts by rank, not by pre-filtering.

import asyncio
 
async def hybrid_search(query, k=20):
    dense_task = asyncio.create_task(dense_retrieve(query, k=k))
    sparse_task = asyncio.create_task(bm25_retrieve(query, k=k))
    dense_results, sparse_results = await asyncio.gather(dense_task, sparse_task)
    return rrf([dense_results, sparse_results], k=60)

Use Reciprocal Rank Fusion to merge result lists

RRF combines multiple ranked lists without needing score calibration. Each document gets a score of 1 / (k + rank) for every list it appears in; scores sum across lists.

def rrf(rankings: list[list[str]], k: int = 60) -> list[str]:
    scores: dict[str, float] = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.__getitem__, reverse=True)

The constant k=60 is the standard default from the original RRF paper. Lower values amplify top-rank documents; higher values flatten differences. Tune k on your golden set if retrieval quality is unsatisfactory, but 60 is correct for most corpora.

Hybrid beats pure dense on keyword-heavy and rare-token queries

Dense models generalize well but fail predictably on:

Exact identifiers: error codes, function names, SKUs, UUIDs.
Rare proper nouns: product names, people, organizations not well-represented in training.
Numeric values: dates, prices, version numbers where similarity is meaningless.
Short queries with one operative word: “EADDRINUSE” means nothing to an embedding model trained on prose.

BM25 handles all of these exactly. For general conceptual queries (“how do I handle port conflicts”), dense dominates. The hybrid covers both cases and rarely underperforms either retriever in isolation. Verify this claim on your own corpus using embeddings-eval.

Apply metadata pre-filters before both retrievers, not after

Pre-filtering by metadata (date range, user ID, document type) is cheaper than post-filtering from a large merged result list. Apply filters at the index level:

# pgvector example with pre-filter
WHERE metadata->>'category' = 'support'
ORDER BY embedding <#> $1::vector
LIMIT 20

Post-filtering a large k list to hit a small target wastes retrieval bandwidth and can produce empty result sets when the metadata is selective. See pinecone-vs-pgvector for filter support differences between vector stores.

Query routing reduces cost on query categories that do not need hybrid

Not every query benefits from running both retrievers. If query logs show that 60 percent of queries are exact identifier lookups, route those directly to BM25 and skip the embedding call. Route semantic queries to dense-only. Route ambiguous queries to hybrid.

A lightweight query classifier (even a simple keyword pattern) can reduce embedding API calls by 20 to 40 percent in production. Measure the cost before adding complexity; for low-volume applications, always run hybrid.

Re-ranking after hybrid retrieval improves precision

RRF maximizes recall, not precision. After merging to top-20, pass the candidates to a cross-encoder reranker (Cohere Rerank, Voyage Rerank, or a local cross-encoder/ms-marco model) to reorder by relevance. The reranker sees the full query and document pair, not just embeddings. See rag-retrieval for the full retrieval-to-rerank pipeline.

LLM Best Practices

Explorer

Overview

Run dense and sparse retrievers in parallel, not sequentially

Use Reciprocal Rank Fusion to merge result lists

Hybrid beats pure dense on keyword-heavy and rare-token queries

Apply metadata pre-filters before both retrievers, not after

Query routing reduces cost on query categories that do not need hybrid

Re-ranking after hybrid retrieval improves precision

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

Embeddings: Hybrid Search

Overview

Run dense and sparse retrievers in parallel, not sequentially

Use Reciprocal Rank Fusion to merge result lists

Hybrid beats pure dense on keyword-heavy and rare-token queries

Apply metadata pre-filters before both retrievers, not after

Query routing reduces cost on query categories that do not need hybrid

Re-ranking after hybrid retrieval improves precision

Related

Graph View

Table of Contents

Backlinks