Overview
Rerankers fix the precision that bi-encoder retrieval cannot. Embedding models compare query and chunk independently, then take a dot product; cross-encoders read query and chunk together and score the pair. The pair-aware score is more accurate and slower, which is exactly why reranking sits between retrieval and the prompt. For the retrieval step that feeds the reranker, see rag-retrieval.
Retrieve broad, rerank narrow
The pattern is two stages: a cheap recall stage that returns 30 to 100 candidates, then an accurate precision stage that returns the 3 to 8 chunks the prompt actually sees.
query
→ dense retrieval top 30
→ sparse retrieval top 30
→ RRF merge to 50
→ cross-encoder rerank to top 5
→ promptSkipping the rerank step is the most common reason a “well-tuned” RAG system answers wrong on borderline queries. The embedding got close; the reranker would have closed the gap.
Pick a hosted reranker for the easy path
Three production-ready hosted rerankers in 2026.
- Cohere Rerank 3: strong English baseline, multilingual variants, low latency, simple billing.
- Voyage rerank-2: pairs well with Voyage 3 embeddings; best when the embedding side is already Voyage. See embeddings.
- Jina Reranker v2: open-weight option with a hosted API; good price per million tokens.
Run a bake-off on the eval suite before committing. The right reranker is the one that wins on your golden set, not the one with the best benchmark in a paper. See rag-eval.
Self-host BGE-Reranker when latency or data residency demands it
BGE-Reranker (bge-reranker-large, bge-reranker-v2-m3) is the open-weight default. Run it on a single GPU; throughput hits hundreds of pairs per second on an A10 or L4.
from FlagEmbedding import FlagReranker
reranker = FlagReranker("BAAI/bge-reranker-large", use_fp16=True)
scores = reranker.compute_score([[query, chunk] for chunk in candidates])
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])- Pin the model version in your image; reranker weights change.
- Batch pairs aggressively; GPU latency drops 10x against unbatched calls.
Self-host when the corpus is sensitive, the latency budget is tight, or the volume makes a hosted API uneconomic.
Combine retrievers with reciprocal rank fusion before rerank
When the candidate pool comes from multiple retrievers (dense, BM25, knowledge graph), merge with RRF before reranking. The reranker is expensive; do not score the same chunk twice.
def rrf(rankings, k=60):
scores = {}
for r in rankings:
for rank, doc_id in enumerate(r):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores, key=scores.get, reverse=True)[:50]RRF is parameter-light (k=60 default), works with any number of retrievers, and does not need score calibration.
Budget the reranker on a latency target
Reranking is the slowest hop in the pipeline that is still cheap enough to keep. Set a budget; respect it.
- Hosted Rerank API: 50 to 200 ms for 50 pairs. Plan for the long tail at 500 ms.
- Self-hosted BGE on a GPU: 20 to 100 ms for 50 pairs at FP16, batched.
- CPU reranking: usable for prototypes only.
If the budget is under 150 ms total, drop to top-25 candidates or use a smaller reranker. Do not skip reranking; cut its scope.
Eval-drive the reranker choice
Reranker quality varies by corpus far more than leaderboards suggest. Pick with numbers, not vendor pitch.
- Score
recall@5andnDCG@5against the gold set on three rerankers. - Score the same chunks with no reranker as the baseline.
- Pick the reranker with the largest lift at acceptable latency.
A reranker that adds 1 point of accuracy at 4x the cost loses. See rag-eval for the harness.
Pass enough context, not too much
The reranked top-k goes into the prompt. More context is not always better.
- Start at top 5 chunks. Most generation models lose accuracy past 8 to 10 chunks per query.
- Order chunks by reranker score, highest first. Long-context models still weight the head of the prompt.
- Trim chunks to the relevant section when individual chunks are long; do not paste the full 800 tokens.
The reranker did the hard work. The prompt should reflect that. See rag-citations for how to format the chunks for generation.