Cross Encoder Reranking: The Low-Cost Fix for RAG Misses
Cut RAG hallucinations and misses using cross-encoder reranking. Learn optimal rerank depth, caching strategies, and ColBERT tradeoffs for throughput balance.
When your RAG system pulls passages that look relevant but cause the LLM to cite wrong facts or miss important nuances, the problem usually starts with first-stage retrieval. BM25 and dense bi-encoders are great at broad recall, but they really struggle with subtle intent. Things like negations, temporal constraints, or domain-specific phrasing trip them up. A cross-encoder reranker scores the query and passage together, which means it catches those close-but-wrong candidates before they ever reach the LLM. If you want to take your retrieval pipeline even further, check out our guide on retrieval tricks to boost answer accuracy. It covers practical chunking, metadata filtering, and hybrid retrieval strategies that work really well alongside the reranking approaches we're discussing here.
This guide explains why first-stage retrieval fails when you're dealing with nuanced queries. You'll see how cross-encoder reranking delivers that top-rank precision you need, and what latency tradeoff you're accepting. By the end, you'll have a simple decision rule for when to deploy reranking, plus a minimal set of defaults you can use to measure the actual impact.

Why This Matters
The Limits of First-Stage Retrieval
First-stage retrieval optimizes for recall. You're casting a wide net to surface plausible candidates quickly. BM25 matches keywords but completely ignores semantics. Dense bi-encoders like all-MiniLM-L6-v2 encode the query and passage independently, so they miss those fine-grained interactions. Negations, conditional clauses, all that stuff just flies right past them.
How Misaligned Candidates Break Your LLM
Here's what happens in practice. Your top 30 candidates include passages that share vocabulary with the query but actually contradict what it's asking for. Picture your LLM seeing passages like "FDA allows off-label promotion in certain contexts" sitting right next to "FDA prohibits off-label promotion to physicians." The model might cite the wrong one. Or worse, it hedges and gives you this vague, low-confidence answer that helps nobody.
Why Precision at the Top Matters More Than Recall
Precision at top k matters way more than recall. Think about it. If your LLM only sees 5 to 10 passages, just one close-but-wrong document can trigger a hallucination or incorrect citation. Cross-encoder reranking re-scores your top candidates using joint attention over query and passage tokens. It brings the truly relevant passages to the surface and pushes those near misses down where they belong.
The Tradeoff: Latency vs. Accuracy
Now, there's a tradeoff here. Reranking adds about 100 to 200 ms of latency per query when you're dealing with around 30 candidates. If your application can handle that, and you're seeing precision failures like answer correctness below 85 percent or nDCG at 10 below baseline, then reranking is absolutely worth deploying.
How It Works
1. First-stage retrieval is independent and fast
BM25 and bi-encoders score the query and passage separately. BM25 just counts term overlap. Bi-encoders compare precomputed embeddings using cosine similarity. Neither model actually sees the query and passage together, which is the whole problem. They miss those contextual cues. Negations, qualifiers, domain-specific phrasing that completely flips the meaning, all invisible to them.
2. Cross-encoders use joint attention for precision
A cross-encoder like cross-encoder/ms-marco-MiniLM-L-6-v2 does something different. It concatenates the query and passage into a single input: [CLS] query [SEP] passage [SEP]. The transformer attention layers process both together. The model learns token-level interactions, so it catches those subtle mismatches. "FDA prohibits" versus "FDA allows", that kind of thing that independent encoders just don't see.
3. A two-stage cascade balances speed and accuracy
Here's how you set it up. Retrieve 100 to 200 candidates with your fast first-stage methods. Could be BM25, dense, hybrid retrieval, whatever works. Then rerank the top 30 to 50 with a cross-encoder. Send the top 5 to 10 reranked passages to the LLM. If you want a hands-on walkthrough of building a production-ready multi-document agent with LlamaIndex, we've got a comprehensive guide. It covers retrieval, reranking, and summarization pipelines, the whole thing.
4. Latency cost is proportional to candidate count
Cross-encoders run a forward pass for each query-passage pair. Reranking 30 candidates takes about 100 to 150 ms on CPU, maybe 30 to 50 ms on a GPU like T4 or better. But if you try to rerank more than 100 candidates, you're looking at over 300 ms. A practical default that I've found works well is reranking the top 30 and returning the top 8. This fits most sub-500ms SLA budgets pretty comfortably.
What You Should Do
Start with this proven default configuration:
Retrieve 150 candidates using whatever first-stage method you already have. BM25, dense, hybrid retrieval, they're all fine.
Rerank the top 30 with cross-encoder/ms-marco-MiniLM-L-6-v2. This 6-layer MiniLM model gives you a really solid precision and speed tradeoff.
Send the top 8 reranked passages to your LLM prompt.
Set a 200 ms timeout for reranking. If it times out, just fall back to your first-stage top k.
Measure impact with two metrics:
nDCG at 10 on a labeled eval set with query and gold passage IDs. If your first-stage precision is weak, reranking should lift nDCG by 5 to 15 points.
Answer correctness using an LLM-as-judge or human evaluation. Track whether reranking actually reduces hallucinations or incorrect citations in your application.
Actually, there's something else you should know. Tokenization pitfalls that can silently degrade retrieval performance can mess with both retrieval and reranking accuracy. Our explainer goes into detail about how invisible characters and Unicode quirks can impact both evaluation and production outcomes.
When to deploy reranking:
Precision at 5 from first-stage retrieval is below 60 percent on your eval set.
You're seeing incorrect or contradictory answers even though the relevant documents definitely exist in your index.
Your application can absorb that 100 to 200 ms of extra latency without breaking SLA.
When to skip reranking:
First-stage retrieval already achieves more than 80 percent precision at 10. This happens sometimes with a small, curated corpus and simple queries.
Your latency budget is under 200 ms end to end and there's no wiggle room.
Queries are keyword-heavy without much semantic nuance. Pure BM25 might actually be enough.
Here's how you integrate reranking into your pipeline:
This example shows how to add a cross-encoder reranker to your RAG pipeline. It retrieves first-stage candidates, batches them for efficient inference, reranks using a cross-encoder, and optionally caches results with Redis to avoid redundant computation.
import os
from typing import List, Tuple
from sentence_transformers import CrossEncoder
def get_first_stage_candidates(query: str, k: int = 150) -> List[Tuple[str, str]]:
"""
Simulate first-stage retrieval (replace with your BM25/dense retriever).
Returns list of (passage_id, passage_text).
"""
return [(f"doc_{i}", f"Passage {i} for '{query}'") for i in range(k)]
def rerank_candidates(
query: str,
candidates: List[Tuple[str, str]],
cross_encoder: CrossEncoder,
top_n: int = 30
) -> List[Tuple[str, float]]:
"""
Rerank candidates using cross-encoder, return top_n scored passages.
"""
input_pairs = [(query, ptext) for _, ptext in candidates]
scores = cross_encoder.predict(input_pairs, batch_size=32)
scored = list(zip([pid for pid, _ in candidates], scores))
scored.sort(key=lambda x: x[1], reverse=True)
return scored[:top_n]
# Load model once at startup
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
# Pipeline
query = "Does FDA allow off-label promotion of drugs to physicians?"
candidates = get_first_stage_candidates(query, k=150)
reranked = rerank_candidates(query, candidates[:30], cross_encoder, top_n=8)
# Use top 8 passages in LLM prompt
top_passages = [pid for pid, _ in reranked]
print(f"Top passages for LLM: {top_passages}")
Conclusion: Key Takeaways
Cross-encoder reranking fixes close-but-wrong retrieval by scoring the query and passage jointly. It catches those nuances that first-stage methods miss. You're trading 100 to 200 ms of latency for measurably better precision at top k, which means fewer hallucinations and incorrect citations.
Deploy reranking when:
First-stage precision at 5 is below 60 percent
Answer correctness is suffering despite good recall
You can afford 100 to 200 ms in your latency budget
Default recipe: Retrieve 150. Rerank the top 30 with ms-marco-MiniLM-L-6-v2. Send the top 8 to the LLM. Measure nDCG at 10 and answer correctness to confirm you're actually seeing impact.
If you want to go further, look into late interaction models like ColBERT, caching strategies, or dynamic rerank depth under SLA pressure. You'll find these in the related explainers on production-scale reranking and retrieval optimization. You could also explore building a Knowledge Graph RAG system with Neo4j and embeddings for those context-rich, grounded answers.