Reranking Isn't Always Better: A Tale of Two RAG Systems
TL;DR: I built two RAG pipelines. In one, adding a reranker dropped accuracy by 7%. In the other, it improved accuracy by 7%. Same technique, opposite results. Here's why — and how to know which outcome you'll get.
The Conventional Wisdom
If you're building a RAG pipeline, you've probably seen this advice: "Add a reranker after initial retrieval to improve precision." The logic is sound — a cross-encoder reads the query and each candidate passage together through a transformer, giving more accurate relevance scores than the initial embedding similarity.
So I added one. And it made things worse.
Then I built a second RAG system on a different corpus. Added the same reranking step. This time it helped.
What changed?
Financial Report Search
-7% MRR
Reranking made it worse
Academic Paper Search
+7% MRR
Reranking made it better
System A: Financial Report Search
The first system retrieves passages from a 160-page annual report. Analysts ask questions like "What was the revenue growth?" and need the right paragraph to surface first.
After testing 11 configurations of chunking, embedding, and retrieval, I landed on a winner: sentence-based chunking at 500 characters with no overlap, using OpenAI's large embedding model and pure vector retrieval. MRR (Mean Reciprocal Rank) was 0.833 — meaning the correct result appeared at position 1 most of the time.
Then I added Cohere's reranker to re-score the top 10 results.
Result:
| Metric | Without Reranker | With Reranker | Delta |
|---|---|---|---|
| MRR | 0.833 | 0.763 | -0.070 |
| NDCG@5 | 0.867 | 0.804 | -0.063 |
| Latency | 1ms | 4,198ms | +4,197ms |
Accuracy dropped 7 percentage points. Latency increased 4,000x. The reranker made everything worse.
System B: Academic Paper Search
The second system retrieves passages from 1,000 academic papers. Researchers ask questions validated against the Open RAG Benchmark — 3,045 human-authored queries with ground truth labels.
After testing 15 configurations, I landed on a different winner: fixed-size chunking at 512 characters, MiniLM embeddings, and hybrid retrieval (combining vector search with BM25 keyword matching). MRR was 0.733 — good, but below my 0.75 NDCG target.
I added a cross-encoder reranker (ms-marco-MiniLM-L-6-v2) to re-score the top results.
Result:
| Metric | Without Reranker | With Reranker | Delta |
|---|---|---|---|
| MRR | 0.733 | 0.789 | +0.056 |
| NDCG@5 | 0.747 | 0.797 | +0.050 |
| Latency | 110s | 252s | +142s |
Accuracy improved 7 percentage points, closing the gap to my quality targets. Latency doubled — a real cost, but for a research tool where accuracy matters more than speed, that's the right tradeoff.
Why the Difference?
Two factors explain the opposite outcomes:
1. Baseline retrieval quality
In the financial system, the initial retrieval was already excellent. Sentence-based chunking on financial documents produced coherent passages — each sentence is a discrete fact ("Revenue grew 15%", "Operating expenses decreased 8%"). The embedding model captured these well. MRR of 0.833 means the right answer was usually first.
When your baseline is that strong, there's little room to improve — but plenty of room to regress. The reranker second-guessed rankings that were already correct.
In the academic system, the initial retrieval was good but not great. MRR of 0.733 meant about 1 in 4 queries had the correct result below position 1. There was room to improve, and the reranker found it.
The pattern: Reranking helps most when initial retrieval is noisy. When upstream quality is already high, it can hurt. Or put another way: investing in upstream chunking quality pays more dividends than bolting on downstream corrections.
2. Reranker-corpus domain fit
The Cohere reranker (used on financial documents) is a general-purpose model trained on diverse web text. Financial documents have a specific structure — dense with numbers, formal language, section headings that matter. The reranker wasn't trained on this domain and made mistakes a human wouldn't.
The ms-marco reranker (used on academic papers) was trained on MS MARCO, which includes academic and technical text. It aligned better with academic papers than a general web reranker did with financial reports.
The pattern: Rerankers aren't generic. Their training domain affects whether they improve or degrade your results.
How to Know If Reranking Will Help You
Before adding a reranker, ask:
- What's your baseline MRR? If it's above 0.80, you're already ranking correctly most of the time. The reranker is more likely to shuffle correct rankings down than to surface buried gems.
- How close is your reranker's training domain to your corpus? A general-purpose reranker on domain-specific text is a gamble. A domain-matched reranker (or a fine-tuned one) is safer.
- What's your latency budget? Reranking adds a forward pass through a transformer for every candidate. If you're reranking 10 candidates, that's 10 inferences. The accuracy gain may not justify the latency cost for real-time applications.
- How large is your candidate pool? Reranking 10 results is fast. Reranking 100+ gets slow. If your retrieval returns a large candidate set, you'll need to either truncate before reranking or accept significant latency.
- Can you afford to be wrong? Run the experiment. Measure before/after. Reranking is an empirical question, not a theoretical one.
The Takeaway
Reranking isn't universally good or bad. It's a tool that helps in some contexts and hurts in others. The difference is:
- → Baseline quality: High baseline → less room to improve, more risk of regression
- → Domain fit: Mismatched training domain → more likely to make mistakes
Don't add a reranker because "best practices" say to. Add it because your data shows it helps.
And if your baseline retrieval is already strong, consider the inverse: maybe you don't need a reranker at all. The time you'd spend tuning one might be better spent improving your chunking strategy — get the upstream right, and you won't need downstream corrections.
This finding surprised me. I expected reranking to be a safe "always add it" step. Building both systems and getting opposite results taught me that RAG components are context-dependent. The evaluation methodology transfers across domains; the specific configuration choices don't.