RAG Reranking Evaluation

Reranking Isn't Always Better: A Tale of Two RAG Systems

6 min read

TL;DR: I built two RAG pipelines. In one, adding a reranker dropped accuracy by 7%. In the other, it improved accuracy by 7%. Same technique, opposite results. Here's why — and how to know which outcome you'll get.

The Conventional Wisdom

If you're building a RAG pipeline, you've probably seen this advice: "Add a reranker after initial retrieval to improve precision." The logic is sound — a cross-encoder reads the query and each candidate passage together through a transformer, giving more accurate relevance scores than the initial embedding similarity.

So I added one. And it made things worse.

Then I built a second RAG system on a different corpus. Added the same reranking step. This time it helped.

What changed?

Financial Report Search

-7% MRR

Reranking made it worse

Academic Paper Search

+7% MRR

Reranking made it better

System A: Financial Report Search

The first system retrieves passages from a 160-page annual report. Analysts ask questions like "What was the revenue growth?" and need the right paragraph to surface first.

After testing 11 configurations of chunking, embedding, and retrieval, I landed on a winner: sentence-based chunking at 500 characters with no overlap, using OpenAI's large embedding model and pure vector retrieval. MRR (Mean Reciprocal Rank) was 0.833 — meaning the correct result appeared at position 1 most of the time.

Then I added Cohere's reranker to re-score the top 10 results.

Result:

Metric Without Reranker With Reranker Delta
MRR 0.833 0.763 -0.070
NDCG@5 0.867 0.804 -0.063
Latency 1ms 4,198ms +4,197ms

Accuracy dropped 7 percentage points. Latency increased 4,000x. The reranker made everything worse.

System B: Academic Paper Search

The second system retrieves passages from 1,000 academic papers. Researchers ask questions validated against the Open RAG Benchmark — 3,045 human-authored queries with ground truth labels.

After testing 15 configurations, I landed on a different winner: fixed-size chunking at 512 characters, MiniLM embeddings, and hybrid retrieval (combining vector search with BM25 keyword matching). MRR was 0.733 — good, but below my 0.75 NDCG target.

I added a cross-encoder reranker (ms-marco-MiniLM-L-6-v2) to re-score the top results.

Result:

Metric Without Reranker With Reranker Delta
MRR 0.733 0.789 +0.056
NDCG@5 0.747 0.797 +0.050
Latency 110s 252s +142s

Accuracy improved 7 percentage points, closing the gap to my quality targets. Latency doubled — a real cost, but for a research tool where accuracy matters more than speed, that's the right tradeoff.

Why the Difference?

Two factors explain the opposite outcomes:

1. Baseline retrieval quality

In the financial system, the initial retrieval was already excellent. Sentence-based chunking on financial documents produced coherent passages — each sentence is a discrete fact ("Revenue grew 15%", "Operating expenses decreased 8%"). The embedding model captured these well. MRR of 0.833 means the right answer was usually first.

When your baseline is that strong, there's little room to improve — but plenty of room to regress. The reranker second-guessed rankings that were already correct.

In the academic system, the initial retrieval was good but not great. MRR of 0.733 meant about 1 in 4 queries had the correct result below position 1. There was room to improve, and the reranker found it.

The pattern: Reranking helps most when initial retrieval is noisy. When upstream quality is already high, it can hurt. Or put another way: investing in upstream chunking quality pays more dividends than bolting on downstream corrections.

2. Reranker-corpus domain fit

The Cohere reranker (used on financial documents) is a general-purpose model trained on diverse web text. Financial documents have a specific structure — dense with numbers, formal language, section headings that matter. The reranker wasn't trained on this domain and made mistakes a human wouldn't.

The ms-marco reranker (used on academic papers) was trained on MS MARCO, which includes academic and technical text. It aligned better with academic papers than a general web reranker did with financial reports.

The pattern: Rerankers aren't generic. Their training domain affects whether they improve or degrade your results.

How to Know If Reranking Will Help You

Before adding a reranker, ask:

  1. What's your baseline MRR? If it's above 0.80, you're already ranking correctly most of the time. The reranker is more likely to shuffle correct rankings down than to surface buried gems.
  2. How close is your reranker's training domain to your corpus? A general-purpose reranker on domain-specific text is a gamble. A domain-matched reranker (or a fine-tuned one) is safer.
  3. What's your latency budget? Reranking adds a forward pass through a transformer for every candidate. If you're reranking 10 candidates, that's 10 inferences. The accuracy gain may not justify the latency cost for real-time applications.
  4. How large is your candidate pool? Reranking 10 results is fast. Reranking 100+ gets slow. If your retrieval returns a large candidate set, you'll need to either truncate before reranking or accept significant latency.
  5. Can you afford to be wrong? Run the experiment. Measure before/after. Reranking is an empirical question, not a theoretical one.

The Takeaway

Reranking isn't universally good or bad. It's a tool that helps in some contexts and hurts in others. The difference is:

  • Baseline quality: High baseline → less room to improve, more risk of regression
  • Domain fit: Mismatched training domain → more likely to make mistakes

Don't add a reranker because "best practices" say to. Add it because your data shows it helps.

And if your baseline retrieval is already strong, consider the inverse: maybe you don't need a reranker at all. The time you'd spend tuning one might be better spent improving your chunking strategy — get the upstream right, and you won't need downstream corrections.

This finding surprised me. I expected reranking to be a safe "always add it" step. Building both systems and getting opposite results taught me that RAG components are context-dependent. The evaluation methodology transfers across domains; the specific configuration choices don't.