Reciprocal Rank Fusion

The problem

You have two ranked lists from different retrieval systems — say, pgvector cosine similarity and Full-Text Search and ParadeDB BM25 (Best Matching 25 — the standard keyword relevance scoring algorithm used by Elasticsearch, Solr, and ParadeDB) keyword search. Each system returns documents ordered by relevance, but their scores live on completely different scales:

  • Cosine similarity:
  • BM25:

You cannot simply compare or combine these scores. You need a principled way to merge the two ranked lists into a single ranking.

Naive approaches and why they fail

Raw score addition. Just sum the scores from each system. This fails because the scales are incompatible. A BM25 score of 12.7 dwarfs a cosine score of 0.93, so keyword search would dominate regardless of actual relevance.

Score normalization (min-max). Rescale each system’s scores to before adding. This requires knowing the global min and max for each system, is sensitive to outliers (one extreme score compresses all others), and the distribution shapes may still differ (one uniform, one exponential).

Rank averaging. Average each document’s rank across lists. This ignores the magnitude of gaps between ranks. The difference between rank 1 and rank 2 might represent a massive relevance drop, while rank 50 vs 51 are nearly identical — but rank averaging treats both gaps equally.

The RRF formula

Reciprocal Rank Fusion (Cormack, Clarke & Butt, 2009) sidesteps score comparison entirely by working only with ranks:

Where:

  • = set of ranked lists (e.g., vector search and BM25)
  • = rank of document in list (1-indexed)
  • = smoothing constant (typically 60)

0-indexed variant

Some implementations use 0-indexed ranks with the denominator . This is algebraically equivalent when ranks start at 0 instead of 1. Example:

let rrf_score = 1.0 / (k as f64 + rank as f64 + 1.0);

where rank comes from enumerate() (0-indexed).

Why k = 60

The constant dampens the advantage of top ranks.

Without :

With :

This prevents a single retrieval system from dominating the fused result just because it assigned rank 1. The value 60 was empirically chosen in the original paper and has held up well across diverse benchmarks.

The additive property

Documents appearing in multiple lists accumulate scores from each list. This is the key insight: agreement across retrieval methods is a strong relevance signal.

A document ranked 5th in both lists scores:

A document ranked 1st in only one list scores:

The document that both systems agree on wins, even though it was never the top result in either system individually.

Worked example

Two retrieval systems, five results each

Vector search results:

RankDocument
1Doc A
2Doc B
3Doc C
4Doc D
5Doc E

BM25 results:

RankDocument
1Doc C
2Doc F
3Doc A
4Doc G
5Doc B

RRF scores (using , 1-indexed):

DocumentVector contributionBM25 contributionTotal RRF score
Doc C0.03226
Doc A0.03226
Doc B0.03151
Doc F0.01613
Doc D0.01563
Doc E0.01538
Doc G0.01563

Final ranking: C, A, B, F, G/D (tie), E

Documents appearing in both lists (C, A, B) dominate the top, even though none of them were unanimously ranked first.

Implementation pattern

The standard implementation uses a HashMap to accumulate scores:

  1. For each ranked list, iterate over results with their rank index
  2. For each document, add to its accumulated score in the map
  3. Sort all documents by accumulated score descending
  4. Truncate to the desired limit

Implementation note

A typical reciprocal_rank_fusion() function takes the raw row results from both the vector search and BM25 queries, accumulates RRF scores in a HashMap<(Uuid, i32), (f64, String, String)> keyed by (content_id, chunk_index), then sorts and truncates. The BM25 query uses ParadeDB’s @@@ operator and gracefully falls back to empty results if ParadeDB is not installed, making the system degrade to vector-only ranking.

Alternatives

MethodProsCons
Convex combination (a weighted average where and are non-negative and sum to 1)Simple, tunableRequires score normalization and tuning per domain
Learned ranking (LambdaMART — a gradient-boosted decision tree algorithm trained to rank documents by optimising a ranking loss function on labelled query-document pairs)Can capture complex interactionsNeeds training data, expensive to maintain
Cross-encoder reranking (a model that takes the full (query, document) pair as joint input and produces a relevance score — more accurate than bi-encoders that embed query and document independently)Highest qualitySlow (runs full model per doc), impractical for large candidate sets
RRFNo hyperparameters beyond , no training data, no score normalizationCannot weight one system over another without modification

When to reach for something else

RRF treats all retrieval systems equally. If you know that one system is significantly more reliable than another for your domain, a weighted variant or convex combination may outperform RRF. But as a zero-configuration baseline, RRF is hard to beat.

See also