You have two ranked lists from different retrieval systems — say, pgvector cosine similarity and Full-Text Search and ParadeDB BM25 (Best Matching 25 — the standard keyword relevance scoring algorithm used by Elasticsearch, Solr, and ParadeDB) keyword search. Each system returns documents ordered by relevance, but their scores live on completely different scales:
Cosine similarity: ∈[0,1]
BM25: ∈[0,∞)
You cannot simply compare or combine these scores. You need a principled way to merge the two ranked lists into a single ranking.
Naive approaches and why they fail
Raw score addition. Just sum the scores from each system. This fails because the scales are incompatible. A BM25 score of 12.7 dwarfs a cosine score of 0.93, so keyword search would dominate regardless of actual relevance.
Score normalization (min-max). Rescale each system’s scores to [0,1] before adding. This requires knowing the global min and max for each system, is sensitive to outliers (one extreme score compresses all others), and the distribution shapes may still differ (one uniform, one exponential).
Rank averaging. Average each document’s rank across lists. This ignores the magnitude of gaps between ranks. The difference between rank 1 and rank 2 might represent a massive relevance drop, while rank 50 vs 51 are nearly identical — but rank averaging treats both gaps equally.
The RRF formula
Reciprocal Rank Fusion (Cormack, Clarke & Butt, 2009) sidesteps score comparison entirely by working only with ranks:
RRF(d)=∑r∈Rk+rankr(d)1
Where:
R = set of ranked lists (e.g., vector search and BM25)
rankr(d) = rank of document d in list r (1-indexed)
k = smoothing constant (typically 60)
0-indexed variant
Some implementations use 0-indexed ranks with the denominator k+rank+1. This is algebraically equivalent when ranks start at 0 instead of 1. Example:
let rrf_score = 1.0 / (k as f64 + rank as f64 + 1.0);
where rank comes from enumerate() (0-indexed).
Why k = 60
The constant k dampens the advantage of top ranks.
Without k:
11 vs 21=2× difference
With k=60:
611 vs 621≈1.6% difference
This prevents a single retrieval system from dominating the fused result just because it assigned rank 1. The value 60 was empirically chosen in the original paper and has held up well across diverse benchmarks.
The additive property
Documents appearing in multiple lists accumulate scores from each list. This is the key insight: agreement across retrieval methods is a strong relevance signal.
A document ranked 5th in both lists scores:
60+51+60+51=652≈0.0308
A document ranked 1st in only one list scores:
60+11=611≈0.0164
The document that both systems agree on wins, even though it was never the top result in either system individually.
Worked example
Two retrieval systems, five results each
Vector search results:
Rank
Document
1
Doc A
2
Doc B
3
Doc C
4
Doc D
5
Doc E
BM25 results:
Rank
Document
1
Doc C
2
Doc F
3
Doc A
4
Doc G
5
Doc B
RRF scores (using k=60, 1-indexed):
Document
Vector contribution
BM25 contribution
Total RRF score
Doc C
1/63=0.01587
1/61=0.01639
0.03226
Doc A
1/61=0.01639
1/63=0.01587
0.03226
Doc B
1/62=0.01613
1/65=0.01538
0.03151
Doc F
—
1/62=0.01613
0.01613
Doc D
1/64=0.01563
—
0.01563
Doc E
1/65=0.01538
—
0.01538
Doc G
—
1/64=0.01563
0.01563
Final ranking: C, A, B, F, G/D (tie), E
Documents appearing in both lists (C, A, B) dominate the top, even though none of them were unanimously ranked first.
Implementation pattern
The standard implementation uses a HashMap to accumulate scores:
For each ranked list, iterate over results with their rank index
For each document, add 1/(k+rank) to its accumulated score in the map
Sort all documents by accumulated score descending
Truncate to the desired limit
Implementation note
A typical reciprocal_rank_fusion() function takes the raw row results from both the vector search and BM25 queries, accumulates RRF scores in a HashMap<(Uuid, i32), (f64, String, String)> keyed by (content_id, chunk_index), then sorts and truncates. The BM25 query uses ParadeDB’s @@@ operator and gracefully falls back to empty results if ParadeDB is not installed, making the system degrade to vector-only ranking.
Alternatives
Method
Pros
Cons
Convex combinationα⋅s1+(1−α)⋅s2 (a weighted average where α and 1−α are non-negative and sum to 1)
Simple, tunable
Requires score normalization and tuning α per domain
Learned ranking (LambdaMART — a gradient-boosted decision tree algorithm trained to rank documents by optimising a ranking loss function on labelled query-document pairs)
Can capture complex interactions
Needs training data, expensive to maintain
Cross-encoder reranking (a model that takes the full (query, document) pair as joint input and produces a relevance score — more accurate than bi-encoders that embed query and document independently)
Highest quality
Slow (runs full model per doc), impractical for large candidate sets
RRF
No hyperparameters beyond k, no training data, no score normalization
Cannot weight one system over another without modification
When to reach for something else
RRF treats all retrieval systems equally. If you know that one system is significantly more reliable than another for your domain, a weighted variant or convex combination may outperform RRF. But as a zero-configuration baseline, RRF is hard to beat.
See also
pgvector — vector similarity search (one leg of hybrid retrieval)