Full-Text Search and ParadeDB
Postgres Native Full-Text Search
Postgres ships with built-in full-text search based on two core types:
tsvector— a sorted list of lexemes (normalized words) with positional information, produced byto_tsvector('english', 'some text').tsquery— a search predicate composed of lexemes and boolean operators (&,|,!), produced byto_tsquery('english', 'rust & search').
The @@ operator matches a tsvector against a tsquery:
SELECT * FROM documents
WHERE to_tsvector('english', body) @@ to_tsquery('english', 'postgres & search');For performance, create a GIN index on the tsvector column:
CREATE INDEX documents_fts_idx ON documents USING gin(to_tsvector('english', body));Limitations of Native FTS
ts_rankuses raw term frequency (TF) — it counts how many times query terms appear in the document. It does not weight by inverse document frequency (IDF — a measure of how rare a term is across the whole corpus), so common words like “data” score the same as rare, distinctive terms.- No BM25 scoring. No term-frequency saturation. No document-length normalization. (These concepts are explained in the BM25 section below.)
ts_rank_cdadds cover density (proximity of terms) but still lacks IDF.
For serious search relevance, you need BM25.
BM25 Scoring
BM25 (Best Matching 25) is the standard ranking function in information retrieval, used by Elasticsearch, Solr, and now ParadeDB.
Where:
- = term frequency of query term in document
- = document length (number of terms), = average document length across the corpus
- (typically 1.2) = term frequency saturation parameter
- (typically 0.75) = length normalization parameter
- IDF is computed as:
where = total documents and = number of documents containing .
Intuition
Three key ideas make BM25 effective:
- Rare terms matter more (IDF). A term that appears in 5 out of 10,000 documents is far more informative than one appearing in 8,000. The IDF component upweights rare, distinctive terms.
- Diminishing returns on repetition (saturation). If a term appears once, that is strong signal. If it appears 50 times, the additional occurrences add progressively less. The parameter controls how quickly term frequency “saturates” — unlike raw TF which grows linearly, BM25’s numerator is bounded by .
- Longer documents are penalized (length normalization). A 10,000-word document will naturally contain more occurrences of any term than a 100-word document. The parameter scales the effective term frequency by , so a long document must have proportionally more hits to score as well as a short one.
The and defaults (1.2 and 0.75) work well for most corpora. Only tune them if you have domain-specific evidence — e.g., set closer to 0 if document length is not meaningful (all documents are roughly the same size).
ParadeDB
ParadeDB is a Postgres extension that embeds Tantivy — a Rust-based full-text search engine — directly inside Postgres. Tantivy is the Rust equivalent of Apache Lucene (the Java library that powers Elasticsearch and Solr): it manages inverted indexes, tokenization, and BM25 scoring, but runs inside the Postgres process rather than as a separate service. This gives you BM25-scored full-text search without leaving SQL.
Key Operators and Functions
| Syntax | Purpose |
|---|---|
@@@ | BM25 full-text search operator (replaces native @@) |
paradedb.parse(query_text) | Parses a query string into a Tantivy query |
paradedb.score(id) | Returns the BM25 score for a matched row |
Example query:
SELECT id, content, paradedb.score(id)::float8 AS score
FROM content_chunks
WHERE content_chunks @@@ paradedb.parse('hybrid search architecture')
ORDER BY paradedb.score(id)::float8 DESC
LIMIT 20;The
@@@operator is ParadeDB-specific — it routes the query through Tantivy's BM25 engine rather than Postgres's nativetsvector/tsquerypipeline.
The BM25 Index
ParadeDB registers a custom index access method in Postgres:
CREATE INDEX content_chunks_bm25_idx ON content_chunks
USING bm25 (content);
USING bm25is not a native Postgres index type. It is a custom access method registered by the ParadeDB extension. Under the hood, it builds a Tantivy inverted index — the same data structure that powers Elasticsearch — but stored and managed within Postgres's index infrastructure. The extension must be installed for this DDL to succeed.
This index supports:
- Tokenization and stemming — stemming reduces words to their root form so that “running”, “runs”, and “ran” all match a query for “run” — (configurable per-column)
- Positional data for phrase queries
- Fast BM25 scoring without a sequential scan
Hybrid Search
The best retrieval systems combine two complementary signals:
- Vector similarity (semantic) — captures meaning. “automobile” and “car” are close in embedding space even though they share no characters.
- BM25 (lexical) — captures exact keywords. A user searching for “HNSW” expects documents containing that exact acronym, which embeddings may dilute.
Neither alone is sufficient. Vector search misses exact keyword matches; BM25 misses semantic similarity. Combining them produces better recall and precision than either in isolation.
The standard technique for merging two ranked lists is Reciprocal Rank Fusion (RRF):
where is the standard constant. Documents appearing in both lists get additive scores and naturally rise to the top.
RRF is rank-based, not score-based — it does not require normalizing BM25 scores against cosine similarities. This makes it robust and easy to implement.
Practical pattern
A hybrid_search method in a chunk repository implements this two-retriever + RRF pattern.
BM25 leg
SELECT content_id, content_type, content, chunk_index,
paradedb.score(id)::float8 AS score
FROM content_chunks
WHERE content_chunks @@@ paradedb.parse($1)
ORDER BY paradedb.score(id)::float8 DESC
LIMIT $2;The ::float8 cast is necessary because paradedb.score() returns a custom type that sqlx (a Rust async SQL toolkit that provides compile-time checked queries and typed row decoding) cannot directly decode into f64.
Graceful degradation
let bm25_rows = sqlx::query_with(&sql, values)
.fetch_all(&mut *tx)
.await
.unwrap_or_default();unwrap_or_default() means: if ParadeDB is not installed (the query fails because @@@ is an unknown operator), return an empty Vec instead of propagating the error. RRF then falls back to vector-only ranking. This lets the application run in environments without ParadeDB — local dev, CI — without code changes.
Vector leg
The vector similarity leg uses pgvector’s cosine distance operator (<=>), with an HNSW index for approximate nearest neighbour search. The two result sets are then merged via RRF with to produce the final ranking.
See also
- pgvector — vector similarity search, the vector leg of hybrid search
- Reciprocal Rank Fusion — RRF formula and worked example
- Index Types — GIN index that powers native Postgres FTS
- Vector Search and Vector Databases — ANN algorithm theory