Full-Text Search and ParadeDB

Postgres Native Full-Text Search

Postgres ships with built-in full-text search based on two core types:

tsvector — a sorted list of lexemes (normalized words) with positional information, produced by to_tsvector('english', 'some text').
tsquery — a search predicate composed of lexemes and boolean operators (&, |, !), produced by to_tsquery('english', 'rust & search').

The @@ operator matches a tsvector against a tsquery:

SELECT * FROM documents
WHERE to_tsvector('english', body) @@ to_tsquery('english', 'postgres & search');

For performance, create a GIN index on the tsvector column:

CREATE INDEX documents_fts_idx ON documents USING gin(to_tsvector('english', body));

Limitations of Native FTS

ts_rank uses raw term frequency (TF) — it counts how many times query terms appear in the document. It does not weight by inverse document frequency (IDF — a measure of how rare a term is across the whole corpus), so common words like “data” score the same as rare, distinctive terms.
No BM25 scoring. No term-frequency saturation. No document-length normalization. (These concepts are explained in the BM25 section below.)
ts_rank_cd adds cover density (proximity of terms) but still lacks IDF.

For serious search relevance, you need BM25.

BM25 Scoring

BM25 (Best Matching 25) is the standard ranking function in information retrieval, used by Elasticsearch, Solr, and now ParadeDB.

$BM25 (D, Q) = \sum_{i = 1}^{n} IDF (q_{i}) \cdot \frac{f ( q _{i} , D ) \cdot ( k _{1} + 1 )}{f ( q _{i} , D ) + k _{1} \cdot ( 1 - b + b \cdot \frac{∣ D ∣}{avgdl} )}$

Where:

$f (q_{i}, D)$ = term frequency of query term $q_{i}$ in document $D$
$∣ D ∣$ = document length (number of terms), $avgdl$ = average document length across the corpus
$k_{1}$ (typically 1.2) = term frequency saturation parameter
$b$ (typically 0.75) = length normalization parameter
IDF is computed as: $IDF (q_{i}) = lo g \frac{N - n ( q _{i} ) + 0.5}{n ( q _{i} ) + 0.5}$

where $N$ = total documents and $n (q_{i})$ = number of documents containing $q_{i}$ .

Intuition

Three key ideas make BM25 effective:

Rare terms matter more (IDF). A term that appears in 5 out of 10,000 documents is far more informative than one appearing in 8,000. The IDF component upweights rare, distinctive terms.
Diminishing returns on repetition (saturation). If a term appears once, that is strong signal. If it appears 50 times, the additional occurrences add progressively less. The $k_{1}$ parameter controls how quickly term frequency “saturates” — unlike raw TF which grows linearly, BM25’s numerator is bounded by $(k_{1} + 1)$ .
Longer documents are penalized (length normalization). A 10,000-word document will naturally contain more occurrences of any term than a 100-word document. The $b$ parameter scales the effective term frequency by $∣ D ∣/ avgdl$ , so a long document must have proportionally more hits to score as well as a short one.

The $k_{1}$ and $b$ defaults (1.2 and 0.75) work well for most corpora. Only tune them if you have domain-specific evidence — e.g., set $b$ closer to 0 if document length is not meaningful (all documents are roughly the same size).

ParadeDB

ParadeDB is a Postgres extension that embeds Tantivy — a Rust-based full-text search engine — directly inside Postgres. Tantivy is the Rust equivalent of Apache Lucene (the Java library that powers Elasticsearch and Solr): it manages inverted indexes, tokenization, and BM25 scoring, but runs inside the Postgres process rather than as a separate service. This gives you BM25-scored full-text search without leaving SQL.

Key Operators and Functions

Syntax	Purpose
`@@@`	BM25 full-text search operator (replaces native `@@`)
`paradedb.parse(query_text)`	Parses a query string into a Tantivy query
`paradedb.score(id)`	Returns the BM25 score for a matched row

Example query:

SELECT id, content, paradedb.score(id)::float8 AS score
FROM content_chunks
WHERE content_chunks @@@ paradedb.parse('hybrid search architecture')
ORDER BY paradedb.score(id)::float8 DESC
LIMIT 20;

The @@@ operator is ParadeDB-specific — it routes the query through Tantivy's BM25 engine rather than Postgres's native tsvector/tsquery pipeline.

The BM25 Index

ParadeDB registers a custom index access method in Postgres:

CREATE INDEX content_chunks_bm25_idx ON content_chunks
    USING bm25 (content);

USING bm25 is not a native Postgres index type. It is a custom access method registered by the ParadeDB extension. Under the hood, it builds a Tantivy inverted index — the same data structure that powers Elasticsearch — but stored and managed within Postgres's index infrastructure. The extension must be installed for this DDL to succeed.

This index supports:

Tokenization and stemming — stemming reduces words to their root form so that “running”, “runs”, and “ran” all match a query for “run” — (configurable per-column)
Positional data for phrase queries
Fast BM25 scoring without a sequential scan

Hybrid Search

The best retrieval systems combine two complementary signals:

Vector similarity (semantic) — captures meaning. “automobile” and “car” are close in embedding space even though they share no characters.
BM25 (lexical) — captures exact keywords. A user searching for “HNSW” expects documents containing that exact acronym, which embeddings may dilute.

Neither alone is sufficient. Vector search misses exact keyword matches; BM25 misses semantic similarity. Combining them produces better recall and precision than either in isolation.

The standard technique for merging two ranked lists is Reciprocal Rank Fusion (RRF):

$RRF (d) = \sum_{r \in rankers} \frac{1}{k + rank _{r} ( d )}$

where $k = 60$ is the standard constant. Documents appearing in both lists get additive scores and naturally rise to the top.

RRF is rank-based, not score-based — it does not require normalizing BM25 scores against cosine similarities. This makes it robust and easy to implement.

Practical pattern

A hybrid_search method in a chunk repository implements this two-retriever + RRF pattern.

BM25 leg

SELECT content_id, content_type, content, chunk_index,
       paradedb.score(id)::float8 AS score
FROM content_chunks
WHERE content_chunks @@@ paradedb.parse($1)
ORDER BY paradedb.score(id)::float8 DESC
LIMIT $2;

The ::float8 cast is necessary because paradedb.score() returns a custom type that sqlx (a Rust async SQL toolkit that provides compile-time checked queries and typed row decoding) cannot directly decode into f64.

Graceful degradation

let bm25_rows = sqlx::query_with(&sql, values)
    .fetch_all(&mut *tx)
    .await
    .unwrap_or_default();

unwrap_or_default() means: if ParadeDB is not installed (the query fails because @@@ is an unknown operator), return an empty Vec instead of propagating the error. RRF then falls back to vector-only ranking. This lets the application run in environments without ParadeDB — local dev, CI — without code changes.

Vector leg

The vector similarity leg uses pgvector’s cosine distance operator (<=>), with an HNSW index for approximate nearest neighbour search. The two result sets are then merged via RRF with $k = 60$ to produce the final ranking.

Edmondo's Vault

Explorer

Full-Text Search and ParadeDB

Full-Text Search and ParadeDB

Postgres Native Full-Text Search

Limitations of Native FTS

BM25 Scoring

Intuition

ParadeDB

Key Operators and Functions

The BM25 Index

Hybrid Search

Practical pattern

BM25 leg

Graceful degradation

Vector leg

See also

Graph View

Table of Contents

Backlinks