Traditional keyword search is tolerant of typos but blind to meaning: a query for “car” won’t find a document about “automobiles”. Vector search solves this by representing both documents and queries as points in a high-dimensional space, where proximity means semantic similarity.

The workflow has three steps:

  1. Embed — run documents and queries through the same model to get fixed-length vectors.
  2. Index — store and organise the vectors so retrieval is fast.
  3. Query — embed the incoming query and find the nearest vectors.

The hard part is step 3 at scale: with millions of documents, how do you find the closest vectors without comparing the query to every single one?

What does “closest” mean? Distance metrics

Before approximating nearest-neighbour search, we need a distance to approximate. Given two vectors :

Euclidean distance (L2) measures straight-line distance. It is sensitive to both direction and magnitude — the natural choice when absolute position in space carries meaning (image features, anomaly detection):

Cosine similarity measures the angle between two vectors, ignoring their length. Most text embedding models output unit-length vectors, making cosine, dot product, and L2 all equivalent on normalised vectors (since when ). This is why cosine dominates for text:

Dot product conflates direction and magnitude — useful when magnitude encodes importance (e.g. learned retrieval models where larger norms signal higher relevance).

Warning

Many libraries (FAISS, Hnswlib) only implement L2 or inner product natively. To use cosine similarity, normalise your vectors to unit length first and then use inner product — the results are equivalent.

Why brute force doesn’t scale

A linear scan over vectors each of dimension costs per query. At and (OpenAI embeddings), that is ~15 billion multiply-adds per query — roughly 15 seconds on a single CPU core. Real systems need sub-10ms latency at thousands of queries per second.

Approximate Nearest Neighbour (ANN) algorithms trade a small margin of error in recall — recall here means the fraction of true nearest neighbours that are actually returned — for orders-of-magnitude speedup by avoiding most of the database during search. The three dominant approaches organise the space differently:

ApproachKey ideaRepresentative algorithms
Hash-based partitioningMap similar vectors to the same hash bucketLSH (Locality Sensitive Hashing)
Cluster-based partitioningAssign vectors to Voronoi cells (the region of space closer to one centroid than any other); search nearby cells onlyIVF (Inverted File Index)
Graph traversalNavigate a proximity graph toward the queryHNSW (Hierarchical Navigable Small Worlds)
QuantizationCompress vectors into compact codes; compute approximate distances on codesScaNN, IVF-PQ (IVF with Product Quantization — a compression technique that splits each vector into sub-vectors and quantizes each independently)

ANN algorithms

Locality Sensitive Hashing (LSH)

The insight: design a hash function such that similar vectors are likely to collide (land in the same bucket) and dissimilar ones are not. Candidates are only the vectors in the query’s bucket.

See notebook for an interactive walkthrough with live controls.

Hash function for cosine similarity. Draw a random vector . It defines a random hyperplane that splits the space into two half-spaces:

Two vectors on the same side of the hyperplane get bit 1; on opposite sides, bit 0. The collision probability for vectors at angle :

One plane is too weak — it puts half the database in each bucket regardless. The solution is to use planes together.

Multi-projection signatures ( planes). Draw independent planes . For each vector , compute a -bit signature (bucket address):

A vector is a candidate only if it has the exact same -bit signature as the query. Since each plane independently requires agreement, the joint collision probability drops to :

  • Near neighbour (, ): — still a candidate half the time
  • Far point (, ): — essentially never retrieved

The planes carve the space into fine-grained regions. Points in the same region are much more likely to be true neighbours than random points.

Multiple tables () — recovering recall. Even at , we miss half the true near neighbours. Fix this with independent sets of planes (hash tables). A pair is a candidate if they collide in at least one table:

With , , : recall .

Tip

Increasing cuts false positives (smaller candidate sets, higher precision). Increasing recovers recall. Tune so the candidate set is small but almost always contains the true nearest neighbours.

Postal code analogy

LSH is like grouping residences by postal code: nearby houses likely share a code, distant ones don’t. Using different coding schemes makes it unlikely that two truly nearby houses are missed by all of them.

IVF (Inverted File Index)

The insight: k-means cluster the database into Voronoi cells. A Voronoi cell is the region of space that is closer to one centroid than to any other — the set of all points for which that centroid wins a nearest-neighbour race. Together, the cells tile the entire vector space without overlap. At query time, only examine the few cells whose centroid is closest to the query.

See notebook for an interactive walkthrough with live controls.

Three phases:

  1. Train. Run k-means on a representative sample to learn centroids.
  2. Index. Assign each vector to its nearest centroid; store an inverted list per centroid.
  3. Search. Find the nearest centroids to the query; exhaustively scan those lists.

The primary knobs:

  • is the standard heuristic. Too few → large lists, slow scan. Too many → vectors near cell boundaries are missed.
  • is the recall vs. speed dial at query time. degrades to brute force; is fastest but misses boundary cases.

IVFFlat stores full vectors (exact distances within each cell). IVF-PQ adds product quantization (PQ) — a compression technique that splits each -dimensional vector into sub-vectors of dimensions each, then quantizes each sub-vector to one of a small codebook of centroids. This reduces storage by 10-100× at some recall cost.

HNSW (Hierarchical Navigable Small Worlds)

The insight: build a multi-layer proximity graph where upper layers provide long-range “express lane” connections and the bottom layer is a dense fine-grained neighbourhood graph. Search is a greedy descent through the layers.

See notebook for an interactive walkthrough with live controls.

Beam search is a bounded-width best-first search. Instead of greedily following only the single best neighbour (which gets stuck in local optima), beam search keeps a frontier of the ef most promising candidates seen so far, expands each of their neighbours, and keeps the best ef of those — repeating until no neighbour beats the current frontier. ef (expansion factor) is the width: larger ef explores more of the graph and finds better results, at proportionally more distance computations.

Graph construction. Each new vector is assigned a maximum layer where in the standard parametrisation. This gives a geometric distribution: each successive layer has as many nodes as the one below it.

LayerProbability (exact )
0 only
1
2
3
4+

The sum is 100% — every node lands somewhere. Nothing is discarded. At each layer up to its assigned maximum, is connected to its nearest neighbours found by beam search with width efConstruction. The result: the top layers are sparse and navigable; the bottom layer is a dense nearest-neighbour graph.

Search. Start at the single entry point on the top layer. Greedily traverse to the locally closest node. Descend layer by layer, then at layer 0 run a beam search of width efSearch to collect the final top- candidates.

Why complexity. The geometric sampling — each node exists on one fewer layer with constant probability — means the expected number of layers is , exactly like a skip list. The top layer has nodes (only the rarest reach it), each lower layer has roughly a constant multiple more. At each layer, greedy traversal visits neighbours before finding a node closer to the query. So total work is layers hops per layer per distance computation . Since is a fixed constant (16–64), this collapses to .

Key parameters:

  • : connections per node. Higher M = better recall, more memory. Typical: 16–64.
  • efConstruction: beam width at build time. Higher = better graph, slower build. Typical: 100–500.
  • efSearch: beam width at query time. The primary recall/latency dial — can change per query without rebuilding.

HNSW has the best recall/speed trade-off of any ANN algorithm but uses the most memory ().

ScaNN (Scalable Approximate Nearest Neighbours)

ScaNN is a quantization-based ANN algorithm from Google. To understand ScaNN, start with what quantization does and why the standard approach falls short for inner product search.

What quantization means here. Storing vectors of dimension as 32-bit floats costs bytes — 6 GB for 1 million 1536-dim vectors. Product quantization (PQ) — the same compression technique used in IVF-PQ — compresses each vector to a compact code:

  1. Split the -dimensional vector into sub-vectors of dimensions each.
  2. For each sub-space, learn a codebook of centroids (typically , so each centroid index fits in 1 byte).
  3. Replace each sub-vector with its nearest centroid’s index.

A 1536-dim float vector compresses to 12 bytes (with , ) — 128× smaller. Let denote the quantized approximation: the vector reconstructed from its centroid indices. The approximation error is .

At query time, approximate distances are computed entirely from byte codes using precomputed lookup tables — much faster than operating on raw floats.

The problem with standard PQ for MIPS. Standard PQ minimises isotropic — equal in all directions, like a sphere of allowable error — reconstruction error:

But for MIPS (Maximum Inner Product Search — finding the vector with the largest dot product with the query, the core operation in recommendation systems), what matters is not the reconstruction error but the inner product error:

This error depends on the direction of the residual relative to the query , not on its magnitude alone. Standard PQ wastes compression budget reducing error in directions that have no impact on ranking.

See notebook for an interactive walkthrough with live controls.

Anisotropic quantization: ScaNN’s core insight. Error parallel to — the direction that high-scoring queries tend to align with — propagates directly into inner product ranking errors. Error perpendicular to barely affects the inner product at all (since when the residual is orthogonal to the query). ScaNN replaces the isotropic loss with an anisotropic (direction-dependent) loss that penalises parallel error more heavily:

where is the component of the error in the direction of , and is the remaining component. Since , the optimiser places the codebook centroids to minimise parallel error first, accepting more perpendicular error in exchange. This yields significantly better ranking quality at the same compression ratio.

The three-phase pipeline. ScaNN is structured like IVF-PQ — partition to reduce candidates, then quantize for fast scoring — but with anisotropic PQ replacing standard PQ in the scoring phase:

  1. Partition (optional): k-means into Voronoi cells. At query time, probe the nearest few cells. This reduces the candidate set from to vectors.
  2. Score: for each candidate vector in the probed cells, compute an approximate inner product using its anisotropic quantized code and precomputed lookup tables. This is fast because it operates on byte codes, not floats.
  3. Rescore: take the top- candidates from scoring (typically a few hundred) and compute their exact inner products against the original float vectors. This corrects ranking errors introduced by quantization.

The rescore phase is what gives ScaNN its high recall: quantization is used as a fast filter, not as the final answer. The cost is for scoring (where via quantization) plus for rescoring — both cheap relative to the brute force cost.

Practical tradeoffs

AlgorithmRecall@10Query latencyMemoryBuild timeNotes
Flat (brute force)100%noneViable for
LSH85-95%fastSimple, streaming-friendly; recall ceiling below graph methods
IVFFlat90-99%requires trainingNeeds representative training data; combine with PQ for memory
HNSW95-99.9%slowBest recall/speed; highest memory; no training needed
ScaNN95-99%+requires trainingBest for MIPS (Maximum Inner Product Search — finding the vector with the largest dot product, used in recommendation systems); via quantization

Always benchmark on your own data. The ANN Benchmarks project is the standard reference.

Where to run it: vector databases

Purpose-built vector databases (Pinecone, Weaviate, Qdrant, Milvus, Chroma) bundle ANN indexing, metadata filtering, and horizontal scaling into a single system. They make sense at billion-scale or when you need sub-millisecond latency at high QPS (Queries Per Second).

For most workloads, an existing database with a vector extension is simpler and sufficient:

  • pgvector — adds vector columns, L2/cosine/IP operators, and IVFFlat + HNSW indexes to PostgreSQL.
  • Elasticsearch / OpenSearch — HNSW on dense vector fields.
  • BigQuery / AlloyDB — built-in vector search for analytics workloads.

Applications

Embeddings together with ANN power:

  • RAG (Retrieval-Augmented Generation) — retrieve relevant document chunks at query time and feed them into an LLM’s context window, grounding responses in real data and reducing hallucination.
  • Semantic search — find documents by meaning rather than keyword overlap.
  • Recommendations — embed users and items into the same space; nearest neighbours become recommendations.
  • Anomaly detection — points far from all cluster centroids are anomalous.
  • Two-stage retrieval — ANN as a fast first-stage candidate generator (billions → thousands); a cross-encoder or Reciprocal Rank Fusion reranker as second stage.

See also