RAG (Retrieval-Augmented Generation)
RAG is an architecture pattern that grounds a large language model’s responses in retrieved documents, reducing hallucination.
The core idea: instead of relying solely on knowledge baked into the model’s weights during training, a RAG system retrieves relevant documents at query time and feeds them into the model’s context window alongside the user’s question. The model answers using both its pre-trained knowledge and the retrieved content.
The three-phase pipeline
- Embed and index — convert a document corpus into embeddings and store them in a vector index (e.g. via pgvector + HNSW). Also optionally build a keyword index (e.g. BM25 via ParadeDB).
- Retrieve — at query time, embed the question and find the nearest documents. Typically hybrid retrieval: vector search for semantic similarity combined with BM25 for exact keyword matches, merged via Reciprocal Rank Fusion.
- Generate — pass the top-k retrieved chunks as context to the LLM. The model produces an answer that cites or is grounded in those chunks.
Why it works
LLMs have a knowledge cutoff and a finite context window. RAG trades the constraint of baked-in knowledge for dynamic retrieval, enabling:
- Answers about documents that post-date training
- Attribution: which source was used
- Domain-specific corpora without full fine-tuning
See also
- pgvector — vector similarity search in Postgres
- Full-Text Search and ParadeDB — BM25 keyword search in Postgres
- Reciprocal Rank Fusion — merging ranked lists from multiple retrievers
- Vector Search and Vector Databases — ANN algorithm theory