RAG (Retrieval-Augmented Generation)

RAG is an architecture pattern that grounds a large language model’s responses in retrieved documents, reducing hallucination.

The core idea: instead of relying solely on knowledge baked into the model’s weights during training, a RAG system retrieves relevant documents at query time and feeds them into the model’s context window alongside the user’s question. The model answers using both its pre-trained knowledge and the retrieved content.

The three-phase pipeline

  1. Embed and index — convert a document corpus into embeddings and store them in a vector index (e.g. via pgvector + HNSW). Also optionally build a keyword index (e.g. BM25 via ParadeDB).
  2. Retrieve — at query time, embed the question and find the nearest documents. Typically hybrid retrieval: vector search for semantic similarity combined with BM25 for exact keyword matches, merged via Reciprocal Rank Fusion.
  3. Generate — pass the top-k retrieved chunks as context to the LLM. The model produces an answer that cites or is grounded in those chunks.

Why it works

LLMs have a knowledge cutoff and a finite context window. RAG trades the constraint of baked-in knowledge for dynamic retrieval, enabling:

  • Answers about documents that post-date training
  • Attribution: which source was used
  • Domain-specific corpora without full fine-tuning

See also