RAG (Retrieval-Augmented Generation)

RAG is an architecture pattern that grounds a large language model’s responses in retrieved documents, reducing hallucination.

The core idea: instead of relying solely on knowledge baked into the model’s weights during training, a RAG system retrieves relevant documents at query time and feeds them into the model’s context window alongside the user’s question. The model answers using both its pre-trained knowledge and the retrieved content.

The three-phase pipeline

Embed and index — convert a document corpus into embeddings and store them in a vector index (e.g. via pgvector + HNSW). Also optionally build a keyword index (e.g. BM25 via ParadeDB).
Retrieve — at query time, embed the question and find the nearest documents. Typically hybrid retrieval: vector search for semantic similarity combined with BM25 for exact keyword matches, merged via Reciprocal Rank Fusion.
Generate — pass the top-k retrieved chunks as context to the LLM. The model produces an answer that cites or is grounded in those chunks.

Why it works

LLMs have a knowledge cutoff and a finite context window. RAG trades the constraint of baked-in knowledge for dynamic retrieval, enabling:

Answers about documents that post-date training
Attribution: which source was used
Domain-specific corpora without full fine-tuning

Edmondo's Vault

Explorer

RAG

RAG (Retrieval-Augmented Generation)

The three-phase pipeline

Why it works

See also

Graph View

Table of Contents

Backlinks