Overview

This page is the atomic definition. Full RAG architecture patterns live at rag.

Definition

Retrieval-Augmented Generation (RAG) is an inference-time technique that supplements an LLM’s parametric knowledge (knowledge baked into weights) with non-parametric knowledge retrieved from an external corpus. A query is converted to an embedding, which is compared against a vector index of pre-embedded document chunks using vector-similarity (cosine similarity or dot product). The top-k most relevant chunks are inserted into the prompt as context. The model generates a response grounded in those chunks rather than relying solely on what it memorized during training. This reduces hallucination on domain-specific and time-sensitive information. Quality depends on retrieval precision: irrelevant chunks waste context window space and can mislead the model. A reranker is often applied after initial retrieval to re-score and trim results. Evaluation requires a ground-truth set of query-answer pairs and metrics like faithfulness (does the answer match the retrieved context) and answer relevance.

When it applies

Use RAG when: the knowledge domain changes faster than model retraining cycles allow, the corpus is too large to fit in a context window, or you need the model to cite sources. Do not use RAG as a substitute for fine-tuning when the task requires behavioral change (tone, output format) rather than knowledge injection.

Example

A legal research assistant embeds 50,000 case summaries. A user asks about a 2024 ruling. The query vector matches the relevant case chunk. The LLM receives: [Context: case summary text] + [Question: ...] and generates a cited answer. Without RAG, the model’s training cutoff would prevent it from knowing the 2024 ruling.

  • rag - architecture patterns: chunking strategies, embedding models, hybrid retrieval.
  • embedding - the vector representation that makes similarity search possible.
  • vector-similarity - the distance metric used to find relevant chunks.
  • reranker - improves retrieval precision after the initial embedding search.
  • hallucination - RAG reduces but does not eliminate hallucination.
  • eval-set - a test set of query-answer pairs is needed to measure RAG quality.

Citing this term

See RAG (llmbestpractices.com/glossary/rag).