Overview
This page is the atomic definition. The deep-dive lives at rag.
Definition
Retrieval-augmented generation (RAG) is a pattern that retrieves relevant context from a knowledge base at query time and passes it into the model prompt, grounding the answer in external data instead of relying on the model’s parametric memory. The pipeline has three stages: index (chunk and embed documents into a vector store), retrieve (embed the query and fetch top-k chunks), and generate (pass chunks plus query to the model). RAG reduces hallucinations on factual questions, lets the system answer from documents the model never trained on, and produces auditable citations.
When it applies
Use RAG for question answering over private or fast-changing documents (docs sites, internal wikis, support tickets, codebases). Avoid it when the model already knows the answer or when latency matters more than freshness.
Example
A support bot retrieves the top 5 most relevant help-center articles for a user’s question, passes them as context, and asks the model to answer with inline citations to the article IDs. The user clicks a citation to read the source.
Related concepts
- rag - the deep-dive with chunking, retrieval, and reranking rules.
- rag-retrieval - the retrieval stage in detail.
- embedding - the vector representation that powers retrieval.
- hallucination - the failure mode RAG most directly mitigates.
- rag-citations - the citation discipline that makes RAG auditable.
Citing this term
See Retrieval-augmented generation (RAG) (llmbestpractices.com/glossary/retrieval-augmented-generation).