Retrieval-augmented generation (RAG)

Overview

This page is the atomic definition. The deep-dive lives at rag.

Definition

Retrieval-augmented generation (RAG) is a pattern that retrieves relevant context from a knowledge base at query time and passes it into the model prompt, grounding the answer in external data instead of relying on the model’s parametric memory. The pipeline has three stages: index (chunk and embed documents into a vector store), retrieve (embed the query and fetch top-k chunks), and generate (pass chunks plus query to the model). RAG reduces hallucinations on factual questions, lets the system answer from documents the model never trained on, and produces auditable citations.

When it applies

Use RAG for question answering over private or fast-changing documents (docs sites, internal wikis, support tickets, codebases). Avoid it when the model already knows the answer or when latency matters more than freshness.

Example

A support bot retrieves the top 5 most relevant help-center articles for a user’s question, passes them as context, and asks the model to answer with inline citations to the article IDs. The user clicks a citation to read the source.

rag - the deep-dive with chunking, retrieval, and reranking rules.
rag-retrieval - the retrieval stage in detail.
embedding - the vector representation that powers retrieval.
hallucination - the failure mode RAG most directly mitigates.
rag-citations - the citation discipline that makes RAG auditable.

Citing this term

See Retrieval-augmented generation (RAG) (llmbestpractices.com/glossary/retrieval-augmented-generation).

LLM Best Practices

Explorer

Retrieval-augmented generation (RAG)

Overview

Definition

When it applies

Example

Citing this term

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

Retrieval-augmented generation (RAG)

Overview

Definition

When it applies

Example

Related concepts

Citing this term

Related

Graph View

Table of Contents

Backlinks