How to set up a RAG pipeline

Overview

Retrieval-Augmented Generation (RAG) gives an LLM access to a private knowledge base without fine-tuning. Documents are split into chunks, embedded into vectors, stored in a vector database, and retrieved at query time so the model can cite sources. This guide uses pgvector as the store. The full retrieval strategy lives in rag-retrieval and chunking rules in rag-chunking.

Prerequisites

Python 3.11 or newer with a virtual environment active.
A Postgres instance with the pgvector extension. Follow configure-pgvector-on-neon or run locally.
An embeddings API key. Voyage AI (voyageai) or OpenAI (openai) both work; examples use Voyage.
pip install pgvector psycopg[binary] voyageai langchain-text-splitters

Steps

1. Choose a chunk strategy

Chunk size is the biggest quality lever in a RAG pipeline. 512-token chunks balance specificity and context; smaller chunks retrieve precisely but lose surrounding context, larger chunks retrieve context but rank poorly. Use sentence-aware splitting to avoid cutting mid-sentence.

from langchain_text_splitters import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " "],
)
 
with open("docs/handbook.md") as f:
    raw = f.read()
 
chunks = splitter.split_text(raw)
# Each chunk is a plain string; attach metadata separately.

See rag-chunking for parent-child chunking, semantic chunking, and when to use each.

2. Embed the chunks

Embed in batches; embedding APIs charge per token and have rate limits.

import voyageai
 
vc = voyageai.Client()  # reads VOYAGE_API_KEY from env
 
BATCH_SIZE = 128
 
def embed_chunks(chunks: list[str]) -> list[list[float]]:
    vectors = []
    for i in range(0, len(chunks), BATCH_SIZE):
        batch = chunks[i : i + BATCH_SIZE]
        result = vc.embed(batch, model="voyage-3", input_type="document")
        vectors.extend(result.embeddings)
    return vectors
 
vectors = embed_chunks(chunks)

Match input_type="document" for indexing and input_type="query" for queries. Mismatching them degrades recall. See embeddings for model comparison.

3. Store in pgvector

Create the table with a vector column whose dimension matches your model (1024 for voyage-3). The hnsw index enables approximate nearest-neighbor search.

CREATE EXTENSION IF NOT EXISTS vector;
 
CREATE TABLE chunks (
  id        bigserial PRIMARY KEY,
  source    text NOT NULL,
  content   text NOT NULL,
  embedding vector(1024) NOT NULL
);
 
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);

Insert rows with psycopg:

import psycopg
 
conn = psycopg.connect(DATABASE_URL)
 
with conn.cursor() as cur:
    cur.executemany(
        "INSERT INTO chunks (source, content, embedding) VALUES (%s, %s, %s)",
        [(f"handbook.md#{i}", chunk, vec)
         for i, (chunk, vec) in enumerate(zip(chunks, vectors))],
    )
conn.commit()

4. Retrieve top-k at query time

Embed the user query with input_type="query" and run a cosine-distance search.

def retrieve(query: str, k: int = 5) -> list[dict]:
    q_vec = vc.embed([query], model="voyage-3", input_type="query").embeddings[0]
    with conn.cursor(row_factory=psycopg.rows.dict_row) as cur:
        cur.execute(
            """
            SELECT source, content,
                   1 - (embedding <=> %s::vector) AS score
            FROM chunks
            ORDER BY embedding <=> %s::vector
            LIMIT %s
            """,
            (q_vec, q_vec, k),
        )
        return cur.fetchall()

Consider adding a reranker after retrieval to improve precision; see rag-reranking.

5. Ground the answer with citations

Inject retrieved chunks into the system prompt and instruct the model to cite sources.

import anthropic
 
client = anthropic.Anthropic()
 
def answer(query: str) -> str:
    results = retrieve(query)
    context = "\n\n".join(
        f"[{r['source']}]\n{r['content']}" for r in results
    )
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=f"Answer using only the sources below. Cite [source] inline.\n\n{context}",
        messages=[{"role": "user", "content": query}],
    )
    return response.content[0].text

See rag-citations for citation format standards.

Verify it worked

python - <<'EOF'
from pipeline import answer
print(answer("What is the onboarding process?"))
EOF
# Expected: an answer with at least one [handbook.md#N] citation.
 
# Check vector count:
psql $DATABASE_URL -c "SELECT count(*) FROM chunks;"

Common errors

Retrieval returns irrelevant chunks. The query input_type is wrong or chunks are too large. Re-embed with the correct type or reduce chunk size.
psycopg.errors.UndefinedFunction: operator does not exist. The pgvector extension is not installed in this database. Run CREATE EXTENSION vector; in the target database.
Embedding dimension mismatch. The vector(N) column dimension must match the model output. voyage-3 outputs 1024; text-embedding-3-small outputs 1536 by default.
Slow queries on large tables. The HNSW index was created before rows were inserted. Drop and recreate it, or use VACUUM ANALYZE.
Model hallucinates despite context. The retrieved chunks do not contain the answer. Check recall with retrieve() before blaming the model; improve chunking or increase k.

LLM Best Practices

Explorer

Overview

Prerequisites

Steps

1. Choose a chunk strategy

2. Embed the chunks

3. Store in pgvector

4. Retrieve top-k at query time

5. Ground the answer with citations

Verify it worked

Common errors

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

How to set up a RAG pipeline

Overview

Prerequisites

Steps

1. Choose a chunk strategy

2. Embed the chunks

3. Store in pgvector

4. Retrieve top-k at query time

5. Ground the answer with citations

Verify it worked

Common errors

Related

Graph View

Table of Contents

Backlinks