Overview

Retrieval-Augmented Generation (RAG) gives an LLM access to a private knowledge base without fine-tuning. Documents are split into chunks, embedded into vectors, stored in a vector database, and retrieved at query time so the model can cite sources. This guide uses pgvector as the store. The full retrieval strategy lives in rag-retrieval and chunking rules in rag-chunking.

Prerequisites

  • Python 3.11 or newer with a virtual environment active.
  • A Postgres instance with the pgvector extension. Follow configure-pgvector-on-neon or run locally.
  • An embeddings API key. Voyage AI (voyageai) or OpenAI (openai) both work; examples use Voyage.
  • pip install pgvector psycopg[binary] voyageai langchain-text-splitters

Steps

1. Choose a chunk strategy

Chunk size is the biggest quality lever in a RAG pipeline. 512-token chunks balance specificity and context; smaller chunks retrieve precisely but lose surrounding context, larger chunks retrieve context but rank poorly. Use sentence-aware splitting to avoid cutting mid-sentence.

from langchain_text_splitters import RecursiveCharacterTextSplitter
 
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " "],
)
 
with open("docs/handbook.md") as f:
    raw = f.read()
 
chunks = splitter.split_text(raw)
# Each chunk is a plain string; attach metadata separately.

See rag-chunking for parent-child chunking, semantic chunking, and when to use each.

2. Embed the chunks

Embed in batches; embedding APIs charge per token and have rate limits.

import voyageai
 
vc = voyageai.Client()  # reads VOYAGE_API_KEY from env
 
BATCH_SIZE = 128
 
def embed_chunks(chunks: list[str]) -> list[list[float]]:
    vectors = []
    for i in range(0, len(chunks), BATCH_SIZE):
        batch = chunks[i : i + BATCH_SIZE]
        result = vc.embed(batch, model="voyage-3", input_type="document")
        vectors.extend(result.embeddings)
    return vectors
 
vectors = embed_chunks(chunks)

Match input_type="document" for indexing and input_type="query" for queries. Mismatching them degrades recall. See embeddings for model comparison.

3. Store in pgvector

Create the table with a vector column whose dimension matches your model (1024 for voyage-3). The hnsw index enables approximate nearest-neighbor search.

CREATE EXTENSION IF NOT EXISTS vector;
 
CREATE TABLE chunks (
  id        bigserial PRIMARY KEY,
  source    text NOT NULL,
  content   text NOT NULL,
  embedding vector(1024) NOT NULL
);
 
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);

Insert rows with psycopg:

import psycopg
 
conn = psycopg.connect(DATABASE_URL)
 
with conn.cursor() as cur:
    cur.executemany(
        "INSERT INTO chunks (source, content, embedding) VALUES (%s, %s, %s)",
        [(f"handbook.md#{i}", chunk, vec)
         for i, (chunk, vec) in enumerate(zip(chunks, vectors))],
    )
conn.commit()

4. Retrieve top-k at query time

Embed the user query with input_type="query" and run a cosine-distance search.

def retrieve(query: str, k: int = 5) -> list[dict]:
    q_vec = vc.embed([query], model="voyage-3", input_type="query").embeddings[0]
    with conn.cursor(row_factory=psycopg.rows.dict_row) as cur:
        cur.execute(
            """
            SELECT source, content,
                   1 - (embedding <=> %s::vector) AS score
            FROM chunks
            ORDER BY embedding <=> %s::vector
            LIMIT %s
            """,
            (q_vec, q_vec, k),
        )
        return cur.fetchall()

Consider adding a reranker after retrieval to improve precision; see rag-reranking.

5. Ground the answer with citations

Inject retrieved chunks into the system prompt and instruct the model to cite sources.

import anthropic
 
client = anthropic.Anthropic()
 
def answer(query: str) -> str:
    results = retrieve(query)
    context = "\n\n".join(
        f"[{r['source']}]\n{r['content']}" for r in results
    )
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=f"Answer using only the sources below. Cite [source] inline.\n\n{context}",
        messages=[{"role": "user", "content": query}],
    )
    return response.content[0].text

See rag-citations for citation format standards.

Verify it worked

python - <<'EOF'
from pipeline import answer
print(answer("What is the onboarding process?"))
EOF
# Expected: an answer with at least one [handbook.md#N] citation.
 
# Check vector count:
psql $DATABASE_URL -c "SELECT count(*) FROM chunks;"

Common errors

  • Retrieval returns irrelevant chunks. The query input_type is wrong or chunks are too large. Re-embed with the correct type or reduce chunk size.
  • psycopg.errors.UndefinedFunction: operator does not exist. The pgvector extension is not installed in this database. Run CREATE EXTENSION vector; in the target database.
  • Embedding dimension mismatch. The vector(N) column dimension must match the model output. voyage-3 outputs 1024; text-embedding-3-small outputs 1536 by default.
  • Slow queries on large tables. The HNSW index was created before rows were inserted. Drop and recreate it, or use VACUUM ANALYZE.
  • Model hallucinates despite context. The retrieved chunks do not contain the answer. Check recall with retrieve() before blaming the model; improve chunking or increase k.