Overview
Retrieval-Augmented Generation (RAG) gives an LLM access to a private knowledge base without fine-tuning. Documents are split into chunks, embedded into vectors, stored in a vector database, and retrieved at query time so the model can cite sources. This guide uses pgvector as the store. The full retrieval strategy lives in rag-retrieval and chunking rules in rag-chunking.
Prerequisites
- Python 3.11 or newer with a virtual environment active.
- A Postgres instance with the pgvector extension. Follow configure-pgvector-on-neon or run locally.
- An embeddings API key. Voyage AI (
voyageai) or OpenAI (openai) both work; examples use Voyage. pip install pgvector psycopg[binary] voyageai langchain-text-splitters
Steps
1. Choose a chunk strategy
Chunk size is the biggest quality lever in a RAG pipeline. 512-token chunks balance specificity and context; smaller chunks retrieve precisely but lose surrounding context, larger chunks retrieve context but rank poorly. Use sentence-aware splitting to avoid cutting mid-sentence.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ". ", " "],
)
with open("docs/handbook.md") as f:
raw = f.read()
chunks = splitter.split_text(raw)
# Each chunk is a plain string; attach metadata separately.See rag-chunking for parent-child chunking, semantic chunking, and when to use each.
2. Embed the chunks
Embed in batches; embedding APIs charge per token and have rate limits.
import voyageai
vc = voyageai.Client() # reads VOYAGE_API_KEY from env
BATCH_SIZE = 128
def embed_chunks(chunks: list[str]) -> list[list[float]]:
vectors = []
for i in range(0, len(chunks), BATCH_SIZE):
batch = chunks[i : i + BATCH_SIZE]
result = vc.embed(batch, model="voyage-3", input_type="document")
vectors.extend(result.embeddings)
return vectors
vectors = embed_chunks(chunks)Match input_type="document" for indexing and input_type="query" for queries. Mismatching them degrades recall. See embeddings for model comparison.
3. Store in pgvector
Create the table with a vector column whose dimension matches your model (1024 for voyage-3). The hnsw index enables approximate nearest-neighbor search.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunks (
id bigserial PRIMARY KEY,
source text NOT NULL,
content text NOT NULL,
embedding vector(1024) NOT NULL
);
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);Insert rows with psycopg:
import psycopg
conn = psycopg.connect(DATABASE_URL)
with conn.cursor() as cur:
cur.executemany(
"INSERT INTO chunks (source, content, embedding) VALUES (%s, %s, %s)",
[(f"handbook.md#{i}", chunk, vec)
for i, (chunk, vec) in enumerate(zip(chunks, vectors))],
)
conn.commit()4. Retrieve top-k at query time
Embed the user query with input_type="query" and run a cosine-distance search.
def retrieve(query: str, k: int = 5) -> list[dict]:
q_vec = vc.embed([query], model="voyage-3", input_type="query").embeddings[0]
with conn.cursor(row_factory=psycopg.rows.dict_row) as cur:
cur.execute(
"""
SELECT source, content,
1 - (embedding <=> %s::vector) AS score
FROM chunks
ORDER BY embedding <=> %s::vector
LIMIT %s
""",
(q_vec, q_vec, k),
)
return cur.fetchall()Consider adding a reranker after retrieval to improve precision; see rag-reranking.
5. Ground the answer with citations
Inject retrieved chunks into the system prompt and instruct the model to cite sources.
import anthropic
client = anthropic.Anthropic()
def answer(query: str) -> str:
results = retrieve(query)
context = "\n\n".join(
f"[{r['source']}]\n{r['content']}" for r in results
)
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=f"Answer using only the sources below. Cite [source] inline.\n\n{context}",
messages=[{"role": "user", "content": query}],
)
return response.content[0].textSee rag-citations for citation format standards.
Verify it worked
python - <<'EOF'
from pipeline import answer
print(answer("What is the onboarding process?"))
EOF
# Expected: an answer with at least one [handbook.md#N] citation.
# Check vector count:
psql $DATABASE_URL -c "SELECT count(*) FROM chunks;"Common errors
- Retrieval returns irrelevant chunks. The query
input_typeis wrong or chunks are too large. Re-embed with the correct type or reduce chunk size. psycopg.errors.UndefinedFunction: operator does not exist. The pgvector extension is not installed in this database. RunCREATE EXTENSION vector;in the target database.- Embedding dimension mismatch. The
vector(N)column dimension must match the model output.voyage-3outputs 1024;text-embedding-3-smalloutputs 1536 by default. - Slow queries on large tables. The HNSW index was created before rows were inserted. Drop and recreate it, or use
VACUUM ANALYZE. - Model hallucinates despite context. The retrieved chunks do not contain the answer. Check recall with
retrieve()before blaming the model; improve chunking or increasek.