Overview

Chunking decides what the retriever can see. A chunk that splits mid-sentence or strips a heading hides the answer from the embedding model and from the user. The rules below set the boundary policy, the size window, and the metadata payload every chunk needs. For retrieval over those chunks, see rag-retrieval. For the embedding step, see embeddings.

Split on semantic boundaries, not character counts

Walk the document structure first; enforce a token budget second. A 500-token chunk that starts mid-sentence loses to a 700-token chunk that respects the section.

  • Split order for Markdown: #, ##, ###, blank lines, code-fence boundaries, then sentence boundaries as a last resort.
  • For HTML, split on <h1>, <h2>, <section>, <article>. Strip nav and footer first.
  • For code, split on top-level declarations (function, class, module), not braces. A half-function chunk teaches the model nothing.

Fixed character splits are the wrong default. They survive only because the libraries that ship them are easy to use.

Target 200 to 800 tokens per chunk

Below 200 tokens the chunk loses the context that makes it retrievable; above 800 the embedding gets fuzzy because the vector averages over too many topics.

  • 200 to 400 tokens: short FAQ entries, definitions, code snippets.
  • 400 to 800 tokens: documentation sections, blog paragraphs, API reference entries.
  • Over 800: split. The cost of one more chunk is lower than the cost of a missed retrieval.

Use a real tokenizer (the model’s own, ideally) for the budget. Character heuristics undercount code and overcount English.

Overlap adjacent chunks by 10 to 20 percent

Facts that fall on a boundary stay retrievable only when both chunks contain them. A 50 to 100 token overlap is the standard insurance.

def chunk_with_overlap(tokens, size=512, overlap=64):
    out = []
    i = 0
    while i < len(tokens):
        out.append(tokens[i : i + size])
        i += size - overlap
    return out

Overlap above 20 percent is waste. The same fact gets embedded twice and competes with itself at retrieval time.

Attach metadata to every chunk

Metadata is the second axis of retrieval. A chunk without metadata is half a chunk.

  • Required keys: source_url, title, heading_path, chunk_index, last_updated.
  • Optional but useful: lang, author, tenant_id, tags, doc_type.
  • Store the heading path as a string ("Postgres > Indexes > BRIN") so it renders in citations and filters cleanly.

The filter step uses metadata before the vector step touches the index. See chromadb for the where clause pattern and postgres-full-text-search for the keyword arm of hybrid retrieval.

Use propositional chunking for dense reference text

When the source is encyclopedia-style (one fact per sentence, no narrative flow), break it into atomic propositions instead of paragraphs. Each chunk becomes a single self-contained claim.

  • Generate propositions with a small model: “Rewrite this paragraph as standalone sentences, each understandable without its neighbors.”
  • Store the original paragraph as a parent chunk; retrieve propositions, return the parent.
  • Propositional chunks lift recall on fact-extraction queries by 10 to 30 percent against narrative chunks.

Skip propositional chunking on storytelling or argumentative prose; the flow is the meaning.

Retrieve chunks, return documents when needed

The chunk is the retrieval unit; the document is sometimes the answer unit. Decide per use case.

  • QA over a knowledge base: retrieve chunks, prompt with chunks, cite chunks. See rag-citations.
  • Summarization or “show me the file”: retrieve chunks, return the parent document. The chunk is just the hook.
  • Long-form briefings: retrieve chunks, expand to a fixed window around each (parent paragraph, parent section).

The “small-to-big” pattern (retrieve small, return big) keeps recall high while giving the model enough context to write.

Re-chunk when the embedding model changes

Chunk size is tied to the embedding model’s context window and training distribution. A new model is a new chunking decision.

  • Re-evaluate the size window when switching from text-embedding-3-small to Voyage 3 or a long-context embedder.
  • Re-ingest the corpus; do not mix chunk sizes in one collection.
  • Record the chunker version in the collection metadata.

A chunking change without a re-embed is a silent regression. Pair it with the eval suite in rag-eval.