Overview

Dimension is the single biggest lever on index storage and query latency. Matryoshka training lets modern models truncate gracefully without a separate re-training step. The rule is: start at half the model maximum, measure recall on your golden set, then go smaller if cost matters and larger only if recall drops below your threshold. For the evaluation harness, see embeddings-eval.

Matryoshka truncation preserves most quality down to 256 dimensions

Matryoshka Representation Learning (MRL) trains the first N dimensions of the vector to be nearly as informative as the full vector. Models that support it include Voyage 3, Voyage 3 Large, OpenAI text-embedding-3-*, and Nomic Embed v1.5. When you request a smaller dimension from these APIs, you are not truncating post-hoc; the model outputs only the informative prefix.

# Voyage: pass output_dimension at embed time
embeddings = client.embed(texts, model="voyage-3-large", output_dimension=512).embeddings
 
# OpenAI: pass dimensions parameter
resp = oai.embeddings.create(input=texts, model="text-embedding-3-large", dimensions=512)

For models that do not natively support truncation, never truncate by slicing the raw output; retrain or use a different model.

The 384 / 512 / 768 / 1536 / 3072 decision tree

DimensionStorage per million docsTypical recall loss vs maxBest for
256~1 GB (float32)3 to 6 percentMobile, edge, cost-first
384~1.5 GB2 to 4 percentLightweight semantic search
512~2 GB1 to 2 percentMost production RAG
768~3 GB< 1 percentCode retrieval, technical docs
1024~4 GB< 0.5 percentHigh-precision legal, biomedical
1536~6 GBNear zeroOpenAI default, balanced
3072~12 GBZero (baseline)Maximum quality, cost tolerated

Storage figures are approximate for float32; use int8 quantization to cut them by 4x with minimal recall loss on most models.

Apply the halve-and-verify rule before fixing a dimension

Start at the model maximum. Halve the dimension. Measure recall@10 on your golden set. If recall drops less than 1 percent, halve again. Stop when recall drops more than 1 percent or when you reach the minimum acceptable for your use case. This converges in two to three iterations and avoids choosing a dimension based on intuition.

def halve_and_verify(baseline_recall, dims_to_test, embed_fn, eval_fn):
    for dim in dims_to_test:
        vecs = embed_fn(corpus, dimension=dim)
        recall = eval_fn(vecs, queries, golden)
        delta = baseline_recall - recall
        print(f"dim={dim}  recall={recall:.3f}  delta={delta:.3f}")
        if delta > 0.01:
            print(f"Stop at previous dimension.")
            break

Re-normalize after truncation

Truncation breaks the unit-norm property of vectors embedded at full dimension. Always re-normalize after any truncation step. See embeddings-normalization for the standard normalize function and the consequences of skipping this step.

Storage scales quadratically with dimension in some ANN indexes

HNSW graphs in pgvector and ChromaDB store neighbor links per node. The memory overhead of the graph structure is proportional to the dimension of the distance computation, not just the vector payload. At 3072 dimensions, an HNSW index over 10 million documents can exceed 80 GB of RAM. Choosing 512 dimensions reduces that to roughly 13 GB. For storage and index selection, see chromadb and pinecone-vs-pgvector.

Mixing dimensions across models is a silent bug

Vectors from text-embedding-3-large at 1536 dimensions are not comparable to Voyage 3 vectors at 1536 dimensions. They live in different geometric spaces. Never mix model outputs in the same index, even when the dimension matches. Tag every stored vector with the model ID and version. See embeddings for the model-version cache key pattern.