Overview
Dimension is the single biggest lever on index storage and query latency. Matryoshka training lets modern models truncate gracefully without a separate re-training step. The rule is: start at half the model maximum, measure recall on your golden set, then go smaller if cost matters and larger only if recall drops below your threshold. For the evaluation harness, see embeddings-eval.
Matryoshka truncation preserves most quality down to 256 dimensions
Matryoshka Representation Learning (MRL) trains the first N dimensions of the vector to be nearly as informative as the full vector. Models that support it include Voyage 3, Voyage 3 Large, OpenAI text-embedding-3-*, and Nomic Embed v1.5. When you request a smaller dimension from these APIs, you are not truncating post-hoc; the model outputs only the informative prefix.
# Voyage: pass output_dimension at embed time
embeddings = client.embed(texts, model="voyage-3-large", output_dimension=512).embeddings
# OpenAI: pass dimensions parameter
resp = oai.embeddings.create(input=texts, model="text-embedding-3-large", dimensions=512)For models that do not natively support truncation, never truncate by slicing the raw output; retrain or use a different model.
The 384 / 512 / 768 / 1536 / 3072 decision tree
| Dimension | Storage per million docs | Typical recall loss vs max | Best for |
|---|---|---|---|
| 256 | ~1 GB (float32) | 3 to 6 percent | Mobile, edge, cost-first |
| 384 | ~1.5 GB | 2 to 4 percent | Lightweight semantic search |
| 512 | ~2 GB | 1 to 2 percent | Most production RAG |
| 768 | ~3 GB | < 1 percent | Code retrieval, technical docs |
| 1024 | ~4 GB | < 0.5 percent | High-precision legal, biomedical |
| 1536 | ~6 GB | Near zero | OpenAI default, balanced |
| 3072 | ~12 GB | Zero (baseline) | Maximum quality, cost tolerated |
Storage figures are approximate for float32; use int8 quantization to cut them by 4x with minimal recall loss on most models.
Apply the halve-and-verify rule before fixing a dimension
Start at the model maximum. Halve the dimension. Measure recall@10 on your golden set. If recall drops less than 1 percent, halve again. Stop when recall drops more than 1 percent or when you reach the minimum acceptable for your use case. This converges in two to three iterations and avoids choosing a dimension based on intuition.
def halve_and_verify(baseline_recall, dims_to_test, embed_fn, eval_fn):
for dim in dims_to_test:
vecs = embed_fn(corpus, dimension=dim)
recall = eval_fn(vecs, queries, golden)
delta = baseline_recall - recall
print(f"dim={dim} recall={recall:.3f} delta={delta:.3f}")
if delta > 0.01:
print(f"Stop at previous dimension.")
breakRe-normalize after truncation
Truncation breaks the unit-norm property of vectors embedded at full dimension. Always re-normalize after any truncation step. See embeddings-normalization for the standard normalize function and the consequences of skipping this step.
Storage scales quadratically with dimension in some ANN indexes
HNSW graphs in pgvector and ChromaDB store neighbor links per node. The memory overhead of the graph structure is proportional to the dimension of the distance computation, not just the vector payload. At 3072 dimensions, an HNSW index over 10 million documents can exceed 80 GB of RAM. Choosing 512 dimensions reduces that to roughly 13 GB. For storage and index selection, see chromadb and pinecone-vs-pgvector.
Mixing dimensions across models is a silent bug
Vectors from text-embedding-3-large at 1536 dimensions are not comparable to Voyage 3 vectors at 1536 dimensions. They live in different geometric spaces. Never mix model outputs in the same index, even when the dimension matches. Tag every stored vector with the model ID and version. See embeddings for the model-version cache key pattern.