Embeddings: Dimensionality

Overview

Dimension is the single biggest lever on index storage and query latency. Matryoshka training lets modern models truncate gracefully without a separate re-training step. The rule is: start at half the model maximum, measure recall on your golden set, then go smaller if cost matters and larger only if recall drops below your threshold. For the evaluation harness, see embeddings-eval.

Matryoshka truncation preserves most quality down to 256 dimensions

Matryoshka Representation Learning (MRL) trains the first N dimensions of the vector to be nearly as informative as the full vector. Models that support it include Voyage 3, Voyage 3 Large, OpenAI text-embedding-3-*, and Nomic Embed v1.5. When you request a smaller dimension from these APIs, you are not truncating post-hoc; the model outputs only the informative prefix.

# Voyage: pass output_dimension at embed time
embeddings = client.embed(texts, model="voyage-3-large", output_dimension=512).embeddings
 
# OpenAI: pass dimensions parameter
resp = oai.embeddings.create(input=texts, model="text-embedding-3-large", dimensions=512)

For models that do not natively support truncation, never truncate by slicing the raw output; retrain or use a different model.

The 384 / 512 / 768 / 1536 / 3072 decision tree

Dimension	Storage per million docs	Typical recall loss vs max	Best for
256	~1 GB (float32)	3 to 6 percent	Mobile, edge, cost-first
384	~1.5 GB	2 to 4 percent	Lightweight semantic search
512	~2 GB	1 to 2 percent	Most production RAG
768	~3 GB	< 1 percent	Code retrieval, technical docs
1024	~4 GB	< 0.5 percent	High-precision legal, biomedical
1536	~6 GB	Near zero	OpenAI default, balanced
3072	~12 GB	Zero (baseline)	Maximum quality, cost tolerated

Storage figures are approximate for float32; use int8 quantization to cut them by 4x with minimal recall loss on most models.

Apply the halve-and-verify rule before fixing a dimension

Start at the model maximum. Halve the dimension. Measure recall@10 on your golden set. If recall drops less than 1 percent, halve again. Stop when recall drops more than 1 percent or when you reach the minimum acceptable for your use case. This converges in two to three iterations and avoids choosing a dimension based on intuition.

def halve_and_verify(baseline_recall, dims_to_test, embed_fn, eval_fn):
    for dim in dims_to_test:
        vecs = embed_fn(corpus, dimension=dim)
        recall = eval_fn(vecs, queries, golden)
        delta = baseline_recall - recall
        print(f"dim={dim}  recall={recall:.3f}  delta={delta:.3f}")
        if delta > 0.01:
            print(f"Stop at previous dimension.")
            break

Re-normalize after truncation

Truncation breaks the unit-norm property of vectors embedded at full dimension. Always re-normalize after any truncation step. See embeddings-normalization for the standard normalize function and the consequences of skipping this step.

Storage scales quadratically with dimension in some ANN indexes

HNSW graphs in pgvector and ChromaDB store neighbor links per node. The memory overhead of the graph structure is proportional to the dimension of the distance computation, not just the vector payload. At 3072 dimensions, an HNSW index over 10 million documents can exceed 80 GB of RAM. Choosing 512 dimensions reduces that to roughly 13 GB. For storage and index selection, see chromadb and pinecone-vs-pgvector.

Mixing dimensions across models is a silent bug

Vectors from text-embedding-3-large at 1536 dimensions are not comparable to Voyage 3 vectors at 1536 dimensions. They live in different geometric spaces. Never mix model outputs in the same index, even when the dimension matches. Tag every stored vector with the model ID and version. See embeddings for the model-version cache key pattern.

LLM Best Practices

Explorer

Overview

Matryoshka truncation preserves most quality down to 256 dimensions

The 384 / 512 / 768 / 1536 / 3072 decision tree

Apply the halve-and-verify rule before fixing a dimension

Re-normalize after truncation

Storage scales quadratically with dimension in some ANN indexes

Mixing dimensions across models is a silent bug

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

Embeddings: Dimensionality

Overview

Matryoshka truncation preserves most quality down to 256 dimensions

The 384 / 512 / 768 / 1536 / 3072 decision tree

Apply the halve-and-verify rule before fixing a dimension

Re-normalize after truncation

Storage scales quadratically with dimension in some ANN indexes

Mixing dimensions across models is a silent bug

Related

Graph View

Table of Contents

Backlinks