Overview
Embeddings map text (or code, or images) to dense vectors so a vector store can answer “what is similar to this.” The choice of model, dimension, and similarity metric defines the quality ceiling of any retrieval system. For how the vectors get used, see rag. For storage, see chromadb.
Pick the model that fits the corpus
The strongest English-and-code default in 2026 is Voyage 3 (or Voyage 3 Large where budget allows). For OpenAI shops, text-embedding-3-large or text-embedding-3-small are the cheaper-but-capable picks. For open-weight, self-hosted use, nomic-embed-text-v1.5 and bge-large-en-v1.5 are the workhorses.
- Mixed English and code: Voyage 3.
- OpenAI-only stack already:
text-embedding-3-smallfor cost,-largefor quality. - Self-hosted, no API:
nomic-embed-text-v1.5(768 dim) orbge-large-en-v1.5(1024 dim) on ollama or a dedicated embedding server. - Multilingual: Voyage multilingual-2 or BGE-M3.
Run a small bake-off on your golden set before committing. Embedding quality varies more by domain than benchmark tables suggest.
Use Matryoshka truncation to control dimension
Modern embedding models (Voyage 3, OpenAI v3, Nomic v1.5) are trained so that truncating the vector preserves most of the quality. Pick the dimension that fits your storage and latency budget.
- 256 dim: cheap storage, fastest ANN search, 1 to 3 percent recall loss.
- 512 dim: a common sweet spot.
- 1024 to 1536 dim: maximum quality, biggest index.
# Voyage and OpenAI v3 accept output_dimension
vec = client.embed(texts=[doc], model="voyage-3", output_dimension=512)Truncate, do not zero-pad. Re-normalize after truncation.
Unit-normalize every vector at write and query time
Normalize once when you embed; never trust the model output to be unit-length, and never mix normalized and unnormalized vectors in the same index.
import numpy as np
def normalize(v):
n = np.linalg.norm(v)
return v / n if n > 0 else vAfter normalization, cosine similarity equals dot product, which is faster on every vector engine. The two metrics are mathematically equivalent only on unit vectors; mismatched normalization is a common silent bug.
Use cosine for normalized vectors, dot product for speed
Once vectors are normalized, prefer dot product as the metric setting on the index. It is the same answer as cosine, and most engines compute it faster.
- Pinecone, ChromaDB, pgvector: set metric to
dotoripafter normalization. - Faiss: use
IndexFlatIPfor exact,IndexHNSWFlatwithMETRIC_INNER_PRODUCTfor approximate.
Euclidean distance is rarely the right choice for text embeddings; the magnitudes carry no signal after normalization.
Batch aggressively for throughput
Embedding APIs charge per call and per token; latency drops sharply when you batch.
- Voyage and OpenAI: batch 64 to 128 inputs per request, up to the model’s max input tokens.
- Local (Ollama, Nomic): batch 8 to 32 depending on GPU memory.
def embed_batch(texts, batch_size=128):
out = []
for i in range(0, len(texts), batch_size):
out.extend(client.embed(texts[i:i+batch_size], model="voyage-3").embeddings)
return outFor ingestion pipelines, parallelize batches with a small worker pool. Throughput typically scales until the API rate limit, then saturates.
Key the cache on model, version, and input hash
Re-embedding the same input is pure waste; embeddings are deterministic per model version.
cache_key = sha256(f"{model_id}:{model_version}:{input_text}".encode()).hexdigest()- Include the model version. When a vendor ships a new revision, the same model name returns different vectors.
- Include input normalization in the hash if you strip whitespace or lowercase the text before embedding.
- Invalidate on model upgrade; never mix versions in the same index.
The cache pays for itself within a single re-ingestion of a corpus.
Re-embed when the model changes
An index is tied to a model. New model means a full re-embed of the corpus, and either a swap-and-replace or a parallel index during cutover.
- Run the new model alongside the old, write to both, query the old.
- Backfill the new index in batches.
- Flip query traffic when recall on the eval set matches or exceeds the old index.
Skipping the eval step is the standard way to ship a regression. The model upgrade is not done until the numbers prove it.