Overview

Embeddings map text (or code, or images) to dense vectors so a vector store can answer “what is similar to this.” The choice of model, dimension, and similarity metric defines the quality ceiling of any retrieval system. For how the vectors get used, see rag. For storage, see chromadb.

Pick the model that fits the corpus

The strongest English-and-code default in 2026 is Voyage 3 (or Voyage 3 Large where budget allows). For OpenAI shops, text-embedding-3-large or text-embedding-3-small are the cheaper-but-capable picks. For open-weight, self-hosted use, nomic-embed-text-v1.5 and bge-large-en-v1.5 are the workhorses.

  • Mixed English and code: Voyage 3.
  • OpenAI-only stack already: text-embedding-3-small for cost, -large for quality.
  • Self-hosted, no API: nomic-embed-text-v1.5 (768 dim) or bge-large-en-v1.5 (1024 dim) on ollama or a dedicated embedding server.
  • Multilingual: Voyage multilingual-2 or BGE-M3.

Run a small bake-off on your golden set before committing. Embedding quality varies more by domain than benchmark tables suggest.

Use Matryoshka truncation to control dimension

Modern embedding models (Voyage 3, OpenAI v3, Nomic v1.5) are trained so that truncating the vector preserves most of the quality. Pick the dimension that fits your storage and latency budget.

  • 256 dim: cheap storage, fastest ANN search, 1 to 3 percent recall loss.
  • 512 dim: a common sweet spot.
  • 1024 to 1536 dim: maximum quality, biggest index.
# Voyage and OpenAI v3 accept output_dimension
vec = client.embed(texts=[doc], model="voyage-3", output_dimension=512)

Truncate, do not zero-pad. Re-normalize after truncation.

Unit-normalize every vector at write and query time

Normalize once when you embed; never trust the model output to be unit-length, and never mix normalized and unnormalized vectors in the same index.

import numpy as np
def normalize(v):
    n = np.linalg.norm(v)
    return v / n if n > 0 else v

After normalization, cosine similarity equals dot product, which is faster on every vector engine. The two metrics are mathematically equivalent only on unit vectors; mismatched normalization is a common silent bug.

Use cosine for normalized vectors, dot product for speed

Once vectors are normalized, prefer dot product as the metric setting on the index. It is the same answer as cosine, and most engines compute it faster.

  • Pinecone, ChromaDB, pgvector: set metric to dot or ip after normalization.
  • Faiss: use IndexFlatIP for exact, IndexHNSWFlat with METRIC_INNER_PRODUCT for approximate.

Euclidean distance is rarely the right choice for text embeddings; the magnitudes carry no signal after normalization.

Batch aggressively for throughput

Embedding APIs charge per call and per token; latency drops sharply when you batch.

  • Voyage and OpenAI: batch 64 to 128 inputs per request, up to the model’s max input tokens.
  • Local (Ollama, Nomic): batch 8 to 32 depending on GPU memory.
def embed_batch(texts, batch_size=128):
    out = []
    for i in range(0, len(texts), batch_size):
        out.extend(client.embed(texts[i:i+batch_size], model="voyage-3").embeddings)
    return out

For ingestion pipelines, parallelize batches with a small worker pool. Throughput typically scales until the API rate limit, then saturates.

Key the cache on model, version, and input hash

Re-embedding the same input is pure waste; embeddings are deterministic per model version.

cache_key = sha256(f"{model_id}:{model_version}:{input_text}".encode()).hexdigest()
  • Include the model version. When a vendor ships a new revision, the same model name returns different vectors.
  • Include input normalization in the hash if you strip whitespace or lowercase the text before embedding.
  • Invalidate on model upgrade; never mix versions in the same index.

The cache pays for itself within a single re-ingestion of a corpus.

Re-embed when the model changes

An index is tied to a model. New model means a full re-embed of the corpus, and either a swap-and-replace or a parallel index during cutover.

  • Run the new model alongside the old, write to both, query the old.
  • Backfill the new index in batches.
  • Flip query traffic when recall on the eval set matches or exceeds the old index.

Skipping the eval step is the standard way to ship a regression. The model upgrade is not done until the numbers prove it.