Overview
ChromaDB can compute embeddings for you via its built-in embedding functions, or you can precompute vectors externally and pass them in. Each path has different tradeoffs for cost control, model flexibility, and testability. The deeper rule is simpler: one collection maps to one embedding model at one version, always. Drift in that contract produces meaningless similarity scores and silent retrieval failures.
Pass precomputed vectors instead of using the built-in embedding function
The built-in embedding functions (OpenAI, Cohere, HuggingFace) embed text at add time as a side effect. That side effect makes tests expensive, creates implicit API dependencies, and hides per-document embedding cost.
- Compute embeddings in your own pipeline, cache them, then call
collection.add(embeddings=[...]). - Pass
embedding_function=Nonewhen creating the collection to disable the built-in. - Precomputed vectors decouple the embedding step from the storage step, letting you batch, rate-limit, and cache independently.
import chromadb
from my_embed import embed_batch # Your own embedding pipeline
client = chromadb.PersistentClient(path="./.chroma")
collection = client.get_or_create_collection(
name="docs_support",
embedding_function=None,
metadata={"embedding_model": "voyage-3", "dimension": 1024},
)
texts = ["Refunds take 5 days.", "Contact support at help@example.com."]
vectors = embed_batch(texts, model="voyage-3")
collection.add(
ids=["chunk-001", "chunk-002"],
documents=texts,
embeddings=vectors,
metadatas=[{"source": "faq"}, {"source": "faq"}],
)Record the model name and dimension in collection metadata
The collection has no enforcement mechanism for embedding compatibility. Recording the model name in metadata gives every worker a machine-readable contract to check before writing.
collection = client.get_or_create_collection(
name="docs_support",
embedding_function=None,
metadata={
"embedding_model": "voyage-3",
"dimension": 1024,
"distance": "cosine",
},
)
# Every ingest worker checks before writing
meta = collection.metadata
assert meta["embedding_model"] == MY_MODEL, "Model mismatch - abort ingest"
assert meta["dimension"] == len(vectors[0]), "Dimension mismatch - abort ingest"This check catches the failure mode where a new engineer points a second service at the same collection with a different model and corrupts the index silently.
Pin the model at a specific version, not just a name
Model providers update weights behind the same name. voyage-3 today and voyage-3 in six months may produce incompatible vectors.
- Store
"embedding_model": "voyage-3-2024-11"(or the exact version string the provider exposes). - When the provider does not offer version pinning, hash a reference document and store that hash alongside the model name. A changed hash signals a silent update.
- When you upgrade to a new version, create a new collection, re-embed the corpus, evaluate recall, then swap query traffic. See embeddings for the full upgrade protocol.
Keep dimensions consistent across add and query
ChromaDB will accept a query vector at a different dimension than the indexed vectors and return garbage silently. Enforce dimension at the application layer.
EXPECTED_DIM = collection.metadata.get("dimension", 1024)
def safe_query(collection, query_text, n_results=10, where=None):
vec = embed_one(query_text)
assert len(vec) == EXPECTED_DIM, f"Query dim {len(vec)} != index dim {EXPECTED_DIM}"
return collection.query(
query_embeddings=[vec],
n_results=n_results,
where=where or {},
)Dimension mismatches most often appear when switching between Matryoshka truncation levels (e.g., 512 vs 1024 for Voyage 3). See embeddings-dimensionality for the truncation rules.
Normalize vectors before storing
ChromaDB’s cosine distance metric normalizes internally, but storing pre-normalized vectors removes that overhead and makes dot product and cosine equivalent, which is faster.
import numpy as np
def normalize(v: list[float]) -> list[float]:
arr = np.array(v, dtype=np.float32)
norm = np.linalg.norm(arr)
return (arr / norm if norm > 0 else arr).tolist()
vectors = [normalize(v) for v in raw_vectors]Use hnsw:space=cosine in collection metadata when storing normalized vectors. Do not mix collections using l2 space with pre-normalized vectors; the distance metric assumption breaks recall.
Use the built-in embedding function only in notebooks and prototypes
The built-in functions are a productivity shortcut for exploration. They are not suitable for production because they couple embedding cost to storage calls, obscure error handling, and make rate limiting difficult.
- Acceptable uses: quick notebooks, documentation examples, one-off data exploration.
- Not acceptable: any code path that runs in staging or production.
- The migration path is always the same: extract the embedded text, re-embed externally, and upsert with explicit vectors.