Overview
A ChromaDB collection is the unit of organization: it owns one embedding space, one distance metric, and one metadata schema. Poor collection design compounds every downstream retrieval problem. The rules here cover when to split or merge collections, how to version the metadata schema, and which client mode matches your deployment pattern.
One collection per retrieval task, not per tenant or per document type
The collection boundary is semantic, not organizational. Use it to separate workloads that need different embedding models or distance metrics, not to partition users or sources.
docs_support,code_snippets,product_catalog: each is a distinct retrieval task with its own embedding model choice.- Multi-tenant separation belongs in a
tenant_idmetadata field filtered at query time, not in separate collections. - Per-tenant collections fragment the index, make bulk analytics hard, and explode count when customers churn and new ones onboard.
import chromadb
client = chromadb.PersistentClient(path="./.chroma")
collection = client.get_or_create_collection(
name="docs_support",
metadata={"embedding_model": "voyage-3", "dimension": 1024, "hnsw:space": "cosine"},
)Store the embedding model name and dimension in the collection metadata. That record prevents silent mismatches when you add a new ingest worker.
Define the metadata schema before ingest begins
Metadata is the second axis of every query. Changing it later means re-ingesting the corpus.
- Choose keys before writing a single document:
source,tenant_id,created_at,lang,chunk_index,doc_id. - Assign one type per key and never change it. Once
created_atis a Unix timestamp integer, do not mix ISO strings into the same collection. - Flat values only. ChromaDB does not index into nested objects; put complex payloads in a
payloadJSON string and parse on the client side. - Document the schema in a comment or a
SCHEMAkey in the collection metadata so every ingest script reads the same contract.
collection.add(
ids=["chunk-001"],
documents=["Refunds are processed within 5 business days."],
metadatas=[{
"tenant_id": "acme",
"source": "zendesk",
"created_at": 1736294400,
"lang": "en",
"chunk_index": 0,
"doc_id": "ticket-5892",
}],
embeddings=[embedding],
)Name collections with a prefix convention
Collection names are global within a ChromaDB instance. Collisions happen when multiple teams or services share an instance.
- Prefix by team or service:
support_docs,eng_code,marketing_content. - Include the environment when dev and prod share an instance:
prod_support_docs,dev_support_docs. - Use snake_case only. Collection names become file paths on disk; special characters cause platform-specific bugs.
- List existing collections before creating a new one:
client.list_collections().
Use get_or_create_collection for idempotent startup
Startup code that calls create_collection on every run raises if the collection exists. Use get_or_create_collection everywhere.
collection = client.get_or_create_collection(
name="docs_support",
metadata={"embedding_model": "voyage-3"},
embedding_function=None, # Disable built-in; pass vectors explicitly
)Pass embedding_function=None when you compute vectors outside ChromaDB. The built-in embedding functions call third-party APIs automatically; that side effect becomes a budget leak and a test hazard.
Choose in-memory vs persistent client by lifecycle, not by performance
chromadb.Client() (ephemeral) and chromadb.PersistentClient() are not interchangeable aliases.
- Ephemeral: tests, notebooks, one-off scripts where data lives and dies with the process. No files to clean up.
- Persistent: any workload where documents survive a restart. One worker, one path. Two processes writing to the same path corrupt the index.
- Server mode (
HttpClient): multi-process reads, multi-service writes, Docker deployments. See chromadb-persistence for setup.
# Ephemeral - tests only
client = chromadb.EphemeralClient()
# Persistent - single process
client = chromadb.PersistentClient(path="./.chroma")
# Server mode - multi-process
client = chromadb.HttpClient(host="chroma", port=8000)Reset the collection during development, not in production
collection.delete() followed by client.create_collection() is the correct reset path during development. Never call it in a code path that can run in production.
- Gate reset behind an explicit env var (
RESET_CHROMA=true), not a boolean flag. - In production, prefer upsert operations (
collection.upsert()) to reset so existing vectors survive partial re-ingests. - Track document version in metadata (
versionfield) and upsert by ID; ChromaDB’s IDs are stable handles for that pattern.