ChromaDB: Collection Design

Overview

A ChromaDB collection is the unit of organization: it owns one embedding space, one distance metric, and one metadata schema. Poor collection design compounds every downstream retrieval problem. The rules here cover when to split or merge collections, how to version the metadata schema, and which client mode matches your deployment pattern.

One collection per retrieval task, not per tenant or per document type

The collection boundary is semantic, not organizational. Use it to separate workloads that need different embedding models or distance metrics, not to partition users or sources.

docs_support, code_snippets, product_catalog: each is a distinct retrieval task with its own embedding model choice.
Multi-tenant separation belongs in a tenant_id metadata field filtered at query time, not in separate collections.
Per-tenant collections fragment the index, make bulk analytics hard, and explode count when customers churn and new ones onboard.

import chromadb
 
client = chromadb.PersistentClient(path="./.chroma")
collection = client.get_or_create_collection(
    name="docs_support",
    metadata={"embedding_model": "voyage-3", "dimension": 1024, "hnsw:space": "cosine"},
)

Store the embedding model name and dimension in the collection metadata. That record prevents silent mismatches when you add a new ingest worker.

Define the metadata schema before ingest begins

Metadata is the second axis of every query. Changing it later means re-ingesting the corpus.

Choose keys before writing a single document: source, tenant_id, created_at, lang, chunk_index, doc_id.
Assign one type per key and never change it. Once created_at is a Unix timestamp integer, do not mix ISO strings into the same collection.
Flat values only. ChromaDB does not index into nested objects; put complex payloads in a payload JSON string and parse on the client side.
Document the schema in a comment or a SCHEMA key in the collection metadata so every ingest script reads the same contract.

collection.add(
    ids=["chunk-001"],
    documents=["Refunds are processed within 5 business days."],
    metadatas=[{
        "tenant_id": "acme",
        "source": "zendesk",
        "created_at": 1736294400,
        "lang": "en",
        "chunk_index": 0,
        "doc_id": "ticket-5892",
    }],
    embeddings=[embedding],
)

Name collections with a prefix convention

Collection names are global within a ChromaDB instance. Collisions happen when multiple teams or services share an instance.

Prefix by team or service: support_docs, eng_code, marketing_content.
Include the environment when dev and prod share an instance: prod_support_docs, dev_support_docs.
Use snake_case only. Collection names become file paths on disk; special characters cause platform-specific bugs.
List existing collections before creating a new one: client.list_collections().

Use `get_or_create_collection` for idempotent startup

Startup code that calls create_collection on every run raises if the collection exists. Use get_or_create_collection everywhere.

collection = client.get_or_create_collection(
    name="docs_support",
    metadata={"embedding_model": "voyage-3"},
    embedding_function=None,  # Disable built-in; pass vectors explicitly
)

Pass embedding_function=None when you compute vectors outside ChromaDB. The built-in embedding functions call third-party APIs automatically; that side effect becomes a budget leak and a test hazard.

Choose in-memory vs persistent client by lifecycle, not by performance

chromadb.Client() (ephemeral) and chromadb.PersistentClient() are not interchangeable aliases.

Ephemeral: tests, notebooks, one-off scripts where data lives and dies with the process. No files to clean up.
Persistent: any workload where documents survive a restart. One worker, one path. Two processes writing to the same path corrupt the index.
Server mode (HttpClient): multi-process reads, multi-service writes, Docker deployments. See chromadb-persistence for setup.

# Ephemeral - tests only
client = chromadb.EphemeralClient()
 
# Persistent - single process
client = chromadb.PersistentClient(path="./.chroma")
 
# Server mode - multi-process
client = chromadb.HttpClient(host="chroma", port=8000)

Reset the collection during development, not in production

collection.delete() followed by client.create_collection() is the correct reset path during development. Never call it in a code path that can run in production.

Gate reset behind an explicit env var (RESET_CHROMA=true), not a boolean flag.
In production, prefer upsert operations (collection.upsert()) to reset so existing vectors survive partial re-ingests.
Track document version in metadata (version field) and upsert by ID; ChromaDB’s IDs are stable handles for that pattern.

LLM Best Practices

Explorer

Overview

One collection per retrieval task, not per tenant or per document type

Define the metadata schema before ingest begins

Name collections with a prefix convention

Use `get_or_create_collection` for idempotent startup

Choose in-memory vs persistent client by lifecycle, not by performance

Reset the collection during development, not in production

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

ChromaDB: Collection Design

Overview

One collection per retrieval task, not per tenant or per document type

Define the metadata schema before ingest begins

Name collections with a prefix convention

Use get_or_create_collection for idempotent startup

Choose in-memory vs persistent client by lifecycle, not by performance

Reset the collection during development, not in production

Related

Graph View

Table of Contents

Backlinks

Use `get_or_create_collection` for idempotent startup