Overview

A semantic cache stores LLM responses keyed not on exact string match but on embedding similarity. Two queries that mean the same thing return the cached response. For high-query-volume applications this can reduce LLM API cost by 30 to 60 percent on domains with repetitive user intent (support bots, FAQs, product search). For general embedding caching to avoid re-embedding the same text, see embeddings-cost-control.

Set the cosine threshold between 0.92 and 0.97 for most use cases

The similarity threshold determines whether a query is “close enough” to serve the cached response. Too low and unrelated queries share answers; too high and the cache never hits.

CACHE_THRESHOLD = 0.95  # tune per domain
 
def semantic_cache_lookup(query_embedding, cache_store):
    result = cache_store.query(
        query_embedding,
        top_k=1,
        include_scores=True
    )
    if result and result[0].score >= CACHE_THRESHOLD:
        return result[0].payload["response"]
    return None

Start at 0.95. If false positives appear (wrong answers for similar-but-distinct queries), raise to 0.97. If the hit rate is too low for common paraphrase queries, lower to 0.92. For factual or precise domains (legal, medical, financial), err toward 0.97. For conversational FAQ, 0.92 to 0.94 is sufficient.

Store the normalized query embedding, the response, and metadata

Cache entries need at least: the normalized embedding, the canonical response text, the model and prompt version that produced it, the creation timestamp, and an expiry or version tag. Without version metadata, cache invalidation after a prompt change is guesswork.

cache_entry = {
    "embedding": l2_normalize(query_embedding).tolist(),
    "response": llm_response_text,
    "prompt_version": "v3.1",
    "model": "claude-sonnet-4-6",
    "created_at": datetime.utcnow().isoformat(),
    "ttl_days": 30,
}

See embeddings-normalization for why the embedding must be normalized before storage.

Use pgvector as the cache store for Postgres-native stacks

pgvector supports the similarity query and the metadata filter in one SQL statement. For a Postgres-backed application, this avoids a separate vector service.

-- Create the cache table
CREATE TABLE semantic_cache (
    id BIGSERIAL PRIMARY KEY,
    embedding vector(1536),
    response TEXT NOT NULL,
    prompt_version TEXT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW()
);
 
CREATE INDEX ON semantic_cache USING hnsw (embedding vector_cosine_ops);
 
-- Lookup
SELECT response, 1 - (embedding <=> $1::vector) AS score
FROM semantic_cache
WHERE prompt_version = $2
ORDER BY embedding <=> $1::vector
LIMIT 1;

Filter by prompt_version before the vector scan to avoid returning stale responses after a prompt update. For non-Postgres deployments, chromadb supports the same metadata-filtered similarity query. For a comparison of storage options, see pinecone-vs-pgvector.

Invalidate cache entries on prompt version change and model change

Cached responses are coupled to the prompt template that generated them. A change in the system prompt, the retrieval context format, or the LLM model version renders cached responses potentially wrong. Do not use TTL alone for invalidation.

Invalidation strategies in order of precision:

  1. Tag every entry with the prompt version and delete entries with the old tag on deploy.
  2. Namespace the cache by prompt version hash and let old entries expire via TTL.
  3. For high-stakes domains (financial, medical), disable the semantic cache for questions where freshness matters more than cost.

Never mix cached responses from different prompt versions in the same response surface.

Memory and index size grow with distinct query types

A semantic cache is not a bounded structure unless you set an eviction policy. For products with diverse query patterns, the number of cached embeddings grows without bound. Set a max entry count and evict by last-used time (LRU) or by TTL. For pgvector, a periodic DELETE with ORDER BY last_hit ASC LIMIT N is sufficient.

-- Evict oldest 10,000 entries when cache exceeds 100,000 rows
DELETE FROM semantic_cache
WHERE id IN (
    SELECT id FROM semantic_cache
    ORDER BY created_at ASC
    LIMIT 10000
);

Measure cache hit rate and false positive rate, not just cost savings

Track three metrics: hit rate (fraction of queries served from cache), false positive rate (fraction of cache hits that returned the wrong answer, measured via user feedback or automated eval), and cost reduction. A high hit rate with non-zero false positives is a reliability problem. See embeddings-eval for the evaluation setup.