Overview

Embedding API cost is proportional to token volume and inversely proportional to batch size. The fastest wins are: cache every embedding forever (they are deterministic), deduplicate before the API call, and use the batch API where available. For semantic caching of LLM responses rather than raw embeddings, see embeddings-semantic-cache.

Use the batch API; it typically costs 50 percent less

Both Voyage and OpenAI offer asynchronous batch embedding endpoints that cost roughly half the synchronous rate. Batch endpoints accept a file of inputs and return results asynchronously. Use them for all ingestion workloads that are not latency-sensitive.

# OpenAI Batch API for embeddings
import json
from openai import OpenAI
 
client = OpenAI()
 
# Prepare batch input file
requests = [
    {"custom_id": f"doc-{i}", "method": "POST", "url": "/v1/embeddings",
     "body": {"model": "text-embedding-3-large", "input": text, "dimensions": 1024}}
    for i, text in enumerate(corpus_texts)
]
with open("batch_input.jsonl", "w") as f:
    for r in requests:
        f.write(json.dumps(r) + "\n")
 
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/embeddings",
    completion_window="24h"
)

For corpus ingestion, document libraries, and nightly reprocessing, the 50 percent cost reduction compounds significantly. Reserve synchronous embedding calls for real-time query paths.

Cache embeddings forever, keyed on model, version, and input hash

Embeddings are deterministic: the same model version on the same input always returns the same vector. There is no reason to re-embed a document that has not changed.

import hashlib
 
def cache_key(model_id: str, model_version: str, text: str) -> str:
    payload = f"{model_id}:{model_version}:{text}"
    return hashlib.sha256(payload.encode()).hexdigest()

Include the model version in the key because vendors periodically update model weights under the same model name. A cached vector from voyage-3@2024-10-01 is incompatible with one from voyage-3@2025-06-01. When you detect a version change, mark all cached vectors for that model as stale and re-embed.

For the storage layer, a simple key-value store (Redis, DynamoDB, or a SQLite sidecar) is sufficient. Store the normalized vector bytes alongside the key. See embeddings-normalization for why normalized-at-write is the correct pattern.

Deduplicate inputs before the API call

Many ingestion pipelines process the same content multiple times: overlapping chunks from sliding-window chunking, duplicate documents across data sources, repeated boilerplate in templates. Deduplicating on the content hash before any embedding call eliminates redundant spend.

from collections import defaultdict
 
def deduplicate(texts: list[str]) -> tuple[list[str], dict[int, int]]:
    seen: dict[str, int] = {}
    unique: list[str] = []
    index_map: dict[int, int] = {}  # original index -> unique index
    for i, text in enumerate(texts):
        h = hashlib.sha256(text.encode()).hexdigest()
        if h not in seen:
            seen[h] = len(unique)
            unique.append(text)
        index_map[i] = seen[h]
    return unique, index_map

On corpora with overlapping chunks (e.g., 256-token chunks with 50-token overlap), deduplication typically reduces unique embedding calls by 15 to 30 percent.

Truncate inputs to the model’s maximum token limit before embedding

Sending text that exceeds the model’s context window causes errors; many SDKs silently truncate, which means you pay for the full input but embed a truncated version. Explicit truncation before the API call avoids this and reduces token cost for documents that exceed the limit.

import tiktoken
 
def truncate_to_limit(text: str, model: str, max_tokens: int = 8191) -> str:
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)
    if len(tokens) <= max_tokens:
        return text
    return enc.decode(tokens[:max_tokens])

For Voyage models, use the Voyage tokenizer. For BGE and Nomic, use their respective tokenizers. Never assume token count from character count. For dimension-related cost, see embeddings-dimensionality; smaller dimensions mean smaller stored vectors and faster ANN queries, which reduces infra cost independently of API cost.

Set budget caps and alert on per-run token counts

Runaway ingestion jobs or retry loops can exhaust monthly quotas in hours. Set an explicit token budget per ingestion run and fail fast when it is exceeded.

MAX_TOKENS_PER_RUN = 10_000_000  # $1 at typical pricing
 
total_tokens = sum(len(enc.encode(t)) for t in texts_to_embed)
if total_tokens > MAX_TOKENS_PER_RUN:
    raise RuntimeError(
        f"Budget exceeded: {total_tokens:,} tokens > {MAX_TOKENS_PER_RUN:,} limit."
    )

Track token spend per run in a log or metrics system. Alert on runs that exceed 2x the historical mean. Sudden spikes usually indicate a loop bug or a corpus that changed unexpectedly.

Prefer smaller models for query embedding, larger for corpus embedding

Query embedding happens at inference time; corpus embedding happens at ingest time. Using a smaller, cheaper model for queries only (e.g., text-embedding-3-small or Voyage 3 standard) reduces real-time API cost. The trade-off is that the query and corpus embedding must come from the same model. Evaluate whether 3-small recall is acceptable versus 3-large before splitting query and corpus onto different model tiers. See embeddings-voyage-vs-openai for the quality versus cost comparison.