Overview
Cost control on agents is a stack of small habits, not a single switch. Prompt caching, model routing, batch APIs, and observability each shave a multiplier off the bill. The rules below assume you already have an eval suite; without one, see evaluation first, because cheap and wrong is still wrong.
Measure cost per task, not cost per token
Cost per token is a vendor metric. Cost per resolved task is the business metric.
- Sum every token spent to ship one user-visible outcome: retries, judge calls, tool calls, planner overhead.
- Track per slice (
easy,hard,adversarial) so a price change does not hide behind a healthy average. - Cheap models that loop three times are not cheap.
A 10x cheaper model that solves the task half as often costs more, not less.
Cache the stable parts of the prompt
Prompt caching pays back for any system with a long, fixed preamble. Anthropic, OpenAI, and Gemini all expose it.
- Put durable content first: system prompt, tool definitions, schema, few-shot examples. Mark the cache breakpoint after the last stable token.
- Put per-turn user input last so the cached prefix hits on every call.
- Aim for cache hit rate above 80% on production traffic. Watch it as a first-class metric.
Anthropic charges ~10% of input price for cache reads; OpenAI auto-caches; Gemini supports explicit cache objects. The exact discount changes; the rule does not.
Route by difficulty, not by default
Pick the smallest model that passes the eval for each request type. Run a router as the first hop.
- Triage with a small fast model (Haiku, GPT-4.1-mini, Gemini Flash). It classifies the request.
- Routine work goes to a mid-tier model (Sonnet, GPT-4.1).
- Hard cases escalate to the frontier (Opus, GPT-5, Gemini Pro) on multi-step reasoning or low-confidence triage output.
Score each tier on the golden set. Promote a request only when the cheaper tier’s accuracy on that slice is below the bar.
Use batch APIs for anything not realtime
Most providers offer a batch endpoint at ~50% off with a 24-hour SLA. Embeddings backfills, nightly summarization, eval suites, and bulk classification all belong on batch.
- Anthropic Message Batches, OpenAI Batch, and Gemini batch accept the same prompts as the realtime API.
- Queue jobs that do not need a sub-second response; reserve realtime for user-facing turns.
Build a fallback ladder
Order providers cheapest to most expensive and try them in series with a quality gate at each step.
1. Local model via Ollama for known-easy patterns (see [[ai-agents/ollama]]).
2. Cheap hosted API (Haiku, Flash, GPT-4.1-mini) for routine tasks.
3. Frontier API (Opus, Gemini Pro, GPT-5) when steps 1 and 2 fail confidence checks.The ladder works only with a confidence signal at each rung. Without one, you pay for both calls.
Trim context aggressively
Long context is paid for on every turn that does not hit cache. Trim before you send.
- Drop irrelevant tool results once the next step is planned.
- Summarize prior turns into a 200-token recap; discard the raw history.
- For RAG, send the top 5 reranked chunks, not the top 50. See rag.
A 50k-token context that does not need to be 50k tokens is a recurring tax.
Cache deterministic tool results
Tools that return the same output for the same input should be memoized. Key on (tool_name, serialized_args) with a TTL appropriate to the data.
- Web fetches: minutes to hours.
- Database lookups for slow-changing rows: hours.
- Embedding calls: keyed by
(model_id, sha256(text)), near-permanent. See embeddings.
Tool caches cut latency and spend on iterative loops.
Cap spend per user and per task
Every entry point needs a budget. No exceptions.
- Per-task cap: hard ceiling on input + output tokens; return a partial result and a
budget_exceededflag. - Per-user daily cap: throttle or downgrade to a cheaper tier once exceeded.
- Per-org monthly cap: alert at 50%, 80%, 100%; refuse new tasks at 110%.
See iteration caps in multi-agent; the same logic applies to dollars.
Observe cost per session
Bill data lags by a day; in-app metrics do not. Emit cost_usd, input_tokens, output_tokens, cache_read_tokens per call, then aggregate per session and per task.
- Dashboards: cost per task by slice, cache hit rate, p95 cost per session.
- Alerts: any task type whose cost-per-task doubles week-over-week.
A prompt regression that doubles the bill is invisible without per-session attribution.