Overview
LLM observability is the practice of instrumenting LLM systems so you can answer what happened on any given request and whether quality is drifting over time. It extends classic logs-metrics-traces with LLM-specific signals: tokens, cost, prompt and model versions, tool-call spans, and quality scores. Without it, a nondeterministic failure is unreproducible. This page is the monitoring view; the broad ops pillar is llmops-best-practices and the measurement methods are in llm-evaluation-in-production. Builds on general practice in observability. Dated 2026-06.
Trace every request end to end
Emit one trace per request with a correlation ID spanning the full path: the assembled prompt, the model id and prompt version, retrieved chunks, each tool call and its result, the response, and the stop reason. For agents, span each loop iteration so you can see where it went wrong. This is the foundation; everything else is aggregation over traces. See reliable-agents-in-production and agent-loop.
Capture LLM-specific metrics
Track the numbers classic APM misses.
- Tokens in and out, and cost, per request, model, and feature.
- Time to first token and total latency; streaming makes the first matter most.
- Cache hit rate on prompt prefixes; see cost-control.
- Tool-call count, tool-error rate, and loop-cap hits per request.
Monitor quality, not just uptime
A system can be fully up and quietly wrong. Sample live traffic for quality signals: user thumbs up/down, edit and regenerate rates, refusal rate, escalation rate, and faithfulness or hallucination-rate on retrieval-grounded answers. Feed graded samples back into the golden set; see llm-evaluation-in-production.
Log prompts and outputs with privacy in mind
Store inputs and outputs to debug failures, but redact secrets and PII at capture, set retention limits, and respect user consent. Logging raw conversations without controls is a data-protection liability. Keep citation provenance in the trace so a disputed answer can be audited; see rag-citations.
Alert on the failures that hide
Set alerts that catch silent degradation, not just exceptions: a spike in loop-cap hits, a rising tool-error rate, a cost-per-request jump, a drop in thumbs-up rate, or a latency regression after a deploy. Tie alerts to the prompt and model version so you can correlate a regression with a change.
Verification
- Trigger a failing request; confirm its full trace is retrievable by correlation ID.
- Confirm token, cost, and latency dashboards break down by model and feature.
- Confirm PII redaction runs at capture, not after storage.
- Deploy a deliberately worse prompt to staging; confirm a quality alert fires.