LLM observability and monitoring guide

Overview

LLM observability is the practice of instrumenting LLM systems so you can answer what happened on any given request and whether quality is drifting over time. It extends classic logs-metrics-traces with LLM-specific signals: tokens, cost, prompt and model versions, tool-call spans, and quality scores. Without it, a nondeterministic failure is unreproducible. This page is the monitoring view; the broad ops pillar is llmops-best-practices and the measurement methods are in llm-evaluation-in-production. Builds on general practice in observability. Dated 2026-06.

Trace every request end to end

Emit one trace per request with a correlation ID spanning the full path: the assembled prompt, the model id and prompt version, retrieved chunks, each tool call and its result, the response, and the stop reason. For agents, span each loop iteration so you can see where it went wrong. This is the foundation; everything else is aggregation over traces. See reliable-agents-in-production and agent-loop.

Capture LLM-specific metrics

Track the numbers classic APM misses.

Tokens in and out, and cost, per request, model, and feature.
Time to first token and total latency; streaming makes the first matter most.
Cache hit rate on prompt prefixes; see cost-control.
Tool-call count, tool-error rate, and loop-cap hits per request.

Monitor quality, not just uptime

A system can be fully up and quietly wrong. Sample live traffic for quality signals: user thumbs up/down, edit and regenerate rates, refusal rate, escalation rate, and faithfulness or hallucination-rate on retrieval-grounded answers. Feed graded samples back into the golden set; see llm-evaluation-in-production.

Log prompts and outputs with privacy in mind

Store inputs and outputs to debug failures, but redact secrets and PII at capture, set retention limits, and respect user consent. Logging raw conversations without controls is a data-protection liability. Keep citation provenance in the trace so a disputed answer can be audited; see rag-citations.

Alert on the failures that hide

Set alerts that catch silent degradation, not just exceptions: a spike in loop-cap hits, a rising tool-error rate, a cost-per-request jump, a drop in thumbs-up rate, or a latency regression after a deploy. Tie alerts to the prompt and model version so you can correlate a regression with a change.

Verification

Trigger a failing request; confirm its full trace is retrievable by correlation ID.
Confirm token, cost, and latency dashboards break down by model and feature.
Confirm PII redaction runs at capture, not after storage.
Deploy a deliberately worse prompt to staging; confirm a quality alert fires.

LLM Best Practices

Explorer

LLM observability and monitoring guide

Overview

Trace every request end to end

Capture LLM-specific metrics

Monitor quality, not just uptime

Log prompts and outputs with privacy in mind

Alert on the failures that hide

Verification

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

LLM observability and monitoring guide

Overview

Trace every request end to end

Capture LLM-specific metrics

Monitor quality, not just uptime

Log prompts and outputs with privacy in mind

Alert on the failures that hide

Verification

Related

Graph View

Table of Contents

Backlinks