Observability

Overview

Observability is the ability to answer questions about a running system without shipping a new build. The three pillars are logs, metrics, and traces; pick OpenTelemetry as the wire format, ship structured logs from day one, and wire trace IDs through every request.

Treat logs, metrics, and traces as the three pillars

Each pillar answers a different question. Wire all three.

Logs: “what happened?” High-cardinality event records, one per interesting thing. Useful for forensic debugging.
Metrics: “how often, how slow, how many?” Low-cardinality numbers aggregated over time. Cheap to store, fast to dashboard.
Traces: “where did the time go?” A single request decomposed into spans across services. Useful for latency hunts.

A system with only logs is unmonitorable. A system with only metrics cannot be debugged.

Use OpenTelemetry as the wire format

OpenTelemetry (OTel) is the vendor-neutral standard. Instrument once; route to any backend.

The OTel SDK (Python, Node, Go, Rust) emits spans, metrics, and structured logs over OTLP.
The OTel Collector forwards OTLP to Datadog, Honeycomb, Grafana Cloud, Sentry, or a self-hosted stack. Backend swaps are a config change.
Auto-instrumentation covers the common libraries: HTTP servers, postgres clients, fastapi, gRPC. Add manual spans for business operations.
Avoid vendor SDKs as the primary integration; they lock the wire format and force a rewrite at every backend change.

Ship structured logs from day one

Free-text logs are unindexable. Emit JSON with a stable schema.

{"ts":"2026-05-14T12:00:00Z","level":"info","msg":"order.created","order_id":"ord_123","user_id":"usr_42","trace_id":"a1b2c3","duration_ms":42}

One event per line. No multi-line stack traces broken across log entries; serialize the stack into a string field.
Use a stable set of top-level fields: ts, level, msg, trace_id, service, env. Add domain fields per event.
Pick a logger that produces JSON natively: pino (Node), structlog (Python), zerolog (Go). Do not pipe printf through jq.

Propagate the trace ID through every request

A trace ID is only useful if every log line, metric exemplar, and outbound call carries it.

Accept traceparent headers on inbound HTTP. Generate one if missing.
Inject the trace ID into the logger context (structlog.contextvars, pino child loggers) so every line emitted during the request carries it.
Forward the header on outbound HTTP, gRPC, and queue messages. A trace that dies at a service boundary is half a trace.

Pick the right metric type

Three primitives cover most cases.

Counter: monotonically increasing. http_requests_total, orders_created_total. Rates are derived in the query layer.
Gauge: a value that moves up and down. db_connections_in_use, queue_depth.
Histogram: a distribution. http_request_duration_ms. Pre-bucketed; supports percentiles in the backend.

Avoid summaries (client-side percentiles); they cannot be aggregated across instances. Avoid high-cardinality labels (user_id, request_id); cardinality explodes the metric store.

Use RED for services, USE for resources

Two canonical dashboards. Build both per service.

RED (services): Rate (requests per second), Errors (5xx per second), Duration (p50, p95, p99). One row per route or one per service. See fastapi for the route-level shape.
USE (resources): Utilization (CPU, memory percent), Saturation (queue depth, run queue length), Errors (disk errors, OOM kills). One row per host or per hostinger-vps instance.

If a dashboard is not RED or USE, ask what question it answers. Most “kitchen sink” dashboards answer none.

Send errors to a dedicated tracker

Error tracking is not log search. Sentry, Rollbar, or GlitchTip group exceptions by fingerprint, deduplicate across replicas, and ping on first occurrence in a release.

Wire the SDK once at process start. Auto-capture unhandled exceptions; manually capture the handled-but-interesting ones.
Tag every error with release, env, and the current trace_id.

Set log levels deliberately and sample traces

Verbosity costs money at scale.

DEBUG: dropped in production. Local dev only.
INFO: discrete business events (“order.created”, “user.signed_up”). Not request-level chatter.
WARN: recoverable issues that did not break the request (retry succeeded on second attempt).
ERROR: failures the user noticed. Pages someone.

Sample traces at 1% head-based for steady traffic. Add tail-based sampling so slow or errored requests are captured at 100%. See general-principles for cost-aware telemetry.

LLM Best Practices

Explorer

Overview

Treat logs, metrics, and traces as the three pillars

Use OpenTelemetry as the wire format

Ship structured logs from day one

Propagate the trace ID through every request

Pick the right metric type

Use RED for services, USE for resources

Send errors to a dedicated tracker

Set log levels deliberately and sample traces

Graph View

Table of Contents

Backlinks

LLM Best Practices

Explorer

Observability

Overview

Treat logs, metrics, and traces as the three pillars

Use OpenTelemetry as the wire format

Ship structured logs from day one

Propagate the trace ID through every request

Pick the right metric type

Use RED for services, USE for resources

Send errors to a dedicated tracker

Set log levels deliberately and sample traces

Related

Graph View

Table of Contents

Backlinks