Overview
Observability is the ability to answer questions about a running system without shipping a new build. The three pillars are logs, metrics, and traces; pick OpenTelemetry as the wire format, ship structured logs from day one, and wire trace IDs through every request.
Treat logs, metrics, and traces as the three pillars
Each pillar answers a different question. Wire all three.
- Logs: “what happened?” High-cardinality event records, one per interesting thing. Useful for forensic debugging.
- Metrics: “how often, how slow, how many?” Low-cardinality numbers aggregated over time. Cheap to store, fast to dashboard.
- Traces: “where did the time go?” A single request decomposed into spans across services. Useful for latency hunts.
A system with only logs is unmonitorable. A system with only metrics cannot be debugged.
Use OpenTelemetry as the wire format
OpenTelemetry (OTel) is the vendor-neutral standard. Instrument once; route to any backend.
- The OTel SDK (Python, Node, Go, Rust) emits spans, metrics, and structured logs over OTLP.
- The OTel Collector forwards OTLP to Datadog, Honeycomb, Grafana Cloud, Sentry, or a self-hosted stack. Backend swaps are a config change.
- Auto-instrumentation covers the common libraries: HTTP servers, postgres clients, fastapi, gRPC. Add manual spans for business operations.
- Avoid vendor SDKs as the primary integration; they lock the wire format and force a rewrite at every backend change.
Ship structured logs from day one
Free-text logs are unindexable. Emit JSON with a stable schema.
{"ts":"2026-05-14T12:00:00Z","level":"info","msg":"order.created","order_id":"ord_123","user_id":"usr_42","trace_id":"a1b2c3","duration_ms":42}- One event per line. No multi-line stack traces broken across log entries; serialize the stack into a string field.
- Use a stable set of top-level fields:
ts,level,msg,trace_id,service,env. Add domain fields per event. - Pick a logger that produces JSON natively:
pino(Node),structlog(Python),zerolog(Go). Do not pipeprintfthroughjq.
Propagate the trace ID through every request
A trace ID is only useful if every log line, metric exemplar, and outbound call carries it.
- Accept
traceparentheaders on inbound HTTP. Generate one if missing. - Inject the trace ID into the logger context (
structlog.contextvars,pinochild loggers) so every line emitted during the request carries it. - Forward the header on outbound HTTP, gRPC, and queue messages. A trace that dies at a service boundary is half a trace.
Pick the right metric type
Three primitives cover most cases.
- Counter: monotonically increasing.
http_requests_total,orders_created_total. Rates are derived in the query layer. - Gauge: a value that moves up and down.
db_connections_in_use,queue_depth. - Histogram: a distribution.
http_request_duration_ms. Pre-bucketed; supports percentiles in the backend.
Avoid summaries (client-side percentiles); they cannot be aggregated across instances. Avoid high-cardinality labels (user_id, request_id); cardinality explodes the metric store.
Use RED for services, USE for resources
Two canonical dashboards. Build both per service.
- RED (services): Rate (requests per second), Errors (5xx per second), Duration (p50, p95, p99). One row per route or one per service. See fastapi for the route-level shape.
- USE (resources): Utilization (CPU, memory percent), Saturation (queue depth, run queue length), Errors (disk errors, OOM kills). One row per host or per hostinger-vps instance.
If a dashboard is not RED or USE, ask what question it answers. Most “kitchen sink” dashboards answer none.
Send errors to a dedicated tracker
Error tracking is not log search. Sentry, Rollbar, or GlitchTip group exceptions by fingerprint, deduplicate across replicas, and ping on first occurrence in a release.
- Wire the SDK once at process start. Auto-capture unhandled exceptions; manually capture the handled-but-interesting ones.
- Tag every error with
release,env, and the currenttrace_id.
Set log levels deliberately and sample traces
Verbosity costs money at scale.
DEBUG: dropped in production. Local dev only.INFO: discrete business events (“order.created”, “user.signed_up”). Not request-level chatter.WARN: recoverable issues that did not break the request (retry succeeded on second attempt).ERROR: failures the user noticed. Pages someone.
Sample traces at 1% head-based for steady traffic. Add tail-based sampling so slow or errored requests are captured at 100%. See general-principles for cost-aware telemetry.