Overview

Treat incidents as predictable events with a repeatable playbook, not emergencies improvised from scratch each time.

A good incident response system answers three questions instantly: how bad is this, who owns it right now, and what does everyone else need to know. The sections below define each piece.

Define severity before an incident happens

Assign a severity level the moment a page fires. Do not wait until you understand the root cause.

LevelDefinitionResponse target
SEV1Full outage, data loss, or security breach affecting all or most users. Revenue or data integrity is at immediate risk.Page on-call immediately; acknowledge within 5 minutes.
SEV2Major feature broken or significant user segment affected. Core workflows are degraded but the service is partially up.Page on-call; acknowledge within 15 minutes.
SEV3Non-critical feature degraded or minor subset of users affected. Workarounds exist; service is mostly functional.Create a ticket; triage within business hours. No page unless it escalates.

When in doubt, round up. It is cheaper to downgrade a SEV2 that turns out to be a SEV3 than to discover a “SEV3” was silently losing revenue for an hour. This principle comes directly from Google SRE practice: severity describes the incident’s impact, not the team’s confidence about the cause.

Establish an on-call rotation and escalation path

Every production service needs a named on-call engineer at all times. The rotation must be documented, not tribal knowledge.

Escalation path for SEV1 and SEV2:

  1. Alert fires. Primary on-call is paged via PagerDuty or equivalent.
  2. No acknowledgment within 10 minutes (SEV1) or 20 minutes (SEV2): secondary on-call is paged automatically.
  3. No acknowledgment within 30 minutes: engineering manager is notified.
  4. For SEV1 involving data loss or a security event: security lead and executive sponsor are notified in parallel with step 1, not after escalation stalls.

Codify these rules in your alerting tool. Escalation that depends on someone remembering to manually ping a manager will fail under pressure.

Open a dedicated incident channel immediately

The moment a SEV1 or SEV2 is declared, open a dedicated Slack channel (or equivalent) named with the date and a short description, such as #inc-2026-06-15-api-outage. All communication flows there.

Assign roles at the start:

  • Incident commander (IC): owns the process, not the fix. The IC delegates investigation tasks, drives decisions, and ensures updates go out on schedule. The IC does not need to be the most senior engineer.
  • Technical lead: owns the investigation and coordinates the fix.
  • Comms lead: writes external updates and keeps the status page current.

Separate the technical channel from customer-facing communication. Engineers need space to be wrong in real time; customers need calm, accurate summaries.

Status update cadence:

  • SEV1: external update every 15 minutes until resolved.
  • SEV2: external update every 30 minutes.
  • Post-resolution: a final “resolved” update with a brief explanation.

Use a public status page (e.g., Statuspage, Instatus) for external updates. Never let customers learn about an outage from social media before the status page is updated.

Follow a “mitigate first” protocol in the first 10 minutes

Restore service before root-causing. The sequence is:

  1. Acknowledge the page. Claim ownership in the incident channel so no one duplicates the response.
  2. Assess scope and severity. Check observability dashboards to answer: what is broken, for how many users, since when. Assign a severity level.
  3. Communicate. Post a one-line status in the incident channel and update the status page. Even “investigating” is better than silence.
  4. Mitigate. Roll back the last deploy, flip a feature flag, redirect traffic, or restore from backup (see hostinger-vps-backups). Take the fastest path to restoring service.
  5. Root-cause after stability. Once users are unblocked, preserve logs and traces via llm-observability or your observability stack, then investigate. Do not let the desire to understand the cause delay the fix.

This order is non-negotiable for SEV1. Customers do not benefit from a perfectly diagnosed outage that lasted 45 extra minutes.

Run a blameless postmortem within 48 hours

Every SEV1 and every SEV2 that required on-call escalation gets a postmortem. SEV3 incidents get a brief written summary if a systemic pattern is suspected.

Blameless means the process assumes every person acted with the best information they had at the time. Blame produces defensiveness; defensiveness produces incomplete timelines; incomplete timelines produce repeated incidents. Atlassian and Google SRE both treat blamelessness as a prerequisite for honest postmortems, not a nicety.

Postmortem structure:

  1. Timeline. Ordered list of events from first alert to full resolution, sourced from logs and tool history, not memory.
  2. Root cause. The technical condition that caused the failure. Use “five whys” to get past the proximate cause. Refer to systems and processes, not individuals.
  3. Impact. Users affected, duration, data or revenue exposure.
  4. What worked. Explicitly note detection speed, tooling, or coordination that held up well. Reinforce those behaviors.
  5. Action items. Each item has an owner, a due date, and a category: mitigative (prevents this specific failure) or preventative (addresses the class of failure). Add unresolved gaps to pre-launch-checklist for future deploys.

Do not close the postmortem until action items are assigned. A postmortem with no owners is a document, not a process.