Skip to content

Incident response

What to do when something is broken in production.

Severity levels

Level Meaning Response
SEV1 Customers broadly affected, revenue-impacting All hands, page on-call, status page update
SEV2 Significant degradation for some customers On-call owns, status page update
SEV3 Minor degradation, workaround exists Ticket, fix during business hours

First five minutes

  1. Acknowledge in #incidents. Even just "I'm looking."
  2. Declare a severity. It's better to over-declare and downgrade than the reverse.
  3. Open an incident channel (#inc-YYYY-MM-DD-short-name) for SEV1 and SEV2.
  4. Assign roles: Incident Commander, Comms, Scribe. The IC does not debug.
  5. Update the status page if customers are affected.

While you're in it

  • Over-communicate. Post updates every 15 minutes even if there's nothing new.
  • One Incident Commander at a time. Hand off explicitly: "IC is now @alice."
  • Timestamp everything in the channel. Future-you writing the postmortem will thank you.

After

Every SEV1 and SEV2 gets a blameless postmortem within 5 business days. Template:

  • What happened — user-visible impact, duration
  • Timeline — in UTC
  • Root cause — the technical cause
  • Contributing factors — what made this possible or worse
  • What went well
  • What didn't
  • Action items — each with an owner and a due date