Incident response¶
What to do when something is broken in production.
Severity levels¶
| Level | Meaning | Response |
|---|---|---|
| SEV1 | Customers broadly affected, revenue-impacting | All hands, page on-call, status page update |
| SEV2 | Significant degradation for some customers | On-call owns, status page update |
| SEV3 | Minor degradation, workaround exists | Ticket, fix during business hours |
First five minutes¶
- Acknowledge in
#incidents. Even just "I'm looking." - Declare a severity. It's better to over-declare and downgrade than the reverse.
- Open an incident channel (
#inc-YYYY-MM-DD-short-name) for SEV1 and SEV2. - Assign roles: Incident Commander, Comms, Scribe. The IC does not debug.
- Update the status page if customers are affected.
While you're in it¶
- Over-communicate. Post updates every 15 minutes even if there's nothing new.
- One Incident Commander at a time. Hand off explicitly: "IC is now @alice."
- Timestamp everything in the channel. Future-you writing the postmortem will thank you.
After¶
Every SEV1 and SEV2 gets a blameless postmortem within 5 business days. Template:
- What happened — user-visible impact, duration
- Timeline — in UTC
- Root cause — the technical cause
- Contributing factors — what made this possible or worse
- What went well
- What didn't
- Action items — each with an owner and a due date