Incident response¶

What to do when something is broken in production.

Severity levels¶

Level	Meaning	Response
SEV1	Customers broadly affected, revenue-impacting	All hands, page on-call, status page update
SEV2	Significant degradation for some customers	On-call owns, status page update
SEV3	Minor degradation, workaround exists	Ticket, fix during business hours

Acknowledge in #incidents. Even just "I'm looking."
Declare a severity. It's better to over-declare and downgrade than the reverse.
Open an incident channel (#inc-YYYY-MM-DD-short-name) for SEV1 and SEV2.
Assign roles: Incident Commander, Comms, Scribe. The IC does not debug.
Update the status page if customers are affected.

Over-communicate. Post updates every 15 minutes even if there's nothing new.
One Incident Commander at a time. Hand off explicitly: "IC is now @alice."
Timestamp everything in the channel. Future-you writing the postmortem will thank you.

Every SEV1 and SEV2 gets a blameless postmortem within 5 business days. Template: