Incident response

When something breaks in production, the goal is to restore service as quickly as possible — then understand why.

Severity levels

Level	Description	Response target
SEV-1	Production down or data loss	Immediate, all hands
SEV-2	Significant degradation, major feature broken	Within 30 minutes
SEV-3	Minor degradation, workaround available	Within 2 hours
SEV-4	Cosmetic or low-impact issue	Next business day

Acknowledge the alert in your alerting tool to signal you’re on it.
Assess severity — is this SEV-1/2 or lower?
Open a war room — for SEV-1/2, create a Slack thread in #incidents and invite your on-call partner.
Mitigate first — roll back, disable a feature flag, or scale up before diagnosing root cause.
Communicate — post updates to #incidents every 15 minutes until resolved.
Resolve and document — mark the incident resolved and file a postmortem for SEV-1/2.

Rotation schedule, escalation paths, and what to do when you’re paged.

How to write a blameless postmortem and drive follow-through.

Last modified on May 4, 2026

⌘I