Incident Management That Actually Works — From Alert to Post-Mortem
The 4 AM Wake-Up Call
Your phone buzzes at 4 AM. PagerDuty: "Checkout error rate > 5%." You open your laptop, half awake. No runbook. No clear escalation path. Five people join a Slack channel and all ask "what's happening?" simultaneously. Someone runs a query against production that makes things worse. An hour later, the CEO emails asking for an update.
This is incident management without a process. Here's incident management with one.
The Incident Response Framework
Severity Levels
SEV-1 (Critical): Revenue-impacting, customer-facing outage
→ Response time: 15 minutes
→ Who's involved: On-call engineer, engineering manager, comms lead
→ Examples: Checkout down, data breach, complete site outage
→ Communication: Status page update within 15 min
SEV-2 (Major): Significant degradation, partial functionality loss
→ Response time: 30 minutes
→ Who's involved: On-call engineer, relevant team lead
→ Examples: Search broken, slow checkout, payment failures > 2%
→ Communication: Status page update within 30 min
SEV-3 (Minor): Limited impact, workaround available
→ Response time: 4 hours (business hours)
→ Who's involved: On-call engineer
→ Examples: Non-critical feature broken, performance degradation < 20%
→ Communication: Internal only
SEV-4 (Informational): Cosmetic issues, non-impacting
→ Response time: Next business day
→ Who's involved: Owning team
→ Examples: Typo, minor UI glitch, non-critical log errors
The Incident Commander Role
Every SEV-1 and SEV-2 incident needs one person in charge — the Incident Commander (IC):
IC Responsibilities:
→ Coordinate the response (who's doing what)
→ Make decisions when options are unclear
→ Communicate status to stakeholders
→ Decide when to escalate
→ Decide when to declare resolved
IC Does NOT:
→ Debug the issue (that's the responders' job)
→ Write code or run queries
→ Get pulled into technical details
→ Let the incident go 30 minutes without a status update
IC Rotation: Rotate weekly, separate from on-call.
Any senior+ engineer can be IC. It's a coordination skill, not a debugging skill.
The Incident Timeline
Minute 0: Alert fires
→ On-call engineer acknowledges alert within 5 min
→ On-call assesses severity (SEV-1 through SEV-4)
Minute 5: Incident declared (SEV-1/2)
→ Create incident channel: #inc-2026-05-29-checkout-errors
→ Assign Incident Commander
→ Post initial assessment to channel
Minute 10: Triage
→ What changed? (deployments, config changes, infrastructure)
→ What's the blast radius? (which customers, which features)
→ Is there an obvious fix? (rollback, config revert)
Minute 15: First external communication
→ Status page: "We're investigating elevated error rates on checkout."
→ Internal stakeholder update via Slack
Minute 15-60: Investigation and mitigation
→ Parallel tracks: diagnose root cause AND mitigate impact
→ Mitigation > root cause fix during an active incident
→ Update status page every 15 minutes
Resolution: Incident resolved
→ Status page: "Issue resolved. Checkout is operating normally."
→ Final internal summary posted to incident channel
→ Schedule post-mortem within 48 hours
Runbooks: The Cheat Sheet for 4 AM
Pre-written runbooks for common incidents prevent fumbling:
## Runbook: Checkout Error Rate Spike
### Quick Assessment
1. Check deployment log: did we deploy in the last 2 hours?
→ If yes: rollback first, investigate later
2. Check third-party status: Stripe, Shopify, PayPal
→ If external service down: enable fallback payment method
3. Check database: connection pool, query latency, disk space
→ If DB issue: restart connection pool, scale up if needed
### Common Causes (ranked by frequency)
1. Payment provider outage (40% of incidents)
→ Mitigation: Switch to backup provider
2. Bad deployment (30%)
→ Mitigation: Rollback to last known good
3. Database overload (15%)
→ Mitigation: Kill long-running queries, scale read replicas
4. Infrastructure issue (10%)
→ Mitigation: Check AWS status, failover to backup region
5. DDoS or bot traffic (5%)
→ Mitigation: Enable rate limiting, block suspicious IPs
### Escalation
- If not resolved in 30 min: Page engineering manager
- If revenue loss > $10K: Notify VP Engineering
- If data breach suspected: Notify security team immediatelyThe Post-Mortem
Post-mortems exist to prevent recurrence, not to assign blame:
## Post-Mortem: Checkout Outage - May 29, 2026
**Duration**: 45 minutes (4:02 AM - 4:47 AM UTC)
**Severity**: SEV-1
**Impact**: 2,340 failed checkout attempts, estimated $47K lost revenue
**Incident Commander**: Sarah K.
### Timeline
- 4:02 AM: Alert fires (checkout error rate > 5%)
- 4:07 AM: On-call acknowledges, begins investigation
- 4:12 AM: IC assigned, incident channel created
- 4:15 AM: Status page updated
- 4:18 AM: Root cause identified (Stripe webhook endpoint returning 500)
- 4:22 AM: Hotfix deployed (bypass failing webhook handler)
- 4:35 AM: Error rate returns to normal
- 4:47 AM: Incident resolved, monitoring confirmed stable
### Root Cause
A deployment at 3:45 AM introduced a type error in the Stripe webhook
handler. The handler crashed on subscription renewal events because
the new code expected a `subscription` field that renewal events don't
include.
### Contributing Factors
1. No integration tests for webhook handler edge cases
2. Deployment at 3:45 AM with no post-deploy monitoring
3. Alert threshold too high (5%) — should have caught it at 2%
### Action Items
| # | Action | Owner | Due Date |
|---|--------|-------|----------|
| 1 | Add integration tests for all webhook event types | Alex | Jun 5 |
| 2 | Lower checkout error alert threshold to 2% | Platform | Jun 2 |
| 3 | Block deployments between 12-6 AM without approval | DevOps | Jun 5 |
| 4 | Add post-deploy smoke test for payment flows | Alex | Jun 10 |
### Lessons Learned
- Webhook handlers need tests for every event type, not just the common ones
- Late-night deployments need automated verification, not manual monitoring
- Our alert threshold was set for "something is very broken" not "something is starting to break"The Rules
Rule 1: Mitigate first, root cause later. Stop the bleeding.
Rule 2: One Incident Commander. Not a committee.
Rule 3: Update stakeholders every 15 minutes. Silence is worse than "still investigating."
Rule 4: Blameless post-mortems. The goal is prevention, not punishment.
Rule 5: Every SEV-1 gets a post-mortem with action items. No exceptions.
Rule 6: Action items have owners and due dates. Track completion.
Good incident management isn't about preventing incidents — those will always happen. It's about having a system that minimizes impact, coordinates response, and prevents the same incident from happening twice. Build the playbook before you need it.