Incident Management That Actually Works — From Alert to Post-Mortem

May 29, 2026·ScaledByDesign·

incident-managementsreon-callpost-mortemdevops

The 4 AM Wake-Up Call

Your phone buzzes at 4 AM. PagerDuty: "Checkout error rate > 5%." You open your laptop, half awake. No runbook. No clear escalation path. Five people join a Slack channel and all ask "what's happening?" simultaneously. Someone runs a query against production that makes things worse. An hour later, the CEO emails asking for an update.

This is incident management without a process. Here's incident management with one.

The Incident Response Framework

Severity Levels

SEV-1 (Critical): Revenue-impacting, customer-facing outage
  → Response time: 15 minutes
  → Who's involved: On-call engineer, engineering manager, comms lead
  → Examples: Checkout down, data breach, complete site outage
  → Communication: Status page update within 15 min

SEV-2 (Major): Significant degradation, partial functionality loss
  → Response time: 30 minutes
  → Who's involved: On-call engineer, relevant team lead
  → Examples: Search broken, slow checkout, payment failures > 2%
  → Communication: Status page update within 30 min

SEV-3 (Minor): Limited impact, workaround available
  → Response time: 4 hours (business hours)
  → Who's involved: On-call engineer
  → Examples: Non-critical feature broken, performance degradation < 20%
  → Communication: Internal only

SEV-4 (Informational): Cosmetic issues, non-impacting
  → Response time: Next business day
  → Who's involved: Owning team
  → Examples: Typo, minor UI glitch, non-critical log errors

The Incident Commander Role

Every SEV-1 and SEV-2 incident needs one person in charge — the Incident Commander (IC):

IC Responsibilities:
  → Coordinate the response (who's doing what)
  → Make decisions when options are unclear
  → Communicate status to stakeholders
  → Decide when to escalate
  → Decide when to declare resolved

IC Does NOT:
  → Debug the issue (that's the responders' job)
  → Write code or run queries
  → Get pulled into technical details
  → Let the incident go 30 minutes without a status update

IC Rotation: Rotate weekly, separate from on-call.
Any senior+ engineer can be IC. It's a coordination skill, not a debugging skill.

The Incident Timeline

Minute 0: Alert fires
  → On-call engineer acknowledges alert within 5 min
  → On-call assesses severity (SEV-1 through SEV-4)

Minute 5: Incident declared (SEV-1/2)
  → Create incident channel: #inc-2026-05-29-checkout-errors
  → Assign Incident Commander
  → Post initial assessment to channel

Minute 10: Triage
  → What changed? (deployments, config changes, infrastructure)
  → What's the blast radius? (which customers, which features)
  → Is there an obvious fix? (rollback, config revert)

Minute 15: First external communication
  → Status page: "We're investigating elevated error rates on checkout."
  → Internal stakeholder update via Slack

Minute 15-60: Investigation and mitigation
  → Parallel tracks: diagnose root cause AND mitigate impact
  → Mitigation > root cause fix during an active incident
  → Update status page every 15 minutes

Resolution: Incident resolved
  → Status page: "Issue resolved. Checkout is operating normally."
  → Final internal summary posted to incident channel
  → Schedule post-mortem within 48 hours

Runbooks: The Cheat Sheet for 4 AM

Pre-written runbooks for common incidents prevent fumbling:

## Runbook: Checkout Error Rate Spike
 
### Quick Assessment
1. Check deployment log: did we deploy in the last 2 hours?
   → If yes: rollback first, investigate later
2. Check third-party status: Stripe, Shopify, PayPal
   → If external service down: enable fallback payment method
3. Check database: connection pool, query latency, disk space
   → If DB issue: restart connection pool, scale up if needed
 
### Common Causes (ranked by frequency)
1. Payment provider outage (40% of incidents)
   → Mitigation: Switch to backup provider
2. Bad deployment (30%)
   → Mitigation: Rollback to last known good
3. Database overload (15%)
   → Mitigation: Kill long-running queries, scale read replicas
4. Infrastructure issue (10%)
   → Mitigation: Check AWS status, failover to backup region
5. DDoS or bot traffic (5%)
   → Mitigation: Enable rate limiting, block suspicious IPs
 
### Escalation
- If not resolved in 30 min: Page engineering manager
- If revenue loss > $10K: Notify VP Engineering
- If data breach suspected: Notify security team immediately

The Post-Mortem

Post-mortems exist to prevent recurrence, not to assign blame:

## Post-Mortem: Checkout Outage - May 29, 2026
 
**Duration**: 45 minutes (4:02 AM - 4:47 AM UTC)
**Severity**: SEV-1
**Impact**: 2,340 failed checkout attempts, estimated $47K lost revenue
**Incident Commander**: Sarah K.
 
### Timeline
- 4:02 AM: Alert fires (checkout error rate > 5%)
- 4:07 AM: On-call acknowledges, begins investigation
- 4:12 AM: IC assigned, incident channel created
- 4:15 AM: Status page updated
- 4:18 AM: Root cause identified (Stripe webhook endpoint returning 500)
- 4:22 AM: Hotfix deployed (bypass failing webhook handler)
- 4:35 AM: Error rate returns to normal
- 4:47 AM: Incident resolved, monitoring confirmed stable
 
### Root Cause
A deployment at 3:45 AM introduced a type error in the Stripe webhook
handler. The handler crashed on subscription renewal events because
the new code expected a `subscription` field that renewal events don't
include.
 
### Contributing Factors
1. No integration tests for webhook handler edge cases
2. Deployment at 3:45 AM with no post-deploy monitoring
3. Alert threshold too high (5%) — should have caught it at 2%
 
### Action Items
| # | Action | Owner | Due Date |
|---|--------|-------|----------|
| 1 | Add integration tests for all webhook event types | Alex | Jun 5 |
| 2 | Lower checkout error alert threshold to 2% | Platform | Jun 2 |
| 3 | Block deployments between 12-6 AM without approval | DevOps | Jun 5 |
| 4 | Add post-deploy smoke test for payment flows | Alex | Jun 10 |
 
### Lessons Learned
- Webhook handlers need tests for every event type, not just the common ones
- Late-night deployments need automated verification, not manual monitoring
- Our alert threshold was set for "something is very broken" not "something is starting to break"

The Rules

Rule 1: Mitigate first, root cause later. Stop the bleeding.
Rule 2: One Incident Commander. Not a committee.
Rule 3: Update stakeholders every 15 minutes. Silence is worse than "still investigating."
Rule 4: Blameless post-mortems. The goal is prevention, not punishment.
Rule 5: Every SEV-1 gets a post-mortem with action items. No exceptions.
Rule 6: Action items have owners and due dates. Track completion.

Good incident management isn't about preventing incidents — those will always happen. It's about having a system that minimizes impact, coordinates response, and prevents the same incident from happening twice. Build the playbook before you need it.

A Practical Guide to AI Embeddings — Beyond the Hype

Attribution Modeling Beyond Last-Click — What DTC Brands Actually Need

Incident Management That Actually Works — From Alert to Post-Mortem

May 29, 2026·ScaledByDesign·

incident-managementsreon-callpost-mortemdevops

The 4 AM Wake-Up Call

This is incident management without a process. Here's incident management with one.

The Incident Response Framework

Severity Levels

SEV-1 (Critical): Revenue-impacting, customer-facing outage
  → Response time: 15 minutes
  → Who's involved: On-call engineer, engineering manager, comms lead
  → Examples: Checkout down, data breach, complete site outage
  → Communication: Status page update within 15 min

SEV-2 (Major): Significant degradation, partial functionality loss
  → Response time: 30 minutes
  → Who's involved: On-call engineer, relevant team lead
  → Examples: Search broken, slow checkout, payment failures > 2%
  → Communication: Status page update within 30 min

SEV-3 (Minor): Limited impact, workaround available
  → Response time: 4 hours (business hours)
  → Who's involved: On-call engineer
  → Examples: Non-critical feature broken, performance degradation < 20%
  → Communication: Internal only

SEV-4 (Informational): Cosmetic issues, non-impacting
  → Response time: Next business day
  → Who's involved: Owning team
  → Examples: Typo, minor UI glitch, non-critical log errors

The Incident Commander Role

Every SEV-1 and SEV-2 incident needs one person in charge — the Incident Commander (IC):

IC Responsibilities:
  → Coordinate the response (who's doing what)
  → Make decisions when options are unclear
  → Communicate status to stakeholders
  → Decide when to escalate
  → Decide when to declare resolved

IC Does NOT:
  → Debug the issue (that's the responders' job)
  → Write code or run queries
  → Get pulled into technical details
  → Let the incident go 30 minutes without a status update

IC Rotation: Rotate weekly, separate from on-call.
Any senior+ engineer can be IC. It's a coordination skill, not a debugging skill.

The Incident Timeline

Minute 0: Alert fires
  → On-call engineer acknowledges alert within 5 min
  → On-call assesses severity (SEV-1 through SEV-4)

Minute 5: Incident declared (SEV-1/2)
  → Create incident channel: #inc-2026-05-29-checkout-errors
  → Assign Incident Commander
  → Post initial assessment to channel

Minute 10: Triage
  → What changed? (deployments, config changes, infrastructure)
  → What's the blast radius? (which customers, which features)
  → Is there an obvious fix? (rollback, config revert)

Minute 15: First external communication
  → Status page: "We're investigating elevated error rates on checkout."
  → Internal stakeholder update via Slack

Minute 15-60: Investigation and mitigation
  → Parallel tracks: diagnose root cause AND mitigate impact
  → Mitigation > root cause fix during an active incident
  → Update status page every 15 minutes

Resolution: Incident resolved
  → Status page: "Issue resolved. Checkout is operating normally."
  → Final internal summary posted to incident channel
  → Schedule post-mortem within 48 hours

Runbooks: The Cheat Sheet for 4 AM

Pre-written runbooks for common incidents prevent fumbling:

## Runbook: Checkout Error Rate Spike
 
### Quick Assessment
1. Check deployment log: did we deploy in the last 2 hours?
   → If yes: rollback first, investigate later
2. Check third-party status: Stripe, Shopify, PayPal
   → If external service down: enable fallback payment method
3. Check database: connection pool, query latency, disk space
   → If DB issue: restart connection pool, scale up if needed
 
### Common Causes (ranked by frequency)
1. Payment provider outage (40% of incidents)
   → Mitigation: Switch to backup provider
2. Bad deployment (30%)
   → Mitigation: Rollback to last known good
3. Database overload (15%)
   → Mitigation: Kill long-running queries, scale read replicas
4. Infrastructure issue (10%)
   → Mitigation: Check AWS status, failover to backup region
5. DDoS or bot traffic (5%)
   → Mitigation: Enable rate limiting, block suspicious IPs
 
### Escalation
- If not resolved in 30 min: Page engineering manager
- If revenue loss > $10K: Notify VP Engineering
- If data breach suspected: Notify security team immediately

The Post-Mortem

Post-mortems exist to prevent recurrence, not to assign blame:

## Post-Mortem: Checkout Outage - May 29, 2026
 
**Duration**: 45 minutes (4:02 AM - 4:47 AM UTC)
**Severity**: SEV-1
**Impact**: 2,340 failed checkout attempts, estimated $47K lost revenue
**Incident Commander**: Sarah K.
 
### Timeline
- 4:02 AM: Alert fires (checkout error rate > 5%)
- 4:07 AM: On-call acknowledges, begins investigation
- 4:12 AM: IC assigned, incident channel created
- 4:15 AM: Status page updated
- 4:18 AM: Root cause identified (Stripe webhook endpoint returning 500)
- 4:22 AM: Hotfix deployed (bypass failing webhook handler)
- 4:35 AM: Error rate returns to normal
- 4:47 AM: Incident resolved, monitoring confirmed stable
 
### Root Cause
A deployment at 3:45 AM introduced a type error in the Stripe webhook
handler. The handler crashed on subscription renewal events because
the new code expected a `subscription` field that renewal events don't
include.
 
### Contributing Factors
1. No integration tests for webhook handler edge cases
2. Deployment at 3:45 AM with no post-deploy monitoring
3. Alert threshold too high (5%) — should have caught it at 2%
 
### Action Items
| # | Action | Owner | Due Date |
|---|--------|-------|----------|
| 1 | Add integration tests for all webhook event types | Alex | Jun 5 |
| 2 | Lower checkout error alert threshold to 2% | Platform | Jun 2 |
| 3 | Block deployments between 12-6 AM without approval | DevOps | Jun 5 |
| 4 | Add post-deploy smoke test for payment flows | Alex | Jun 10 |
 
### Lessons Learned
- Webhook handlers need tests for every event type, not just the common ones
- Late-night deployments need automated verification, not manual monitoring
- Our alert threshold was set for "something is very broken" not "something is starting to break"

The Rules

Rule 1: Mitigate first, root cause later. Stop the bleeding.
Rule 2: One Incident Commander. Not a committee.
Rule 3: Update stakeholders every 15 minutes. Silence is worse than "still investigating."
Rule 4: Blameless post-mortems. The goal is prevention, not punishment.
Rule 5: Every SEV-1 gets a post-mortem with action items. No exceptions.
Rule 6: Action items have owners and due dates. Track completion.

A Practical Guide to AI Embeddings — Beyond the Hype

Attribution Modeling Beyond Last-Click — What DTC Brands Actually Need

Incident Management That Actually Works — From Alert to Post-Mortem

The 4 AM Wake-Up Call

The Incident Response Framework

Severity Levels

The Incident Commander Role

The Incident Timeline

Runbooks: The Cheat Sheet for 4 AM

The Post-Mortem

The Rules

Ready to Ship?

Incident Management That Actually Works — From Alert to Post-Mortem

The 4 AM Wake-Up Call

The Incident Response Framework

Severity Levels

The Incident Commander Role

The Incident Timeline

Runbooks: The Cheat Sheet for 4 AM

The Post-Mortem

The Rules

Ready to Ship?