ScaledByDesign/Insights
ServicesPricingAboutContact
Book a Call
Scaled By Design

Fractional CTO + execution partner for revenue-critical systems.

Company

  • About
  • Services
  • Contact

Resources

  • Insights
  • Pricing
  • FAQ

Legal

  • Privacy Policy
  • Terms of Service

© 2026 ScaledByDesign. All rights reserved.

contact@scaledbydesign.com

On This Page

The 4 AM Wake-Up CallThe Incident Response FrameworkSeverity LevelsThe Incident Commander RoleThe Incident TimelineRunbooks: The Cheat Sheet for 4 AMRunbook: Checkout Error Rate SpikeQuick AssessmentCommon Causes (ranked by frequency)EscalationThe Post-MortemPost-Mortem: Checkout Outage - May 29, 2026TimelineRoot CauseContributing FactorsAction ItemsLessons LearnedThe Rules
  1. Insights
  2. Infrastructure
  3. Incident Management That Actually Works — From Alert to Post-Mortem

Incident Management That Actually Works — From Alert to Post-Mortem

May 29, 2026·ScaledByDesign·
incident-managementsreon-callpost-mortemdevops

The 4 AM Wake-Up Call

Your phone buzzes at 4 AM. PagerDuty: "Checkout error rate > 5%." You open your laptop, half awake. No runbook. No clear escalation path. Five people join a Slack channel and all ask "what's happening?" simultaneously. Someone runs a query against production that makes things worse. An hour later, the CEO emails asking for an update.

This is incident management without a process. Here's incident management with one.

The Incident Response Framework

Severity Levels

SEV-1 (Critical): Revenue-impacting, customer-facing outage
  → Response time: 15 minutes
  → Who's involved: On-call engineer, engineering manager, comms lead
  → Examples: Checkout down, data breach, complete site outage
  → Communication: Status page update within 15 min

SEV-2 (Major): Significant degradation, partial functionality loss
  → Response time: 30 minutes
  → Who's involved: On-call engineer, relevant team lead
  → Examples: Search broken, slow checkout, payment failures > 2%
  → Communication: Status page update within 30 min

SEV-3 (Minor): Limited impact, workaround available
  → Response time: 4 hours (business hours)
  → Who's involved: On-call engineer
  → Examples: Non-critical feature broken, performance degradation < 20%
  → Communication: Internal only

SEV-4 (Informational): Cosmetic issues, non-impacting
  → Response time: Next business day
  → Who's involved: Owning team
  → Examples: Typo, minor UI glitch, non-critical log errors

The Incident Commander Role

Every SEV-1 and SEV-2 incident needs one person in charge — the Incident Commander (IC):

IC Responsibilities:
  → Coordinate the response (who's doing what)
  → Make decisions when options are unclear
  → Communicate status to stakeholders
  → Decide when to escalate
  → Decide when to declare resolved

IC Does NOT:
  → Debug the issue (that's the responders' job)
  → Write code or run queries
  → Get pulled into technical details
  → Let the incident go 30 minutes without a status update

IC Rotation: Rotate weekly, separate from on-call.
Any senior+ engineer can be IC. It's a coordination skill, not a debugging skill.

The Incident Timeline

Minute 0: Alert fires
  → On-call engineer acknowledges alert within 5 min
  → On-call assesses severity (SEV-1 through SEV-4)

Minute 5: Incident declared (SEV-1/2)
  → Create incident channel: #inc-2026-05-29-checkout-errors
  → Assign Incident Commander
  → Post initial assessment to channel

Minute 10: Triage
  → What changed? (deployments, config changes, infrastructure)
  → What's the blast radius? (which customers, which features)
  → Is there an obvious fix? (rollback, config revert)

Minute 15: First external communication
  → Status page: "We're investigating elevated error rates on checkout."
  → Internal stakeholder update via Slack

Minute 15-60: Investigation and mitigation
  → Parallel tracks: diagnose root cause AND mitigate impact
  → Mitigation > root cause fix during an active incident
  → Update status page every 15 minutes

Resolution: Incident resolved
  → Status page: "Issue resolved. Checkout is operating normally."
  → Final internal summary posted to incident channel
  → Schedule post-mortem within 48 hours

Runbooks: The Cheat Sheet for 4 AM

Pre-written runbooks for common incidents prevent fumbling:

## Runbook: Checkout Error Rate Spike
 
### Quick Assessment
1. Check deployment log: did we deploy in the last 2 hours?
   → If yes: rollback first, investigate later
2. Check third-party status: Stripe, Shopify, PayPal
   → If external service down: enable fallback payment method
3. Check database: connection pool, query latency, disk space
   → If DB issue: restart connection pool, scale up if needed
 
### Common Causes (ranked by frequency)
1. Payment provider outage (40% of incidents)
   → Mitigation: Switch to backup provider
2. Bad deployment (30%)
   → Mitigation: Rollback to last known good
3. Database overload (15%)
   → Mitigation: Kill long-running queries, scale read replicas
4. Infrastructure issue (10%)
   → Mitigation: Check AWS status, failover to backup region
5. DDoS or bot traffic (5%)
   → Mitigation: Enable rate limiting, block suspicious IPs
 
### Escalation
- If not resolved in 30 min: Page engineering manager
- If revenue loss > $10K: Notify VP Engineering
- If data breach suspected: Notify security team immediately

The Post-Mortem

Post-mortems exist to prevent recurrence, not to assign blame:

## Post-Mortem: Checkout Outage - May 29, 2026
 
**Duration**: 45 minutes (4:02 AM - 4:47 AM UTC)
**Severity**: SEV-1
**Impact**: 2,340 failed checkout attempts, estimated $47K lost revenue
**Incident Commander**: Sarah K.
 
### Timeline
- 4:02 AM: Alert fires (checkout error rate > 5%)
- 4:07 AM: On-call acknowledges, begins investigation
- 4:12 AM: IC assigned, incident channel created
- 4:15 AM: Status page updated
- 4:18 AM: Root cause identified (Stripe webhook endpoint returning 500)
- 4:22 AM: Hotfix deployed (bypass failing webhook handler)
- 4:35 AM: Error rate returns to normal
- 4:47 AM: Incident resolved, monitoring confirmed stable
 
### Root Cause
A deployment at 3:45 AM introduced a type error in the Stripe webhook
handler. The handler crashed on subscription renewal events because
the new code expected a `subscription` field that renewal events don't
include.
 
### Contributing Factors
1. No integration tests for webhook handler edge cases
2. Deployment at 3:45 AM with no post-deploy monitoring
3. Alert threshold too high (5%) — should have caught it at 2%
 
### Action Items
| # | Action | Owner | Due Date |
|---|--------|-------|----------|
| 1 | Add integration tests for all webhook event types | Alex | Jun 5 |
| 2 | Lower checkout error alert threshold to 2% | Platform | Jun 2 |
| 3 | Block deployments between 12-6 AM without approval | DevOps | Jun 5 |
| 4 | Add post-deploy smoke test for payment flows | Alex | Jun 10 |
 
### Lessons Learned
- Webhook handlers need tests for every event type, not just the common ones
- Late-night deployments need automated verification, not manual monitoring
- Our alert threshold was set for "something is very broken" not "something is starting to break"

The Rules

Rule 1: Mitigate first, root cause later. Stop the bleeding.
Rule 2: One Incident Commander. Not a committee.
Rule 3: Update stakeholders every 15 minutes. Silence is worse than "still investigating."
Rule 4: Blameless post-mortems. The goal is prevention, not punishment.
Rule 5: Every SEV-1 gets a post-mortem with action items. No exceptions.
Rule 6: Action items have owners and due dates. Track completion.

Good incident management isn't about preventing incidents — those will always happen. It's about having a system that minimizes impact, coordinates response, and prevents the same incident from happening twice. Build the playbook before you need it.

Previous
A Practical Guide to AI Embeddings — Beyond the Hype
Insights
Incident Management That Actually Works — From Alert to Post-MortemPostgreSQL Performance Tuning — The Queries That Are Killing Your DatabaseYour CI/CD Pipeline Should Take Under 10 Minutes — Here's HowThe Three Pillars of Observability — What They Actually Mean in PracticeRedis Caching Patterns That Actually Work in ProductionZero-Downtime Database Migrations — The Patterns That Actually WorkTerraform State Management Lessons We Learned the Hard WayKubernetes Is Overkill for Your Startup — Here's What to Use InsteadScale Postgres Before Reaching for NoSQLDatabase Migrations Without DowntimeObservability That Actually Helps You Sleep at Night

Ready to Ship?

Let's talk about your engineering challenges and how we can help.

Book a Call