Building an On-Call Rotation That Doesn't Destroy Your Team

March 25, 2026·ScaledByDesign·

on-callincident-responsedevopsteam-healthreliability

The Burnout Factory

A client's best senior engineer quit after 18 months. Exit interview quote: "I got paged 14 times last month. Three of those were at 3 AM for alerts that didn't need human attention. I can't do this anymore."

The company didn't have an on-call problem. They had an alerting problem disguised as an on-call rotation. Most on-call rotations fail because they page humans for things that machines should handle — or things that can wait until morning.

The Alert Severity Framework

Every alert must fit into one of four categories. If it doesn't, it shouldn't page anyone:

P1 — Page immediately (any hour):
  Definition: Revenue impact RIGHT NOW, data loss, security breach
  Examples: Payment processing down, database corruption, customer data exposed
  Response: Wake someone up. This is what on-call is for.
  Target: < 2 per month

P2 — Page during business hours only:
  Definition: Degraded service, elevated error rates, approaching capacity
  Examples: API latency > 2x normal, error rate > 1%, disk 85% full
  Response: Needs attention today, not at 3 AM.
  Target: < 5 per week

P3 — Notification (Slack, email):
  Definition: Anomaly detected, might need attention
  Examples: Unusual traffic pattern, background job delayed, cert expiring in 14 days
  Response: Check during next working session.
  Target: Monitor and triage weekly

P4 — Log only:
  Definition: Informational, auto-recovers
  Examples: Single request timeout, brief connection pool spike, auto-scaled up
  Response: Review in weekly ops meeting.

The critical rule: If an alert is P1 less than 30% of the time it fires, it's miscategorized. Reclassify it or fix the underlying flakiness. False alarms are the #1 destroyer of on-call morale.

The Rotation Structure

Team of 6 engineers:
  Primary on-call:    1 person, rotates weekly (Mon 9am → Mon 9am)
  Secondary on-call:  1 person (previous week's primary), backup only
  
  Schedule:
    Week 1: Alice (primary) / Frank (secondary)
    Week 2: Bob (primary) / Alice (secondary)
    Week 3: Carol (primary) / Bob (secondary)
    ...

  Rules:
    → 6-week rotation cycle = each person is on-call 1 in 6 weeks
    → On-call week = lighter sprint load (50% capacity for project work)
    → After overnight page: next day is optional (work from home or take off)
    → Swap policy: any trade is allowed with 24h notice, no questions asked

The Runbook System

Every P1 and P2 alert needs a runbook. Not a novel — a decision tree:

## Alert: Payment Processing Error Rate > 1%
 
### Quick Diagnosis (< 5 minutes)
1. Check Stripe status page: https://status.stripe.com
   → If Stripe is down: Post in #incidents, nothing we can do, monitor
2. Check our payment service health: `curl https://api.example.com/health/payments`
   → If unhealthy: Restart payment service pods (see step 3)
3. Check recent deployments: `kubectl rollout history deployment/payment-service`
   → If deployed in last 2 hours: Rollback (see step 4)
 
### Restart Payment Service
kubectl rollout restart deployment/payment-service -n production
Wait 2 minutes, then verify: curl https://api.example.com/health/payments
 
### Rollback Last Deployment
kubectl rollout undo deployment/payment-service -n production
Verify rollback: kubectl rollout status deployment/payment-service -n production
 
### Escalation
If none of the above resolves within 15 minutes:
→ Page the payment team tech lead: @jane-smith
→ Post in #incidents with timeline of actions taken

Compensation and Fairness

On-call without compensation is exploitation. Here are the models that work:

Option A: Extra PTO
  → 1 day of PTO per on-call week
  → Additional half-day for any overnight page
  → Simple, works for teams that value flexibility

Option B: Stipend
  → $500-1000 per on-call week (varies by market)
  → Additional $200 per overnight page
  → Works for teams that prefer direct compensation

Option C: Reduced Sprint Load
  → On-call engineer takes 50% sprint commitment
  → Uses remaining time for on-call improvements (fix alerts, write runbooks)
  → This is the best option — it creates a virtuous cycle

Option C is our recommendation because it aligns incentives: the on-call engineer is incentivized to reduce future pages because they'll be on-call again in 6 weeks, and they have dedicated time to do it.

Measuring On-Call Health

Track these metrics monthly:

Alert Metrics:
  Total pages per week:           Target < 5
  P1 pages per month:             Target < 2
  False positive rate:            Target < 20%
  Mean time to acknowledge:       Target < 5 min
  Mean time to resolve:           Target < 30 min

People Metrics:
  Pages per person per rotation:  Target < 3
  Overnight pages per month:      Target < 1
  On-call satisfaction (survey):  Target > 7/10
  On-call opt-out requests:       Should be zero

System Metrics:
  Alerts auto-resolved:           Target > 60%
  Runbook coverage:               Target > 90% of P1/P2 alerts
  Repeat incidents:               Should trend to zero

If overnight pages are above 1 per month per person, you don't need a better on-call rotation. You need better systems. Every overnight page should trigger a blameless postmortem that asks: "How do we prevent this from paging a human at 3 AM ever again?"

On-call is a necessary part of running production systems. But it shouldn't be something your team dreads. Fix the alerts first, write the runbooks, compensate fairly, and keep the rotation light enough that engineers can sustain it long-term.

The Subscription Box Tech Stack That Scales Past $10M ARR

Your A/B Test Isn't Statistically Significant — Here's What to Do About It

Building an On-Call Rotation That Doesn't Destroy Your Team

March 25, 2026·ScaledByDesign·

on-callincident-responsedevopsteam-healthreliability

The Burnout Factory

The Alert Severity Framework

Every alert must fit into one of four categories. If it doesn't, it shouldn't page anyone:

P1 — Page immediately (any hour):
  Definition: Revenue impact RIGHT NOW, data loss, security breach
  Examples: Payment processing down, database corruption, customer data exposed
  Response: Wake someone up. This is what on-call is for.
  Target: < 2 per month

P2 — Page during business hours only:
  Definition: Degraded service, elevated error rates, approaching capacity
  Examples: API latency > 2x normal, error rate > 1%, disk 85% full
  Response: Needs attention today, not at 3 AM.
  Target: < 5 per week

P3 — Notification (Slack, email):
  Definition: Anomaly detected, might need attention
  Examples: Unusual traffic pattern, background job delayed, cert expiring in 14 days
  Response: Check during next working session.
  Target: Monitor and triage weekly

P4 — Log only:
  Definition: Informational, auto-recovers
  Examples: Single request timeout, brief connection pool spike, auto-scaled up
  Response: Review in weekly ops meeting.

The critical rule: If an alert is P1 less than 30% of the time it fires, it's miscategorized. Reclassify it or fix the underlying flakiness. False alarms are the #1 destroyer of on-call morale.

The Rotation Structure

Team of 6 engineers:
  Primary on-call:    1 person, rotates weekly (Mon 9am → Mon 9am)
  Secondary on-call:  1 person (previous week's primary), backup only
  
  Schedule:
    Week 1: Alice (primary) / Frank (secondary)
    Week 2: Bob (primary) / Alice (secondary)
    Week 3: Carol (primary) / Bob (secondary)
    ...

  Rules:
    → 6-week rotation cycle = each person is on-call 1 in 6 weeks
    → On-call week = lighter sprint load (50% capacity for project work)
    → After overnight page: next day is optional (work from home or take off)
    → Swap policy: any trade is allowed with 24h notice, no questions asked

The Runbook System

Every P1 and P2 alert needs a runbook. Not a novel — a decision tree:

## Alert: Payment Processing Error Rate > 1%
 
### Quick Diagnosis (< 5 minutes)
1. Check Stripe status page: https://status.stripe.com
   → If Stripe is down: Post in #incidents, nothing we can do, monitor
2. Check our payment service health: `curl https://api.example.com/health/payments`
   → If unhealthy: Restart payment service pods (see step 3)
3. Check recent deployments: `kubectl rollout history deployment/payment-service`
   → If deployed in last 2 hours: Rollback (see step 4)
 
### Restart Payment Service
kubectl rollout restart deployment/payment-service -n production
Wait 2 minutes, then verify: curl https://api.example.com/health/payments
 
### Rollback Last Deployment
kubectl rollout undo deployment/payment-service -n production
Verify rollback: kubectl rollout status deployment/payment-service -n production
 
### Escalation
If none of the above resolves within 15 minutes:
→ Page the payment team tech lead: @jane-smith
→ Post in #incidents with timeline of actions taken

Compensation and Fairness

On-call without compensation is exploitation. Here are the models that work:

Option A: Extra PTO
  → 1 day of PTO per on-call week
  → Additional half-day for any overnight page
  → Simple, works for teams that value flexibility

Option B: Stipend
  → $500-1000 per on-call week (varies by market)
  → Additional $200 per overnight page
  → Works for teams that prefer direct compensation

Option C: Reduced Sprint Load
  → On-call engineer takes 50% sprint commitment
  → Uses remaining time for on-call improvements (fix alerts, write runbooks)
  → This is the best option — it creates a virtuous cycle

Measuring On-Call Health

Track these metrics monthly:

Alert Metrics:
  Total pages per week:           Target < 5
  P1 pages per month:             Target < 2
  False positive rate:            Target < 20%
  Mean time to acknowledge:       Target < 5 min
  Mean time to resolve:           Target < 30 min

People Metrics:
  Pages per person per rotation:  Target < 3
  Overnight pages per month:      Target < 1
  On-call satisfaction (survey):  Target > 7/10
  On-call opt-out requests:       Should be zero

System Metrics:
  Alerts auto-resolved:           Target > 60%
  Runbook coverage:               Target > 90% of P1/P2 alerts
  Repeat incidents:               Should trend to zero

The Subscription Box Tech Stack That Scales Past $10M ARR

Your A/B Test Isn't Statistically Significant — Here's What to Do About It

Building an On-Call Rotation That Doesn't Destroy Your Team

The Burnout Factory

The Alert Severity Framework

The Rotation Structure

The Runbook System

Compensation and Fairness

Measuring On-Call Health

Ready to Ship?

Building an On-Call Rotation That Doesn't Destroy Your Team

The Burnout Factory

The Alert Severity Framework

The Rotation Structure

The Runbook System

Compensation and Fairness

Measuring On-Call Health

Ready to Ship?