Building an On-Call Rotation That Doesn't Destroy Your Team
The Burnout Factory
A client's best senior engineer quit after 18 months. Exit interview quote: "I got paged 14 times last month. Three of those were at 3 AM for alerts that didn't need human attention. I can't do this anymore."
The company didn't have an on-call problem. They had an alerting problem disguised as an on-call rotation. Most on-call rotations fail because they page humans for things that machines should handle — or things that can wait until morning.
The Alert Severity Framework
Every alert must fit into one of four categories. If it doesn't, it shouldn't page anyone:
P1 — Page immediately (any hour):
Definition: Revenue impact RIGHT NOW, data loss, security breach
Examples: Payment processing down, database corruption, customer data exposed
Response: Wake someone up. This is what on-call is for.
Target: < 2 per month
P2 — Page during business hours only:
Definition: Degraded service, elevated error rates, approaching capacity
Examples: API latency > 2x normal, error rate > 1%, disk 85% full
Response: Needs attention today, not at 3 AM.
Target: < 5 per week
P3 — Notification (Slack, email):
Definition: Anomaly detected, might need attention
Examples: Unusual traffic pattern, background job delayed, cert expiring in 14 days
Response: Check during next working session.
Target: Monitor and triage weekly
P4 — Log only:
Definition: Informational, auto-recovers
Examples: Single request timeout, brief connection pool spike, auto-scaled up
Response: Review in weekly ops meeting.
The critical rule: If an alert is P1 less than 30% of the time it fires, it's miscategorized. Reclassify it or fix the underlying flakiness. False alarms are the #1 destroyer of on-call morale.
The Rotation Structure
Team of 6 engineers:
Primary on-call: 1 person, rotates weekly (Mon 9am → Mon 9am)
Secondary on-call: 1 person (previous week's primary), backup only
Schedule:
Week 1: Alice (primary) / Frank (secondary)
Week 2: Bob (primary) / Alice (secondary)
Week 3: Carol (primary) / Bob (secondary)
...
Rules:
→ 6-week rotation cycle = each person is on-call 1 in 6 weeks
→ On-call week = lighter sprint load (50% capacity for project work)
→ After overnight page: next day is optional (work from home or take off)
→ Swap policy: any trade is allowed with 24h notice, no questions asked
The Runbook System
Every P1 and P2 alert needs a runbook. Not a novel — a decision tree:
## Alert: Payment Processing Error Rate > 1%
### Quick Diagnosis (< 5 minutes)
1. Check Stripe status page: https://status.stripe.com
→ If Stripe is down: Post in #incidents, nothing we can do, monitor
2. Check our payment service health: `curl https://api.example.com/health/payments`
→ If unhealthy: Restart payment service pods (see step 3)
3. Check recent deployments: `kubectl rollout history deployment/payment-service`
→ If deployed in last 2 hours: Rollback (see step 4)
### Restart Payment Service
kubectl rollout restart deployment/payment-service -n production
Wait 2 minutes, then verify: curl https://api.example.com/health/payments
### Rollback Last Deployment
kubectl rollout undo deployment/payment-service -n production
Verify rollback: kubectl rollout status deployment/payment-service -n production
### Escalation
If none of the above resolves within 15 minutes:
→ Page the payment team tech lead: @jane-smith
→ Post in #incidents with timeline of actions takenCompensation and Fairness
On-call without compensation is exploitation. Here are the models that work:
Option A: Extra PTO
→ 1 day of PTO per on-call week
→ Additional half-day for any overnight page
→ Simple, works for teams that value flexibility
Option B: Stipend
→ $500-1000 per on-call week (varies by market)
→ Additional $200 per overnight page
→ Works for teams that prefer direct compensation
Option C: Reduced Sprint Load
→ On-call engineer takes 50% sprint commitment
→ Uses remaining time for on-call improvements (fix alerts, write runbooks)
→ This is the best option — it creates a virtuous cycle
Option C is our recommendation because it aligns incentives: the on-call engineer is incentivized to reduce future pages because they'll be on-call again in 6 weeks, and they have dedicated time to do it.
Measuring On-Call Health
Track these metrics monthly:
Alert Metrics:
Total pages per week: Target < 5
P1 pages per month: Target < 2
False positive rate: Target < 20%
Mean time to acknowledge: Target < 5 min
Mean time to resolve: Target < 30 min
People Metrics:
Pages per person per rotation: Target < 3
Overnight pages per month: Target < 1
On-call satisfaction (survey): Target > 7/10
On-call opt-out requests: Should be zero
System Metrics:
Alerts auto-resolved: Target > 60%
Runbook coverage: Target > 90% of P1/P2 alerts
Repeat incidents: Should trend to zero
If overnight pages are above 1 per month per person, you don't need a better on-call rotation. You need better systems. Every overnight page should trigger a blameless postmortem that asks: "How do we prevent this from paging a human at 3 AM ever again?"
On-call is a necessary part of running production systems. But it shouldn't be something your team dreads. Fix the alerts first, write the runbooks, compensate fairly, and keep the rotation light enough that engineers can sustain it long-term.