ScaledByDesign/Insights
ServicesPricingAboutContact
Book a Call
Scaled By Design

Fractional CTO + execution partner for revenue-critical systems.

Company

  • About
  • Services
  • Contact

Resources

  • Insights
  • Pricing
  • FAQ

Legal

  • Privacy Policy
  • Terms of Service

© 2026 ScaledByDesign. All rights reserved.

contact@scaledbydesign.com

On This Page

On-Call Is Broken at Most CompaniesThe Sustainable On-Call FrameworkStructure: The Two-Tier ModelRotation RulesThe Alert BudgetReducing Incident VolumeThe Post-Incident Review (Blameless)Incident: [Title]What happenedTimelineRoot causeAction itemsClassificationThe Incident CategoriesThe Automation LadderThe On-Call HandoffThe Metrics That MatterThe Cultural Shift
  1. Insights
  2. Engineering
  3. The On-Call Rotation That Doesn't Burn Out Your Team

The On-Call Rotation That Doesn't Burn Out Your Team

December 27, 2025·ScaledByDesign·
on-calldevopscultureteams

On-Call Is Broken at Most Companies

The standard on-call setup: one engineer carries a pager for a week, gets woken up 3 times, spends the next week exhausted, and quietly starts interviewing. Multiply by 12 months and you have an on-call rotation that's your biggest source of attrition.

On-call doesn't have to be this way. The goal isn't to endure incidents — it's to eliminate them.

The Sustainable On-Call Framework

Structure: The Two-Tier Model

Tier 1: First Responder (Primary On-Call)
  → Responds to all pages within 15 minutes
  → Follows runbook for known issues
  → Escalates to Tier 2 if not resolved in 30 minutes
  → Rotates weekly

Tier 2: Subject Matter Expert (Secondary On-Call)
  → Only engaged if Tier 1 can't resolve
  → Deep expertise in specific systems
  → Available within 30 minutes
  → Rotates monthly (less disruptive)

Why two tiers: Most pages (70-80%) are known issues with documented fixes. Tier 1 handles these with runbooks. Only novel problems reach the expert — which means experts are rarely woken up.

Rotation Rules

1. Minimum 4 people in the rotation
   → Anything less = on-call every 3 weeks = burnout

2. No back-to-back weeks
   → Minimum 3 weeks between on-call shifts

3. Voluntary swap system
   → Easy to trade shifts (Slack bot, PagerDuty)
   → No questions asked

4. Protected recovery time
   → If paged after midnight: late start next day
   → If paged 3+ times in one night: next day off
   → Non-negotiable

5. On-call compensation
   → Stipend for being on-call ($200-500/week)
   → Additional per-page compensation for after-hours
   → Or equivalent comp time

The Alert Budget

This is the single most important concept. Set a maximum number of acceptable pages per week:

Alert Budget: 5 pages per on-call week

Week 1: 3 pages (under budget ✅)
Week 2: 7 pages (over budget ❌)
  → Mandatory incident review
  → Team dedicates 20% of next sprint to reducing alerts
  
Week 3: 4 pages (under budget ✅)
Week 4: 8 pages (over budget ❌)
  → Engineering leadership involved
  → Root cause analysis for every page
  → Systemic fix required before next sprint work

The rule: If on-call consistently exceeds the alert budget, it becomes the team's #1 priority — above features, above tech debt, above everything. The alert budget forces the team to fix the systems, not just endure them.

Reducing Incident Volume

The Post-Incident Review (Blameless)

Every page gets a 15-minute review:

## Incident: [Title]
**Date:** [When]  **Duration:** [How long]  **Severity:** [1-3]
 
### What happened
[2-3 sentences]
 
### Timeline
- HH:MM Alert fired
- HH:MM On-call acknowledged
- HH:MM Root cause identified
- HH:MM Fix deployed
- HH:MM Verified resolved
 
### Root cause
[Technical explanation]
 
### Action items
1. [Prevent recurrence] — Owner: [Name] — Due: [Date]
2. [Improve detection] — Owner: [Name] — Due: [Date]
 
### Classification
- [ ] Known issue (runbook exists but didn't work)
- [ ] New issue (needs new runbook)
- [ ] False alarm (alert needs tuning)
- [ ] Customer-caused (consider rate limiting)

The Incident Categories

Track where your pages come from:

CategoryTarget %If Over Target
False alarms< 10%Fix alert thresholds
Known issues with runbooks< 30%Automate the fix
Infrastructure (DB, cache, DNS)< 20%Invest in reliability
Application bugs< 20%Improve testing
External dependencies< 20%Add fallbacks

The Automation Ladder

For every recurring incident, climb this ladder:

Level 0: Page a human who follows a runbook
Level 1: Page a human + auto-collect diagnostics
Level 2: Auto-remediate + notify human after
Level 3: Auto-remediate + no notification (logged only)
Level 4: Prevent the condition entirely

Goal: Move every incident type up at least one level per quarter

The On-Call Handoff

The handoff between on-call shifts should be a 15-minute meeting:

Outgoing on-call shares:
  1. How many pages this week? (vs alert budget)
  2. Any ongoing issues to watch?
  3. Any new runbooks created or updated?
  4. Any alerts that need tuning?
  5. Anything unusual about the current system state?

Incoming on-call confirms:
  1. PagerDuty/OpsGenie is configured correctly
  2. VPN/access to all systems is working
  3. Runbook index is bookmarked
  4. Escalation contacts are current

The Metrics That Matter

Track monthly and share with the team:

MetricHealthyNeeds WorkCritical
Pages per week< 55-10> 10
After-hours pages< 2/week2-4/week> 4/week
Mean time to acknowledge< 5 min5-15 min> 15 min
Mean time to resolve< 30 min30-60 min> 60 min
False alarm rate< 10%10-25%> 25%
Repeat incidents< 20%20-40%> 40%
On-call satisfaction> 7/105-7/10< 5/10

The Cultural Shift

On-call quality is a leading indicator of engineering culture:

  • Good culture: On-call is a shared responsibility. Incidents drive improvements. The team celebrates reducing alert volume.
  • Bad culture: On-call is a punishment. The same incidents recur monthly. Senior engineers find ways to avoid the rotation.

The difference isn't tooling. It's whether leadership treats on-call as something to endure or something to improve.

Make on-call better, and you'll make the product better, the team happier, and retention easier. It's one of the highest-leverage investments an engineering organization can make.

Previous
CI/CD Pipelines That Actually Make You Faster
Next
Multi-Tenant Architecture: The Decisions You Can't Undo
Insights
How to Write RFCs That Actually Get ReadThe Engineering Ladder Nobody Follows (And How to Fix It)Why Your Best Engineers Keep LeavingCode Review Is a Bottleneck — Here's How to Fix ItThe Incident Retro That Actually Prevents the Next IncidentRemote Engineering Teams That Ship: The PlaybookHow to Run Execution Sprints That Actually ShipThe On-Call Rotation That Doesn't Burn Out Your TeamTechnical Interviews Are Broken — Here's What We Do Instead

Ready to Ship?

Let's talk about your engineering challenges and how we can help.

Book a Call