On-Call Fundamentals
Building sustainable, fair, and effective on-call rotations for your team.
What is On-Call?
On-call is a system where team members take turns being the first responder to production incidents outside of normal working hours. The on-call engineer is responsible for acknowledging alerts, triaging issues, and either resolving them or escalating appropriately.
Done well, on-call provides reliable coverage while distributing the burden fairly. Done poorly, it leads to burnout, attrition, and degraded incident response.
Designing On-Call Rotations
Key Principles
- Fairness: Everyone participates (unless there's a good reason not to)
- Predictability: People know when they're on-call well in advance
- Flexibility: Easy to swap shifts or cover for teammates
- Sustainability: On-call burden doesn't exceed reasonable limits
One person is on-call for an entire week, then hands off to the next person.
Pros: Simple, clear ownership, fewer handoffs
Cons: Entire week can be exhausting, single point of failure
Best for: Smaller teams (4-8 people), lower alert volume
Responsibility follows daylight hours across time zones. No one works overnight.
Pros: No overnight pages, better work-life balance
Cons: Requires global team, more handoffs
Best for: Distributed teams across time zones
Two people on-call: primary gets all alerts, secondary is backup if primary doesn't respond.
Pros: Redundancy, shared load during major incidents
Cons: More people tied up, secondary might not engage
Best for: High-stakes systems, SEV0-prone environments
Fixed shifts (e.g., 8am-4pm, 4pm-12am, 12am-8am) with different people covering each.
Pros: Predictable hours, better for high-volume environments
Cons: Requires more people, shift handoffs
Best for: 24/7 operations, larger teams
Escalation Policies
An escalation policy defines what happens when the primary on-call doesn't respond. Good escalation policies ensure incidents get handled even when things go wrong.
Example Escalation Policy
Primary On-Call
Alert sent immediately. 5-minute acknowledgment window.
Secondary On-Call
If no ack after 5 min, alert secondary. Another 5-minute window.
Team Lead / Manager
If still no ack after 10 min total, escalate to leadership.
All-Hands Broadcast
For SEV0: page entire team if no response within 15 min.
On-Call Compensation
On-call is real work that happens outside of normal hours. Fair compensation acknowledges this.
Common Compensation Models
- Flat stipend: Fixed amount per on-call shift (e.g., $500/week)
- Per-page payment: Additional compensation per incident handled
- Time-off in lieu: Comp time for after-hours work
- Higher base salary: On-call expectation built into compensation
- Combination: Stipend + per-incident bonus
Legal Considerations
On-call compensation requirements vary by jurisdiction. Some regions require pay for "on-call time" even if no incidents occur. Consult with HR/legal for compliance.
Preventing On-Call Burnout
On-call burnout is a real risk that leads to attrition, mistakes, and degraded incident response. Prevention requires intentional effort.
Warning Signs
- Dreading on-call shifts weeks in advance
- Sleep disruption affecting work quality
- Decreased incident response quality over time
- Team members frequently swapping away from on-call
- High alert volume with many false positives
Prevention Strategies
- • Fix noisy alerts—every alert should be actionable
- • Tune thresholds to reduce false positives
- • Group related alerts to reduce noise
- • Track and eliminate toil
- • Ensure rotation includes enough people
- • Expand on-call pool to all appropriate engineers
- • Consider hiring for coverage if needed
- • Follow-the-sun if possible
- • Pay fairly for on-call time
- • Compensate extra for heavy pages
- • Provide comp time after rough shifts
- • Recognize on-call in performance reviews
- • Invest in reliability to reduce incidents
- • Better runbooks for faster resolution
- • Automate common remediations
- • Prioritize fixes for repeat incidents
Best Practices Summary
- Make on-call voluntary where possible — Forced on-call breeds resentment
- Set clear expectations — Response times, escalation, documentation
- Provide good tooling — Fast alerting, clear runbooks, easy escalation
- Hold post-mortems — Learn from incidents to reduce future burden
- Measure and track — Pages per shift, MTTR, repeat incidents
- Listen to feedback — On-call engineers know what's broken