The real on-call rotation is a system where no one is indispensable, and everyone is adequately supported.

Imagine this: a critical alert fires at 3 AM. Instead of a single, sleep-deprived engineer fumbling through runbooks, a pre-defined process kicks in. The alert is routed, enriched with context, and if the primary on-call can’t resolve it within minutes, a secondary engineer is automatically engaged. This isn’t about just assigning names to days; it’s about building a resilient, predictable system for incident response.

Here’s how we can structure it:

1. Define Incident Severity and Impact

Not all alerts are created equal. We need a clear, tiered system:

  • SEV-1 (Critical): System-wide outage, significant data loss, security breach. Immediate, all-hands-on-deck response.
  • SEV-2 (Major): Significant degradation of service for a large user segment, or a critical feature is unusable. Requires prompt attention.
  • SEV-3 (Minor): Performance issues affecting a small subset of users, or non-critical feature malfunctions. Can often be addressed during business hours.
  • SEV-4 (Informational): Warning or informational alerts that don’t immediately impact users but require monitoring.

The on-call rotation should primarily focus on SEV-1 and SEV-2 incidents. SEV-3 and SEV-4 might be handled by a separate, less urgent escalation path or even assigned as follow-up tasks.

2. Implement a Multi-Tiered On-Call Structure

A single on-call engineer is a single point of failure. A better model uses primary, secondary, and potentially tertiary responders.

  • Primary On-Call: The first responder for all SEV-1/SEV-2 incidents. Responsible for initial triage, diagnosis, and resolution. They are the "front line."
  • Secondary On-Call: Automatically paged if the primary engineer doesn’t acknowledge the alert within, say, 5 minutes, or if the incident escalates beyond a predefined resolution time (e.g., 15 minutes for SEV-1). They provide backup and can take over if needed.
  • Tertiary/Subject Matter Expert (SME): A pre-defined expert for specific services or components. Paged only if the incident clearly falls within their domain and both primary and secondary are unable to resolve it. This ensures specialized knowledge is available without overwhelming SMEs with general alerts.

Example Configuration (using PagerDuty-like syntax):

service: web-api
escalation_policy:
  - id: primary_oncall
    targets:
      - user_id: user_a
        escalation_delay_minutes: 5 # Primary has 5 mins to acknowledge
  - id: secondary_oncall
    targets:
      - user_id: user_b
        escalation_delay_minutes: 10 # Secondary has 10 mins if primary missed it
  - id: tertiary_sme_db
    targets:
      - user_id: user_c # Database SME
    escalation_delay_minutes: 15 # Tertiary has 15 mins
    conditions:
      service_component: database

3. Distribute the Load Fairly and Predictably

Burnout stems from an uneven distribution of on-call burden.

  • Rotation Length: A week-long rotation (Monday to Sunday) is common. Shorter rotations (e.g., 2-3 days) can increase frequency but reduce the impact of any single shift. Longer rotations (e.g., two weeks) can reduce alert fatigue but increase the risk of a single person being overwhelmed during their shift. Experiment to find what works for your team size and incident volume.
  • Team Size: Aim for a minimum of 4-5 people in the rotation to ensure adequate coverage and prevent any single person from being on-call too frequently.
  • "Follow the Sun" Model: For globally distributed teams, this model assigns on-call based on active working hours, minimizing overnight alerts for any single individual.
  • Opt-Out/Swap System: Allow engineers to easily swap shifts or opt out for pre-approved time off, provided they arrange adequate coverage. This fosters flexibility.

Example: Fair Distribution Calculation

If you have 10 engineers and a weekly rotation, each engineer is on-call approximately 10% of the time. If your incident volume averages 2 SEV-1/SEV-2 incidents per week, each engineer can expect to be paged roughly twice per month. If this number is significantly higher, the rotation is too frequent or the system needs better alert tuning.

4. Automate Everything Possible

Manual processes are error-prone and slow down incident response.

  • Automated Alerting: Use robust monitoring tools (Prometheus, Datadog, New Relic) integrated with an incident management platform (PagerDuty, Opsgenie, VictorOps).
  • Automated Triage and Routing: Configure alerts to include runbooks, relevant logs, and system dashboards. If an alert mentions "database latency," it should automatically link to the database monitoring dashboard and relevant troubleshooting steps.
  • Automated Escalation: As shown in the example policy, escalation delays and conditions should be automated.
  • Automated Post-Mortems: Tools can help gather metrics, logs, and timelines automatically after an incident is resolved, speeding up the post-mortem process.

5. Provide Excellent Runbooks and Documentation

Engineers should never be guessing in the dark.

  • Clear, Concise Runbooks: For every alert, there should be a runbook. It should start with "What is this alert?" then "What is the likely impact?" and finally "Steps to resolve."
  • Accessible Documentation: Ensure all runbooks and system documentation are easily searchable and accessible from the alert itself or from the incident management platform.
  • Live Runbook Testing: Regularly test runbooks by simulating alerts and having engineers walk through the steps. This identifies gaps and outdated information.

6. Foster a Culture of Psychological Safety

Engineers need to feel safe to acknowledge alerts, ask for help, and make mistakes without fear of reprisal.

  • Blameless Post-Mortems: Focus on system failures, not individual errors. The goal is to learn and improve, not to assign blame.
  • Encourage Asking for Help: Make it clear that escalating or asking for assistance is a sign of good judgment, not weakness.
  • Regular Feedback: Solicit feedback from the on-call team about the process, tools, and any pain points.

7. Compensate and Recognize On-Call Work

On-call is a significant responsibility that impacts work-life balance.

  • Financial Compensation: Offer a stipend or extra pay for on-call shifts, especially those involving nights and weekends.
  • Time Off: Provide additional PTO or "comp time" for engineers who have particularly heavy on-call weeks or handle major incidents.
  • Recognition: Publicly acknowledge the contributions of the on-call team during team meetings or company-wide updates.

8. Continuous Improvement

An on-call system is never truly "done." It requires ongoing refinement.

  • Regular Review of Metrics: Track metrics like Mean Time To Acknowledge (MTTA), Mean Time To Resolve (MTTR), number of escalations, and alert fatigue.
  • Incident Retrospectives: After every SEV-1/SEV-2 incident, conduct a brief retrospective to identify what went well and what could be improved in the on-call process.
  • Alert Tuning: Continuously tune alert thresholds and silence noisy, low-impact alerts to reduce alert fatigue.

By implementing these principles, you can build an on-call rotation that is effective, sustainable, and doesn’t lead to the burnout that plagues so many engineering teams. The ultimate goal is a system where incidents are resolved quickly and reliably, without sacrificing the well-being of the engineers responsible for keeping the lights on.

The next logical step after stabilizing your on-call rotation is to implement robust chaos engineering practices to proactively identify weaknesses before they cause real-world incidents.

Want structured learning?

Take the full DevOps & Platform Engineering course →