Observability & Performance Deep Dive · 5 of 7

On-Call & Incident Response — PagerDuty, Opsgenie & Friends

Alerts without routing are noise. Routing without escalation is a single-point-of-failure pager. PagerDuty (2009) put on-call on the map; Opsgenie (now Atlassian) and newer entrants like incident.io and Rootly compete on developer ergonomics, slick Slack integration, and the chat-driven incident playbook.

On-call rotationEscalationRunbooksSeverity levelsPostmortems
← Back to Observability & Performance
Anatomy

What These Tools Actually Do

Basic Concepts

  • Schedule — who's on-call when (primary, secondary, holiday rotations).
  • Service — a logical owner; alerts route here, then to the schedule.
  • Escalation policy — primary unresponsive after N minutes → secondary → manager.
  • Routing rules — match alerts to the right service by labels/tags.
  • Severity levels — SEV1 (full outage), SEV2 (major degradation), SEV3 (minor).
  • Acknowledgement & resolution — explicit lifecycle to prevent dropped pages.
Players

Side-By-Side

ToolStrengthNotable
PagerDutyThe category creator; mature, event-routing richAIOps add-ons for noise reduction; pricier per-user.
OpsgenieTight Atlassian integration (Jira, Statuspage)Strong if you're already in the Atlassian stack.
incident.io / Rootly / FireHydrantSlack-native incident command"Declare incident" Slack slash command; auto-create channel, ticket, status page.
VictorOps / Splunk On-CallSplunk-integrated opsBest when you're already a Splunk shop.
Better Stack / Grafana OnCallOSS-friendly, simpler pricingGrafana OnCall is the OSS option that pairs with the rest of the Grafana stack.
Good Alerting

The Healthy Habits

  • Alert on symptoms, not causes. Page on "users can't log in," not "Redis CPU is 90%."
  • Tie alerts to SLOs. Multi-window, multi-burn-rate alerts beat threshold-of-the-week.
  • Every alert links to a runbook. No runbook, no alert.
  • Tune ruthlessly. Each false page is a tax on attention and trust. Track signal-to-noise.
  • One pager, one owner. "Everyone is on-call" → no one is.
  • Sustainable rotation. Compensate, follow-the-sun where possible, no solo on-call for SEV1 services.
  • Blameless postmortems. Focus on the system, not the human; track action items to completion.
Tradeoffs

What Goes Wrong

  • Alert fatigue kills response. People silence pagers in their sleep — literally.
  • Hero culture. One person who knows everything = ticking bomb. Cross-train.
  • Stale runbooks. Runbooks that lie are worse than no runbooks; review on a cadence.
  • Postmortems without action. If nobody owns the action items, you'll have the same incident again.
  • SEV inflation. If everything's a SEV1, nothing is.
Continue

Other Observability Tools