PagerDuty / Opsgenie · Observability & Performance Deep Dive

Anatomy

What These Tools Actually Do

Schedule — who's on-call when (primary, secondary, holiday rotations).
Service — a logical owner; alerts route here, then to the schedule.
Escalation policy — primary unresponsive after N minutes → secondary → manager.
Routing rules — match alerts to the right service by labels/tags.
Severity levels — SEV1 (full outage), SEV2 (major degradation), SEV3 (minor).
Acknowledgement & resolution — explicit lifecycle to prevent dropped pages.

Players

Tool	Strength	Notable
PagerDuty	The category creator; mature, event-routing rich	AIOps add-ons for noise reduction; pricier per-user.
Opsgenie	Tight Atlassian integration (Jira, Statuspage)	Strong if you're already in the Atlassian stack.
incident.io / Rootly / FireHydrant	Slack-native incident command	"Declare incident" Slack slash command; auto-create channel, ticket, status page.
VictorOps / Splunk On-Call	Splunk-integrated ops	Best when you're already a Splunk shop.
Better Stack / Grafana OnCall	OSS-friendly, simpler pricing	Grafana OnCall is the OSS option that pairs with the rest of the Grafana stack.

Good Alerting

Alert on symptoms, not causes. Page on "users can't log in," not "Redis CPU is 90%."
Tie alerts to SLOs. Multi-window, multi-burn-rate alerts beat threshold-of-the-week.
Every alert links to a runbook. No runbook, no alert.
Tune ruthlessly. Each false page is a tax on attention and trust. Track signal-to-noise.
One pager, one owner. "Everyone is on-call" → no one is.
Sustainable rotation. Compensate, follow-the-sun where possible, no solo on-call for SEV1 services.
Blameless postmortems. Focus on the system, not the human; track action items to completion.

Tradeoffs

Alert fatigue kills response. People silence pagers in their sleep — literally.
Hero culture. One person who knows everything = ticking bomb. Cross-train.
Stale runbooks. Runbooks that lie are worse than no runbooks; review on a cadence.
Postmortems without action. If nobody owns the action items, you'll have the same incident again.
SEV inflation. If everything's a SEV1, nothing is.

Continue