Observability Deep Dive

SLIs, SLOs & Error Budgets

"Reliable" is a feeling. Measurably reliable is a discipline. Service Level Objectives turn reliability into a number you can budget, alert on, and trade against feature velocity. The whole point isn't to hit 100% — it's to spend the failure you can afford on the work that matters most.

SREReliabilityError budgetBurn rateSLI
← Back to Observability & Performance
Quick Facts

The Three Letters

SLI vs SLO vs SLA

  • SLI — Service Level Indicator: a number you measure. "Fraction of redirects served in under 50ms." Usually good events ÷ valid events.
  • SLO — Service Level Objective: the target you set internally. "99.9% of redirects under 50ms over a 30-day window."
  • SLA — Service Level Agreement: the promise you make externally, with consequences (refunds, credits) if you miss it. SLAs are looser than SLOs by design — you set yourself a tighter internal target so you have room to react.
  • Error budget: the gap between 100% and your SLO. With a 99.9% SLO, you have a 0.1% budget — about 43 minutes of failure per 30 days.

Codified by Google's SRE practice. The discipline matters more than the exact numbers.

Why

What SLOs Buy You

  • A shared definition of "working." No more "the site felt slow" arguments — the SLI either was or wasn't met.
  • Alerts that wake you for the right reasons. You alert on budget burn, not on every blip.
  • A lever for the velocity-vs-stability trade. Budget left → ship features. Budget exhausted → freeze and stabilize.
  • A user-facing definition instead of a CPU graph. "99.9% of logins succeed under 1s" maps directly to user pain. "CPU is at 60%" doesn't.
Picking SLIs

A good SLI is a fraction:

SLI = good events ÷ valid events

Define both halves carefully. "Valid events" excludes things you don't want to be on the hook for (synthetic checks, requests from blocked IPs, 4xx caused by the client). "Good events" is whatever you'd defend in a postmortem as "worked."

Common SLI families

FamilyExampleWhere it fits
AvailabilityFraction of requests with HTTP < 500.Almost every service.
LatencyFraction of requests served in < 50ms.User-facing reads.
QualityFraction of search queries returning > 0 results.Search, recommendations, ML services.
FreshnessFraction of dashboards reflecting data from the last 5 minutes.Pipelines, batch, replicas.
ThroughputMessages processed per minute above floor X.Async pipelines.
CorrectnessFraction of orders whose total matches sum of line items.Critical business logic.

Pick at most 2–3 SLIs per service. More than that and nobody will track them.

Picking the Number

How High Should the SLO Be?

As high as users actually need — and not a tenth of a percent higher.

SLOAllowed downtime / 30 daysRoughly equivalent to
99%~7h 12mInternal tools, pre-launch betas.
99.5%~3h 36mB2B SaaS used during business hours.
99.9% ("three nines")~43m 12sMost consumer web products.
99.95%~21m 36sCritical paths in larger products.
99.99% ("four nines")~4m 19sPayments, identity providers, core infra.
99.999%~26 secondsTelecom, regulated finance. Hugely expensive.

Each extra "nine" is roughly 10× the cost. Setting SLOs higher than users need is one of the most expensive mistakes a team can make.

Error Budgets

The Discipline That Makes SLOs Useful

A 99.9% SLO over 30 days = 0.1% × 30d ≈ 43 minutes of "bad time" you're allowed to spend. That's your error budget.

  • Budget unspent → ship faster. Take risks, deploy mid-day, run experiments.
  • Budget burning → slow down. Freeze risky changes, focus on reliability work, don't ship Friday afternoon.
  • Budget exhausted → stop. No new features until the SLO recovers. Sounds extreme; works.

This is the trade product and engineering should be making consciously. Without an error budget the conversation devolves into "ship faster" vs "be more careful" with no shared language.

Alerting

Burn-Rate Alerts Beat Threshold Alerts

Old-school alerts trip on "error rate > 1% for 5 minutes" — too sensitive at midnight, too slow during a real outage. Multi-window, multi-burn-rate alerts solve this.

  • Burn rate = how fast you're spending the error budget right now. = on track to exhaust the budget exactly at the end of the window. 10× = blowing through it ten times faster.
  • Fast-burn alert (page now): e.g., 14× burn over 1 hour and 5 minutes. Catches major outages quickly with low false-positive rate.
  • Slow-burn alert (ticket, not page): e.g., 1× burn over 6 hours and 30 minutes. Catches creeping degradation.
  • Two windows in each alert keep you from paging on a single noisy minute.

The Google SRE workbook's burn-rate table is the canonical reference here. Most observability platforms (Datadog, Grafana, Honeycomb, Nobl9) ship SLO + burn-rate alerts as a first-class feature.

Pitfalls

Ways Teams Get SLOs Wrong

  • Aspirational SLOs. 99.99% on a service that historically does 99.5%. You'll spend forever in alarm and learn to ignore the alerts.
  • SLOs nobody owns. If the team doesn't sign off, they'll quietly disregard them. Co-own with product.
  • Measuring the wrong thing. CPU is not user pain. Pick SLIs from the user's perspective.
  • Symmetric SLOs across endpoints. Your healthcheck doesn't need 99.99% — your checkout does.
  • No budget policy. "Stop and stabilize" only works if it's pre-agreed with leadership.
  • Including planned maintenance in the SLI. Either exclude it, or accept the budget hit consciously.
  • Forgetting to review. SLOs need to age — what was right last year might be too loose or too strict now.
Worked Example

SLOs for the URL Shortener

SLI 1 — Redirect availability

good_events  = count(http_requests{path="/:code", status < 500})
valid_events = count(http_requests{path="/:code"})
SLI          = good_events / valid_events
SLO          = 99.95% over 30 days  → ~21 minutes of budget

SLI 2 — Redirect latency

good_events  = count(http_requests{path="/:code", duration_ms < 50})
valid_events = count(http_requests{path="/:code"})
SLI          = good_events / valid_events
SLO          = 99% of redirects < 50ms over 30 days

SLI 3 — Click-event freshness

good_events  = count(clicks where now() - occurred_at < 5min)
valid_events = count(clicks)
SLI          = good_events / valid_events
SLO          = 99% of clicks visible in stats within 5 minutes

Error-budget policy

  • > 50% budget remaining: normal operations; experiment freely.
  • < 50% budget remaining: code freeze on risky changes; reliability is the next sprint's #1 priority.
  • Budget exhausted: feature freeze. The next deploy must be a reliability fix.
  • Burn-rate page: 14× burn over 1h + 5min. Burn-rate ticket: 1× burn over 6h + 30min.
Operationalizing

Make It Stick

  1. Pick 2–3 SLIs per service from a user perspective.
  2. Ship them as Prometheus / OTel metrics with stable labels — these become your SLO definitions.
  3. Set conservative initial SLOs based on past 30 days of data, not aspiration.
  4. Build one dashboard per service: SLI value, SLO target, budget remaining, burn rate.
  5. Wire up burn-rate alerts (fast + slow). Delete legacy threshold alerts that overlap.
  6. Write the budget policy down. Get product + leadership agreement.
  7. Review SLOs quarterly. Tighten where you've earned it, loosen where users don't care.
Continue

Related Reading