"Reliable" is a feeling. Measurably reliable is a discipline. Service Level Objectives turn reliability into a number you can budget, alert on, and trade against feature velocity. The whole point isn't to hit 100% — it's to spend the failure you can afford on the work that matters most.
← Back to Observability & PerformanceCodified by Google's SRE practice. The discipline matters more than the exact numbers.
A good SLI is a fraction:
SLI = good events ÷ valid events
Define both halves carefully. "Valid events" excludes things you don't want to be on the hook for (synthetic checks, requests from blocked IPs, 4xx caused by the client). "Good events" is whatever you'd defend in a postmortem as "worked."
| Family | Example | Where it fits |
|---|---|---|
| Availability | Fraction of requests with HTTP < 500. | Almost every service. |
| Latency | Fraction of requests served in < 50ms. | User-facing reads. |
| Quality | Fraction of search queries returning > 0 results. | Search, recommendations, ML services. |
| Freshness | Fraction of dashboards reflecting data from the last 5 minutes. | Pipelines, batch, replicas. |
| Throughput | Messages processed per minute above floor X. | Async pipelines. |
| Correctness | Fraction of orders whose total matches sum of line items. | Critical business logic. |
Pick at most 2–3 SLIs per service. More than that and nobody will track them.
As high as users actually need — and not a tenth of a percent higher.
| SLO | Allowed downtime / 30 days | Roughly equivalent to |
|---|---|---|
| 99% | ~7h 12m | Internal tools, pre-launch betas. |
| 99.5% | ~3h 36m | B2B SaaS used during business hours. |
| 99.9% ("three nines") | ~43m 12s | Most consumer web products. |
| 99.95% | ~21m 36s | Critical paths in larger products. |
| 99.99% ("four nines") | ~4m 19s | Payments, identity providers, core infra. |
| 99.999% | ~26 seconds | Telecom, regulated finance. Hugely expensive. |
Each extra "nine" is roughly 10× the cost. Setting SLOs higher than users need is one of the most expensive mistakes a team can make.
A 99.9% SLO over 30 days = 0.1% × 30d ≈ 43 minutes of "bad time" you're allowed to spend. That's your error budget.
This is the trade product and engineering should be making consciously. Without an error budget the conversation devolves into "ship faster" vs "be more careful" with no shared language.
Old-school alerts trip on "error rate > 1% for 5 minutes" — too sensitive at midnight, too slow during a real outage. Multi-window, multi-burn-rate alerts solve this.
1× = on track to exhaust the budget exactly at the end of the window. 10× = blowing through it ten times faster.The Google SRE workbook's burn-rate table is the canonical reference here. Most observability platforms (Datadog, Grafana, Honeycomb, Nobl9) ship SLO + burn-rate alerts as a first-class feature.
good_events = count(http_requests{path="/:code", status < 500})
valid_events = count(http_requests{path="/:code"})
SLI = good_events / valid_events
SLO = 99.95% over 30 days → ~21 minutes of budget
good_events = count(http_requests{path="/:code", duration_ms < 50})
valid_events = count(http_requests{path="/:code"})
SLI = good_events / valid_events
SLO = 99% of redirects < 50ms over 30 days
good_events = count(clicks where now() - occurred_at < 5min) valid_events = count(clicks) SLI = good_events / valid_events SLO = 99% of clicks visible in stats within 5 minutes