Architecture Deep Dive

Reliability — Surviving Failure

Things break. Disks die, networks partition, vendors have bad days, deploys go wrong. Reliability is the discipline of building systems where those events are absorbed, not amplified — and where recovery is rehearsed, not improvised.

RedundancyFailoverBlast RadiusChaosRunbooksPostmortems
← Back to Architecture
Quick Facts

The Vocabulary

Words you'll use

  • Availability: fraction of time the system is up. 99.9% ≈ 43 minutes of downtime per month.
  • Durability: fraction of stored data not lost. Higher bar than availability.
  • Fault-tolerant: tolerates failures without user-visible impact.
  • Resilient: recovers quickly when impact is unavoidable.
  • RTO: Recovery Time Objective — how long until we're back up.
  • RPO: Recovery Point Objective — how much data we can afford to lose.
  • Blast radius: how much breaks when one thing fails.
  • MTTR / MTTD: mean time to recovery / detection.
Failure

Failure Modes You'll See

  • Hardware: disks, NICs, single nodes. Common, mitigated by redundancy.
  • Network: packet loss, DNS hiccups, partitions. Less common, much harder to reason about.
  • Software: bugs, OOMs, deadlocks, regressions on deploy. The most common cause of incidents.
  • Capacity: traffic spike, queue backup, slow downstream. Predictable in theory, surprising in practice.
  • Human: a misconfigured rule, a bad migration, a wrong button. The cause of most large outages.
  • Cloud / vendor: a provider region or service goes down. Rare but newsworthy when it happens.

A reliable system is one that turns each of these from "outage" into "non-event" or "small event."

Redundancy

The Cheapest Reliability Win

  • N+1 instances for every critical component. Two app servers, two database nodes, two of everything.
  • Multi-AZ for cloud workloads. Cheap, transparent, defends against datacenter-level events.
  • Multi-region for the rare strategic case. Expensive, complex, only for systems that can't tolerate a region outage.
  • Stateless app instances are infinitely cloneable. State is the part that takes care.
  • Independent failure domains. Two replicas on the same host, in the same AZ, with the same dependencies isn't redundancy — it's a single point of failure with extra cost.
Patterns

Building Blocks of Resilient Systems

Health checks & load balancers

The LB checks each backend; failing instances are removed from rotation automatically. Pair with separate liveness, readiness, and startup probes when running on Kubernetes — they answer different questions.

Timeouts everywhere

Every network call has a deadline. Without them, a single slow downstream becomes a thread-pool exhaustion outage. Default to short timeouts; lengthen consciously.

Retries with exponential backoff & jitter

Retry transient failures. Bounded attempts. Jittered delays so retries don't synchronize and stampede. Distinguish retryable vs terminal errors at the source.

Circuit breakers

After N failures, stop calling the bad downstream for a cool-down window. Lets it recover instead of being beaten while down.

Bulkheads

Isolate resource pools — thread pools, connection pools, rate-limit buckets — so one tenant or feature can't starve another. The shipping metaphor: water in one compartment doesn't sink the whole boat.

Graceful degradation

When something fails, serve a worse but useful response. Search down → show recent items. Recommendations down → show defaults. Some answer is much better than a 500.

Idempotency & the outbox pattern

Retries are safe only if handlers are idempotent. The outbox pattern (write event + business state in one DB transaction; publish later) makes "publish or roll back" atomic.

Backpressure & load shedding

When you can't keep up, reject early at the edge. Rate limit per tenant; drop low-priority work first; surface 429s clients can respect.

Blast Radius

Limit What One Failure Can Take Down

  • Cellular architecture / shuffle sharding. Partition users into independent cells. A bad deploy or a noisy tenant is contained to one cell.
  • Independent services for unrelated features. A bug in the analytics path shouldn't take down the redirect path.
  • Asynchronous boundaries. Decouple via queues so downstream pain doesn't propagate upstream.
  • Read/write separation. Reads can keep working when writes can't (and vice versa).
  • Feature flags. Turn off a misbehaving feature without a rollback.
Deploy Safety

The #1 Source of Outages

Deployments cause far more incidents than hardware. Treat them with respect.

  • Progressive rollouts: rolling, blue/green, canary. Catch the bad version before it hits everyone.
  • Health gates & auto-rollback. If error rate or latency cross thresholds during rollout, revert without paging.
  • Decouple deploy from release with feature flags. Code can be in production but disabled.
  • Database changes are special. Expand → migrate → contract; never break running code.
  • No Friday afternoon deploys when budget is tight. The boring rule that prevents weekend pages.
Backup & Recovery

The Last Line of Defense

  • Automated, tested backups. Untested backups are wishes. Restore drills monthly.
  • Point-in-time recovery for transactional databases. RPO measured in seconds, not hours.
  • Off-account / off-region copies. A compromised AWS account shouldn't delete its own backups.
  • Immutable / WORM storage for critical backups (object lock). Defends against ransomware.
  • Documented runbooks for restore. Practice them before you need them.
Operating It

Reliability Is a Practice, Not a Property

  • Define SLOs and error budgets. Reliability you can measure is reliability you can prioritize.
  • On-call rotation with clear ownership. The team that builds it operates it.
  • Runbooks for the top 5 alerts on every service. New on-callers shouldn't have to invent the response at 3am.
  • Game days & chaos testing. Inject failures in staging (or carefully in prod) to find latent assumptions.
  • Blameless postmortems. Every incident → write-up → action items → tracked to completion. Without this, the same outage repeats.
  • Reduce toil. Repeated manual interventions are bugs in disguise. Automate or eliminate.
Worked Example

Reliability Plan for the URL Shortener

Targets

  • Redirect availability SLO 99.95% / 30 days.
  • Redirect latency SLO 99% < 50ms / 30 days.
  • RTO ≤ 15 min, RPO ≤ 1 min for link metadata; clicks may lose up to 1 hour in disaster.

Architecture

  • Two AZs; ≥ 2 stateless app instances per AZ behind a load balancer.
  • Postgres primary in AZ-A, hot standby in AZ-B; PITR enabled; daily off-account snapshot.
  • Redis with replication; service degrades to direct DB reads if Redis fails.
  • Click pipeline via queue with DLQ; redirect path is independent of click writes.

Resilience patterns

  • Per-call timeouts to DB (200ms) and Redis (50ms); circuit breakers wrap both.
  • Edge rate limit per IP; per-tenant quotas; load shedding at > 80% saturation.
  • Outbox + idempotent worker for click events; retries with exponential backoff + jitter.
  • Feature flags for risky paths (preview-fetch, abuse rules) — kill switch in < 60s.

Practice

  • Quarterly game day: kill primary DB, observe failover; kill Redis, observe degradation; partition AZs.
  • Monthly restore drill from off-account snapshot.
  • Top-5 runbooks: DB failover, Redis outage, queue backed up, deploy gone wrong, abuse spike.
  • Postmortem template + action-item tracker linked from every incident.
Common Pitfalls
  • "Reliability" without numbers. Without an SLO, "more reliable" is a feeling, not a target.
  • Untested redundancy. Two replicas you've never failed over to are one replica.
  • Retry storms. Naive retries amplify outages. Always add backoff + jitter + bounds.
  • Hidden coupling. "Independent" services that share a single Redis, a single database, a single shared library version.
  • No graceful degradation. A read-only mode beats a 500 every time.
  • Deploys without health gates. A bad deploy with no auto-rollback turns a small bug into a long outage.
  • One-shot fixes, no postmortem. Fix the symptom, miss the cause, watch it recur.
  • Heroics culture. Reliability built on the same three people staying up late doesn't last.
Continue

Related Reading