Reliability · Architecture Deep Dive

Quick Facts

The Vocabulary

Words you'll use

Availability: fraction of time the system is up. 99.9% ≈ 43 minutes of downtime per month.
Durability: fraction of stored data not lost. Higher bar than availability.
Fault-tolerant: tolerates failures without user-visible impact.
Resilient: recovers quickly when impact is unavoidable.
RTO: Recovery Time Objective — how long until we're back up.
RPO: Recovery Point Objective — how much data we can afford to lose.
Blast radius: how much breaks when one thing fails.
MTTR / MTTD: mean time to recovery / detection.

Failure

Failure Modes You'll See

Hardware: disks, NICs, single nodes. Common, mitigated by redundancy.
Network: packet loss, DNS hiccups, partitions. Less common, much harder to reason about.
Software: bugs, OOMs, deadlocks, regressions on deploy. The most common cause of incidents.
Capacity: traffic spike, queue backup, slow downstream. Predictable in theory, surprising in practice.
Human: a misconfigured rule, a bad migration, a wrong button. The cause of most large outages.
Cloud / vendor: a provider region or service goes down. Rare but newsworthy when it happens.

A reliable system is one that turns each of these from "outage" into "non-event" or "small event."

Redundancy

The Cheapest Reliability Win

N+1 instances for every critical component. Two app servers, two database nodes, two of everything.
Multi-AZ for cloud workloads. Cheap, transparent, defends against datacenter-level events.
Multi-region for the rare strategic case. Expensive, complex, only for systems that can't tolerate a region outage.
Stateless app instances are infinitely cloneable. State is the part that takes care.
Independent failure domains. Two replicas on the same host, in the same AZ, with the same dependencies isn't redundancy — it's a single point of failure with extra cost.

Patterns

Building Blocks of Resilient Systems

Health checks & load balancers

The LB checks each backend; failing instances are removed from rotation automatically. Pair with separate liveness, readiness, and startup probes when running on Kubernetes — they answer different questions.

Timeouts everywhere

Every network call has a deadline. Without them, a single slow downstream becomes a thread-pool exhaustion outage. Default to short timeouts; lengthen consciously.

Retries with exponential backoff & jitter

Retry transient failures. Bounded attempts. Jittered delays so retries don't synchronize and stampede. Distinguish retryable vs terminal errors at the source.

Circuit breakers

After N failures, stop calling the bad downstream for a cool-down window. Lets it recover instead of being beaten while down.

Bulkheads

Isolate resource pools — thread pools, connection pools, rate-limit buckets — so one tenant or feature can't starve another. The shipping metaphor: water in one compartment doesn't sink the whole boat.

Graceful degradation

When something fails, serve a worse but useful response. Search down → show recent items. Recommendations down → show defaults. Some answer is much better than a 500.

Idempotency & the outbox pattern

Retries are safe only if handlers are idempotent. The outbox pattern (write event + business state in one DB transaction; publish later) makes "publish or roll back" atomic.

Backpressure & load shedding

When you can't keep up, reject early at the edge. Rate limit per tenant; drop low-priority work first; surface 429s clients can respect.

Blast Radius

Limit What One Failure Can Take Down

Cellular architecture / shuffle sharding. Partition users into independent cells. A bad deploy or a noisy tenant is contained to one cell.
Independent services for unrelated features. A bug in the analytics path shouldn't take down the redirect path.
Asynchronous boundaries. Decouple via queues so downstream pain doesn't propagate upstream.
Read/write separation. Reads can keep working when writes can't (and vice versa).
Feature flags. Turn off a misbehaving feature without a rollback.

Deploy Safety

The #1 Source of Outages

Deployments cause far more incidents than hardware. Treat them with respect.

Progressive rollouts: rolling, blue/green, canary. Catch the bad version before it hits everyone.
Health gates & auto-rollback. If error rate or latency cross thresholds during rollout, revert without paging.
Decouple deploy from release with feature flags. Code can be in production but disabled.
Database changes are special. Expand → migrate → contract; never break running code.
No Friday afternoon deploys when budget is tight. The boring rule that prevents weekend pages.

Backup & Recovery

The Last Line of Defense

Automated, tested backups. Untested backups are wishes. Restore drills monthly.
Point-in-time recovery for transactional databases. RPO measured in seconds, not hours.
Off-account / off-region copies. A compromised AWS account shouldn't delete its own backups.
Immutable / WORM storage for critical backups (object lock). Defends against ransomware.
Documented runbooks for restore. Practice them before you need them.

Operating It

Reliability Is a Practice, Not a Property

Define SLOs and error budgets. Reliability you can measure is reliability you can prioritize.
On-call rotation with clear ownership. The team that builds it operates it.
Runbooks for the top 5 alerts on every service. New on-callers shouldn't have to invent the response at 3am.
Game days & chaos testing. Inject failures in staging (or carefully in prod) to find latent assumptions.
Blameless postmortems. Every incident → write-up → action items → tracked to completion. Without this, the same outage repeats.
Reduce toil. Repeated manual interventions are bugs in disguise. Automate or eliminate.

Worked Example

Reliability Plan for the URL Shortener

Targets

Redirect availability SLO 99.95% / 30 days.
Redirect latency SLO 99% < 50ms / 30 days.
RTO ≤ 15 min, RPO ≤ 1 min for link metadata; clicks may lose up to 1 hour in disaster.

Architecture

Two AZs; ≥ 2 stateless app instances per AZ behind a load balancer.
Postgres primary in AZ-A, hot standby in AZ-B; PITR enabled; daily off-account snapshot.
Redis with replication; service degrades to direct DB reads if Redis fails.
Click pipeline via queue with DLQ; redirect path is independent of click writes.

Resilience patterns

Per-call timeouts to DB (200ms) and Redis (50ms); circuit breakers wrap both.
Edge rate limit per IP; per-tenant quotas; load shedding at > 80% saturation.
Outbox + idempotent worker for click events; retries with exponential backoff + jitter.
Feature flags for risky paths (preview-fetch, abuse rules) — kill switch in < 60s.

Practice

Quarterly game day: kill primary DB, observe failover; kill Redis, observe degradation; partition AZs.
Monthly restore drill from off-account snapshot.
Top-5 runbooks: DB failover, Redis outage, queue backed up, deploy gone wrong, abuse spike.
Postmortem template + action-item tracker linked from every incident.

Common Pitfalls

"Reliability" without numbers. Without an SLO, "more reliable" is a feeling, not a target.
Untested redundancy. Two replicas you've never failed over to are one replica.
Retry storms. Naive retries amplify outages. Always add backoff + jitter + bounds.
Hidden coupling. "Independent" services that share a single Redis, a single database, a single shared library version.
No graceful degradation. A read-only mode beats a 500 every time.
Deploys without health gates. A bad deploy with no auto-rollback turns a small bug into a long outage.
One-shot fixes, no postmortem. Fix the symptom, miss the cause, watch it recur.
Heroics culture. Reliability built on the same three people staying up late doesn't last.

Continue

Reliability — Surviving Failure