Things break. Disks die, networks partition, vendors have bad days, deploys go wrong. Reliability is the discipline of building systems where those events are absorbed, not amplified — and where recovery is rehearsed, not improvised.
← Back to ArchitectureA reliable system is one that turns each of these from "outage" into "non-event" or "small event."
The LB checks each backend; failing instances are removed from rotation automatically. Pair with separate liveness, readiness, and startup probes when running on Kubernetes — they answer different questions.
Every network call has a deadline. Without them, a single slow downstream becomes a thread-pool exhaustion outage. Default to short timeouts; lengthen consciously.
Retry transient failures. Bounded attempts. Jittered delays so retries don't synchronize and stampede. Distinguish retryable vs terminal errors at the source.
After N failures, stop calling the bad downstream for a cool-down window. Lets it recover instead of being beaten while down.
Isolate resource pools — thread pools, connection pools, rate-limit buckets — so one tenant or feature can't starve another. The shipping metaphor: water in one compartment doesn't sink the whole boat.
When something fails, serve a worse but useful response. Search down → show recent items. Recommendations down → show defaults. Some answer is much better than a 500.
Retries are safe only if handlers are idempotent. The outbox pattern (write event + business state in one DB transaction; publish later) makes "publish or roll back" atomic.
When you can't keep up, reject early at the edge. Rate limit per tenant; drop low-priority work first; surface 429s clients can respect.
Deployments cause far more incidents than hardware. Treat them with respect.