Distributed systems fail in ways nobody can predict. Chaos engineering is the practice of deliberately injecting failure — killing a node, partitioning the network, slowing a dependency — to verify the system handles it. The goal isn't destruction; it's confidence. The point of breaking things on Tuesday afternoon is so they don't break themselves on Saturday at 3 a.m.
← Back to TestingKill a pod / VM / container. The classic Chaos Monkey behavior. Verifies that orchestration replaces it, in-flight requests don't drop, sticky sessions migrate or recover. The first chaos test most teams run.
Drop packets between two services. Add 200ms of latency. Partition a network into halves. Verifies timeouts, retries, circuit breakers actually kick in — not just that they're configured. Most resilience patterns look fine until the network misbehaves; chaos surfaces the ones that don't work.
Make the database respond at 5x its normal latency. Make a downstream service take 10 seconds. Find the components without timeouts that exhaust their thread pools, retry storms that take down healthy services, queues that fill up because consumers can't keep up.
Fill a disk. Eat memory. Pin CPUs. Verifies you alert before runaway and degrade gracefully when the alert fires. The class of failure that quietly accumulates and then becomes everyone's emergency at once.
Drop one cloud availability zone. The advanced version: drop one region. Tests multi-region failover, cross-AZ replication, RTO/RPO claims. Big deal to set up; one of the few tests that proves a "we're multi-region" claim is real.
Jump a node's clock ahead by an hour. Verifies token expiry, scheduled jobs, replication lag handling. Subtle bugs hide here.
At the application layer: throw exceptions on a percentage of calls, return malformed responses, drop messages. Tools: Toxiproxy, Chaos Mesh, application-level fault-injection libraries. The unit-test of resilience patterns.
Kubernetes-native chaos. Define experiments in YAML; the operator injects pod kills, network drops, IO faults, time skew. Both are CNCF projects, open source. The default if you live on Kubernetes.
Native AWS service. Run experiments that throttle DynamoDB, kill EC2 instances, partition VPCs. Integrates with stop conditions tied to CloudWatch alarms — automatic abort when blast radius grows.
Commercial chaos platform with broad infra coverage and a polished UI. Common in larger enterprises that want a managed experience and approval workflows.
Lightweight TCP proxy for adding latency, dropping connections, slicing bandwidth between two services. Excellent for integration tests and local chaos experiments. Doesn't need cluster privileges.
Other commercial / cloud-managed chaos platforms. Pick based on where your infrastructure lives and your appetite for managed vs. self-hosted.
Dev → staging → canary in prod → broader prod. Don't run your first experiment on Friday afternoon against production. Build the muscle (and the tooling) in environments where breakage is cheap.
"Under normal load, p99 latency is < 300ms and error rate is < 0.5%. We hypothesize that killing one of three replicas keeps both metrics within bounds for at least 60 seconds." Now you have something to verify; the experiment isn't theater.
Auto-abort when error rate exceeds X, when latency exceeds Y, when alarm Z fires. The chaos shouldn't outlast the bug it reveals. AWS FIS, Chaos Mesh, and Gremlin all support this.
Tell the on-call team. Tell anyone who shares a downstream dependency. Posting in #chaos channel before each experiment prevents "is this real?" panic and respects the time of people who'd otherwise treat it as a real outage.
A scheduled session — half a day, a full day — with a coordinator, a known scenario, observers measuring response, and a debrief at the end. Famously practiced at Stripe, Google, AWS, Netflix. Way more learning per hour than reading runbooks.
Every weak spot becomes a JIRA ticket: a missing timeout, a too-permissive retry, a unalerted failure mode. Track them as a fix backlog separately from feature work. The whole point is building a list of real resilience improvements.
You're ready when:
If you don't have those, fix observability first. Chaos without instrumentation is just an outage.