Test Types Deep Dive · 8 of 8

Chaos Engineering — Break It on Purpose, in Daylight

Distributed systems fail in ways nobody can predict. Chaos engineering is the practice of deliberately injecting failure — killing a node, partitioning the network, slowing a dependency — to verify the system handles it. The goal isn't destruction; it's confidence. The point of breaking things on Tuesday afternoon is so they don't break themselves on Saturday at 3 a.m.

ResilienceFailure InjectionChaos MonkeyGame DaysSRE
← Back to Testing
Quick Facts

What Chaos Engineering Is

Basic Concepts

  • Origin: Netflix, mid-2010s. Chaos Monkey randomly killed EC2 instances in production to force engineers to design for failure from day one.
  • The principle: the only way to know your system tolerates a failure is to cause one and watch.
  • Hypothesis-driven: "we believe the system handles X. Let's test it." Not "let's break random things and see."
  • Blast radius matters. Start tiny — one container in dev. Grow only when each step is uneventful. The goal is controlled failure, not the real kind.
  • Game days: scheduled, team-wide chaos exercises with a known scenario, observers, and a clear "abort" plan.
What to Inject

The Common Failure Modes

Instance Failure

Kill a pod / VM / container. The classic Chaos Monkey behavior. Verifies that orchestration replaces it, in-flight requests don't drop, sticky sessions migrate or recover. The first chaos test most teams run.

Network Failure

Drop packets between two services. Add 200ms of latency. Partition a network into halves. Verifies timeouts, retries, circuit breakers actually kick in — not just that they're configured. Most resilience patterns look fine until the network misbehaves; chaos surfaces the ones that don't work.

Dependency Slowdown

Make the database respond at 5x its normal latency. Make a downstream service take 10 seconds. Find the components without timeouts that exhaust their thread pools, retry storms that take down healthy services, queues that fill up because consumers can't keep up.

Resource Exhaustion

Fill a disk. Eat memory. Pin CPUs. Verifies you alert before runaway and degrade gracefully when the alert fires. The class of failure that quietly accumulates and then becomes everyone's emergency at once.

Region / AZ Failure

Drop one cloud availability zone. The advanced version: drop one region. Tests multi-region failover, cross-AZ replication, RTO/RPO claims. Big deal to set up; one of the few tests that proves a "we're multi-region" claim is real.

Clock Skew & Time Travel

Jump a node's clock ahead by an hour. Verifies token expiry, scheduled jobs, replication lag handling. Subtle bugs hide here.

Bad Inputs / Fault Injection at the Code Layer

At the application layer: throw exceptions on a percentage of calls, return malformed responses, drop messages. Tools: Toxiproxy, Chaos Mesh, application-level fault-injection libraries. The unit-test of resilience patterns.

Tooling

The Modern Stack

Chaos Mesh / LitmusChaos

Kubernetes-native chaos. Define experiments in YAML; the operator injects pod kills, network drops, IO faults, time skew. Both are CNCF projects, open source. The default if you live on Kubernetes.

AWS Fault Injection Simulator (FIS)

Native AWS service. Run experiments that throttle DynamoDB, kill EC2 instances, partition VPCs. Integrates with stop conditions tied to CloudWatch alarms — automatic abort when blast radius grows.

Gremlin

Commercial chaos platform with broad infra coverage and a polished UI. Common in larger enterprises that want a managed experience and approval workflows.

Toxiproxy

Lightweight TCP proxy for adding latency, dropping connections, slicing bandwidth between two services. Excellent for integration tests and local chaos experiments. Doesn't need cluster privileges.

Steadybit, Harness Chaos, Azure Chaos Studio

Other commercial / cloud-managed chaos platforms. Pick based on where your infrastructure lives and your appetite for managed vs. self-hosted.

How to Run It Safely

The Real Discipline

Start in a Lower Environment

Dev → staging → canary in prod → broader prod. Don't run your first experiment on Friday afternoon against production. Build the muscle (and the tooling) in environments where breakage is cheap.

Define a Steady State and a Hypothesis

"Under normal load, p99 latency is < 300ms and error rate is < 0.5%. We hypothesize that killing one of three replicas keeps both metrics within bounds for at least 60 seconds." Now you have something to verify; the experiment isn't theater.

Set Stop Conditions

Auto-abort when error rate exceeds X, when latency exceeds Y, when alarm Z fires. The chaos shouldn't outlast the bug it reveals. AWS FIS, Chaos Mesh, and Gremlin all support this.

Communicate

Tell the on-call team. Tell anyone who shares a downstream dependency. Posting in #chaos channel before each experiment prevents "is this real?" panic and respects the time of people who'd otherwise treat it as a real outage.

Game Days

A scheduled session — half a day, a full day — with a coordinator, a known scenario, observers measuring response, and a debrief at the end. Famously practiced at Stripe, Google, AWS, Netflix. Way more learning per hour than reading runbooks.

Capture the Findings

Every weak spot becomes a JIRA ticket: a missing timeout, a too-permissive retry, a unalerted failure mode. Track them as a fix backlog separately from feature work. The whole point is building a list of real resilience improvements.

Decision

When You're Ready for Chaos Engineering

You're ready when:

  • You have observability — metrics, logs, traces — to see what happens during the experiment.
  • You have alerting that catches degradation, not just full outages.
  • You have automated rollback or kill-switch on the chaos itself.
  • The team understands "chaos failed" doesn't mean "the chaos tool broke" — it means it found something.

If you don't have those, fix observability first. Chaos without instrumentation is just an outage.

Continue

Other Test Types