Test Types Deep Dive · 5 of 8

Performance & Load Testing — How It Holds Up Under Pressure

Functional tests prove the system works. Performance tests prove it works fast enough, under enough load, for long enough. They catch the bugs nothing else can: the N+1 query that's fine on dev data, the connection pool that exhausts at 200 RPS, the cache that thrashes under realistic traffic. Most production outages are performance failures of code that passed every other test.

k6JMeterGatlingLocustSLOp99
← Back to Testing
Quick Facts

The Vocabulary

Basic Concepts

  • Latency: how long one request takes. Always look at percentiles, not averages — p50, p95, p99.
  • Throughput: requests per second the system can sustain.
  • Concurrency: how many in-flight requests at once.
  • Saturation: when adding more load stops increasing throughput and starts increasing latency. Past saturation, things break.
  • Apdex / SLO: what counts as "fast enough" — codified, agreed, alerted on.
The Family

Five Different Tests, Often Confused

Load Test

Run the system at expected peak traffic for a sustained window. Verify SLOs hold (p99 latency, error rate). The most common kind. The question it answers: "will Black Friday work?"

Stress Test

Push past expected peak until something breaks. Find the saturation point and the failure mode. The question: "if traffic doubles unexpectedly, do we degrade gracefully or fall over?" Look for: connection pool exhaustion, OOM, cascading timeouts, retry storms.

Spike Test

Idle, then sudden burst, then idle again. Catches autoscaling lag, cold caches, JIT compilation hits, connection-pool ramp-up. The question: "what happens when we go from 100 RPS to 5,000 RPS in 30 seconds?"

Soak / Endurance Test

Run at moderate load for hours or days. Catches memory leaks, resource leaks (file descriptors, DB connections), gradually-growing heap sizes, log-volume issues. Most leaks don't show up in a 5-minute load test.

Microbenchmarks

One function, isolated, measured precisely. JMH (Java), BenchmarkDotNet (.NET), Rust's criterion, Go's testing.B. Catches algorithmic regressions and micro-optimizations. Different beast from load tests — beware of confusing one with the other.

Tools

The Modern Stack

k6 — The Default Today

Grafana's open-source load testing tool. Tests in JavaScript, executed by a Go runtime — fast, scriptable, CI-friendly. First-class Prometheus and OTel output. A modern team starting today usually picks k6.

JMeter

The veteran. Java, GUI-driven (also CLI). Massive plugin ecosystem. Heavy compared to k6 but stable in many enterprise shops. Apache project, free.

Gatling

Scala-based DSL for scenario scripting. Strong reports out of the box. Popular in JVM-heavy environments. Open source and a commercial cloud edition.

Locust

Python-based, async, easy to write user behavior scripts. Distributed mode for high load. Popular when the test team writes Python.

Cloud-Hosted Load Generators

Grafana k6 Cloud, BlazeMeter, Loader.io, AWS Distributed Load Testing. Spin up thousands of geographic load generators on demand. The right call when a single laptop can't generate the load you want.

Discipline

How to Make Performance Tests Useful

Define SLOs Before You Test

"p99 < 500ms at 1,000 RPS sustained, error rate < 0.1%" is a target. "It feels slow" isn't. Without an SLO, every test result is just a number.

Look at Percentiles, Not Averages

An average of 100ms can hide a p99 of 5 seconds. Users feel the tail. Most reporting libraries default to averages — change the defaults to p50/p95/p99 (and p99.9 if you have the volume) before reading anything.

Realistic Data and Distribution

Tests against an empty DB lie. Use production-shaped data (anonymized) and production-shaped traffic patterns: 80% reads, 20% writes; 90% small accounts, 10% whales. A perfectly balanced synthetic workload doesn't predict reality.

Test in a Production-Shaped Environment

Tests on a 2-vCPU laptop don't predict 16-vCPU prod behavior. If you can't test in prod-equivalent infra, at least test in something that scales linearly — and document the scaling factor you assumed.

Profile When You Find a Bottleneck

"It's slow at 500 RPS" tells you nothing. Profilers — Java Flight Recorder, dotnet-trace, py-spy, pprof, Linux perf — tell you which function is hot and why. Don't optimize before profiling; you'll be wrong.

Watch the Whole System

The bottleneck is rarely the code under test. It's usually the DB, the connection pool, GC pauses, the network, the upstream you forgot to scale. Instrument everything (Prometheus, OTel) and look at the whole picture during the test.

In CI: Smoke, Not Marathon

Don't run a 30-minute load test on every PR. Run a small smoke (60-second test against staging) to catch egregious regressions; do full load on a schedule, before releases, or before traffic events.

Common Mistakes

The Honest List

  • Coordinated omission. Many tools record latency only after a successful response — missing the requests that timed out, which are exactly the ones you care about. Use tools that report correctly under load (k6, recent JMeter, properly-configured Gatling).
  • Testing the load generator. If your generator can't push 1,000 RPS, that's a generator bottleneck, not a system one. Confirm with multiple generators or a hosted service.
  • One-shot tests, no baselines. A number in isolation tells you nothing. Track results over time; alert on regressions.
  • Optimizing what's already fast. p99 = 200ms with a 5-second target — leave it alone. Optimize the slow paths first.
  • Forgetting backpressure. A system that "scales" by buffering 50,000 in-flight requests has just queued up a thundering herd. Test that requests fail fast under saturation, not slowly.
Continue

Other Test Types