Functional tests prove the system works. Performance tests prove it works fast enough, under enough load, for long enough. They catch the bugs nothing else can: the N+1 query that's fine on dev data, the connection pool that exhausts at 200 RPS, the cache that thrashes under realistic traffic. Most production outages are performance failures of code that passed every other test.
← Back to TestingRun the system at expected peak traffic for a sustained window. Verify SLOs hold (p99 latency, error rate). The most common kind. The question it answers: "will Black Friday work?"
Push past expected peak until something breaks. Find the saturation point and the failure mode. The question: "if traffic doubles unexpectedly, do we degrade gracefully or fall over?" Look for: connection pool exhaustion, OOM, cascading timeouts, retry storms.
Idle, then sudden burst, then idle again. Catches autoscaling lag, cold caches, JIT compilation hits, connection-pool ramp-up. The question: "what happens when we go from 100 RPS to 5,000 RPS in 30 seconds?"
Run at moderate load for hours or days. Catches memory leaks, resource leaks (file descriptors, DB connections), gradually-growing heap sizes, log-volume issues. Most leaks don't show up in a 5-minute load test.
One function, isolated, measured precisely. JMH (Java), BenchmarkDotNet (.NET), Rust's criterion, Go's testing.B. Catches algorithmic regressions and micro-optimizations. Different beast from load tests — beware of confusing one with the other.
Grafana's open-source load testing tool. Tests in JavaScript, executed by a Go runtime — fast, scriptable, CI-friendly. First-class Prometheus and OTel output. A modern team starting today usually picks k6.
The veteran. Java, GUI-driven (also CLI). Massive plugin ecosystem. Heavy compared to k6 but stable in many enterprise shops. Apache project, free.
Scala-based DSL for scenario scripting. Strong reports out of the box. Popular in JVM-heavy environments. Open source and a commercial cloud edition.
Python-based, async, easy to write user behavior scripts. Distributed mode for high load. Popular when the test team writes Python.
Grafana k6 Cloud, BlazeMeter, Loader.io, AWS Distributed Load Testing. Spin up thousands of geographic load generators on demand. The right call when a single laptop can't generate the load you want.
"p99 < 500ms at 1,000 RPS sustained, error rate < 0.1%" is a target. "It feels slow" isn't. Without an SLO, every test result is just a number.
An average of 100ms can hide a p99 of 5 seconds. Users feel the tail. Most reporting libraries default to averages — change the defaults to p50/p95/p99 (and p99.9 if you have the volume) before reading anything.
Tests against an empty DB lie. Use production-shaped data (anonymized) and production-shaped traffic patterns: 80% reads, 20% writes; 90% small accounts, 10% whales. A perfectly balanced synthetic workload doesn't predict reality.
Tests on a 2-vCPU laptop don't predict 16-vCPU prod behavior. If you can't test in prod-equivalent infra, at least test in something that scales linearly — and document the scaling factor you assumed.
"It's slow at 500 RPS" tells you nothing. Profilers — Java Flight Recorder, dotnet-trace, py-spy, pprof, Linux perf — tell you which function is hot and why. Don't optimize before profiling; you'll be wrong.
The bottleneck is rarely the code under test. It's usually the DB, the connection pool, GC pauses, the network, the upstream you forgot to scale. Instrument everything (Prometheus, OTel) and look at the whole picture during the test.
Don't run a 30-minute load test on every PR. Run a small smoke (60-second test against staging) to catch egregious regressions; do full load on a schedule, before releases, or before traffic events.