Scalability · Architecture Deep Dive

Quick Facts

At a Glance

Basic Concepts

Vertical scaling (scale up): bigger machine. Simple, limited by what the cloud rents.
Horizontal scaling (scale out): more machines. Effectively unlimited; requires statelessness.
Throughput: requests per second the system can sustain.
Latency: time per request. Often inversely related to throughput once you're saturated.
Bottleneck: the resource saturated first — CPU, memory, disk, network, DB, downstream API.
Headroom: the gap between current load and capacity. Run at 50% so you can absorb spikes.

Vertical vs Horizontal

Two Axes, Different Costs

	Vertical	Horizontal
How	Bigger CPU/RAM/SSD	More instances behind a load balancer
Code changes	None	Stateless app, externalized session, distributed cache
Ceiling	What the cloud sells	Effectively unlimited (within reason)
Failure mode	One big machine fails	Many small failures, easier to absorb
Cost curve	Linear, then exponential	Roughly linear with operations overhead

Default approach: scale vertically until it's awkward, scale horizontally for headroom and resilience. "Scale horizontally first" is fashionable advice that's wrong for most apps.

Statelessness

The Property That Unlocks Horizontal Scale

No in-process session state. Move sessions to Redis, the DB, or signed cookies/JWTs.
No local files needed for correctness. Push uploads/downloads to object storage (S3 et al).
No sticky load balancing required. Any instance can serve any request.
Idempotent operations for retries — important when scale means clients give up and retry.
12-factor is largely about achieving this — config from env, processes are disposable, port-binding, dependency declarations.

A stateless app scales by adding instances. A stateful app scales by suffering.

Database

Where Scale Actually Hurts

App tier scaling is the easy part. The database is where most apps run into a real ceiling.

Read replicas. Route reads to followers, writes to the primary. Watch for replication lag.
Connection pooling. PgBouncer / RDS Proxy. App-tier pools alone don't scale to hundreds of instances.
Indexes & query plans. A missing index can be the difference between 10 and 10,000 RPS. EXPLAIN before scaling out.
Materialized views & rollups. Pre-compute expensive aggregates; refresh on a schedule or on writes.
Partitioning. Split a huge table by range/list/hash. Native to most modern databases.
Sharding. Multiple independent databases keyed by tenant/user. Powerful and operationally heavy — only when you must.
Mixed stores. OLTP for writes, OLAP / warehouse for analytics; search engine for full-text. The right tool for the right query.

Caching & Async

Take Work Off the Hot Path

Caching at the edge (CDN), at the app (Redis), and inside processes (LRU). Multiple layers compound.
Queues for any work that doesn't have to happen during the request — email, analytics, search indexing, notifications.
Read-after-write patterns: write to DB, return 202 Accepted, let async workers do the rest.
Pre-compute, don't compute on read. Counts, leaderboards, feeds — compute on write or in batch.

Backpressure

What to Do When You Can't Keep Up

Saturation is inevitable. The question is what happens next.

Rate limit at the edge. Reject early; don't accept work you can't do.
Concurrency caps per service to keep tail latency bounded.
Bounded queues. Unbounded queues hide problems and cause memory blowups.
Load shedding. When health crosses a threshold, drop low-priority traffic before high-priority falls over.
Circuit breakers. Stop hammering a struggling downstream — fast-fail until it recovers.
Bulkheads. Isolate resource pools so one tenant or feature can't starve the others.
Retries with backoff & jitter. Naive retries amplify load — that's how outages turn into incidents.

Auto-scaling

Right-Sizing Without Watching Dashboards

Horizontal Pod Autoscaler (HPA) on Kubernetes — scale on CPU, RPS, queue depth, custom metrics.
Cloud-native equivalents: AWS Auto Scaling, GCP Managed Instance Groups, Azure VMSS.
Scale up fast, scale down slowly. Cold-starts are expensive; spurious shrinking re-creates the problem.
Predictive / scheduled scaling for known patterns (Monday morning login storm, batch windows).
Watch the queue, not the CPU. Queue length leads CPU by minutes — better autoscale signal.
Quotas matter. Cloud limits, DB connection caps, API rate limits — hit any one and "infinite" scaling stops.

Capacity Testing

Know Your Numbers Before Black Friday

Load test with realistic traffic shapes — k6, Locust, Gatling. Steady state and spikes.
Identify the first bottleneck at increasing load. Fix it. Repeat. The bottleneck always moves.
Saturation tests: push past breaking point. How the system fails matters as much as when.
Soak tests: run at 70% load for hours. Memory leaks, slow disk fills, log rotation issues — they only show up over time.
Game-day exercises. Schedule a "Black Friday rehearsal" against a prod-shaped environment.

Worked Example

Scaling the URL Shortener

Rough capacity walk for 10× and 100× growth.

Stage	Bottleneck	Fix
1× — small VM, single Postgres	None	Ship it.
10× reads	Postgres CPU on resolves	Add Redis cache (cache-aside, jittered TTL). Now > 95% of reads never hit Postgres.
100× reads	App-tier CPU on TLS & framework	Run multiple stateless instances behind a load balancer; sessions in Redis; CDN caches 302s for popular codes.
10× writes (link creation bursts)	Postgres write IOPS, DLQ filling	Tune indexes; partition `clicks` by month; offload analytics to a queue + worker fleet.
Multi-region	Cross-region DB latency	Read replicas in each region; route reads locally; writes go to primary region. Accept eventual consistency for stats.
1000× clicks	Single Postgres for clicks	Sample writes (don't store every click); aggregate in a stream processor (Kafka + Flink) into rollup tables; warm tier in OLAP store.

Notice: each step adds one piece of complexity to remove one bottleneck. Don't ship multi-region sharded analytics on day one.

Common Pitfalls

Premature distribution. Microservices and sharding before they're needed. The complexity bills before the scale arrives.
Hot keys. One celebrity's data, one viral link — normal sharding falls flat. Plan a hot-key strategy.
N+1 in disguise. ORMs, GraphQL, REST sub-resources — single biggest unforced scaling error.
Synchronous fan-out. One request triggers ten downstream calls in series. Latency adds up; failures multiply.
Retries amplifying outages. A struggling service gets hammered into the ground by clients trying harder.
Unbounded everything. Lists, queues, caches, log retention. Always pick a limit; revisit as you grow.
Confusing performance with scalability. Fast for one user is not the same as fast for a million.

Continue

Scalability — Growing Under Load