Module 10 Tutorial · URL Shortener Track

Definition of Done

What You'll Have

Reads on GET /:code route to a replica; writes go to the primary.
Two app instances behind a load balancer; killing one drops zero requests.
Redis outage degrades performance but the service stays up; alarm fires within a minute.
Three runbooks under docs/ops/ for the most likely incidents.
One postmortem from a chaos drill committed under docs/postmortems/.

The Steps

Build It

STEP 1

Add a Postgres replica (local docker-compose)

services:
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: shortener
      POSTGRES_PASSWORD: dev
      POSTGRES_DB: shortener
    command: ["postgres", "-c", "wal_level=replica", "-c", "max_wal_senders=4", "-c", "hot_standby=on"]
    ports: ["5432:5432"]
    volumes: ["./ops/init-replica.sh:/docker-entrypoint-initdb.d/init.sh", "pgdata:/var/lib/postgresql/data"]

  db-replica:
    image: postgres:16-alpine
    depends_on: [db]
    environment:
      POSTGRES_USER: shortener
      POSTGRES_PASSWORD: dev
    command:
      - bash
      - -c
      - |
        until pg_basebackup -h db -U replicator -D /var/lib/postgresql/data -P -R -X stream; do
          sleep 1
        done
        chown -R postgres:postgres /var/lib/postgresql/data
        exec gosu postgres postgres
    ports: ["5433:5432"]

Create ops/init-replica.sh:

#!/bin/bash
set -e
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
  CREATE USER replicator WITH REPLICATION ENCRYPTED PASSWORD 'replpass';
EOSQL
echo "host replication replicator 0.0.0.0/0 md5" >> "$PGDATA/pg_hba.conf"

Gotcha: in production, use your cloud's managed Postgres replica — this local setup is just to practice the failover.

STEP 2

Add a read pool

In src/db.ts:

export const writePool = new pg.Pool({ connectionString: process.env.DATABASE_URL });
export const readPool  = new pg.Pool({
  connectionString: process.env.DATABASE_REPLICA_URL ?? process.env.DATABASE_URL
});

Use readPool for the redirect resolve and /stats/:code; use writePool for inserts/deletes/migrations.

Add to .env:

DATABASE_REPLICA_URL=postgres://shortener:dev@localhost:5433/shortener

STEP 3

Run two app instances behind a local load balancer

Add to docker-compose.yml:

  app1: { build: ., environment: { PORT: 3000, ...common }, ports: ["3001:3000"] }
  app2: { build: ., environment: { PORT: 3000, ...common }, ports: ["3002:3000"] }
  caddy:
    image: caddy:alpine
    ports: ["8080:80"]
    volumes: ["./ops/Caddyfile:/etc/caddy/Caddyfile"]

Create ops/Caddyfile:

:80 {
  reverse_proxy app1:3000 app2:3000 {
    health_uri  /health
    health_interval 5s
    lb_policy   round_robin
    lb_try_duration 2s
  }
}

docker compose up -d --build app1 app2 caddy

✓ Verify: hitting http://localhost:8080/health repeatedly succeeds. Stop one app: docker compose stop app1. Requests still succeed.

STEP 4

Chaos drill 1 — kill Redis

docker compose stop redis
# In another terminal: keep hitting /:code
while true; do curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/<code>; sleep 1; done
# Should keep returning 302 — slower but not failing
docker compose start redis

Note observed numbers (latency, error rate) in docs/ops/chaos-redis.md.

STEP 5

Chaos drill 2 — kill an app instance

docker compose kill app1
# Requests still succeed via app2
# Caddy notices app1 is unhealthy and stops routing to it
docker compose start app1

STEP 6

Chaos drill 3 — replica lag

Pause the replica:

docker compose pause db-replica
# Create a new link via primary (write goes to app1/app2 → writePool → db)
# Read it via /:code (read goes to readPool → db-replica)
# Read may 404 until you unpause

Document the user-visible behavior. Decide your policy: should newly-created codes read from primary for the first N seconds? Add a TODO to docs/ops/replica-lag.md.

STEP 7

Write three runbooks

Each under docs/ops/, ~one page each:

db-down.md — symptoms, immediate actions, escalation, recovery, verification.
queue-backed-up.md — how to inspect, scale workers, drain DLQ, root-cause.
cache-cold.md — what user-facing latency looks like, capacity adjustments while warming, when to alert.

STEP 8

Write a postmortem from one chaos drill

Use docs/postmortems/2026-XX-XX-redis-outage.md:

# Redis Outage Drill — 2026-XX-XX

## Summary
Stopped Redis at 14:02 UTC; app continued serving traffic from Postgres. Latency rose from p95=12ms to p95=78ms. No 5xx errors. Total impact window: 2 minutes.

## Timeline
- 14:02 — Redis stopped (planned)
- 14:02 — Caddy detected normal app health; no instance ejection
- 14:03 — p95 latency alert fired (latency SLO consuming budget)
- 14:04 — Redis started; p95 returned to baseline within 30s

## What went well
- Cache-aside fallback worked; no errors.
- Alert fired within 1 min.

## What didn't
- Postgres CPU jumped 4× during the outage; would have hit limits at 5× normal traffic.

## Action items
- [ ] Add per-key in-process LRU as second-tier cache (owner: …, by …)
- [ ] Add k6 capacity test with cache disabled (owner: …, by …)

STEP 9

Commit

git checkout -b module-10
git add .
git commit -m "module 10: read replica, multi-instance, chaos drills, runbooks, postmortem"
git push -u origin module-10

Common Gotchas

If Something Goes Wrong

Read-your-own-writes broken — newly created links 404 from replica. Either route reads-after-writes to primary briefly, or accept and document the lag.
Caddy 502 on instance restart — make sure the health endpoint responds quickly; bump health_interval shorter.
Replica setup fails — Postgres replication is finicky locally; consider using repmgr, a managed service, or just simulating the read split with two pools to the same DB.
Drills feel pointless — that's the point. The boring drill is the one that doesn't surprise you in production.

What's Next

Move On

← Module 09 Module 11 — Capstone & Review → ↑ Back to track