By the end of this tutorial you'll have a Postgres read replica taking redirect reads, multiple app instances behind a load balancer, three runbooks, a chaos drill report, and a postmortem written from a real (small) failure.
← Back to Module 10 overviewGET /:code route to a replica; writes go to the primary.docs/ops/ for the most likely incidents.docs/postmortems/.services:
db:
image: postgres:16-alpine
environment:
POSTGRES_USER: shortener
POSTGRES_PASSWORD: dev
POSTGRES_DB: shortener
command: ["postgres", "-c", "wal_level=replica", "-c", "max_wal_senders=4", "-c", "hot_standby=on"]
ports: ["5432:5432"]
volumes: ["./ops/init-replica.sh:/docker-entrypoint-initdb.d/init.sh", "pgdata:/var/lib/postgresql/data"]
db-replica:
image: postgres:16-alpine
depends_on: [db]
environment:
POSTGRES_USER: shortener
POSTGRES_PASSWORD: dev
command:
- bash
- -c
- |
until pg_basebackup -h db -U replicator -D /var/lib/postgresql/data -P -R -X stream; do
sleep 1
done
chown -R postgres:postgres /var/lib/postgresql/data
exec gosu postgres postgres
ports: ["5433:5432"]
Create ops/init-replica.sh:
#!/bin/bash set -e psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL CREATE USER replicator WITH REPLICATION ENCRYPTED PASSWORD 'replpass'; EOSQL echo "host replication replicator 0.0.0.0/0 md5" >> "$PGDATA/pg_hba.conf"
In src/db.ts:
export const writePool = new pg.Pool({ connectionString: process.env.DATABASE_URL });
export const readPool = new pg.Pool({
connectionString: process.env.DATABASE_REPLICA_URL ?? process.env.DATABASE_URL
});
Use readPool for the redirect resolve and /stats/:code; use writePool for inserts/deletes/migrations.
Add to .env:
DATABASE_REPLICA_URL=postgres://shortener:dev@localhost:5433/shortener
Add to docker-compose.yml:
app1: { build: ., environment: { PORT: 3000, ...common }, ports: ["3001:3000"] }
app2: { build: ., environment: { PORT: 3000, ...common }, ports: ["3002:3000"] }
caddy:
image: caddy:alpine
ports: ["8080:80"]
volumes: ["./ops/Caddyfile:/etc/caddy/Caddyfile"]
Create ops/Caddyfile:
:80 {
reverse_proxy app1:3000 app2:3000 {
health_uri /health
health_interval 5s
lb_policy round_robin
lb_try_duration 2s
}
}
docker compose up -d --build app1 app2 caddy
http://localhost:8080/health repeatedly succeeds. Stop one app: docker compose stop app1. Requests still succeed.docker compose stop redis
# In another terminal: keep hitting /:code
while true; do curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/<code>; sleep 1; done
# Should keep returning 302 — slower but not failing
docker compose start redis
Note observed numbers (latency, error rate) in docs/ops/chaos-redis.md.
docker compose kill app1 # Requests still succeed via app2 # Caddy notices app1 is unhealthy and stops routing to it docker compose start app1
Pause the replica:
docker compose pause db-replica # Create a new link via primary (write goes to app1/app2 → writePool → db) # Read it via /:code (read goes to readPool → db-replica) # Read may 404 until you unpause
Document the user-visible behavior. Decide your policy: should newly-created codes read from primary for the first N seconds? Add a TODO to docs/ops/replica-lag.md.
Each under docs/ops/, ~one page each:
db-down.md — symptoms, immediate actions, escalation, recovery, verification.queue-backed-up.md — how to inspect, scale workers, drain DLQ, root-cause.cache-cold.md — what user-facing latency looks like, capacity adjustments while warming, when to alert.Use docs/postmortems/2026-XX-XX-redis-outage.md:
# Redis Outage Drill — 2026-XX-XX ## Summary Stopped Redis at 14:02 UTC; app continued serving traffic from Postgres. Latency rose from p95=12ms to p95=78ms. No 5xx errors. Total impact window: 2 minutes. ## Timeline - 14:02 — Redis stopped (planned) - 14:02 — Caddy detected normal app health; no instance ejection - 14:03 — p95 latency alert fired (latency SLO consuming budget) - 14:04 — Redis started; p95 returned to baseline within 30s ## What went well - Cache-aside fallback worked; no errors. - Alert fired within 1 min. ## What didn't - Postgres CPU jumped 4× during the outage; would have hit limits at 5× normal traffic. ## Action items - [ ] Add per-key in-process LRU as second-tier cache (owner: …, by …) - [ ] Add k6 capacity test with cache disabled (owner: …, by …)
git checkout -b module-10 git add . git commit -m "module 10: read replica, multi-instance, chaos drills, runbooks, postmortem" git push -u origin module-10
health_interval shorter.repmgr, a managed service, or just simulating the read split with two pools to the same DB.