Learning Track · 2 of 3

Build a Production-Grade Webhook Delivery Platform

Ten modules that take you from an empty Git repo to a distributed, resilient webhook service — handling millions of events with reliable delivery, exponential backoff retries, dead-letter queues, and observability. This track pushes async systems, failure modes, and reliability patterns harder than most projects will encounter.

Async systemsQueuesRetriesDead-letter queuesReliabilityObservability
← Back to Learning Tracks
How to Use This Track

Learning by Shipping Async Systems

Ground rules

  • Embrace queues early. Webhook delivery is fundamentally async. Don't build synchronously and retrofit later.
  • Failure is the baseline. Network timeouts, retries, and dead-letter queues aren't add-ons — they're core design.
  • Observe everything. You can't debug async systems without excellent logging and metrics.
  • Pick a stack and stick with it. Node + BullMQ, Python + Celery, Go + go-queue are all solid.
  • Test failure modes. Write tests that inject delays and failures. Chaos is your friend.

The ten modules

Module 01 · ~2–3 hrs

Foundations & Project Setup

Start with a clear design and a pleasant repo. Webhook delivery is complex — don't let tooling get in the way.

Tasks

  • Create a Git repo with standard config (.editorconfig, .gitignore, linter/formatter).
  • Scaffold an HTTP server. Pick one: Node/Express, Python/FastAPI, Go/Gin, or Rust/Axum.
  • Write a one-page design doc: problem statement, goals, API sketch, architecture diagram, failure modes you'll handle.
  • Create tables: webhooks (registrations), events (immutable log), deliveries (attempts and status).
  • Document your idempotency strategy in the design doc.
Acceptance criteria
  • git clone + one command spins up the server.
  • Design doc committed to /docs/design.md with a sketch of the three main tables.
  • Linter passes. Pre-commit hook blocks failures.
Module 02 · ~4–6 hrs

Core Webhook API

The contract: let customers register webhooks and emit events. Get this right before adding async machinery.

Tasks

  • Implement POST /webhooks to register a webhook: URL, secret (for signing), optional filter (event types).
  • Implement POST /events to emit an event: type, timestamp, payload. Store immediately in the events table (immutable log).
  • Validate input: URL format, event type enum, max payload size.
  • Return sensible HTTP status codes (201 on create, 400 on bad input, 404 on missing).
  • Document your event schema with examples (e.g., order.created, user.updated).
Acceptance criteria
  • curl -X POST /webhooks creates a webhook; GET /webhooks/:id returns it.
  • curl -X POST /events with an event appears in the events table.
  • Invalid URLs, missing fields are rejected with clear 4xx messages.
Module 03 · ~5–7 hrs

Reliable Delivery & Retries

The hard part: when you emit an event, every webhook must eventually get it — even if the customer's server is down for an hour.

Tasks

  • On POST /events, enqueue a delivery job for each matching webhook (use your job queue: BullMQ, Celery, SQS, etc.).
  • A worker processes jobs: POST the event to the webhook URL with a signed JSON body.
  • Implement exponential backoff retries: 1s, 4s, 16s, 64s, 256s, then stop (5 retries).
  • Implement idempotency: each delivery has a unique delivery_id in the request. Customers should deduplicate on this.
  • Track delivery state: pending, processing, success, failed (permanent).
Acceptance criteria
  • Emit an event and verify a delivery job is enqueued for each webhook.
  • Stop the worker and emit events; restart the worker and deliveries drain successfully.
  • Sending to a URL that always fails (after 5 retries) transitions to failed state.
Module 04 · ~4–6 hrs

Dead Letter Queue & Failure Handling

Some webhooks will fail permanently. Don't lose them — analyze, alert, and allow recovery.

Tasks

  • Create a dead_letter_queue table: deliveries that failed all retries go here.
  • Expose GET /dlq to list failed deliveries (paginated). Include: webhook_id, event, last_error, timestamp.
  • Add POST /dlq/:delivery_id/retry to allow manual retries from the DLQ.
  • Emit a metric/alert on DLQ writes so you catch silent failures.
  • Document your failure categorization: transient (retry) vs permanent (DLQ).
Acceptance criteria
  • A permanently-failed delivery appears in the DLQ.
  • POST /dlq/:delivery_id/retry moves it back to the queue and retries it.
  • You can query the DLQ and see the last error message.
Module 05 · ~5–7 hrs

Testing Async Systems

Async systems are slippery. Test them hard: race conditions, timing, failure injection.

Tasks

  • Unit tests for retry logic: backoff calculation, state transitions, idempotency.
  • Integration tests for delivery: spin up a fake webhook server, emit events, verify POSTs arrive.
  • Chaos tests: inject delays (slow webhook), errors (500, timeout), network partitions. Verify queue drains correctly.
  • Concurrency test: emit 1000 events to 10 webhooks; verify all 10k deliveries complete (single-threaded queue, or multi-worker safety).
  • Coverage threshold: 75% for core delivery logic.
Acceptance criteria
  • Test suite runs in under 1 minute.
  • You can describe race conditions your tests protect against (e.g., DLQ insertion while retrying).
  • A failing webhook doesn't block other deliveries (isolation).
Module 06 · ~4–5 hrs

Performance & Batching

Push throughput. Can you deliver 10k events/sec to 100k webhooks? Batching, parallelism, and backpressure are key.

Tasks

  • Profile the bottleneck: is it database inserts, queue enqueue, or worker throughput?
  • Batch enqueuing: when emitting an event, bulk-insert N delivery jobs (not one-by-one).
  • Run multiple worker processes (or threads). Test with k6 or Locust: measure events/sec, p95 latency.
  • Discuss backpressure: what happens when the queue is full? Do you reject POST /events or buffer?
  • Optimize: connection pooling, query batching, queue batch sizes.
Acceptance criteria
  • Load test shows you can deliver ≥1000 events/sec to 100 webhooks on local hardware.
  • Adding a 2nd worker scales throughput (roughly 2x, accounting for contention).
  • Load test report with graphs committed to /docs/perf.
Module 07 · ~5–7 hrs

Observability & Debugging

Async systems fail silently. Instrument heavily: logs, metrics, traces. Debugging without visibility is hopeless.

Tasks

  • Structured JSON logs with correlation/trace IDs. Log every: enqueue, retry, success, DLQ placement.
  • Metrics via Prometheus: enqueued events/sec, deliveries/sec, retry rate, DLQ growth, queue depth.
  • Distributed tracing with OpenTelemetry: instrument the full path (enqueue → worker → POST).
  • Build a Grafana dashboard: 4 golden signals + queue health + DLQ health.
  • Define SLOs: e.g., 99.9% of events delivered within 5 minutes (over 30 days).
Acceptance criteria
  • You can query logs for a single event's journey: enqueue → all retries → success/DLQ.
  • One OpenTelemetry trace shows HTTP POST → response → DLQ decision.
  • SLO doc with error budget committed; dashboard shows compliance.
Module 08 · ~5–7 hrs

Scaling & Multi-Tenant

One tenant's misbehaving webhook shouldn't tank the system. Add isolation, rate limits, and resource fairness.

Tasks

  • Add customer/tenant isolation: separate queue partitions or per-tenant workers.
  • Implement rate limits per customer: max events/sec, max concurrent deliveries, max DLQ size.
  • Add a quotas table. Reject POST /events with 429 if over limit.
  • Run a chaos drill: one tenant's webhook receiver is slow (p99 latency = 30s). Verify it doesn't starve other tenants.
  • Document your resource isolation strategy in the design doc.
Acceptance criteria
  • Tenant A exhausting their quota doesn't affect Tenant B's delivery.
  • GET /quotas/:tenant shows current usage and limits.
  • Rate-limited requests are rejected with 429 and a Retry-After header.
Module 09 · ~4–6 hrs

Security & Compliance

You're posting to customer URLs with sensitive data. Harden it: HMAC signing, SSRF prevention, secrets management.

Tasks

  • HMAC signing: include X-Webhook-Signature: sha256=... in every POST. Customers verify with their secret.
  • SSRF prevention: block private IPs (127.0.0.1, 10.0.0.0/8, 169.254.0.0/16, metadata endpoints).
  • Move secrets (DB password, signing key) to a secrets manager (Vault, AWS Secrets Manager, GitHub Secrets).
  • Add audit logging: log who created/modified webhooks, DLQ retries, quota changes.
  • Write a simple threat model: customer sees other customer's events? Signing replay attacks? SSRF?
Acceptance criteria
  • Every webhook POST includes a signed X-Webhook-Signature header.
  • Attempting to register a webhook to 127.0.0.1 or metadata endpoint is rejected.
  • Threat model committed; key risks + mitigations documented.
Module 10 · ~6–8 hrs

Capstone & Production Readiness

Ship it safely. Deploy, monitor, document, and be ready to debug production issues.

Tasks

  • Write a multi-stage Dockerfile. Use docker-compose.yml for local dev (app + Postgres + Redis + worker).
  • GitHub Actions: lint → test → build → push → deploy.
  • Deploy to production: Fly.io, Render, Railway, or a VM. Set up health checks and auto-rollback.
  • Write three runbooks: queue is backed up, DLQ is growing, worker is stuck.
  • Write a postmortem template and one practice postmortem for a simulated incident.
  • Polish the README: architecture diagram, how to run, how to test, how to deploy.
Acceptance criteria
  • Every PR gets a preview URL; merging to main deploys to production automatically.
  • A bad deploy (health check fails) auto-rolls back.
  • Runbooks + postmortem template committed to /docs.
After the Track

Where to Go Next

  • Stretch goals on the same project: webhook templates, custom transformations, batch delivery optimization, audit log export, webhook testing UI.
  • Read about distributed systems. Study Designing Data-Intensive Applications; sketch how you'd scale to 1B events/day.
  • Try the Image Upload track to practice async file processing at scale.
  • Write about it. Blog post per module, especially on failure modes and chaos testing.