Webhook Delivery Service Track

How to Use This Track

Learning by Shipping Async Systems

Ground rules

Embrace queues early. Webhook delivery is fundamentally async. Don't build synchronously and retrofit later.
Failure is the baseline. Network timeouts, retries, and dead-letter queues aren't add-ons — they're core design.
Observe everything. You can't debug async systems without excellent logging and metrics.
Pick a stack and stick with it. Node + BullMQ, Python + Celery, Go + go-queue are all solid.
Test failure modes. Write tests that inject delays and failures. Chaos is your friend.

The ten modules

MODULE 01

Foundations & Setup

Repo, schema, design doc.

MODULE 02

Core Webhook API

REST API, event schema, registration.

MODULE 03

Reliable Delivery & Retries

Queue, exponential backoff, idempotency.

MODULE 04

Dead Letter Queue

Failure handling, analytics, recovery.

MODULE 05

Testing Async Systems

Unit, integration, chaos tests.

MODULE 06

Performance & Batching

Throughput, batch delivery, load tests.

MODULE 07

Observability & Debugging

Logs, metrics, traces, dashboards.

MODULE 08

Scaling & Multi-Tenant

Isolation, rate limits, resource mgmt.

MODULE 09

Security & Compliance

Secrets, signing, audit logs, SSRF.

MODULE 10

Capstone & Production Readiness

Deploy, runbooks, postmortem, docs.

Module 01 · ~2–3 hrs

Foundations & Project Setup

Start with a clear design and a pleasant repo. Webhook delivery is complex — don't let tooling get in the way.

Tasks

Create a Git repo with standard config (.editorconfig, .gitignore, linter/formatter).
Scaffold an HTTP server. Pick one: Node/Express, Python/FastAPI, Go/Gin, or Rust/Axum.
Write a one-page design doc: problem statement, goals, API sketch, architecture diagram, failure modes you'll handle.
Create tables: webhooks (registrations), events (immutable log), deliveries (attempts and status).
Document your idempotency strategy in the design doc.

Acceptance criteria

git clone + one command spins up the server.
Design doc committed to /docs/design.md with a sketch of the three main tables.
Linter passes. Pre-commit hook blocks failures.

Deep dives: Documentation & Writing · Problem Decomposition · Schema Design

Module 02 · ~4–6 hrs

Core Webhook API

The contract: let customers register webhooks and emit events. Get this right before adding async machinery.

Tasks

Implement POST /webhooks to register a webhook: URL, secret (for signing), optional filter (event types).
Implement POST /events to emit an event: type, timestamp, payload. Store immediately in the events table (immutable log).
Validate input: URL format, event type enum, max payload size.
Return sensible HTTP status codes (201 on create, 400 on bad input, 404 on missing).
Document your event schema with examples (e.g., order.created, user.updated).

Acceptance criteria

curl -X POST /webhooks creates a webhook; GET /webhooks/:id returns it.
curl -X POST /events with an event appears in the events table.
Invalid URLs, missing fields are rejected with clear 4xx messages.

Deep dives: REST API Design · Relational Databases

Module 03 · ~5–7 hrs

Reliable Delivery & Retries

The hard part: when you emit an event, every webhook must eventually get it — even if the customer's server is down for an hour.

Tasks

On POST /events, enqueue a delivery job for each matching webhook (use your job queue: BullMQ, Celery, SQS, etc.).
A worker processes jobs: POST the event to the webhook URL with a signed JSON body.
Implement exponential backoff retries: 1s, 4s, 16s, 64s, 256s, then stop (5 retries).
Implement idempotency: each delivery has a unique delivery_id in the request. Customers should deduplicate on this.
Track delivery state: pending, processing, success, failed (permanent).

Acceptance criteria

Emit an event and verify a delivery job is enqueued for each webhook.
Stop the worker and emit events; restart the worker and deliveries drain successfully.
Sending to a URL that always fails (after 5 retries) transitions to failed state.

Deep dives: Message Queues · Idempotency · Resilience & Circuit Breakers

Module 04 · ~4–6 hrs

Dead Letter Queue & Failure Handling

Some webhooks will fail permanently. Don't lose them — analyze, alert, and allow recovery.

Tasks

Create a dead_letter_queue table: deliveries that failed all retries go here.
Expose GET /dlq to list failed deliveries (paginated). Include: webhook_id, event, last_error, timestamp.
Add POST /dlq/:delivery_id/retry to allow manual retries from the DLQ.
Emit a metric/alert on DLQ writes so you catch silent failures.
Document your failure categorization: transient (retry) vs permanent (DLQ).

Acceptance criteria

A permanently-failed delivery appears in the DLQ.
POST /dlq/:delivery_id/retry moves it back to the queue and retries it.
You can query the DLQ and see the last error message.

Deep dives: Event-Driven Patterns · Monitoring & Observability

Module 05 · ~5–7 hrs

Testing Async Systems

Async systems are slippery. Test them hard: race conditions, timing, failure injection.

Tasks

Unit tests for retry logic: backoff calculation, state transitions, idempotency.
Integration tests for delivery: spin up a fake webhook server, emit events, verify POSTs arrive.
Chaos tests: inject delays (slow webhook), errors (500, timeout), network partitions. Verify queue drains correctly.
Concurrency test: emit 1000 events to 10 webhooks; verify all 10k deliveries complete (single-threaded queue, or multi-worker safety).
Coverage threshold: 75% for core delivery logic.

Acceptance criteria

Test suite runs in under 1 minute.
You can describe race conditions your tests protect against (e.g., DLQ insertion while retrying).
A failing webhook doesn't block other deliveries (isolation).

Deep dives: Unit Testing · Integration Testing · TDD

Module 06 · ~4–5 hrs

Performance & Batching

Push throughput. Can you deliver 10k events/sec to 100k webhooks? Batching, parallelism, and backpressure are key.

Tasks

Profile the bottleneck: is it database inserts, queue enqueue, or worker throughput?
Batch enqueuing: when emitting an event, bulk-insert N delivery jobs (not one-by-one).
Run multiple worker processes (or threads). Test with k6 or Locust: measure events/sec, p95 latency.
Discuss backpressure: what happens when the queue is full? Do you reject POST /events or buffer?
Optimize: connection pooling, query batching, queue batch sizes.

Acceptance criteria

Load test shows you can deliver ≥1000 events/sec to 100 webhooks on local hardware.
Adding a 2nd worker scales throughput (roughly 2x, accounting for contention).
Load test report with graphs committed to /docs/perf.

Deep dives: Performance Engineering · Metrics & Monitoring

Module 07 · ~5–7 hrs

Observability & Debugging

Async systems fail silently. Instrument heavily: logs, metrics, traces. Debugging without visibility is hopeless.

Tasks

Structured JSON logs with correlation/trace IDs. Log every: enqueue, retry, success, DLQ placement.
Metrics via Prometheus: enqueued events/sec, deliveries/sec, retry rate, DLQ growth, queue depth.
Distributed tracing with OpenTelemetry: instrument the full path (enqueue → worker → POST).
Build a Grafana dashboard: 4 golden signals + queue health + DLQ health.
Define SLOs: e.g., 99.9% of events delivered within 5 minutes (over 30 days).

Acceptance criteria

You can query logs for a single event's journey: enqueue → all retries → success/DLQ.
One OpenTelemetry trace shows HTTP POST → response → DLQ decision.
SLO doc with error budget committed; dashboard shows compliance.

Deep dives: Logging · Metrics · Tracing · SLOs

Module 08 · ~5–7 hrs

Scaling & Multi-Tenant

One tenant's misbehaving webhook shouldn't tank the system. Add isolation, rate limits, and resource fairness.

Tasks

Add customer/tenant isolation: separate queue partitions or per-tenant workers.
Implement rate limits per customer: max events/sec, max concurrent deliveries, max DLQ size.
Add a quotas table. Reject POST /events with 429 if over limit.
Run a chaos drill: one tenant's webhook receiver is slow (p99 latency = 30s). Verify it doesn't starve other tenants.
Document your resource isolation strategy in the design doc.

Acceptance criteria

Tenant A exhausting their quota doesn't affect Tenant B's delivery.
GET /quotas/:tenant shows current usage and limits.
Rate-limited requests are rejected with 429 and a Retry-After header.

Deep dives: Scalability · Resilience & Circuit Breakers

Module 09 · ~4–6 hrs

Security & Compliance

You're posting to customer URLs with sensitive data. Harden it: HMAC signing, SSRF prevention, secrets management.

Tasks

HMAC signing: include X-Webhook-Signature: sha256=... in every POST. Customers verify with their secret.
SSRF prevention: block private IPs (127.0.0.1, 10.0.0.0/8, 169.254.0.0/16, metadata endpoints).
Move secrets (DB password, signing key) to a secrets manager (Vault, AWS Secrets Manager, GitHub Secrets).
Add audit logging: log who created/modified webhooks, DLQ retries, quota changes.
Write a simple threat model: customer sees other customer's events? Signing replay attacks? SSRF?

Acceptance criteria

Every webhook POST includes a signed X-Webhook-Signature header.
Attempting to register a webhook to 127.0.0.1 or metadata endpoint is rejected.
Threat model committed; key risks + mitigations documented.

Deep dives: Authentication · OWASP Top 10 · Threat Modeling

Module 10 · ~6–8 hrs

Capstone & Production Readiness

Ship it safely. Deploy, monitor, document, and be ready to debug production issues.

Tasks

Write a multi-stage Dockerfile. Use docker-compose.yml for local dev (app + Postgres + Redis + worker).
GitHub Actions: lint → test → build → push → deploy.
Deploy to production: Fly.io, Render, Railway, or a VM. Set up health checks and auto-rollback.
Write three runbooks: queue is backed up, DLQ is growing, worker is stuck.
Write a postmortem template and one practice postmortem for a simulated incident.
Polish the README: architecture diagram, how to run, how to test, how to deploy.

Acceptance criteria

Every PR gets a preview URL; merging to main deploys to production automatically.
A bad deploy (health check fails) auto-rolls back.
Runbooks + postmortem template committed to /docs.

Deep dives: CI/CD · Containers · Operational Excellence

After the Track

Where to Go Next

Stretch goals on the same project: webhook templates, custom transformations, batch delivery optimization, audit log export, webhook testing UI.
Read about distributed systems. Study Designing Data-Intensive Applications; sketch how you'd scale to 1B events/day.
Try the Image Upload track to practice async file processing at scale.
Write about it. Blog post per module, especially on failure modes and chaos testing.

↑ Back to Learning Tracks ↑ Back to Map