Webhook Delivery Tutorial · Module 07 of 10

Observability & Debugging

Async systems fail silently without visibility. Instrument: structured JSON logs, Prometheus metrics, OpenTelemetry traces. Build a Grafana dashboard. By the end, you'll be able to debug any issue by following the logs.

~5–7 hrsAdvancedObservability focus
← Back to Module 07 overview
What You'll Have at the End

Definition of Done

  • Structured JSON logs with correlation/trace IDs for every event.
  • Prometheus metrics: events/sec, deliveries/sec, retry rate, queue depth, DLQ size.
  • OpenTelemetry traces instrument: enqueue → worker → POST → decision.
  • Grafana dashboard with 4 golden signals + queue health + DLQ health.
  • SLO document: 99.9% of events delivered within 5 minutes (over 30 days).
  • Can query logs for a single event's complete journey.
The Steps

Build It

STEP 1

Add structured logging

Install logger: npm install pino pino-pretty

Create src/logger.ts:

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  transport: {
    target: 'pino-pretty',
    options: {
      colorize: true,
      translateTime: 'SYS:standard',
      ignore: 'pid,hostname'
    }
  }
});

export default logger;

Use throughout codebase with trace IDs:

import logger from '../logger';

logger.info({
  traceId: crypto.randomUUID(),
  type: 'event_received',
  event_type: 'order.created',
  webhook_count: 10
});
✓ Verify: Logs output as JSON. Can pipe to file: npm run dev 2>&1 | tee app.log.
STEP 2

Add Prometheus metrics

Install: npm install prom-client

Create src/metrics.ts:

import { Counter, Histogram, Gauge } from 'prom-client';

export const eventsEmitted = new Counter({
  name: 'webhook_events_emitted_total',
  help: 'Total events emitted',
  labelNames: ['type']
});

export const deliveriesAttempted = new Counter({
  name: 'webhook_deliveries_attempted_total',
  help: 'Total delivery attempts',
  labelNames: ['status', 'event_type']
});

export const deliveryLatency = new Histogram({
  name: 'webhook_delivery_latency_seconds',
  help: 'Delivery latency',
  buckets: [0.1, 0.5, 1, 5, 10]
});

export const queueDepth = new Gauge({
  name: 'webhook_queue_depth',
  help: 'Number of pending jobs in queue'
});

export const dlqSize = new Gauge({
  name: 'webhook_dlq_size',
  help: 'Number of failed deliveries in DLQ'
});

Expose metrics endpoint in src/index.ts:

import { collectDefaultMetrics, register } from 'prom-client';

collectDefaultMetrics();

app.get('/metrics', (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(register.metrics());
});
✓ Verify: curl http://localhost:3000/metrics returns Prometheus format.
STEP 3

Add OpenTelemetry tracing

Install: npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/auto

Create src/tracing.ts:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { ConsoleSpanExporter, SimpleSpanProcessor } from '@opentelemetry/sdk-trace-node';

const sdk = new NodeSDK({
  traceExporter: new ConsoleSpanExporter(),
  instrumentations: [getNodeAutoInstrumentations()]
});

sdk.start();

export default sdk;

Import in src/index.ts: import './tracing'; (first import)

✓ Verify: Logs show trace spans (in console output).
STEP 4

Set up Prometheus + Grafana

Create docker-compose.yml:

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin

Create prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'webhook-service'
    static_configs:
      - targets: ['localhost:3000']
    metrics_path: '/metrics'

Run: docker-compose up

✓ Verify: Prometheus at http://localhost:9090 scrapes metrics. Grafana at http://localhost:3001.
STEP 5

Build Grafana dashboard

In Grafana, create a dashboard with panels:

  • Events/sec: rate(webhook_events_emitted_total[1m])
  • Delivery success rate: deliveries_attempted_total{status="success"} / deliveries_attempted_total
  • Delivery latency p95: histogram_quantile(0.95, webhook_delivery_latency_seconds)
  • Queue depth: webhook_queue_depth
  • DLQ size: webhook_dlq_size
✓ Verify: Dashboard displays all metrics. Can answer "what was throughput at 14:32?" from the graph.
STEP 6

Document SLO

Create docs/slo.md:

# SLO — Webhook Delivery Service

## Service Level Objective
99.9% of events are delivered within 5 minutes of emission (over a 30-day window).

## Service Level Indicator
(successful_deliveries_within_5min / total_events) * 100

## Error Budget
30 days = 43,200 minutes
1 minute of outage = ~0.002% error budget

With 99.9% SLO, we have ~43 minutes of error budget per month.

## Monitoring
Dashboard tracks:
- Events emitted per second
- Delivery success rate
- Delivery latency (p50, p95, p99)
- Queue depth (queued jobs)
- DLQ size (failed deliveries)

Alerts:
- Alert if success rate < 99.8% (consuming error budget)
- Alert if DLQ size > 100 (cascading failures)
✓ Verify: SLO document is committed. Dashboard reflects the SLO.
STEP 7

Test traceability

Emit an event, query the logs for its trace ID:

curl -X POST http://localhost:3000/events \
  -H 'Content-Type: application/json' \
  -d '{"type":"order.created","payload":{"id":123}}'

# Copy the traceId from the response logs
# Query logs for this trace:
grep "traceId=abc-123" app.log

# Should see: enqueue → delivery jobs → worker picks up → POST attempt → success/retry
✓ Verify: Can follow a single event through the entire system from logs.
STEP 8

Commit observability work

git add -A
git commit -m "observability: add logging, metrics, tracing, and dashboards

- Structured JSON logging with trace IDs
- Prometheus metrics: events/sec, delivery rate, latency, queue depth
- OpenTelemetry tracing for full request path
- Grafana dashboard with 4 golden signals + queue health
- SLO document: 99.9% delivery within 5 minutes
- Traceability: can follow single event through system"
git push origin main
✓ Verify: git log --oneline shows your commit.
Next Steps

Ready for Module 08?

You now have visibility into your system. Next, you'll make it multi-tenant and isolate resources. Head to Module 08.