Async systems fail silently without visibility. Instrument: structured JSON logs, Prometheus metrics, OpenTelemetry traces. Build a Grafana dashboard. By the end, you'll be able to debug any issue by following the logs.
← Back to Module 07 overviewInstall logger: npm install pino pino-pretty
Create src/logger.ts:
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
transport: {
target: 'pino-pretty',
options: {
colorize: true,
translateTime: 'SYS:standard',
ignore: 'pid,hostname'
}
}
});
export default logger;
Use throughout codebase with trace IDs:
import logger from '../logger';
logger.info({
traceId: crypto.randomUUID(),
type: 'event_received',
event_type: 'order.created',
webhook_count: 10
});
npm run dev 2>&1 | tee app.log.Install: npm install prom-client
Create src/metrics.ts:
import { Counter, Histogram, Gauge } from 'prom-client';
export const eventsEmitted = new Counter({
name: 'webhook_events_emitted_total',
help: 'Total events emitted',
labelNames: ['type']
});
export const deliveriesAttempted = new Counter({
name: 'webhook_deliveries_attempted_total',
help: 'Total delivery attempts',
labelNames: ['status', 'event_type']
});
export const deliveryLatency = new Histogram({
name: 'webhook_delivery_latency_seconds',
help: 'Delivery latency',
buckets: [0.1, 0.5, 1, 5, 10]
});
export const queueDepth = new Gauge({
name: 'webhook_queue_depth',
help: 'Number of pending jobs in queue'
});
export const dlqSize = new Gauge({
name: 'webhook_dlq_size',
help: 'Number of failed deliveries in DLQ'
});
Expose metrics endpoint in src/index.ts:
import { collectDefaultMetrics, register } from 'prom-client';
collectDefaultMetrics();
app.get('/metrics', (req, res) => {
res.set('Content-Type', register.contentType);
res.end(register.metrics());
});
curl http://localhost:3000/metrics returns Prometheus format.Install: npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/auto
Create src/tracing.ts:
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { ConsoleSpanExporter, SimpleSpanProcessor } from '@opentelemetry/sdk-trace-node';
const sdk = new NodeSDK({
traceExporter: new ConsoleSpanExporter(),
instrumentations: [getNodeAutoInstrumentations()]
});
sdk.start();
export default sdk;
Import in src/index.ts: import './tracing'; (first import)
Create docker-compose.yml:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana:latest
ports:
- "3001:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
Create prometheus.yml:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'webhook-service'
static_configs:
- targets: ['localhost:3000']
metrics_path: '/metrics'
Run: docker-compose up
In Grafana, create a dashboard with panels:
Create docs/slo.md:
# SLO — Webhook Delivery Service ## Service Level Objective 99.9% of events are delivered within 5 minutes of emission (over a 30-day window). ## Service Level Indicator (successful_deliveries_within_5min / total_events) * 100 ## Error Budget 30 days = 43,200 minutes 1 minute of outage = ~0.002% error budget With 99.9% SLO, we have ~43 minutes of error budget per month. ## Monitoring Dashboard tracks: - Events emitted per second - Delivery success rate - Delivery latency (p50, p95, p99) - Queue depth (queued jobs) - DLQ size (failed deliveries) Alerts: - Alert if success rate < 99.8% (consuming error budget) - Alert if DLQ size > 100 (cascading failures)
Emit an event, query the logs for its trace ID:
curl -X POST http://localhost:3000/events \
-H 'Content-Type: application/json' \
-d '{"type":"order.created","payload":{"id":123}}'
# Copy the traceId from the response logs
# Query logs for this trace:
grep "traceId=abc-123" app.log
# Should see: enqueue → delivery jobs → worker picks up → POST attempt → success/retry
git add -A git commit -m "observability: add logging, metrics, tracing, and dashboards - Structured JSON logging with trace IDs - Prometheus metrics: events/sec, delivery rate, latency, queue depth - OpenTelemetry tracing for full request path - Grafana dashboard with 4 golden signals + queue health - SLO document: 99.9% delivery within 5 minutes - Traceability: can follow single event through system" git push origin main
git log --oneline shows your commit.You now have visibility into your system. Next, you'll make it multi-tenant and isolate resources. Head to Module 08.