Webhook Delivery Tutorial · Module 04 of 10

Dead Letter Queue & Failure Handling

Permanently failed deliveries shouldn't disappear. Implement a DLQ, expose failure analytics, and allow manual recovery. By the end, you'll have full visibility into permanent failures.

~4–6 hrsIntermediateObservability focus
← Back to Module 04 overview
What You'll Have at the End

Definition of Done

  • GET /dlq endpoint lists all permanently failed deliveries (paginated).
  • POST /dlq/:delivery_id/retry allows manual retries from the DLQ.
  • Failed deliveries are queryable: by webhook ID, by date, with error details.
  • Metrics/alerts when deliveries move to DLQ.
  • Clear documentation of transient vs permanent failure classification.
The Steps

Build It

STEP 1

Create DLQ table schema

Update src/db/migrations.ts to add:

CREATE TABLE IF NOT EXISTS dead_letter_queue (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  delivery_id UUID NOT NULL REFERENCES deliveries(id),
  webhook_id UUID NOT NULL REFERENCES webhooks(id),
  event_id UUID NOT NULL REFERENCES events(id),
  last_error TEXT,
  error_count INT DEFAULT 1,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_dlq_webhook_id ON dead_letter_queue(webhook_id);
CREATE INDEX idx_dlq_created_at ON dead_letter_queue(created_at);
✓ Verify: Run migrations: npm run dev completes without errors.
STEP 2

Update the worker to move failed deliveries to DLQ

Modify src/worker/deliveryWorker.ts where permanent failure is detected:

// When attempt >= MAX_RETRIES:
if (nextRetryAt === -1) {
  // Move to DLQ
  await pool.query(
    'INSERT INTO dead_letter_queue (delivery_id, webhook_id, event_id, last_error) VALUES ($1, $2, $3, $4)',
    [delivery_id, job.data.webhook_id, job.data.event_id, errorMsg]
  );

  await pool.query(
    'UPDATE deliveries SET status = $1, attempts = $2, last_error = $3, updated_at = NOW() WHERE id = $4',
    ['failed', attempt + 1, errorMsg, delivery_id]
  );

  console.log(`Delivery ${delivery_id} moved to DLQ`);
}
✓ Verify: Trigger permanent failures (5+ retries) and check DLQ table has records.
STEP 3

Implement GET /dlq endpoint

Create src/routes/dlq.ts:

import { Router, Request, Response } from 'express';
import pool from '../db/pool';

const router = Router();

// GET /dlq - list failed deliveries
router.get('/', async (req: Request, res: Response) => {
  const page = parseInt(req.query.page as string) || 1;
  const limit = 50;
  const offset = (page - 1) * limit;

  try {
    const countResult = await pool.query('SELECT COUNT(*) FROM dead_letter_queue');
    const total = parseInt(countResult.rows[0].count);

    const result = await pool.query(`
      SELECT dlq.*, w.url as webhook_url, e.type as event_type
      FROM dead_letter_queue dlq
      JOIN webhooks w ON dlq.webhook_id = w.id
      JOIN events e ON dlq.event_id = e.id
      ORDER BY dlq.created_at DESC
      LIMIT $1 OFFSET $2
    `, [limit, offset]);

    res.json({
      data: result.rows,
      pagination: {
        page,
        limit,
        total,
        pages: Math.ceil(total / limit)
      }
    });
  } catch (err) {
    console.error(err);
    res.status(500).json({ error: 'Failed to fetch DLQ' });
  }
});

// GET /dlq/webhook/:webhook_id - failures for a specific webhook
router.get('/webhook/:webhook_id', async (req: Request, res: Response) => {
  try {
    const result = await pool.query(`
      SELECT dlq.*, e.type as event_type
      FROM dead_letter_queue dlq
      JOIN events e ON dlq.event_id = e.id
      WHERE dlq.webhook_id = $1
      ORDER BY dlq.created_at DESC
    `, [req.params.webhook_id]);

    res.json(result.rows);
  } catch (err) {
    console.error(err);
    res.status(500).json({ error: 'Failed to fetch webhook DLQ' });
  }
});

export default router;

Add to src/index.ts:

import dlqRouter from './routes/dlq';
app.use('/dlq', dlqRouter);
✓ Verify: curl http://localhost:3000/dlq returns paginated failed deliveries.
STEP 4

Implement POST /dlq/:delivery_id/retry

Add to src/routes/dlq.ts:

import deliveryQueue from '../queue/deliveryQueue';

// POST /dlq/:delivery_id/retry - retry a failed delivery
router.post('/:delivery_id/retry', async (req: Request, res: Response) => {
  try {
    // Get the DLQ entry
    const dlqResult = await pool.query(
      'SELECT * FROM dead_letter_queue WHERE delivery_id = $1',
      [req.params.delivery_id]
    );

    if (dlqResult.rows.length === 0) {
      return res.status(404).json({ error: 'Delivery not found in DLQ' });
    }

    const dlq = dlqResult.rows[0];

    // Get delivery and webhook info
    const deliveryResult = await pool.query(
      'SELECT * FROM deliveries WHERE id = $1',
      [req.params.delivery_id]
    );

    const webhookResult = await pool.query(
      'SELECT * FROM webhooks WHERE id = $1',
      [dlq.webhook_id]
    );

    const eventResult = await pool.query(
      'SELECT * FROM events WHERE id = $1',
      [dlq.event_id]
    );

    if (deliveryResult.rows.length === 0 || webhookResult.rows.length === 0) {
      return res.status(404).json({ error: 'Delivery or webhook not found' });
    }

    // Re-enqueue
    await deliveryQueue.add({
      delivery_id: req.params.delivery_id,
      webhook_id: dlq.webhook_id,
      event_id: dlq.event_id,
      webhook_url: webhookResult.rows[0].url,
      webhook_secret: webhookResult.rows[0].secret,
      payload: eventResult.rows[0].payload,
      attempt: 0
    });

    // Update status back to pending
    await pool.query(
      'UPDATE deliveries SET status = $1, attempts = 0, next_retry_at = NULL WHERE id = $2',
      ['pending', req.params.delivery_id]
    );

    // Remove from DLQ
    await pool.query(
      'DELETE FROM dead_letter_queue WHERE delivery_id = $1',
      [req.params.delivery_id]
    );

    res.json({ message: 'Delivery requeued', delivery_id: req.params.delivery_id });
  } catch (err) {
    console.error(err);
    res.status(500).json({ error: 'Failed to retry delivery' });
  }
});
✓ Verify: curl -X POST http://localhost:3000/dlq/:delivery_id/retry moves it back to pending and removes from DLQ.
STEP 5

Add DLQ alerts/metrics

Create src/metrics/dlqMetrics.ts:

import pool from '../db/pool';

export async function trackDLQGrowth() {
  const result = await pool.query(
    'SELECT COUNT(*) as count FROM dead_letter_queue'
  );
  const count = parseInt(result.rows[0].count);

  // Log as metric (or push to Prometheus, CloudWatch, etc.)
  console.log(`[METRIC] dlq.size=${count}`);

  if (count > 10) {
    console.warn(`[ALERT] DLQ is growing: ${count} failed deliveries`);
  }

  return count;
}

// Call this periodically (e.g., every 5 minutes)
setInterval(() => trackDLQGrowth(), 5 * 60 * 1000);
✓ Verify: Logs show DLQ size metrics periodically.
STEP 6

Document failure classification

Update docs/design.md with:

## Failure Classification

### Transient Failures
- Network timeout (retry with backoff)
- 5xx server error (retry with backoff)
- Connection refused (retry with backoff)

### Permanent Failures
- 4xx client error (except 408, 429)
- URL unreachable (DNS failure, etc.)
- Max retries exceeded (5 attempts)

Failed deliveries move to the dead-letter queue for manual inspection and recovery.
✓ Verify: Documentation is committed and clear.
STEP 7

Commit your work

git add -A
git commit -m "feat: implement dead-letter queue for failure handling

- DLQ table for permanently failed deliveries
- GET /dlq endpoint with pagination and filtering
- POST /dlq/:id/retry for manual recovery
- DLQ growth metrics and alerts
- Clear documentation of transient vs permanent failures"
git push origin main
✓ Verify: git log --oneline shows your commit.
Next Steps

Ready for Module 05?

You now have full failure handling and visibility. Next, you'll test this system: unit tests, integration tests, and chaos engineering. Head to Module 05.