Permanently failed deliveries shouldn't disappear. Implement a DLQ, expose failure analytics, and allow manual recovery. By the end, you'll have full visibility into permanent failures.
← Back to Module 04 overviewGET /dlq endpoint lists all permanently failed deliveries (paginated).POST /dlq/:delivery_id/retry allows manual retries from the DLQ.Update src/db/migrations.ts to add:
CREATE TABLE IF NOT EXISTS dead_letter_queue ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), delivery_id UUID NOT NULL REFERENCES deliveries(id), webhook_id UUID NOT NULL REFERENCES webhooks(id), event_id UUID NOT NULL REFERENCES events(id), last_error TEXT, error_count INT DEFAULT 1, created_at TIMESTAMP DEFAULT NOW(), updated_at TIMESTAMP DEFAULT NOW() ); CREATE INDEX idx_dlq_webhook_id ON dead_letter_queue(webhook_id); CREATE INDEX idx_dlq_created_at ON dead_letter_queue(created_at);
npm run dev completes without errors.Modify src/worker/deliveryWorker.ts where permanent failure is detected:
// When attempt >= MAX_RETRIES:
if (nextRetryAt === -1) {
// Move to DLQ
await pool.query(
'INSERT INTO dead_letter_queue (delivery_id, webhook_id, event_id, last_error) VALUES ($1, $2, $3, $4)',
[delivery_id, job.data.webhook_id, job.data.event_id, errorMsg]
);
await pool.query(
'UPDATE deliveries SET status = $1, attempts = $2, last_error = $3, updated_at = NOW() WHERE id = $4',
['failed', attempt + 1, errorMsg, delivery_id]
);
console.log(`Delivery ${delivery_id} moved to DLQ`);
}
Create src/routes/dlq.ts:
import { Router, Request, Response } from 'express';
import pool from '../db/pool';
const router = Router();
// GET /dlq - list failed deliveries
router.get('/', async (req: Request, res: Response) => {
const page = parseInt(req.query.page as string) || 1;
const limit = 50;
const offset = (page - 1) * limit;
try {
const countResult = await pool.query('SELECT COUNT(*) FROM dead_letter_queue');
const total = parseInt(countResult.rows[0].count);
const result = await pool.query(`
SELECT dlq.*, w.url as webhook_url, e.type as event_type
FROM dead_letter_queue dlq
JOIN webhooks w ON dlq.webhook_id = w.id
JOIN events e ON dlq.event_id = e.id
ORDER BY dlq.created_at DESC
LIMIT $1 OFFSET $2
`, [limit, offset]);
res.json({
data: result.rows,
pagination: {
page,
limit,
total,
pages: Math.ceil(total / limit)
}
});
} catch (err) {
console.error(err);
res.status(500).json({ error: 'Failed to fetch DLQ' });
}
});
// GET /dlq/webhook/:webhook_id - failures for a specific webhook
router.get('/webhook/:webhook_id', async (req: Request, res: Response) => {
try {
const result = await pool.query(`
SELECT dlq.*, e.type as event_type
FROM dead_letter_queue dlq
JOIN events e ON dlq.event_id = e.id
WHERE dlq.webhook_id = $1
ORDER BY dlq.created_at DESC
`, [req.params.webhook_id]);
res.json(result.rows);
} catch (err) {
console.error(err);
res.status(500).json({ error: 'Failed to fetch webhook DLQ' });
}
});
export default router;
Add to src/index.ts:
import dlqRouter from './routes/dlq';
app.use('/dlq', dlqRouter);
curl http://localhost:3000/dlq returns paginated failed deliveries.Add to src/routes/dlq.ts:
import deliveryQueue from '../queue/deliveryQueue';
// POST /dlq/:delivery_id/retry - retry a failed delivery
router.post('/:delivery_id/retry', async (req: Request, res: Response) => {
try {
// Get the DLQ entry
const dlqResult = await pool.query(
'SELECT * FROM dead_letter_queue WHERE delivery_id = $1',
[req.params.delivery_id]
);
if (dlqResult.rows.length === 0) {
return res.status(404).json({ error: 'Delivery not found in DLQ' });
}
const dlq = dlqResult.rows[0];
// Get delivery and webhook info
const deliveryResult = await pool.query(
'SELECT * FROM deliveries WHERE id = $1',
[req.params.delivery_id]
);
const webhookResult = await pool.query(
'SELECT * FROM webhooks WHERE id = $1',
[dlq.webhook_id]
);
const eventResult = await pool.query(
'SELECT * FROM events WHERE id = $1',
[dlq.event_id]
);
if (deliveryResult.rows.length === 0 || webhookResult.rows.length === 0) {
return res.status(404).json({ error: 'Delivery or webhook not found' });
}
// Re-enqueue
await deliveryQueue.add({
delivery_id: req.params.delivery_id,
webhook_id: dlq.webhook_id,
event_id: dlq.event_id,
webhook_url: webhookResult.rows[0].url,
webhook_secret: webhookResult.rows[0].secret,
payload: eventResult.rows[0].payload,
attempt: 0
});
// Update status back to pending
await pool.query(
'UPDATE deliveries SET status = $1, attempts = 0, next_retry_at = NULL WHERE id = $2',
['pending', req.params.delivery_id]
);
// Remove from DLQ
await pool.query(
'DELETE FROM dead_letter_queue WHERE delivery_id = $1',
[req.params.delivery_id]
);
res.json({ message: 'Delivery requeued', delivery_id: req.params.delivery_id });
} catch (err) {
console.error(err);
res.status(500).json({ error: 'Failed to retry delivery' });
}
});
curl -X POST http://localhost:3000/dlq/:delivery_id/retry moves it back to pending and removes from DLQ.Create src/metrics/dlqMetrics.ts:
import pool from '../db/pool';
export async function trackDLQGrowth() {
const result = await pool.query(
'SELECT COUNT(*) as count FROM dead_letter_queue'
);
const count = parseInt(result.rows[0].count);
// Log as metric (or push to Prometheus, CloudWatch, etc.)
console.log(`[METRIC] dlq.size=${count}`);
if (count > 10) {
console.warn(`[ALERT] DLQ is growing: ${count} failed deliveries`);
}
return count;
}
// Call this periodically (e.g., every 5 minutes)
setInterval(() => trackDLQGrowth(), 5 * 60 * 1000);
Update docs/design.md with:
## Failure Classification ### Transient Failures - Network timeout (retry with backoff) - 5xx server error (retry with backoff) - Connection refused (retry with backoff) ### Permanent Failures - 4xx client error (except 408, 429) - URL unreachable (DNS failure, etc.) - Max retries exceeded (5 attempts) Failed deliveries move to the dead-letter queue for manual inspection and recovery.
git add -A git commit -m "feat: implement dead-letter queue for failure handling - DLQ table for permanently failed deliveries - GET /dlq endpoint with pagination and filtering - POST /dlq/:id/retry for manual recovery - DLQ growth metrics and alerts - Clear documentation of transient vs permanent failures" git push origin main
git log --oneline shows your commit.You now have full failure handling and visibility. Next, you'll test this system: unit tests, integration tests, and chaos engineering. Head to Module 05.