Message Queues · Architecture Deep Dive

Quick Facts

At a Glance

Basic Concepts

Producer: the code that publishes a message.
Consumer / worker: the code that processes it.
Broker: the system in the middle that holds messages durably (Redis, RabbitMQ, Kafka, SQS…).
Queue vs topic: a queue delivers each message to one consumer; a topic broadcasts to many subscribers.
Backlog: messages waiting to be processed. Watch this number — it's the most honest signal you have.
DLQ (dead-letter queue): a side queue for messages that keep failing — for human inspection.

Why

What Queues Are Good For

Decoupling latency. A user-facing redirect shouldn't wait on geo-IP enrichment or analytics writes.
Absorbing spikes. The producer side can briefly exceed consumer capacity; the queue acts as a buffer.
Tolerating outages. If the worker is down, work piles up and drains when it returns — no data lost.
Smoothing variable work. Sending an email takes 50–5000ms. Don't make every web request pay that.
Fan-out. One event, many independent consumers — analytics, search index, billing, notifications.
Scheduled / delayed work. "Send a reminder in 24 hours."

Choices

Brokers, Briefly

Broker	Sweet spot	Notes
Redis (BullMQ, Sidekiq, RQ, Celery+Redis)	App-level job queues, low setup cost.	Already in your stack as a cache; fine for moderate volume; durability is "good enough" with AOF.
RabbitMQ	Classic AMQP queues, complex routing.	Mature, flexible, exchange/binding model. Single-broker mental model.
Kafka / Redpanda	High-throughput event streams, replay.	Log, not a queue. Consumers track their own offsets. Great for analytics, painful as a generic job queue.
AWS SQS / GCP Pub/Sub / Azure Service Bus	Managed, infinite-scale, pay per message.	No ops; visibility timeouts and DLQs built in. Vendor lock comes with the convenience.
NATS / NATS JetStream	Low-latency messaging with optional persistence.	Lightweight, fast, increasingly popular for microservices.
Postgres-as-a-queue (e.g., River, Graphile Worker)	Small/mid teams already on Postgres.	Transactional with your DB; one less moving part. Has clear scale ceiling but it's surprisingly high.

Default for small teams: Postgres-as-a-queue or Redis-backed. Reach for Kafka when you need replay/streaming, not just async jobs.

Semantics

How Often Does a Message Get Processed?

Guarantee	Meaning	Reality
At-most-once	0 or 1 deliveries.	Cheap. Fine for fire-and-forget telemetry; unsafe for anything you can't lose.
At-least-once	1 or more deliveries.	The pragmatic default. Combine with idempotent handlers and you've got "effectively once".
Exactly-once	Exactly 1 delivery.	Distributed-systems folklore. True end-to-end exactly-once is rare and expensive; usually you build it from at-least-once + idempotency.

The right phrase isn't "exactly-once delivery", it's "exactly-once effect" — and you achieve it by making handlers idempotent, not by tuning the broker.

Idempotency

The Most Important Property of a Worker

If processing the same message twice produces the same outcome as processing it once, you're idempotent. If not, every retry is a potential bug.

Use a stable message ID. Producer assigns it; persist it on the consumer side.
De-dupe by ID. A unique constraint on processed_messages(message_id) is often enough.
Make state changes conditional. UPDATE … WHERE status = 'pending' beats unconditional writes.
Outbound side-effects need care. Hitting a third-party API twice is the kind of bug that goes undetected until accounting calls.
Use the broker's de-dup window if available (SQS FIFO, Kafka exactly-once producers) — but treat it as defense in depth, not the design.

Retries

Failing Gracefully

Retry transient failures, not bugs. A network blip, a database deadlock, a 503 from a downstream — yes. A NullPointerException on the same input — no, that's a poison message.
Exponential backoff with jitter. 1s, 2s, 4s, 8s, ± random. Avoids thundering herds against the same downstream.
Bounded attempts. After N tries, send to the DLQ. Don't loop forever.
Visibility timeouts (SQS, Redis) keep a message hidden while a worker holds it; if the worker dies, it reappears. Pick a value comfortably larger than your p99 handler time.
Distinguish retryable vs terminal errors in your handler. Throwing a NonRetryable error should bypass retries and go straight to DLQ.

DLQs

Where Bad Messages Go to Wait for a Human

Every queue should have a DLQ. No exceptions.
DLQ depth is an alertable metric — a non-zero DLQ usually means a recent code change broke something.
Build a tool to replay DLQ messages after a fix. Manual replays from the broker UI don't scale.
Capture the failure reason alongside the message — handler exceptions, downstream errors, timeouts.

Ordering & Concurrency

Pick Two of Three: Order, Throughput, Simplicity

Order is expensive. Strict ordering means a single consumer per partition/key. Throughput is bounded by that.
Partition by a key (user ID, tenant ID) so that per-key order is preserved while you scale across keys. Kafka, SQS FIFO, and most modern brokers support this.
Concurrency caps per consumer — both globally and per key — protect downstreams.
Most jobs don't need ordering. Don't pay for it if you don't need it.

Reliable Publishing

The Outbox Pattern

Publishing to a queue and writing to a database are two separate operations. If the broker is unavailable after the DB commit, you've silently lost the event. The fix:

Write the message into an outbox table in the same DB transaction as your business write.
A separate process reads new outbox rows and publishes them to the broker.
On successful publish, mark the row sent (or delete it).

Now your business write and your "intent to publish" are atomic. The publisher can crash and recover without losing events.

Observability

What to Watch

Backlog depth per queue — flat or growing?
Age of the oldest message — better tail metric than depth alone.
Throughput (msgs/sec in vs out) and per-message latency p50/p95/p99.
Error rate by error class; retry rate.
DLQ size — alert on any non-zero value during steady state.
Trace context propagation — pass the OpenTelemetry trace ID through the message so async work shows up in your distributed traces.

Worked Example

The URL Shortener Click Pipeline

Every redirect publishes a click event to a queue. A worker enriches it (geo-IP, UA parsing) and writes to the clicks table. The redirect itself never waits for the write.

// Producer — on the redirect path
async function redirect(req, res, link) {
  res.redirect(302, link.target_url);                  // user gets out fast

  // Fire-and-forget enqueue (with outbox in real life)
  await queue.add('click', {
    id:         crypto.randomUUID(),                   // stable message ID
    link_id:    link.id,
    occurred_at: new Date().toISOString(),
    ip:         req.ip,
    ua:         req.get('user-agent'),
    referrer:   req.get('referer') ?? null
  }, { attempts: 5, backoff: { type: 'exponential', delay: 1000 } });
}

// Consumer
worker.process('click', async (job) => {
  const m = job.data;

  // Idempotency: unique index on (message_id) absorbs duplicates
  const inserted = await db.insertClickIfNew(m.id, m);
  if (!inserted) return;                               // already processed

  const enriched = await enrich(m);                    // geo, UA parsing
  await db.upsertClickEnrichment(m.id, enriched);
});

// DLQ alarm: queue 'click:dead' depth > 0 for 5 minutes → page on-call

Notice: redirect latency is unaffected by analytics; duplicates are absorbed; failures retry with backoff and end up in a DLQ if they don't succeed; trace ID would be propagated in a real implementation.

Common Pitfalls

Lessons Paid For in Pages

Non-idempotent handlers. One retry charges the customer twice.
No DLQ. Poison messages loop forever, eat capacity, hide other work.
Visibility timeout shorter than handler time. Two workers process the same message in parallel. Surprise.
Synchronous publish in the request path. Broker hiccup → user-facing 500. Use the outbox.
Treating Kafka like a queue. No ack-based redelivery, offsets you have to manage, retention windows that drop messages on you.
Fan-out without backpressure. One slow consumer can take the whole pipeline down.
Logging the whole payload. PII in your log pipeline. Scrub.

Continue

Message Queues — Decoupling Producers from Consumers