Background Jobs · Server Responsibilities Deep Dive

Quick Facts

At a Glance

Basic Concepts

Producer: the code that enqueues a job.
Broker / queue: where the job waits — Redis, RabbitMQ, SQS, Kafka, the database itself.
Worker: the long-running process that pops jobs and executes them.
Idempotent: safe to run more than once. Every job should be — workers crash, retries happen.
Dead-letter queue (DLQ): where jobs that fail repeatedly go to be inspected instead of looping forever.

Why

What Belongs in the Background

Slow I/O

Email, SMS, push, third-party API calls — anything that can block for seconds.

Heavy CPU

Image/video processing, PDF generation, ML inference, exports.

Fan-out

One user action → 10,000 downstream notifications. Enqueue many small jobs.

Scheduled Work

Nightly billing, weekly reports, hourly cache warm-ups, retention deletes.

Retries with patience

"Webhook failed — try again in 1 min, then 5, then an hour."

Decoupling

"Order placed" event drives email, analytics, fulfillment — without coupling them to the API handler.

Landscape

The Major Systems

Tool	Stack	Sweet spot
Sidekiq	Ruby + Redis	The Rails default. Fast, mature, simple ops.
Celery	Python + Redis/RabbitMQ	Long-time Python standard; lots of moving parts.
BullMQ	Node + Redis	Modern Node default; great DX, scheduled & repeatable jobs.
Hangfire	.NET + SQL Server / Redis	Built-in dashboard; persists jobs in the DB by default.
Quartz	JVM	Enterprise scheduling; rich cron and clustering.
Asynq / River	Go + Redis / Postgres	Idiomatic Go; type-safe payloads.
SQS / Pub/Sub / Service Bus	Cloud-managed	No broker to operate; plays well with serverless workers.
RabbitMQ	Self-hosted	Rich routing, exchanges, priorities — when shape of delivery matters.
Kafka	Self-hosted / Confluent / MSK	Event log, not just a queue — replayable, high-throughput.
Temporal / Inngest / Trigger.dev	Workflow engines	Long-lived, multi-step workflows with built-in retries & durability.
Postgres-as-queue	pg-boss, Oban, River, Graphile Worker	Skip the broker — transactional with your data.

Mechanics

Designing Reliable Jobs

Delivery Semantics

At-most-once: may be lost. Rarely what you want.
At-least-once: the practical default — may run twice. Make jobs idempotent.
Exactly-once: mostly a marketing claim; achieved by at-least-once delivery + idempotent processing.

Idempotency

Pass an operation_id in the payload. Check-and-record it before doing side effects (insert into a processed_jobs table with a unique index). Second run hits the unique violation and exits cleanly.

Retries & Dead Letters

Exponential backoff with jitter — never tight-loop a failing job.
Bound max attempts; route to DLQ after the limit.
Distinguish retriable errors (network, 503) from permanent ones (validation, 400) and stop retrying the latter.
Alert on DLQ growth — it's your signal that something needs human eyes.

Scheduling & Cron

Avoid running cron on every app instance — N replicas means N copies of your nightly job. Use a leader-elected scheduler (Hangfire, Quartz cluster, Kubernetes CronJob, BullMQ repeatable jobs) or a managed scheduler (EventBridge, Cloud Scheduler).

Long-Running Workflows

For jobs that span hours or days (onboarding flows, multi-step refunds, AI pipelines), reach for a workflow engine — Temporal, Inngest, Step Functions, Cadence. They persist state between steps so a worker restart doesn't lose progress.

Priority, Concurrency, Fairness

Separate queues per priority; never let a flood of low-priority work starve user-facing jobs.
Per-tenant rate limits — one big customer's bulk import shouldn't degrade everyone else.
Cap worker concurrency for fragile downstreams (legacy SOAP API, single-writer DB).

Observability

Log job ID, type, attempt, duration. Trace from API request → enqueue → worker.
Track queue depth, oldest-job age, and failure rate as first-class SLIs.
Alert on age, not just depth — depth can stay flat while jobs starve at the back.

Pitfalls

Common Mistakes

Putting domain objects in the payload. Pass IDs, re-fetch in the worker — the object on disk may have changed by the time the worker runs.
Doing the DB write and the enqueue separately. If the DB commits and the enqueue fails, the side effect never happens. Use the outbox pattern or a Postgres-backed queue.
Workers without graceful shutdown. SIGTERM should drain in-flight jobs; otherwise deploys lose work.
Sharing the request DB pool with workers. A worker stampede starves the API. Separate pools, separate instances.
Treating the queue as a database. If you find yourself querying it, you want a database with a job table, not a queue.

Continue

Other Server Responsibilities

Authentication & Sessions Authorization Validation Persistence Integrations ↑ Back to Server Side