Server Responsibilities · 6 of 6

Background Jobs

Anything that doesn't have to finish before the user sees a response: emails, reports, ML scoring, exports, scheduled cleanups. Move it off the request path or pay in latency forever.

QueuesWorkersSidekiqCeleryBullMQSQSCron
← Back to Server Side
Quick Facts

At a Glance

Basic Concepts

  • Producer: the code that enqueues a job.
  • Broker / queue: where the job waits — Redis, RabbitMQ, SQS, Kafka, the database itself.
  • Worker: the long-running process that pops jobs and executes them.
  • Idempotent: safe to run more than once. Every job should be — workers crash, retries happen.
  • Dead-letter queue (DLQ): where jobs that fail repeatedly go to be inspected instead of looping forever.
Why

What Belongs in the Background

Slow I/O

Email, SMS, push, third-party API calls — anything that can block for seconds.

Heavy CPU

Image/video processing, PDF generation, ML inference, exports.

Fan-out

One user action → 10,000 downstream notifications. Enqueue many small jobs.

Scheduled Work

Nightly billing, weekly reports, hourly cache warm-ups, retention deletes.

Retries with patience

"Webhook failed — try again in 1 min, then 5, then an hour."

Decoupling

"Order placed" event drives email, analytics, fulfillment — without coupling them to the API handler.

Landscape

The Major Systems

ToolStackSweet spot
SidekiqRuby + RedisThe Rails default. Fast, mature, simple ops.
CeleryPython + Redis/RabbitMQLong-time Python standard; lots of moving parts.
BullMQNode + RedisModern Node default; great DX, scheduled & repeatable jobs.
Hangfire.NET + SQL Server / RedisBuilt-in dashboard; persists jobs in the DB by default.
QuartzJVMEnterprise scheduling; rich cron and clustering.
Asynq / RiverGo + Redis / PostgresIdiomatic Go; type-safe payloads.
SQS / Pub/Sub / Service BusCloud-managedNo broker to operate; plays well with serverless workers.
RabbitMQSelf-hostedRich routing, exchanges, priorities — when shape of delivery matters.
KafkaSelf-hosted / Confluent / MSKEvent log, not just a queue — replayable, high-throughput.
Temporal / Inngest / Trigger.devWorkflow enginesLong-lived, multi-step workflows with built-in retries & durability.
Postgres-as-queuepg-boss, Oban, River, Graphile WorkerSkip the broker — transactional with your data.
Mechanics

Designing Reliable Jobs

Delivery Semantics
  • At-most-once: may be lost. Rarely what you want.
  • At-least-once: the practical default — may run twice. Make jobs idempotent.
  • Exactly-once: mostly a marketing claim; achieved by at-least-once delivery + idempotent processing.
Idempotency

Pass an operation_id in the payload. Check-and-record it before doing side effects (insert into a processed_jobs table with a unique index). Second run hits the unique violation and exits cleanly.

Retries & Dead Letters
  • Exponential backoff with jitter — never tight-loop a failing job.
  • Bound max attempts; route to DLQ after the limit.
  • Distinguish retriable errors (network, 503) from permanent ones (validation, 400) and stop retrying the latter.
  • Alert on DLQ growth — it's your signal that something needs human eyes.
Scheduling & Cron

Avoid running cron on every app instance — N replicas means N copies of your nightly job. Use a leader-elected scheduler (Hangfire, Quartz cluster, Kubernetes CronJob, BullMQ repeatable jobs) or a managed scheduler (EventBridge, Cloud Scheduler).

Long-Running Workflows

For jobs that span hours or days (onboarding flows, multi-step refunds, AI pipelines), reach for a workflow engine — Temporal, Inngest, Step Functions, Cadence. They persist state between steps so a worker restart doesn't lose progress.

Priority, Concurrency, Fairness
  • Separate queues per priority; never let a flood of low-priority work starve user-facing jobs.
  • Per-tenant rate limits — one big customer's bulk import shouldn't degrade everyone else.
  • Cap worker concurrency for fragile downstreams (legacy SOAP API, single-writer DB).
Observability
  • Log job ID, type, attempt, duration. Trace from API request → enqueue → worker.
  • Track queue depth, oldest-job age, and failure rate as first-class SLIs.
  • Alert on age, not just depth — depth can stay flat while jobs starve at the back.
Pitfalls

Common Mistakes

  • Putting domain objects in the payload. Pass IDs, re-fetch in the worker — the object on disk may have changed by the time the worker runs.
  • Doing the DB write and the enqueue separately. If the DB commits and the enqueue fails, the side effect never happens. Use the outbox pattern or a Postgres-backed queue.
  • Workers without graceful shutdown. SIGTERM should drain in-flight jobs; otherwise deploys lose work.
  • Sharing the request DB pool with workers. A worker stampede starves the API. Separate pools, separate instances.
  • Treating the queue as a database. If you find yourself querying it, you want a database with a job table, not a queue.
Continue

Other Server Responsibilities