Services emit events when something happens — "OrderPlaced", "PaymentReceived", "UserSignedUp" — onto a shared bus. Other services subscribe to what they care about and react. Producers don't know who consumes; consumers don't know who produced. The result is loosely coupled systems that fan out naturally and leave a perfect audit trail.
← Back to ArchitectureThe order service doesn't know that emails get sent, analytics gets updated, the warehouse picks the package, and the loyalty system credits points. It just publishes "OrderPlaced". Add a new consumer (fraud detection, recommendation training) without touching the producer.
One event, ten consumers. Each one gets its own copy and processes at its own pace. Slow consumers don't slow down fast ones; they just lag behind on the topic.
The event log is a durable history. New consumer? Replay the topic from the start to backfill. Bug fix? Replay since the bad commit to repair downstream state. With Kafka, retention can be effectively infinite.
If the email service is down for an hour, "OrderPlaced" events queue up on its topic. When it comes back, it catches up. The order service never knew there was a problem.
"Place an order, then immediately query the order list" may not show the new order if the read model lags the event. Designs that assume read-after-write consistency break here. Solutions: optimistic UI updates, "your order is processing" states, or a read model with monotonic guarantees.
Across a partitioned topic, global order is impossible. You get order within a partition. Partition by aggregate key (e.g., order ID) so all events for one entity stay in order — but events across different aggregates can interleave.
Most brokers guarantee "at least once", which means consumers will see duplicates. Make consumers idempotent — store the event ID, skip if already processed. "Exactly once" exists in name (Kafka transactions) but is brittle and rare in practice.
An event lives in the topic for years; consumers come and go. Yesterday's events must keep deserializing. Use a schema registry (Confluent, Apicurio) and enforce backward-compatible changes only — add optional fields, never rename or remove without versioning.
"What happens after OrderPlaced?" requires grepping every service to see who subscribes. Distributed tracing across event hops is harder than across HTTP — propagate trace IDs in event headers and instrument both sides.
If a service writes to its DB and then publishes an event, what if the publish fails after the DB commits? Inconsistency. The outbox pattern writes the event to an outbox table in the same DB transaction; a separate process reads the outbox and publishes. Solves the dual-write problem.
Each service reacts to events independently. No central controller. Simple to build, hard to see — "what's the full checkout flow?" requires walking every service. Best for small flows with 2–3 steps.
A workflow engine (Temporal, AWS Step Functions, Camunda, Netflix Conductor) drives the saga step by step. Visible, testable, with built-in retries and compensation. The right call for long-running multi-step flows like onboarding, refunds, KYC. Slightly more central coupling, much more debuggable.
You can't BEGIN/COMMIT across services. A saga is a sequence of local transactions, each with a compensating action. If step 4 fails, fire compensations for steps 1–3. Implement choreographed (events) or orchestrated (workflow engine).
| Broker | Model | Best For |
|---|---|---|
| Apache Kafka | Append-only log, partitioned, persistent | High-throughput event streaming, replay, analytics pipelines. |
| RabbitMQ | Queue + exchange routing | Task queues, request/reply, complex routing. |
| AWS SNS + SQS | Pub/sub fan-out + queues | Decoupling Lambdas and microservices on AWS. |
| Google Pub/Sub | Managed pub/sub | Cloud-native eventing on GCP. |
| NATS / JetStream | Lightweight messaging + persistent streams | Edge, IoT, microservices needing low overhead. |
| Redis Streams | In-memory log | Small-scale eventing alongside an existing Redis. |
Reach for events when:
Don't reach for events when you actually need a synchronous answer ("did the payment go through?"), when consistency must be immediate, or when the team can't yet operate a broker reliably. Sync RPC over HTTP/gRPC is fine — sometimes ideal — for request/response.