Event-Driven Architecture Patterns
Event-driven architecture has matured from emerging pattern to default substrate for cloud-native estates. A practitioner catalogue of the patterns that recur — and the operational disciplines that distinguish event-driven estates that scale from ones that accumulate confusion.
Event-driven architecture has moved from emerging pattern to default substrate for cloud-native estates over the last five years. Apache Kafka, AWS EventBridge, Azure Event Hub, Google Pub/Sub, and the ecosystem around them have matured to production-grade operational tooling. The architectural patterns themselves have stabilised; the operating disciplines that determine success have been documented across many implementations.
This piece is a practitioner catalogue of the patterns that recur and the operational disciplines that distinguish event-driven estates that scale from ones that accumulate operational confusion.
The publish-subscribe pattern, the workhorse
The fundamental event-driven pattern: producers publish events to a topic; consumers subscribe and receive them. The pattern decouples producers from consumers in time, location, and lifecycle.
Where it fits:
- Order events flowing from commerce to fulfilment, billing, marketing automation, analytics
- Document state changes flowing from a document management system to audit, indexing, downstream workflows
- Equipment telemetry flowing from operational systems to monitoring, alerting, predictive maintenance
- Regulatory events flowing from compliance systems to reporting, partner notification, archival
The pattern's strength is consumer addition without producer change. A new analytics workload that needs order events subscribes to the existing topic; the producer is unaware.
The disciplines that make it work:
- Stable event schemas. The event shape is a contract. Schema evolution follows compatibility rules (typically backward-compatible additions only); a schema registry (Confluent Schema Registry, AWS Glue Schema Registry, Apicurio) enforces them.
- Consumer-side resilience. Consumers handle malformed events, duplicate events, out-of-order events, and consumer-side outages. The producer cannot guarantee any of these are absent; the consumer must be defensive.
- Topic naming conventions. A consistent topic naming pattern (typically
<domain>.<entity>.<eventType>) makes the topic landscape navigable. - Documentation. Each topic has a published schema, an owner, and a description of when events are produced.
Event sourcing — for the workloads that need it
Event sourcing extends the pattern: the events themselves are the system of record. The current state is computed by replaying events from the start. Any historical state can be reconstructed by replaying to that point.
Where it fits:
- Financial systems where the audit trail is the product (every transaction matters; the order of events matters; reconstructing past state is a legal requirement)
- Regulatory systems where decision provenance must be traceable
- Systems with strong temporal querying needs ("what did the customer record look like on Dec 31?")
- Systems where the workflow is naturally event-shaped (state machines, long-running processes with explicit transitions)
Where it does not fit:
- High-volume read-mostly workloads where the cost of event replay outweighs the audit benefit
- Workloads with simple CRUD semantics and no temporal queries
- Workloads where the team's experience with event sourcing is limited (the discipline is genuine; learning on production workloads is expensive)
Event sourcing introduces complexity. The disciplines: schema versioning across years of historical events; snapshot generation to avoid full replay on every read; eventual consistency between the event log and any derived state; rebuild procedures when derived state becomes inconsistent. The estates that benefit from event sourcing accept these costs; the estates that adopt event sourcing because it sounds modern usually regret it.
CQRS — separating reads from writes
Command Query Responsibility Segregation: writes go through one model, reads through another. The write side captures the change; the read side serves queries. The two sides are kept consistent through events.
The pattern is common in event-driven estates because event sourcing produces a natural write model. The read models are projections of the event stream into query-optimised forms — denormalised views, search indexes, aggregation tables.
Where CQRS earns its complexity:
- Workloads with very different read and write characteristics (heavy reads, infrequent writes, or vice versa)
- Workloads where multiple read shapes serve different consumers (mobile, web, batch, reporting)
- Workloads where read-model rebuild from events is a useful operational tool
Where CQRS doesn't fit:
- Workloads with simple symmetric read/write patterns
- Teams that have not yet developed the operational maturity for eventual consistency
The pattern has become more accessible with managed event brokers and managed read-store services. The operational complexity is still real; the cost is justified only by genuine workload need.
Choreography vs orchestration
Two ways to coordinate multi-step processes in event-driven estates:
Choreography. Each service listens for events it cares about and emits events when it has done something. There is no central coordinator. The overall process emerges from the participating services' reactions. Highly decoupled; harder to reason about end-to-end.
Orchestration. A central orchestrator (BPM platform, workflow engine, saga coordinator) issues commands to participating services and tracks the overall process state. More coupled; easier to reason about and operate.
Most enterprise estates need both, applied to different workloads:
- Choreography for high-volume, structurally-simple processes (event distribution, data fan-out, simple notifications)
- Orchestration for low-volume, structurally-complex processes (multi-step workflows with compensation, human-in-the-loop, regulatory audit requirements)
The mistake we see often is choosing one for everything. Pure choreography for complex regulatory workflows produces audit nightmares. Pure orchestration for high-volume event distribution produces orchestrator bottlenecks. The estates that scale use both, with explicit decisions about which workload uses which.
The saga pattern for distributed transactions
Long-running transactions across multiple services cannot use traditional two-phase commit; the latency is prohibitive and the failure modes are worse. The saga pattern is the event-driven alternative: each step in the transaction has a compensating action that undoes it if a later step fails.
Two saga implementations:
Choreography-based sagas. Each service in the saga listens for events and emits its own events; compensation flows backward through the chain. Highly decoupled; difficult to reason about for sagas with many steps.
Orchestration-based sagas. A saga orchestrator (often AWS Step Functions, Temporal, Camunda, custom code) drives the saga forward, invoking each step and handling compensation. More coupled; far easier to operate.
We default to orchestration-based sagas for any process with more than three steps or any regulatory exposure. The auditability and operational visibility are worth the additional coupling.
Common mistakes:
- Compensating actions that aren't actually compensating (e.g., "decrement counter" rather than restoring the original value)
- Compensations that have side effects they shouldn't (e.g., emitting customer notifications during compensation)
- Sagas without idempotency at each step, so retries produce inconsistent state
- Sagas without a defined timeout for the overall process, so failed sagas accumulate as zombie state
The dead-letter discipline
Every event-driven system eventually has events that cannot be processed — malformed, missing dependencies, business-rule violations, transient failures that exceed retry budget. The dead-letter pattern routes these events to a quarantine where they can be inspected and either reprocessed or discarded.
What makes dead-letter discipline work:
- Every consumer has a defined dead-letter destination (topic, queue, store)
- Dead-letter events carry the failure reason and the consumer's state at failure
- Dead-letter inspection is part of operations (a defined runbook, with dashboards showing dead-letter accumulation)
- Reprocessing is supported as a first-class operation (replay from dead-letter to original topic, with the fix applied)
- Aging policies prevent dead-letter accumulation indefinitely (events older than X days are archived or discarded with audit)
The estates that operate event-driven systems well treat dead-letter as a primary operational concern. The estates that treat dead-letter as an exception path discover the cost during the first incident that fills the dead-letter store with thousands of events.
Operational disciplines
The disciplines that distinguish event-driven estates that scale:
Schema governance. A schema registry, compatibility rules, versioning policy, deprecation discipline. The schema layer is the contract layer; allowing it to drift breaks every consumer.
Topic governance. Naming conventions, ownership, retention policy. Topics are platform resources; ungoverned topics produce a sprawl that nobody can navigate.
Consumer SLA tracking. For each topic-consumer pair, latency from event production to consumer processing. Slow consumers are visible; their slowdown's impact on other consumers (broker pressure, partition lag) is monitored.
Replay capability. The ability to replay events from a specific point, to a specific consumer, with a specific filter. Replay is part of recovery; estates without good replay are stuck during incidents.
Cost visibility. Event-driven infrastructure can produce surprising bills (broker storage, transit, consumer compute). Per-team cost visibility prevents the cost discussion from being a once-a-year surprise.
What we recommend
For an estate adopting event-driven patterns:
- Pick one workload that genuinely benefits — usually a high-fan-out scenario where multiple consumers want the same data
- Adopt a schema registry from Day 1. The discipline pays back from the first schema change.
- Establish topic naming conventions and document them
- Default to orchestrated sagas for multi-step processes
- Define dead-letter strategy before any consumer goes to production
For an existing event-driven estate showing operational pain:
- Audit schema discipline. Is there a registry? Are compatibility rules enforced? If not, this is the highest-leverage fix.
- Audit the consumer lag picture. Are slow consumers visible? Are their failure modes survivable?
- Audit the dead-letter situation. Is it operationally managed?
- Re-examine processes that use pure choreography. Are some of them suffering from audit and operational opacity that orchestration would fix?
Event-driven architecture is the right substrate for many enterprise workloads. The patterns are mature; the tooling is mature; the operational disciplines are documented. The estates that adopt them with the disciplines in place compound value over years. The estates that adopt the patterns without the disciplines tend to discover the disciplines through painful incidents.
RELATED READING
More from the field.
Service practices the article draws on, related programmes, and other pieces on adjacent topics.
Service practices
Related pieces
Enterprise Service Bus Evolution
The ESB pattern is older than most engineers who work with it. A look at where it came from, what it did well, where it earned its bad reputation, and what genuinely replaces parts of it in modern integration architectures.
IBM webMethods Modernisation: A Decision Framework for the Eight-Year Horizon
Most webMethods estates do not need a rewrite. They need a structured assessment, a few high-conviction architectural moves, and an operating model that survives the consultant exit. A field framework from a team that has lived inside the practice.
Integration Resilience Patterns
Resilience in integration estates is the discipline of expecting things to fail and degrading gracefully when they do. A year-end synthesis of the patterns that survive real failures — circuit breakers, retries with discipline, bulkheading, timeouts, idempotency, and the operational habits that hold them together.
Programme · Life Sciences · North America
AI-Ready Event Streaming — Global Life Sciences Enterprise
Production-grade Apache Kafka event streaming platform feeding AI models, ML pipelines, and operational intelligence systems across global operations.
Industry
Life Sciences & Consumer Goods
Global system integration, data pipelines, and operational platforms.
Discuss this work
Bring an enterprise programme.
If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.
Work with the practitioners
Bring an enterprise programme.
Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.