Event-Driven Architecture (Order Processing System)
In modern distributed systems, the traditional request-response model often acts as a bottleneck for high-throughput applications. When a user clicks "Place Order," a synchronous system might attempt to call the payment gateway, inventory service, and shipping provider in a single blocking chain. If the payment gateway experiences a 500ms latency spike, the entire user experience degrades. If the inventory service is down, the order fails. This tight coupling limits scalability and creates a fragile "distributed monolith."
Transitioning to an Event-Driven Architecture (EDA) allows us to decouple these concerns. By treating an order placement as an immutable event—a fact that has happened in the past—we can broadcast this information to interested subscribers. Companies like Uber and Netflix leverage this pattern to handle millions of concurrent events, ensuring that their core systems remain responsive even when downstream services are under heavy load or undergoing maintenance.
In this post, we will design a production-grade Order Processing System using EDA. We will focus on the Transactional Outbox pattern to ensure data consistency, discuss how to handle the CAP theorem's trade-offs, and explore how to scale from a few thousand to millions of users.
System Requirements
Our system must handle the lifecycle of an order from creation to fulfillment. The primary goal is to ensure that no order is ever lost, even if specific components fail.
Capacity Estimation
To design for the right scale, we assume a mid-to-large e-commerce platform.
| Metric | Value |
|---|---|
| Daily Active Users (DAU) | 1,000,000 |
| Average Orders per Day | 100,000 |
| Peak Orders per Minute | 10,000 |
| Average Event Size | 2 KB |
| Daily Data Ingress | ~200 GB |
High-Level Architecture
We utilize an asynchronous choreography pattern. Instead of a central orchestrator, each service listens for relevant events and performs its logic. We use Apache Kafka as our message backbone due to its high throughput and durability.
Detailed Design: The Transactional Outbox Pattern
A common pitfall in EDA is the "dual write" problem: updating a database and publishing a message to a broker are two separate operations. If the database update succeeds but the broker is down, the system becomes inconsistent. To solve this, we use the Transactional Outbox pattern.
The OrderService saves the order and the event in the same local transaction. A separate "Relay" process polls the outbox table and publishes messages to Kafka.
// OrderService implementation in Go
func (s *OrderService) CreateOrder(ctx context.Context, req OrderRequest) error {
return s.db.Transaction(func(tx *gorm.DB) error {
// 1. Create the Order record
order := Order{
UserID: req.UserID,
Total: req.Total,
Status: "PENDING",
}
if err := tx.Create(&order).Error; err != nil {
return err
}
// 2. Create the Outbox entry in the same transaction
eventPayload, _ := json.Marshal(order)
outbox := OutboxEvent{
AggregateType: "ORDER",
AggregateID: order.ID,
Type: "OrderCreated",
Payload: eventPayload,
CreatedAt: time.Now(),
}
return tx.Create(&outbox).Error
})
}This approach guarantees "at-least-once" delivery. Consumers must be idempotent to handle potential duplicate messages, typically achieved by tracking order_id in a processed_events table.
Database Schema
We use PostgreSQL for its robust ACID compliance. For the orders table, we use range partitioning by created_at to keep indexes small and performant.
CREATE TABLE orders (
id UUID PRIMARY KEY,
user_id UUID NOT NULL,
total_amount DECIMAL(12,2),
status VARCHAR(20),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
) PARTITION BY RANGE (created_at);
CREATE INDEX idx_order_user_id ON orders(user_id);
CREATE INDEX idx_outbox_unprocessed ON outbox_events(id) WHERE processed = false;Scaling Strategy
To scale from 1,000 to 1,000,000+ users, we move from vertical scaling to horizontal scaling of consumers and partitioning of the message broker.
Scaling Key Components:
- Broker Partitioning: We partition Kafka topics by
user_idto ensure that all events for a single user are processed in order, while allowing different users' orders to be processed in parallel across multiple brokers. - Consumer Groups: By using Kafka Consumer Groups, we can add more instances of the
PaymentServiceas load increases. Kafka automatically rebalances the partitions among the available instances. - Read Replicas: The
OrderAPIcan use read-only replicas for order history queries, keeping the primary database free for write-heavy transactions.
Failure Modes and Resilience
In a distributed system, failure is inevitable. We must design for partial failures using retries, circuit breakers, and Dead Letter Queues (DLQ).
| Failure Mode | Mitigation Strategy |
|---|---|
| Broker Down | Local Outbox storage allows the API to stay up; Relay retries once Broker returns. |
| Consumer Crash | Kafka offsets ensure the consumer resumes exactly where it left off. |
| Downstream Latency | Circuit breakers (e.g., Resilience4j) trip to prevent cascading resource exhaustion. |
| Poison Pill Message | Move message to Dead Letter Queue (DLQ) after N retries to unblock the partition. |
For example, Stripe uses idempotency keys to ensure that if a network timeout occurs and the client retries the request, the customer is only charged once. We implement similar logic in our PaymentService by checking if a transaction for a specific order_id already exists before calling the gateway.
Conclusion
Designing an event-driven order processing system requires a shift from immediate consistency to eventual consistency. By embracing the CAP theorem, we prioritize Availability and Partition Tolerance (AP), ensuring the system remains responsive during peak traffic.
Key takeaways for a production-grade system include:
- Decoupling: Use choreography to allow services to evolve independently.
- Atomicity: Implement the Transactional Outbox pattern to bridge the gap between your database and message broker.
- Idempotency: Always assume messages will be delivered more than once.
- Observability: Implement correlation IDs across services to trace an order's journey through the event mesh.
While EDA introduces complexity in testing and debugging, the gains in resilience and horizontal scalability make it the gold standard for high-growth distributed systems.
https://martinfowler.com/articles/201701-event-driven.html https://microservices.io/patterns/data/transactional-outbox.html https://stripe.com/blog/idempotency https://eng.uber.com/reliable-reprocessing/