Reliability Patterns Every Engineer Should Know
In the world of distributed systems, failure is not an elective; it is a fundamental property of the environment. As systems scale from single-node prototypes to global infrastructures like those managed by Netflix or Uber, the probability of a "one-in-a-million" event occurring drops to nearly zero—it happens every few minutes. Staff engineers realize that reliability is not about preventing failure, but about building systems that remain functional despite it.
This shift in mindset—from fault avoidance to fault tolerance—requires a standardized set of reliability patterns. These patterns serve as the architectural guardrails that prevent a localized failure in a minor microservice from cascading into a catastrophic global outage. Whether you are dealing with network partitions, slow database queries, or third-party API downtime, the goal is to minimize the blast radius and maintain a consistent user experience.
Requirements
To illustrate these patterns, we will design a high-availability payment processing system. This system must handle millions of transactions while interacting with external banking gateways that are notoriously unreliable.
Functional and Non-Functional Requirements
Capacity Estimation
| Metric | Value |
|---|---|
| Daily Active Users (DAU) | 5 Million |
| Average Transactions per Day | 100 Million |
| Average Transactions per Second (TPS) | ~1,150 |
| Peak Transactions per Second | 10,000 |
| Data Retention | 7 Years (Compliance) |
| Total Storage (Daily) | ~100 GB (1KB per record) |
High-Level Architecture
The architecture follows a decoupled, event-driven approach. By using an API Gateway with built-in rate limiting and a message broker for asynchronous processing, we protect the core ledger from traffic spikes.
Detailed Design: Idempotency and Retries
The most critical reliability pattern in financial systems is Idempotency. Stripe popularized the use of Idempotency-Key headers to ensure that if a client retries a request due to a timeout, the system does not charge the customer twice.
Below is a Go implementation of a robust retry mechanism with exponential backoff and jitter. Jitter is essential to avoid the "Thundering Herd" problem, where many clients retry at the exact same millisecond, further overwhelming a struggling service.
package main
import (
"context"
"fmt"
"math/rand"
"time"
)
type Operation func(ctx context.Context) error
func ExecuteWithRetry(ctx context.Context, op Operation) error {
const (
maxRetries = 5
baseDelay = 100 * time.Millisecond
maxDelay = 10 * time.Second
)
for i := 0; i < maxRetries; i++ {
err := op(ctx)
if err == nil {
return nil
}
// Calculate exponential backoff: base * 2^attempt
backoff := float64(baseDelay) * float64(1<<uint(i))
// Add jitter (randomness) to prevent synchronized retries
jitter := rand.Float64() * 0.5 * backoff
sleepDuration := time.Duration(backoff + jitter)
if sleepDuration > maxDelay {
sleepDuration = maxDelay
}
select {
case <-time.After(sleepDuration):
fmt.Printf("Retry attempt %d after failure: %v\n", i+1, err)
case <-ctx.Done():
return ctx.Err()
}
}
return fmt.Errorf("operation failed after %d attempts", maxRetries)
}Database Schema
Reliability at the data layer requires a clear separation between request metadata (idempotency) and the source of truth (ledger). We use PostgreSQL for its ACID compliance and strong consistency.
To handle 100M daily transactions, we partition the PAYMENTS table by created_at (range partitioning) and shard the IDEMPOTENCY_KEYS table by the key itself using a hash-based strategy.
CREATE TABLE payments (
id UUID NOT NULL,
user_id UUID NOT NULL,
amount BIGINT NOT NULL,
status VARCHAR(20),
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
) PARTITION BY RANGE (created_at);
CREATE INDEX idx_payments_user_id ON payments (user_id);
CREATE UNIQUE INDEX idx_idempotency_key ON idempotency_keys (key);Scaling Strategy
Scaling from 1,000 to 1,000,000 users involves moving from a vertical scaling model to a horizontal, shard-aware architecture. We utilize a "Cell-based Architecture," similar to Amazon’s approach, where the system is divided into isolated islands (cells) to contain failures.
Failure Modes and Circuit Breakers
The Circuit Breaker pattern prevents a system from repeatedly trying to execute an operation that is likely to fail. This is crucial when calling external APIs like banking gateways. When the error rate exceeds a threshold, the circuit "trips," and subsequent calls fail fast without hitting the network.
| Pattern | Scenario | Benefit |
|---|---|---|
| Circuit Breaker | External API is down or slow | Prevents resource exhaustion (threads/memory) |
| Bulkhead | One service component is failing | Isolates failure so other components remain functional |
| Dead Letter Queue | Message cannot be processed | Persists failed events for manual replay or debugging |
| Timeout | Network request hangs | Frees up resources by setting a maximum wait time |
Conclusion
Designing for reliability is a balancing act between the CAP theorem’s constraints. In our payment system, we prioritize Consistency and Partition Tolerance (CP) over Availability during network partitions to ensure no double-spending occurs. By implementing idempotency keys, we bridge the gap between consistency and the need for retries.
Key takeaways for any staff engineer:
- Assume the network will fail. Use timeouts and retries with jitter for every network call.
- Protect your resources. Use circuit breakers to stop the bleeding and bulkheads to isolate components.
- Ensure idempotency. Every state-changing operation must be idempotent to handle the "at-least-once" delivery nature of distributed systems.
- Observe everything. Reliability is impossible without deep telemetry—metrics, logs, and traces are your eyes in a production crisis.
By embedding these patterns into the DNA of your architecture, you transform a fragile collection of services into a resilient, production-grade distributed system.