Reliability Patterns Every Engineer Should Know

6 min read5.4k

In the world of distributed systems, failure is not an elective; it is a fundamental property of the environment. As systems scale from single-node prototypes to global infrastructures like those managed by Netflix or Uber, the probability of a "one-in-a-million" event occurring drops to nearly zero—it happens every few minutes. Staff engineers realize that reliability is not about preventing failure, but about building systems that remain functional despite it.

This shift in mindset—from fault avoidance to fault tolerance—requires a standardized set of reliability patterns. These patterns serve as the architectural guardrails that prevent a localized failure in a minor microservice from cascading into a catastrophic global outage. Whether you are dealing with network partitions, slow database queries, or third-party API downtime, the goal is to minimize the blast radius and maintain a consistent user experience.

Requirements

To illustrate these patterns, we will design a high-availability payment processing system. This system must handle millions of transactions while interacting with external banking gateways that are notoriously unreliable.

Functional and Non-Functional Requirements

Capacity Estimation

MetricValue
Daily Active Users (DAU)5 Million
Average Transactions per Day100 Million
Average Transactions per Second (TPS)~1,150
Peak Transactions per Second10,000
Data Retention7 Years (Compliance)
Total Storage (Daily)~100 GB (1KB per record)

High-Level Architecture

The architecture follows a decoupled, event-driven approach. By using an API Gateway with built-in rate limiting and a message broker for asynchronous processing, we protect the core ledger from traffic spikes.

Detailed Design: Idempotency and Retries

The most critical reliability pattern in financial systems is Idempotency. Stripe popularized the use of Idempotency-Key headers to ensure that if a client retries a request due to a timeout, the system does not charge the customer twice.

Below is a Go implementation of a robust retry mechanism with exponential backoff and jitter. Jitter is essential to avoid the "Thundering Herd" problem, where many clients retry at the exact same millisecond, further overwhelming a struggling service.

go
package main

import (
	"context"
	"fmt"
	"math/rand"
	"time"
)

type Operation func(ctx context.Context) error

func ExecuteWithRetry(ctx context.Context, op Operation) error {
	const (
		maxRetries    = 5
		baseDelay     = 100 * time.Millisecond
		maxDelay      = 10 * time.Second
	)

	for i := 0; i < maxRetries; i++ {
		err := op(ctx)
		if err == nil {
			return nil
		}

		// Calculate exponential backoff: base * 2^attempt
		backoff := float64(baseDelay) * float64(1<<uint(i))
		
		// Add jitter (randomness) to prevent synchronized retries
		jitter := rand.Float64() * 0.5 * backoff
		sleepDuration := time.Duration(backoff + jitter)

		if sleepDuration > maxDelay {
			sleepDuration = maxDelay
		}

		select {
		case <-time.After(sleepDuration):
			fmt.Printf("Retry attempt %d after failure: %v\n", i+1, err)
		case <-ctx.Done():
			return ctx.Err()
		}
	}
	return fmt.Errorf("operation failed after %d attempts", maxRetries)
}

Database Schema

Reliability at the data layer requires a clear separation between request metadata (idempotency) and the source of truth (ledger). We use PostgreSQL for its ACID compliance and strong consistency.

To handle 100M daily transactions, we partition the PAYMENTS table by created_at (range partitioning) and shard the IDEMPOTENCY_KEYS table by the key itself using a hash-based strategy.

sql
CREATE TABLE payments (
    id UUID NOT NULL,
    user_id UUID NOT NULL,
    amount BIGINT NOT NULL,
    status VARCHAR(20),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
) PARTITION BY RANGE (created_at);

CREATE INDEX idx_payments_user_id ON payments (user_id);
CREATE UNIQUE INDEX idx_idempotency_key ON idempotency_keys (key);

Scaling Strategy

Scaling from 1,000 to 1,000,000 users involves moving from a vertical scaling model to a horizontal, shard-aware architecture. We utilize a "Cell-based Architecture," similar to Amazon’s approach, where the system is divided into isolated islands (cells) to contain failures.

Failure Modes and Circuit Breakers

The Circuit Breaker pattern prevents a system from repeatedly trying to execute an operation that is likely to fail. This is crucial when calling external APIs like banking gateways. When the error rate exceeds a threshold, the circuit "trips," and subsequent calls fail fast without hitting the network.

PatternScenarioBenefit
Circuit BreakerExternal API is down or slowPrevents resource exhaustion (threads/memory)
BulkheadOne service component is failingIsolates failure so other components remain functional
Dead Letter QueueMessage cannot be processedPersists failed events for manual replay or debugging
TimeoutNetwork request hangsFrees up resources by setting a maximum wait time

Conclusion

Designing for reliability is a balancing act between the CAP theorem’s constraints. In our payment system, we prioritize Consistency and Partition Tolerance (CP) over Availability during network partitions to ensure no double-spending occurs. By implementing idempotency keys, we bridge the gap between consistency and the need for retries.

Key takeaways for any staff engineer:

  1. Assume the network will fail. Use timeouts and retries with jitter for every network call.
  2. Protect your resources. Use circuit breakers to stop the bleeding and bulkheads to isolate components.
  3. Ensure idempotency. Every state-changing operation must be idempotent to handle the "at-least-once" delivery nature of distributed systems.
  4. Observe everything. Reliability is impossible without deep telemetry—metrics, logs, and traces are your eyes in a production crisis.

By embedding these patterns into the DNA of your architecture, you transform a fragile collection of services into a resilient, production-grade distributed system.

References: