Active-Active Multi-Region System (Global Traffic Routing)

7 min read6.1k

In the world of high-scale distributed systems, the transition from a single-region architecture to an Active-Active multi-region setup represents a significant engineering milestone. For companies like Netflix, Uber, and Stripe, this isn't just about disaster recovery; it is a fundamental requirement for providing low-latency experiences to a global user base and achieving "five nines" (99.999%) availability. When a single AWS region experiences a "weather event" or a backbone fiber is cut, an Active-Active system ensures that traffic is seamlessly re-routed without user intervention.

Designing such a system requires a departure from traditional monolithic thinking. We must grapple with the speed of light—which imposes a hard limit on data synchronization—and the CAP theorem, which forces us to choose between consistency and availability during network partitions. In an Active-Active configuration, every region is "live," serving traffic and mutating state. This introduces complex challenges in global traffic routing, data replication, and conflict resolution that do not exist in simpler Active-Passive setups.

This post explores the architectural blueprints and operational strategies required to build a production-grade Active-Active system. We will focus on the "Global Traffic Routing" layer, which acts as the entry point for the entire ecosystem, ensuring that users are directed to the healthiest and closest regional deployment.

Requirements

To build a robust multi-region system, we must define clear boundaries for our functional and non-functional requirements.

Capacity Estimation

The following table outlines the scale our design must support, modeled after a global payment or social media platform.

MetricTarget Value
Total Global Users100 Million
Peak Global Requests Per Second (RPS)50,000
Number of Regions4 (e.g., us-east-1, eu-west-1, ap-southeast-1, sa-east-1)
Latency Budget (Routing Layer)< 20ms overhead
Data ConsistencyEventual Consistency (with Strong Consistency for specific keys)

High-Level Architecture

The architecture relies on a multi-layered routing strategy. At the edge, we use Anycast DNS or a Global Accelerator to route users to the nearest regional Point of Presence (PoP). Once the request enters a region, a local Load Balancer directs it to the appropriate microservices.

In this model, the "Global Traffic Router" is responsible for continuous health checking. If the US-East region becomes degraded, the DNS layer must detect this and shift traffic to EU-West. This process, known as "Regional Evacuation," is a core pattern used by Netflix to maintain availability during regional outages.

Detailed Design: Global Traffic Router

A critical component is the logic that determines where a request should go. While DNS-based routing is common, many organizations use a "Proxy-based" approach (like Uber’s use of Envoy) for finer control. Below is a Go-based conceptual implementation of a Latency-aware Router that could run at the edge.

go
package main

import (
	"sync"
	"time"
)

type RegionHealth struct {
	RegionCode string
	Latency    time.Duration
	IsHealthy  bool
}

type GlobalRouter struct {
	mu      sync.RWMutex
	regions map[string]*RegionHealth
}

// GetBestRegion selects the optimal region based on health and latency
func (gr *GlobalRouter) GetBestRegion() string {
	gr.mu.RLock()
	defer gr.mu.RUnlock()

	var bestRegion string
	var minLatency = time.Hour

	for code, health := range gr.regions {
		if health.IsHealthy && health.Latency < minLatency {
			minLatency = health.Latency
			bestRegion = code
		}
	}
	return bestRegion
}

// UpdateHealth is called by background health checkers
func (gr *GlobalRouter) UpdateHealth(region string, latency time.Duration, healthy bool) {
	gr.mu.Lock()
	defer gr.mu.Unlock()
	gr.regions[region] = &RegionHealth{
		RegionCode: region,
		Latency:    latency,
		IsHealthy:  healthy,
	}
}

This router maintains a local view of global health. In a production environment, this would be integrated with a control plane that aggregates signals from synthetic probes and real-user monitoring (RUM).

Database Schema and Consistency

Data is the hardest part of Active-Active. We use a "Global Database" approach (like CockroachDB or AWS Aurora Global) where data is partitioned by region but accessible globally.

To prevent write conflicts, we employ Regional Pinning. A user’s data is "homed" in a specific region. While any region can technically read the data, writes are ideally routed to the home region to avoid distributed lock contention.

SQL Schema with Partitioning:

sql
CREATE TABLE users (
    id UUID PRIMARY KEY,
    email TEXT UNIQUE,
    home_region TEXT NOT NULL,
    data JSONB,
    updated_at TIMESTAMPTZ
) PARTITION BY LIST (home_region);

CREATE TABLE users_us_east PARTITION OF users FOR VALUES IN ('us-east-1');
CREATE TABLE users_eu_west PARTITION OF users FOR VALUES IN ('eu-west-1');

-- Indexing for global lookups
CREATE INDEX idx_users_email ON users(email);

Scaling Strategy

Scaling an Active-Active system involves moving from a "Global Monolith" to a "Cell-based Architecture." Each region is divided into isolated cells to minimize the blast radius of internal failures.

As we scale to millions of users, we move away from simple DNS weightings to "Traffic Sharding," where specific user segments are mapped to specific regional cells. This allows us to scale horizontally by adding more regions without increasing the complexity of any single cluster.

Failure Modes and Resiliency

In an Active-Active system, we must design for "Partial Failure." If the connection between US and EU is severed, both regions must continue to function independently.

When a region is "Evacuated," we utilize Circuit Breakers at the edge. If the US-East regional endpoint returns a 5xx error rate above a certain threshold, the Global Load Balancer automatically redirects traffic to US-West or EU-West. This requires that every region has enough "headroom" (extra capacity) to handle at least 50% of the traffic from another region.

Conclusion

Building an Active-Active Multi-Region system is a trade-off between complexity and resilience. By leveraging Global Traffic Routing, we can achieve near-zero downtime and superior performance. However, this comes at the cost of managing data consistency and increased infrastructure spend.

Key patterns to remember:

  1. Prefer Latency-based Routing: Always bring the compute closer to the user.
  2. Embrace Eventual Consistency: Use asynchronous replication for non-critical paths to keep regional latency low.
  3. Implement Regional Pinning: Minimize cross-region writes by homing user data.
  4. Automate Evacuation: Don't wait for a human to flip the switch during an outage; use automated health signals.

By following these principles, you can build a system that not only survives regional disasters but thrives under the pressure of global scale.

https://research.google/pubs/pub45855/ https://netflixtechblog.com/active-active-for-multi-regional-resiliency-c47719f6685b https://aws.amazon.com/builders-library/multi-region-application-architecture/ https://www.cockroachlabs.com/blog/multi-region-serverless-architecture/