Eventually Consistent Systems (CAP & Tradeoffs)
In the world of high-scale distributed systems, the dream of "strong consistency" often collapses under the weight of global latency and the inevitability of network partitions. As staff engineers, we are frequently forced to navigate the CAP theorem—Consistency, Availability, and Partition Tolerance—knowing we can only pick two. In practice, because network partitions are a fact of life in cloud environments like AWS or GCP, the choice is almost always between AP (Availability) and CP (Consistency).
Eventually consistent systems represent the "AP" side of this coin. They prioritize responsiveness and uptime, accepting that different nodes in a cluster may briefly hold divergent versions of the same data. This pattern is not a compromise of quality but a strategic engineering choice. Companies like Amazon (DynamoDB), Netflix (Cassandra), and Uber (Ringpop) rely on eventual consistency to ensure that a user in Tokyo and a user in New York can both access the service even if the trans-Pacific cables are severed.
However, building these systems requires a fundamental shift in how we handle state. We move away from the safety of ACID transactions and toward a world of idempotent operations, conflict resolution, and background reconciliation. This post explores how to design a production-grade notification and activity feed system—a classic use case for eventual consistency—where high availability is non-negotiable.
Requirements
To design a global activity feed (similar to LinkedIn or Facebook), we must handle massive write volumes while ensuring the system remains responsive even during regional outages.
Capacity Estimation
| Metric | Value |
|---|---|
| Daily Active Users (DAU) | 100 Million |
| Average Posts per User/Day | 5 |
| Total Writes per Second (Avg) | ~5,800 TPS |
| Total Writes per Second (Peak) | ~15,000 TPS |
| Read/Write Ratio | 10:1 |
| Storage per Post | 1 KB |
| Daily Storage Growth | ~500 GB |
High-Level Architecture
The architecture follows an asynchronous, event-driven pattern. When a user posts an update, we acknowledge the write immediately to a local data store and a message bus. Background workers then propagate this data to other regions and update the "fan-out" feeds.
Detailed Design: Conflict Resolution
In an eventually consistent system, two writes to the same record might happen in different regions simultaneously. To resolve this, we typically use "Last Write Wins" (LWW) or Conflict-free Replicated Data Types (CRDTs).
Below is a Go implementation of a simplified LWW element set, which ensures that even if updates arrive out of order, the state eventually converges to the same value across all nodes.
package consistency
import (
"sync"
"time"
)
type Element struct {
Value string
Timestamp int64
}
type LWWSet struct {
sync.RWMutex
addSet map[string]int64
removeSet map[string]int64
}
func NewLWWSet() *LWWSet {
return &LWWSet{
addSet: make(map[string]int64),
removeSet: make(map[string]int64),
}
}
// Add inserts an element with a high-resolution timestamp
func (s *LWWSet) Add(value string) {
s.Lock()
defer s.Unlock()
s.addSet[value] = time.Now().UnixNano()
}
// Remove marks an element as deleted with a timestamp
func (s *LWWSet) Remove(value string) {
s.Lock()
defer s.Unlock()
s.removeSet[value] = time.Now().UnixNano()
}
// Exists checks if an element is present based on LWW logic
func (s *LWWSet) Exists(value string) bool {
s.RLock()
defer s.RUnlock()
addTs, inAdd := s.addSet[value]
if !inAdd {
return false
}
removeTs, inRemove := s.removeSet[value]
if !inRemove {
return true
}
// If add timestamp is greater than remove, it exists
return addTs > removeTs
}This logic is crucial for features like "liking" a post or updating a profile. If a user unlikes a post in a low-connectivity environment, the removal must eventually override any delayed "add" events.
Database Schema
For the persistent layer, we use a partitioned NoSQL database like Cassandra or a distributed SQL database like CockroachDB. We partition data by user_id to ensure that all data for a single user's feed is co-located, minimizing cross-node joins.
The SQL definition utilizes a composite primary key for efficient range queries on the feed:
CREATE TABLE user_feeds (
user_id UUID,
post_id UUID,
actor_id UUID,
content_preview TEXT,
created_at TIMESTAMP,
PRIMARY KEY (user_id, created_at DESC)
) WITH CLUSTERING ORDER BY (created_at DESC);
-- Partitioning by user_id ensures feed generation is a local read
CREATE INDEX ON user_feeds (actor_id);Scaling Strategy
Scaling from 1,000 to 1,000,000+ users requires moving from a monolithic database to a distributed, multi-region architecture. We use a "Cell-based" architecture to limit blast radius.
| Scale | Strategy | Tradeoff |
|---|---|---|
| 1k Users | Single RDS Instance | Simple, but No Partition Tolerance |
| 100k Users | Read Replicas | Improved Reads, but Write Bottleneck |
| 1M+ Users | Sharding & Multi-Region | High Availability, but Eventual Consistency |
Failure Modes and Resiliency
In an AP system, we must handle "Partial Failures." If the fan-out service fails, we don't want to lose the update; we want to retry until it succeeds.
We implement a Circuit Breaker pattern at the API Gateway. If the cross-region replication lag exceeds a threshold (e.g., 5 seconds), the system may temporarily switch to "Read-Local" mode to prevent serving stale data that is too old, or it may prioritize specific high-traffic shards.
Conclusion
Designing for eventual consistency is a deliberate choice to prioritize availability over immediate global synchronization. By using patterns like LWW-sets, idempotent background workers, and regional sharding, we can build systems that remain resilient in the face of network instability. The "Trade" in CAP isn't about losing data—it's about managing the window of time in which data remains inconsistent. For a global notification system, the difference between a 100ms and a 500ms synchronization delay is a small price to pay for a system that never goes down.
https://jepsen.io/consistency https://aws.amazon.com/builders-library/challenges-with-distributed-systems/ https://lamport.azurewebsites.net/pubs/time-clocks.pdf https://www.datastax.com/blog/introduction-lightweight-transactions-cassandra