Jubin Soni - Portfolio & Blog

When a Senior Engineer approaches a system design problem, they focus on the "how"—the specific technologies, the schema, and the API endpoints. When a Staff+ Engineer approaches the same problem, they focus on the "why" and the "at what cost." At this level, design is less about picking the "best" tool and more about managing a complex web of trade-offs, operational burdens, and long-term evolution.

In distributed systems, the most expensive mistakes aren't the ones that crash the system today; they are the ones that make the system impossible to change or scale two years from now. This post explores the Staff-level design process through the lens of a Global Distributed Rate Limiter, a critical component used by companies like Stripe and Uber to protect downstream services from cascading failures and malicious traffic.

Designing a rate limiter at scale is a classic study in the CAP theorem. Do you prioritize consistency (ensuring a user never exceeds their limit by even one request) or availability (ensuring the rate limiter doesn't become a single point of failure that blocks legitimate traffic)? A Staff+ engineer knows that in 99% of web-scale applications, high availability and low latency beat perfect consistency.

Requirements and Constraints

The first step in any Staff-level design is defining the "envelope"—the boundaries within which the system must operate. We aren't just building a rate limiter; we are building a multi-tenant, globally distributed guardrail.

To understand the scale, we must perform back-of-the-envelope calculations. If we are protecting a suite of services handling 1 million requests per second (RPS), our rate limiter must handle that same volume with negligible overhead.

Metric	Value
Peak Throughput	1,000,000 RPS
Total Active Users	100 Million
Average Payload Size	200 Bytes
Daily Data Volume	~17 TB (Logs/Counters)
Max Latency P99	5ms

High-Level Architecture

A Staff+ engineer avoids "gold-plating." We start with a hybrid approach: local in-memory caching for speed and a centralized Redis cluster for global state. This mirrors the architecture used by Stripe to handle bursty traffic while maintaining global limits.

In this design, we use a "Sidecar" pattern. The rate-limiting logic lives as a small process alongside the application code or within the Service Mesh (like Envoy). This minimizes network hops. The L1 cache allows us to "fail open"—if the global Redis is unreachable, we fall back to local estimation to keep the system alive.

Detailed Design: The Sliding Window Algorithm

The core of a production-grade rate limiter is the Sliding Window Log or Sliding Window Counter. Fixed windows suffer from "boundary bursts" (double the allowed traffic at the edge of a minute), while Token Buckets can be complex to synchronize across nodes.

Here is a Go implementation of a Sliding Window using Redis Lua scripts to ensure atomicity and reduce network round-trips.

// RateLimiter defines the interface for checking limits
type RateLimiter struct {
    client *redis.Client
}

// Allow checks if a request for a specific key is permitted
func (rl *RateLimiter) Allow(ctx context.Context, key string, limit int, window time.Duration) (bool, error) {
    now := time.Now().UnixNano() / int64(time.Millisecond)
    windowStart := now - window.Milliseconds()

    // Lua script to ensure atomicity
    // 1. Remove old entries outside the current window
    // 2. Count remaining entries
    // 3. If count < limit, add current timestamp
    script := `
        redis.call('ZREMRANGEBYSCORE', KEYS[1], 0, ARGV[1])
        local count = redis.call('ZCARD', KEYS[1])
        if count < tonumber(ARGV[2]) then
            redis.call('ZADD', KEYS[1], ARGV[3], ARGV[3])
            return 1
        else
            return 0
        end
    `
    
    result, err := rl.client.Eval(ctx, script, []string{key}, windowStart, limit, now).Int()
    if err != nil {
        return true, err // Fail open on Redis error
    }
    return result == 1, nil
}

By using ZREMRANGEBYSCORE and ZCARD, we maintain a precise log of every request within the timeframe. While this is memory-intensive, it provides the highest level of accuracy for critical financial APIs.

Database Schema and Persistence

For a rate limiter, the "database" is often a combination of an in-memory store for counters and a persistent store for configuration and audit logs.

The SQL schema for the control plane would involve heavy indexing on tenant_id and api_key. To handle global distribution, we use a Global Database (like CockroachDB or AWS Aurora Global) to replicate policies, while the USAGE_LOG is sharded by user_id across regional ClickHouse clusters for high-throughput analytical writes.

Scaling Strategy

Scaling from 1,000 to 1,000,000 users requires moving from a centralized model to a decentralized, sharded model. Staff+ engineers look for "shared-nothing" architectures where possible.

To scale Redis, we use Consistent Hashing to distribute keys across multiple shards. This prevents "hot keys" where one popular user's traffic overwhelms a single Redis node. We also implement "Batching": instead of incrementing Redis for every request, a node might collect 10 requests and perform a single INCRBY 10 to reduce IOPS.

Failure Modes and Resiliency

The hallmark of Staff-level design is the "Failure Mode and Effects Analysis" (FMEA). We assume every component will fail.

If the Redis cluster experiences a partition, the system enters a Fail-Open state. It is better to allow a few extra requests through than to block all legitimate traffic for all customers. We use a Circuit Breaker (like Hystrix or Resilience4j) to wrap the Redis calls. If the error rate exceeds 5%, the breaker trips, and we rely solely on local in-memory counters until the backend stabilizes.

Conclusion

Designing a system at the Staff+ level is an exercise in balance. We chose a Sliding Window for accuracy but mitigated its memory cost through Redis TTLs. We chose a centralized Redis for consistency but mitigated its latency through L1 caching and sidecar deployments.

The key takeaways for any distributed system design are:

Prioritize Availability: In global systems, "failing open" is usually the correct business decision.
Optimize the Hot Path: Keep the request-response path as lean as possible; move logging and heavy computation to asynchronous background tasks.
Design for Observability: A system you cannot monitor is a system you cannot manage. Every rate-limit decision should emit a metric.

By focusing on these patterns, you build systems that don't just work today but are resilient enough to handle the unknown demands of tomorrow.

https://research.google/pubs/pub46310/ https://stripe.com/blog/rate-limiters https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/other_features/global_rate_limiting https://aws.amazon.com/builders-library/throttling-distributed-systems/