Jubin Soni - Portfolio & Blog

System design is often misunderstood as the art of drawing boxes and arrows on a whiteboard. However, for staff and senior engineers, the visual diagram is merely the byproduct of a much deeper cognitive process: the rigorous evaluation of trade-offs under specific constraints. When we design systems at scale—whether it is a payment gateway for Stripe or a streaming backbone for Netflix—we are not looking for the "best" solution, but rather the "least problematic" one for the current and future state of the business.

The hallmark of senior-level design is the transition from "how do I make this work?" to "how will this fail, and can we live with that failure?" This mindset requires balancing the CAP theorem (Consistency, Availability, Partition Tolerance) against product requirements. For instance, a social media feed can tolerate eventual consistency (AP), but a double-entry ledger for a financial system must prioritize strong consistency (CP). In this post, we will explore this mental framework by designing a Distributed Task Scheduler capable of handling millions of jobs with high reliability.

Requirements and Constraints

A distributed task scheduler must manage the lifecycle of asynchronous jobs. Senior engineers begin by defining the "blast radius" of the system. If the scheduler fails, does the entire platform halt, or do background tasks simply delay? We must define functional requirements (what it does) and non-functional requirements (how it performs).

To ground the design, we perform a capacity estimation. This ensures the chosen data stores can handle the projected throughput without hitting I/O bottlenecks.

Metric	Value
Daily Job Volume	100,000,000
Average TPS (Write)	~1,200
Peak TPS (Write)	~5,000
Average Payload Size	2 KB
Daily Storage Growth	200 GB
Retention Period	30 Days (6 TB total)

High-Level Architecture

Senior engineers think in layers. We separate the "Ingestion Layer" from the "Execution Layer" to prevent a spike in job submissions from crashing the workers actually performing the tasks. This is a pattern used extensively by Uber in their Cadence (now Temporal) orchestration engine.

In this architecture, we use a relational database for the "Source of Truth" because job metadata requires strict ACID properties. We use a message broker (like Kafka or RabbitMQ) as a buffer to decouple scheduling from execution.

Detailed Design: The Idempotency and Lease Pattern

A common pitfall in distributed systems is the "Double Execution" problem. Network partitions are inevitable. A worker might finish a task but fail to send an acknowledgment. To solve this, senior engineers implement Idempotency Keys and Lease Mechanisms.

The following Go snippet demonstrates how a worker "claims" a job using a conditional update (Optimistic Locking). This ensures that even if multiple workers pick up the same job ID from the broker, only one can proceed.

func (w *Worker) ClaimAndExecute(jobID string) error {
    // Attempt to acquire a lease with a TTL
    // SQL: UPDATE jobs SET status='running', worker_id=?, version = version + 1 
    // WHERE id=? AND (status='pending' OR (status='running' AND lease_expires < NOW()))
    leaseAcquired, err := w.db.ExecContext(ctx, claimQuery, w.ID, jobID)
    if err != nil || !leaseAcquired {
        return fmt.Errorf("failed to acquire lease for job %s", jobID)
    }

    // Execute the actual task logic
    result := w.taskRegistry[jobID].Run()

    // Finalize the job status
    return w.db.UpdateStatus(jobID, "completed", result)
}

This pattern shifts the complexity from the application logic to the database's atomicity guarantees. By using a version column or a lease_expires timestamp, we handle zombie workers that crash mid-execution.

Database Schema and Indexing Strategy

When dealing with 100M+ rows, the schema design must account for query patterns. We need to fetch "all pending jobs scheduled for now" and "all jobs for a specific user."

Partitioning Strategy: To maintain performance, we would partition the JOB table by scheduled_at (time-based partitioning). This allows us to drop old data efficiently (TTL) without expensive DELETE operations that cause vacuuming overhead in Postgres. We would also create a composite index on (status, scheduled_at) to optimize the poller that finds ready-to-run tasks.

Scaling Strategy: From 1K to 1M+ Users

Scaling is not just about adding more servers; it is about identifying the next bottleneck. At 1,000 users, a single Postgres instance is sufficient. At 1,000,000 users, the database connections and lock contention become the primary constraints.

At the 1M+ user mark, we move toward a Cell-based Architecture. Instead of one giant cluster, we deploy multiple independent "cells" of the scheduler. If one cell fails, it only affects a subset of users. This is the strategy used by AWS to ensure global resilience.

Failure Modes and Resilience

A senior engineer's job is to "design for failure." We assume the network is slow, the database will lag, and workers will run out of memory. We implement a State Machine to manage job transitions and handle retries with exponential backoff.

Key Resilience Patterns:

Circuit Breaker: If the downstream service a worker calls is down, stop trying to send jobs to it to prevent cascading failure.
Dead Letter Queue (DLQ): If a job fails $N$ times, move it to a DLQ for manual inspection rather than clogging the system.
Backpressure: If the workers are overwhelmed, the API should return 429 Too Many Requests to the clients rather than accepting more work.

Conclusion

Designing a system at the staff level is an exercise in empathy—empathy for the developers who will maintain it, and empathy for the operators who will wake up when it breaks. We choose Postgres for its reliability, Kafka for its durability, and Go for its concurrency primitives.

The transition from a mid-level to a senior engineer is marked by the realization that there is no perfect architecture. Every decision—whether it is choosing between strong or eventual consistency, or selecting a sharding key—is a trade-off. By focusing on idempotency, clear failure states, and horizontal scalability, we build systems that don't just work today but can evolve for years to come.

References

https://engineering.fb.com/2021/04/05/developer-tools/scaling-mercurial-at-facebook/ https://www.uber.com/blog/cadence-open-source-workflow-engine/ https://netflixtechblog.com/fault-tolerance-in-a-high-volume-distributed-system-91ab4fa2b46b https://stripe.com/blog/idempotency/