Jubin Soni - Portfolio & Blog

Building a production-grade system for Large Language Model (LLM) inference at scale represents a fundamental shift in distributed systems design. Unlike traditional microservices at companies like Uber or Netflix, where bottlenecks are typically I/O bound or database-constrained, LLM architectures are dominated by extreme compute density and memory bandwidth limitations. We are no longer just moving data; we are orchestrating massive matrix multiplications across distributed GPU clusters while maintaining the illusion of a seamless, instantaneous conversation.

In a "ChatGPT-style" architecture, the challenge is twofold: achieving low Time to First Token (TTFT) for a responsive UI and maintaining high throughput (Tokens Per Second) to handle millions of concurrent users. This requires a departure from standard stateless request-response patterns. Instead, we must manage long-lived stateful connections, optimize KV (Key-Value) cache reuse, and implement sophisticated batching strategies that would be overkill for a standard CRUD application.

Designing this system requires a deep understanding of the hardware-software interface. We are balancing the CAP theorem in a new context: prioritizing Availability and Partition Tolerance for the inference engine, while ensuring eventual consistency for conversation history across global regions.

Requirements

To design a system capable of supporting 100M+ monthly active users, we must define strict boundaries for our functional and non-functional requirements.

Capacity Estimation

Metric	Value
Daily Active Users (DAU)	10 Million
Average Requests per User/Day	5
Average Tokens per Request	500
Total Tokens per Day	25 Billion
Peak Requests per Second (RPS)	~20,000
VRAM per 70B Model Instance	~140 GB (FP16)

High-Level Architecture

The architecture is divided into the Orchestration Layer and the Inference Fleet. The Orchestration Layer handles identity, rate limiting, and session state (similar to Stripe’s API gateway), while the Inference Fleet manages the GPU-intensive workloads.

Detailed Design: The Inference Orchestrator

The core of the system is the Inference Orchestrator. Unlike a simple load balancer, it must be "cache-aware." If a user is in the middle of a conversation, their previous "KV Cache" (the mathematical state of the conversation) might reside on a specific GPU's memory. Routing the next prompt to the same GPU significantly reduces computation time.

We implement "Continuous Batching" to maximize GPU utilization. Instead of waiting for a batch to complete, we insert new requests into the batch as soon as an existing request finishes a token generation step.

python

import asyncio
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class InferenceRequest:
    request_id: str
    prompt: str
    max_tokens: int
    context_id: str

class ContinuousBatcher:
    def __init__(self, batch_size: int = 16):
        self.batch_size = batch_size
        self.active_requests: List[InferenceRequest] = []
        self.queue = asyncio.Queue()

    async def add_request(self, request: InferenceRequest):
        await self.queue.put(request)

    async def run_loop(self):
        while True:
            # Fill batch slots if available
            while len(self.active_requests) < self.batch_size and not self.queue.empty():
                new_req = await self.queue.get()
                self.active_requests.append(new_req)

            if not self.active_requests:
                await asyncio.sleep(0.01)
                continue

            # Simulate one step of GPU forward pass (token generation)
            await self.generate_token_step(self.active_requests)

            # Remove completed requests
            self.active_requests = [r for r in self.active_requests if not self.is_finished(r)]

    async def generate_token_step(self, requests: List[InferenceRequest]):
        # Logic for interacting with CUDA kernels via PyTorch/vLLM
        pass

    def is_finished(self, request: InferenceRequest) -> bool:
        # Check against stop sequences or max_tokens
        return False

Database Schema

We use a hybrid approach: PostgreSQL for relational metadata and conversation structures, and a specialized Vector Database (like Pinecone or Milvus) for Retrieval-Augmented Generation (RAG).

To handle billions of messages, we partition the MESSAGE table by created_at (range partitioning) and sub-partition by user_id (hash partitioning). This ensures that queries for a specific user's chat history remain fast even as the global dataset grows.

sql

CREATE TABLE messages (
    id UUID NOT NULL,
    conversation_id UUID NOT NULL,
    user_id UUID NOT NULL,
    content TEXT,
    role VARCHAR(20),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);

CREATE INDEX idx_messages_user_conv ON messages (user_id, conversation_id);

Scaling Strategy

Scaling from 1,000 to 1,000,000+ users requires moving from vertical scaling (bigger GPUs) to horizontal model parallelism (distributing one model across multiple GPUs) and data parallelism (running multiple copies of the model).

Scale	Strategy	Bottleneck
1k Users	Single Node (vLLM)	VRAM Capacity
100k Users	Multi-node Cluster + Redis	Orchestration Latency
1M+ Users	Multi-region + Global LB	KV Cache Locality

Failure Modes and Resilience

In a distributed LLM system, "failure" isn't always a 500 error. It can be a "hallucination," a timeout due to GPU queue backup, or a CUDA Out-of-Memory (OOM) error. We implement a Circuit Breaker pattern at the Orchestrator level to prevent cascading failures.

GPU OOM Protection: The orchestrator monitors VRAM usage. If a request exceeds the available context window, the system must gracefully truncate the conversation or offload the KV cache to host memory (CPU), similar to how operating systems use swap space.
Request Hedging: For premium users, we can send the same request to two different GPU clusters and take the first token response to minimize tail latency (P99).

Conclusion

Scaling LLM inference is as much about memory management as it is about networking. By implementing cache-aware routing, continuous batching, and tiered storage for conversation history, we can build a system that feels instantaneous to the user while remaining cost-effective for the provider. The key tradeoff remains the balance between batch size (throughput) and latency. Larger batches improve GPU efficiency but increase the time each individual user waits for their token. As the field evolves, the integration of hardware-aware scheduling will become the gold standard for high-performance AI applications.

References

https://arxiv.org/abs/2309.06180 (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention)
https://www.uber.com/blog/real-time-machine-learning-at-scale/
https://stripe.com/blog/how-we-built-it-the-stripe-api
https://jalammar.github.io/illustrated-gpt2/