LLM Inference at Scale (ChatGPT-Style Architecture)
Building a production-grade system for Large Language Model (LLM) inference at scale represents a fundamental shift in distributed systems design. Unlike traditional microservices at companies like Uber or Netflix, where bottlenecks are typically I/O bound or database-constrained, LLM architectures are dominated by extreme compute density and memory bandwidth limitations. We are no longer just moving data; we are orchestrating massive matrix multiplications across distributed GPU clusters while maintaining the illusion of a seamless, instantaneous conversation.
In a "ChatGPT-style" architecture, the challenge is twofold: achieving low Time to First Token (TTFT) for a responsive UI and maintaining high throughput (Tokens Per Second) to handle millions of concurrent users. This requires a departure from standard stateless request-response patterns. Instead, we must manage long-lived stateful connections, optimize KV (Key-Value) cache reuse, and implement sophisticated batching strategies that would be overkill for a standard CRUD application.
Designing this system requires a deep understanding of the hardware-software interface. We are balancing the CAP theorem in a new context: prioritizing Availability and Partition Tolerance for the inference engine, while ensuring eventual consistency for conversation history across global regions.
Requirements
To design a system capable of supporting 100M+ monthly active users, we must define strict boundaries for our functional and non-functional requirements.
Capacity Estimation
| Metric | Value |
|---|---|
| Daily Active Users (DAU) | 10 Million |
| Average Requests per User/Day | 5 |
| Average Tokens per Request | 500 |
| Total Tokens per Day | 25 Billion |
| Peak Requests per Second (RPS) | ~20,000 |
| VRAM per 70B Model Instance | ~140 GB (FP16) |
High-Level Architecture
The architecture is divided into the Orchestration Layer and the Inference Fleet. The Orchestration Layer handles identity, rate limiting, and session state (similar to Stripe’s API gateway), while the Inference Fleet manages the GPU-intensive workloads.
Detailed Design: The Inference Orchestrator
The core of the system is the Inference Orchestrator. Unlike a simple load balancer, it must be "cache-aware." If a user is in the middle of a conversation, their previous "KV Cache" (the mathematical state of the conversation) might reside on a specific GPU's memory. Routing the next prompt to the same GPU significantly reduces computation time.
We implement "Continuous Batching" to maximize GPU utilization. Instead of waiting for a batch to complete, we insert new requests into the batch as soon as an existing request finishes a token generation step.
import asyncio
from dataclasses import dataclass
from typing import List, Dict
@dataclass
class InferenceRequest:
request_id: str
prompt: str
max_tokens: int
context_id: str
class ContinuousBatcher:
def __init__(self, batch_size: int = 16):
self.batch_size = batch_size
self.active_requests: List[InferenceRequest] = []
self.queue = asyncio.Queue()
async def add_request(self, request: InferenceRequest):
await self.queue.put(request)
async def run_loop(self):
while True:
# Fill batch slots if available
while len(self.active_requests) < self.batch_size and not self.queue.empty():
new_req = await self.queue.get()
self.active_requests.append(new_req)
if not self.active_requests:
await asyncio.sleep(0.01)
continue
# Simulate one step of GPU forward pass (token generation)
await self.generate_token_step(self.active_requests)
# Remove completed requests
self.active_requests = [r for r in self.active_requests if not self.is_finished(r)]
async def generate_token_step(self, requests: List[InferenceRequest]):
# Logic for interacting with CUDA kernels via PyTorch/vLLM
pass
def is_finished(self, request: InferenceRequest) -> bool:
# Check against stop sequences or max_tokens
return FalseDatabase Schema
We use a hybrid approach: PostgreSQL for relational metadata and conversation structures, and a specialized Vector Database (like Pinecone or Milvus) for Retrieval-Augmented Generation (RAG).
To handle billions of messages, we partition the MESSAGE table by created_at (range partitioning) and sub-partition by user_id (hash partitioning). This ensures that queries for a specific user's chat history remain fast even as the global dataset grows.
CREATE TABLE messages (
id UUID NOT NULL,
conversation_id UUID NOT NULL,
user_id UUID NOT NULL,
content TEXT,
role VARCHAR(20),
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);
CREATE INDEX idx_messages_user_conv ON messages (user_id, conversation_id);Scaling Strategy
Scaling from 1,000 to 1,000,000+ users requires moving from vertical scaling (bigger GPUs) to horizontal model parallelism (distributing one model across multiple GPUs) and data parallelism (running multiple copies of the model).
| Scale | Strategy | Bottleneck |
|---|---|---|
| 1k Users | Single Node (vLLM) | VRAM Capacity |
| 100k Users | Multi-node Cluster + Redis | Orchestration Latency |
| 1M+ Users | Multi-region + Global LB | KV Cache Locality |
Failure Modes and Resilience
In a distributed LLM system, "failure" isn't always a 500 error. It can be a "hallucination," a timeout due to GPU queue backup, or a CUDA Out-of-Memory (OOM) error. We implement a Circuit Breaker pattern at the Orchestrator level to prevent cascading failures.
- GPU OOM Protection: The orchestrator monitors VRAM usage. If a request exceeds the available context window, the system must gracefully truncate the conversation or offload the KV cache to host memory (CPU), similar to how operating systems use swap space.
- Request Hedging: For premium users, we can send the same request to two different GPU clusters and take the first token response to minimize tail latency (P99).
Conclusion
Scaling LLM inference is as much about memory management as it is about networking. By implementing cache-aware routing, continuous batching, and tiered storage for conversation history, we can build a system that feels instantaneous to the user while remaining cost-effective for the provider. The key tradeoff remains the balance between batch size (throughput) and latency. Larger batches improve GPU efficiency but increase the time each individual user waits for their token. As the field evolves, the integration of hardware-aware scheduling will become the gold standard for high-performance AI applications.
References
- https://arxiv.org/abs/2309.06180 (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention)
- https://www.uber.com/blog/real-time-machine-learning-at-scale/
- https://stripe.com/blog/how-we-built-it-the-stripe-api
- https://jalammar.github.io/illustrated-gpt2/