Designing an ML Pipeline for Production (Feature Store + Training + Serving)
In the evolution of a technology company, there is a distinct "Maturity Gap" between a data scientist training a model in a Jupyter notebook and a software engineer deploying a high-availability distributed system. For organizations like Uber, Netflix, and Stripe, the challenge isn't just the model's accuracy; it is the reliability of the data "plumbing" that feeds it. A production ML pipeline is less about the algorithms and more about the distributed systems design patterns that ensure data consistency, low latency, and scalability.
When we move from batch processing to real-time inference, the primary bottleneck shifts from compute to data orchestration. We must solve the "Training-Serving Skew"—the phenomenon where the data used to train a model differs from the data available at the moment of prediction. To bridge this gap, we adopt a standardized architecture centered around a Feature Store, a Model Registry, and a scalable Inference Service. This post outlines the design of a system capable of handling tens of thousands of requests per second with sub-100ms latency.
Requirements
To design a production-grade ML pipeline, we must balance the needs of data scientists (who need historical data for training) and engineers (who need low-latency data for serving).
Capacity Estimation
Assuming a mid-sized fintech application (e.g., fraud detection):
| Metric | Estimated Value |
|---|---|
| Daily Active Users (DAU) | 10,000,000 |
| Predictions per User/Day | 5 |
| Total Daily Predictions | 50,000,000 |
| Average RPS | ~600 |
| Peak RPS (10x average) | 6,000 |
| Features per Request | 100 - 500 |
| Feature Store Payload | 2KB - 10KB |
High-Level Architecture
The architecture is divided into three planes: the Data Plane (Feature Store), the Control Plane (Training & Registry), and the Inference Plane (Serving).
Detailed Design: The Feature Store Abstraction
The Feature Store is the most critical component for preventing skew. It must support two APIs: get_historical_features for training (batch) and get_online_features for inference (point-lookup). Below is a Python-based conceptual implementation of a feature retrieval service that abstracts these complexities.
import redis
import pandas as pd
from typing import List, Dict
class FeatureStoreClient:
def __init__(self, online_config: Dict, offline_path: str):
self.online_provider = redis.StrictRedis(**online_config)
self.offline_path = offline_path
def get_online_features(self, entity_id: str, feature_names: List[str]) -> Dict:
"""
Low-latency lookup for real-time inference.
Complexity: O(N) where N is number of features.
"""
keys = [f"feat:{entity_id}:{name}" for name in feature_names]
values = self.online_provider.mget(keys)
return dict(zip(feature_names, values))
def get_historical_features(self, entity_df: pd.DataFrame, features: List[str]) -> pd.DataFrame:
"""
Point-in-time join to prevent data leakage.
Ensures that for every event, we only see features from 'timestamp - epsilon'.
"""
# In a real system, this would trigger a Spark/Trino job
offline_data = pd.read_parquet(self.offline_path)
joined_df = pd.merge_asof(
entity_df.sort_values("timestamp"),
offline_data.sort_values("timestamp"),
on="timestamp",
by="entity_id"
)
return joined_df[features + ["entity_id", "timestamp"]]Database Schema
We utilize a hybrid storage approach. The Online Store requires a Key-Value structure for millisecond lookups, while the Offline Store uses a columnar format (Parquet) optimized for analytical scans.
SQL Implementation (Offline Store)
For the offline store, we partition by date and feature group to optimize training set generation:
CREATE TABLE offline_feature_store (
entity_id VARCHAR(255),
feature_name VARCHAR(255),
feature_value DOUBLE PRECISION,
event_timestamp TIMESTAMP,
ingestion_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
PARTITION BY RANGE (event_timestamp);
CREATE INDEX idx_entity_lookup ON offline_feature_store (entity_id, event_timestamp DESC);Scaling Strategy
Scaling an ML pipeline requires decoupling the feature retrieval from the model computation. As traffic grows from 1,000 to 1,000,000 users, the inference service must scale horizontally.
| Component | Scaling Mechanism | Real-World Example |
|---|---|---|
| Inference | Kubernetes HPA (CPU/Custom Metrics) | Netflix (Titus) |
| Online Store | Redis Cluster Sharding / DynamoDB | DoorDash |
| Batch Processing | Spark Dynamic Allocation | Uber (Michelangelo) |
| Stream Processing | Kafka Partitioning | Stripe |
Failure Modes and Resiliency
ML systems fail in unique ways. Beyond standard 5xx errors, we face "Silent Failures" like model drift or missing features.
- Circuit Breakers: If the Feature Store latency exceeds 50ms, the Inference Service should trip a circuit breaker and return a "safe" default prediction (e.g., a global mean) rather than timing out the entire user request.
- Imputation Logic: If a specific feature is missing due to an upstream pipeline delay, the system should use pre-calculated "Sentinel Values" or the last known value to maintain service availability.
- Model Shadowing: Before promoting a new model from the Registry, we route a percentage of traffic to it in "Shadow Mode." We compare its predictions against the live model without returning them to the user.
Conclusion
Designing a production ML pipeline is an exercise in distributed systems trade-offs. While data scientists prioritize model complexity, staff engineers must prioritize the reliability of the feature delivery and the robustness of the serving infrastructure. By implementing a centralized Feature Store, we solve the training-serving skew and provide a unified interface for both batch and real-time data.
The key takeaway for any architect is that the model is only as good as the data it sees at runtime. By treating features as first-class citizens—versioned, monitored, and indexed—we move from fragile ML experiments to resilient, industrial-grade intelligence systems.
References
- https://www.uber.com/blog/michelangelo-machine-learning-platform/
- https://research.facebook.com/publications/fblearner-predictor-internals-of-a-training-and-inference-system/
- https://netflixtechblog.com/open-sourcing-metaflow-a-framework-for-real-life-data-science-3d43c8d4720d
- https://door-dash.github.io/blog/2020-08-31-machine-learning-feature-store/