Designing an ML Pipeline for Production (Feature Store + Training + Serving)

6 min read4.9k

In the evolution of a technology company, there is a distinct "Maturity Gap" between a data scientist training a model in a Jupyter notebook and a software engineer deploying a high-availability distributed system. For organizations like Uber, Netflix, and Stripe, the challenge isn't just the model's accuracy; it is the reliability of the data "plumbing" that feeds it. A production ML pipeline is less about the algorithms and more about the distributed systems design patterns that ensure data consistency, low latency, and scalability.

When we move from batch processing to real-time inference, the primary bottleneck shifts from compute to data orchestration. We must solve the "Training-Serving Skew"—the phenomenon where the data used to train a model differs from the data available at the moment of prediction. To bridge this gap, we adopt a standardized architecture centered around a Feature Store, a Model Registry, and a scalable Inference Service. This post outlines the design of a system capable of handling tens of thousands of requests per second with sub-100ms latency.

Requirements

To design a production-grade ML pipeline, we must balance the needs of data scientists (who need historical data for training) and engineers (who need low-latency data for serving).

Capacity Estimation

Assuming a mid-sized fintech application (e.g., fraud detection):

MetricEstimated Value
Daily Active Users (DAU)10,000,000
Predictions per User/Day5
Total Daily Predictions50,000,000
Average RPS~600
Peak RPS (10x average)6,000
Features per Request100 - 500
Feature Store Payload2KB - 10KB

High-Level Architecture

The architecture is divided into three planes: the Data Plane (Feature Store), the Control Plane (Training & Registry), and the Inference Plane (Serving).

Detailed Design: The Feature Store Abstraction

The Feature Store is the most critical component for preventing skew. It must support two APIs: get_historical_features for training (batch) and get_online_features for inference (point-lookup). Below is a Python-based conceptual implementation of a feature retrieval service that abstracts these complexities.

python
import redis
import pandas as pd
from typing import List, Dict

class FeatureStoreClient:
    def __init__(self, online_config: Dict, offline_path: str):
        self.online_provider = redis.StrictRedis(**online_config)
        self.offline_path = offline_path

    def get_online_features(self, entity_id: str, feature_names: List[str]) -> Dict:
        """
        Low-latency lookup for real-time inference.
        Complexity: O(N) where N is number of features.
        """
        keys = [f"feat:{entity_id}:{name}" for name in feature_names]
        values = self.online_provider.mget(keys)
        return dict(zip(feature_names, values))

    def get_historical_features(self, entity_df: pd.DataFrame, features: List[str]) -> pd.DataFrame:
        """
        Point-in-time join to prevent data leakage.
        Ensures that for every event, we only see features from 'timestamp - epsilon'.
        """
        # In a real system, this would trigger a Spark/Trino job
        offline_data = pd.read_parquet(self.offline_path)
        joined_df = pd.merge_asof(
            entity_df.sort_values("timestamp"),
            offline_data.sort_values("timestamp"),
            on="timestamp",
            by="entity_id"
        )
        return joined_df[features + ["entity_id", "timestamp"]]

Database Schema

We utilize a hybrid storage approach. The Online Store requires a Key-Value structure for millisecond lookups, while the Offline Store uses a columnar format (Parquet) optimized for analytical scans.

SQL Implementation (Offline Store)

For the offline store, we partition by date and feature group to optimize training set generation:

sql
CREATE TABLE offline_feature_store (
    entity_id VARCHAR(255),
    feature_name VARCHAR(255),
    feature_value DOUBLE PRECISION,
    event_timestamp TIMESTAMP,
    ingestion_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
) 
PARTITION BY RANGE (event_timestamp);

CREATE INDEX idx_entity_lookup ON offline_feature_store (entity_id, event_timestamp DESC);

Scaling Strategy

Scaling an ML pipeline requires decoupling the feature retrieval from the model computation. As traffic grows from 1,000 to 1,000,000 users, the inference service must scale horizontally.

ComponentScaling MechanismReal-World Example
InferenceKubernetes HPA (CPU/Custom Metrics)Netflix (Titus)
Online StoreRedis Cluster Sharding / DynamoDBDoorDash
Batch ProcessingSpark Dynamic AllocationUber (Michelangelo)
Stream ProcessingKafka PartitioningStripe

Failure Modes and Resiliency

ML systems fail in unique ways. Beyond standard 5xx errors, we face "Silent Failures" like model drift or missing features.

  1. Circuit Breakers: If the Feature Store latency exceeds 50ms, the Inference Service should trip a circuit breaker and return a "safe" default prediction (e.g., a global mean) rather than timing out the entire user request.
  2. Imputation Logic: If a specific feature is missing due to an upstream pipeline delay, the system should use pre-calculated "Sentinel Values" or the last known value to maintain service availability.
  3. Model Shadowing: Before promoting a new model from the Registry, we route a percentage of traffic to it in "Shadow Mode." We compare its predictions against the live model without returning them to the user.

Conclusion

Designing a production ML pipeline is an exercise in distributed systems trade-offs. While data scientists prioritize model complexity, staff engineers must prioritize the reliability of the feature delivery and the robustness of the serving infrastructure. By implementing a centralized Feature Store, we solve the training-serving skew and provide a unified interface for both batch and real-time data.

The key takeaway for any architect is that the model is only as good as the data it sees at runtime. By treating features as first-class citizens—versioned, monitored, and indexed—we move from fragile ML experiments to resilient, industrial-grade intelligence systems.

References