Real-Time Feature Store (Online + Offline Serving)
In the modern ML landscape, the bottleneck for productionizing models has shifted from model architecture to data engineering. Companies like Uber, Netflix, and DoorDash have pioneered the concept of a Feature Store to solve the "training-serving skew"—the discrepancy between data used during model training and data available during real-time inference. Without a unified feature store, data scientists often rewrite complex SQL transformations into Java or Go for production, leading to logic drift and catastrophic model performance degradation.
A real-time feature store acts as a centralized repository that manages the lifecycle of features. It provides a dual-interface: an offline view for historical batch processing (model training) and an online view for low-latency lookups (real-time inference). The goal is to ensure that a feature like user_avg_spend_7d is calculated identically whether it is being pulled from a data lake for a training job or from an in-memory cache for a live transaction.
Requirements
To build a production-grade system, we must balance the strict latency requirements of online serving with the high-throughput demands of offline processing.
The following capacity estimation assumes a mid-to-large scale deployment similar to a regional payment processor.
| Metric | Estimated Value |
|---|---|
| Total Features | 5,000 - 10,000 |
| Online Query Latency (p99) | < 15ms |
| Throughput (Online) | 100,000 QPS |
| Offline Data Volume | 500 TB - 2 PB |
| Update Frequency | 100ms (Streaming) to 24h (Batch) |
High-Level Architecture
The architecture follows a Lambda-style pattern where data flows through two distinct paths but converges at the Serving Layer. The Registry acts as the source of truth for feature definitions.
In this design, streaming data from Kafka is processed by Apache Flink for sub-second feature updates (e.g., click_count_last_5_min). Batch data from S3 is processed via Spark for long-term aggregations. Both paths sink into the Online Store for inference and the Offline Store for historical record-keeping.
Detailed Design
The core of a feature store is the transformation logic. To prevent skew, we define transformations once. Below is a Python implementation of a feature definition and a mock transformation engine that handles the dual-writing logic.
import abc
from datetime import datetime
from typing import Dict, Any
class FeatureProvider(abc.ABC):
@abc.abstractmethod
def get_feature(self, entity_id: str) -> Dict[str, Any]:
pass
class RealTimeFeatureStore:
def __init__(self, online_repo, offline_repo, registry):
self.online = online_repo
self.offline = offline_repo
self.registry = registry
def ingest_event(self, entity_type: str, entity_id: str, event: Dict[str, Any]):
# 1. Fetch feature definition from Registry
feature_def = self.registry.get_definition(entity_type)
# 2. Apply Transformation (e.g., rolling window sum)
transformed_val = self._apply_transform(event, feature_def)
# 3. Dual Write
timestamp = datetime.utcnow()
self.online.set(f"{entity_type}:{entity_id}", transformed_val)
self.offline.append(entity_type, entity_id, transformed_val, timestamp)
def _apply_transform(self, event, definition):
# Implementation of logic (e.g., sum, mean, count)
return event.get('value', 0) * definition.get('multiplier', 1)
# Example usage for a 'credit_score' feature
# store.ingest_event('user', 'u123', {'value': 750})This implementation ensures that the _apply_transform logic is encapsulated. In production, this would be a shared library used by both the Flink streaming job and the Spark batch job.
Database Schema
The metadata registry is critical for discoverability and versioning. While the online store is often a Key-Value store (Redis or DynamoDB) for O(1) lookups, the metadata requires a relational structure.
The SQL schema for the metadata registry must support fast lookups for the Feature Service to resolve feature names to physical storage locations.
CREATE TABLE feature_metadata (
feature_id UUID PRIMARY KEY,
feature_name VARCHAR(255) NOT NULL,
entity_id VARCHAR(100) NOT NULL,
storage_key VARCHAR(255) NOT NULL,
data_type VARCHAR(50) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_entity_feature (entity_id, feature_name)
);
-- Partitioning strategy for the Offline Store (conceptual)
-- PARTITION BY RANGE (event_timestamp)
-- Partitioning by day allows for efficient point-in-time joins during training.Scaling Strategy
Scaling from 1,000 to 1,000,000+ users requires a transition from a centralized database to a distributed, sharded architecture.
To handle massive scale:
- Online Store Sharding: Use Redis Cluster or DynamoDB with a partition key based on the
entity_id. This ensures linear scaling of throughput. - Feature Prefetching: For high-traffic entities, pre-calculate features and push them to the online store before the request arrives.
- Local Caching: Use an in-memory L1 cache (like Caffeine in Java) on the Feature Service nodes for the most frequently accessed features to reduce network hops to Redis.
Failure Modes and Resilience
In a distributed feature store, the "Online Store" is the most vulnerable component. If it fails, downstream ML models cannot perform inference, potentially breaking critical paths like fraud detection or search ranking.
Key resilience patterns include:
- Circuit Breakers: If the online store is slow, open the circuit and return default feature values or cached "stale" features.
- Graceful Degradation: If a specific feature group is unavailable, the model should be trained to handle "null" inputs or use a simpler version of the model that requires fewer features.
- Write-Ahead Logs (WAL): Ensure all ingestion events are persisted in Kafka before processing. If the Flink job fails, it can replay events to rebuild the online state.
Conclusion
Designing a real-time feature store is an exercise in managing the CAP theorem. For online serving, we typically prioritize Availability and Partition Tolerance (AP), accepting eventual consistency for feature updates. However, for the offline store, Consistency is paramount to ensure that training data accurately reflects historical reality.
The key patterns for success are:
- Unified Definition: Define features in a DSL (like YAML or Python) that generates both batch and streaming pipelines.
- Immutability: Treat offline feature data as immutable to simplify point-in-time joins.
- Monitoring: Track "Feature Drift"—if the distribution of a feature in production differs significantly from training, trigger an alert.
By centralizing data logic, a feature store transforms ML from a series of "one-off" experiments into a scalable, industrial-grade engineering discipline.
https://www.uber.com/blog/michelangelo-machine-learning-platform/ https://netflixtechblog.com/building-and-scaling-the-netflix-feature-store-9952d7539061 https://door-dash.engineering/2020/11/19/building-a-gigascale-feature-store-with-redis/ https://feast.dev/docs/