Feature Store for Real-Time ML Inference (Meta)
In the modern ML lifecycle, the bottleneck has shifted from model architecture to data engineering. At organizations like Meta, Uber, and Netflix, the challenge isn't just training a model with billions of parameters; it is serving features to that model with millisecond latency while ensuring "training-serving skew" does not invalidate the model's predictions. When a user opens Instagram, the recommendation engine must fetch hundreds of features—user history, post metadata, and real-time engagement—in under 50ms.
A Feature Store acts as the interface between raw data and ML models. It provides a centralized repository where features are defined, stored, and served. Without this, data scientists often rewrite SQL queries for production that were originally written in Python for training, leading to subtle logic discrepancies that degrade model performance. This "dual-coding" problem is the primary driver for adopting a unified feature store architecture.
By decoupling feature engineering from model deployment, we enable feature reuse across different teams. If the Ads team builds a sophisticated "user_propensity_score," the Marketplace team should be able to consume it without re-implementing the pipeline. In this post, we will explore the design of a production-grade feature store optimized for real-time inference at Meta-scale.
Requirements
To design a system capable of supporting thousands of models and billions of users, we must categorize our requirements into functional capabilities and operational constraints.
Mindmap: System Requirements
Capacity Estimation
| Metric | Estimated Value |
|---|---|
| Total Entities (Users/Items) | 5 Billion |
| Features per Entity | 500 - 1,000 |
| Online Read Throughput | 1 Million QPS |
| Write Throughput (Streaming) | 500k events/sec |
| Storage (Offline) | 50+ Petabytes |
| Storage (Online) | 100+ Terabytes (RAM/SSD) |
High-Level Architecture
The architecture follows a "Lambda" style pattern but is optimized for serving. It consists of two main paths: the Offline Path for batch processing and model training, and the Online Path for real-time inference.
The Feature Registry is the source of truth, storing metadata about feature ownership, types, and transformation logic. The Online Store (often using Redis or Meta's ZippyDB) provides low-latency lookups, while the Offline Store (S3/HDFS) maintains historical values for "point-in-time" joins during training.
Detailed Design
The core logic of a feature store resides in how it handles the "Online Lookup." When a prediction service requests features for user_id: 12345, the system must aggregate features from multiple tables and return a unified vector.
Below is a Go-based implementation of an OnlineFeatureProvider. It utilizes a concurrent fetching strategy to minimize tail latency when gathering features from different storage backends.
package featurestore
import (
"context"
"sync"
"time"
)
type FeatureValue struct {
Name string
Value interface{}
Timestamp time.Time
}
type OnlineProvider struct {
kvStore map[string]interface{} // Simplified for example
mu sync.RWMutex
}
// GetFeatures fetches multiple features concurrently
func (p *OnlineProvider) GetFeatures(ctx context.Context, entityID string, featureNames []string) ([]FeatureValue, error) {
results := make([]FeatureValue, len(featureNames))
var wg sync.WaitGroup
errChan := make(chan error, len(featureNames))
for i, fName := range featureNames {
wg.Add(1)
go func(index int, name string) {
defer wg.Done()
// Simulate KV Store lookup (e.g., Redis GET)
val, err := p.fetchFromRemote(ctx, entityID, name)
if err != nil {
errChan <- err
return
}
results[index] = FeatureValue{
Name: name,
Value: val,
Timestamp: time.Now(),
}
}(i, fName)
}
wg.Wait()
close(errChan)
if len(errChan) > 0 {
return nil, <-errChan
}
return results, nil
}
func (p *OnlineProvider) fetchFromRemote(ctx context.Context, entityID, feature string) (interface{}, error) {
// Logic for Redis/ZippyDB interaction
return "feature_data", nil
}This design ensures that if one storage shard is slow, it doesn't linearly increase the total request time. In production, we would also implement "Request Hedging," where a second request is sent if the first doesn't return within a specific percentile (e.g., P95).
Database Schema
The Feature Registry tracks the definition, while the Online Store tracks the values. For the Online Store, we use a flattened Key-Value structure to optimize for point lookups.
Entity Relationship Diagram
The ONLINE_STORE table in a system like Cassandra or DynamoDB would use entity_id as the Partition Key and feature_name as the Sort Key. This allows us to fetch all features for a user in a single disk seek or network round-trip.
Scaling Strategy
Scaling from 1,000 to 1 Million+ QPS requires moving from a single-node architecture to a distributed, sharded system. We utilize a "Sidecar" pattern for the Feature SDK to handle local caching and connection pooling.
To handle the "Hot Key" problem (e.g., a viral post being accessed by millions), we implement a multi-tier caching strategy. L1 is a local In-Memory cache (LRU) within the application process, and L2 is a distributed cache like Memcached.
Failure Modes and Resilience
In a distributed system, failures are inevitable. A Feature Store must fail gracefully to avoid taking down the entire prediction pipeline.
- Circuit Breaker: If the Online Store latency exceeds a threshold, the SDK trips a circuit breaker and returns default values (e.g., the global mean for a feature) to ensure the model still produces a prediction, albeit a less accurate one.
- Staleness Monitoring: We track the
event_timevs.processing_time. If the gap (lag) exceeds 5 minutes, we flag the features as "stale," allowing the model to decide whether to trust the input. - Shadow Writes: When deploying a new version of a feature, we perform "Shadow Writes" to the Online Store to validate throughput and latency before switching the "Read" traffic.
Conclusion
Designing a feature store for Meta-scale inference involves balancing the CAP theorem's trade-offs. We prioritize Availability and Partition Tolerance for online serving, accepting Eventual Consistency for feature updates. The separation of the Registry (Metadata) from the Data Plane (Storage) allows for independent scaling and robust governance.
Key patterns identified include:
- Unified Transformations: Using a single logic definition for both streaming and batch.
- Point-in-Time Correctness: Preventing data leakage by ensuring training sets only see data available at the time of the event.
- Multi-tier Caching: Mitigating hot-key issues and reducing tail latency.
By implementing these patterns, organizations can reduce the time-to-market for new ML models from months to days, ensuring that the data infrastructure is an accelerant rather than a bottleneck.