Jubin Soni - Portfolio & Blog

For decades, data engineering was bifurcated into two distinct worlds: the Data Warehouse and the Data Lake. Data Warehouses, like Snowflake or Teradata, offered high-performance SQL and ACID transactions but were prohibitively expensive for unstructured data and lacked support for machine learning workloads. Data Lakes, built on S3 or HDFS, offered massive scale and low cost but quickly devolved into "data swamps" where data consistency was non-existent and performance was abysmal due to the lack of indexing and schema enforcement.

The "Lakehouse" architecture represents a fundamental shift in distributed systems design, merging the low-cost, open-ended nature of the Data Lake with the management and performance capabilities of the Data Warehouse. By implementing a metadata layer on top of open file formats like Parquet or Avro, organizations like Uber, Netflix, and Stripe can now execute complex transactions directly on object storage. This convergence is not merely a marketing term; it is a technical solution to the CAP theorem trade-offs that previously forced engineers to maintain redundant ETL pipelines between lakes and warehouses.

Implementing a production-grade Lakehouse requires a deep understanding of how to manage state in a stateless environment. In a distributed system where storage is decoupled from compute, the "source of truth" moves from the database engine to the metadata manifest files. This architecture allows for independent scaling of compute clusters (like Spark or Trino) and storage buckets (S3/GCS), providing the elasticity required for modern petabyte-scale analytics.

Requirements

To build a robust Lakehouse, we must satisfy both the rigorous consistency requirements of financial reporting and the high-throughput demands of real-time telemetry.

Capacity Estimation

When designing for a mid-to-large scale enterprise (e.g., a logistics company similar to Uber), we must account for the following scale:

Metric	Estimated Value
Daily Ingest Volume	50 TB / day
Total Data Footprint	10 PB+
Concurrent Query Users	500+
Latency Requirement	< 1s for Point Lookups; < 60s for Heavy Scans
Retention Policy	7 Years (Regulatory)

High-Level Architecture

The Lakehouse architecture is defined by its "Medallion" structure, which organizes data into layers of increasing cleanliness and aggregation. Unlike traditional warehouses, every layer resides in the same object store, governed by a unified metadata catalog.

In this design, the Bronze layer stores raw events in their original format. The Silver layer applies schema enforcement and normalization. The Gold layer contains business-level aggregates optimized for consumption. Companies like Netflix use Apache Iceberg to ensure that when a Spark job writes to the Gold layer, a Trino user reading that same data sees a consistent snapshot, even if the write is still in progress.

Detailed Design

The core of the Lakehouse is the Table Format. Let's look at how we implement a transactional "Upsert" using Apache Iceberg in Python/PySpark. Traditional data lakes cannot update a single row; they must rewrite an entire partition. Iceberg solves this using "Merge-on-Read" (MoR) or "Copy-on-Write" (CoW) strategies.

python

from pyspark.sql import SparkSession

# Initialize Spark with Iceberg support
spark = SparkSession.builder \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.glue_catalog.warehouse", "s3://prod-data-lakehouse/gold/") \
    .getOrCreate()

# Schema-enforced Merge (Upsert) Logic
def upsert_to_gold_table(incremental_df, target_table):
    incremental_df.createOrReplaceTempView("updates")
    
    spark.sql(f"""
        MERGE INTO glue_catalog.db.{target_table} t
        USING updates s
        ON t.user_id = s.user_id
        WHEN MATCHED AND t.updated_at < s.updated_at THEN
            UPDATE SET *
        WHEN NOT MATCHED THEN
            INSERT *
    """)

# Example usage: Processing daily user activity
activity_df = spark.read.parquet("s3://prod-data-lakehouse/silver/user_activity/date=2023-10-27/")
upsert_to_gold_table(activity_df, "user_profiles")

The magic happens in the metadata. When MERGE is called, the system creates a new "snapshot" file. This file points to the old data files plus a new "delete file" (for MoR) or a newly written data file (for CoW). This allows for Time Travel, where you can query the state of the database as it existed at a specific timestamp by referencing an older snapshot ID.

Database Schema & Physical Layout

The physical layout of a Lakehouse is critical for performance. We use Z-Ordering and Multi-dimensional Clustering to colocate related data.

For a typical transactions table, the schema and partitioning strategy would look like this:

sql

CREATE TABLE glue_catalog.gold.transactions (
    transaction_id UUID,
    user_id BIGINT,
    amount DECIMAL(18, 2),
    currency STRING,
    status STRING,
    ts TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(ts))
TBLPROPERTIES (
    'write.format.default'='parquet',
    'write.metadata.compression-codec'='gzip',
    'write.summary.max-size'='100'
);

-- Optimization: Z-Order by user_id to speed up specific user lookups
-- ALTER TABLE glue_catalog.gold.transactions WRITE ORDERED BY user_id;

Scaling Strategy

Scaling a Lakehouse involves moving from simple file listing to metadata-based pruning. When a query hits a 1PB table, the system should not scan S3. Instead, it reads the manifest files to identify exactly which Parquet files contain the relevant data based on column statistics (min/max values).

As the system scales from 1K to 1M+ users, we implement Compaction Services. Small files are the enemy of distributed storage. A background process must periodically merge small 10MB files into 1GB files to maintain query performance.

Failure Modes & Resilience

In a distributed environment, the most common failure is a Write Conflict. If two jobs attempt to update the same table simultaneously, we use Optimistic Concurrency Control (OCC).

If Job A and Job B both try to commit a new snapshot, the catalog (e.g., AWS Glue or Project Nessie) ensures that only one atomic swap succeeds. The loser must re-read the new metadata and attempt a merge again.

Conclusion

The Lakehouse architecture succeeds by acknowledging that data has gravity. By bringing warehouse-like reliability to the data lake, we eliminate the need for expensive "copy-out" operations to proprietary systems. The key patterns—Medallion architecture, open table formats (Iceberg/Delta), and decoupled compute—allow for a system that is both cost-effective and highly performant.

However, the trade-off is complexity. Managing compaction, vacuuming old snapshots, and ensuring catalog consistency requires more engineering overhead than a turnkey Data Warehouse. For organizations at the scale of Stripe or Netflix, this overhead is a small price to pay for the flexibility of a truly open distributed system.

https://iceberg.apache.org/docs/latest/ https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html https://hudi.apache.org/docs/overview/ https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf