Designing a Log & Event Ingestion System (Kafka Classic)

7 min read5.4k

In the modern distributed landscape, data is no longer a static asset sitting in a relational database; it is a continuous stream of pulses representing user behavior, system health, and financial transactions. As a system evolves from a monolithic architecture to a microservices-oriented one, the "glue" that binds these services shifts from synchronous REST calls to asynchronous event streams. Designing a log and event ingestion system is one of the most critical challenges a staff engineer faces, as it sits at the intersection of high-throughput networking, durable storage, and real-time computation.

Companies like Uber, Netflix, and LinkedIn have pioneered this space, moving from batch processing to real-time pipelines that handle trillions of events per day. At this scale, the ingestion system must act as a massive shock absorber, decoupling unpredictable traffic spikes from downstream consumers like search indices (Elasticsearch), data lakes (S3/HDFS), and real-time analytics engines (Apache Flink). This post explores the architecture of a production-grade ingestion system built on the foundations of "Kafka Classic"—leveraging the proven durability of the distributed commit log.

Requirements

To build a system capable of handling global-scale traffic, we must define clear functional boundaries and strict non-functional constraints. Our target is a system that can ingest 100 billion events per day, providing a unified interface for both application logs and business events.

Capacity Estimation

MetricTarget Value
Daily Event Volume100 Billion Events
Average Event Size1 KB
Daily Data Volume~100 TB
Peak Throughput2.5 Million Events/sec
Retention Period7 Days (Hot Storage)
Required Storage~700 TB (pre-replication)

High-Level Architecture

The architecture follows a "Source-to-Sink" pattern. We utilize a tiered approach: an Ingestion Layer (API Gateways), a Messaging Layer (Kafka), and a Processing Layer (Consumers). This decoupling allows us to scale the ingestion capacity independently of our processing logic.

In this design, the Ingestion Service acts as a lightweight proxy. It performs authentication, initial schema validation against a Schema Registry (like Confluent or an internal equivalent), and assigns a partition key to ensure ordering. By moving validation to the edge, we prevent "poison pills" from ever entering our Kafka brokers.

Detailed Design: The Producer Logic

A production-grade ingestion service must handle backpressure and transient network failures gracefully. In Go, we can implement a robust producer that utilizes internal buffering and asynchronous retries to maximize throughput without blocking the incoming HTTP request.

go
package ingestion

import (
    "context"
    "github.com/segmentio/kafka-go"
    "time"
)

type EventProducer struct {
    writer *kafka.Writer
}

func NewProducer(brokers []string, topic string) *EventProducer {
    return &EventProducer{
        writer: &kafka.Writer{
            Addr:         kafka.TCP(brokers...),
            Topic:        topic,
            Balancer:     &kafka.LeastBytes{},
            BatchSize:    1000,
            BatchTimeout: 10 * time.Millisecond,
            Async:        true, // Non-blocking for high throughput
            RequiredAcks: -1,   // acks=all for durability
        },
    }
}

func (p *EventProducer) Ingest(ctx context.Context, key []byte, payload []byte) error {
    // Inject metadata/trace headers before sending
    err := p.writer.WriteMessages(ctx, kafka.Message{
        Key:   key,
        Value: payload,
        Time:  time.Now(),
    })
    
    if err != nil {
        // Log failure and emit metrics for circuit breaker
        return err
    }
    return nil
}

This implementation emphasizes three critical Kafka tunables:

  1. BatchSize and BatchTimeout: Balancing latency vs. throughput.
  2. RequiredAcks = -1 (acks=all): Ensuring the message is replicated across the min.insync.replicas before acknowledgment.
  3. Async: true: Decoupling the HTTP response from the disk I/O of the broker.

Database Schema (Metadata & Registry)

While Kafka stores the event data, we need a relational store to manage topic configurations, user permissions, and schema versions. We use PostgreSQL with partitioning for the audit logs of schema changes.

sql
CREATE TABLE topic_metadata (
    topic_id UUID PRIMARY KEY,
    name VARCHAR(255) UNIQUE NOT NULL,
    retention_ms BIGINT DEFAULT 604800000, -- 7 days
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_topic_name ON topic_metadata(name);

-- Partitioned table for schema history
CREATE TABLE schema_registry (
    schema_id SERIAL,
    topic_name VARCHAR(255),
    version INT,
    definition JSONB,
    PRIMARY KEY (topic_name, version)
) PARTITION BY LIST (topic_name);

Scaling Strategy

Scaling a Kafka-based system requires a multi-dimensional approach. We don't just "add more servers"; we manage partition distribution and consumer group rebalancing.

  1. Partitioning Strategy: We scale by increasing the number of partitions. For 1M+ events/sec, we typically aim for 1,000 to 10,000 partitions across the cluster.
  2. Segment Compaction: For "Log" data (e.g., user profile updates), we use cleanup.policy=compact to save space.
  3. Tiered Storage: To handle 1M+ users without infinite disk costs, we offload older segments from local NVMe drives to S3 using Kafka’s tiered storage feature (available in modern distributions like Confluent or Uber’s internal fork).

Failure Modes and Resiliency

In a distributed system, failure is the norm. We must design for "Partial Failure" where the network is slow, or a subset of brokers is down.

  • In-Sync Replicas (ISR) Shrinkage: If a broker lags, it is removed from the ISR. Our producer must handle NotEnoughReplicasException by buffering locally or failing over to a secondary region.
  • Consumer Lag: We monitor offset_lag. If lag exceeds a threshold, we trigger auto-scaling for our Flink/Go consumer pods.
  • Dead Letter Queues (DLQ): Events that fail schema validation or processing are routed to a separate "quarantine" topic for manual inspection or automated reprocessing.

Conclusion

Designing a log and event ingestion system using Kafka Classic requires a deep understanding of the CAP theorem. By choosing acks=all, we prioritize Consistency and Partition Tolerance over Availability (CP). This ensures that once our ingestion service confirms a write, that data is durable—a requirement for financial systems like Stripe or audit-heavy environments at Netflix.

The key patterns for success involve shifting complexity left: validate schemas early, batch aggressively at the producer level, and use a robust partitioning strategy to ensure your consumers can scale alongside your producers. As you move from 1,000 to 1 million users, the bottlenecks will shift from disk I/O to network throughput and rebalance overhead. Mastering the configuration of the distributed log is not just a devops task; it is the core of reliable distributed systems design.

References:

https://kafka.apache.org/documentation/ https://eng.uber.com/reliable-assets-delivery-reliability-platform/ https://netflixtechblog.com/evolution-of-the-netflix-data-pipeline-da24634629f4 https://engineering.linkedin.com/blog/2019/04/kafka-replication-benchmarking