Payment Processing System (Stripe / PayPal)

7 min read5.2k

Designing a payment processing system is one of the most challenging tasks for a software engineer. Unlike a social media feed where a missed post is a minor inconvenience, a payment system deals with the atomic transfer of value. If you fail to record a transaction, you lose money; if you record it twice, you lose trust. Companies like Stripe and PayPal have built multi-billion dollar businesses by abstracting the sheer complexity of global financial rails into simple APIs.

At its core, a payment system is a distributed state machine that must guarantee "exactly-once" processing while interacting with inherently unreliable third-party gateways and legacy banking systems. This requires a shift in mindset from "availability at all costs" to "consistency above all else." In this post, we will explore the architecture of a production-grade payment system, focusing on idempotency, the dual-write problem, and high-precision ledgering.

Requirements

To design a system that scales to the level of Uber or Netflix, we must define clear boundaries for functional and non-functional requirements.

Capacity Estimation

For a system handling 100 million transactions per month, we can estimate the following:

MetricValue
Total Transactions / Month100,000,000
Average TPS (Transactions Per Second)~40 TPS
Peak TPS (10x average)400 TPS
Storage per Transaction2 KB
Monthly Storage Growth200 GB
Availability Target99.999% (5 Nines)

High-Level Architecture

The architecture separates the public-facing API from the complex, long-running execution of financial transactions. This decoupling allows us to provide fast responses to the user while the heavy lifting happens asynchronously.

The Payment Service acts as the entry point, performing basic validation and persisting a "Pending" record. The Payment Executor consumes events and interacts with External Payment Service Providers (PSPs). The Ledger Service is the immutable source of truth for all fund movements.

Detailed Design: Idempotency and Exactly-Once Processing

The most critical pattern in payment systems is the Idempotency Key. If a client retries a request due to a network timeout, the system must recognize the request and return the previous result rather than processing a second charge.

Below is a Go-based implementation of an idempotency wrapper used by the Payment Executor.

go
package payment

import (
    "context"
    "crypto/sha256"
    "encoding/hex"
    "errors"
)

type IdempotencyStore interface {
    Get(ctx context.Context, key string) (*Response, error)
    Save(ctx context.Context, key string, resp *Response) error
}

type PaymentProcessor struct {
    store IdempotencyStore
}

func (p *PaymentProcessor) ProcessPayment(ctx context.Context, idempotencyKey string, req PaymentRequest) (*Response, error) {
    // 1. Check if we've seen this key before
    if cachedResp, err := p.store.Get(ctx, idempotencyKey); err == nil {
        return cachedResp, nil
    }

    // 2. Execute the payment logic
    result, err := p.executeExternalCharge(req)
    if err != nil {
        return nil, err
    }

    resp := &Response{
        Status:        "SUCCESS",
        TransactionID: result.ID,
    }

    // 3. Atomically save the result
    if err := p.store.Save(ctx, idempotencyKey, resp); err != nil {
        // Log critical failure: Payment succeeded but record failed
        return resp, nil 
    }

    return resp, nil
}

This pattern ensures that regardless of how many times a POST /v1/charges request is sent with the same key, the customer is only billed once.

Database Schema

We use a relational database (PostgreSQL) for its ACID compliance. For the ledger, we use an append-only table structure to ensure auditability.

SQL Schema and Indexing Strategy

To handle scale, we partition the ledger_entries table by entry_date and index the payment_id for fast reconciliation.

sql
CREATE TABLE payments (
    payment_id UUID PRIMARY KEY,
    user_id UUID NOT NULL,
    amount DECIMAL(19, 4) NOT NULL,
    currency CHAR(3) NOT NULL,
    status VARCHAR(20) NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_payments_user_id ON payments(user_id);

CREATE TABLE ledger_entries (
    ledger_id BIGSERIAL,
    payment_id UUID REFERENCES payments(payment_id),
    account_id UUID NOT NULL,
    debit DECIMAL(19, 4),
    credit DECIMAL(19, 4),
    created_at TIMESTAMP WITH TIME ZONE
) PARTITION BY RANGE (created_at);

Scaling Strategy

As the system moves from 1,000 to 1,000,000+ users, the primary bottleneck is the database. We transition from a single primary instance to a sharded architecture.

  1. Database Sharding: Shard the payments and ledger tables by user_id or merchant_id. This ensures that all data for a single entity resides on one physical node, allowing for local transactions.
  2. Async Processing: Use a message queue (Kafka) to decouple the API response from the PSP communication. This absorbs spikes in traffic during events like Black Friday.
  3. Read Replicas: Use dedicated read replicas for the Merchant Dashboard and Reporting services to offload traffic from the primary transactional shards.

Failure Modes and Resilience

In distributed systems, failures are inevitable. A PSP might go down, or a network partition might occur after a payment is authorized but before our database is updated.

Handling the "Unknown" State: If the executor times out while calling Stripe, we do not know if the charge happened. We must:

  1. Circuit Breaker: If a PSP returns 5xx errors consistently, trip the circuit and fail-over to a secondary PSP (e.g., switch from Stripe to Adyen).
  2. Reconciliation Engine: A background process that runs every hour, comparing our internal ledger with the PSP's transaction logs via their API. This is the final safety net for financial integrity.
  3. Exponential Backoff: For transient failures (429 Too Many Requests), use retries with jitter to avoid overwhelming the external gateway.

Conclusion

Building a payment processing system is an exercise in managing complexity and risk. The CAP theorem dictates that in the face of a network partition, we must choose Consistency over Availability for our ledger. By utilizing idempotency keys, an append-only ledger, and a robust reconciliation engine, we can build a system that remains reliable even when underlying infrastructure fails.

The key patterns to remember are:

  • Never use floating-point numbers for currency; use integers (cents) or decimal with fixed precision.
  • Idempotency is not optional; it must be at the gate of every state change.
  • Decouple the payment intent from the payment execution to maintain high system throughput.

https://stripe.com/blog/idempotency https://eng.uber.com/payments-fulfillment-platform/ https://netflixtechblog.com/scaling-the-netflix-billing-system-with-edge-computing-4b95388e637b https://martinfowler.com/articles/patterns-of-distributed-systems/idempotent-receiver.html