Designing Idempotent APIs (Stripe / Payments)
In the world of distributed systems, the network is fundamentally unreliable. Packets drop, connections time out, and services crash at the most inopportune moments. In most domains, a retry is a harmless annoyance; in payments, a retry without idempotency is a financial catastrophe. If a customer clicks "Pay" and the network times out, the client will likely retry the request. Without a robust idempotency strategy, the system risks double-charging the user, leading to support tickets, chargebacks, and a loss of trust.
Designing for idempotency—the property where an operation can be applied multiple times without changing the result beyond the initial application—is a core requirement for any high-stakes financial platform. Companies like Stripe, Uber, and Netflix have popularized the use of an Idempotency-Key header to solve this. This pattern ensures that even if a request is sent ten times, the side effect (the payment) occurs exactly once, and the caller receives the same response every time.
As a staff engineer, your goal isn't just to "make it work" but to build a system that is resilient to race conditions, handles high-concurrency "thundering herds," and maintains strict data integrity under the constraints of the CAP theorem. This post explores the architectural blueprints and implementation details required to build a production-grade idempotency layer.
Requirements
To build an idempotency system that scales to millions of transactions, we must define clear functional boundaries and performance expectations.
Capacity Estimation
For a global payment processor, we might anticipate the following load:
| Metric | Estimate |
|---|---|
| Peak Transactions Per Second (TPS) | 10,000 |
| Average Payload Size | 2 KB |
| Retention Period | 24 - 72 Hours |
| Daily Storage Growth | ~1.7 TB (including indexes) |
| Read/Write Ratio | 1:1 (every write is preceded by a check) |
High-Level Architecture
The idempotency layer sits between the API Gateway and the downstream business logic. It acts as a gatekeeper that checks for the existence of a unique key before allowing the request to proceed to the core Payment Service.
Detailed Design
The core of the system relies on an atomic "Check-and-Set" operation. When a request arrives with an Idempotency-Key, the system must handle three states:
- Started: The request is currently being processed. Any subsequent requests with the same key should receive a
409 Conflict. - Finished: The request has completed. Subsequent requests return the cached response.
- Not Found: This is a new request.
We use a combination of a relational database for permanent record-keeping and a distributed lock (like Redis) to prevent race conditions.
import hashlib
import json
from datetime import datetime, timedelta
class IdempotencyManager:
def __init__(self, redis_client, db_session):
self.redis = redis_client
self.db = db_session
def get_request_hash(self, payload):
"""Ensures the payload hasn't changed for the same key."""
return hashlib.sha256(json.dumps(payload, sort_keys=True).encode()).hexdigest()
def handle_request(self, key, payload, execute_logic):
request_hash = self.get_request_hash(payload)
# 1. Atomic Check-and-Set using Redis for locking
lock_key = f"lock:idempotency:{key}"
if not self.redis.set(lock_key, "processing", nx=True, ex=30):
# If lock exists, check if it's already finished in DB
record = self.db.query(IdempotencyRecord).filter_by(key=key).first()
if record:
if record.request_hash != request_hash:
raise Exception("Idempotency Key reused with different payload")
return record.response_body, record.status_code
# If no record but locked, another pod is working on it
return {"error": "Request in progress"}, 409
try:
# 2. Re-check DB inside lock to prevent race conditions
record = self.db.query(IdempotencyRecord).filter_by(key=key).first()
if record:
return record.response_body, record.status_code
# 3. Execute the actual business logic (e.g., call Stripe)
result_body, status_code = execute_logic(payload)
# 4. Persist the result
new_record = IdempotencyRecord(
key=key,
request_hash=request_hash,
response_body=result_body,
status_code=status_code,
created_at=datetime.utcnow()
)
self.db.add(new_record)
self.db.commit()
return result_body, status_code
finally:
self.redis.delete(lock_key)Database Schema
A relational database is preferred over NoSQL for the idempotency store because we require ACID properties to ensure that the payment record and the idempotency record are committed atomically (or at least consistently).
To optimize for performance and scale:
- Indexing: A composite index on
(key, user_id)is essential. - Partitioning: Since idempotency keys are usually only relevant for 24-72 hours, we use List Partitioning by day. This allows us to drop old partitions instantly without the overhead of massive
DELETEoperations.
CREATE TABLE idempotency_records (
key VARCHAR(255) NOT NULL,
user_id VARCHAR(255) NOT NULL,
request_hash CHAR(64) NOT NULL,
response_body TEXT,
status_code INTEGER,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (key, user_id, created_at)
) PARTITION BY RANGE (created_at);
CREATE INDEX idx_idempotency_lookup ON idempotency_records (key, user_id);Scaling Strategy
As traffic grows from 1,000 to 1,000,000 users, the bottleneck shifts from the database to the distributed locking mechanism and the storage volume.
At the 1M+ user scale, we implement Request Fingerprinting. This prevents "Key Hijacking," where a client accidentally sends the same key for two different transactions. By hashing the request body and comparing it to the stored hash, we ensure that the key is not only unique but also contextually valid.
Failure Modes and Resiliency
The most dangerous failure mode is the "Zombie State": the payment service successfully charges the customer via a third-party PSP, but the service crashes before it can save the success response to the idempotency store.
To handle this, we implement:
- Deterministic Request IDs: Pass the
Idempotency-Keydirectly to the downstream PSP (like Stripe'sIdempotency-Keyor Adyen'smerchantReference). - Reconciliation Workers: If a retry arrives and we find a "Processing" lock that has timed out, the system must query the PSP to see if the transaction actually went through before attempting a new one.
- Circuit Breakers: If the Redis cluster is down, the system should fail-closed (reject payments) rather than risk double-charging, as consistency is more important than availability in finance.
Conclusion
Designing for idempotency is a study in balancing consistency and performance. By implementing a multi-layered approach—using Redis for distributed locking, a partitioned relational database for persistence, and request fingerprinting for safety—we can build a system that handles the inherent unreliability of the network.
Key takeaways for production systems:
- Never rely on the client to provide a unique key without validating the payload hash.
- Always use TTLs for locks to prevent deadlocks during service crashes.
- Atomicity is non-negotiable; use database transactions to link your business logic results with the idempotency record.
- Design for failure by ensuring your downstream providers are also treated idempotently.