Jubin Soni - Portfolio & Blog

In the world of Software-as-a-Service (SaaS), the database architecture is the most consequential decision a founding engineering team will make. At the scale of Shopify or Stripe, the challenge isn't just storing data; it’s providing the illusion of a private, high-performance environment for millions of distinct customers while running on a shared, cost-effective infrastructure. This is the essence of multi-tenancy.

The architectural journey usually begins with a single database instance where every row is tagged with a tenant_id. However, as the system grows, you encounter the "Noisy Neighbor" effect: a single tenant's massive flash sale or runaway reporting query consumes all IOPS, degrading performance for everyone else. Transitioning from a monolithic database to a distributed, multi-tenant system requires a deep understanding of tenant isolation, routing layers, and data locality.

Designing for multi-tenancy involves balancing three competing forces: isolation, cost-efficiency, and operational complexity. High isolation (one database per tenant) offers the best security and performance guarantees but is prohibitively expensive and difficult to manage at scale. Low isolation (shared tables) is cheap but risky. Modern distributed systems usually land on a hybrid approach—logical isolation via sharding and physical isolation via "pods" or "cells."

Requirements

To build a production-grade multi-tenant system, we must define clear boundaries for how data behaves across tenant lines. The following requirements guide the design of a system capable of handling millions of stores.

Capacity Estimation

When designing for a Shopify-scale platform, we must estimate the load to ensure our routing and storage layers can handle the throughput.

Metric	Target Scale
Total Tenants	1,000,000
Active Monthly Tenants	200,000
Peak Requests per Second (RPS)	500,000
Average Database Size	500 TB
Daily Order Volume	50,000,000
Tenant Growth Rate	15% MoM

High-Level Architecture

The architecture relies on a "Cellular" or "Pod-based" design. Instead of one giant cluster, we group tenants into isolated units called Pods. Each Pod is a self-contained environment with its own database shards, caches, and worker nodes.

In this model, the Tenant Router is the critical component. It inspects incoming requests (via headers or subdomains), queries a global Tenant Registry to find which Pod the tenant lives in, and proxies the request. This allows us to move tenants between pods for rebalancing without changing the application logic.

Detailed Design: Tenant Routing Logic

The routing layer must be extremely low-latency. We often use a combination of consistent hashing and a lookup table. Below is a Go-based implementation of a Tenant Router that handles shard mapping.

package main

import (
	"crypto/sha256"
	"encoding/binary"
	"fmt"
)

type Shard struct {
	ID       int
	ConnString string
}

type Router struct {
	Shards map[int]Shard
}

// GetShardID uses a simple modulo hash for demonstration.
// In production, use a lookup table in Redis for explicit control.
func (r *Router) GetShardID(tenantID string) int {
	hash := sha256.Sum256([]byte(tenantID))
	val := binary.BigEndian.Uint64(hash[:8])
	return int(val % uint64(len(r.Shards)))
}

func main() {
	router := &Router{
		Shards: map[int]Shard{
			0: {ID: 0, ConnString: "db-shard-0.internal"},
			1: {ID: 1, ConnString: "db-shard-1.internal"},
			2: {ID: 2, ConnString: "db-shard-2.internal"},
		},
	}

	tenantID := "shop_9921"
	shardID := router.GetShardID(tenantID)
	fmt.Printf("Routing tenant %s to shard %d\n", tenantID, shardID)
}

This logic ensures that all data for shop_9921 always lands on the same physical database, which is crucial for maintaining ACID guarantees within a single tenant's scope.

Database Schema

We use a "Shared Schema, Multi-sharded" approach. Every table must include a tenant_id to allow for horizontal partitioning (sharding). PostgreSQL with the Citus extension or Vitess (for MySQL) are industry standards for this pattern.

To optimize performance, we use Composite Primary Keys. Instead of just id, we use (tenant_id, id). This ensures that the database engine can colocate data for the same tenant on the same physical disk pages, significantly reducing disk I/O during joins.

sql

CREATE TABLE orders (
    tenant_id UUID NOT NULL,
    order_id UUID NOT NULL,
    customer_id UUID,
    total_price DECIMAL(12,2),
    created_at TIMESTAMP DEFAULT NOW(),
    PRIMARY KEY (tenant_id, order_id)
) PARTITION BY HASH (tenant_id);

CREATE INDEX idx_orders_tenant_created ON orders (tenant_id, created_at DESC);

Scaling Strategy

Scaling from 1,000 to 1,000,000 users requires moving from a single database to a pod-based architecture. As a pod reaches its capacity (typically defined by CPU or IOPS limits), we "split" the pod or migrate heavy tenants to a fresh pod.

Strategy	Complexity	Isolation	Cost
Single DB (Shared)	Low	Low	Low
Schema-per-Tenant	Medium	Medium	Medium
Database-per-Tenant	High	High	High
Pod-based Sharding	Very High	High	Optimized

Failure Modes and Resilience

In a multi-tenant system, a failure in the routing layer or a "heavy" tenant can cause a cascading failure. We implement Circuit Breakers and Tenant-Level Rate Limiting to protect the fleet.

If Tenant A starts sending 100x their normal traffic, the rate limiter (implemented at the Gateway) drops their requests before they hit the database. If a specific database shard becomes unresponsive, the Circuit Breaker trips for only that shard, allowing the rest of the system to function normally.

Conclusion

Designing a multi-tenant database system is a game of managing trade-offs. While the "Shared Schema" approach provides the best cost-efficiency, it requires rigorous engineering discipline to prevent data leaks and performance degradation. By adopting a pod-based architecture and a robust routing layer, organizations can achieve the horizontal scalability required to grow from a few hundred tenants to millions.

Key takeaways for any staff engineer embarking on this path:

Always include tenant_id in your primary keys to enable future sharding.
Decouple the application from the physical database using a routing service.
Invest in observability to identify "noisy neighbors" before they impact the entire pod.
Automate tenant migrations early; you will eventually need to rebalance your shards.

https://shopify.engineering/sharding-is-hard-how-shopify-scales https://www.citusdata.com/blog/2016/10/03/designing-your-saas-database-for-scale/ https://stripe.com/blog/scaling-the-shard-manager https://aws.amazon.com/blogs/apn/calculating-tenant-costs-in-saas-environments/