Designing an Alerting System (PagerDuty-Style)
In the lifecycle of a high-growth technology company, there is a definitive moment when "checking the logs" transitions from a manual task to a distributed systems challenge. As organizations like Netflix or Uber scale to thousands of microservices, the surface area for failure expands exponentially. An alerting system is not merely a notification wrapper; it is the critical feedback loop of the entire engineering organization. If this system fails, the company is effectively flying blind during a storm.
Designing a PagerDuty-style system requires shifting our mindset from simple CRUD operations to high-throughput event processing and reliable state machines. We are building a system that must, by definition, be more resilient than the infrastructure it monitors. This involves handling massive bursts of events (alert storms) while ensuring that a single critical notification reaches the right engineer at 3:00 AM, regardless of downstream provider failures.
In this post, we will explore the architectural blueprint for a production-grade alerting system, focusing on the trade-offs between consistency and availability, the implementation of escalation logic, and the strategies for scaling to millions of users.
Requirements and Capacity Estimation
To build a robust system, we must first categorize our requirements into functional capabilities and operational constraints.
For capacity estimation, let’s assume we are designing for a mid-to-large enterprise environment similar to Stripe or Airbnb:
| Metric | Value |
|---|---|
| Daily Active Users (Engineers) | 100,000 |
| Events per Second (Average) | 10,000 |
| Events per Second (Peak) | 100,000 |
| Retention Period | 365 Days |
| Storage per Event | ~2 KB |
| Total Daily Storage | ~1.7 TB |
High-Level Architecture
The architecture follows a decoupled, event-driven pattern. We use an "Ingestion-Processing-Dispatch" pipeline to ensure that bottlenecks in notification delivery do not back-pressure the ingestion of new alerts.
- Ingestion Service: A lightweight Go-based service that validates incoming payloads and pushes them to a Kafka topic.
- Deduplication Service: Uses a sliding window (via Redis) to suppress redundant alerts (e.g., 1,000 "CPU High" events for the same host).
- Alert Processor: The brain of the system. It correlates events to incidents and evaluates rules.
- Escalation Engine: Manages the state machine of an incident (Triggered -> Acknowledged -> Resolved).
Detailed Design: Escalation Logic
The core complexity lies in the escalation engine. We need to determine who to page based on the service_id, the current time, and the escalation_policy. In distributed systems, we often use a "Worker Pattern" to handle these time-sensitive transitions.
The following Go snippet demonstrates how we might model the escalation state transition:
type IncidentStatus string
const (
Triggered IncidentStatus = "triggered"
Acknowledged IncidentStatus = "acknowledged"
Resolved IncidentStatus = "resolved"
)
type EscalationStep struct {
DelayMinutes int
Targets []string // User IDs or Schedule IDs
}
type Incident struct {
ID string
ServiceID string
CurrentStatus IncidentStatus
CurrentStepIndex int
CreatedAt time.Time
LastNotifiedAt time.Time
}
// ProcessEscalation evaluates if an incident needs to move to the next level
func (e *Engine) ProcessEscalation(incident *Incident, policy EscalationPolicy) error {
if incident.CurrentStatus != Triggered {
return nil // No action needed if acknowledged/resolved
}
currentStep := policy.Steps[incident.CurrentStepIndex]
elapsed := time.Since(incident.LastNotifiedAt).Minutes()
if int(elapsed) >= currentStep.DelayMinutes {
if incident.CurrentStepIndex < len(policy.Steps)-1 {
incident.CurrentStepIndex++
return e.dispatchNotification(incident, policy.Steps[incident.CurrentStepIndex])
}
}
return nil
}This logic is typically wrapped in a distributed cron or a task queue (like Temporal or Sidekiq) that polls for "Triggered" incidents requiring action.
Database Schema
We require a relational database for metadata because of the need for ACID transactions when managing on-call schedules and incident states.
To handle scale, we partition the INCIDENT and NOTIFICATION_LOG tables by created_at (range partitioning) and service_id (hash partitioning). This allows us to prune old data efficiently and keep indexes performant.
Scaling Strategy: From 1K to 1M Users
As we scale, a single monolithic database or a single Kafka partition becomes a bottleneck. We adopt a "Cellular Architecture," similar to how AWS partitions its services.
- Horizontal Sharding: Shard data by
organization_idorteam_id. This ensures that an alert storm in one company doesn't impact another. - Read Replicas: Use read replicas for the "On-Call Schedule" UI, while the Escalation Engine hits the Primary for consistency.
- Caching: Aggressively cache on-call schedules in Redis. Schedules change infrequently but are read every time an alert triggers.
Failure Modes and Resilience
In a distributed alerting system, "Availability" trumps "Consistency" (referencing the CAP Theorem). It is better to send two notifications (At-Least-Once delivery) than to send zero.
- Circuit Breakers: If Twilio is down, the system should automatically failover to an alternative provider (e.g., MessageBird or AWS SNS).
- Dead Letter Queues (DLQ): Any event that fails processing after N retries is moved to a DLQ for manual inspection, preventing "Poison Pill" messages from blocking the pipeline.
- Self-Monitoring (The Watchdog): We run a separate, minimal "Canary" service that sends a heartbeat alert through the entire system every minute. If the heartbeat isn't received, an emergency "Meta-Alert" is triggered via a completely independent path.
Conclusion
Designing a PagerDuty-style system is a masterclass in building for the "Worst Case Scenario." By prioritizing availability and partitioning our system into isolated cells, we ensure that the alerting infrastructure remains a source of truth during outages.
Key takeaways include using Kafka for ingestion buffering, Redis for high-speed deduplication, and a robust state machine for escalation logic. While we lean towards AP in the CAP theorem for notification delivery, we maintain ACID compliance for the underlying schedule and policy metadata. This hybrid approach provides the reliability required for mission-critical operations.