System Design Fundamentals

Single Points of Failure

A

Single Points of Failure

The Day the Internet Broke

October 21, 2016. A distributed denial-of-service attack targeted Dyn, a DNS provider. Dyn’s infrastructure got hammered by millions of requests per second. For several hours, their DNS servers couldn’t respond reliably. This might sound like a problem only affecting Dyn’s customers, but the impact was far broader: Twitter, GitHub, Netflix, Shopify, and dozens of other major services went offline because they all relied on Dyn for DNS.

A single company’s infrastructure failure cascaded across the internet.

This is the danger of a Single Point of Failure (SPOF). When any single component’s failure brings down your entire system, that component is a SPOF. Even if Dyn’s infrastructure was 99.99% available, a single DDoS attack broke thousands of services that should have been much more reliable.

What Is a Single Point of Failure?

A SPOF is any component where:

  1. The component’s failure causes total system failure
  2. There’s no redundancy or failover capability
  3. The component is critical to the system’s function

SPOFs exist at multiple levels:

Hardware SPOFs

  • Single web server (when one instance handles all traffic)
  • Single database server (no replicas)
  • Single network switch (no redundant paths)
  • Single hard drive (no RAID)
  • Single power supply to a cabinet

Software SPOFs

  • Single instance of a message broker (Kafka, RabbitMQ)
  • Single instance of a cache layer (Redis, Memcached)
  • Single API gateway handling all traffic
  • Single authentication service with no failover

Infrastructure SPOFs

  • Single data center (all servers in one location)
  • Single cloud region (all resources in us-east-1)
  • Single DNS provider (like the Dyn incident)
  • Single internet service provider (ISP)
  • Single certificate authority for SSL/TLS

Organizational SPOFs (The “Bus Factor”)

  • Single person who knows how to deploy to production
  • Single person with access to production databases
  • Single person who understands critical legacy code
  • Single person handling on-call rotations without backup

Did you know: The “bus factor” is the number of team members who would need to be hit by a bus for the project to fail. You want this number to be at least 2 for any critical system. A bus factor of 1 is a SPOF waiting to happen.

Finding SPOFs: Dependency Mapping

You can’t fix SPOFs you don’t know exist. Finding them requires systematic analysis:

1. Dependency Mapping

Draw your system architecture and trace dependencies:

User Traffic

Load Balancer (SPOF!)

    ├→ Web Server A
    ├→ Web Server B
    └→ Web Server C

    Database Primary (SPOF!)

    [No replicas—if primary fails, write traffic fails]

2. Failure Mode Analysis

For each component, ask: “What if this fails? What happens?”

  • If Load Balancer fails: All traffic is lost immediately
  • If Web Server A fails: Traffic routes to B and C, no problem
  • If Database Primary fails: Read-only queries still work (maybe from replicas), but writes fail

3. Chaos Engineering

Rather than imagining failures, actually cause them in a safe environment:

  • Netflix’s Chaos Monkey randomly terminates production instances
  • Gremlin provides failure-as-a-service platform for chaos testing
  • You write runbooks describing failure scenarios and validate your recovery

4. Dependency Graph Tools

Many organizations use visualization tools:

  • Service meshes (Istio) provide automatic service-to-service visibility
  • Distributed tracing (Jaeger, Datadog) shows request flow and pinpoints critical paths
  • Infrastructure-as-code tools (Terraform) can export dependency graphs

Elimination Strategies by Category

Eliminating Hardware SPOFs

Single Web Server Problem:

Before (SPOF):
    Load Balancer → Web Server (single instance)

After (N+1 redundancy):
    Load Balancer → ├─ Web Server A
                    └─ Web Server B

With multiple web servers, one can fail without losing service. Modern systems use auto-scaling groups that automatically add/remove instances based on demand.

Single Database Server Problem:

Before (SPOF):
    Database Primary (single instance, single disk)

After (Redundancy + Replication):
    Database Primary ─→ WAL (Write-Ahead Log)
         ↓               ↓
    Replica A      Replica B
    (read-only)    (standby)

Replication copies data to standby instances. If the primary fails, the system promotes a replica to primary and continues.

Eliminating Infrastructure SPOFs

Single Data Center Problem:

Before (SPOF):
    All resources in us-east-1

After (Multi-AZ):
    us-east-1a  ├─ API Server
                ├─ Database Replica
                └─ Cache

    us-east-1b  ├─ API Server
                ├─ Database Replica
                └─ Cache

Availability Zones (AZs) are separate data centers within a region with independent power, cooling, and networking. AWS requires that each AZ can survive the others failing.

Single Region Problem:

Before (SPOF):
    All resources in us-east-1

After (Multi-Region):
    us-east-1 ├─ Primary Database
              └─ API Servers

    eu-west-1 ├─ Replica Database
              └─ API Servers (read-only or full duplicate)

Geographic redundancy adds complexity but provides resilience to regional outages (natural disasters, provider issues, DDoS attacks).

Eliminating Software SPOFs

Single Message Broker Problem:

Before (SPOF):
    Message Producer → Kafka Broker (single instance)

                      Downstream Consumers

After:
    Message Producer → Kafka Cluster (3+ instances with replication)

                      Downstream Consumers

Most stateful services support clustering where multiple instances replicate state across each other.

Eliminating Organizational SPOFs

Single Person Knowing Deployments:

  • Document deployment procedures in a runbook (wiki, GitHub, etc.)
  • Rotate on-call responsibilities so multiple people practice deployments
  • Automate deployments so no single person controls the process
  • Pair new team members with experienced ones on production changes

Single Person Understanding Legacy Code:

  • Schedule code reviews where others read critical systems
  • Write ADRs (Architecture Decision Records) explaining the why
  • Invest in refactoring to simplify systems
  • Budget time for knowledge transfer, not just feature development

Real-World SPOF Disasters

Let’s look at actual system failures caused by SPOFs:

Case Study 1: Google Calendar (2009)

A single server handled all calendar list operations. When that server was under maintenance, all Google Calendar users got errors. The lesson: even seemingly simple operations need redundancy.

Case Study 2: AWS S3 Outage (2017)

An operator meant to remove a small fraction of S3 capacity for debugging. They fat-fingered the command and removed more than intended. The issue: a single person’s mistake affected the entire us-east-1 region. Solutions AWS implemented: safer API designs, automated rate-limiting of operations, and multi-approval workflows for dangerous commands.

Case Study 3: Twitter Database (2010-2012)

Twitter’s MySQL database became a SPOF during rapid growth. Heavy write load exhausted the primary, and replicas fell behind, becoming unusable for failover. Solution: moving to a distributed database architecture (later Cassandra) that had no single point of failure.

Cost of Eliminating SPOFs

Eliminating every SPOF is theoretically possible but economically irrational. Each elimination requires investment:

SPOFElimination CostComplexity Added
Single web serverLow (duplicate server)Low (load balancing)
Single databaseMedium (replication setup)Medium (consistency issues)
Single data centerHigh (multi-AZ setup)High (latency, sync)
Single regionVery High (multi-region)Very High (global coordination)
Single personLow (documentation)Low (time investment)

This is why you prioritize:

  1. Eliminate cheaply-fixed SPOFs first (multi-threading, adding replicas)
  2. Accept SPOFs with low-impact failures (development environment SPOF is fine)
  3. Invest in high-impact SPOFs (payment processing SPOFs are unacceptable)

Chaos Engineering in Practice

Rather than waiting for real failures, teams proactively test their resilience:

graph LR
    A["Define Hypothesis<br/>System survives<br/>database failover"] --> B["Run Experiment<br/>Kill primary database"]
    B --> C{"Does system<br/>remain available?"}
    C -->|Yes| D["Hypothesis Confirmed<br/>Update runbook"]
    C -->|No| E["Identify Root Cause<br/>Add redundancy/monitoring"]
    E --> F["Retest"]
    F --> D

Popular chaos engineering tools:

  • Chaos Monkey (Netflix): Randomly kills production instances
  • Gremlin: SaaS platform for running chaos experiments
  • Pumba: Docker chaos testing
  • Kube-chaos: Kubernetes-native chaos engineering

Pro tip: Start small. Run one chaos experiment per sprint on non-critical systems. Learn to recover from single-instance failures before you test multi-region disasters.

Key Takeaways

  • SPOFs exist at hardware, software, infrastructure, and organizational levels
  • Every SPOF will eventually fail—the only question is when
  • Find SPOFs through dependency mapping, failure analysis, and chaos engineering
  • Eliminate critical SPOFs; accept SPOFs that don’t impact user experience
  • Build redundancy for hardware and infrastructure; document and rotate for organizational SPOFs
  • Chaos engineering validates that your redundancy actually works

In the next section, we’ll explore how to build redundancy effectively and understand when active-active is better than active-passive, and how geographic redundancy changes the reliability equation.