System Design Fundamentals

Failover & Recovery

A

Failover & Recovery

The Black Friday Crisis

It’s 2 PM on Black Friday. Your e-commerce platform is handling peak traffic—thousands of orders per second flowing through your primary database. Everything is running smoothly until, suddenly, the database server suffers a catastrophic hardware failure. The lights on the rack go dark.

In that instant, every millisecond matters. Your business is hemorrhaging money—every second of downtime costs thousands of dollars. Orders are queuing, customers are refreshing, your monitoring systems are screaming. The question isn’t whether your replica will save you; it’s whether you can promote it fast enough and redirect traffic before your SLAs shatter.

This is the world of failover—the critical process of automatically (or manually) transitioning from a failed primary database to a healthy replica. It’s where the theory of replication meets brutal reality: network partitions, timing windows, in-flight transactions, and split-brain conditions that can corrupt your entire system if handled incorrectly.

We’ve already explored how replicas stay in sync. Now we need to answer the harder question: What happens when the primary dies, and how do we survive it?

What is Failover?

Failover is a coordinated process that answers three essential questions:

  1. Detection: How does the system know the primary has failed?
  2. Promotion: How is a replica elevated to become the new primary?
  3. Redirection: How do applications learn about the new primary and reconnect?

Let’s break this down systematically.

Manual vs. Automatic Failover

Manual failover requires a database administrator to:

  • Monitor the health of the primary
  • Detect the failure (often by noticing connection errors)
  • Manually promote a replica using commands like PROMOTE REPLICA in PostgreSQL or SLAVEOF NO ONE in Redis
  • Update application configurations or DNS records to point to the new primary
  • Manually re-sync the old primary when it recovers

This approach is slow (human reaction time: minutes) but safe (humans can assess the situation and avoid catastrophic decisions).

Automatic failover uses orchestration software to:

  • Continuously monitor the primary’s health
  • Detect failures within seconds
  • Automatically promote a replica
  • Automatically redirect application connections
  • Optionally re-sync the old primary in the background

This approach is fast (seconds) but risky (can make wrong decisions during network partitions).

Detection: Knowing Something Is Wrong

The first challenge is knowing the primary is actually down. This sounds simple but isn’t. Common detection mechanisms:

Heartbeat Pings: The primary sends “I’m alive” signals every few seconds. If a certain number of pings are missed, it’s declared dead. PostgreSQL Patroni and Redis Sentinel use this approach.

Primary:    ping → ping → [CRASH] → (no ping)
Monitoring:                          "Primary down after 3 missed pings"

Connection Timeouts: Applications or monitoring systems try to execute a query. If the connection hangs for N seconds without response, the primary is presumed dead.

Consensus-Based Detection: A witness node (or ensemble of nodes) requires agreement that the primary is down before triggering failover. This prevents false positives from network glitches.

Failure Detection Windows: The faster you detect a failure, the sooner you failover—but you risk false positives. A 3-second detection window might catch real failures but also react to temporary network hiccups.

Pro Tip: In production, use a detection window of 5-10 seconds and require multiple failed attempts before declaring the primary dead. This balance prevents hair-trigger failovers while still responding quickly.

The Promotion: Making a Replica the New Primary

Once a replica is chosen for promotion, several things must happen:

  1. Stop replication — The replica stops consuming writes from the primary.
  2. Apply any remaining WAL — Any buffered write-ahead logs are applied to ensure the replica has all data it had received but not yet processed.
  3. Promote to primary mode — The replica begins accepting writes from clients.
  4. Update metadata — Internal records reflect that this server is now the primary.

In PostgreSQL:

-- On the replica
SELECT pg_promote();

In Redis:

SLAVEOF NO ONE

Redirection: Telling Applications Where to Connect

Once the replica is promoted, the hard part begins: getting thousands of application connections to the new primary. This is where different strategies emerge:

DNS-Based Failover: The application connects to a DNS name (e.g., db.myapp.com), which resolves to the primary’s IP. During failover, update the DNS record to point to the new primary. Downside: DNS caching means some clients won’t update immediately.

Virtual IP (VIP) Failover: A floating IP address is maintained by the cluster. When the primary fails, the VIP is reassigned to the new primary via ARP (Address Resolution Protocol). The network layer handles the redirection seamlessly. This is fast but requires special network setup.

Proxy-Based Failover: All connections go through a proxy (HAProxy, PgBouncer, etc.). The proxy knows about all replicas and can route new connections to the new primary without DNS changes. Existing connections might need to reconnect, but the proxy handles that logic.

Orchestrator Services: Kubernetes, Patroni, or other orchestration platforms maintain a “service endpoint” that always points to the current primary. The orchestrator updates this endpoint during failover.

The Split-Brain Problem

Here’s a scenario that keeps database engineers awake at night:

The primary database is running fine in Datacenter A. Datacenter B has a replica. A network partition occurs—Datacenter A and B can’t communicate. The monitoring system in Datacenter B detects that the primary is unreachable and promotes its replica.

Now you have two primaries.

Both are accepting writes. Data diverges. When the network heals, you have two conflicting datasets. Which one is correct? Which writes do you lose?

This is the split-brain condition, and it’s catastrophic.

Preventing Split-Brain with Fencing

The solution is fencing—ensuring the old primary can’t write even if it recovers.

STONITH (Shoot The Other Node In The Head): The cluster literally forces the old primary offline. This might mean:

  • Revoking its database credentials
  • Disconnecting its network port
  • In extreme cases, triggering a hardware power-off
  • Revoking write access at the storage layer
graph TD
    A["Primary in Datacenter A"] -->|write ops| Storage["Shared Storage Layer"]
    B["Replica in Datacenter B"] -->|read ops| Storage

    NetworkPartition["Network Partition Occurs"]

    NetworkPartition --> Detect["Datacenter B detects timeout"]
    Detect --> Promote["Promote replica to primary"]
    Promote --> Fence["STONITH: Revoke Datacenter A's write access"]
    Fence --> SafePromote["Promotion succeeds safely"]

    style SafePromote fill:#90EE90

A simpler version in single-datacenter scenarios: require that the new primary have quorum (agreement from a majority of nodes) before taking over. This prevents isolated nodes from thinking they should promote.

RTO and RPO: Failover Metrics

We touched on these before, but they’re critical in the failover context:

RTO (Recovery Time Objective): How long can your database be unavailable? In the Black Friday scenario, your RTO might be 1 minute. Every second beyond that violates your SLA.

RTO = Detection Time + Promotion Time + Redirection Time + Connection Time

Primary fails at 14:00:00
Detection takes 5 seconds  → 14:00:05
Promotion takes 2 seconds  → 14:00:07
DNS update propagates      → 14:00:12
App servers reconnect      → 14:00:17
Total RTO: 17 seconds

RPO (Recovery Point Objective): How much data loss can you tolerate? If you use async replication, your replica might lag by milliseconds or seconds. During failover, any in-flight transactions on the primary are lost.

With async replication, RPO might be 5 seconds (meaning up to 5 seconds of transactions might be lost). With synchronous replication, RPO is near zero.

Automatic Failover Tools and Platforms

Let’s examine how production-grade failover actually works:

PostgreSQL: Patroni

Patroni is an open-source orchestration platform for PostgreSQL that handles automatic failover:

# patroni.yml configuration
scope: my-postgres-cluster
namespace: /patroni/

postgresql:
  data_dir: /var/lib/postgresql/14/main
  bin_dir: /usr/lib/postgresql/14/bin

  parameters:
    wal_level: replica
    max_wal_senders: 10
    wal_keep_size: 1GB

etcd:
  host: etcd-node-1:2379

rest_api:
  listen: 0.0.0.0:8008

watchdog:
  mode: required
  device: /dev/watchdog
  safety_margin: 5

Patroni uses a Distributed Consensus algorithm (via etcd or Consul) to:

  1. Elect a leader (the primary)
  2. Detect when the leader fails
  3. Elect a new leader from healthy replicas
  4. Atomically update the cluster topology

Timeline with Patroni:

  • Primary failure: 0ms
  • Failure detection: 3-5 seconds (configurable)
  • Leader election: 2-3 seconds
  • New primary promotion: 1-2 seconds
  • DNS/VIP update: 0-30 seconds (depends on your network setup)
  • Total RTO: 6-40 seconds

Redis: Redis Sentinel

Redis Sentinel is specifically designed for Redis failover:

sentinel.conf:
port 26379
sentinel monitor mymaster 127.0.0.1 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 180000

Three Sentinel instances monitor the Redis primary. If 2 out of 3 agree the primary is down, they elect a new primary from the replicas.

Key metrics:

  • down-after-milliseconds: 5000 = 5 second detection window
  • Failover can complete in 10-30 seconds
  • Clients use Sentinel API to discover the current primary

MySQL: Orchestrator and MHA

MySQL Orchestrator is a topology management tool:

  • Continuously monitors MySQL instances
  • Detects failures and automatically triggers failover
  • Updates proxies (like ProxySQL) to point to the new primary
  • Handles binlog-based replication intricacies

MySQL MHA (Master High Availability):

  • Monitors replication lag
  • Detects primary failure
  • Selects the most advanced replica as the new primary
  • Updates slaves to replicate from the new primary
  • Optionally keeps the old primary for later analysis (GTID makes this easier)

Handling the Old Primary: Rejoining the Cluster

When the old primary recovers, you have two choices:

1. Demote It to a Replica: The recovering primary becomes a replica of the new primary. In PostgreSQL:

# On the recovered primary, reset it as a replica
pg_rewind --target-xlog-end=<position from new primary> \
          --source-server="dbname=postgres host=new-primary" \
          /var/lib/postgresql/14/main

pg_rewind is clever—it:

  • Identifies the point where the old primary and new primary diverged
  • Rolls back uncommitted changes on the old primary
  • Re-syncs it as a replica of the new primary

2. Rebuild from Scratch: In some systems (especially with data loss from async replication), you rebuild the old primary as a fresh replica by taking a backup from the new primary.

Split-Brain in Detail: The Network Partition Scenario

Let’s walk through a dangerous scenario:

sequenceDiagram
    participant PA as Primary<br/>Datacenter A
    participant Rep as Replica<br/>Datacenter B
    participant Monitor as Monitoring<br/>Datacenter B
    participant Client as Client

    PA->>Rep: [replication stream]
    PA->>Monitor: [heartbeat]

    Note over PA,Rep: Network partition occurs

    PA--xRep: [replication broken]
    PA--xMonitor: [heartbeat stops]

    Monitor->>Monitor: Detect primary down
    Monitor->>Rep: Promote replica to primary
    Rep->>Rep: Accept writes

    Note over PA: Old primary still alive<br/>but isolated

    PA->>PA: Client writes queued<br/>locally

    Client->>Rep: Connect to new primary
    Client->>Rep: Write data

    Note over PA,Client: Data divergence!

Without fencing, the old primary (still running, still accepting connections from isolated clients) keeps accumulating writes. When the partition heals, you have two primaries with conflicting data.

Solution: Implement STONITH so that when the new primary takes over, the old primary is forcibly shut down or revoked of write privileges.

Designing for Failover: A Checklist

Here’s what you need to implement production-grade failover:

RequirementManualAutomatic
Health monitoringHuman checks logsAutomated heartbeat/consensus
Detection speedMinutes5-30 seconds
PromotionAdmin commandOrchestrator script
RedirectionUpdate config/DNSAutomatic
Split-brain protectionHuman judgmentFencing/quorum
Data loss riskMinimal (human oversight)Low-moderate (depends on sync mode)
Operational complexityLowHigh (need orchestrator)
CostLower (fewer systems)Higher (monitoring, orchestrator, quorum)

Practical Failover Timeline Walkthrough

Let’s trace a real failover event for PostgreSQL with Patroni:

14:00:00.000 - Primary database server loses power (hardware failure)
14:00:00.500 - Outgoing network packets from primary stop
14:00:01.200 - Patroni on replica detects heartbeat timeout
14:00:02.100 - Patroni votes through etcd for new leader
14:00:02.500 - Quorum achieved (2 of 3 nodes vote for replica)
14:00:02.800 - pg_promote() executes on replica
14:00:03.100 - New primary accepts first write request
14:00:04.000 - HAProxy detects primary change in Patroni API
14:00:04.200 - HAProxy routes new connections to new primary
14:00:04.500 - Existing connection pools reconnect
14:00:05.200 - All application servers sending traffic to new primary

Total RTO: 5.2 seconds
Total data loss (RPO): 0 bytes (if using synchronous replication)

Testing Failover: Chaos Engineering

The only way to truly know your failover will work is to test it under failure. This means deliberately crashing your primary:

# Chaos engineering script: kill the primary
import subprocess
import time

def chaos_kill_primary():
    primary_host = "db-primary-001"
    ssh_cmd = f"ssh {primary_host} 'sudo killall -9 postgres'"
    subprocess.run(ssh_cmd, shell=True)

    print("Primary killed. Monitoring failover...")
    failover_start = time.time()

    # Poll until new primary is reachable
    while True:
        try:
            replica_response = subprocess.run(
                f"psql -h db-replica-001 -c 'SELECT 1'",
                shell=True, timeout=2, capture_output=True
            )
            if replica_response.returncode == 0:
                failover_time = time.time() - failover_start
                print(f"Failover completed in {failover_time:.1f} seconds")
                break
        except subprocess.TimeoutExpired:
            pass
        time.sleep(0.5)

if __name__ == "__main__":
    chaos_kill_primary()

Run this in your staging environment regularly. If it takes longer than your RTO, you have a problem to fix.

Cloud-Managed Failover

If you’re using a cloud provider, they often handle failover for you:

AWS RDS Multi-AZ:

  • Automatically maintains a synchronous replica in a different availability zone
  • Detects primary failure and promotes replica within 60-120 seconds
  • Updates the DNS endpoint automatically
  • You don’t manage any of this—it’s transparent

Google Cloud SQL HA:

  • Similar to RDS—automatic replica in different zone
  • RTO typically under 5 minutes
  • Handles re-syncing the old primary automatically

Azure Database for PostgreSQL:

  • Built-in high-availability with automatic failover
  • Active-active failover available in some tiers

The trade-off: less control, but also less operational burden.

Trade-offs You’ll Face

Speed vs. Safety: Aggressive failure detection (3-second windows) means faster failover but more false positives. Conservative detection (30-second windows) means slower response but fewer mistakes.

Automatic vs. Manual: Automatic failover is faster but can make bad decisions (promoting a replica that’s too far behind). Manual failover gives humans time to think but costs precious minutes.

Synchronous vs. Asynchronous Replication: Sync replication guarantees zero data loss but slows down writes. Async replication is fast but risks losing in-flight transactions.

Complexity: Simple manual failover is easy to understand and debug. Automatic failover with STONITH and consensus requires careful orchestration.

Cost: Maintaining hot standbys (replicas that are ready to be promoted immediately) costs money. Some systems use warm standbys (replicas that need a few minutes of catch-up) to save costs.

Key Takeaways

  • Failover is inevitable: Plan for when (not if) your primary fails. Your application’s resilience depends on how fast you can switch.

  • Detection is half the battle: You can’t failover if you don’t know the primary is down. Use heartbeats, connection timeouts, and consensus mechanisms—but be careful of false positives.

  • Split-brain is your enemy: Network partitions can create two “primary” databases with conflicting data. Use fencing (STONITH) and quorum mechanisms to prevent this.

  • RTO and RPO are your SLAs: Measure how long failover actually takes in your environment. If it exceeds your SLA, you need faster detection, faster promotion, or faster redirection.

  • Automatic failover tools exist: Patroni, Sentinel, Orchestrator, and cloud-managed solutions handle most of the complexity. But understand what they’re doing—don’t treat them as magic.

  • Test it or it doesn’t work: Use chaos engineering to regularly kill your primary in staging. If your failover strategy hasn’t been tested under real failure, you don’t actually have a failover strategy.

Practice Scenarios

Scenario 1: The Detection Delay

Your team has configured Patroni with a 30-second failure detection window (down-after-milliseconds: 30000). You need RTO of 10 seconds. The primary fails at 2:00 PM. What’s the minimum failover time, and what needs to change?

Consider: detection time, promotion time, redirection time, and where the bottleneck is.

Scenario 2: Async Replication Data Loss

You’re running MySQL with asynchronous replication, and a replica is 2 seconds behind the primary when the primary crashes. The replica is promoted. During those 2 seconds, 500 write transactions were acknowledged to clients but not yet replicated. What are the consequences, and how would you handle this?

Scenario 3: Split-Brain Prevention

You have a primary in Datacenter A and a replica in Datacenter B. A network partition occurs. Your orchestration system promotes the replica to primary because it can’t reach the old primary. However, the old primary is still running and accepting writes from isolated clients in Datacenter A. Describe how STONITH would prevent data corruption in this scenario.


Now that we understand how databases stay in sync and fail over gracefully, we’re ready to tackle the next challenge: database partitioning and sharding. When your data becomes too large for a single primary-replica pair, we split data across multiple nodes. Instead of copying all data to replicas, we distribute it so each shard holds only a portion of the data. This introduces new complexities around consistency, routing, and cross-shard operations—but it’s the only way to scale beyond a single machine’s capacity.