Redundancy & Fault Tolerance

The Philosophy Behind Redundancy

Commercial aircraft are designed with redundancy everywhere. Multiple engines, multiple hydraulic systems, multiple flight computers. An Airbus A380 has four engines. You only need two to fly safely. Why have the extra two? Because redundancy isn’t just about survival—it’s about the confidence that the system will self-heal when something breaks, without pilots doing anything special.

This is the philosophy we bring to systems engineering. Redundancy isn’t extra capacity sitting unused. It’s the ability for a system to maintain its function when components fail, automatically and without human intervention. That’s fault tolerance.

Understanding redundancy means answering several questions: How many copies do you need? Should all copies serve traffic simultaneously (active-active) or wait in the wings (active-passive)? How do components detect failures and decide to switch? Let’s explore each pattern.

Redundancy Patterns: Active, Passive, and Warm

Active-Active Redundancy

All instances serve traffic simultaneously. If one fails, the others seamlessly absorb the load.

┌─────────────────────────────────┐
│     Load Balancer / Router      │
└──────────┬──────────────────────┘
           │
    ┌──────┼──────┬──────┐
    │      │      │      │
   API-A  API-B  API-C  API-D
 (serving) (serving) (serving) (serving)

Advantages:

Maximizes resource utilization (no idle backup)
Better performance (load distributed)
Graceful degradation (losing one instance just reduces capacity slightly)

Challenges:

More complex failure detection (detecting that API-C is actually broken, not just slow)
State synchronization (if instances have local state, they must replicate)
Coordination (some distributed consensus required)

Best for: Stateless or easily-replicated services (web servers, API gateways, read-only cache layers)

Active-Passive (Hot Standby)

The passive instance mirrors the active one but doesn’t serve traffic. When the active fails, the passive takes over.

Active Instance (serving traffic)
         ↓
    Database
    (replicating)
         ↓
Passive Instance (hot standby, not serving)

Advantages:

Simpler to reason about (only one version of state at a time)
Faster failure detection (just monitor the active)
Less complex synchronization

Challenges:

Resource waste (passive instance sitting idle, consuming money)
Failover gap (brief moment when neither serves while switching)
“Zombie passive” problem (passive might be out of sync without detecting it)

Best for: Stateful services (databases, message brokers) or when licensing/cost is high per instance

Warm Standby

A middle ground: the passive instance is partially running, periodically syncing state, but not actively serving requests.

Active Instance (serving)
         ↓ (periodic sync)
Cache or Snapshot
         ↓
Passive Instance (warm, ready to serve in seconds)

Advantages:

Better resource efficiency than hot standby
Faster recovery than cold standby
Safer than active-active

Challenges:

More complexity than either pattern
State lag between active and passive
Requires careful sync mechanisms

Best for: Services where you want fast failover but can’t justify keeping a full hot standby running

N+1, N+2, and 2N Redundancy

These formulas describe how many spares you have relative to capacity:

N+1: Run N instances plus 1 spare. If capacity needed is N, you have 2N instances total. One can fail without losing capacity.
N+2: Run N instances plus 2 spares. Survive two simultaneous failures.
2N: Run N instances in two independent regions/data centers. Can survive the loss of one entire location.

Scenario: You need capacity for 10 concurrent connections

N+1 Redundancy:
  ├─ Instance 1 (5 connections)
  ├─ Instance 2 (5 connections)
  └─ Instance 3 (spare)

  Loss of any one instance: others handle full load ✓

N+2 Redundancy:
  ├─ Instance 1 (3.3 connections)
  ├─ Instance 2 (3.3 connections)
  ├─ Instance 3 (3.3 connections)
  └─ Instance 4 (spare) + Instance 5 (spare)

  Loss of any two instances: others handle full load ✓
  Cost: 5 instances instead of 2

2N Redundancy (Multi-Region):
  Region A:
    ├─ Instance 1A (5 connections)
    └─ Instance 2A (5 connections)

  Region B:
    ├─ Instance 1B (5 connections)
    └─ Instance 2B (5 connections)

  Loss of entire Region A: Region B handles full load ✓
  Cost: 4 instances + complex geo-routing

The choice depends on your reliability requirements and budget. For most services, N+1 in a single region is reasonable. N+2 when you need very high reliability. 2N when you need to survive regional outages.

Fault Detection: How Do We Know Something Failed?

Redundancy is useless without detection. You need to know when a component is failing before you can failover.

Heartbeat / Liveness Probes

The system periodically checks if a component is alive:

Health Checker → Component
                    ↓
                 /health endpoint
                    ↓
                "status: healthy"

If a component misses N heartbeats in a row, it’s marked unhealthy. Simple but effective. Used by Kubernetes liveness probes, load balancer health checks, etc.

Tuning parameters:

Check interval: How often to probe? 5 seconds is common.
Timeout: How long to wait for a response? 2-3 seconds.
Failure threshold: Miss how many checks before marking unhealthy? Usually 2-3.

Tune too aggressively and you get false positives (healthy instances marked unhealthy). Tune too conservatively and you miss real failures.

Consensus Protocols

For critical state (database primaries, message brokers), single heartbeats aren’t enough. You need a group decision:

graph LR
    A["Node A<br/>Has primary?<br/>Yes"] --> D["Consensus<br/>3 nodes agree<br/>Primary failed?"]
    B["Node B<br/>Has primary?<br/>No"] --> D
    C["Node C<br/>Has primary?<br/>No"] --> D
    D --> E["Majority agreed:<br/>Promote standby<br/>to primary"]

Raft and Paxos are the main algorithms. Raft is simpler and used by etcd, Consul, and many others. The benefit: no split-brain (two servers thinking they’re primary), because a quorum must agree on state changes.

Error Rate Spikes

Sometimes a component doesn’t crash—it just starts returning errors:

Error Rate: 0.1% (normal)
         ↓ [spike detected]
Error Rate: 15% (alert! start draining traffic)
         ↓
Error Rate: 85% (component is broken, remove from rotation)

This is common when a dependency (database, cache) degrades. The component is still alive but unhealthy.

Failover Mechanisms

Once you detect failure, how do you switch traffic?

Load Balancer Failover

The simplest: the load balancer stops sending traffic to the failed instance.

User → Load Balancer ──→ Instance A (healthy)
       │              └─→ Instance B (failed, not used)
       └──→ Instance C (healthy)

Time to failover: 5-10 seconds (depending on health check frequency).

DNS Failover

Change which IP address the DNS name resolves to:

Before: api.example.com → 192.0.2.1 (failed)
After:  api.example.com → 192.0.2.2 (backup)

Time to failover: 30 seconds to several minutes (depends on TTL). Much slower than load balancer, so less preferred.

Database Failover (Promotion)

The most complex: promoting a replica to primary:

Primary (failed)
    ↓
Replica A ──promote──→ New Primary
Replica B ───catch up from binlog───

Must ensure:

Replica has all committed writes (no data loss)
Other replicas repoint to new primary
Clients reconnect to new primary
Old primary doesn’t accept writes if it comes back online (split-brain prevention)

Time to failover: 10-30 seconds for well-configured systems, minutes for complex setups.

Data Consistency During Failover

Here’s where redundancy gets hard: when a component fails in the middle of operations, what happens to in-flight requests?

Scenario 1: Write In Progress

Client: INSERT user WHERE id=1
  ↓
Primary Database (processing, 50% done)
  ↓ [CRASH during fsync]
Replica (hasn't received the write yet)
  ↓
Replica promoted to primary

Result: User record lost! Client thinks insert succeeded but data wasn't replicated.

Solutions:

Synchronous replication (primary waits for replica to acknowledge write before responding to client)
Quorum writes (write must reach a majority before being acknowledged)
Durability guarantees (write goes to disk before acknowledging)

Trade-off: Durability guarantees slow down writes.

Scenario 2: Read After Write Consistency

Client: Write to Primary
         ↓ [failover happens]
Client: Read from New Primary (replica)
         ↓
Read sees old value (replica wasn't caught up)

Result: Client sees their write disappear!

Solutions:

Ensure replicas are caught up before promoting
Redirect failed writes to the new primary (have client retry)
Accept eventual consistency and warn users

RPO and RTO: The Recovery Equation

Two metrics critical for disaster recovery:

RTO (Recovery Time Objective): How long until the system is back in service after a failure?

Failure occurs at 2:00 PM
System is back online at 2:03 PM
RTO = 3 minutes

RPO (Recovery Point Objective): How much data loss is acceptable?

Last backup at 2:00 PM
Failure at 2:15 PM
Data from 2:00-2:15 is lost
RPO = 15 minutes

These drive redundancy architecture:

Low RTO (seconds) → Need active-active or hot standby
Low RPO (near-zero) → Need synchronous replication or persistent queue
High RTO (hours) → Cold standby is fine
High RPO (hours) → Periodic backups are fine

Most teams target: RTO under 5 minutes, RPO under 1 minute.

Geographic Redundancy and Disaster Recovery

With active-active redundancy within a region, you’re protected against component failures. But what if the entire region fails (major power outage, earthquake, DDoS)?

Primary Region (us-east-1)
  ├─ API Servers
  ├─ Database Primary
  └─ Cache

  ↕ [replication]

Secondary Region (eu-west-1)
  ├─ API Servers
  ├─ Database Replica
  └─ Cache

Active-Active Geo-Redundancy: Both regions serve traffic simultaneously. Users in Europe hit eu-west-1, users in US hit us-east-1. If one region dies, traffic routes to the other.

Challenge: Keeping data consistent across regions when both are accepting writes. Usually requires eventual consistency.

Active-Passive Geo-Redundancy: Only primary region serves traffic. Secondary is a hot standby.

Simpler but less efficient (secondary resources idle).

Cold Standby: Snapshots/backups in secondary region, needs 30+ minutes to spin up.

The Cost-Reliability Curve

Every redundancy improvement has a cost:

Reliability
    ↑
99.99% ├─ Multi-region active-active
       │     (very expensive)
       │
99.9%  ├─ Multi-AZ active-active or hot standby
       │     (expensive)
       │
99%    ├─ N+1 redundancy in single AZ
       │     (moderate cost)
       │
98%    ├─ No redundancy
       │     (cheap but risky)
       └────────────────────────────────→ Cost

Most systems get 80% of their reliability from N+1 redundancy (duplicate components). The next 15% comes from multi-AZ. The final 5% (jumping from 99.9% to 99.99%+) requires expensive multi-region setup.

Key Takeaways

Redundancy without automation is useless; you must have fault detection and automatic failover
Choose your pattern based on statelessness and cost: active-active for stateless, active-passive for stateful
N+1 gives you basic protection; N+2 for higher reliability; 2N for regional failure protection
Failover detection is critical; tune health checks carefully to balance false positives and missed failures
RPO and RTO are your reliability objectives; design redundancy to meet them
Geographic redundancy is expensive and complex; only add it when you must survive regional outages
Most reliability comes from within-region redundancy; geographic is a luxury for high-availability systems

In the next section, we’ll discuss the circuit breaker pattern—how to prevent cascading failures when dependencies become unhealthy.