System Design Fundamentals

Health Checks & Failure

A

Health Checks & Failure

Why a Load Balancer Needs to See

A load balancer is only useful if it knows which servers are actually alive. Imagine routing traffic to a crashed database server, or a web server that hangs on every request. Without visibility into server health, your load balancer becomes a traffic director who can’t see the road. It distributes requests into the void, leaving users waiting forever while perfectly healthy servers sit idle elsewhere.

This is where health checks come in. Health checks are the nervous system of load balancing—they continuously monitor your backend servers and answer a simple question: “Is this server ready to handle traffic?” They’re the foundation of reliability discussed in Chapter 1, providing the feedback loop that makes load balancing actually work in practice. Without health checks, algorithms from previous sections (round-robin, least connections, weighted distribution) are flying blind.

Health checks are deceptively simple in concept but challenging in execution. Get them wrong, and you either remove good servers from rotation (false negatives) or keep serving traffic to broken ones (false positives). Strike the right balance, and your system gracefully handles failures that would otherwise cascade through your infrastructure.

Types of Health Checks

Health checks come in two flavors: active and passive, and they serve different purposes in your architecture.

Active health checks are explicit, deliberate probes initiated by the load balancer. The load balancer sends a request to each backend server on a regular interval—perhaps every 5 or 10 seconds—and checks if it responds correctly. If the server doesn’t respond or responds with an error, the load balancer marks it unhealthy. This is direct, immediate, and you control exactly what you’re checking. The downside: you’re adding traffic to your backends just to verify they’re alive.

Passive health checks observe real traffic flowing through the load balancer. If a backend server returns 500 errors or connections time out, the load balancer notices without sending dedicated probe requests. This requires no extra traffic, but it means problems are only detected after a user experiences them—you’re healing after the damage is done, not preventing it.

Within active checks, we distinguish between liveness and readiness. A liveness check asks, “Is this server running?” Does the process exist? Can I connect to it on port 8080? Readiness goes deeper: “Is this server actually ready to serve requests?” Has it finished initializing? Are its database connections healthy? Can it reach its cache? A server can be alive but not ready.

Most active health checks hit a dedicated health check endpoint, typically /health or /ready. This endpoint should be lightweight—it doesn’t need to serve your entire application logic, just return a quick status. Some systems go deeper with full health checks that validate database connectivity, cache access, and external service dependencies. Shallow checks are fast but might miss issues; deep checks catch problems but add latency and risk to your health check process itself.

Failure thresholds and recovery thresholds control how quickly we react. You don’t mark a server unhealthy after one failed probe—that’s too aggressive and causes rapid flapping. Instead, you might require 3 consecutive failures before removing a server. Similarly, after failures, you don’t immediately return traffic to the server after one successful probe. You might require 2 successful checks before marking it healthy again.

Grace periods are another safeguard. When a server first starts, it might not be ready for a few seconds while it initializes. A grace period gives it time to become ready before we start marking it unhealthy for probe failures.

The Hospital Triage Analogy

Think of health checks like a hospital’s patient monitoring system. A nurse performs rounds every few hours, taking vital signs and checking on patients—this is your active health check. The nurse looks at heart rate, blood pressure, and breathing, just like your /health endpoint reports CPU usage, memory, and responsiveness.

But hospitals also use continuous monitoring. Heart monitors beep if a patient’s rhythm becomes irregular—this is passive health check. The system detects problems without waiting for the next nurse’s round.

A patient doesn’t get discharged after one good blood pressure reading. The doctor requires sustained improvement over time—equivalent to your recovery threshold. Similarly, a concerning vital sign doesn’t mean immediate emergency; the nurse checks again before escalating, much like your failure threshold requires multiple failures before removing a server.

How Active Health Checks Actually Work

Active health checks typically work through three mechanisms: HTTP-based checks, TCP connection checks, and script-based verification.

HTTP-based checks are most common in modern systems. The load balancer sends an HTTP GET request to your backend on a dedicated endpoint:

GET /health HTTP/1.1
Host: backend-1.internal
User-Agent: nginx-health-check

HTTP/1.1 200 OK
Content-Type: application/json

{
  "status": "healthy",
  "timestamp": "2025-02-10T14:22:33Z"
}

The load balancer expects a 200 status code within a timeout period—maybe 5 seconds. If it gets anything else (500 error, timeout, connection refused), it counts as a failure.

TCP checks are simpler: can I establish a TCP connection to your server on port 8080? If the three-way handshake completes, the server is alive. This is fast but tells you nothing about whether your application logic works. Useful for detecting that a process crashed, but not that it’s hung.

Script-based checks run custom logic on the load balancer side. You might have a script that connects to your database, runs a test query, and reports success or failure. This is powerful but adds complexity and must be carefully designed to avoid overloading backends.

Passive health checks operate differently. They instrument your request/response pipeline:

Request sent to backend-1 →
  Response: 500 Internal Server Error →
    Failure counter incremented →
    (after 3 failures) Remove from rotation

The load balancer watches the status codes and response times of real traffic. Consistent 500s or timeouts indicate a problem. Some systems mark servers unhealthy immediately after a bad response, others use counters.

When a server is marked unhealthy, most load balancers implement connection draining (graceful shutdown). New requests don’t route to this server, but existing connections are allowed to complete. This prevents clients from getting connection resets for in-flight requests.

Here’s the health check lifecycle in diagram form:

graph LR
    A[Healthy] -->|Probe Timeout| B[1 Failure]
    B -->|Probe Timeout| C[2 Failures]
    C -->|Probe Timeout| D[3 Failures<br/>Mark Unhealthy]
    D -->|Connection Drain| E[Unhealthy<br/>No New Requests]
    E -->|Successful Probe| F[1 Recovery]
    F -->|Successful Probe| G[2 Recoveries]
    G -->|Successful Probe| A
    A -->|Continuous Probes| A

One critical problem to avoid: cascading health check failures. Imagine your backend has a slow database query. Health checks time out waiting for the query. The load balancer removes the server. Its load redistributes to other servers, which are now overloaded. Their health checks start timing out too. Within seconds, your entire backend cluster is marked unhealthy. This happens because health checks themselves became the bottleneck.

Kubernetes takes a structured approach with liveness and readiness probes defined in your deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      containers:
      - name: api
        livenessProbe:
          httpGet:
            path: /live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 2

The liveness probe checks “Is the process still running?” It takes 30 seconds before even starting (initialDelaySeconds), then checks every 10 seconds. The readiness probe checks “Is the app ready for traffic?” It starts sooner and checks more frequently. If the readiness probe fails repeatedly, Kubernetes removes the pod from the service load balancer, but doesn’t kill it. If liveness fails, Kubernetes restarts the entire pod.

Building a Health Check Endpoint

Let’s implement a real /health endpoint. Here’s Node.js with Express:

app.get('/health', async (req, res) => {
  try {
    // Check database connectivity
    const dbCheck = await db.query('SELECT 1');

    // Check cache
    const cacheValue = await redis.get('heartbeat');
    if (!cacheValue) {
      return res.status(503).json({
        status: 'degraded',
        reason: 'cache_unreachable'
      });
    }

    return res.status(200).json({
      status: 'healthy',
      uptime: process.uptime(),
      timestamp: new Date().toISOString()
    });
  } catch (error) {
    res.status(503).json({
      status: 'unhealthy',
      error: error.message
    });
  }
});

And the same in Python with Flask:

from flask import jsonify
import redis
import psycopg2

@app.route('/health', methods=['GET'])
def health_check():
    try:
        # Check database
        conn = psycopg2.connect(database="mydb")
        conn.close()

        # Check cache
        r = redis.Redis()
        r.ping()

        return jsonify({
            'status': 'healthy',
            'timestamp': datetime.now().isoformat()
        }), 200
    except Exception as e:
        return jsonify({
            'status': 'unhealthy',
            'error': str(e)
        }), 503

Notice we return 200 when healthy and 503 when unhealthy. The load balancer watches for these status codes.

For Nginx, configure health checks to upstream servers:

upstream backend {
    server backend1.example.com:8080;
    server backend2.example.com:8080;

    check interval=3000 rise=2 fall=5 timeout=1000 type=http;
    check_http_send "GET /health HTTP/1.0\r\n\r\n";
    check_http_expect_alive http_2xx http_3xx;
}

server {
    location / {
        proxy_pass http://backend;
    }
}

This checks every 3 seconds, requires 2 successful responses to mark healthy (rise=2), and 5 failures to mark unhealthy (fall=5).

AWS Application Load Balancer configuration:

{
  "TargetGroupArn": "arn:aws:elasticloadbalancing:...",
  "HealthCheckEnabled": true,
  "HealthCheckProtocol": "HTTP",
  "HealthCheckPath": "/health",
  "HealthCheckIntervalSeconds": 30,
  "HealthCheckTimeoutSeconds": 5,
  "HealthyThresholdCount": 2,
  "UnhealthyThresholdCount": 2,
  "Matcher": {
    "HttpCode": "200-299"
  }
}

When Health Checks Go Wrong

A real failure scenario: A company’s health check endpoint made a database query to validate database connectivity. This was thorough—it checked everything. But during peak traffic, the database got slow. Health checks started timing out. The load balancer marked all backends unhealthy. Traffic cascaded, the database became even slower, and the health checks failed faster. The entire system became inaccessible. The fix: make health checks lightweight. Don’t query your main database in the health check. Use a separate monitoring database, or just check local server state.

Another scenario: Health check interval was too aggressive—every 1 second across 100 load balancers checking 1,000 backends. That’s 100,000 requests per second just for health checks. The health checks themselves became a denial-of-service attack. The fix: stagger health checks, use longer intervals, or implement health check response caching.

Trade-offs in Health Check Design

Frequency vs. Overhead. Check every second and you detect failures instantly but generate immense traffic. Check every 60 seconds and you save traffic but might miss failures for up to a minute. Most systems settle on 10-30 second intervals as a practical balance.

Shallow vs. Deep Checks. A shallow check (TCP connect) is fast and low-risk but misses application-level issues. A deep check validates your entire dependency chain but risks cascading failures if overloaded. The sweet spot: check what’s critical to your service working, nothing more. If database failure means you can’t serve traffic, check the database. If a third-party API is optional, don’t check it in the health endpoint.

Aggressive Removal vs. False Positives. Require 1 failure to mark unhealthy and you might remove healthy servers due to transient blips. Require 10 failures and you might leave broken servers in rotation. Thresholds of 3-5 failures are common because they tolerate occasional timeouts without being so loose that real problems go undetected.

Thundering Herd. If all backends restart simultaneously and all health checks start at the same time, you create a synchronized spike of traffic. Stagger check times or have backends randomize their startup delays to spread the load.

Key Takeaways

  • Health checks are essential infrastructure, not optional—without them, your load balancer sends traffic to dead servers and your reliability degrades instantly
  • Active checks are proactive (you detect failures before users do) but add traffic; passive checks are reactive but require no overhead
  • Shallow checks are fast; deep checks are thorough—balance coverage against risk and complexity
  • Thresholds matter: grace periods, failure thresholds, and recovery thresholds all prevent rapid flapping and false positives
  • Health check endpoints must be lightweight and fast—avoid making expensive queries, and never let the health check itself become the bottleneck
  • Cascading failures from aggressive health checks are a real risk—design defensively and test under load

Practice Scenarios

  1. Your team receives a page: 30% of requests are timing out. You investigate and find the health check endpoint is calling your main database with a complex JOIN query. The database is slow, so health checks time out, servers get marked unhealthy, traffic concentrates on remaining servers, and they become overloaded too. Design a new health check endpoint that avoids this cascade. What would you check instead?

  2. You deploy a new load balancer with health checks every 1 second across 500 servers. Production goes haywire—health check traffic itself is saturating your network. What parameters would you tune, and why?

  3. A server process is running and the TCP health check passes, but the application is actually hung in an infinite loop. Your HTTP health check would catch this, but the team wants to minimize health check overhead. How would you detect application hangs without expensive probes?

Bridging to Global Load Balancing

So far we’ve focused on health checks within a single datacenter or region—local visibility into local servers. But what happens when your users are distributed globally? What if one entire datacenter becomes unreachable? That’s where Global Server Load Balancing (GSLB) comes in, using these same health check principles but at geographic scale.