System Design Fundamentals

Health Checks & Heartbeats

A

Health Checks & Heartbeats

The Zombie Instance Problem

Your load balancer is routing traffic to 5 API server instances. Everything looks good — TCP connections are being established, network is working fine. But one instance has a problem: its database connection pool is exhausted. The server accepts connections (so the load balancer thinks it’s alive) but returns 500 errors for every request.

The result? 20% of your users get errors while the other 80% succeed. The load balancer kept sending traffic to the zombie instance because it couldn’t tell the difference between “alive” and “actually working.”

A proper health check would detect this. The instance would report “not ready,” the load balancer would remove it from the pool, and all users would get served by the 4 healthy instances. This is what health checks do: detect when a service is degraded and remove it from traffic before it hurts users.

This chapter is about the automated systems that keep your infrastructure honest.

Types of Health Checks in Kubernetes

Kubernetes gives you three types of probes, each serving a different purpose:

Liveness Probe: “Is the process running?” If the probe fails repeatedly, Kubernetes assumes the process is stuck and restarts it. This catches “hung” processes that are technically alive but unresponsive.

Readiness Probe: “Can it accept traffic?” If the readiness probe fails, Kubernetes removes the pod from the load balancer’s target list. This catches temporary issues (initializing, cache loading) without restarting the whole pod.

Startup Probe: “Has it finished initializing?” For applications with slow startups (Java with JIT warmup, Node with module loading), the startup probe delays liveness/readiness checks until the app has finished booting. Without this, Kubernetes might kill a service before it’s even started.

Here’s how they interact:

┌────────────────────────────────────────────┐
│         Container Starts                   │
└────────────────────────────────────────────┘

         ┌──────────────────────┐
         │ Startup Probe Check  │
         │ (Is it initialized?) │
         └──────────────────────┘
          (fails → restart container)

    ┌───────────────────────────────┐
    │ Liveness Probe                │
    │ (Is process still running?)   │
    │ (fails → restart container)   │
    └───────────────────────────────┘

    ┌───────────────────────────────┐
    │ Readiness Probe               │
    │ (Can it accept traffic?)      │
    │ (fails → remove from LB)      │
    └───────────────────────────────┘

      ┌─────────────────────────────┐
      │ Container Running & Healthy │
      │ (Receiving traffic)         │
      └─────────────────────────────┘

Shallow vs. Deep Health Checks

There’s a trade-off between simplicity and comprehensiveness.

Shallow health check: Does the process respond to requests?

GET /health → 200 OK

This tells you the process is running and the network path works. But it doesn’t verify that the database is reachable, Redis is working, or disk space is available. A shallow check is fast and won’t itself become a performance bottleneck.

Deep health check: Verify all critical dependencies.

GET /health → {
  "status": "healthy",
  "database": "connected",
  "redis": "connected",
  "disk_free_gb": 45,
  "memory_available_gb": 2.1
}

This tells you much more, but if it’s checking the database on every health check, you’ve just multiplied your database load by 10x. Your health checks become the problem they’re trying to prevent.

Best practice: shallow checks (fast, frequent) for liveness, deeper checks for readiness and startup.

# Liveness: shallow, check if process responds (fast)
livenessProbe:
  httpGet:
    path: /ping
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3

# Readiness: deeper, check if it can handle requests (slower)
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 2

A Lifeguard Analogy

Imagine a swimming pool with dozens of swimmers. The lifeguard (health checker) periodically scans for problems. A shallow check is just looking if swimmers are above water (process alive). A deep check is asking “are you okay?” and verifying they can tread water (dependencies working).

If a swimmer doesn’t respond (health check timeout), the lifeguard acts immediately — they’re either in trouble or gone.

Kubernetes Health Probe Configuration

Here’s what each parameter does:

httpGet:
  path: /health
  port: 8080
  scheme: HTTP  # or HTTPS

# How long to wait before starting checks (app needs time to boot)
initialDelaySeconds: 30

# How often to check (every 5 seconds for liveness, every 10 for readiness)
periodSeconds: 5

# How long to wait for a response before giving up
timeoutSeconds: 2

# How many consecutive failures before action
# - Liveness: 3 failures = restart the pod
# - Readiness: 2 failures = remove from load balancer
failureThreshold: 3

# How many successes to consider it recovered
# - Usually 1 for liveness, 1-2 for readiness
successThreshold: 1

The key insight: these parameters should match your application’s failure patterns. If your database flakes out for 30 seconds during backups, a 5-second check with failureThreshold=3 will false-positive (15 seconds total). Increase failureThreshold or periodSeconds.

Health Check Endpoint Design

When you implement /health in your application, what should you include?

{
  "status": "healthy",
  "timestamp": "2024-02-13T14:32:05Z",
  "version": "v1.2.3",
  "uptime_seconds": 3600,
  "dependencies": {
    "database": {
      "status": "connected",
      "latency_ms": 2
    },
    "redis": {
      "status": "connected",
      "latency_ms": 1
    },
    "external_api": {
      "status": "degraded",
      "latency_ms": 5000,
      "note": "Slow but operational"
    }
  },
  "checks": {
    "disk_free_gb": 45,
    "memory_available_percent": 65
  }
}

What to include:

  • Overall status (healthy, degraded, unhealthy)
  • Timestamp
  • Version (helps correlate health issues with deployments)
  • Dependency status (can optional dependencies fail without failing health?)
  • Resource availability (disk, memory)

What NOT to include:

  • Sensitive information (database credentials, API keys)
  • Expensive operations (full data consistency checks)
  • Business logic (is this user premium? — that’s not a health check)

The Cascading Health Check Problem

Imagine Service A’s health check queries Service B’s /health endpoint. Service B checks Service C. Service C checks the database. Now the database gets 1000 health check requests per second (300 from C’s own probes + 700 from B and A cascading checks).

Rule: Service A’s health check should report:

  • Its own health (can it process requests?)
  • Its dependencies’ health (but reported passively, not actively checked)
  • NOT propagate dependency failures as its own failure

If Service B is down, Service A says “unhealthy” (can’t reach critical dependency), but it doesn’t restart. The Service B alert fires separately and Service B’s pods restart. Don’t cascade health check calls; cascade dependencies logically.

Better approach: Service A keeps track of its last successful request to Service B. If that was recent (less than 30 seconds ago), it reports healthy. If older, it reports degraded. This avoids creating a health check traffic storm.

Heartbeats: Push-Based Health Checks

Health checks are pull-based: the monitor asks “are you alive?” But in some architectures, services send heartbeats: “I’m alive, all is well.”

Pull-based (Kubernetes liveness probes):

Load Balancer:  GET /health?
Service:        200 OK, all good

Push-based (heartbeat/watchdog):

Service:              POST /heartbeat (I'm alive)
Monitoring System:    Records last heartbeat timestamp
If no heartbeat in 30 seconds → alert

Push-based is common in peer-to-peer systems where there’s no centralized monitor. Services in a cluster gossip: “I’m alive” → neighboring services relay the message. If a node stops sending heartbeats, neighbors notice and reorganize leadership.

Health Checks in Load Balancers

Kubernetes isn’t the only place with health checks. Load balancers (AWS ALB, nginx, HAProxy) also have them.

Active health checks: Periodically send a request to each backend and check for success.

HAProxy configuration:
option httpchk GET /health
monitor-uri /health

Passive health checks: Monitor real traffic. If a backend returns 5xx errors repeatedly, temporarily remove it.

ALB Target Group:
Health check path: /health
Interval: 5 seconds
Timeout: 2 seconds
Healthy threshold: 2 consecutive successes
Unhealthy threshold: 3 consecutive failures

Load balancers often support passive health checks (watching real traffic) plus active health checks (periodic probes). The combination is powerful: real traffic failures are caught immediately, and the active probe ensures a healthy backend isn’t accidentally removed due to coincidental errors.

Service Mesh Health Checks

If you’re using a service mesh like Istio, it adds another layer. The mesh proxy (Envoy) can perform health checks independently of Kubernetes.

# Istio DestinationRule
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service
spec:
  host: payment-service
  outlierDetection:
    consecutiveErrors: 5
    interval: 30s
    baseEjectionTime: 30s
    maxEjectionPercent: 100

This tells Envoy: “If a backend returns 5 consecutive errors, eject it (stop sending traffic) for 30 seconds.” This is passive health detection at the proxy level, complementing Kubernetes probes.

The Startup Probe: Critical for Long-Running Initialization

Java services are notorious for slow startups. The JVM loads, the Spring framework initializes, beans are created, caches are warmed. This can take 2-3 minutes.

Without a startup probe, Kubernetes would check liveness at 30 seconds (“no response yet”) and restart the pod. The container would be stuck in a restart loop, never finishing initialization.

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  failureThreshold: 30      # Allow 30 failures
  periodSeconds: 10         # Check every 10 seconds
                            # Total: up to 5 minutes of startup time

Once the startup probe succeeds once, the liveness probe takes over. This is crucial for applications with JVM, cache warming, or database migrations.

Preventing Health Checks From Creating the Problem They Measure

Sometimes health checks become their own bottleneck.

Example 1: Database overload from health checks You have 100 pods, each checking the database health every 5 seconds. That’s 1200 database queries per minute just for health checks. If your database can only handle 3000 qpm total, you’ve just consumed 40% of capacity with monitoring.

Solution: Cache health check results. If the database was healthy 5 seconds ago, assume it still is. Querying a cached in-memory value is fast.

Example 2: The Thundering Herd You deploy 50 new pods. By coincidence, all of them start up and check the database within the same second. The database connection pool exhausts. All health checks fail. Kubernetes restarts all 50 pods simultaneously. They all come back up, check the database simultaneously again. You’ve created an oscillation.

Solution: Stagger health check requests using jitter (add randomness to periodSeconds).

# Instead of:
health_check_interval = 5  # All pods check at t=0, 5, 10, 15...

# Use:
jitter = random.uniform(0, 2)
health_check_interval = 5 + jitter  # Each pod checks at slightly different times

Real-World Health Check Example

Here’s a production-grade health check endpoint:

from flask import Flask, jsonify
from datetime import datetime
import psycopg2
import redis

app = Flask(__name__)

# Cache last successful dependency checks
last_db_check = {"time": None, "status": "unknown"}
last_redis_check = {"time": None, "status": "unknown"}
CACHE_DURATION = 5  # seconds

@app.route('/ping')
def ping():
    """Shallow check: is the process responding?"""
    return jsonify({"status": "alive"}), 200

@app.route('/health')
def health():
    """Readiness check: can we handle requests?"""
    now = datetime.now().timestamp()

    # Check database (but cache it)
    if (not last_db_check["time"] or
        now - last_db_check["time"] > CACHE_DURATION):
        try:
            conn = psycopg2.connect(DSN)
            conn.close()
            last_db_check = {"time": now, "status": "healthy"}
        except Exception as e:
            last_db_check = {"time": now, "status": "unhealthy", "error": str(e)}

    # Check Redis (but cache it)
    if (not last_redis_check["time"] or
        now - last_redis_check["time"] > CACHE_DURATION):
        try:
            r = redis.Redis()
            r.ping()
            last_redis_check = {"time": now, "status": "healthy"}
        except Exception as e:
            last_redis_check = {"time": now, "status": "unhealthy", "error": str(e)}

    # Determine overall status
    status = "healthy"
    if last_db_check["status"] != "healthy":
        status = "unhealthy"  # Critical dependency down
    elif last_redis_check["status"] != "healthy":
        status = "degraded"  # Optional dependency down

    return jsonify({
        "status": status,
        "version": "v1.2.3",
        "timestamp": datetime.utcnow().isoformat(),
        "dependencies": {
            "database": last_db_check["status"],
            "redis": last_redis_check["status"]
        }
    }), 200 if status != "unhealthy" else 503

The Zombie Instance Fix

Going back to our opening scenario: one instance has an exhausted database connection pool. The shallow health check (ping) passes. The deep health check (health) fails:

GET /health
{
  "status": "unhealthy",
  "database": "exhausted_connection_pool"
}
Returns 503 (Service Unavailable)

The readiness probe sees 503. After 2 failures (20 seconds), the pod is removed from the load balancer’s target list. No more traffic is sent to it. Users don’t see errors. The ops team gets a readiness probe failure alert and investigates: “Why are we out of database connections? Did a recent change increase connection usage?” They fix it (adjust connection pool, revert deploy), and the pod becomes healthy again.

Key Takeaways

  1. Health checks are your insurance policy against silent failures. A service that accepts connections but returns errors is worse than being completely down.

  2. Three probe types serve different purposes: Startup (has it finished initializing?), Liveness (is the process running?), Readiness (can it handle traffic?).

  3. Shallow checks (fast) for liveness, deep checks (comprehensive) for readiness. Health checks shouldn’t themselves become a performance problem.

  4. Cache health check results. Checking the database connection pool every 5 seconds across 100 pods creates a thundering herd.

  5. Don’t cascade health checks: Service A shouldn’t actively check Service B; instead, it should passively note dependency status.

  6. Use jitter to prevent synchronization: When all 50 pods start simultaneously and check dependencies at the same time, you create a spike. Randomize timings.

Practice Scenarios

Scenario 1: The Slow JVM Startup You deploy a new version of your Java-based payment service. Kubernetes immediately starts liveness probes at 30 seconds. But the JVM takes 2 minutes to warm up (Spring context loading, cache initialization). The liveness probe fails, Kubernetes restarts the pod. This repeats forever — CrashLoopBackOff.

Solution: Add a startup probe that allows 12 failures over 2 minutes (periodSeconds=10, failureThreshold=12). Now the startup probe gives the JVM time to initialize. Once that succeeds, liveness/readiness probes take over.

Scenario 2: The Health Check Thundering Herd You use auto-scaling. When traffic spikes at 2 PM, you scale from 10 to 50 instances in 10 seconds. Now all 50 instances start health-checking the database simultaneously. The database sees a spike from 100 qpm to 3000 qpm (just health checks). The connection pool exhausts. All health checks fail. The pods fail readiness probes. They’re removed from the load balancer. Now you have 50 instances but zero are healthy.

Solution: Use jitter in health check intervals. Also, implement health check result caching (don’t actually ping the database every 5 seconds; cache the result and only check occasionally). This prevents one monitoring tool from becoming the bottleneck.

Scenario 3: The Cascading Health Check Failure Your architecture has:

  • Service A (user-facing API)
  • Service B (payment processing)
  • Service C (fraud detection)
  • Service D (database)

Service A’s health check queries B’s health, which queries C’s health, which queries D’s health. When D is slow, C fails its health check, B fails, A fails. But A failing doesn’t mean users are actually seeing errors yet — maybe B can continue processing with a cached fraud list. Cascading health checks created a single point of failure.

Better: A’s health check checks only itself. It passively tracks B’s status (using cached results from B’s own health checks). If A detects B is down, it reports “degraded” but doesn’t fail. The operators see “Service A degraded due to Service B down” and fix B. Meanwhile, A might have a fallback (disable high-risk fraud checks, alert users) rather than returning errors to everyone.

Scenario 4: The Zombie Instance Detection You have a production incident: 20% of requests are getting 500 errors. You check the logs and see one instance returning payment_gateway_timeout errors. But the load balancer is still sending traffic to it.

Investigation: The shallow health check (/ping) is passing (the process is running). But the deep health check shows the instance can’t reach the payment gateway. The readiness probe is configured but it’s only checking /ping, not /health.

Fix: Change the readiness probe to check /health (which includes payment gateway connectivity). Now when the gateway becomes unreachable, the instance is removed from traffic. Users aren’t affected.


We’ve now covered the full observability stack: metrics (from previous chapters), distributed tracing (from previous chapters), centralized logging, alerting and incident response, and health checks. Together, these give you visibility into your system’s behavior and the tools to respond when things break.

In the next chapter, we shift gears from operations to security. We’ve built systems that are observable and reliable. Now we need to protect them.