Load Distribution Patterns

You’ve successfully scaled to multiple servers — but now you face a critical question: how does your traffic actually get distributed across them? In our previous sections, we explored horizontal scaling and database replication. Now we need the intelligence layer that sits between users and your server fleet. Without it, requests would pile up on the first server while others sit idle. This is where load distribution patterns become essential.

Load distribution isn’t just about spreading requests evenly. It’s about intelligently routing traffic based on server health, current load, user affinity, and geographic location. Think of it as the nervous system of your scaled infrastructure — it detects what’s happening and routes traffic accordingly.

How Load Balancers Work

A load balancer is a dedicated system (hardware or software) that sits between clients and your backend servers. When a request arrives, the load balancer decides which server should handle it based on its configured algorithm and the current state of your infrastructure.

There are two main layers where load balancing happens:

Layer 4 (Transport Layer) Load Balancing operates at the TCP/UDP level. It makes routing decisions based solely on IP protocol data — source IP, destination IP, and ports. This is fast because the load balancer doesn’t need to inspect the actual request content. It simply distributes connections to backend servers and maintains connection tables. Think of it as directing people to checkout lines without looking inside their shopping carts.

Layer 7 (Application Layer) Load Balancing inspects the actual application data — HTTP headers, request paths, hostnames, or cookie values. This allows for sophisticated routing decisions. You could route all requests for /api/v1/users to one server group and /api/v1/analytics to another. It’s more flexible but requires more CPU because the load balancer must parse and understand application protocols.

Load Balancing Algorithms

Different algorithms suit different workloads:

Round-Robin distributes requests sequentially across servers. Server 1 gets request 1, server 2 gets request 2, server 3 gets request 3, then back to server 1. It’s simple and works well when servers have equal capacity and similar response times. However, it doesn’t account for server load or varying request complexity.

Least Connections routes new requests to the server currently handling the fewest active connections. If you have requests with varying durations, this prevents one server from becoming a bottleneck while others are idle. A batch processing server might handle one long-running request while others handle many quick ones.

Weighted Load Balancing assigns capacity weights to servers. If you have a beefy server with more resources, give it weight 3, while smaller servers get weight 1. The load balancer distributes traffic proportionally. This is practical in heterogeneous infrastructure where hardware differs.

IP Hash uses the client’s IP address to determine routing. The same client always routes to the same server. This is useful for session affinity — if your application stores session state locally (not ideal, but sometimes necessary), you want requests from the same client hitting the same server. However, if that server fails, the client’s requests scatter randomly.

Consistent Hashing is more sophisticated. It maps both servers and clients onto a ring structure. When a server fails, only requests from clients who “hash” near that server get redistributed — not all traffic. This minimizes cache invalidation and session loss. It’s essential for distributed caching layers.

Least Response Time considers both connection count and server response latencies. Some servers might respond faster than others due to hardware differences or local caches. This algorithm ensures faster servers handle more traffic.

Reverse Proxies and Health Checks

A reverse proxy (often your load balancer) also handles SSL/TLS termination. Instead of each backend server managing encryption, the load balancer decrypts incoming HTTPS traffic and communicates with backends over plain HTTP (in a trusted internal network). This reduces CPU load on backend servers and centralizes certificate management.

Health checks keep your system resilient. The load balancer periodically probes backend servers — usually with HTTP requests to a specific endpoint like /health. If a server stops responding, the load balancer removes it from the active pool. When it recovers, health checks detect this and restore it. Active health checks explicitly probe servers, while passive checks infer health from request success/failure rates.

Global Server Load Balancing and Geo-Routing

For truly distributed systems, geographic load balancing (GSLB) routes users to the nearest datacenter. DNS queries return different IPs based on the client’s geographic location or the current health status of regional data centers. A user in Tokyo gets routed to your Tokyo datacenter, while one in New York hits your American infrastructure. This reduces latency and improves resilience against regional outages.

Real-World Analogy: The Smart Airport Counter System

Imagine an airport check-in with multiple counters. A naive approach would assign customers in strict rotation — counter 1, then 2, then 3 — even if counter 2 has a customer with an extremely complex rebooking situation. A smarter system watches each counter’s queue length and directs new arrivals to the shortest line (least connections). International customers with VIP status get routed to dedicated counters (weighted routing). If counter 3’s agent becomes ill and steps away, customers are rerouted to active counters (health check failure). Customers with connecting flights get special routing to faster-processing counters. This mirrors how modern load balancing works — intelligent, adaptive, and aware of both capacity and current conditions.

Layer 4 vs Layer 7: Performance and Flexibility Trade-off

Layer 4 load balancing is blazingly fast. Since it only examines TCP/UDP headers, it can handle millions of requests per second with minimal CPU overhead. It’s ideal for non-HTTP protocols (gaming servers, databases, real-time streaming) or when you need extreme throughput. Most modern load balancers can handle L4 traffic at wire speed — the network bandwidth becomes the limit, not the load balancer’s processing.

Layer 7 adds flexibility at a performance cost. To inspect HTTP headers and make application-aware decisions, the load balancer must fully parse HTTP. For some workloads, this overhead is negligible. For others handling millions of requests per second, L7 load balancers might become the bottleneck. However, modern implementations using efficient parsing and bypassing (where high-traffic connections skip inspection after initial classification) minimize this overhead.

Hardware load balancers (like F5 BIG-IP) offer dedicated silicon optimized for packet processing, providing extreme performance but at premium costs. Software load balancers like Nginx, HAProxy, or AWS ALB run on commodity hardware, offering flexibility and cost-effectiveness. AWS’s Network Load Balancer (NLB) handles L4 with extreme throughput, while the Application Load Balancer (ALB) provides L7 capabilities.

Here’s how consistent hashing works in practice:

# Imagine a ring with 360 degrees
# Servers: A (at 0°), B (at 120°), C (at 240°)
# Client requests hash to: 45°, 150°, 280°

Request hashing to 45° → nearest server clockwise is B
Request hashing to 150° → nearest server clockwise is C
Request hashing to 280° → nearest server clockwise is A

# Now server B fails. Only requests that would hash between
# 120°-240° (B's range) get redistributed to C. A's traffic
# continues untouched. This minimizes cache invalidation.

Nginx Load Balancing Configuration Example

Here’s a practical Nginx setup:

upstream backend_servers {
    # Simple round-robin
    server 10.0.1.10:8080;
    server 10.0.1.11:8080;
    server 10.0.1.12:8080;

    # Add health checks (requires nginx_http_upstream_module)
    # server 10.0.1.10:8080 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name api.example.com;

    location / {
        proxy_pass http://backend_servers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # Connection timeout settings
        proxy_connect_timeout 5s;
        proxy_send_timeout 30s;
        proxy_read_timeout 30s;
    }
}

To use least connections instead of round-robin:

upstream backend_servers {
    least_conn;  # Use least connections algorithm
    server 10.0.1.10:8080;
    server 10.0.1.11:8080;
    server 10.0.1.12:8080;
}

For weighted load balancing:

upstream backend_servers {
    server 10.0.1.10:8080 weight=3;  # Handles 3x more traffic
    server 10.0.1.11:8080 weight=1;
    server 10.0.1.12:8080 weight=1;
}

Health Checks in Action

Here’s a simple health check endpoint your backend should implement:

# Flask example
@app.route('/health')
def health_check():
    # Check critical dependencies
    if not database_connected():
        return {'status': 'unhealthy'}, 503
    if not cache_available():
        return {'status': 'degraded'}, 200  # Still serve, but marked degraded
    return {'status': 'healthy'}, 200

The load balancer periodically hits this endpoint:

# HAProxy health check configuration
backend api_servers
    balance least_conn
    option httpchk GET /health
    server api1 10.0.1.10:8080 check inter 5s fall 2 rise 3
    server api2 10.0.1.11:8080 check inter 5s fall 2 rise 3

This means: check every 5 seconds (inter 5s), mark server as down after 2 consecutive failures (fall 2), mark as up after 3 consecutive successes (rise 3).

Load Balancing Algorithm Selection Guide

Scenario	Recommended Algorithm	Why
Stateless API with uniform requests	Round-Robin	Simple, fair distribution
Varying request durations (batch processing)	Least Connections	Prevents overloading servers with long requests
Heterogeneous server capacity	Weighted	Leverages different hardware capabilities
Users with local sessions	IP Hash or Consistent Hash	Maintains session affinity
Large distributed cache layer	Consistent Hashing	Minimizes cache invalidation on server failure
Real-time gaming or streaming	L4 Round-Robin	Minimizes latency, avoids L7 overhead
Microservice routing by path	L7 with path-based rules	Directs `/users` and `/orders` to different backends

Load Balancer Resilience

Here’s where many teams stumble: your load balancer can become a single point of failure (SPOF). If it crashes, all traffic stops.

Active-Passive Configuration: You run two load balancers. One actively serves traffic while the other stays on standby. A heartbeat mechanism detects failure and triggers failover. The passive one takes over using a floating virtual IP. This has downtime during failover (typically seconds) and wastes one load balancer’s capacity, but it’s simple and reliable.

Active-Active Configuration: Both load balancers actively serve traffic simultaneously, usually with equal shares. This requires careful state management — session tables, connection state, and connection tracking must synchronize or be stateless. If built well, you get redundancy without capacity waste and near-zero failover time.

DNS-Based Load Balancing: Return multiple A records from DNS, one for each load balancer IP. The client (or resolver) picks one. This is resilient at scale but has limitations: clients cache DNS results, and failover happens only when the cache expires (typically minutes). Not suitable for fast failure recovery.

Pro tip: Single load balancer in development, active-passive for most production systems, and active-active only when you’ve mastered operation complexity.

Handling Stateless vs Stateful Workloads

For stateless services (most APIs), any load balancing algorithm works fine. Requests are independent; it doesn’t matter which server handles them.

For stateful workloads (WebSockets, file uploads in progress, authentication sessions), you need affinity. IP hash or consistent hashing ensures the same client returns to the same server. Alternatively, store state externally (Redis, a database) so any server can handle any request — the best approach when feasible.

Mermaid Diagram: Load Balancing Topology

graph TB
    Users["Users across the internet"]
    DNS["DNS (api.example.com)"]
    LB1["Load Balancer 1<br/>(Active)"]
    LB2["Load Balancer 2<br/>(Passive/Backup)"]

    API1["API Server 1"]
    API2["API Server 2"]
    API3["API Server 3"]

    HealthCheck["Health Check Service"]

    Users -->|Query DNS| DNS
    DNS -->|Returns LB1 IP| Users
    Users -->|HTTP Requests| LB1
    LB1 -->|Failover signal| LB2

    LB1 -->|Round-robin| API1
    LB1 -->|Round-robin| API2
    LB1 -->|Round-robin| API3

    LB1 -->|Periodic health checks| HealthCheck
    HealthCheck -->|Pings /health| API1
    HealthCheck -->|Pings /health| API2
    HealthCheck -->|Pings /health| API3

    style LB1 fill:#4CAF50
    style LB2 fill:#FF9800
    style API1 fill:#2196F3
    style API2 fill:#2196F3
    style API3 fill:#2196F3

Key Takeaways

Load balancers distribute traffic across servers using algorithms that range from simple round-robin to sophisticated consistent hashing, matching the distribution strategy to your workload characteristics.
Layer 4 load balancing offers extreme performance for any protocol, while Layer 7 provides application-aware routing at the cost of CPU overhead.
Health checks enable automatic failover, removing unhealthy servers and restoring them when they recover, keeping your system resilient.
Consistent hashing is critical for systems with distributed caches or stateful services, minimizing data loss when servers fail.
Load balancers themselves must be made redundant through active-passive or active-active configurations to avoid becoming a single point of failure.
Choosing the right algorithm depends on your workload: stateless APIs, session affinity needs, server heterogeneity, and request duration variance all influence the best choice.

Practice Scenarios

Scenario 1: The Overloaded Payment Service Your payment processing service has three identical servers, but during peak hours, one server consistently handles 60% of requests while the others handle 20% each. Round-robin is distributing evenly, but average response times are degrading. What algorithm would you switch to, and why? How would you verify the problem isn’t elsewhere (e.g., database bottleneck)?

Scenario 2: Cache Invalidation After Server Failure You’re running a distributed caching layer with consistent hashing. When one cache server fails, you want to minimize the fraction of cache entries that need to be redistributed. Given 10 servers with 1 million total cache entries, calculate roughly how many entries would need rehashing with consistent hashing vs simple round-robin. Why is this difference important at scale?

Scenario 3: Designing for Geographic Distribution You operate services in three regions: US, Europe, and Asia-Pacific. Latency from users to different regions varies significantly. Propose a GSLB strategy using DNS and health checks to route users to the closest healthy region. What happens if the US region becomes unhealthy? How would you handle a user in the US whose requests keep timing out?

As you’ve learned, load balancing is the intelligence layer that makes horizontal scaling effective. But load balancing is just the beginning — what happens when you need to add servers dynamically during traffic spikes, or remove them during quiet periods? That’s where auto-scaling comes in. In the next section, we’ll explore how systems decide to scale up and down automatically, maintaining performance while controlling costs. You’ll see how metrics drive scaling decisions and how to prevent the cascade failures that occur when scaling logic goes wrong.