System Design Fundamentals

Load Balancing Algorithms

A

Load Balancing Algorithms

Deciding Where Each Request Goes

Now that we understand what load balancers do, we face a critical question: which server receives each incoming request? This decision—seemingly simple—has profound implications for system performance, user experience, and cost efficiency. We could send requests to servers in a round-robin fashion, route them to the least-busy server, or ensure the same client always reaches the same destination. Each approach trades off simplicity against optimal resource utilization. In this chapter, we’ll explore the algorithms that power modern load balancers and learn when to use each one.

Different applications demand different distribution strategies. A stateless REST API might benefit from simple round-robin distribution, while a real-time chat application requires requests from the same user to reach the same backend instance to maintain WebSocket connections. An e-commerce platform might need weighted distribution to account for servers with different hardware capabilities. Understanding these algorithms gives you the tools to architect systems that distribute load intelligently rather than blindly.

This chapter builds directly on our previous discussion of load balancer fundamentals. While we focused on the “what” and “why” of load balancing, we now examine the “how”—the decision-making logic that lives inside load balancers and determines every request’s destination.

The Core Algorithms

Let’s examine the most widely-used load balancing algorithms, starting with the simplest and building toward more sophisticated approaches.

Round Robin

The simplest algorithm treats all servers as equals and routes requests in sequential order: first request to server 1, second to server 2, third to server 3, then back to server 1. Think of it as a carousel—requests rotate through the server list continuously.

Advantages: Implementation is trivial; no state tracking required; works well when servers have similar capacity and request complexity is uniform.

Disadvantages: Ignores actual server load. If one request takes 10 seconds and another takes 100ms, you’ll have uneven utilization. A newly-restarted server gets the same traffic as a fully-warmed server.

Weighted Round Robin

This variant acknowledges that servers aren’t created equal. Some might have more CPU cores, more memory, or better networking hardware. You assign each server a weight—perhaps server 1 handles 1 unit of traffic, server 2 handles 2 units, server 3 handles 4 units.

Instead of cycling through each server once, the load balancer cycles through multiple times proportionally. If weights are [1, 2, 4], the sequence becomes: S1, S2, S2, S3, S3, S3, S3, S2, S2, S1… repeating this pattern indefinitely.

When to use: When your backend servers have heterogeneous hardware. You know server A can handle twice the load of server B? Weight it accordingly.

Least Connections

Rather than following a predetermined pattern, least connections tracks how many active connections each server currently handles. When a new request arrives, the load balancer sends it to whichever server has the fewest open connections.

This requires maintaining state—the load balancer must track connection counts for each backend. But the benefit is substantial: if one server is processing a few long-lived connections while others are idle, new requests intelligently route to the idle servers.

When to use: Systems with long-lived connections (databases, WebSockets, gRPC streams) where connection duration varies significantly. Short-lived HTTP requests with uniform complexity? Stick with round robin.

Weighted Least Connections

Combine the previous two approaches: track connection counts but weight servers differently. Server A with weight 2 is considered “full” at 8 connections while server B with weight 1 is full at 4. The algorithm sends new requests to the server with the lowest weighted connection ratio.

Formula: (active_connections / weight)

Send the new connection to the server with the lowest ratio.

IP Hash (Consistent Hashing)

Hash the client’s IP address to deterministically map it to a specific backend server. All requests from the same client always reach the same server.

This isn’t arbitrary hashing—it uses consistent hashing, which minimizes server reassignments when the pool changes. If a server dies and you remove it from the pool, only requests originally destined for that server redistribute; others remain unaffected.

Use case: Session affinity / sticky sessions. Your application stores state in-process (bad practice, but common in legacy systems). Or you’re building a game where players must maintain connection to the same game server.

Critical caveat: If a client’s IP address changes (mobile network switching), they’ll be routed to a different server and lose their session. IP hash also doesn’t account for actual load—one server might receive 10x more traffic if certain IP ranges have many users.

Random

Select a backend uniformly at random. Surprisingly effective when servers have equal capacity! The random distribution approximates round-robin over time but requires no sequential ordering logic.

Interesting fact: Random actually performs nearly identically to round-robin in most scenarios. The per-request CPU overhead of computing a hash or maintaining sequence position often exceeds the benefit of perfectly-even distribution.

Least Response Time

Track not just connection count but actual response latencies. Measure how quickly each server responds to requests and preferentially send new requests to the fastest-responding server.

This requires active health checking and response time sampling. It’s more sophisticated than least connections but better handles performance variations (e.g., one server is CPU-constrained while another is network-limited).

Comparison Table

AlgorithmState TrackingBest ForCPU OverheadSession Affinity
Round RobinNoneUniform stateless requestsVery LowNo
Weighted RRNoneHeterogeneous serversVery LowNo
Least ConnectionsConnection countsLong-lived connectionsMediumNo
Weighted Least ConnConnection counts, weightsMixed workloadsMediumNo
IP HashHash functionSession stickinessLowYes
RandomNoneStateless, uniformLowNo
Least Response TimeResponse latenciesVariable performanceHighNo

The Restaurant Seating Analogy

Imagine you’re a restaurant host managing table assignments. Round robin seating assigns tables sequentially—table 1, table 2, table 3—regardless of how many people are eating. If table 1 has a family of eight eating slowly while tables 2 and 3 are empty, your latest guests wait while restaurants tables sit unused.

Least connections seating looks at the current occupancy: “Which section has the fewest people?” and seats the new party there. Your dynamic allocation improves utilization significantly.

Weighted seating accounts for server size: experienced servers handle larger parties. Your head chef might be weighted 3x because she can manage twice the table count and work faster. New hosts start at weight 1 until they’re trained.

IP hash (sticky seating) means once a customer sits, they always return to the same server: same host, same section, even the same waiter remembers their coffee order. Perfect for regulars who value consistency, but problematic if that host calls in sick.

How Algorithms Decide: The Technical Perspective

Let’s examine how these algorithms actually work under the hood, starting with simple implementations and building toward sophisticated systems.

Round Robin Implementation

class RoundRobinBalancer:
    def __init__(self, servers):
        self.servers = servers
        self.current = 0

    def select_server(self, request):
        server = self.servers[self.current]
        self.current = (self.current + 1) % len(self.servers)
        return server

This implementation maintains a single integer—the next server index. Each request increments it. Thread-safe implementations use atomic operations. The appeal: it’s bulletproof simple.

Least Connections Implementation

class LeastConnectionsBalancer:
    def __init__(self, servers):
        self.servers = servers
        self.connections = {s: 0 for s in servers}

    def select_server(self, request):
        # Find server with minimum connections
        server = min(self.servers,
                     key=lambda s: self.connections[s])
        self.connections[server] += 1
        return server

    def connection_closed(self, server):
        self.connections[server] -= 1

This requires tracking active connections. When a connection closes, the count decrements. The load balancer needs visibility into connection lifecycle—it must know when connections end. This is why load balancers typically sit in the TCP/connection-handling layer.

Consistent Hashing for IP Hash

import hashlib

class IPHashBalancer:
    def __init__(self, servers):
        self.servers = sorted(servers)

    def select_server(self, client_ip):
        # Hash the IP to a number
        hash_value = int(hashlib.md5(
            client_ip.encode()).hexdigest(), 16)

        # Map to a server
        index = hash_value % len(self.servers)
        return self.servers[index]

Simple modulo hashing, but it has a weakness: if a server dies and you remove it from the list, all hashes change. Most requests get reassigned, breaking session affinity for most users. Consistent hashing solves this by mapping both clients and servers to a hash ring. When a server leaves, only the “neighbors” on the ring are affected.

Request Distribution Visualization

graph TD
    A["Incoming Requests"] --> B{Load Balancing Algorithm}
    B -->|Round Robin| C1["S1"]
    B -->|Round Robin| C2["S2"]
    B -->|Round Robin| C3["S3"]
    B -->|Least Conn<br/>2 active| D1["S1"]
    B -->|Least Conn<br/>1 active| D2["S2"]
    B -->|Least Conn<br/>3 active| D3["S3"]
    B -->|IP Hash<br/>10.0.0.1| E1["S1"]
    B -->|IP Hash<br/>10.0.0.2| E2["S2"]
    B -->|IP Hash<br/>10.0.0.3| E3["S3"]

How Algorithms Interact with Health Checks

Health checks remove unhealthy servers from the rotation, but algorithms respond differently:

  • Round Robin: Skips dead servers. If S2 is down, it routes: S1, S3, S1, S3…
  • Least Connections: Never considers dead servers when selecting.
  • IP Hash: Reassigns clients whose server is down, using consistent hashing to minimize impact.

A server in “draining” state (graceful shutdown) accepts existing connections but receives no new ones. Least connections naturally handles this—new connections route elsewhere while active ones finish.

Real-World Applications

Nginx Configuration Examples

Nginx lets you specify the algorithm in its upstream block:

upstream backend_pool {
    # Round Robin (default)
    server backend1.example.com;
    server backend2.example.com;
    server backend3.example.com;
}

upstream backend_weighted {
    # Weighted Round Robin
    server backend1.example.com weight=1;
    server backend2.example.com weight=2;
    server backend3.example.com weight=4;
}

upstream backend_least_conn {
    # Least Connections
    least_conn;
    server backend1.example.com;
    server backend2.example.com;
}

upstream backend_hash {
    # IP Hash for session affinity
    hash $remote_addr consistent;
    server backend1.example.com;
    server backend2.example.com;
}

server {
    listen 80;
    location / {
        proxy_pass http://backend_weighted;
    }
}

Netflix’s Algorithm Strategy

Netflix uses different algorithms for different workloads:

  • Stateless API calls: Weighted round-robin. They know their server capacity profiles and weight accordingly.
  • Streaming connections: Least connections. Streaming sessions are long-lived; evenly distributing active sessions matters more than request count.
  • Cache layers: IP hash with consistent hashing. They want the same user’s requests hitting the same cache to maximize hit rates.

Choosing for E-Commerce vs. Chat

E-commerce platform: You’re processing independent requests (browse products, add to cart, checkout). Stateless servers. Use weighted round-robin if servers vary in power, or simple round-robin if homogeneous. Health checks disable slow servers automatically. Session data lives in Redis, not in-process.

Real-time chat application: Users maintain WebSocket connections to receive messages. Each connection is long-lived. Use least connections so new users are routed to the least-busy server. Or use IP hash if you want geographic/network affinity. A user’s chat history and presence data is tied to their specific connection.

Trade-offs You’ll Face

Simplicity vs. Optimal Distribution

Round robin is beautifully simple. It requires nearly zero CPU overhead—just an increment and modulo. But it’s dumb. If you add weighted round robin, you add marginally more complexity but account for heterogeneous servers. Least connections adds significant complexity (tracking connection state) but dramatically improves distribution for real workloads.

Pro tip: Start simple. Round robin works for most scenarios. Only add complexity if profiling shows uneven server utilization.

Session Affinity vs. Even Distribution

IP hash gives you sticky sessions but sacrifices perfect load balance. Certain IP ranges might have more traffic. Removing servers causes reassignments, breaking sessions.

Alternatively, don’t use session affinity at all. Store session data in Redis or a database. Any server can handle any request. This requires more infrastructure but enables flexible scaling and canary deployments.

Tracking Overhead

Least connections requires the load balancer to track every active connection. High-traffic systems might handle millions of concurrent connections. Each selection becomes a O(n) scan of the connection map (or better with a heap, but still overhead).

Round robin is a single atomic increment. IP hash is a single hash function. The computational difference is tiny on modern hardware, but at trillion-request scale, it matters.

When Random Actually Wins

Here’s a counterintuitive insight: random selection often performs as well as more sophisticated algorithms in well-designed systems. Why? If your servers are homogeneous and requests are similar in cost, randomness naturally produces balanced distribution. If they’re not—some servers are slow, some requests are expensive—then random fails. But sophisticated algorithms also fail if you don’t maintain accurate metadata (connection counts, response times).

Random wins when it’s accurate enough and has less overhead than tracking state.

Key Takeaways

  • Round robin distributes requests sequentially; use for stateless workloads with homogeneous servers
  • Weighted round robin accounts for different server capacities without tracking state
  • Least connections adapts to actual load; best for long-lived connections where duration varies
  • IP hash provides session affinity by consistently routing the same client to the same server
  • Consistent hashing minimizes disruption when the server pool changes
  • Choose algorithms based on your workload characteristics: stateless vs. stateful, uniform vs. variable cost, session affinity requirements

Practice Scenarios

Scenario 1: You’ve built a microservice that processes image uploads. Each upload takes 200ms to 30 seconds depending on file size. You’ve deployed it across three identical servers. Requests are stateless—no session data. Which algorithm would you choose, and why? How would you evolve your choice if you discovered image uploads from certain geographies consistently take longer?

Scenario 2: You’re redesigning a multiplayer game’s matchmaking service. Players connect to a server and maintain a persistent connection for the game session (average 45 minutes). The server stores player state in-memory, so disconnecting and reconnecting to a different server loses the session. You plan to scale from 10 servers to 100. What algorithm ensures players don’t disconnect mid-game? What happens when you deploy a new version and need to gracefully drain servers?

Scenario 3: Your API gateway handles 50,000 requests per second across 20 backend services, each with different hardware profiles. Some services scale to handle 10x more traffic than others based on their resource needs. Requests are independent. What’s your load balancing strategy? How do you account for service heterogeneity?

Connecting Forward

We’ve now explored how load balancers choose which server to send each request to. But there’s another dimension we haven’t addressed: at which network layer does load balancing happen? An IP hash algorithm can be implemented at layer 4 (TCP/IP) where we only see addresses and ports, or at layer 7 (application) where we understand HTTP headers, request paths, and content. This choice between layer 4 and layer 7 load balancing has profound implications for what information you can use to make routing decisions—and that’s exactly what we’ll dive into next.