System Design Fundamentals

What is Load Balancing?

A

What is Load Balancing?

The Traffic Cop Problem

Imagine you’ve built a successful online store. Your single server handles hundreds of requests per second, and business is booming. Then, one Friday evening during a flash sale, traffic spikes 10x. Your server becomes overwhelmed, response times crawl, and customers abandon their carts. Sound familiar? This is the fundamental challenge every growing system faces.

In Chapter 4, we explored horizontal scaling—the idea that instead of buying a massive server, you can buy many smaller ones and distribute work across them. But here’s the catch: if you have ten servers, how do incoming requests know which server to use? How do you ensure each server gets a fair share of the work? What happens if a server fails mid-request? You need something to act as a traffic cop, directing requests intelligently across your fleet. Enter load balancing.

By the end of this chapter, you’ll understand what load balancers are, how they work, where they fit in your architecture, and why they’re essential for scaling. We’ll explore different types of load balancers, see them in action, and discuss when (and when not) to use them.

What Load Balancing Really Does

At its core, load balancing is beautifully simple: it distributes incoming requests across multiple servers, ensuring no single server becomes a bottleneck. Instead of all traffic hitting one address, a load balancer sits in front of your servers, acts as a “smart router,” and decides which server should handle each request.

Think of a load balancer as an intelligent reverse proxy. While a reverse proxy’s primary job is to hide your backend servers from the public internet, a load balancer adds the critical feature of distributing work. It inspects incoming requests, tracks the health and capacity of backend servers, and makes routing decisions based on a configurable algorithm.

Load balancers come in two flavors: hardware and software. Hardware load balancers are expensive, powerful physical appliances you install in your data center—they can handle millions of connections per second. Software load balancers are programs running on standard servers. They’re cheaper, easier to configure, and more flexible, making them the default choice for modern systems. Popular software load balancers include Nginx, HAProxy, Envoy, and AWS’s Elastic Load Balancer.

The load balancer sits at a strategic point in your architecture: between clients and your backend servers. When a client connects, they actually connect to the load balancer’s IP address, not the servers directly. The load balancer terminates the connection, processes the request (or forwards it), and decides which backend server to route to. From the client’s perspective, they’re talking to one address. Behind the scenes, work is distributed.

One obvious concern: if the load balancer itself fails, your entire system goes down. This is solved through redundancy. In production systems, you run at least two load balancers in an active-passive or active-active configuration, often paired with a virtual IP (VIP) managed by a failover mechanism like Keepalived. If the primary load balancer fails, traffic automatically switches to the secondary.

An important distinction: not every reverse proxy is a load balancer, but every load balancer is a reverse proxy. A load balancer must also track server health, handle failover, and actively distribute requests—that’s what sets it apart.

The Manager at the Checkout

Picture a busy supermarket on Saturday afternoon. Ten checkout lanes are available, but all customers are lined up at lane one while lanes two through ten sit empty. A frustrated manager notices this and begins directing customers to available lanes. “You, go to lane five. You, lane three.” Immediately, the bottleneck dissolves. Customers check out faster, the system is more efficient, and everyone leaves happier.

Without the manager, customers have no visibility into which lanes are open, and they default to the first lane they see. The manager is the load balancer—they have a complete view of system state and make intelligent routing decisions. Your backend servers are the checkout lanes. Your incoming requests are the customers.

How Load Balancers Work at Scale

When you set up a load balancer, you define a backend pool: a set of healthy servers ready to handle traffic. The load balancer continuously health-checks these servers (often via HTTP GET requests to a /health endpoint). If a server stops responding or returns an error, it’s marked unhealthy and removed from the pool.

Each incoming request goes through the load balancer’s decision logic. The simplest algorithm is round-robin: distribute requests evenly across all healthy servers. Request one goes to server A, request two to server B, request three to server C, then back to A. More sophisticated algorithms consider server load, latency, or geographic location.

At the network level, the load balancer can operate at Layer 4 (transport) or Layer 7 (application). A Layer 4 load balancer (like AWS’s Network Load Balancer) makes routing decisions based on IP, port, and protocol—it’s incredibly fast but knows nothing about your application. A Layer 7 load balancer (like AWS’s Application Load Balancer) can read HTTP headers, inspect request bodies, and route based on URL path or hostname. Layer 7 is slower but more flexible.

DNS-based load balancing is another approach: instead of a single load balancer, you return multiple IP addresses for your domain. When a client does a DNS lookup, they get one of several server IPs at random. This is cheap and simple but loses fine-grained control—you can’t health-check or rebalance dynamically.

Here’s an example of what a typical architecture looks like:

graph TB
    clients["Clients (Users)"]
    lb["Load Balancer<br/>(Nginx, HAProxy, or Cloud LB)"]
    s1["Server 1<br/>(Active)"]
    s2["Server 2<br/>(Active)"]
    s3["Server 3<br/>(Active)"]
    db[(Database)]
    cache["Cache<br/>(Redis/Memcached)"]

    clients -->|HTTP/HTTPS| lb
    lb -->|Route request| s1
    lb -->|Route request| s2
    lb -->|Route request| s3
    s1 -->|Read/Write| db
    s2 -->|Read/Write| db
    s3 -->|Read/Write| db
    s1 -->|Cache| cache
    s2 -->|Cache| cache
    s3 -->|Cache| cache

Two important operational concerns: connection draining and session persistence. When you need to take a server down for maintenance, you don’t want to drop active connections. Instead, the load balancer stops sending new requests to that server but allows existing connections to finish gracefully. This is connection draining.

Some applications require sticky sessions (or session affinity): once a client connects to server A, all subsequent requests from that client should go to server A. This is necessary if you store session state locally on each server (a practice we generally discourage in distributed systems, but it happens). The load balancer tracks client identity (often via cookies) and ensures requests stick to the same backend.

Nginx: A Software Load Balancer in Action

Let’s see load balancing in practice. Here’s a minimal Nginx configuration that distributes traffic across three backend servers using round-robin:

upstream backend_servers {
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    server 192.168.1.12:8080;
}

server {
    listen 80;
    server_name myapp.com;

    location / {
        proxy_pass http://backend_servers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

This configuration tells Nginx to listen on port 80, accept requests for myapp.com, and forward them to one of three backend servers in a round-robin fashion. The proxy_set_header directives preserve the original client IP and host information.

For AWS users, setting up an Application Load Balancer is similarly straightforward through the AWS Console: define your backend instances, specify health check rules (like “check /health every 30 seconds”), and ALB handles the rest. It automatically performs health checks, removes unhealthy instances, and scales up the pool if you add new servers.

A real-world example: when Shopify handles Black Friday traffic, they don’t add a single massive server. Instead, they spin up hundreds of smaller app servers behind load balancers. The load balancers distribute the surge across the fleet, ensuring no single server is overwhelmed. Requests that would have timed out on a single server now complete in milliseconds because work is parallelized.

The Cost of Abstraction

Adding a load balancer introduces one additional network hop. Instead of client → server, you now have client → load balancer → server. This adds latency (typically milliseconds). In latency-critical systems, this matters. For most web applications, it’s negligible—the benefits far outweigh the cost.

There’s also operational complexity. A single server is simple to reason about. Ten servers behind a load balancer introduces questions: How do you monitor each server? What happens when one fails? How do you deploy updates across the fleet without downtime? Load balancing is the foundation that makes scaling feasible, but you need operational tools and processes to manage it.

Software load balancers require resources: CPU, memory, and network bandwidth. If your load balancer becomes the bottleneck (possible, but rare with modern software like Envoy), you’ve solved nothing. Cloud load balancers handle this by being fully managed and auto-scaling.

Not every system needs a load balancer. A small internal API serving five microservices might use client-side load balancing (the client picks a server) or simple DNS round-robin. The complexity should match your scale. When do you need a load balancer? When you have multiple backend servers and want intelligent routing, health checking, and graceful failover.

Hardware vs. software trade-offs:

AspectHardware LBSoftware LB
CostVery highLow to free
SetupMonths (procurement, installation)Minutes
ThroughputHighestVery high (but less than hardware)
FlexibilityLimited (firmware-based)High (code-based)
ScalabilityFixed capacityScales with infrastructure
Best ForLarge enterprises, extreme scaleCloud-native, flexible needs

Key Takeaways

  • Load balancers distribute traffic across multiple backend servers, preventing any single server from becoming a bottleneck and enabling horizontal scaling.
  • They sit between clients and servers, acting as an intelligent reverse proxy that health-checks backends and routes requests intelligently.
  • Redundancy is critical: a single load balancer is a single point of failure, so run at least two in an active-passive or active-active setup.
  • Software load balancers (Nginx, HAProxy, Envoy) are the modern default; they’re flexible, cheap, and easier to manage than hardware alternatives.
  • Layer 4 and Layer 7 load balancers serve different purposes: Layer 4 is faster but dumb; Layer 7 is slower but application-aware.
  • Connection draining and session persistence are operational details that matter when managing real traffic and updates.

Practice Scenarios

Scenario 1: The Cascading Failure Your load balancer routes 100 requests per second to three backend servers (roughly 33 per server). One server crashes. The load balancer removes it from the pool and now routes to two servers (50 per second each). This surge in load causes one of the remaining servers to crash as well. Walk through what happens next and explain how you’d design your system to prevent cascading failures.

Scenario 2: Black Friday Surge You’re building a system that expects 10x normal traffic on Black Friday. You currently have 10 app servers behind a load balancer. Your database is shared. Where will your load balancer help? Where will it not? What else must you scale?

Scenario 3: Sticky Session Dilemma A team member suggests using sticky sessions to cache user data on each app server. Your load balancer supports this. Why is this a bad idea? What’s a better approach?

What’s Next?

Now that you understand how load balancers distribute traffic, the natural question is: how do they decide which server to route to? Different algorithms have different properties—some prioritize fairness, others adapt to real-time load, and some fail faster in overload conditions. In the next section, we’ll explore load balancing algorithms: round-robin, least connections, weighted approaches, and advanced algorithms that adapt to your system’s behavior. You’ll see how algorithm choice can mean the difference between smooth scaling and performance cliffs.