Distributed Cache Design

The Problem

Your URL shortener service from the previous section is booming. One second you’re serving 400 redirects per second; next week it’s 4,000. Your single Redis instance is maxing out. You need a distributed caching system that scales horizontally—add servers and automatically distribute the data across them.

We’re designing a system like Redis Cluster or Memcached: key-value storage that can span hundreds of nodes, automatically routes requests to the right node, rebalances when nodes join or leave, and recovers gracefully from failures. This is the infrastructure layer that powers real-world systems.

Requirements and Functional Scope

Functional Requirements

Get/Set/Delete operations: Atomic key-value manipulation
TTL (Time-To-Live): Keys expire automatically after specified duration
Eviction policies: When memory is full, evict old/unused keys (LRU, LFU, random)
Data types: Support strings, hashes, lists, sets, sorted sets (like Redis)
Transactions: Multi-step operations execute atomically
Persistence: Optional durability to disk (RDB snapshots, AOF)

Non-Functional Requirements

Sub-millisecond latency: GET/SET under 1ms (99th percentile)
High throughput: Support 1M operations per second across cluster
Horizontal scalability: Add nodes to increase capacity and throughput
High availability: Tolerate node failures with minimal data loss
Cluster membership: Nodes automatically discover new nodes and rebalance
Data distribution: Even distribution (avoid hot partitions)

The Core Problem: Data Distribution

Imagine you have 3 cache nodes and 1 million cached keys. How do you decide which key lives on which node?

Naive Approach: Modular Hashing

node_index = hash(key) % num_nodes
node = nodes[node_index]

Problem: When a node fails and you have 2 nodes left:

# Old mapping
node_index = hash("user:123") % 3  # → node 1
node_index = hash("user:456") % 3  # → node 2

# New mapping (one node failed!)
node_index = hash("user:123") % 2  # → node 1 (unchanged, OK)
node_index = hash("user:456") % 2  # → node 0 (MOVED! was on node 2)

99.9% of keys need to move. Your cache becomes useless as clients suddenly can’t find anything.

Solution: Consistent Hashing

Imagine a circle (0 to 2^32). Nodes and keys both hash onto this circle. Each key belongs to the nearest node (clockwise).

         Node A
        /       \
   Key1         Key2
   /               \
Node C          Node B
   \              /
    Key3    Key4

Adding a node: Only keys between the new node and the next clockwise node need to move. If you have N nodes and add one, roughly 1/N keys move (not 99.9%).

Virtual Nodes (Better Distribution)

Real nodes have multiple hash positions (virtual nodes). Instead of 3 actual nodes, imagine 300 virtual positions:

Nodes: [Node A, Node A, Node A, ..., Node B, Node B, ..., Node C, ...]
       (100 virtual A's, 100 virtual B's, 100 virtual C's)

This ensures even distribution even if nodes are heterogeneous or data is skewed.

class ConsistentHash:
  def __init__(self, nodes, virtual_nodes=160):
    self.ring = {}
    for node in nodes:
      for i in range(virtual_nodes):
        virtual_key = f"{node}:{i}"
        hash_value = hash(virtual_key)
        self.ring[hash_value] = node

  def get_node(self, key):
    key_hash = hash(key)
    # Find smallest hash value >= key_hash (wraps around)
    for ring_hash in sorted(self.ring.keys()):
      if ring_hash >= key_hash:
        return self.ring[ring_hash]
    return self.ring[sorted(self.ring.keys())[0]]  # wrap to start

Adding a node: recalculate ring, ~1/N keys need to migrate.

High-Level Architecture

graph TB
    subgraph Clients
        C1["App Server 1"]
        C2["App Server 2"]
    end

    subgraph "Cache Cluster"
        N1["Node 1\n(Slot 0-5460)"]
        N2["Node 2\n(Slot 5461-10922)"]
        N3["Node 3\n(Slot 10923-16383)"]
        N1R["Replica N1"]
        N2R["Replica N2"]
        N3R["Replica N3"]

        N1 -.->|replication| N1R
        N2 -.->|replication| N2R
        N3 -.->|replication| N3R

        N1 ---|gossip| N2
        N2 ---|gossip| N3
        N3 ---|gossip| N1
    end

    C1 -->|GET user:123| N1
    C2 -->|GET user:456| N2
    N1R -->|replica data| N1
    N1 -->|MOVED| C1

Two approaches for routing:

Client-side routing (Redis Cluster):

Client library knows cluster topology
Client calculates hash slot for key, directly talks to correct node
On topology change, client updates mapping
Pro: Fast (no proxy), scalable. Con: Client complexity.

Proxy-based routing (Memcached with consistent hashing):

Client talks to proxy
Proxy knows topology, routes to correct backend
Pro: Simpler clients. Con: Extra hop (higher latency), proxy becomes bottleneck.

We’ll focus on client-side routing (Redis Cluster approach).

Deep Dive: Consistent Hashing at Scale

Hash Slots vs Hash Ring

Redis Cluster uses hash slots (simpler than continuous hash rings):

16384 total slots (fixed)
Cluster divides slots among nodes

Node A: slots 0-5460 (one-third)
Node B: slots 5461-10922 (one-third)
Node C: slots 10923-16383 (one-third)

For key "user:123":
  slot = CRC16("user:123") % 16384  # → 4092
  → Send to Node A (owns slots 0-5460)

Advantages over hash rings:

Predictable slot ranges
Metadata doesn’t change when a node is added (only slot mappings change)
Rebalancing is deterministic

Cluster Topology and Gossip

Nodes communicate via gossip protocol:

Every 100ms, each node sends a message to a few random nodes:
  "I'm Node A, I own slots 0-5460, I have 1M keys, my version is 12345"

When Node D joins:
  Node A sends: "Welcome Node D! Here's the current topology"
  Node D joins and listens
  All nodes gradually learn about D
  Admin command: CLUSTER ADDSLOTS to assign slots to D

After slots are assigned to Node D:

Node A: slots 0-3640
Node B: slots 3641-7281
Node D: slots 7282-10922 (moved from B)
Node C: slots 10923-16383

Clients gradually learn new topology and start routing to Node D directly.

Replication for Durability

Each data node has replicas:

Master: Node A (slots 0-5460)
  Replica: Node A-1 (standby)
  Replica: Node A-2 (standby)

Write to A: SET key value (async replicated to A-1, A-2)
Read from A or A-1 or A-2 (stale reads possible)

If A fails:
  Cluster promotes A-1 to master
  Clients reroute to A-1
  A-2 replicates from A-1

Async vs Sync replication trade-off:

Async: Fast writes, data loss risk (A crashes before replicating)
Sync: Slow writes, higher durability
Reality: Most use async + replicas + background persistence (RDB snapshots)

Memory Management and Eviction

What happens when cache is full?

Eviction Policies

LRU (Least Recently Used):

Track access time for each key
On memory pressure, evict least recently accessed
Good for typical workloads
Cost: extra metadata per key

LFU (Least Frequently Used):

Track access frequency for each key
Evict least frequently accessed
Better for skewed distributions (some keys very hot)
Cost: higher overhead

Random:

Evict random key
Fastest eviction
Predictable? Not really

TTL-based:

Evict keys closest to expiration
Natural fit for cache (temporary data anyway)
Low overhead

Redis uses approximate LRU for memory efficiency:

Instead of tracking exact access time for 10M keys:
  - Track access time only for a sample (5-10% of keys)
  - On eviction, check sample, evict least recent
  - Fast, O(1) space per key

Slab Allocation (Memcached)

Memcached preallocates memory into slabs (fixed-size buckets):

Slab class 1: 80 bytes per item, 10K items = 800KB
Slab class 2: 160 bytes per item, 5K items = 800MB
Slab class 3: 320 bytes per item, 2.5K items = 800KB
...

User stores 100-byte value:
  → Stored in slab class 2 (160 bytes)
  → 60 bytes wasted (fragmentation)

Advantage: Allocator never fragments
Disadvantage: Size-based fragmentation

Redis uses jemalloc (modern allocator) instead—better fragmentation characteristics.

Hot Key Problem

Imagine a celebrity’s profile in your cache: 100,000 requests per second for the same key. A single cache node bottlenecks.

Detection

# Each node tracks local access frequency
for each request:
  key_access_count[key] += 1

# Periodically (every minute), emit metrics:
emit_metric("hot_keys", key_access_count)
  → Monitoring system aggregates across cluster
  → Alerts if any key exceeds threshold (1000 req/sec)

Solutions

Local Caching Layer:

App → Local memory cache (Caffeine, Guava)
           ↓ miss
      → Distributed cache (Redis)

App servers cache the celebrity profile locally. Reduces load on Redis node by 10-100x.

Key Replication (Across All Nodes):

Normally: key "profile:kim-kardashian" → lives on Node B

High traffic detected: replicate to all nodes
  Get returns from nearest node (no single bottleneck)

Trade-off: Cache memory used for redundancy

Read-Only Replicas:

Master Node B handles writes

Replicas on all nodes (local to each app server):
  → App Server 1 reads from Replica B (in same datacenter)
  → App Server 2 reads from Replica B (in same datacenter)
  → Stale reads possible, but <1 second typically

Scaling and Rebalancing

Adding a Node

Initial cluster: A, B, C (5460 slots each)

Add Node D:
  1. CLUSTER ADDSLOTS 0-1820 D (admin command)
     → D now owns slots 0-1820
     → A loses slots 0-1820 (now owns 1821-5460)

  2. CLUSTER REPLICATE D (Node D-replica replicates from D)

  3. MIGRATE keys 0-1820 from A to D:
     while (key in range 0-1820):
       MIGRATE key_name A_host A_port 0 1000  # Move 1000ms at a time

  4. Once migration done, A no longer owns 0-1820
     → Requests to 0-1820 route to D

Node Failure Recovery

Node B fails
  → B has 5460 slots, B-replica has the data

Failover options:
  Option 1: Manual: Admin promotes B-replica to master
  Option 2: Automatic (Redis Sentinel): detects failure, promotes replica

After failover:
  → B-replica becomes new master for slots 5460-10922
  → Clients get redirected (MOVED error with new node info)
  → System continues with minor disruption

Scaling Considerations

Scenario	Challenge	Solution
Single node saturated	1M ops/sec, one node only does 200K ops/sec	Shard into 5 nodes (consistent hashing), 1M ÷ 5 = 200K ops/sec per node
Memory full	Cache doesn’t grow, hit rate drops	Add nodes (more total memory), tune eviction policy
High latency on some keys	Hot keys (100K req/sec to one key)	Local cache layer + replication + read-only replicas
Replication lag	Replica hasn’t received latest write	Stale reads acceptable for cache, use write-through for critical data
Network partition	Cluster split into two isolated groups	Cluster remains operational (can operate in degraded mode)

Pro Tip: Monitor cache hit rate closely. If below 80%, either increase cache size or re-examine what you’re caching. If above 95%, you might be caching too much (e.g., already-cached data).

Trade-offs and Design Decisions

Decision	Chosen	Alternative	Rationale
Routing	Client-side (know topology)	Proxy-based	Lower latency, client handles complexity
Data distribution	Hash slots (Redis)	Hash rings (Memcached)	Simpler, more predictable slots
Replication	Async (fast writes)	Sync (strong durability)	Cache is ephemeral, data loss acceptable, speed critical
Eviction	Approximate LRU	Exact LRU	Memory overhead vs accuracy trade-off
Hot keys	Multi-layer (local + cache)	Cache replication	Simpler, avoids cache memory bloat
Persistence	RDB snapshots + AOF	Durability not needed	Balance between crash recovery and performance

Key Takeaways

Consistent hashing solves the distribution problem—adding nodes requires rebalancing only 1/N keys, not all keys
Client-side routing is faster but more complex—proxy-based is simpler but adds latency
Hash slots (Redis) are more predictable than continuous hash rings
Replication must be async for cache performance—durability comes from snapshots, not replication
Eviction policies determine cache effectiveness—LRU for typical workloads, careful tuning needed
Hot keys require multi-layer solutions—local caching + distributed cache works best
Monitoring is crucial—track hit rate, key frequency, memory usage, latency percentiles

Practice

Design a feature to auto-scale a Redis Cluster: monitor throughput and latency, automatically add nodes when 99th percentile latency exceeds 5ms, remove nodes when cluster is underutilized. What metrics would you use? When would you trigger adds vs removes? How do you avoid thrashing (repeatedly adding/removing nodes)?

Connection to Next Section

We’ve designed systems for specific problems—rate limiting (protection), URL shortening (caching), distributed caching (infrastructure). Real systems combine these and more. The next chapters dive into specialized patterns: message queues for async work, search engines for text queries, time-series databases for metrics, and how to orchestrate them into complete platforms.