Distributed Cache Design
The Problem
Your URL shortener service from the previous section is booming. One second you’re serving 400 redirects per second; next week it’s 4,000. Your single Redis instance is maxing out. You need a distributed caching system that scales horizontally—add servers and automatically distribute the data across them.
We’re designing a system like Redis Cluster or Memcached: key-value storage that can span hundreds of nodes, automatically routes requests to the right node, rebalances when nodes join or leave, and recovers gracefully from failures. This is the infrastructure layer that powers real-world systems.
Requirements and Functional Scope
Functional Requirements
- Get/Set/Delete operations: Atomic key-value manipulation
- TTL (Time-To-Live): Keys expire automatically after specified duration
- Eviction policies: When memory is full, evict old/unused keys (LRU, LFU, random)
- Data types: Support strings, hashes, lists, sets, sorted sets (like Redis)
- Transactions: Multi-step operations execute atomically
- Persistence: Optional durability to disk (RDB snapshots, AOF)
Non-Functional Requirements
- Sub-millisecond latency: GET/SET under 1ms (99th percentile)
- High throughput: Support 1M operations per second across cluster
- Horizontal scalability: Add nodes to increase capacity and throughput
- High availability: Tolerate node failures with minimal data loss
- Cluster membership: Nodes automatically discover new nodes and rebalance
- Data distribution: Even distribution (avoid hot partitions)
The Core Problem: Data Distribution
Imagine you have 3 cache nodes and 1 million cached keys. How do you decide which key lives on which node?
Naive Approach: Modular Hashing
node_index = hash(key) % num_nodes
node = nodes[node_index]
Problem: When a node fails and you have 2 nodes left:
# Old mapping
node_index = hash("user:123") % 3 # → node 1
node_index = hash("user:456") % 3 # → node 2
# New mapping (one node failed!)
node_index = hash("user:123") % 2 # → node 1 (unchanged, OK)
node_index = hash("user:456") % 2 # → node 0 (MOVED! was on node 2)
99.9% of keys need to move. Your cache becomes useless as clients suddenly can’t find anything.
Solution: Consistent Hashing
Imagine a circle (0 to 2^32). Nodes and keys both hash onto this circle. Each key belongs to the nearest node (clockwise).
Node A
/ \
Key1 Key2
/ \
Node C Node B
\ /
Key3 Key4
Adding a node: Only keys between the new node and the next clockwise node need to move. If you have N nodes and add one, roughly 1/N keys move (not 99.9%).
Virtual Nodes (Better Distribution)
Real nodes have multiple hash positions (virtual nodes). Instead of 3 actual nodes, imagine 300 virtual positions:
Nodes: [Node A, Node A, Node A, ..., Node B, Node B, ..., Node C, ...]
(100 virtual A's, 100 virtual B's, 100 virtual C's)
This ensures even distribution even if nodes are heterogeneous or data is skewed.
class ConsistentHash:
def __init__(self, nodes, virtual_nodes=160):
self.ring = {}
for node in nodes:
for i in range(virtual_nodes):
virtual_key = f"{node}:{i}"
hash_value = hash(virtual_key)
self.ring[hash_value] = node
def get_node(self, key):
key_hash = hash(key)
# Find smallest hash value >= key_hash (wraps around)
for ring_hash in sorted(self.ring.keys()):
if ring_hash >= key_hash:
return self.ring[ring_hash]
return self.ring[sorted(self.ring.keys())[0]] # wrap to start
Adding a node: recalculate ring, ~1/N keys need to migrate.
High-Level Architecture
graph TB
subgraph Clients
C1["App Server 1"]
C2["App Server 2"]
end
subgraph "Cache Cluster"
N1["Node 1\n(Slot 0-5460)"]
N2["Node 2\n(Slot 5461-10922)"]
N3["Node 3\n(Slot 10923-16383)"]
N1R["Replica N1"]
N2R["Replica N2"]
N3R["Replica N3"]
N1 -.->|replication| N1R
N2 -.->|replication| N2R
N3 -.->|replication| N3R
N1 ---|gossip| N2
N2 ---|gossip| N3
N3 ---|gossip| N1
end
C1 -->|GET user:123| N1
C2 -->|GET user:456| N2
N1R -->|replica data| N1
N1 -->|MOVED| C1
Two approaches for routing:
Client-side routing (Redis Cluster):
- Client library knows cluster topology
- Client calculates hash slot for key, directly talks to correct node
- On topology change, client updates mapping
- Pro: Fast (no proxy), scalable. Con: Client complexity.
Proxy-based routing (Memcached with consistent hashing):
- Client talks to proxy
- Proxy knows topology, routes to correct backend
- Pro: Simpler clients. Con: Extra hop (higher latency), proxy becomes bottleneck.
We’ll focus on client-side routing (Redis Cluster approach).
Deep Dive: Consistent Hashing at Scale
Hash Slots vs Hash Ring
Redis Cluster uses hash slots (simpler than continuous hash rings):
16384 total slots (fixed)
Cluster divides slots among nodes
Node A: slots 0-5460 (one-third)
Node B: slots 5461-10922 (one-third)
Node C: slots 10923-16383 (one-third)
For key "user:123":
slot = CRC16("user:123") % 16384 # → 4092
→ Send to Node A (owns slots 0-5460)
Advantages over hash rings:
- Predictable slot ranges
- Metadata doesn’t change when a node is added (only slot mappings change)
- Rebalancing is deterministic
Cluster Topology and Gossip
Nodes communicate via gossip protocol:
Every 100ms, each node sends a message to a few random nodes:
"I'm Node A, I own slots 0-5460, I have 1M keys, my version is 12345"
When Node D joins:
Node A sends: "Welcome Node D! Here's the current topology"
Node D joins and listens
All nodes gradually learn about D
Admin command: CLUSTER ADDSLOTS to assign slots to D
After slots are assigned to Node D:
Node A: slots 0-3640
Node B: slots 3641-7281
Node D: slots 7282-10922 (moved from B)
Node C: slots 10923-16383
Clients gradually learn new topology and start routing to Node D directly.
Replication for Durability
Each data node has replicas:
Master: Node A (slots 0-5460)
Replica: Node A-1 (standby)
Replica: Node A-2 (standby)
Write to A: SET key value (async replicated to A-1, A-2)
Read from A or A-1 or A-2 (stale reads possible)
If A fails:
Cluster promotes A-1 to master
Clients reroute to A-1
A-2 replicates from A-1
Async vs Sync replication trade-off:
- Async: Fast writes, data loss risk (A crashes before replicating)
- Sync: Slow writes, higher durability
- Reality: Most use async + replicas + background persistence (RDB snapshots)
Memory Management and Eviction
What happens when cache is full?
Eviction Policies
LRU (Least Recently Used):
Track access time for each key
On memory pressure, evict least recently accessed
Good for typical workloads
Cost: extra metadata per key
LFU (Least Frequently Used):
Track access frequency for each key
Evict least frequently accessed
Better for skewed distributions (some keys very hot)
Cost: higher overhead
Random:
Evict random key
Fastest eviction
Predictable? Not really
TTL-based:
Evict keys closest to expiration
Natural fit for cache (temporary data anyway)
Low overhead
Redis uses approximate LRU for memory efficiency:
Instead of tracking exact access time for 10M keys:
- Track access time only for a sample (5-10% of keys)
- On eviction, check sample, evict least recent
- Fast, O(1) space per key
Slab Allocation (Memcached)
Memcached preallocates memory into slabs (fixed-size buckets):
Slab class 1: 80 bytes per item, 10K items = 800KB
Slab class 2: 160 bytes per item, 5K items = 800MB
Slab class 3: 320 bytes per item, 2.5K items = 800KB
...
User stores 100-byte value:
→ Stored in slab class 2 (160 bytes)
→ 60 bytes wasted (fragmentation)
Advantage: Allocator never fragments
Disadvantage: Size-based fragmentation
Redis uses jemalloc (modern allocator) instead—better fragmentation characteristics.
Hot Key Problem
Imagine a celebrity’s profile in your cache: 100,000 requests per second for the same key. A single cache node bottlenecks.
Detection
# Each node tracks local access frequency
for each request:
key_access_count[key] += 1
# Periodically (every minute), emit metrics:
emit_metric("hot_keys", key_access_count)
→ Monitoring system aggregates across cluster
→ Alerts if any key exceeds threshold (1000 req/sec)
Solutions
Local Caching Layer:
App → Local memory cache (Caffeine, Guava)
↓ miss
→ Distributed cache (Redis)
App servers cache the celebrity profile locally. Reduces load on Redis node by 10-100x.
Key Replication (Across All Nodes):
Normally: key "profile:kim-kardashian" → lives on Node B
High traffic detected: replicate to all nodes
Get returns from nearest node (no single bottleneck)
Trade-off: Cache memory used for redundancy
Read-Only Replicas:
Master Node B handles writes
Replicas on all nodes (local to each app server):
→ App Server 1 reads from Replica B (in same datacenter)
→ App Server 2 reads from Replica B (in same datacenter)
→ Stale reads possible, but <1 second typically
Scaling and Rebalancing
Adding a Node
Initial cluster: A, B, C (5460 slots each)
Add Node D:
1. CLUSTER ADDSLOTS 0-1820 D (admin command)
→ D now owns slots 0-1820
→ A loses slots 0-1820 (now owns 1821-5460)
2. CLUSTER REPLICATE D (Node D-replica replicates from D)
3. MIGRATE keys 0-1820 from A to D:
while (key in range 0-1820):
MIGRATE key_name A_host A_port 0 1000 # Move 1000ms at a time
4. Once migration done, A no longer owns 0-1820
→ Requests to 0-1820 route to D
Node Failure Recovery
Node B fails
→ B has 5460 slots, B-replica has the data
Failover options:
Option 1: Manual: Admin promotes B-replica to master
Option 2: Automatic (Redis Sentinel): detects failure, promotes replica
After failover:
→ B-replica becomes new master for slots 5460-10922
→ Clients get redirected (MOVED error with new node info)
→ System continues with minor disruption
Scaling Considerations
| Scenario | Challenge | Solution |
|---|---|---|
| Single node saturated | 1M ops/sec, one node only does 200K ops/sec | Shard into 5 nodes (consistent hashing), 1M ÷ 5 = 200K ops/sec per node |
| Memory full | Cache doesn’t grow, hit rate drops | Add nodes (more total memory), tune eviction policy |
| High latency on some keys | Hot keys (100K req/sec to one key) | Local cache layer + replication + read-only replicas |
| Replication lag | Replica hasn’t received latest write | Stale reads acceptable for cache, use write-through for critical data |
| Network partition | Cluster split into two isolated groups | Cluster remains operational (can operate in degraded mode) |
Pro Tip: Monitor cache hit rate closely. If below 80%, either increase cache size or re-examine what you’re caching. If above 95%, you might be caching too much (e.g., already-cached data).
Trade-offs and Design Decisions
| Decision | Chosen | Alternative | Rationale |
|---|---|---|---|
| Routing | Client-side (know topology) | Proxy-based | Lower latency, client handles complexity |
| Data distribution | Hash slots (Redis) | Hash rings (Memcached) | Simpler, more predictable slots |
| Replication | Async (fast writes) | Sync (strong durability) | Cache is ephemeral, data loss acceptable, speed critical |
| Eviction | Approximate LRU | Exact LRU | Memory overhead vs accuracy trade-off |
| Hot keys | Multi-layer (local + cache) | Cache replication | Simpler, avoids cache memory bloat |
| Persistence | RDB snapshots + AOF | Durability not needed | Balance between crash recovery and performance |
Key Takeaways
- Consistent hashing solves the distribution problem—adding nodes requires rebalancing only 1/N keys, not all keys
- Client-side routing is faster but more complex—proxy-based is simpler but adds latency
- Hash slots (Redis) are more predictable than continuous hash rings
- Replication must be async for cache performance—durability comes from snapshots, not replication
- Eviction policies determine cache effectiveness—LRU for typical workloads, careful tuning needed
- Hot keys require multi-layer solutions—local caching + distributed cache works best
- Monitoring is crucial—track hit rate, key frequency, memory usage, latency percentiles
Practice
Design a feature to auto-scale a Redis Cluster: monitor throughput and latency, automatically add nodes when 99th percentile latency exceeds 5ms, remove nodes when cluster is underutilized. What metrics would you use? When would you trigger adds vs removes? How do you avoid thrashing (repeatedly adding/removing nodes)?
Connection to Next Section
We’ve designed systems for specific problems—rate limiting (protection), URL shortening (caching), distributed caching (infrastructure). Real systems combine these and more. The next chapters dive into specialized patterns: message queues for async work, search engines for text queries, time-series databases for metrics, and how to orchestrate them into complete platforms.