System Design Fundamentals

Distributed Cache Design

A

Distributed Cache Design

The Problem

Your URL shortener service from the previous section is booming. One second you’re serving 400 redirects per second; next week it’s 4,000. Your single Redis instance is maxing out. You need a distributed caching system that scales horizontally—add servers and automatically distribute the data across them.

We’re designing a system like Redis Cluster or Memcached: key-value storage that can span hundreds of nodes, automatically routes requests to the right node, rebalances when nodes join or leave, and recovers gracefully from failures. This is the infrastructure layer that powers real-world systems.

Requirements and Functional Scope

Functional Requirements

  • Get/Set/Delete operations: Atomic key-value manipulation
  • TTL (Time-To-Live): Keys expire automatically after specified duration
  • Eviction policies: When memory is full, evict old/unused keys (LRU, LFU, random)
  • Data types: Support strings, hashes, lists, sets, sorted sets (like Redis)
  • Transactions: Multi-step operations execute atomically
  • Persistence: Optional durability to disk (RDB snapshots, AOF)

Non-Functional Requirements

  • Sub-millisecond latency: GET/SET under 1ms (99th percentile)
  • High throughput: Support 1M operations per second across cluster
  • Horizontal scalability: Add nodes to increase capacity and throughput
  • High availability: Tolerate node failures with minimal data loss
  • Cluster membership: Nodes automatically discover new nodes and rebalance
  • Data distribution: Even distribution (avoid hot partitions)

The Core Problem: Data Distribution

Imagine you have 3 cache nodes and 1 million cached keys. How do you decide which key lives on which node?

Naive Approach: Modular Hashing

node_index = hash(key) % num_nodes
node = nodes[node_index]

Problem: When a node fails and you have 2 nodes left:

# Old mapping
node_index = hash("user:123") % 3  # → node 1
node_index = hash("user:456") % 3  # → node 2

# New mapping (one node failed!)
node_index = hash("user:123") % 2  # → node 1 (unchanged, OK)
node_index = hash("user:456") % 2  # → node 0 (MOVED! was on node 2)

99.9% of keys need to move. Your cache becomes useless as clients suddenly can’t find anything.

Solution: Consistent Hashing

Imagine a circle (0 to 2^32). Nodes and keys both hash onto this circle. Each key belongs to the nearest node (clockwise).

         Node A
        /       \
   Key1         Key2
   /               \
Node C          Node B
   \              /
    Key3    Key4

Adding a node: Only keys between the new node and the next clockwise node need to move. If you have N nodes and add one, roughly 1/N keys move (not 99.9%).

Virtual Nodes (Better Distribution)

Real nodes have multiple hash positions (virtual nodes). Instead of 3 actual nodes, imagine 300 virtual positions:

Nodes: [Node A, Node A, Node A, ..., Node B, Node B, ..., Node C, ...]
       (100 virtual A's, 100 virtual B's, 100 virtual C's)

This ensures even distribution even if nodes are heterogeneous or data is skewed.

class ConsistentHash:
  def __init__(self, nodes, virtual_nodes=160):
    self.ring = {}
    for node in nodes:
      for i in range(virtual_nodes):
        virtual_key = f"{node}:{i}"
        hash_value = hash(virtual_key)
        self.ring[hash_value] = node

  def get_node(self, key):
    key_hash = hash(key)
    # Find smallest hash value >= key_hash (wraps around)
    for ring_hash in sorted(self.ring.keys()):
      if ring_hash >= key_hash:
        return self.ring[ring_hash]
    return self.ring[sorted(self.ring.keys())[0]]  # wrap to start

Adding a node: recalculate ring, ~1/N keys need to migrate.

High-Level Architecture

graph TB
    subgraph Clients
        C1["App Server 1"]
        C2["App Server 2"]
    end

    subgraph "Cache Cluster"
        N1["Node 1\n(Slot 0-5460)"]
        N2["Node 2\n(Slot 5461-10922)"]
        N3["Node 3\n(Slot 10923-16383)"]
        N1R["Replica N1"]
        N2R["Replica N2"]
        N3R["Replica N3"]

        N1 -.->|replication| N1R
        N2 -.->|replication| N2R
        N3 -.->|replication| N3R

        N1 ---|gossip| N2
        N2 ---|gossip| N3
        N3 ---|gossip| N1
    end

    C1 -->|GET user:123| N1
    C2 -->|GET user:456| N2
    N1R -->|replica data| N1
    N1 -->|MOVED| C1

Two approaches for routing:

Client-side routing (Redis Cluster):

  • Client library knows cluster topology
  • Client calculates hash slot for key, directly talks to correct node
  • On topology change, client updates mapping
  • Pro: Fast (no proxy), scalable. Con: Client complexity.

Proxy-based routing (Memcached with consistent hashing):

  • Client talks to proxy
  • Proxy knows topology, routes to correct backend
  • Pro: Simpler clients. Con: Extra hop (higher latency), proxy becomes bottleneck.

We’ll focus on client-side routing (Redis Cluster approach).

Deep Dive: Consistent Hashing at Scale

Hash Slots vs Hash Ring

Redis Cluster uses hash slots (simpler than continuous hash rings):

16384 total slots (fixed)
Cluster divides slots among nodes

Node A: slots 0-5460 (one-third)
Node B: slots 5461-10922 (one-third)
Node C: slots 10923-16383 (one-third)

For key "user:123":
  slot = CRC16("user:123") % 16384  # → 4092
  → Send to Node A (owns slots 0-5460)

Advantages over hash rings:

  • Predictable slot ranges
  • Metadata doesn’t change when a node is added (only slot mappings change)
  • Rebalancing is deterministic

Cluster Topology and Gossip

Nodes communicate via gossip protocol:

Every 100ms, each node sends a message to a few random nodes:
  "I'm Node A, I own slots 0-5460, I have 1M keys, my version is 12345"

When Node D joins:
  Node A sends: "Welcome Node D! Here's the current topology"
  Node D joins and listens
  All nodes gradually learn about D
  Admin command: CLUSTER ADDSLOTS to assign slots to D

After slots are assigned to Node D:

Node A: slots 0-3640
Node B: slots 3641-7281
Node D: slots 7282-10922 (moved from B)
Node C: slots 10923-16383

Clients gradually learn new topology and start routing to Node D directly.

Replication for Durability

Each data node has replicas:

Master: Node A (slots 0-5460)
  Replica: Node A-1 (standby)
  Replica: Node A-2 (standby)

Write to A: SET key value (async replicated to A-1, A-2)
Read from A or A-1 or A-2 (stale reads possible)

If A fails:
  Cluster promotes A-1 to master
  Clients reroute to A-1
  A-2 replicates from A-1

Async vs Sync replication trade-off:

  • Async: Fast writes, data loss risk (A crashes before replicating)
  • Sync: Slow writes, higher durability
  • Reality: Most use async + replicas + background persistence (RDB snapshots)

Memory Management and Eviction

What happens when cache is full?

Eviction Policies

LRU (Least Recently Used):

Track access time for each key
On memory pressure, evict least recently accessed
Good for typical workloads
Cost: extra metadata per key

LFU (Least Frequently Used):

Track access frequency for each key
Evict least frequently accessed
Better for skewed distributions (some keys very hot)
Cost: higher overhead

Random:

Evict random key
Fastest eviction
Predictable? Not really

TTL-based:

Evict keys closest to expiration
Natural fit for cache (temporary data anyway)
Low overhead

Redis uses approximate LRU for memory efficiency:

Instead of tracking exact access time for 10M keys:
  - Track access time only for a sample (5-10% of keys)
  - On eviction, check sample, evict least recent
  - Fast, O(1) space per key

Slab Allocation (Memcached)

Memcached preallocates memory into slabs (fixed-size buckets):

Slab class 1: 80 bytes per item, 10K items = 800KB
Slab class 2: 160 bytes per item, 5K items = 800MB
Slab class 3: 320 bytes per item, 2.5K items = 800KB
...

User stores 100-byte value:
  → Stored in slab class 2 (160 bytes)
  → 60 bytes wasted (fragmentation)

Advantage: Allocator never fragments
Disadvantage: Size-based fragmentation

Redis uses jemalloc (modern allocator) instead—better fragmentation characteristics.

Hot Key Problem

Imagine a celebrity’s profile in your cache: 100,000 requests per second for the same key. A single cache node bottlenecks.

Detection

# Each node tracks local access frequency
for each request:
  key_access_count[key] += 1

# Periodically (every minute), emit metrics:
emit_metric("hot_keys", key_access_count)
  → Monitoring system aggregates across cluster
  → Alerts if any key exceeds threshold (1000 req/sec)

Solutions

Local Caching Layer:

App → Local memory cache (Caffeine, Guava)
           ↓ miss
      → Distributed cache (Redis)

App servers cache the celebrity profile locally. Reduces load on Redis node by 10-100x.

Key Replication (Across All Nodes):

Normally: key "profile:kim-kardashian" → lives on Node B

High traffic detected: replicate to all nodes
  Get returns from nearest node (no single bottleneck)

Trade-off: Cache memory used for redundancy

Read-Only Replicas:

Master Node B handles writes

Replicas on all nodes (local to each app server):
  → App Server 1 reads from Replica B (in same datacenter)
  → App Server 2 reads from Replica B (in same datacenter)
  → Stale reads possible, but <1 second typically

Scaling and Rebalancing

Adding a Node

Initial cluster: A, B, C (5460 slots each)

Add Node D:
  1. CLUSTER ADDSLOTS 0-1820 D (admin command)
     → D now owns slots 0-1820
     → A loses slots 0-1820 (now owns 1821-5460)

  2. CLUSTER REPLICATE D (Node D-replica replicates from D)

  3. MIGRATE keys 0-1820 from A to D:
     while (key in range 0-1820):
       MIGRATE key_name A_host A_port 0 1000  # Move 1000ms at a time

  4. Once migration done, A no longer owns 0-1820
     → Requests to 0-1820 route to D

Node Failure Recovery

Node B fails
  → B has 5460 slots, B-replica has the data

Failover options:
  Option 1: Manual: Admin promotes B-replica to master
  Option 2: Automatic (Redis Sentinel): detects failure, promotes replica

After failover:
  → B-replica becomes new master for slots 5460-10922
  → Clients get redirected (MOVED error with new node info)
  → System continues with minor disruption

Scaling Considerations

ScenarioChallengeSolution
Single node saturated1M ops/sec, one node only does 200K ops/secShard into 5 nodes (consistent hashing), 1M ÷ 5 = 200K ops/sec per node
Memory fullCache doesn’t grow, hit rate dropsAdd nodes (more total memory), tune eviction policy
High latency on some keysHot keys (100K req/sec to one key)Local cache layer + replication + read-only replicas
Replication lagReplica hasn’t received latest writeStale reads acceptable for cache, use write-through for critical data
Network partitionCluster split into two isolated groupsCluster remains operational (can operate in degraded mode)

Pro Tip: Monitor cache hit rate closely. If below 80%, either increase cache size or re-examine what you’re caching. If above 95%, you might be caching too much (e.g., already-cached data).

Trade-offs and Design Decisions

DecisionChosenAlternativeRationale
RoutingClient-side (know topology)Proxy-basedLower latency, client handles complexity
Data distributionHash slots (Redis)Hash rings (Memcached)Simpler, more predictable slots
ReplicationAsync (fast writes)Sync (strong durability)Cache is ephemeral, data loss acceptable, speed critical
EvictionApproximate LRUExact LRUMemory overhead vs accuracy trade-off
Hot keysMulti-layer (local + cache)Cache replicationSimpler, avoids cache memory bloat
PersistenceRDB snapshots + AOFDurability not neededBalance between crash recovery and performance

Key Takeaways

  1. Consistent hashing solves the distribution problem—adding nodes requires rebalancing only 1/N keys, not all keys
  2. Client-side routing is faster but more complex—proxy-based is simpler but adds latency
  3. Hash slots (Redis) are more predictable than continuous hash rings
  4. Replication must be async for cache performance—durability comes from snapshots, not replication
  5. Eviction policies determine cache effectiveness—LRU for typical workloads, careful tuning needed
  6. Hot keys require multi-layer solutions—local caching + distributed cache works best
  7. Monitoring is crucial—track hit rate, key frequency, memory usage, latency percentiles

Practice

Design a feature to auto-scale a Redis Cluster: monitor throughput and latency, automatically add nodes when 99th percentile latency exceeds 5ms, remove nodes when cluster is underutilized. What metrics would you use? When would you trigger adds vs removes? How do you avoid thrashing (repeatedly adding/removing nodes)?

Connection to Next Section

We’ve designed systems for specific problems—rate limiting (protection), URL shortening (caching), distributed caching (infrastructure). Real systems combine these and more. The next chapters dive into specialized patterns: message queues for async work, search engines for text queries, time-series databases for metrics, and how to orchestrate them into complete platforms.