Rebalancing and Resharding Strategies
As systems grow, the initial partitioning scheme becomes outdated. New shards are added to increase capacity, old nodes fail and are replaced, or business requirements shift, necessitating different partition boundaries. Rebalancing—the process of redistributing data across shards—is one of the most operationally critical and technically challenging aspects of distributed systems at scale. A naïve rebalancing can bring a system to its knees through excessive data movement, lock contention, or temporary unavailability. The goal is to redistribute data smoothly, minimizing disruption while maintaining consistency and availability.
Why Rebalancing is Necessary
Rebalancing isn’t a theoretical concern—it’s a regular operational necessity in growing systems.
Capacity Expansion: As load grows, you add new nodes. To fully utilize the new capacity, data must be redistributed. If you partition by hash(key) mod N where N is the number of shards, adding a new shard changes N and invalidates the mapping for most keys. Without rebalancing, new data goes to shards that weren’t prepared for it, while old data remains in its original location, leading to severe imbalance.
Node Failures and Replacements: When a node fails permanently and is replaced with a new node, the new node arrives empty. If you’ve been running with replication factor 3, you need to re-replicate data to the new node. This is a form of rebalancing, though often called “recovery” or “rebuild.”
Maintenance and Upgrades: Hardware maintenance, software upgrades, or configuration changes might require taking nodes offline. Before a node goes down for maintenance, its data should be redistributed to prevent temporary data loss (if replication factor drops below your minimum).
Business-Driven Repartitioning: Sometimes the original partition key becomes unsuitable. A system initially partitioned by user_id might shift to hash(region, user_id) to improve locality and reduce cross-region traffic. This requires a full re-key and rebalance.
Fixed Number of Partitions
The simplest approach—and the one used by many systems like Kafka and Elasticsearch—is to create a fixed large number of virtual partitions upfront, then assign those partitions to physical nodes.
How it works:
- Choose a large number of partitions P (e.g., 10,000)
- Keys are mapped to partitions via
hash(key) mod P - Each partition is assigned to a node
- When a node is added or removed, partitions are reassigned, not recomputed
Example assignment table:
Partitions 0-999: Node A
Partitions 1000-1999: Node B
Partitions 2000-2999: Node C
Partitions 3000-9999: Unassigned (for future growth)
Adding Node D:
Partitions 3000-3999: Node D
(Reassignment: Move data in partitions 0-999 from A to D)
Advantages:
- Simple conceptually: key-to-partition mapping is fixed
- Adding a node doesn’t require re-hashing all keys
- Straightforward to implement
- Works well when you know an upper bound on cluster size
Disadvantages:
- High overhead if partitions are small relative to nodes
- If a partition is hot (celebrity problem), you can’t split it further
- Requires careful upfront capacity planning to choose P
- Wasted partitions if you never reach maximum capacity
When to use: Systems with relatively stable cluster sizes, or where you can tolerate having 3-5x more virtual partitions than physical nodes.
Dynamic Partitioning
Instead of fixing the number of partitions upfront, create new partitions dynamically as data grows.
How it works:
- Start with a single partition containing all data
- When a partition grows beyond a threshold (e.g., 100GB), split it into two
- New partitions are assigned to nodes, spreading the load
This is the approach used by systems like HBase, DynamoDB, and MongoDB.
Example progression:
Initial state:
Partition [0, infinity): 50GB on Node A
After growth:
Partition [0, 50M): 50GB on Node A
Partition [50M, infinity): 60GB on Node B
After further growth:
Partition [0, 25M): 45GB on Node A
Partition [25M, 50M): 48GB on Node A
Partition [50M, 75M): 55GB on Node B
Partition [75M, infinity): 65GB on Node C
Advantages:
- Automatically adapts to actual data distribution
- No wasted partitions
- Can handle skewed distributions gracefully
Disadvantages:
- More complex: requires tracking partition boundaries
- Rebalancing is continuous, not one-time
- Hot partitions still need special handling (salting or further splitting)
- Unpredictable timing of splits can cause temporary load spikes
When to use: Systems with highly variable data distribution, or where you can’t predict cluster growth patterns.
Proportional Partitioning
A hybrid approach: the number of partitions grows with the number of nodes.
How it works:
- Maintain a ratio of K partitions per node
- When a node is added, create K new partitions
- When a node is removed, merge or redistribute its partitions
Example:
Target ratio: 10 partitions per node
3 nodes: 30 partitions
4 nodes: 40 partitions (split 10 partitions into 20)
5 nodes: 50 partitions (split 10 partitions into 20)
When reducing from 5 to 4 nodes:
Delete 10 partitions worth of data or merge them
Advantages:
- Scales naturally with cluster size
- Maintains a manageable number of partitions
- Simpler than pure dynamic partitioning
Disadvantages:
- Still requires periodic rebalancing
- Choosing the right ratio K requires tuning
Online vs. Offline Resharding
The timing and availability model of rebalancing matters as much as the mechanism.
Offline Resharding
The system is taken offline, data is rebalanced, and then brought back online.
Process:
- Stop accepting writes
- Flush all data to stable storage
- Recompute partition mappings
- Redistribute data to new locations
- Restart the system with new partitioning
Impact: Complete unavailability during rebalancing. For systems serving millions of users, even 5 minutes of downtime is unacceptable. Offline resharding is typically relegated to migrations or major version upgrades.
When acceptable:
- Small systems or non-critical data
- Rebalancing frequency is low (quarterly or annual)
- Maintenance windows are planned and announced
Online Resharding
The system remains available for reads and writes during rebalancing. New clients see the new partitioning, while old data is migrated in the background.
Challenges:
- Consistency during migration: What if a key is being moved while a client tries to read it?
- Double writes: How do you ensure the new location has the latest value?
- Atomic cutover: When do you start directing new requests to the new partition?
Zero-Downtime Migration Strategies
Several techniques enable online resharding without taking the system offline.
Dual-Write Strategy
During migration, writes go to both the old and new partition location until migration is complete.
Phase 1: Pre-migration
Write request → Old partition
Phase 2: During migration
Write request → Old partition AND New partition
Read request → Old partition (authoritative)
Migration process copies old data to new location
Phase 3: Post-migration
Write request → New partition only
Read request → New partition
Implementation:
def write_during_migration(key, value, migration_state):
"""
Route writes to both old and new partition during migration.
"""
old_partition = old_mapping[key]
new_partition = new_mapping[key]
# Always write to old location first (authoritative)
old_partition.write(key, value)
if migration_state[key] == 'IN_PROGRESS':
# Also write to new location for redundancy
new_partition.write(key, value)
elif migration_state[key] == 'COMPLETED':
# Migration done; only write to new location
# But keep reading from old momentarily
pass
def read_during_migration(key, migration_state):
"""
Read from old partition until migration is confirmed complete.
"""
if migration_state[key] == 'COMPLETED':
return new_mapping[key].read(key)
else:
return old_mapping[key].read(key)
Advantages: Simple, easy to reason about Disadvantages: Write amplification during migration, storage overhead, coordination complexity
Redirect Strategy
The old partition forwards reads to the new partition during migration.
Phase 1: Pre-migration
Write/Read → Old partition
Phase 2: During migration
Write → Old partition
Read → Old partition → (if not found) → New partition
Migration process copies data to new location
Phase 3: Post-migration
Write/Read → New partition
Old partition can be decommissioned
Advantages: Reduced write amplification Disadvantages: Read latency increases during migration due to redirects, complexity in handling stale data
Versioned Epoch Strategy
Assign an epoch number to each rebalancing operation. Clients and servers track the epoch, and operations are tagged with the epoch in which they occurred.
Epoch 0 (Old partitioning):
hash(key) mod 100 → partition ID
Epoch 1 (New partitioning):
hash(key) mod 150 → partition ID
Transition:
System announces "switching from Epoch 0 to Epoch 1"
Servers accept operations from both epochs
Migrate all Epoch 0 data to Epoch 1 locations
After migration, reject Epoch 0 operations
Used by: Many real systems including some versions of Redis Cluster and Elasticsearch
Advantages: Clear semantics, epoch number eliminates ambiguity about which data is “current” Disadvantages: All clients must understand epochs, can be complex to implement correctly
Monitoring and Orchestrating Rebalancing
Rebalancing is inherently risky and requires careful orchestration.
Rebalancing Monitoring
Critical metrics during rebalancing:
- Data transferred per second (should be predictable)
- Network utilization (should not exceed 80% capacity)
- Partition sizes on source and destination nodes
- Replication lag (if applicable)
- Client latency percentiles (should remain within SLO)
- Migration progress (% complete)
Adaptive speed control: If you observe client latency degradation, throttle the rebalancing rate. Conversely, if headroom is available, accelerate it.
Orchestration Tools and Patterns
Manual orchestration: For small clusters or rare rebalancing, manual processes (scripts and runbooks) might suffice. Dangerous but sometimes necessary when automation is unavailable.
Cluster manager: Systems like Kubernetes, Consul, or custom orchestrators automate node assignment and coordinate data movement.
Example orchestration workflow (pseudocode):
1. Operator: "Add 2 new nodes to cluster"
2. Cluster manager:
a. Assigns partitions to new nodes
b. Calculates data movement plan (which partitions move where)
c. Validates that replication factor will be maintained
d. Initiates migration with throttling
3. Monitor:
a. Track migration progress
b. Alert if migration is slower than expected
c. Alert if client latency degrades
4. Completion:
a. Verify data integrity on new locations
b. Update routing tables
c. Mark migration as complete
Mermaid Diagram: Rebalancing Process
graph TD
A["Initial State<br/>3 nodes, 3 partitions<br/>Partition A on Node1<br/>Partition B on Node2<br/>Partition C on Node3"] -->|Add Node4| B["Planning Phase<br/>Decide new layout:<br/>Partition A→Node1<br/>Partition B→Node2<br/>Partition C1→Node3<br/>Partition C2→Node4"]
B --> C["Migration Phase<br/>Dual-Write or Redirect"]
C --> C1["Copy data from<br/>Node3 to Node4<br/>Partition C data splits"]
C1 --> C2["Node3: 50% of C<br/>Node4: 50% of C"]
C2 --> D["Verification Phase<br/>Validate data integrity<br/>Check replication"]
D --> E["Cutover Phase<br/>Update routing metadata<br/>Start new write mapping<br/>Drain old locations"]
E --> F["Final State<br/>4 nodes, 4 partitions<br/>Balanced load<br/>Replication maintained"]
style A fill:#ffcccc
style F fill:#ccffcc
style C1 fill:#fff4cc
Real-World Considerations and Edge Cases
Handling Stragglers
During rebalancing, some partitions might be slow to migrate due to hotness, network issues, or hardware problems. A fast rebalancing with stragglers can fail if you wait for all partitions. Some systems use a deadline-based approach: wait for the slowest K% of partitions, then proceed, accepting brief inconsistency.
Dealing with Cascading Failures
If multiple nodes fail during rebalancing, the system might become temporarily under-replicated. Orchestration must pause rebalancing and focus on re-replicating critical data first.
Cluster Shrinkage
Removing nodes is trickier than adding them. Before removing a node, all its data must be redistributed. If the node is unresponsive, you must assume its data is lost and re-replicate from other replicas. Cluster shrinkage should be done gradually to avoid overloading remaining nodes.
Coordination with Replication
If the system uses replication (Chapter 11), rebalancing must coordinate with replica synchronization. Ideally, you want strong consistency during rebalancing. Some systems prioritize availability and accept eventual consistency during migration windows.
Connection to Next Section
Rebalancing ensures that as your system grows, data is distributed evenly and efficiently. However, a partitioned system alone doesn’t guarantee durability or high availability. If a shard node fails, all data on that node is lost unless it’s replicated elsewhere. The next chapter, Database Replication, explores how to use replication across multiple nodes to provide fault tolerance, read scaling, and consistency guarantees that work in concert with partitioning strategies.