Multi-Master Replication
When Every Node Must Accept Writes
Picture this: Your application has grown to serve users across Tokyo, London, and São Paulo. A user in Tokyo tries to update their profile while another user in the same region performs a transaction. The request travels 200+ milliseconds to your primary database in Virginia, incurs another 200ms on the return trip, and suddenly a task that should complete in milliseconds takes half a second. Your users experience sluggish interfaces. More critically, during this latency window, other writes from London might have already modified the same record.
In the previous chapter, we saw how master-slave replication solved read scalability by distributing reads across replicas, but writes remained a bottleneck—all writes had to go to a single master. Multi-master replication (also called master-master or bidirectional replication) removes that bottleneck. Every node becomes a write target. Every node can accept modifications from users in its region, then replicate those changes bidirectionally to other nodes.
But this power comes with a profound complexity: What happens when two users in different regions update the same row simultaneously? Or when a network partition temporarily splits your nodes, and both halves accept conflicting writes? This chapter explores how systems detect, prevent, and resolve those conflicts—and when the added complexity is actually worth the benefits.
The Promise and the Problem
Multi-master replication promises:
- Write scalability across regions: Users in Tokyo write to the Tokyo replica, London users write to London, São Paulo users write to São Paulo. All writes are local.
- Lower write latency: No longer routing every write across continents to a central master.
- Availability during network partitions: If the Virginia data center goes offline, Tokyo and London can continue accepting writes.
But every benefit trades off against a fundamental challenge: write conflicts. In a master-slave setup, there’s only one authority—the master—so conflicts are impossible. With multiple masters, you have multiple authorities with potentially contradictory instructions.
Consider this scenario:
- User A in Tokyo updates a shopping cart at 10:00:00.001
- User B in London updates the same cart at 10:00:00.002
- Both updates replicate to each other
- Tokyo now has London’s version, but London also has Tokyo’s version
- Which one is correct?
Three types of conflicts emerge in multi-master systems:
-
Write-write conflicts: The same row is updated on two different masters within replication lag. User A modifies a customer’s email in Tokyo, User B modifies the phone in London, both for the same customer record.
-
Insert conflicts: Two masters independently insert rows with the same primary key. Tokyo inserts user ID 5000, London inserts user ID 5000 at nearly the same moment.
-
Delete conflicts: One master deletes a row while another simultaneously updates it. You delete an order in Tokyo; London tries to refund it at the same moment.
The choice of how to detect and resolve these conflicts fundamentally shapes which multi-master systems you can use.
How Collaborative Editing Teaches Us About Conflict
Think of Google Docs: Multiple people edit the same document simultaneously. Alice types a sentence in paragraph three while Bob deletes paragraph four. If they were using traditional databases with locked rows, only one person could edit at a time. But Google Docs doesn’t lock—it merges changes.
How? The system doesn’t track just the final state of each paragraph. It tracks the history of operations with enough metadata to understand their relationships. When Alice’s edits and Bob’s edits arrive out of order, the system can replay them and produce the correct merged result.
Multi-master databases aspire to this same vision. Instead of “last writer wins” (which loses data), they try to merge changes intelligently. Some use operational transformation (similar to Google Docs), others use Conflict-free Replicated Data Types (CRDTs) that are mathematically designed to merge automatically without conflicts.
The analogy breaks down in one crucial way: Documents have unstructured text; databases have structured data with constraints. You can’t just merge conflicting phone numbers—one must be chosen or the conflict flagged for human resolution.
Conflict Detection: The Foundation
Before resolving a conflict, you must detect it. Three approaches exist:
Timestamps and Last-Writer-Wins
The simplest approach: each write gets a timestamp. When replication lag causes conflicts, the write with the latest timestamp wins. The earlier write is discarded.
UPDATE customers SET email = '[email protected]', updated_at = 1706012401
WHERE id = 5;
-- Later, this update arrives at Tokyo from London:
UPDATE customers SET email = '[email protected]', updated_at = 1706012402
WHERE id = 5;
-- Because 1706012402 > 1706012401, London's version wins
-- Alice's change is silently overwritten
Problem: Last-writer-wins is deterministic but loses data. If Alice’s write finishes slightly before Bob’s (network timing variance), Bob’s later timestamp wins even though Alice wrote first from the user’s perspective.
Version Vectors
A version vector is metadata tracking which writes came before others. Each node tracks a vector of version numbers: [NodeA: 5, NodeB: 3, NodeC: 7] means “this record has seen 5 writes from A, 3 from B, 7 from C.”
When a write arrives at NodeB from NodeA, NodeB updates its version vector:
- NodeA says:
[NodeA: 10, NodeB: 5, NodeC: 8] - NodeB receives it and checks: did NodeB’s version increase? No (still 5). Did NodeA’s increase? Yes (10 > 5). This is a causal successor—NodeB can safely apply it.
But when two writes genuinely conflict (neither happened-before the other), version vectors reveal it:
- NodeA’s write:
[NodeA: 6, NodeB: 5, NodeC: 3] - NodeB’s write:
[NodeA: 5, NodeB: 6, NodeC: 3] - Neither vector is strictly less than the other—conflict detected.
Version vectors provide precise conflict detection but require storage overhead and become complex with many nodes.
Application-Level Detection
Some systems place the burden on the application: You write code that checks for conflicts and decides how to resolve them. Your application is responsible for detecting if the email field changed and implementing merge logic.
This is the most flexible but places sophisticated logic in application code where bugs are costly.
Conflict Resolution Strategies
Once you’ve detected a conflict, you must resolve it. Four main strategies dominate:
Last-Writer-Wins (LWW)
Use timestamp or logical clock values to deterministically pick a winner. Simple, but sacrifices data.
When to use: For data where recent writes truly override previous ones (e.g., a feature flag toggle, a configuration value) and losing prior updates is acceptable.
When NOT to use: For transactional data, inventory counts, or financial records where losing writes causes serious problems.
Version Vectors with Application Resolution
Detect the conflict (via version vectors), then let the application handle it. Your code can implement business logic: “If there’s a conflict on the email field, keep the one that was manually updated by a user, not auto-filled.” Or: “Alert the user and ask them to choose.”
async function resolveConflict(localVersion, remoteVersion) {
if (localVersion.versionVector > remoteVersion.versionVector) {
return localVersion; // local has more history
} else if (remoteVersion.versionVector > localVersion.versionVector) {
return remoteVersion;
} else {
// Genuine conflict - same causality depth
// Business logic: user-edited values take precedence
if (localVersion.editedByUser && !remoteVersion.editedByUser) {
return localVersion;
}
return remoteVersion;
}
}
When to use: High-value data where losing writes is unacceptable and you have the business context to make intelligent merge decisions.
CRDTs: Mathematically Conflict-Free Data Types
A Conflict-free Replicated Data Type (CRDT) is a data structure designed so that concurrent updates from different nodes can be merged without human intervention, always producing a consistent result.
CRDTs work by changing how you represent data. Instead of “current value,” you track all operations. A CRDT counter doesn’t store “count = 5”; it stores “node A incremented 3 times, node B incremented 2 times” and derives the total.
G-Counter (Grow-only Counter):
{
"nodeA": 3, // node A has issued 3 increments
"nodeB": 2, // node B has issued 2 increments
"nodeC": 1
}
// Value = sum of all increments = 6
// Two nodes incrementing simultaneously:
// NodeA: {"nodeA": 4, "nodeB": 2, "nodeC": 1} = 7
// NodeB: {"nodeA": 3, "nodeB": 3, "nodeC": 1} = 7
// Merge them: {"nodeA": 4, "nodeB": 3, "nodeC": 1} = 8
// Both arrive at the same value without coordination
PN-Counter (Positive-Negative Counter): Uses two G-Counters (one for increments, one for decrements) to support both operations.
LWW-Register (Last-Writer-Wins Register): Stores (timestamp, value) pairs. Merging picks the highest timestamp. Simple but data-losing.
OR-Set (Observed-Remove Set): A set that supports add and remove. Each element gets a unique ID. When merged, elements exist if they’ve been added and not had all their IDs removed. This handles the case where an add and remove happen concurrently—the add wins.
// NodeA adds "item1" with ID "a1", NodeB removes "item1"
// OR-Set stores: {(item1, a1)} // the remove only removed specific ID
// When merged: item1 still exists because ID "a1" was never removed
// If remove had removed "item1" globally, then the concurrent add re-adds it
CRDTs are mathematically proven to converge (all nodes eventually reach the same state) without coordination. The tradeoff: CRDTs only support certain operations and require you to think differently about data structures.
When to use: Distributed systems where strong consistency isn’t possible (geographically distributed, network partitions acceptable), and data is naturally suited to CRDTs (counters, sets, collaborative editing). Riak, Cassandra, and some Redis implementations use CRDTs.
Multi-Master Implementations
Different databases take different approaches to multi-master replication:
MySQL Group Replication
MySQL Group Replication uses a Paxos-based consensus to coordinate writes among multiple masters. When you perform a write on any node, it reaches a quorum of other nodes, and they agree on a consistent global order before committing.
Write on NodeA:
1. NodeA proposes: "INSERT INTO users..."
2. NodeB and NodeC vote YES or NO
3. If quorum (2+ out of 3) agrees, all nodes commit
4. If partition isolates NodeA, it cannot get quorum and write fails
Advantages: Strong consistency within the quorum. Write conflicts are prevented because the consensus protocol ensures only one write succeeds at a time.
Disadvantages: Requires majority quorum (so odd number of nodes, typically 3, 5, 7). If you can’t reach quorum, writes block. Not suitable for high-latency WAN deployments (consensus requires many round-trips).
PostgreSQL BDR (Bi-Directional Replication)
PostgreSQL BDR uses a conflict-resolution approach based on write timestamps and application-defined resolution rules. BDR is designed for geographically distributed clusters where sub-second latency between nodes is unrealistic.
-- Setup multi-master on nodeA
CREATE NODE nodeA DEFINITION (dsn '...');
CREATE NODE nodeB DEFINITION (dsn '...');
SELECT bdr.create_node_group('region_group', 'bdr_node_nodeA', 'bdr_node_nodeB');
Advantages: Asynchronous replication, so writes don’t wait for remote nodes. Better latency than consensus-based systems. Can handle network partitions (both sides keep writing).
Disadvantages: Conflicts will occur during partitions and must be resolved. Complexity of managing conflict resolution.
CockroachDB
CockroachDB implements multi-master replication using distributed consensus (Raft) at a fine-grained level: each row is replicated to multiple nodes using Raft, and leadership can failover to any replica. From the user perspective, you can write to any node, and it coordinates consensus for that row’s key range.
CockroachDB replication:
NodeA, NodeB, NodeC all have replica of row "user:5000"
NodeA is leader for that key range
Write to NodeB routes to NodeA's Raft group
Raft achieves consensus among the three replicas
All three nodes durably store the write before acknowledging
Advantages: Strong consistency everywhere. ACID guarantees. Multi-region deployments with automatic failover.
Disadvantages: Cross-region writes still have latency (Raft needs quorum across regions). Not suitable for geographically distant regions where latency is high.
Galera Cluster (for MySQL)
Galera implements synchronous multi-master replication. Every write is replicated to all nodes before a response is sent to the client.
Write on NodeA:
1. NodeA replicates to NodeB and NodeC
2. Both confirm replication
3. Only then does NodeA confirm to client (write is durable everywhere)
Advantages: All nodes are always synchronized. No replication lag. Very strong consistency.
Disadvantages: Latency = local write time + replication delay to farthest node. For multi-region, this adds continental latency to every write. Difficult to scale across slow networks.
The CAP Theorem and Multi-Master Trade-offs
Recall the CAP theorem: a distributed system can guarantee at most two of Consistency, Availability, and Partition tolerance.
Multi-master systems must choose:
| Approach | Consistency | Availability | Partition Tolerance | Best For |
|---|---|---|---|---|
| Consensus-based (MySQL GR) | Strong | Limited (needs quorum) | Limited (quorum breaks) | Regional clusters, strong consistency |
| Asynchronous + LWW (PostgreSQL BDR) | Eventual | High | High | Geo-distributed, conflict tolerance |
| CRDTs | Eventual | High | High | Collaborative, immutable data |
| Synchronous (Galera) | Strong | High | Limited | Regional, low-latency networks |
If your region partition (Tokyo isolated from London), you must choose:
- Consistency: Both regions stop writing (lose availability).
- Availability: Both regions keep writing, conflicts occur (lose consistency).
Multi-master systems usually choose availability and partition tolerance, accepting eventual consistency.
Building Conflict-Free Counters with CRDTs
Let’s implement a simple G-Counter and see conflict resolution in action:
class GCounter {
constructor(nodeId) {
this.nodeId = nodeId;
this.state = {}; // { nodeA: 5, nodeB: 3, nodeC: 7 }
}
increment(value = 1) {
if (!this.state[this.nodeId]) {
this.state[this.nodeId] = 0;
}
this.state[this.nodeId] += value;
}
value() {
return Object.values(this.state).reduce((a, b) => a + b, 0);
}
merge(otherState) {
// Merge by taking max for each node
for (const [nodeId, count] of Object.entries(otherState)) {
if (!this.state[nodeId]) {
this.state[nodeId] = count;
} else {
this.state[nodeId] = Math.max(this.state[nodeId], count);
}
}
}
// Simulate replication
replicate() {
return JSON.parse(JSON.stringify(this.state));
}
}
// Example: inventory counter across three datacenters
const dcTokyo = new GCounter("tokyo");
const dcLondon = new GCounter("london");
const dcSP = new GCounter("sp");
// Tokyo receives 10 items
dcTokyo.increment(10);
console.log("Tokyo:", dcTokyo.value()); // 10
// London receives 5 items independently
dcLondon.increment(5);
console.log("London:", dcLondon.value()); // 5
// Replication: London receives Tokyo's state
dcLondon.merge(dcTokyo.replicate());
console.log("London after merge:", dcLondon.value()); // 15
// Tokyo receives London's state
dcTokyo.merge(dcLondon.replicate());
console.log("Tokyo after merge:", dcTokyo.value()); // 15
// All nodes eventually agree: 15 items total
// This is true despite concurrent increments with no coordination
Practical Conflict Scenario: User Updates
Imagine a user profile database replicated across Tokyo and London:
Initial state: { id: 1, email: "[email protected]", phone: "555-0000", updated_at: 1706000000 }
Time 10:00:00.001 - Tokyo User updates email
UPDATE users SET email='[email protected]', updated_at=1706012401 WHERE id=1
Time 10:00:00.002 - London User updates phone
UPDATE users SET phone='555-9999', updated_at=1706012402 WHERE id=1
Replication lag: These updates propagate to each other asynchronously
Scenario A: Last-Writer-Wins (Timestamp)
- Tokyo has: email=‘[email protected]’, phone=‘555-0000’, updated_at=1706012401
- London has: email=‘[email protected]’, phone=‘555-9999’, updated_at=1706012402
- London’s timestamp is newer, so London’s entire record wins
- Result: Tokyo’s email update is LOST. Phone is 555-9999.
- Problem: Data loss on email update.
Scenario B: Version Vectors + Smart Merge
- Version vector on email write: [tokyo: 5, london: 4]
- Version vector on phone write: [tokyo: 4, london: 5]
- These vectors are incomparable (neither dominates)
- Application logic: since different fields changed, MERGE them
- Result: email=‘[email protected]’, phone=‘555-9999’
- Both updates preserved because they touched different columns.
Scenario C: If both updates touched the same field
- Both Tokyo and London update email
- Vectors are still incomparable (true conflict)
- Application must resolve: perhaps use most recent user-initiated change, or alert the user
- Or revert to LWW with acceptance that one update will be lost.
When Multi-Master Is Worth the Complexity
Multi-master replication isn’t always the solution. Evaluate it against alternatives:
Multi-Master Shines When:
- Users are distributed globally and write latency is critical.
- Multiple regions must continue operating during partition.
- Conflicts are rare (different regions touch different data).
- Data is naturally suited to CRDTs (counters, sets, collaborative documents).
Consider Alternatives When:
- Reads >> Writes: Master-slave replication with read replicas is simpler and more suitable.
- Strong consistency required: If conflicts are unacceptable, consensus-based approaches add complexity without solving the problem elegantly.
- Write conflicts are frequent: If regions constantly modify overlapping data, conflict resolution becomes a bottleneck. Sharding (splitting data by region) might be better.
- Single-region is acceptable: Many businesses don’t need multi-region writes; master-slave + failover is sufficient.
Pro Tip: Many teams implement multi-master thinking it solves latency, only to realize that most of their writes require global coordination anyway (user email is global; inventory counts are global). Sharding data by region (so regions own different subsets) often solves the actual problem more elegantly than multi-master.
Key Takeaways
- Multi-master replication enables writes on any node, eliminating the single-master write bottleneck, but introduces write conflicts as the fundamental trade-off.
- Conflict detection varies: timestamps are simple but lossy; version vectors are precise; CRDTs are mathematically proven to merge correctly.
- Different systems make different trade-offs: consensus-based (MySQL GR) provides strong consistency at latency cost; asynchronous + conflict resolution (PostgreSQL BDR) provides geo-distributed availability at consistency cost.
- CRDTs are powerful for conflict-free data structures but require you to change how you represent data and only support specific operations.
- CAP theorem applies: multi-master systems must choose availability and partition tolerance, accepting eventual consistency. Alternatively, consensus protocols sacrifice availability or partition tolerance.
- Multi-master solves write latency but not all write conflicts; some problems are better solved by sharding or accepting master-slave replication with read replicas.
Practice Scenarios
Scenario 1: Inventory System Across Regions Your e-commerce platform maintains product inventory replicated across Tokyo, London, and São Paulo warehouses. A product has 100 units. The London warehouse ships 30 units to a customer (30 units remaining). Simultaneously, Tokyo ships 20 units (80 remaining). The operations replicate asynchronously.
Questions:
- If you use last-writer-wins with timestamps, what’s the final inventory count at each node? What’s the real-world inventory?
- How would a G-Counter CRDT handle this? Would it give you accurate inventory?
- Why is a simple CRDT insufficient for inventory, and what additional logic would you need?
Scenario 2: Conflict-Driven Outage You’ve deployed a multi-master PostgreSQL BDR cluster across three regions for a financial application. A network partition isolates Region A (which contains 2 nodes) from Regions B and C (which contain 1 node each). During the partition, a transaction transfers $10,000 from account X to account Y. Region A records a debit on X and credit on Y. Regions B and C also record the same transfer independently (user retried thinking it failed).
Questions:
- Describe the conflict state when the partition heals.
- How should you resolve this in your application code?
- Would you be willing to accept eventual consistency for financial data? Why or why not?
Scenario 3: Multi-Master vs. Sharding Your SaaS product needs to reduce write latency for global customers. You’re considering either:
- Option A: Multi-master replication with conflict resolution
- Option B: Sharding by customer (customer’s data lives in region closest to them)
For each of these features, decide which approach is better:
- User profile updates (each user modifies their own record)
- Team collaboration (multiple users from different regions updating shared documents)
- Reporting (queries that join user profiles with team data across regions)
Looking Ahead
Multi-master replication solves some problems elegantly but introduces others. The next chapter examines synchronous vs. asynchronous replication—the timing dimension of replication. Should replicas acknowledge writes before they’re persisted on all nodes (fast but risky) or only after (slow but safe)? This choice compounds the trade-offs we’ve discussed: synchronous replication ensures durability but increases latency; asynchronous replication is fast but can lose writes during failures. Understanding this trade-off is crucial for designing systems that match your availability and consistency needs.