System Design Fundamentals

Distributed System Trade-offs

A

Distributed System Trade-offs

Why Perfect Distributed Systems Don’t Exist

Here’s an uncomfortable truth: there is no perfect distributed system, only trade-offs you’re willing to accept. Every architectural decision in a distributed system involves giving something up to gain something else. The art of system design isn’t finding the “best” solution—it’s finding the “least worst” set of trade-offs for your specific requirements.

In previous chapters, we explored consistency models in depth: strong consistency, eventual consistency, read-your-writes guarantees, and causal consistency. But we didn’t address the real question you face when designing a system: why would I ever choose eventual consistency if I could have strong consistency? The answer is always because you’re making a trade-off. You’re accepting weaker guarantees to gain something more valuable: lower latency, higher availability, or reduced operational complexity.

This chapter is about making those trade-offs consciously and systematically rather than stumbling into them by accident.

The Dimensions of Trade-offs

When we talk about trade-offs in distributed systems, we’re really juggling several dimensions simultaneously:

Consistency vs Availability (CAP Revisited)

Remember the CAP theorem? It states that in the presence of a network partition, you can guarantee either consistency (all nodes see the same data) or availability (all nodes remain responsive). You cannot have both during a partition. This framework shaped distributed systems thinking for two decades, but it’s somewhat blunt. Most systems operate without partitions most of the time, so treating them as equally likely misses nuance.

Consistency vs Latency (PACELC)

A more practical framework, PACELC (pronounced “pass-elk”) extends CAP: If there’s a partition, choose consistency or availability. Else (in normal operation), choose latency or consistency. During normal operation, adding a replicas or consensus rounds to guarantee strong consistency introduces latency. In Google Spanner or Cockroach DB, you’re trading latency for strong consistency even when the network works perfectly. In DynamoDB with eventual consistency, you’re trading consistency for millisecond latency.

Durability vs Performance

Write one record to disk synchronously (fsync every write)? That’s durable but slow—you’re blocked until the physical disk confirms. Write thousands of records in batches before fsync? That’s fast but risky—a power failure loses recent data. RocksDB lets you tune this. Kafka’s acks configuration embodies this trade-off: acks=all means all in-sync replicas acknowledge before the write completes (durable but slower), while acks=1 means the leader acknowledges immediately (fast but riskier).

Complexity vs Reliability

A simple system—a single database with no replication—fails catastrophically but rarely surprises you with subtle bugs. Add replication, consensus protocols, and distributed coordination, and you gain reliability (it survives failures) but lose simplicity. Complex systems have more moving parts, more potential failure modes, and more opportunities for subtle bugs. Netflix’s Chaos Engineering exists because complexity hides failure modes that simple analysis misses.

Cost vs Performance and Availability

Run 3 replicas instead of 1? Your data is more available and survives node failures, but you pay 3x the infrastructure cost. Run 7 replicas? Better availability at higher cost, but each additional replica adds write latency (more nodes must acknowledge). Add caching layers? Reduces database load but increases operational complexity. There’s always a cost dimension.

The Fundamental Impossibility: FLP

Before we get practical, you need to understand the Fischer-Lynch-Paterson (FLP) impossibility result. In an asynchronous distributed system (where messages can be arbitrarily delayed), no consensus protocol can guarantee progress if even one node might fail. This is not a limitation of current algorithms—it’s a fundamental mathematical impossibility.

What does this mean in practice? It means that timeouts and heuristics are not failures of our designs; they’re essential features. There’s no way to distinguish between a crashed node and a node whose messages are just really delayed. So Raft uses election timeouts. When a leader hasn’t sent a heartbeat in 150ms, followers assume it’s dead and hold a new election. This is a pragmatic heuristic, not a guarantee. Sometimes the network is just slow and you’ll trigger an unnecessary election, introducing latency jitter. But you can’t avoid this—the FLP result proves you can’t.

This insight reframes trade-off thinking: you’re not choosing between a theoretical guarantee and a heuristic. You’re choosing which heuristics and which failure modes you’re willing to accept.

The Trade-off Decision Matrix

Let me walk through the five most consequential trade-offs you’ll encounter:

Trade-off #1: Replication Factor vs Write Latency vs Cost

Three replicas is the industry standard for good reason. With 3 replicas, you can tolerate 1 failure. With 5 replicas, you tolerate 2 failures but pay more and add latency (more nodes must acknowledge writes). With 1 replica, you have no failure tolerance.

But “standard” doesn’t mean right for your system. A logging service might use 1 replica per datacenter (losing logs is expensive but not catastrophic, and the cost matters). A payment ledger uses 5 or more replicas (losing transactions is unacceptable; cost is secondary).

Here’s the latency math: if you require all replicas to acknowledge before a write completes (synchronous replication), the latency is the slowest replica’s latency. If your replicas are geographically distributed (US West, US East, Europe), and you require all to ack, you’re waiting for the European datacenter’s round-trip time. That’s 150+ milliseconds. For a user-facing API, that’s brutal.

The solution we often choose: quorum writes. With 3 replicas, write to 2 replicas and wait. You can tolerate 1 failure (the third replica can fail and you still have 2). The latency is the faster of 2 responses, not the slowest of 3. You’ve traded some fault tolerance (you still tolerate 1 failure, same as 3-way sync) for latency.

Pro Tip: Quorum replication is more sophisticated than it first appears. If you write to a quorum and read from a quorum with proper overlap (the quorum sizes must sum to more than half of all replicas), you guarantee reading your own writes. This is one way to achieve “read-your-writes” consistency without global consensus.

Trade-off #2: Consensus Overhead vs Fault Tolerance

Raft is intuitive and practical, but it’s not free. Each leader election, each replication round, each log entry requires network messages and processing. The overhead scales with the number of nodes:

  • 3 nodes: Can tolerate 1 failure. Majority is 2. Most consensus messages complete in 1 round trip.
  • 5 nodes: Can tolerate 2 failures. Majority is 3. Takes slightly longer as more nodes must participate.
  • 7+ nodes: Can tolerate 3+ failures. But now you’re sending messages to many nodes, and the slowest node in the quorum determines latency.

Google uses Paxos (a more complex consensus protocol than Raft) in production systems where they’ve tuned it extensively. Facebook uses different consistency models for different components. But most teams shouldn’t run 7-node consensus for the fun of it. The overhead per additional replica isn’t linear—it compounds with network topologies, garbage collection pauses, and disk I/O.

The common pattern: use 3-node consensus clusters for most systems. If you need higher fault tolerance, shard your data and run multiple 3-node clusters. This is how CockroachDB and Google Spanner work internally—they achieve high fault tolerance without consensus overhead by distributing it across shards.

Trade-off #3: Data Locality vs Global Consistency

Users in Tokyo connect to a service in Tokyo. Users in New York connect to a service in New York. This is fast—data is close, network latency is low.

But now you have data in two places. A user updates their profile in Tokyo; the change must eventually reach New York. If a New York user reads the profile immediately after, they might see the old version. This is the trade-off: local reads are fast, but consistent reads require global coordination.

Global coordination is expensive. You could require all reads to check the authoritative source (maybe a central database in Europe), but that defeats the purpose of local caches. Or you could require all writes to be globally committed before acknowledging to the user, but that introduces latency (a Tokyo write must wait for replication to New York).

Most real systems make a hybrid choice: serve fresh reads from the local cache, serve stale reads when needed, and offer strong consistency as an expensive option for critical operations. DynamoDB calls this the “strongly consistent read” option—you can request it, but it has higher latency and costs.

Netflix uses a different approach: they shard their user data by geography. New York users’ data lives in US-EAST, Tokyo users’ data lives in Tokyo. The system ensures data never moves (unless the user moves). This avoids replication entirely for the common case.

Trade-off #4: Batch Processing vs Stream Processing

Should you process millions of events all at once (batch) or continuously as they arrive (stream)?

Batch processing (throughput-optimized): Process 1 million events in a single batch. You amortize overhead. Modern CPUs love processing contiguous data—branch mispredictions are minimized, caches are efficient. Batch systems can achieve thousands of transactions per second per core.

Stream processing (latency-optimized): Process each event as it arrives. A user interaction triggers immediate processing. An anomaly is detected within milliseconds. But you process each event individually, which wastes CPU throughput and uses more resources.

The fundamental tension: batching adds latency (you wait for the batch to fill) but improves throughput. Streaming improves latency but wastes throughput.

Many systems use micro-batching: accumulate events for a few milliseconds (say, 10ms), then process them in a batch. Kafka Streams and Spark Streaming both work this way. You get better latency than true batch (10ms vs hours) and better throughput than pure streaming.

Trade-off #5: Coordination vs Autonomy in Microservices

In a monolithic system, you have one transaction that either succeeds or fails atomically. In a microservices system, payment service, inventory service, and shipping service are separate. You need to coordinate them to process an order.

Option A: Distributed Transactions. Use a 2-phase commit (2PC) protocol: all services prepare the transaction, then all commit. This guarantees atomicity. But 2PC is slow (2 rounds of messaging), and if a service becomes unresponsive, the entire transaction blocks.

Option B: Saga Pattern. Split the order into steps: reserve inventory, then process payment, then ship. If payment fails, compensate by releasing inventory. This avoids 2PC overhead but requires writing compensation logic. It’s more complex and offers weaker guarantees (you can end up in “stuck” states where compensation fails).

Option C: Eventual Consistency. Submit the order, return success immediately, and asynchronously process payment and shipping. If something fails, the business processes it manually. This is fast and decoupled but doesn’t guarantee the order succeeds.

Different companies make different choices. Uber uses a hybrid: critical money-related operations use 2PC or Saga patterns; less critical operations use eventual consistency. Amazon historically pushed teams toward eventual consistency and compensation logic, accepting higher operational complexity in exchange for system independence.

The Harvest and Yield Model

Herb Seltzer introduced an alternative framework to CAP: Harvest and Yield.

Yield is the probability of completing a query. During network degradation, you might lose the ability to read from a replica, so your yield drops from 100% to 90% (you can serve 9 out of 10 requests).

Harvest is the fraction of the data you can return. During degradation, you might serve read requests from local cache but it’s stale, so you’re only returning 80% of the “correct” answer (filtered results, partial data).

This model is more realistic than CAP: most degradations don’t create hard partitions; they’re partial failures. A slow replica might still be reachable but adds latency. A network congestion might affect 5% of requests but not 100%. With Harvest/Yield thinking, you ask: “When things go wrong, do we return partial data or do we return fewer complete requests?”

Netflix chose to optimize for Harvest: during degradation, return popular recommendations even if the personalization service is unreachable. Amazon chose to optimize for Yield: refuse requests if the critical data source is unreachable, ensuring consistency for available requests.

Practical Trade-off Examples

Example: DynamoDB Configuration

DynamoDB lets you configure consistency and throughput per table. Here’s the configuration for three different use cases:

# User profile table: reads need to be current
ProfileTable:
  readConsistency: STRONG  # Trade latency for consistency
  readCapacity: 5000      # Cost money for high throughput
  writeCapacity: 500      # Proportional write throughput

# Analytics table: eventual consistency is fine
AnalyticsTable:
  readConsistency: EVENTUAL  # Lower latency, lower cost
  readCapacity: 1000
  writeCapacity: 2000  # Many writes, few strong-consistency reads

# Recommendations table: stale data is fine, speed matters
RecommendationsTable:
  readConsistency: EVENTUAL
  readCapacity: 100      # Few reads—users are cached locally
  writeCapacity: 5000    # Bulk recompute happens frequently

Example: Kafka Durability Configuration

Kafka lets producers choose the durability level:

// Ultra-durable: wait for all in-sync replicas (ISRs) to acknowledge
properties.put("acks", "all");           // Sync writes to all replicas
properties.put("retries", Integer.MAX_VALUE);
properties.put("max.in.flight.requests", 1);  // Ordered delivery

// Fast: leader acknowledges immediately, async replication
properties.put("acks", 1);               // Only leader ACKs
properties.put("compression.type", "snappy");

// Very fast: no acknowledgment, fire and forget
properties.put("acks", 0);               // No ACK at all

The latency differences are dramatic. For a 1MB message, acks=all might take 50ms (waiting for replicas), acks=1 takes 5ms (just the leader), and acks=0 takes less than 1ms.

Example: Multi-tier Architecture with Mixed Consistency

User Request

[Cache Layer] - eventual consistency, 1ms latency
    ↓ (cache miss)
[Read-Your-Writes Layer] - monotonic reads, 10ms latency
    ↓ (stale data unacceptable)
[Strong Consistency Layer] - Raft consensus, 100ms latency
    ↓ (user updates)
[Write Path] - 3-node Raft, async replication to analytical store

A typical request path:

  1. Check cache (fast, stale is okay)
  2. If needed, read from local read replica (consistent with last write, moderate latency)
  3. If needed, read from authoritative store (strong consistency, high latency)
  4. Writes always go through consensus (durability guarantee)

Different parts of the system make different trade-offs based on requirements.

How to Evaluate Trade-offs Systematically

When you’re designing a system, use this framework:

  1. Understand your failure domain. What failures are actually likely? Network partitions between datacenters? Yes, happened to GitHub, Stripe, and AWS. Node crashes? Common. Correlated failures (entire datacenter goes down)? Rare. Silent data corruption? Very rare. Your consistency model should defend against likely failures, not hypothetical ones.

  2. SLA-driven decisions. What does your SLA require? “99.95% availability” means you can be down for 22 minutes per month. If you’re running 3 replicas with 5-minute failover, you’re okay. If you need 99.99% availability (5 minutes per month), you need multi-region replication and sub-minute failover.

  3. The “good enough” principle. Don’t over-engineer consistency. If your use case tolerates 1-second stale data, eventual consistency is fine. If it tolerates 1-minute stale data, you can batch updates. The cost difference is often 10x.

  4. Ask what breaks. For each consistency level, ask: “What’s the failure scenario if I choose this?” If you choose eventual consistency for a payment ledger and the network partitions for 2 hours, what happens? Can you detect inconsistencies later? Can you compensate? If the answer is “no,” eventual consistency isn’t acceptable.

  5. Revisit requirements analysis. Chapter 2 asked about scale and latency. Now ask about consistency: Do all users need to see the same data immediately? Can some users see stale data? What’s the maximum acceptable staleness? Some of these answers come from interviewing stakeholders with distributed systems knowledge, not just product managers.

Did you know? Google Spanner achieves strong consistency across multiple datacenters by using atomic clocks (TrueTime). Every Google datacenter has atomic clocks precise to microseconds. Spanner uses these clocks to order transactions without consensus overhead. It’s a hardware solution to a software problem—and it only makes sense at Google’s scale where the cost is justified.

Key Takeaways

  • No free lunch: Every distributed systems choice trades off latency, consistency, availability, cost, or complexity. There’s no “best” choice, only the “least worst” for your requirements.

  • CAP is blunt; PACELC is practical: Most failures aren’t perfect partitions. In normal operation, you’re choosing consistency vs latency. Different parts of your system can make different choices.

  • FLP impossibility means heuristics, not guarantees: Consensus protocols and distributed systems rely on timeouts, not mathematical certainty. This isn’t a limitation—it’s a design feature.

  • Replication is a lever with multiple knobs: Replication factor, write quorum, read consistency, and durability level are independent levers. Adjust them separately for different tables or components.

  • Geographic distribution adds trade-offs: Local reads are fast but stale. Strong consistency requires global coordination. Most real systems serve fast stale reads and expensive strong reads.

  • Make trade-offs consciously: Use SLAs, failure domains, and staleness tolerance to drive decisions. Document why you made each choice. When failures occur, you’ll understand the constraints.

Practice Scenarios

Scenario 1: Payment Processing System

You’re building a payment system. Users can send money to each other. Requirements:

  • A transaction must either succeed completely or fail completely (no partial transfers)
  • Low latency is important but accuracy is critical
  • Users are globally distributed

Design your consistency and replication strategy. Where would you use strong consistency? Where would you use eventual consistency? What would you do if a network partition separated US and Europe for 10 minutes?

Scenario 2: Feed Generation System (Like Twitter/Facebook)

You’re building a social media feed system. Requirements:

  • Users want to see recent posts from people they follow
  • A post should appear within 2-5 seconds of creation
  • Global scale with millions of concurrent users
  • Stale feeds are acceptable if the system is degraded

Map out your trade-off decisions. Would you use strong consistency? Eventual consistency? Would you batch-process or stream? How would you handle a slow replica in Europe?

Scenario 3: Inventory Management Across Regions

You have warehouses in 5 regions, each with local inventory. Requirements:

  • When a customer orders, inventory must be available (can’t oversell)
  • Shipping should happen from the closest warehouse
  • System should remain available even if one region goes offline

Design the consistency model for inventory data. Would replicas across regions be synchronized, or would you use local-first with periodic sync? What happens during a partition?

What’s Next

You now understand the fundamental trade-offs in distributed systems. In the next chapter, we’ll apply this knowledge to choosing the right consistency model for your specific use case. You’ll learn decision frameworks for when to use strong consistency, when eventual consistency is safe, and when causal or session guarantees make sense.