System Design Fundamentals

Common Pitfalls to Avoid

A

Common Pitfalls to Avoid

When Good Intentions Go Wrong

Imagine you’ve just launched a successful web application. Your users love it, traffic is growing, and everything seems perfect. Then, one Friday night at 11 PM, your system starts crashing under load. Your team rushes to debug, and you discover a series of decisions that made perfect sense when you built the application—but now they’re strangling its ability to scale. This scenario plays out in companies large and small, and it often stems from a handful of recurring mistakes that we’ll explore in this chapter.

By the time you finish reading this section, you’ll understand the most common pitfalls that system designers encounter and, more importantly, how to sidestep them from the beginning. You’ve already learned about scalability, reliability, and basic system design thinking in the previous chapters. Now we’re going to show you the wrong turns that most teams take—not so you feel scared, but so you can build with confidence and foresight.

This chapter acts as a guardrail for everything you’ll learn going forward. Think of it as the “lessons learned” section before you dive into the technical building blocks. We’ll reference back to these pitfalls throughout the book because they influence every architectural decision you make.

The Five Deadly Sins of System Design

The five most dangerous pitfalls in system design fall into a clear pattern: they all stem from either rushing to implement without planning, or assuming that early choices won’t need to change later. Let’s introduce each one briefly, then we’ll dive deeper.

Pitfall #1: Premature Optimization happens when engineers optimize for performance, scalability, or cost before understanding actual system requirements and bottlenecks. This manifests as choosing a sophisticated caching strategy before knowing if caching is even needed, or restructuring your database schema for “theoretical future scale” when you’re still at one million users. The irony is that this optimization work creates complexity that makes the system harder to understand, test, and modify—and it often optimizes the wrong things.

Pitfall #2: Ignoring Failure Modes means designing systems as if failures won’t happen. Most junior designers assume that networks are reliable, databases always respond, and third-party APIs are always available. In reality, everything fails. Disk drives fail, networks partition, services crash, and business logic bugs happen. A system that hasn’t been designed to gracefully handle these realities will cascade failure: one component breaks and takes down the entire system.

Pitfall #3: Tight Coupling occurs when different components of your system are so interdependent that changing one requires changing many others. Imagine a house where the electrical system is wired directly into the walls with no switches or outlets. Want to add a new room? You’d need to rewire half the house. Similarly, tightly coupled systems are inflexible, hard to test, and fragile when requirements change.

Pitfall #4: Neglecting Observability is the sin of building systems that you can’t see into. Your application runs, but when something goes wrong, you’re blind. You have no logs, no metrics, no way to trace where a request went or why it failed. This pitfall is particularly insidious because you might not realize you’re in trouble until a critical failure leaves you completely in the dark.

Pitfall #5: One-Size-Fits-All Data Storage happens when engineers pick a single database technology and try to solve every data problem with it. Perhaps you fall in love with a relational database and force all your data into normalized tables, even data that would be better served by a document store, graph database, or cache. Or conversely, you use NoSQL for everything and lose the transactional guarantees you actually need.

These pitfalls don’t exist in isolation—they often reinforce each other. Tight coupling makes it harder to handle failures. Lack of observability hides the effects of premature optimization. One-size-fits-all databases couple your business logic to an inflexible data model.

Building a House the Wrong Way

Consider building a house. A novice builder might say, “Let’s add extra electrical infrastructure just in case we eventually want to power a factory in the basement” (premature optimization). But they also need to design the house so that if the water main breaks, water doesn’t destroy the entire structure (failure modes). They should use different materials for different purposes—wood framing, concrete foundations, copper pipes, electrical wiring—not try to build everything out of concrete (one-size-fits-all). And they need windows and mirrors so homeowners can see how systems are functioning (observability), not a mysterious black box of a house.

The key insight is that good design acknowledges constraints and trade-offs upfront, rather than fighting them with over-engineering. A well-built house uses the right materials for each job, isolates failures to contained areas (a burst pipe doesn’t ruin the whole structure), and includes ways to inspect and maintain systems. Systems architecture works exactly the same way.

Anatomy of Each Pitfall

Understanding Premature Optimization

Premature optimization happens in layers. At the architectural level, you might introduce caching, sharding, or multi-region replication when a single, simple database would serve current users perfectly well. At the code level, you might optimize for performance when clarity and correctness matter more. The cost of this optimization is what we call “complexity debt”—you’ve made the system harder to reason about, test, and change.

The antidote is measurement-driven optimization. Before you optimize, ask: What is slow? How do you know? Have you actually measured it? In practice, this means:

  1. Build a system that works correctly and handles your current scale
  2. Deploy it and measure real performance (using the observability practices we’ll discuss)
  3. Identify actual bottlenecks (not theoretical ones)
  4. Optimize only the bottlenecks
  5. Measure again to confirm the optimization helped

This approach often reveals surprising truths: the database query you thought was the bottleneck isn’t, or a simple index solves the problem better than a complex caching layer.

Designing for Failure: Circuit Breakers and Bulkheads

Ignoring failure modes is perhaps the most costly pitfall. When you call an external service—say, a payment processor—what happens if it’s down? A naïve system might wait forever or retry indefinitely, exhausting its own resources and making the problem worse. A well-designed system uses patterns like the circuit breaker.

A circuit breaker works like an electrical circuit breaker in your home. When a service is down, instead of hammering it with requests that will fail, the circuit breaker “opens” and returns an error immediately. Once the service recovers, the breaker “closes” and resumes normal traffic. This prevents your system from wasting resources on doomed requests.

graph LR
    A[Request] -->|Success Rate > Threshold| B[Closed: Pass Requests]
    B -->|Success Rate Drops| C[Open: Fail Fast]
    C -->|Timeout Elapsed| D[Half-Open: Test Recovery]
    D -->|Success| B
    D -->|Failure| C
    A -->|When Open| E[Return Error Immediately]

Similarly, bulkheads isolate different parts of your system so one failure doesn’t cascade. Imagine a ship with watertight compartments: if one section floods, it doesn’t sink the whole vessel. In system design, this means using separate thread pools for different services, limiting the connections to any one database, or running critical services in isolated containers. When the user authentication service is overloaded, it shouldn’t take down your product catalog.

Loose Coupling through Asynchronous Communication

Tight coupling often happens when components communicate synchronously and directly. Microservice A calls Microservice B, which calls Microservice C. If C is slow, B blocks, and A waits. Everything depends on everything else.

Event-driven architecture reduces coupling by inserting a message broker between services. Instead of A directly calling B, A publishes an event: “UserSignedUp”. Interested services (B, C, D) subscribe to this event and react independently. A doesn’t need to know about B, C, or D.

graph TB
    A[User Service] -->|Publishes: UserSignedUp| MB[Message Broker]
    MB -->|Subscribes| B[Email Service]
    MB -->|Subscribes| C[Analytics Service]
    MB -->|Subscribes| D[Notification Service]
    B -->|Independent| B1[Send Welcome Email]
    C -->|Independent| C1[Record Signup Event]
    D -->|Independent| D1[Create Welcome Task]

This decoupling has multiple benefits: services can be deployed independently, one slow service doesn’t block others, and you can add new consumers without modifying existing code. The trade-off is eventual consistency: B, C, and D might not react to the event at exactly the same time, and you need to handle failure scenarios where an event is processed multiple times.

The Three Pillars of Observability

Neglecting observability is like flying a plane without instruments. You can do it for a short time, but the moment weather changes, you’re in trouble.

Observability rests on three pillars:

Logs are records of discrete events. “User signed up,” “Payment processed,” “Database query took 500ms.” Logs are great for understanding what happened in a specific scenario, but with millions of events per second, they become overwhelming. You need good log aggregation tools (like the ELK stack) and the discipline to log meaningfully.

Metrics are aggregate numbers over time: request rate, error rate, latency percentiles (p50, p99), CPU usage, memory consumption. Metrics are small enough to store forever and are perfect for alerting: “If error rate exceeds 1%, page me.”

Traces follow a single request through your entire system. You instrument your code so that when a user’s request enters your system, you tag it with a unique ID. As the request passes through services, databases, and caches, each component records what it did. At the end, you have a complete picture of the request’s journey. This is invaluable for debugging slow requests or understanding failure cascades.

Most modern platforms use all three. A good starting point: metrics for alerts, logs for investigation, traces for complex issues.

Polyglot Persistence: Right Tool for the Right Job

One-size-fits-all data storage happens because developers often aren’t trained in database selection. There’s a relational database, and it’s ACID-compliant, so it must be the right choice for everything, right? Not quite.

Consider these scenarios:

  • You need a user’s current location updated 1000 times per second: Redis (in-memory cache) is perfect
  • You need to store complex relationships between data entities: PostgreSQL (relational database) excels
  • You need to store semi-structured JSON documents with flexible schemas: MongoDB (document database) fits
  • You need to query relationships in social networks: Neo4j (graph database) is built for this
  • You need to store time-series data like stock prices or sensor readings: InfluxDB (time-series database) is optimized

Choosing the right database for each use case is called polyglot persistence. It requires understanding your data patterns (read-heavy? write-heavy? analytical? transactional?) and matching them to database strengths.

Seeing It in Code and Architecture

Code Example: Tight vs. Loose Coupling

Tightly Coupled (Bad):

class OrderService {
  processOrder(order) {
    // OrderService directly depends on PaymentService
    const paymentService = new PaymentService();
    const result = paymentService.charge(order.amount);

    // And directly on EmailService
    const emailService = new EmailService();
    emailService.sendConfirmation(order.customerId);

    // If PaymentService or EmailService changes, this breaks
    return result;
  }
}

Loosely Coupled (Good):

class OrderService {
  constructor(eventBus) {
    this.eventBus = eventBus;
  }

  processOrder(order) {
    // Just publish an event
    // Other services subscribe independently
    this.eventBus.publish('OrderPlaced', {
      orderId: order.id,
      customerId: order.customerId,
      amount: order.amount
    });

    return { success: true };
  }
}

// Elsewhere, other services subscribe:
eventBus.subscribe('OrderPlaced', (event) => {
  // Payment service handles it
  paymentService.chargeCustomer(event.customerId, event.amount);
});

eventBus.subscribe('OrderPlaced', (event) => {
  // Email service handles it independently
  emailService.sendConfirmation(event.customerId);
});

In the second version, OrderService doesn’t know or care about payment or email. If you add a new service (like an inventory service), you just add a new subscriber. The original code never changes.

Database Selection Guide

Use CaseBest ChoiceWhy
User accounts, financial recordsPostgreSQL (Relational)ACID guarantees, complex joins, data integrity
Session data, cache, leaderboardsRedis (In-memory)Extreme speed, perfect for temporary data
Product catalogs, user profilesMongoDB (Document)Flexible schema, scales horizontally
Social networks, recommendation graphsNeo4j (Graph)Optimized for relationship queries
Metrics, sensor data, stock pricesInfluxDB (Time-series)Optimized for time-based aggregations
Full-text searchElasticsearch (Search)Fast text indexing and complex queries

Real Failure Scenario: The Cascading Failure

A real e-commerce company once experienced this: their frontend directly called their inventory microservice to check stock levels. The inventory service was down for 5 minutes due to a database migration. Immediately, the frontend crashed because it received errors. Users trying to browse products got 500 errors. The team received thousands of alerts simultaneously. When the inventory service came back up, the frontend still crashed due to a deployment issue in the restart process. The outage stretched from 5 minutes to 2 hours.

A better design would have:

  1. Cached the inventory data (even if 5 minutes stale) so the frontend still works
  2. Used a circuit breaker to fail gracefully instead of cascading errors
  3. Logged and traced the failure to identify the root cause in 30 seconds instead of 30 minutes
  4. Isolated the frontend and inventory service using bulkheads (separate thread pools) so an inventory issue doesn’t affect product browsing

When the “Wrong” Choice is Right

Here’s where system design gets nuanced: some “pitfalls” are actually context-dependent choices.

When Premature Optimization Is OK: If you’re building a system where performance is a core requirement from day one (say, a real-time trading platform), you’re not optimizing prematurely—you’re optimizing for the known requirement. The key is the difference between “we know this needs to be fast” versus “it might need to be fast someday.”

When Tight Coupling Is Acceptable: In a monolithic application where all code is deployed together, some tight coupling through direct function calls is fine. The cost of adding a message broker might outweigh the benefits if your entire system deploys as one unit. However, as you grow and want independent services, coupling becomes toxic.

When Polyglot Persistence Is Overkill: If you have one team managing five different databases, you’re increasing operational burden. Sometimes, accepting some suboptimal database choices for a single technology stack is the pragmatic call. “Multiple databases for multiple teams” is often better than “multiple databases for one team.”

The meta-lesson: understand the principle behind each pitfall, and then make conscious trade-offs based on your context. Avoid accidentally falling into these traps due to ignorance. Actively choose simplicity when it serves you.

Key Takeaways

  • Measure before optimizing: Build systems that work, deploy them, find real bottlenecks through instrumentation, then optimize. Premature optimization creates complexity debt with little payoff.
  • Design for failure from day one: Use patterns like circuit breakers and bulkheads. Assume services will be slow or unavailable and handle it gracefully.
  • Prefer loose coupling: Use asynchronous communication and event-driven patterns to decouple services, making them independently deployable and resilient to failures.
  • Invest in observability: Implement logs, metrics, and traces. You cannot improve what you cannot measure, and you cannot debug what you cannot see.
  • Choose the right database for each job: Don’t force all data problems into a single database. Understand your data patterns and match them to database strengths.
  • Start simple, evolve deliberately: Complexity should be added in response to real constraints, not theoretical future needs.

Put It Into Practice

Scenario 1: The Overengineered Startup

You’re designing a booking system for a new online tutoring marketplace. Your co-founder wants to “build it right from the start” with a globally distributed database, advanced caching strategies, and separate microservices for every feature. Your startup has 50 users so far. What pitfalls do you see? What would you recommend instead?

Scenario 2: The Fragile System

Your team has built a video streaming service where a user’s request goes: Frontend → API Server → Content Service → Database → File Storage. There’s no caching, no circuit breakers, and when one component is slow, the entire chain backs up. Your analytics show that the File Storage service is occasionally slow. How would you apply the principles from this chapter to make the system more resilient?

Scenario 3: The Integrated Monolith

Your company has a monolithic application where user registration, billing, and notifications are tightly coupled in a single codebase. When you want to add a new notification channel (SMS), you have to touch five files and run the full test suite. How would you reduce this coupling without breaking up the monolith?

What Comes Next

Now that you understand the pitfalls to avoid, we’re ready to dive into the building blocks that make resilient systems possible. In Chapter 2, we’ll explore the foundational technologies: how networks actually work (and fail), how databases ensure data consistency, how caching accelerates systems, and how load balancing distributes work across multiple machines. These aren’t abstract concepts—they’re the concrete tools you’ll use to implement the principles we’ve discussed here. With these building blocks and pitfall awareness in your mental toolkit, you’ll be equipped to design systems that scale, fail gracefully, and stand the test of time.