Scaling Database R/W

The Database Bottleneck

Your application servers are humming along beautifully. You’ve added load balancers, scaled horizontally, and optimized your application tier. But then reality hits: your database is now processing 50,000 queries per second, CPU utilization is climbing toward the danger zone, and you’re seeing query timeouts. This is the critical inflection point that catches many growing teams off guard. The database, which you carefully optimized in the early days, has become the single most important bottleneck in your system.

We’ve established in previous sections how to scale stateless application servers—they’re almost trivial when you understand the principles. But databases are different. They maintain state, enforce consistency, and coordinate across your entire system. You can’t just spin up another database server and hope for the best. This chapter explores the strategies for pushing through this bottleneck: read replicas to distribute the query load, sharding to distribute data itself, and the careful engineering choices that determine whether these approaches will actually work for your use case.

The goal isn’t perfection—it’s pragmatism. We’ll examine multiple approaches because the right choice depends on your read-to-write ratio, your consistency requirements, and your tolerance for operational complexity.

Understanding Read Replicas

A read replica is a copy of your master database that stays synchronized with the original. When you write to the master, those changes are propagated to all replicas. Clients can then read from any replica, effectively multiplying your read capacity. This works because most production applications have highly skewed read-to-write ratios—often 100:1 or even 1000:1. You might write once but read that same data hundreds of times.

The beauty of read replicas is their simplicity: your application logic barely changes. You route write queries to the master and read queries to replicas. Most database drivers support read replicas natively or through lightweight middleware. From the database’s perspective, each replica is running a continuous background job: it reads the master’s write-ahead log (or binary log) and applies those changes sequentially.

However, replicas introduce a new problem: replication lag. The master processes a write at time T, but the replica doesn’t apply that change until time T+50ms (or more, depending on network latency and replica load). This means when you read from a replica immediately after writing, you might get stale data. This inconsistency is often acceptable—users aren’t typically reading their own write in milliseconds. But for certain operations (like confirming an account creation), you might need to read from the master.

There are two replication models. Synchronous replication waits for acknowledgment from replicas before confirming the write—strong consistency, but write latency increases with replica count. Asynchronous replication confirms writes after applying them to the master only—fast writes, but replicas can fall behind. Most systems use asynchronous replication for the performance trade-off.

The Sharding Problem

Read replicas multiply read capacity, but they don’t help with write capacity. If your writes are overwhelming the master, you need a different approach: sharding (also called horizontal partitioning). Sharding splits your data across multiple independent database servers, each responsible for a subset of records. A users table split into four shards means shard 0 owns user IDs 0-24,999, shard 1 owns 25,000-49,999, and so on. Each shard is a completely independent database. Writes to different shards happen in parallel, multiplying your write throughput.

The cost of sharding is complexity. Queries that span multiple shards (called cross-shard queries) require hitting multiple databases and merging results in your application layer. Rebalancing shards when you add new hardware is non-trivial. Hotspots emerge when one shard receives disproportionate traffic.

The shard key—the value you use to determine which shard owns a record—is critical and often irreversible. A poorly chosen shard key can create severely imbalanced load. If you shard users by geographic region but 90% of your users are in one region, you’ve created a bottleneck. The ideal shard key is high-cardinality (many distinct values), relatively uniform in distribution, and immutable.

Connection Pooling: The Unsung Hero

Before diving into topologies, let’s discuss connection pooling, because it’s essential at scale. Each database connection carries overhead—memory, file descriptors, and context switching. Opening 10,000 connections from 100 application servers to your database creates a thundering herd of connections that the database can’t efficiently manage. Connection pooling proxies (like PgBouncer for PostgreSQL or ProxySQL for MySQL) sit between your application and database, maintaining a fixed pool of connections and reusing them. A single application server might keep three connections to the pool, and the pool multiplexes across hundreds of backend connections as needed.

Replication and Sharding in Action

Let’s visualize a typical architecture combining read replicas with sharding:

graph TB
    App1[App Server 1]
    App2[App Server 2]
    App3[App Server 3]

    WRouter[Write Router]
    RRouter[Read Router]

    Master0[(Master Shard 0)]
    Replica0A[(Replica 0A)]
    Replica0B[(Replica 0B)]

    Master1[(Master Shard 1)]
    Replica1A[(Replica 1A)]
    Replica1B[(Replica 1B)]

    App1 -->|Write| WRouter
    App2 -->|Write| WRouter
    App3 -->|Write| WRouter

    App1 -->|Read| RRouter
    App2 -->|Read| RRouter
    App3 -->|Read| RRouter

    WRouter -->|shard_key%2==0| Master0
    WRouter -->|shard_key%2==1| Master1

    RRouter -->|shard_key%2==0| Replica0A
    RRouter -->|shard_key%2==0| Replica0B
    RRouter -->|shard_key%2==1| Replica1A
    RRouter -->|shard_key%2==1| Replica1B

    Master0 -->|replicate| Replica0A
    Master0 -->|replicate| Replica0B
    Master1 -->|replicate| Replica1A
    Master1 -->|replicate| Replica1B

This architecture scales reads by routing to any replica and scales writes by distributing across shard masters. Each shard can be independently replicated, giving you both horizontal scaling dimensions.

Sharding Strategies Explained

Hash-based sharding applies a hash function to the shard key: shard_id = hash(user_id) % num_shards. Distribution is uniform and excellent for load balancing, but ranges of IDs don’t map to ranges of shards, making range queries slow. Finding all users with IDs between 100,000 and 200,000 requires querying potentially all shards.

Range-based sharding divides the key space into ranges: shard 0 owns user IDs 1-1,000,000, shard 1 owns 1,000,001-2,000,000. Range queries are efficient, but hotspots emerge if writes aren’t uniformly distributed. If all new users get sequential IDs and new signups are your traffic driver, all writes funnel into the shard containing the highest ID range.

Geographic sharding maps regions or zones to shards, reducing latency for local users and improving compliance for data residency regulations. The downside: global queries spanning regions become expensive cross-shard operations.

Here’s a comparison:

Strategy	Load Balance	Range Queries	Hotspot Risk
Hash-based	Excellent	Poor	Low
Range-based	Good	Excellent	High
Geographic	Good	Poor	Medium

Real-World Analog: The Library Network

Imagine a large city library system. Initially, one central library (the master) maintains the canonical card catalog. As demand grows, the library hires assistants (read replicas) who memorize sections of the catalog. These assistants help patrons find books without burdening the head librarian. But everyone still asks the head librarian to shelve new books. When even this becomes overwhelming, the city opens branch libraries (shards) in different neighborhoods. Now each branch maintains its own catalog for its section of the city. A neighborhood branch can quickly process requests for its area, but if you need a book that might be in a different branch, you need to check multiple catalogs and coordinate across branches.

Let’s ground this in a concrete example: a social media platform with 100 million users, growing 10% monthly. Your analytics show a 200:1 read-to-write ratio: typical users read their feed dozens of times per hour but post once per day.

Initial problem: Your master database runs on a 128-core machine with 512GB RAM, and you’re hitting 80% CPU at peak time. Queries are slowing. You need to scale reads first because that’s your dominant workload.

Step one: Add read replicas. You provision three replica instances in different availability zones and update your application tier with a read/write split. You route user-feed queries to replicas and user-write operations (posts, likes, comments) to the master. Within days, your master CPU drops to 45% because read load is now distributed. Replication lag of 100-200ms is acceptable for feed reads.

Step two: The shard decision. Six months later, you’re at 300 million users, your write load has tripled, and you’re hitting write capacity limits on the master. Sharding becomes necessary. You analyze your schema and choose user_id as the shard key—high cardinality, uniform, and immutable. You plan for 32 shards, which gives you 7+ years of headroom before resharding is necessary.

Step three: Sharding implementation. You introduce a routing layer that examines the user_id in each write and directs it to the appropriate shard. Each shard itself is replicated 3 ways for redundancy. The trickiest part is migrating existing data: a background job gradually redistributes user records across shards while the system remains online.

Here’s how you’d configure your application’s database connection logic (pseudocode):

def get_shard_id(user_id, num_shards=32):
    return hash(user_id) % num_shards

def get_db_connection(user_id, is_write=False):
    shard = get_shard_id(user_id)
    if is_write:
        return pool.get_master(shard)
    else:
        return pool.get_replica(shard)

# Posting a comment
shard = get_shard_id(user_id)
db = get_db_connection(user_id, is_write=True)
db.execute(
    "INSERT INTO comments (user_id, post_id, text) VALUES (%s, %s, %s)",
    (user_id, post_id, comment_text)
)

# Reading a user's posts
db = get_db_connection(user_id, is_write=False)
db.execute(
    "SELECT * FROM posts WHERE user_id = %s ORDER BY created_at DESC LIMIT 20",
    (user_id,)
)

For a cross-shard query like “find all comments by user_id=42 on post_id=999,” you’d need to:

Query shard responsible for post_id=999 to get the post and all comments.
Filter client-side for comments by user_id=42.

Or better: maintain a denormalized comment index keyed by post_id so you don’t need the user_id shard at all.

Pro tip: Always consider denormalization when facing expensive cross-shard queries. Duplicating data across shards costs storage and increases update complexity but can eliminate costly distributed queries.

Trade-Offs and Reality

Replication lag is real and will bite you. If you read from a replica immediately after writing, you might see stale data. You can mitigate this with read-your-writes consistency (route post-write reads to the master) or by accepting eventual consistency. Choose based on your application’s requirements.

Sharding is powerful but complex. Once you shard, reversing the decision is nearly impossible without significant downtime. Don’t shard until you have to. Use replicas and query optimization first. We’ve seen teams shard when they only needed better indexing—wasted effort on a permanently complex system.

Shard key selection is irreversible. A bad choice creates hotspots that cascade into downtime. Spend time on this decision. Analyze your actual traffic patterns, not guesses.

Denormalization becomes essential with sharding. Your normalized schema likely expects to join tables across shards. Instead of normalized design, you’ll denormalize heavily, maintaining multiple copies of data keyed differently. This increases storage and update complexity but eliminates expensive distributed transactions.

Key Takeaways

Read replicas multiply read capacity without modifying your write path. Use them before sharding.
Replication lag is inevitable with asynchronous replication. Choose your consistency model (strong vs. eventual) based on use case.
Sharding distributes data across independent databases, scaling write throughput at the cost of operational complexity and cross-shard query challenges.
Your shard key is critical and immutable. Choose high-cardinality, uniformly distributed, and preferably immutable fields.
Connection pooling is essential at scale. Maintain a fixed pool of database connections to prevent overwhelming the database with connection overhead.
Denormalization is necessary with sharding. Duplicate data keyed by different fields to avoid expensive cross-shard joins.

Practice Scenarios

Scenario 1: You’re designing a blogging platform with 10 million users. Your analytics show a 50:1 read-to-write ratio. You expect 500K peak concurrent users. Without sharding, estimate whether read replicas alone would suffice. What read-to-write ratio would force you to shard?

Scenario 2: You’re sharding a users table with 500 million records using user_id % num_shards. Your initial choice was 8 shards. Two years later, you’ve grown to capacity and need to reshard to 16 shards. Outline the migration strategy. How would you handle the period where both 8-shard and 16-shard systems exist?

Scenario 3: A social network experiences severe hotspots on celebrity accounts—one user’s profile receives 100,000x more reads than average users. Sharding by user_id creates a massive imbalance. Propose a solution. (Hint: think about what data is accessed most frequently.)

Looking Ahead

We’ve tackled how to scale the database itself through replication and sharding. But an equally important question remains: how do you route traffic efficiently to these scaled resources? The next section explores load distribution patterns—the routing layers, consistent hashing, and geographic distribution that ensure requests find the right shard, replica, and region with minimal latency.