System Design Fundamentals

Fundamentals of Caching

A

Fundamentals of Caching

When Your Database Becomes Your Bottleneck

Picture this: your web application is handling 10,000 database queries per second. The infrastructure is solid—you’ve scaled horizontally, distributed the load, optimized the queries. Yet response times keep climbing. Users wait 500ms for a page load that used to take 100ms. You could buy more database servers, but at what cost? More hardware, more complexity, more operational overhead.

Or… you could simply avoid hitting the database for most requests.

That’s caching. It’s one of the most powerful optimizations in system design, and it operates on a deceptively simple principle: if we’ve already computed or retrieved something, don’t do it again. Store it somewhere fast. Return it on demand. In this chapter, we’ll build a solid foundation in how caching works, why it’s essential, and when it matters most. We’ll connect the dots back to the scalability challenges from Chapter 4 and the load balancing strategies from Chapter 5—caching often solves what hardware alone cannot.

By the end of this chapter, you’ll understand cache hits and misses, cache hierarchies, the fundamental patterns that power modern systems, and the real costs of getting caching wrong.

What Is a Cache, Really?

A cache is a fast, temporary storage layer that holds copies of expensive data. Expensive here means computationally costly, I/O intensive, or distant—anything that would slow down a user’s request if we had to fetch or calculate it fresh every time.

Here’s the essential dynamic: when a request arrives, we check the cache first. If the data is there (a cache hit), we return it instantly. If it’s not (a cache miss), we fetch it from the original source, store a copy in the cache, and return it to the user. Next time someone asks for the same data, boom—cache hit, fast response.

The metric that matters most is hit ratio: the percentage of requests served from cache versus total requests. A 90% hit ratio means 9 out of 10 requests skip the slow path entirely. That’s transformative for performance.

Why does this work so reliably? Because of temporal locality and spatial locality. Temporal locality means users tend to request the same data multiple times within a window—your homepage gets refreshed constantly. Spatial locality means data accessed together tends to be requested together—when someone views a user profile, they often view related posts immediately after. Both patterns favor caching.

There’s also an economic principle at work: the Pareto principle or 80/20 rule. In most systems, roughly 20% of the data serves 80% of requests. Celebrity profiles, trending posts, popular products—a small, hot dataset dominates traffic. Cache that hot 20%, and you’ve addressed the majority of load. The long tail of unique requests will always miss the cache, but they’re infrequent enough to handle with the underlying database.

The best candidates for caching are read-heavy data that rarely changes. User profiles (read often, updated occasionally), product catalogs (read constantly, inventory changes infrequently), computed results (expensive to generate, same for hours), and session data (read frequently, write once) all cache exceptionally well. Conversely, real-time stock prices, live lottery results, or write-heavy financial transactions are poor cache candidates—the data changes too fast, and stale answers cause problems.

The Chef’s Mise en Place

Imagine a busy restaurant kitchen during dinner service. Every order arrives—duck confit, sauce béarnaise, sautéed vegetables, plated perfectly. If the chef chopped every ingredient fresh per order, service would grind to a halt. Instead, before service begins, the chef prepares: onions are diced, herbs are minced, stocks are warming, sauces are prepped. This is mise en place—“everything in its place.”

The prep station is a cache. The diced onions are cached data. When an order arrives, pulling pre-cut onions is a cache hit; chopping them from scratch is a miss. And just as stale, oxidized onions ruin a dish, a stale cache serving outdated information can break user trust. The chef must refresh ingredients—discard old prep, prepare fresh. Same with caches: they must expire and refresh.

The beauty of mise en place is efficiency. The chef doesn’t waste time during peak service. Neither should your system. Caching is that prep work, done before the rush.

How Caching Works in Modern Systems

Caching operates at every layer of computing. Your CPU has L1 and L2 caches (microseconds). RAM acts as a cache for your SSD (milliseconds). Your SSD caches data from the network (tens of milliseconds). And application-level caches like Redis or Memcached sit between your web server and database. Each layer trades capacity for speed.

Let’s look at realistic latency numbers—these are critical benchmarks:

Storage LayerLatencyUse Case
L1 CPU Cache4 nsImmediate CPU access
L2 CPU Cache10 nsLocal operations
RAM100 nsApplication working set
SSD1-10 μsDatabase working set
HDD10 msCold storage
Network (local)1 msPeer services
Network (internet)50-100 msRemote APIs

That’s seven orders of magnitude between L1 cache and a network request. A single database round trip at 5ms might eliminate millions of CPU cycles of work. This is why caching between application and database is so impactful.

The dominant pattern is cache-aside, also called lazy loading. Here’s how it works:

  1. Request arrives for data (e.g., “get user 42’s profile”)
  2. Check cache: is user 42 cached?
  3. If yes: return cached data (cache hit)
  4. If no: fetch from database, store in cache, return to user (cache miss)
  5. Next request for user 42 hits the cache
graph TD
    A[Request: User 42] --> B{Cache Contains?}
    B -->|Hit| C[Return Cached Data]
    B -->|Miss| D[Query Database]
    D --> E[Store in Cache]
    E --> F[Return Data]
    C --> G[Response]
    F --> G

Technologies like Redis and Memcached implement this at scale. Redis is an in-memory data store—it holds everything in RAM, trading memory for speed. Memcached is similar but simpler. Both support fast lookups (microseconds), expiration policies (data automatically removes after a TTL), and distributed caching (data spreads across multiple machines).

One advanced technique is cache warming: pre-loading the cache with frequently accessed data before serving traffic. Instead of waiting for the first user to trigger cache misses, you seed the cache on startup. This smooths out the performance curve.

But there’s a dark pattern lurking here: the thundering herd or cache stampede. Imagine a cache entry expires for a very popular item (a trending celebrity profile hits your site). Suddenly, 1,000 concurrent requests miss the cache and query the database for the same data. The database gets hammered by redundant queries, suffers, potentially fails. One mitigation is probabilistic early expiration: expire and refresh the cache slightly before the true TTL, during off-peak times, to avoid the stampede. Another is locking: if a miss is detected, lock the cache entry and have one thread refresh it while others wait.

Caching in Practice: From Theory to Code

Let’s build intuition with a real scenario. You’re building a web API for a social media platform. Users request their homepage feed thousands of times per second. Computing a feed is expensive—sorting posts, filtering, personalizing, fetching user metadata. Compute time: 200ms.

Without caching:

GET /api/user/alice/feed
 → Compute feed: 200ms
 → Database queries: 100ms
 → Total: ~300ms per request

With Redis caching (cache-aside pattern):

GET /api/user/alice/feed
 → Check Redis: "feed:alice:daily"
 → Cache hit? Return instantly (1ms)
 → Cache miss? Compute feed (200ms) + store in Redis with TTL (1 hour) + return

Assume Alice requests her feed 10 times per day. Without cache: 3,000ms total. With cache: after the first hit, 9 hits take 1ms each = 9ms. That’s 333 times faster.

Scale to millions of users, and the math becomes staggering. A 90% cache hit ratio reduces database load by 90%. A single cache layer has prevented the need for 10 additional database replicas.

Real-world example: Twitter caches user timelines aggressively. When you open Twitter, you’re seeing a pre-computed, cached timeline. When someone posts, Twitter invalidates related caches—followers’ timelines, user profile cache, trending topics cache. This approach achieves sub-100ms response times at global scale, serving billions of timeline requests daily.

The Costs of Caching

Caching isn’t free. You’re trading memory for speed. Redis instances consume RAM—lots of it. A billion-key dataset at 100 bytes per key is 100GB of RAM. That’s not cheap.

There’s also the staleness problem. If your cache holds data for one hour, and that data changes after 30 minutes, some users get stale answers. For an e-commerce site showing inventory, serving an old count can lead to overselling. For stock prices, yesterday’s data is worthless. You must carefully balance TTL (time-to-live) against accuracy requirements.

Complexity compounds. Now your system has two sources of truth: the cache and the database. They can fall out of sync. If the database updates but the cache isn’t invalidated, users see old data. If the cache stores incorrect data, it spreads that error to thousands of requests. Debugging a cache corruption issue is painful—by the time you notice, millions of requests have been served lies.

The cache stampede problem mentioned earlier is real and brutal. Popular items expiring simultaneously can cause cascading failures if not handled carefully.

When should you not cache? Write-heavy workloads—if data changes every second, caching adds more miss than hit. Unique data—if every user’s request is different, cache hits are rare. Real-time requirements—medical sensors, financial trading, disaster alerts can’t afford stale information. Privacy-sensitive data must be cached carefully; a leaked cache can expose PII.

Key Takeaways

  • Caching stores computed or fetched data for fast retrieval, eliminating redundant work. Cache hits serve data in microseconds; misses require the slow path but are opportunities to populate the cache.
  • Hit ratio drives value. A 90% hit ratio dramatically reduces load on origin systems. The Pareto principle suggests focusing on the hot 20% of data that generates 80% of requests.
  • Temporal and spatial locality make caching effective. Users revisit the same data and request related data, creating natural cache patterns.
  • Cache-aside is the dominant pattern. Check cache first; on miss, fetch from origin, store, return. Simple and powerful.
  • Memory and staleness are the core trade-offs. You must tune TTL, handle cache invalidation, and watch for stampedes.
  • Not everything should be cached. Write-heavy, unique, or real-time data often shouldn’t enter a cache.

Practice Scenarios

Scenario 1: Inventory Disaster You cache product inventory in Redis with a 5-minute TTL. During flash sale, thousands of cache misses hit your database simultaneously (stampede problem). The database locks up. Devise two strategies to prevent this: one using probabilistic expiration, one using cache locks.

Scenario 2: Cold Start Your mobile app launches and immediately requests 100,000 unique user profiles. Cache is empty (cold cache). Hit ratio is 0%. Design a cache warming strategy to pre-load hot profiles before serving traffic, and explain how you’d identify which profiles are “hot.”

Looking Ahead

Caching at the application layer (Redis, Memcached) solves immediate bottlenecks, but systems benefit from layered approaches. Your web servers cache responses. CDNs cache content at edge locations. Browsers cache static assets. In the next chapter, we’ll explore cache hierarchies—how different caching layers collaborate, when to cache at each tier, and how to reason about consistency when data flows through multiple caches. You’ll see that a well-designed cache hierarchy can reduce origin database load by 99%, transforming response times from 500ms to under 50ms globally.