System Design Fundamentals

Cache Consistency

A

Cache Consistency

The Problem That Never Goes Away

“There are only two hard things in computer science: cache invalidation and naming things.” Phil Karlton’s famous quote captures a challenge that has humbled systems engineers for decades. It sounds simple: store data in a cache to serve requests faster. But what happens when that data changes? Your cache becomes a source of truth that diverges from reality, showing stale information to your users.

Imagine a ride-sharing app displaying outdated prices, an e-commerce site showing out-of-stock inventory as available, or a banking system showing yesterday’s account balance. These aren’t theoretical problems—they’re the direct consequences of inconsistent caches. The gap between what’s in your cache and what’s actually true in your system directly impacts data accuracy, user trust, and business outcomes.

This chapter explores how we detect when cached data becomes stale, trigger updates, and maintain acceptable levels of consistency without sacrificing the performance benefits that caches provide. You’ll learn strategies ranging from simple time-based expiration to sophisticated event-driven systems that keep caches synchronized in real time.

Understanding Cache Invalidation

Cache invalidation is the process of removing, refreshing, or marking cached data as stale when the underlying source data changes. Without a deliberate invalidation strategy, your cache accumulates increasingly stale information until it becomes worse than having no cache at all.

There are two fundamental approaches to invalidation: time-based (TTL) and event-driven. Time-based invalidation assumes that data becomes less valuable over time, so we set an expiration clock on each cached item. When a product listing is cached with a 5-minute TTL, we trust that 5 minutes is a reasonable window for users to see slightly outdated information. Event-driven invalidation, by contrast, reacts directly to source data changes—when inventory updates, we immediately invalidate the cache for that product, so the next request sees fresh data.

The trade-off is immediate: TTL is simple to implement but accepts staleness. Event-driven invalidation is more complex but delivers better freshness. Most robust systems use both strategies, applying TTL as a safety net and event-driven invalidation as the primary mechanism.

When we invalidate, we choose between two actions: purge (remove the entry entirely) or refresh (update it with new data). Purging shifts the cost to the next request—that client experiences a cache miss and loads fresh data, which takes longer but guarantees freshness. Refreshing happens proactively, so the next request hits a cache with current data, but we pay the cost of fetching and updating immediately.

Cache consistency models define how much staleness we’ll tolerate. Strong consistency means every read sees the most recent write—typically requiring synchronous invalidation that blocks requests. Eventual consistency allows temporary divergence; data converges toward truth eventually. Bounded staleness sits in the middle: we guarantee that cached data is no older than a specific threshold (perhaps 30 seconds), combining reasonable freshness with acceptable performance.

The pattern you choose also depends on your access method. In cache-aside patterns, your application checks the cache, handles misses, and decides when to update. Invalidation is your responsibility. In read-through patterns, a cache provider handles misses, making invalidation more centralized. In write-through patterns, you write to cache and database together, making invalidation immediate but requiring coordination.

The Restaurant Menu Board Analogy

Picture a busy restaurant with a printed menu board near the entrance. Throughout the day, the kitchen runs out of popular dishes, adds specials, and adjusts portions. The solution is simple: someone updates the board when inventory changes, using a small marker to cross out unavailable items or add new ones.

Without this system, customers order unavailable food, the kitchen wastes time saying “we’re out,” and diners grow frustrated. The cost of maintaining the menu board is minimal compared to the damage of showing outdated information.

Your cache is that menu board. The kitchen is your database. Your customers are application requests. When data in your database changes, you need a system—whether coordinated updates or time-based expiration—to reflect those changes on your cache before users see stale information. Just as a restaurant can’t wait hours to update its menu, high-demand systems can’t tolerate long cache consistency gaps.

Invalidation Strategies in Practice

Time-based invalidation (TTL) is the simplest strategy and requires no external coordination. You set an expiration time on each cached item, and the cache automatically discards it when the timer expires. A product page cached with a 10-minute TTL guarantees that you see new information within 10 minutes of any change. The cost is guaranteed staleness: for up to 10 minutes, you might display outdated information. TTL works well for data that changes slowly or where slight staleness is acceptable (trending topics, weather data, non-critical UI elements).

// TTL-based caching in Redis
cache.set('product:123', productData, {
  EX: 600  // Expires in 600 seconds (10 minutes)
});

Event-driven invalidation reacts directly to source changes using publish-subscribe messaging. When data changes in your database, an event is published. Your cache service subscribes to these events and invalidates relevant entries immediately. This approach delivers near-instant consistency but requires infrastructure: event brokers (Kafka, RabbitMQ), change data capture from your database, and event handlers in your cache layer.

// Publish invalidation event when product updates
database.on('product-updated', (productId, newData) => {
  // Option 1: Purge - next request loads fresh data
  cache.delete(`product:${productId}`);

  // Option 2: Refresh - update cache proactively
  cache.set(`product:${productId}`, newData);

  // Publish to subscribers
  eventBus.publish('cache:invalidated', {
    key: `product:${productId}`
  });
});

Version-based invalidation uses ETags or version numbers to detect staleness. Your cache stores both data and a version identifier. When data is modified, the version changes. Clients or cache systems detect this mismatch and know the cache is invalid. This works particularly well for distributed caches where not all nodes receive invalidation events immediately.

Cache stampede prevention is critical when many requests hit expired data simultaneously. Imagine a popular product page with a 5-minute TTL. When it expires, thousands of concurrent requests all miss the cache and hammer your database for fresh data. Prevention strategies include:

  • Locking: The first miss acquires a lock and loads fresh data while others wait for the result
  • Probabilistic early expiration: Refresh data before TTL expires if a randomized probability is hit, spreading refreshes over time rather than clustering at expiration
  • Stale-while-revalidate: Serve stale data while fetching fresh data in the background
// Probabilistic early expiration
function getProductWithEarlyRefresh(productId) {
  const cached = cache.get(`product:${productId}`);

  if (!cached) {
    return loadFromDatabase(productId);
  }

  // 5% chance of refresh before actual expiration
  if (Math.random() < 0.05) {
    loadFromDatabase(productId)
      .then(fresh => cache.set(`product:${productId}`, fresh));
  }

  return cached; // Serve stale data immediately
}

Database Change Data Capture (CDC) enables sophisticated invalidation. CDC systems capture all changes to your database (inserts, updates, deletes) as a stream. These changes feed into invalidation pipelines that update or purge caches automatically. This approach scales well because it decouples your application code from invalidation logic—the cache layer subscribes to CDC streams and self-maintains consistency.

Here’s how event-driven invalidation flows in a distributed system:

graph LR
    A[Database Change] --> B[CDC/Replication Log]
    B --> C[Event Bus<br/>Kafka/RabbitMQ]
    C --> D[Invalidation Service]
    D --> E[Cache Layer]
    E --> F[Updated Cache]
    C --> G[Application Events]
    G --> H[Client Notification]

Implementing Pub/Sub Cache Invalidation

Let’s walk through a practical implementation of event-driven cache invalidation using Redis and a message broker:

// Cache invalidation service
class CacheInvalidationService {
  constructor(cache, eventBus) {
    this.cache = cache;
    this.eventBus = eventBus;

    // Subscribe to data change events
    this.eventBus.subscribe('product:updated',
      this.handleProductUpdate.bind(this));
    this.eventBus.subscribe('inventory:changed',
      this.handleInventoryChange.bind(this));
  }

  async handleProductUpdate(event) {
    const { productId, data } = event;

    // Invalidate related cache entries
    const keysToInvalidate = [
      `product:${productId}`,
      `product:${productId}:details`,
      `category:${data.categoryId}:products`
    ];

    // Use pipeline for atomic operations
    const pipeline = this.cache.pipeline();
    keysToInvalidate.forEach(key => pipeline.del(key));
    await pipeline.exec();
  }

  async handleInventoryChange(event) {
    const { productId, newStock } = event;

    // For inventory, we might refresh instead of purging
    // This ensures next request sees accurate stock
    const product = await this.loadProduct(productId);
    if (product) {
      product.stock = newStock;
      await this.cache.set(
        `product:${productId}`,
        product,
        { EX: 600 }
      );
    }
  }
}

// E-commerce price update scenario
async function updateProductPrice(productId, newPrice) {
  // Update source of truth
  const product = await database.updateProduct(productId, {
    price: newPrice
  });

  // Publish invalidation event
  await eventBus.publish('product:updated', {
    productId,
    data: product,
    timestamp: Date.now()
  });

  // Return immediately - invalidation happens asynchronously
  return product;
}

This approach ensures that:

  1. Database updates trigger invalidation events automatically
  2. Multiple cache entries related to the same entity are invalidated together
  3. Different data types use different strategies (purge vs. refresh)
  4. Application code stays simple and focused on business logic

Trade-offs and Practical Considerations

The freshness-performance spectrum defines your invalidation choices. At one extreme, strong consistency (always fresh data) requires synchronous coordination between cache and database, adding latency to every request. At the other, long TTLs (good performance) accept staleness. Most systems operate in the middle ground: event-driven invalidation provides near-instant consistency while event processing happens asynchronously, so your request path stays fast.

Implementing event-driven invalidation adds significant complexity. You need infrastructure (message brokers, CDC pipelines, event processors), operational overhead (monitoring event lag, handling broker failures), and careful coordination between your database, cache, and event system. It’s valuable when consistency is critical—financial systems, inventory counts, real-time analytics—but may be overkill for non-critical data like user preferences or trending lists.

TTL selection requires careful judgment. A 30-second TTL on product prices might be adequate for a consumer site but dangerous for a stock trading platform. A 24-hour TTL on user profiles wastes cache space when they’re updated frequently. The right TTL depends on your data’s volatility, your consistency requirements, and what “slightly wrong” data costs your business.

During database migrations or major updates, your invalidation strategy becomes critical. If you’re moving data between systems, you need temporary dual-write invalidation—changes go to both old and new systems, and you invalidate caches for both. Gradual migrations require coordinated cache invalidation to prevent serving data from the old system after the migration is complete.

Key Takeaways

  • Cache invalidation is unavoidable: Every caching strategy must address how data becomes stale and how you detect/respond to that staleness
  • Time-based (TTL) and event-driven approaches complement each other: TTL provides a safety net; event-driven invalidation delivers freshness
  • Cache stampede is real: Thousands of concurrent requests hitting expired cache keys can overwhelm your database; use locking or probabilistic early expiration to prevent it
  • Eventual consistency is often sufficient: You rarely need every read to see the latest write; bounded staleness (data under 30 seconds old) satisfies most use cases
  • Your invalidation strategy should match your consistency needs: Non-critical data can tolerate long TTLs; critical data needs event-driven invalidation
  • Implementation complexity grows with sophistication: TTL is simple; event-driven invalidation requires event infrastructure and operational discipline

Practice Scenarios

Scenario 1: Flash Sale Disaster You run an e-commerce platform where product prices are cached with a 5-minute TTL. At noon, you announce a flash sale with 70% discounts. You update the database, but thousands of requests still see old prices for 5 minutes, resulting in revenue loss. Design an invalidation strategy that immediately reflects price changes for flash sales without refreshing all 10 million product cache entries.

Scenario 2: Inventory Consistency Your inventory system shows real-time stock levels in both your website and mobile app. Caches are distributed across three data centers. When inventory updates in your database, you need to ensure all cache replicas across all data centers show consistent numbers within 1 second. How would you implement this?

What’s Next

We’ve mastered the defensive side of caching—how to keep it fresh and consistent. But caching strategies go deeper: different patterns for writing to caches (write-through, write-back, write-around) create different consistency and performance profiles. In the next section, we’ll explore how where you write your data dramatically changes how your entire system behaves under load.