Rate Limiter Design

The Problem

Imagine you’re running a platform serving 10 million users. Without any safeguards, a single misbehaving client—whether intentional or accidental—could hammer your API with millions of requests, consuming all your resources and degrading service for everyone else. This is where rate limiting comes in.

We need to design a distributed rate limiter that can throttle API requests fairly and efficiently. Think of it like a bouncer at a club: they check how many times you’ve come in this month and decide whether to let you through. But doing this at scale, across hundreds of servers, with microsecond response times, is a fascinating engineering challenge.

Requirements Gathering

Let’s nail down what we’re building:

Functional Requirements:

Limit requests by user ID, IP address, or API key
Configurable rate limits per endpoint (e.g., 100 requests/minute for user signup, 1000/minute for search)
Return HTTP 429 (Too Many Requests) with Retry-After header when limit exceeded
Support multiple time windows: per second, per minute, per hour, per day
Support different limit tiers (free users get 1K/day, premium get 10K/day)

Non-Functional Requirements:

Latency overhead per request: under 1 millisecond
Throughput: support 1M requests/second
High availability: rate limiting system failure shouldn’t bring down the API
Accuracy in a distributed environment: counting must be accurate even with multiple API servers
Graceful degradation: if Redis is down, we might be slightly less accurate but still functional

Rate Limiting Algorithms

Before we architect, let’s understand the algorithms. Each has trade-offs:

Token Bucket (Most Common)

Tokens refill at a fixed rate; each request consumes one token. If no tokens remain, reject the request.

Tokens = min(Capacity, Tokens + RefillRate * TimeSinceLastRequest)
If Tokens ≥ 1:
  Tokens -= 1
  Allow request
Else:
  Reject request

Advantages: Handles bursts (if a user doesn’t make requests for a while, tokens accumulate), smooth rate limiting, easy to configure.

Disadvantages: More complex to implement at scale, requires state tracking.

Leaky Bucket

Requests are queued and processed at a fixed rate, like water leaking from a bucket.

If Queue.size < Capacity:
  Queue.add(request)
  Process at fixed rate
Else:
  Drop request

Advantages: Smooth traffic, predictable output rate.

Disadvantages: Requests must wait in queue (higher latency), less flexible for bursty workloads.

Fixed Window Counter

Count requests in fixed time windows. Simple but has a boundary burst problem: if a user makes 100 requests in the last second of window 1 and 100 in the first second of window 2, they’ve doubled their rate.

Sliding Window Log

Track exact timestamp of every request. Accurate but memory-intensive (you’re storing timestamps for millions of requests).

Sliding Window Counter (Hybrid)

Use a weighted average of the current and previous window to smooth out boundary issues. Good balance of accuracy and efficiency.

For this design, we’ll implement Token Bucket using Redis for its simplicity and suitability for bursty traffic patterns.

High-Level Architecture

graph LR
    Client["Client"]
    LB["Load Balancer"]
    RL["Rate Limiter Middleware"]
    App["Application"]
    Redis["Redis Cluster"]

    Client -->|HTTP Request| LB
    LB --> RL
    RL -->|Check Counter| Redis
    RL -->|Check Status| App
    App -->|Response| RL
    RL -->|Response| LB
    LB -->|Response| Client

Here’s the flow:

Client makes request to the API
Load balancer routes to an API server
Rate limiter middleware intercepts the request
Query Redis for the current token count for this user
Increment or reject based on the counter
Pass to application if allowed, else return 429

The rate limit rules (user quotas, time windows) are stored in a config service or fetched from Redis.

Deep Dive: Implementation with Redis

Let’s implement token bucket in Redis. Here’s the challenge: we need to do this atomically to avoid race conditions.

The Algorithm

-- Lua script for atomic rate limiting
local key = KEYS[1]  -- "rate_limit:user:123:minute"
local limit = tonumber(ARGV[1])  -- 100 (requests per minute)
local refill_rate = tonumber(ARGV[2])  -- 100/60 tokens per second
local now = tonumber(ARGV[3])  -- current timestamp
local capacity = tonumber(ARGV[4])  -- max tokens to accumulate

-- Get current state
local current = redis.call('HGETALL', key)
local tokens = tonumber(current[2]) or capacity
local last_refill = tonumber(current[4]) or now

-- Calculate tokens to add since last refill
local time_passed = math.max(0, now - last_refill)
local tokens_to_add = time_passed * refill_rate
tokens = math.min(capacity, tokens + tokens_to_add)

-- Try to consume a token
if tokens >= 1 then
  tokens = tokens - 1
  redis.call('HSET', key, 'tokens', tokens, 'last_refill', now)
  redis.call('EXPIRE', key, 3600)  -- expire after 1 hour
  return {1, tokens}  -- {allowed, remaining_tokens}
else
  return {0, tokens}  -- {denied, remaining_tokens}
end

Why Lua? Redis executes Lua scripts atomically. Without atomicity, two concurrent requests could both read the same token count, and both get through even if there’s only 1 token left.

Response Headers

When the rate limit is exceeded, we return helpful headers:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1645123860
Retry-After: 45

X-RateLimit-Limit: Total allowed requests
X-RateLimit-Remaining: Tokens left
X-RateLimit-Reset: Unix timestamp when the limit resets
Retry-After: Seconds to wait before retrying

Distributed Challenges

With 10 API servers, how do we ensure accuracy?

Option 1: Centralized Counting (Recommended) All servers query the same Redis instance for the counter. Redis is fast (sub-millisecond), and INCR operations are atomic.

Problem: Redis becomes a bottleneck at 1M requests/second.

Solution:

Use Redis Cluster with multiple master nodes
Shard rate limit counters by user ID (user 123 → Redis node 2)
Each node handles a partition of users independently
Can scale horizontally by adding more Redis nodes

Option 2: Distributed Counters with Sync Each API server maintains local counters, syncs periodically with Redis.

Problem: You lose accuracy (two servers might both allow requests that violate the limit), and you need complex reconciliation.

Result: Not recommended for strict rate limiting.

Handling Failures

What if Redis goes down?

Circuit breaker: If Redis is unavailable, fall back to a permissive mode (allow all requests but log) rather than blocking everything
Redis Cluster: With 3+ master nodes, the cluster continues operating even if one node fails
Replication: Each rate limit counter is replicated across 2-3 Redis nodes

Pro Tip: Use Redis Sentinel for automatic failover. When a master fails, Sentinel promotes a replica to master, and clients reconnect automatically.

Sliding Window Counter Implementation

For comparison, here’s the sliding window counter approach, which is more accurate than fixed window:

local key = KEYS[1]  -- "rate_limit:user:123:minute"
local limit = tonumber(ARGV[1])  -- 100
local now = tonumber(ARGV[2])
local window = tonumber(ARGV[3])  -- 60 seconds

local current_window_key = key .. ":" .. math.floor(now / window)
local previous_window_key = key .. ":" .. math.floor((now - window) / window)

-- Increment current window
local current_count = redis.call('INCR', current_window_key)
redis.call('EXPIRE', current_window_key, window * 2)

-- Get previous window count
local previous_count = tonumber(redis.call('GET', previous_window_key)) or 0

-- Weighted average (how much of previous window affects current)
local weight_previous = 1 - ((now % window) / window)
local allowed_count = limit * (1 - weight_previous)

local total_count = previous_count * weight_previous + current_count

if total_count < limit then
  return {1, limit - total_count}  -- {allowed, remaining}
else
  return {0, 0}  -- {denied, remaining}
end

This avoids the boundary burst problem of fixed windows.

Scaling Considerations

Horizontal Scaling:

Redis Cluster with 10+ nodes (each handling 10-20% of users)
Each node can handle 100K-200K ops/second
Need 5-10 nodes to handle 1M requests/second

Rate Limiting Levels:

API Gateway: coarse-grained limits (IP-based, per endpoint)
Application Service: medium-grained limits (per user, per operation type)
Database: fine-grained limits (prevent hot key contention)

Dynamic Rules:

Store rate limit rules in a config service (Consul, etcd)
Each rate limiter watches for config changes
Can update limits without restarting services
Use feature flags to gradually roll out new limits

Monitoring:

Track 429 response rate (should be low in normal conditions)
Alert if error rate exceeds 1%
Monitor Redis latency and memory usage
Count how many requests are actually being rate limited (should be rare for legitimate traffic)

Trade-offs and Design Decisions

Aspect	Choice	Rationale
Centralized vs Distributed	Centralized (Redis)	More accurate, simpler to reason about, Redis is fast enough with clustering
Algorithm	Token Bucket	Handles bursts gracefully, industry standard, easy to configure
Storage	Redis (in-memory)	Sub-millisecond latency, atomic operations, acceptable data loss on failure
Replication	Async + Cluster	Fast writes, automatic failover, some reads might see slightly stale data
Fallback on Failure	Permissive (allow)	Better user experience than blocking all traffic if rate limiter fails
Time Window Precision	Second-level buckets	Good balance; minute-level would be too coarse, millisecond too fine

Key Takeaways

Rate limiting is critical for protecting multi-tenant systems from resource exhaustion
Token Bucket is the most practical algorithm for most use cases; it handles bursts naturally
Centralized counting via Redis is the standard approach; consistency beats distributed complexity
Lua scripts enable atomic operations without race conditions
Redis Cluster enables horizontal scaling for high throughput environments
Graceful degradation (fail open) is often better than strict enforcement when the rate limiter fails
Multi-level rate limiting (gateway, app, database) gives fine-grained control

Practice

Design a rate limiter that supports adaptive limits: if the system is under heavy load (CPU above 80%), reduce the limit by 20%. How would you detect system load? Where would you store the adaptive limit thresholds? How would you handle the transition smoothly?

Connection to Next Section

Rate limiting protects individual APIs, but what about when users need to access many resources? In the next section, we’ll design a URL Shortener—a service that handles massive redirect traffic efficiently, requiring careful caching and routing strategies similar to rate limiting but focused on performance rather than throttling.