Rate Limiter Design
The Problem
Imagine you’re running a platform serving 10 million users. Without any safeguards, a single misbehaving client—whether intentional or accidental—could hammer your API with millions of requests, consuming all your resources and degrading service for everyone else. This is where rate limiting comes in.
We need to design a distributed rate limiter that can throttle API requests fairly and efficiently. Think of it like a bouncer at a club: they check how many times you’ve come in this month and decide whether to let you through. But doing this at scale, across hundreds of servers, with microsecond response times, is a fascinating engineering challenge.
Requirements Gathering
Let’s nail down what we’re building:
Functional Requirements:
- Limit requests by user ID, IP address, or API key
- Configurable rate limits per endpoint (e.g., 100 requests/minute for user signup, 1000/minute for search)
- Return HTTP 429 (Too Many Requests) with
Retry-Afterheader when limit exceeded - Support multiple time windows: per second, per minute, per hour, per day
- Support different limit tiers (free users get 1K/day, premium get 10K/day)
Non-Functional Requirements:
- Latency overhead per request: under 1 millisecond
- Throughput: support 1M requests/second
- High availability: rate limiting system failure shouldn’t bring down the API
- Accuracy in a distributed environment: counting must be accurate even with multiple API servers
- Graceful degradation: if Redis is down, we might be slightly less accurate but still functional
Rate Limiting Algorithms
Before we architect, let’s understand the algorithms. Each has trade-offs:
Token Bucket (Most Common)
Tokens refill at a fixed rate; each request consumes one token. If no tokens remain, reject the request.
Tokens = min(Capacity, Tokens + RefillRate * TimeSinceLastRequest)
If Tokens ≥ 1:
Tokens -= 1
Allow request
Else:
Reject request
Advantages: Handles bursts (if a user doesn’t make requests for a while, tokens accumulate), smooth rate limiting, easy to configure.
Disadvantages: More complex to implement at scale, requires state tracking.
Leaky Bucket
Requests are queued and processed at a fixed rate, like water leaking from a bucket.
If Queue.size < Capacity:
Queue.add(request)
Process at fixed rate
Else:
Drop request
Advantages: Smooth traffic, predictable output rate.
Disadvantages: Requests must wait in queue (higher latency), less flexible for bursty workloads.
Fixed Window Counter
Count requests in fixed time windows. Simple but has a boundary burst problem: if a user makes 100 requests in the last second of window 1 and 100 in the first second of window 2, they’ve doubled their rate.
Sliding Window Log
Track exact timestamp of every request. Accurate but memory-intensive (you’re storing timestamps for millions of requests).
Sliding Window Counter (Hybrid)
Use a weighted average of the current and previous window to smooth out boundary issues. Good balance of accuracy and efficiency.
For this design, we’ll implement Token Bucket using Redis for its simplicity and suitability for bursty traffic patterns.
High-Level Architecture
graph LR
Client["Client"]
LB["Load Balancer"]
RL["Rate Limiter Middleware"]
App["Application"]
Redis["Redis Cluster"]
Client -->|HTTP Request| LB
LB --> RL
RL -->|Check Counter| Redis
RL -->|Check Status| App
App -->|Response| RL
RL -->|Response| LB
LB -->|Response| Client
Here’s the flow:
- Client makes request to the API
- Load balancer routes to an API server
- Rate limiter middleware intercepts the request
- Query Redis for the current token count for this user
- Increment or reject based on the counter
- Pass to application if allowed, else return 429
The rate limit rules (user quotas, time windows) are stored in a config service or fetched from Redis.
Deep Dive: Implementation with Redis
Let’s implement token bucket in Redis. Here’s the challenge: we need to do this atomically to avoid race conditions.
The Algorithm
-- Lua script for atomic rate limiting
local key = KEYS[1] -- "rate_limit:user:123:minute"
local limit = tonumber(ARGV[1]) -- 100 (requests per minute)
local refill_rate = tonumber(ARGV[2]) -- 100/60 tokens per second
local now = tonumber(ARGV[3]) -- current timestamp
local capacity = tonumber(ARGV[4]) -- max tokens to accumulate
-- Get current state
local current = redis.call('HGETALL', key)
local tokens = tonumber(current[2]) or capacity
local last_refill = tonumber(current[4]) or now
-- Calculate tokens to add since last refill
local time_passed = math.max(0, now - last_refill)
local tokens_to_add = time_passed * refill_rate
tokens = math.min(capacity, tokens + tokens_to_add)
-- Try to consume a token
if tokens >= 1 then
tokens = tokens - 1
redis.call('HSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600) -- expire after 1 hour
return {1, tokens} -- {allowed, remaining_tokens}
else
return {0, tokens} -- {denied, remaining_tokens}
end
Why Lua? Redis executes Lua scripts atomically. Without atomicity, two concurrent requests could both read the same token count, and both get through even if there’s only 1 token left.
Response Headers
When the rate limit is exceeded, we return helpful headers:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1645123860
Retry-After: 45
X-RateLimit-Limit: Total allowed requestsX-RateLimit-Remaining: Tokens leftX-RateLimit-Reset: Unix timestamp when the limit resetsRetry-After: Seconds to wait before retrying
Distributed Challenges
With 10 API servers, how do we ensure accuracy?
Option 1: Centralized Counting (Recommended) All servers query the same Redis instance for the counter. Redis is fast (sub-millisecond), and INCR operations are atomic.
Problem: Redis becomes a bottleneck at 1M requests/second.
Solution:
- Use Redis Cluster with multiple master nodes
- Shard rate limit counters by user ID (user 123 → Redis node 2)
- Each node handles a partition of users independently
- Can scale horizontally by adding more Redis nodes
Option 2: Distributed Counters with Sync Each API server maintains local counters, syncs periodically with Redis.
Problem: You lose accuracy (two servers might both allow requests that violate the limit), and you need complex reconciliation.
Result: Not recommended for strict rate limiting.
Handling Failures
What if Redis goes down?
- Circuit breaker: If Redis is unavailable, fall back to a permissive mode (allow all requests but log) rather than blocking everything
- Redis Cluster: With 3+ master nodes, the cluster continues operating even if one node fails
- Replication: Each rate limit counter is replicated across 2-3 Redis nodes
Pro Tip: Use Redis Sentinel for automatic failover. When a master fails, Sentinel promotes a replica to master, and clients reconnect automatically.
Sliding Window Counter Implementation
For comparison, here’s the sliding window counter approach, which is more accurate than fixed window:
local key = KEYS[1] -- "rate_limit:user:123:minute"
local limit = tonumber(ARGV[1]) -- 100
local now = tonumber(ARGV[2])
local window = tonumber(ARGV[3]) -- 60 seconds
local current_window_key = key .. ":" .. math.floor(now / window)
local previous_window_key = key .. ":" .. math.floor((now - window) / window)
-- Increment current window
local current_count = redis.call('INCR', current_window_key)
redis.call('EXPIRE', current_window_key, window * 2)
-- Get previous window count
local previous_count = tonumber(redis.call('GET', previous_window_key)) or 0
-- Weighted average (how much of previous window affects current)
local weight_previous = 1 - ((now % window) / window)
local allowed_count = limit * (1 - weight_previous)
local total_count = previous_count * weight_previous + current_count
if total_count < limit then
return {1, limit - total_count} -- {allowed, remaining}
else
return {0, 0} -- {denied, remaining}
end
This avoids the boundary burst problem of fixed windows.
Scaling Considerations
Horizontal Scaling:
- Redis Cluster with 10+ nodes (each handling 10-20% of users)
- Each node can handle 100K-200K ops/second
- Need 5-10 nodes to handle 1M requests/second
Rate Limiting Levels:
- API Gateway: coarse-grained limits (IP-based, per endpoint)
- Application Service: medium-grained limits (per user, per operation type)
- Database: fine-grained limits (prevent hot key contention)
Dynamic Rules:
- Store rate limit rules in a config service (Consul, etcd)
- Each rate limiter watches for config changes
- Can update limits without restarting services
- Use feature flags to gradually roll out new limits
Monitoring:
- Track 429 response rate (should be low in normal conditions)
- Alert if error rate exceeds 1%
- Monitor Redis latency and memory usage
- Count how many requests are actually being rate limited (should be rare for legitimate traffic)
Trade-offs and Design Decisions
| Aspect | Choice | Rationale |
|---|---|---|
| Centralized vs Distributed | Centralized (Redis) | More accurate, simpler to reason about, Redis is fast enough with clustering |
| Algorithm | Token Bucket | Handles bursts gracefully, industry standard, easy to configure |
| Storage | Redis (in-memory) | Sub-millisecond latency, atomic operations, acceptable data loss on failure |
| Replication | Async + Cluster | Fast writes, automatic failover, some reads might see slightly stale data |
| Fallback on Failure | Permissive (allow) | Better user experience than blocking all traffic if rate limiter fails |
| Time Window Precision | Second-level buckets | Good balance; minute-level would be too coarse, millisecond too fine |
Key Takeaways
- Rate limiting is critical for protecting multi-tenant systems from resource exhaustion
- Token Bucket is the most practical algorithm for most use cases; it handles bursts naturally
- Centralized counting via Redis is the standard approach; consistency beats distributed complexity
- Lua scripts enable atomic operations without race conditions
- Redis Cluster enables horizontal scaling for high throughput environments
- Graceful degradation (fail open) is often better than strict enforcement when the rate limiter fails
- Multi-level rate limiting (gateway, app, database) gives fine-grained control
Practice
Design a rate limiter that supports adaptive limits: if the system is under heavy load (CPU above 80%), reduce the limit by 20%. How would you detect system load? Where would you store the adaptive limit thresholds? How would you handle the transition smoothly?
Connection to Next Section
Rate limiting protects individual APIs, but what about when users need to access many resources? In the next section, we’ll design a URL Shortener—a service that handles massive redirect traffic efficiently, requiring careful caching and routing strategies similar to rate limiting but focused on performance rather than throttling.