Rate Limiting & Throttling
The Bouncer Problem
Imagine your API is running smoothly. You’ve scaled to thousands of concurrent users, your response times are solid, and everything feels production-ready. Then, at 2 AM, one client goes haywire and starts sending 10,000 requests per second. Maybe it’s a bug in their retry logic. Maybe it’s a DDoS attack. Maybe they’re scraping your entire database without permission.
Without rate limiting, that single client doesn’t just affect themselves—they bring down your entire service for everyone. Your legitimate users see timeouts. Your databases get overwhelmed. Your infrastructure team gets paged. You just learned an expensive lesson about protecting your API.
Rate limiting is the bouncer at the door. It’s the mechanism that controls how fast requests can flow into your system, ensuring no single client (malicious or accidental) can monopolize your resources. In this chapter, we’ll explore how to build rate limiting systems that protect your service while keeping your actual users happy.
Understanding Rate Limiting vs Throttling
These terms are often confused, so let’s be precise:
Rate Limiting means rejecting requests that exceed a defined threshold. When a client hits their limit, they get an HTTP 429 (Too Many Requests) response. The request is denied.
Throttling means slowing down requests that would exceed the limit, rather than rejecting them outright. Think of it as queuing and delaying rather than turning people away. Throttling is gentler but uses more server resources.
In practice, most systems use rate limiting (the hard rejection), often with client-side throttling (the client slows itself down voluntarily to avoid hitting the limit).
The Dimensions of Control
Rate limiting isn’t one-size-fits-all. You control requests across multiple dimensions:
- Per API Key / User: Premium users get 10,000 requests/minute; free users get 100 requests/minute
- Per IP Address: Limit requests from any single IP to prevent distributed attacks
- Per Endpoint: Expensive operations (machine learning inference) get lower limits than cheap reads
- Global: Absolute cap on total requests across your entire system as a safety valve
The Rate Limit Contract
When you implement rate limiting, you communicate via HTTP headers:
X-RateLimit-Limit: The maximum requests allowed in the windowX-RateLimit-Remaining: How many requests the client has left before hitting the limitX-RateLimit-Reset: Unix timestamp when the limit resets
When a client exceeds the limit, you respond with 429 Too Many Requests and include a Retry-After header telling them when to try again:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1708972260
Retry-After: 60
The Toll Booth Analogy
Imagine a highway toll booth during rush hour. Cars (requests) arrive at varying rates. The booth can process, say, 50 cars per minute. Here’s how it manages:
- Express lane (premium tier): VIP subscribers bypass some checks and get priority—they’re rarely delayed
- Regular lanes: Most drivers queue up and are processed fairly
- Separate counting: Each vehicle type is counted separately—trucks have their own limit
- Emergency lane: Ambulances and fire trucks always get through, no matter how busy it is
The booth operator (your rate limiter) knows exactly how many vehicles have passed, and if too many start queuing up, new arrivals are told to circle back in 5 minutes instead of clogging the entrance.
This is exactly how a multi-tiered, multi-dimensional rate limiting system works. Different clients, different endpoints, different rules—but all flowing through a controlled gate.
Algorithm Showdown: Five Ways to Count Requests
Choosing the right rate limiting algorithm is crucial. Each has trade-offs in accuracy, memory, and complexity.
Fixed Window Counter
The simplest approach: divide time into fixed buckets. If your limit is 100 requests per minute, you reset the counter every 60 seconds.
Pros: Simple, minimal memory Cons: Burst vulnerability at window boundaries (client sends 100 at 59s and 100 at 61s, using 200 requests in 2 seconds)
Sliding Window Log
Keep a timestamped log of every request. To check if a new request is allowed, count how many requests occurred in the past minute.
Pros: Perfectly accurate, no burst vulnerability Cons: Extremely memory-intensive (every request creates a log entry)
Sliding Window Counter
A hybrid approach: use fixed buckets but weight older buckets. More accurate than fixed window, much cheaper than sliding log.
Pros: Good accuracy, reasonable memory usage Cons: Still has minor burst vulnerabilities depending on implementation
Token Bucket
Picture a bucket that refills at a constant rate (say, 100 tokens per minute). Each request consumes a token. If the bucket is full, new tokens are discarded. Clients can make requests as long as tokens exist.
Pros: Allows controlled bursts, smooth traffic flow, handles varying request sizes Cons: Requires careful tuning of bucket size and refill rate
Leaky Bucket
Requests enter a fixed-capacity bucket. A background process drains the bucket at a constant rate. If new requests arrive and the bucket is full, they’re discarded.
Pros: Smooths out burst traffic, predictable outflow Cons: Queuing adds latency; not ideal for real-time APIs
Let’s visualize token bucket, the most popular choice:
graph TD
A["Incoming Requests"] --> B{"Tokens Available?"}
B -->|Yes| C["Consume Token"]
B -->|No| D["Return 429"]
C --> E["Process Request"]
F["Constant Refill Rate"] --> G["Bucket: Max Capacity"]
G -->|"e.g., 100 tokens/min"| H["Token Pool"]
H --> B
Here’s a comparison table:
| Algorithm | Accuracy | Memory | Burst Handling | Use Case |
|---|---|---|---|---|
| Fixed Window | Poor | Minimal | Bad - window boundary spikes | Quick & dirty, non-critical |
| Sliding Window Log | Perfect | High | Good | High-accuracy, small-scale |
| Sliding Window Counter | Good | Low-Med | Good | Standard distributed systems |
| Token Bucket | Good | Low | Controlled | General-purpose, allows bursts |
| Leaky Bucket | Perfect | Low-Med | Smooths out | Traffic shaping, steady-state |
Distributing Rate Limits Across Servers
Here’s where theory meets painful reality. If you have five API servers and use in-memory counters, each server independently tracks requests. One client can send 100 requests to each server (500 total) before any individual server rate limits them. Your limit is now effectively multiplied by the number of servers.
The solution: a centralized rate limiting service. Redis is the industry standard because it’s fast, atomic, and built for distributed counting.
Redis-Based Rate Limiting
A simple Python implementation using Redis:
import redis
import time
class RateLimiter:
def __init__(self, redis_client, limit=100, window=60):
self.redis = redis_client
self.limit = limit
self.window = window
def is_allowed(self, client_id):
key = f"rate_limit:{client_id}"
current = self.redis.incr(key)
if current == 1:
# First request in this window, set expiry
self.redis.expire(key, self.window)
remaining = self.limit - current
reset_time = int(time.time()) + self.window
return {
'allowed': current <= self.limit,
'remaining': max(0, remaining),
'reset': reset_time
}
The problem: This has a race condition. Between checking the count and incrementing it, another request might slip through. For true atomicity at scale, use Lua scripting:
-- Redis Lua script for atomic rate limiting
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local current = redis.call('incr', key)
if current == 1 then
redis.call('expire', key, window)
end
local remaining = math.max(0, limit - current)
local reset_time = redis.call('ttl', key)
return {current <= limit, remaining, reset_time}
Call this from your application with EVAL, and Redis executes it atomically.
API Gateway Rate Limiting
Most platforms handle rate limiting before requests hit your application servers:
AWS API Gateway lets you configure throttling per stage:
MethodSettings:
- ThrottlingBurstLimit: 5000
ThrottlingRateLimit: 2000
Kong uses policies:
plugins:
- name: rate-limiting
config:
second: 10
minute: 600
hour: 36000
nginx offers the limit_req module:
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
server {
location /api {
limit_req zone=api_limit burst=20 nodelay;
}
}
The advantage of gateway-level rate limiting: it stops bad traffic before it reaches your application, saving compute and database resources.
Tiered Limits in Practice
Real systems rarely have one rate limit. You implement tiers:
- Free tier: 1,000 requests/day
- Pro tier: 100,000 requests/day
- Enterprise: Custom limits, contact sales
- Internal services: No limits (trusted)
You also apply per-endpoint limits:
# High-cost operation gets a stricter limit
@app.route('/api/expensive-ml-inference', methods=['POST'])
@rate_limit(limit=10, window=60, per='user_id') # 10 per minute
def expensive_endpoint():
pass
# Cheap read gets looser limit
@app.route('/api/user/<user_id>', methods=['GET'])
@rate_limit(limit=1000, window=60, per='user_id') # 1000 per minute
def get_user():
pass
Some systems implement cost-based rate limiting: each operation costs a certain number of “tokens” based on resource consumption. A complex search query might cost 10 tokens, while a simple key-value lookup costs 1.
Practical Implementation: Token Bucket with Redis
Here’s a production-ready implementation combining Redis with token bucket logic:
import redis
import time
from typing import Tuple
class TokenBucketLimiter:
def __init__(self, redis_host='localhost', redis_port=6379):
self.redis = redis.Redis(host=redis_host, port=redis_port)
self.lua_script = self.redis.register_script("""
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2]) -- tokens per second
local now = tonumber(ARGV[3])
local data = redis.call('hgetall', key)
local tokens = tonumber(data[2]) or capacity
local last_refill = tonumber(data[4]) or now
-- Calculate tokens to add since last refill
local elapsed = math.max(0, now - last_refill)
local new_tokens = math.min(capacity, tokens + elapsed * refill_rate)
if new_tokens >= 1 then
redis.call('hmset', key, 'tokens', new_tokens - 1, 'last_refill', now)
redis.call('expire', key, capacity)
return {1, math.floor(new_tokens - 1)}
else
return {0, math.floor(new_tokens)}
end
""")
def allow_request(self, client_id: str, capacity: int = 100,
refill_rate: float = 10) -> Tuple[bool, int]:
"""Returns (allowed, tokens_remaining)"""
key = f"token_bucket:{client_id}"
allowed, remaining = self.lua_script(
keys=[key],
args=[capacity, refill_rate, time.time()]
)
return bool(allowed), remaining
When a request comes in, you check once—the Lua script handles all the complex state management atomically.
Client-Side Resilience: Exponential Backoff
Your API can be perfect, but client code matters too. A well-behaved client respects rate limits and retries intelligently.
import requests
import time
def api_call_with_backoff(url, max_retries=3):
retry_delay = 1 # Start with 1 second
for attempt in range(max_retries):
response = requests.get(url)
if response.status_code != 429:
return response
# Extract retry-after if provided
retry_after = response.headers.get('Retry-After')
if retry_after:
wait_time = int(retry_after)
else:
wait_time = retry_delay
print(f"Rate limited. Waiting {wait_time}s before retry...")
time.sleep(wait_time)
retry_delay = min(retry_delay * 2, 60) # Cap at 60s
raise Exception("Max retries exceeded")
The exponential backoff (1s, 2s, 4s, 8s…) prevents thundering herds when many clients get rate limited simultaneously.
Design Trade-offs
Strictness vs. User Experience
Too strict and legitimate users hit limits during traffic spikes. Too lenient and you don’t actually protect the system.
Solution: Use two-level responses. Warn users when they’re approaching limits via headers, then reject at the hard limit. Add a small grace period for bursty legitimate traffic (token bucket’s burst capacity).
Distributed Complexity
Centralized rate limiting (Redis) is operationally simple but adds latency and a single point of failure. Distributed counters are faster locally but lose accuracy.
Solution: Most systems use Redis for tier 1 (per-user) limits and distributed counters for low-importance limits (per-IP abuse detection). Or use Redis in a cluster with replication.
Internal vs. External Traffic
Should your internal services respect the same limits as external APIs?
Convention: No. Internal services are trusted; they run your own code. Give them unlimited access or much higher quotas. Separate rate limit keys: external:user_id vs. internal:service_name.
Serverless Environments
In AWS Lambda or similar, you don’t have persistent in-memory state across invocations. Every check hits Redis or DynamoDB, adding latency.
Solution: Many serverless APIs skip strict per-request rate limiting and use external API Gateway throttling instead. Or accept the latency tradeoff and use a managed rate limiting service.
Key Takeaways
-
Rate limiting protects your system from overload—one bad actor shouldn’t affect everyone else. Throttling delays requests instead of rejecting them, trading resources for gentleness.
-
Choose your algorithm based on accuracy vs. complexity—token bucket is the sweet spot for most APIs, offering simplicity and controlled burst handling.
-
Always centralize rate limiting in distributed systems—in-memory counters don’t work across multiple servers. Redis with Lua scripts provides atomic, fast rate limiting.
-
Use HTTP headers to communicate limits—X-RateLimit-* headers and Retry-After allow clients to self-throttle and retry intelligently, reducing wasted requests.
-
Implement tiered limits—different subscription levels, different endpoints, and different dimensions (per-user, per-IP) all need different limits. One global limit is rarely enough.
-
Don’t forget the client side—clients that respect rate limits with exponential backoff reduce your operational burden and create a better user experience than constant retries.
Practice Scenarios
Scenario 1: Burst Protection Your API supports real-time notifications. Clients naturally burst: they connect, catch up on missed events (50 requests in 1 second), then settle into normal polling. Your fixed 10 requests/second limit is too strict. How do you adapt your rate limiting to handle this legitimate burst pattern while still protecting against abuse?
Scenario 2: Cost-Based Metering Your service offers two operations:
- Fast lookup: returns cached data in 1ms
- Slow inference: runs ML model, takes 5 seconds
Both count as “1 API call” but consume vastly different resources. Design a cost-based rate limiting system where you assign token costs per operation. A free user has 1000 tokens/day; how would you price each operation?
Scenario 3: Cascade Failure Prevention Your rate limiter uses Redis, which goes down for 30 seconds due to a network partition. What happens to your API requests? Design a graceful degradation strategy: should you fail open (allow all requests) or fail closed (deny all requests)?
Next: Authentication and Authorization
Rate limiting controls how fast requests come in. Next, we’ll explore who is allowed to make those requests and what they’re authorized to do. Authentication and authorization are the bouncers’ clipboards—checking ID and the guest list—while rate limiting is their job managing the crowd.
Understanding rate limiting deeply positions you to build robust API infrastructure. With these fundamentals in place, you’re ready to layer authentication on top, ensuring not just that requests are controlled, but that the right requests from the right users are getting through.