Rate Limiting & Throttling

The Bouncer Problem

Imagine your API is running smoothly. You’ve scaled to thousands of concurrent users, your response times are solid, and everything feels production-ready. Then, at 2 AM, one client goes haywire and starts sending 10,000 requests per second. Maybe it’s a bug in their retry logic. Maybe it’s a DDoS attack. Maybe they’re scraping your entire database without permission.

Without rate limiting, that single client doesn’t just affect themselves—they bring down your entire service for everyone. Your legitimate users see timeouts. Your databases get overwhelmed. Your infrastructure team gets paged. You just learned an expensive lesson about protecting your API.

Rate limiting is the bouncer at the door. It’s the mechanism that controls how fast requests can flow into your system, ensuring no single client (malicious or accidental) can monopolize your resources. In this chapter, we’ll explore how to build rate limiting systems that protect your service while keeping your actual users happy.

Understanding Rate Limiting vs Throttling

These terms are often confused, so let’s be precise:

Rate Limiting means rejecting requests that exceed a defined threshold. When a client hits their limit, they get an HTTP 429 (Too Many Requests) response. The request is denied.

Throttling means slowing down requests that would exceed the limit, rather than rejecting them outright. Think of it as queuing and delaying rather than turning people away. Throttling is gentler but uses more server resources.

In practice, most systems use rate limiting (the hard rejection), often with client-side throttling (the client slows itself down voluntarily to avoid hitting the limit).

The Dimensions of Control

Rate limiting isn’t one-size-fits-all. You control requests across multiple dimensions:

Per API Key / User: Premium users get 10,000 requests/minute; free users get 100 requests/minute
Per IP Address: Limit requests from any single IP to prevent distributed attacks
Per Endpoint: Expensive operations (machine learning inference) get lower limits than cheap reads
Global: Absolute cap on total requests across your entire system as a safety valve

The Rate Limit Contract

When you implement rate limiting, you communicate via HTTP headers:

X-RateLimit-Limit: The maximum requests allowed in the window
X-RateLimit-Remaining: How many requests the client has left before hitting the limit
X-RateLimit-Reset: Unix timestamp when the limit resets

When a client exceeds the limit, you respond with 429 Too Many Requests and include a Retry-After header telling them when to try again:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1708972260
Retry-After: 60

The Toll Booth Analogy

Imagine a highway toll booth during rush hour. Cars (requests) arrive at varying rates. The booth can process, say, 50 cars per minute. Here’s how it manages:

Express lane (premium tier): VIP subscribers bypass some checks and get priority—they’re rarely delayed
Regular lanes: Most drivers queue up and are processed fairly
Separate counting: Each vehicle type is counted separately—trucks have their own limit
Emergency lane: Ambulances and fire trucks always get through, no matter how busy it is

The booth operator (your rate limiter) knows exactly how many vehicles have passed, and if too many start queuing up, new arrivals are told to circle back in 5 minutes instead of clogging the entrance.

This is exactly how a multi-tiered, multi-dimensional rate limiting system works. Different clients, different endpoints, different rules—but all flowing through a controlled gate.

Algorithm Showdown: Five Ways to Count Requests

Choosing the right rate limiting algorithm is crucial. Each has trade-offs in accuracy, memory, and complexity.

Fixed Window Counter

The simplest approach: divide time into fixed buckets. If your limit is 100 requests per minute, you reset the counter every 60 seconds.

Pros: Simple, minimal memory Cons: Burst vulnerability at window boundaries (client sends 100 at 59s and 100 at 61s, using 200 requests in 2 seconds)

Sliding Window Log

Keep a timestamped log of every request. To check if a new request is allowed, count how many requests occurred in the past minute.

Pros: Perfectly accurate, no burst vulnerability Cons: Extremely memory-intensive (every request creates a log entry)

Sliding Window Counter

A hybrid approach: use fixed buckets but weight older buckets. More accurate than fixed window, much cheaper than sliding log.

Pros: Good accuracy, reasonable memory usage Cons: Still has minor burst vulnerabilities depending on implementation

Token Bucket

Picture a bucket that refills at a constant rate (say, 100 tokens per minute). Each request consumes a token. If the bucket is full, new tokens are discarded. Clients can make requests as long as tokens exist.

Pros: Allows controlled bursts, smooth traffic flow, handles varying request sizes Cons: Requires careful tuning of bucket size and refill rate

Leaky Bucket

Requests enter a fixed-capacity bucket. A background process drains the bucket at a constant rate. If new requests arrive and the bucket is full, they’re discarded.

Pros: Smooths out burst traffic, predictable outflow Cons: Queuing adds latency; not ideal for real-time APIs

Let’s visualize token bucket, the most popular choice:

graph TD
    A["Incoming Requests"] --> B{"Tokens Available?"}
    B -->|Yes| C["Consume Token"]
    B -->|No| D["Return 429"]
    C --> E["Process Request"]
    F["Constant Refill Rate"] --> G["Bucket: Max Capacity"]
    G -->|"e.g., 100 tokens/min"| H["Token Pool"]
    H --> B

Here’s a comparison table:

Algorithm	Accuracy	Memory	Burst Handling	Use Case
Fixed Window	Poor	Minimal	Bad - window boundary spikes	Quick & dirty, non-critical
Sliding Window Log	Perfect	High	Good	High-accuracy, small-scale
Sliding Window Counter	Good	Low-Med	Good	Standard distributed systems
Token Bucket	Good	Low	Controlled	General-purpose, allows bursts
Leaky Bucket	Perfect	Low-Med	Smooths out	Traffic shaping, steady-state

Distributing Rate Limits Across Servers

Here’s where theory meets painful reality. If you have five API servers and use in-memory counters, each server independently tracks requests. One client can send 100 requests to each server (500 total) before any individual server rate limits them. Your limit is now effectively multiplied by the number of servers.

The solution: a centralized rate limiting service. Redis is the industry standard because it’s fast, atomic, and built for distributed counting.

Redis-Based Rate Limiting

A simple Python implementation using Redis:

import redis
import time

class RateLimiter:
    def __init__(self, redis_client, limit=100, window=60):
        self.redis = redis_client
        self.limit = limit
        self.window = window

    def is_allowed(self, client_id):
        key = f"rate_limit:{client_id}"
        current = self.redis.incr(key)

        if current == 1:
            # First request in this window, set expiry
            self.redis.expire(key, self.window)

        remaining = self.limit - current
        reset_time = int(time.time()) + self.window

        return {
            'allowed': current <= self.limit,
            'remaining': max(0, remaining),
            'reset': reset_time
        }

The problem: This has a race condition. Between checking the count and incrementing it, another request might slip through. For true atomicity at scale, use Lua scripting:

-- Redis Lua script for atomic rate limiting
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local current = redis.call('incr', key)

if current == 1 then
    redis.call('expire', key, window)
end

local remaining = math.max(0, limit - current)
local reset_time = redis.call('ttl', key)

return {current <= limit, remaining, reset_time}

Call this from your application with EVAL, and Redis executes it atomically.

API Gateway Rate Limiting

Most platforms handle rate limiting before requests hit your application servers:

AWS API Gateway lets you configure throttling per stage:

MethodSettings:
  - ThrottlingBurstLimit: 5000
    ThrottlingRateLimit: 2000

Kong uses policies:

plugins:
  - name: rate-limiting
    config:
      second: 10
      minute: 600
      hour: 36000

nginx offers the limit_req module:

limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

server {
    location /api {
        limit_req zone=api_limit burst=20 nodelay;
    }
}

The advantage of gateway-level rate limiting: it stops bad traffic before it reaches your application, saving compute and database resources.

Tiered Limits in Practice

Real systems rarely have one rate limit. You implement tiers:

Free tier: 1,000 requests/day
Pro tier: 100,000 requests/day
Enterprise: Custom limits, contact sales
Internal services: No limits (trusted)

You also apply per-endpoint limits:

# High-cost operation gets a stricter limit
@app.route('/api/expensive-ml-inference', methods=['POST'])
@rate_limit(limit=10, window=60, per='user_id')  # 10 per minute
def expensive_endpoint():
    pass

# Cheap read gets looser limit
@app.route('/api/user/<user_id>', methods=['GET'])
@rate_limit(limit=1000, window=60, per='user_id')  # 1000 per minute
def get_user():
    pass

Some systems implement cost-based rate limiting: each operation costs a certain number of “tokens” based on resource consumption. A complex search query might cost 10 tokens, while a simple key-value lookup costs 1.

Practical Implementation: Token Bucket with Redis

Here’s a production-ready implementation combining Redis with token bucket logic:

import redis
import time
from typing import Tuple

class TokenBucketLimiter:
    def __init__(self, redis_host='localhost', redis_port=6379):
        self.redis = redis.Redis(host=redis_host, port=redis_port)
        self.lua_script = self.redis.register_script("""
            local key = KEYS[1]
            local capacity = tonumber(ARGV[1])
            local refill_rate = tonumber(ARGV[2])  -- tokens per second
            local now = tonumber(ARGV[3])

            local data = redis.call('hgetall', key)
            local tokens = tonumber(data[2]) or capacity
            local last_refill = tonumber(data[4]) or now

            -- Calculate tokens to add since last refill
            local elapsed = math.max(0, now - last_refill)
            local new_tokens = math.min(capacity, tokens + elapsed * refill_rate)

            if new_tokens >= 1 then
                redis.call('hmset', key, 'tokens', new_tokens - 1, 'last_refill', now)
                redis.call('expire', key, capacity)
                return {1, math.floor(new_tokens - 1)}
            else
                return {0, math.floor(new_tokens)}
            end
        """)

    def allow_request(self, client_id: str, capacity: int = 100,
                     refill_rate: float = 10) -> Tuple[bool, int]:
        """Returns (allowed, tokens_remaining)"""
        key = f"token_bucket:{client_id}"
        allowed, remaining = self.lua_script(
            keys=[key],
            args=[capacity, refill_rate, time.time()]
        )
        return bool(allowed), remaining

When a request comes in, you check once—the Lua script handles all the complex state management atomically.

Client-Side Resilience: Exponential Backoff

Your API can be perfect, but client code matters too. A well-behaved client respects rate limits and retries intelligently.

import requests
import time

def api_call_with_backoff(url, max_retries=3):
    retry_delay = 1  # Start with 1 second

    for attempt in range(max_retries):
        response = requests.get(url)

        if response.status_code != 429:
            return response

        # Extract retry-after if provided
        retry_after = response.headers.get('Retry-After')
        if retry_after:
            wait_time = int(retry_after)
        else:
            wait_time = retry_delay

        print(f"Rate limited. Waiting {wait_time}s before retry...")
        time.sleep(wait_time)
        retry_delay = min(retry_delay * 2, 60)  # Cap at 60s

    raise Exception("Max retries exceeded")

The exponential backoff (1s, 2s, 4s, 8s…) prevents thundering herds when many clients get rate limited simultaneously.

Design Trade-offs

Strictness vs. User Experience

Too strict and legitimate users hit limits during traffic spikes. Too lenient and you don’t actually protect the system.

Solution: Use two-level responses. Warn users when they’re approaching limits via headers, then reject at the hard limit. Add a small grace period for bursty legitimate traffic (token bucket’s burst capacity).

Distributed Complexity

Centralized rate limiting (Redis) is operationally simple but adds latency and a single point of failure. Distributed counters are faster locally but lose accuracy.

Solution: Most systems use Redis for tier 1 (per-user) limits and distributed counters for low-importance limits (per-IP abuse detection). Or use Redis in a cluster with replication.

Internal vs. External Traffic

Should your internal services respect the same limits as external APIs?

Convention: No. Internal services are trusted; they run your own code. Give them unlimited access or much higher quotas. Separate rate limit keys: external:user_id vs. internal:service_name.

Serverless Environments

In AWS Lambda or similar, you don’t have persistent in-memory state across invocations. Every check hits Redis or DynamoDB, adding latency.

Solution: Many serverless APIs skip strict per-request rate limiting and use external API Gateway throttling instead. Or accept the latency tradeoff and use a managed rate limiting service.

Key Takeaways

Rate limiting protects your system from overload—one bad actor shouldn’t affect everyone else. Throttling delays requests instead of rejecting them, trading resources for gentleness.
Choose your algorithm based on accuracy vs. complexity—token bucket is the sweet spot for most APIs, offering simplicity and controlled burst handling.
Always centralize rate limiting in distributed systems—in-memory counters don’t work across multiple servers. Redis with Lua scripts provides atomic, fast rate limiting.
Use HTTP headers to communicate limits—X-RateLimit-* headers and Retry-After allow clients to self-throttle and retry intelligently, reducing wasted requests.
Implement tiered limits—different subscription levels, different endpoints, and different dimensions (per-user, per-IP) all need different limits. One global limit is rarely enough.
Don’t forget the client side—clients that respect rate limits with exponential backoff reduce your operational burden and create a better user experience than constant retries.

Practice Scenarios

Scenario 1: Burst Protection Your API supports real-time notifications. Clients naturally burst: they connect, catch up on missed events (50 requests in 1 second), then settle into normal polling. Your fixed 10 requests/second limit is too strict. How do you adapt your rate limiting to handle this legitimate burst pattern while still protecting against abuse?

Scenario 2: Cost-Based Metering Your service offers two operations:

Fast lookup: returns cached data in 1ms
Slow inference: runs ML model, takes 5 seconds

Both count as “1 API call” but consume vastly different resources. Design a cost-based rate limiting system where you assign token costs per operation. A free user has 1000 tokens/day; how would you price each operation?

Scenario 3: Cascade Failure Prevention Your rate limiter uses Redis, which goes down for 30 seconds due to a network partition. What happens to your API requests? Design a graceful degradation strategy: should you fail open (allow all requests) or fail closed (deny all requests)?

Next: Authentication and Authorization

Rate limiting controls how fast requests come in. Next, we’ll explore who is allowed to make those requests and what they’re authorized to do. Authentication and authorization are the bouncers’ clipboards—checking ID and the guest list—while rate limiting is their job managing the crowd.

Understanding rate limiting deeply positions you to build robust API infrastructure. With these fundamentals in place, you’re ready to layer authentication on top, ensuring not just that requests are controlled, but that the right requests from the right users are getting through.