System Design Fundamentals

Retry & Exponential Backoff

A

Retry & Exponential Backoff

The Transient Failure Problem

Your application makes a request to a database. The request times out. But the database itself is healthy — it’s just a network hiccup. A packet was lost, a switch was busy, a fiber cut nearby was briefly resolved. The next request will almost certainly succeed.

This is a transient failure: a temporary error that has a good chance of succeeding if you try again. But here’s the trap: what if the database is actually overloaded? If you retry immediately, you’re adding more load to an already struggling system. If everyone retries at the same time, you’ve turned a minor slowdown into a retry storm — a cascading amplification of load that collapses the system.

Naive retry logic kills systems. Smart retry logic saves them. The difference is exponential backoff and jitter.

Transient vs. Permanent Failures

The first rule of retry logic: know what to retry.

Transient failures are worth retrying:

  • Network timeouts and connection resets
  • HTTP 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout
  • HTTP 429 Too Many Requests (rate limiting — wait and retry)
  • DNS resolution failures (transient)
  • Database connection pool exhaustion

Permanent failures are not worth retrying:

  • HTTP 400 Bad Request — your request is malformed, retrying won’t fix it
  • HTTP 401 Unauthorized, 403 Forbidden — authentication/authorization issues don’t change on retry
  • HTTP 404 Not Found — the resource doesn’t exist, retrying won’t create it
  • Application logic errors (division by zero, null pointer exceptions)
  • Invalid data format errors

Retrying a 400 is like knocking on a locked door repeatedly — the door won’t unlock because you knocked harder. You’re just wasting effort.

Exponential Backoff with Jitter

Here’s the core formula:

delay = base * 2^attempt

Where base is your starting delay (often 100 milliseconds) and attempt is which retry this is (0, 1, 2, …).

  • Attempt 0 (first retry): 100ms * 2^0 = 100ms
  • Attempt 1 (second retry): 100ms * 2^1 = 200ms
  • Attempt 2 (third retry): 100ms * 2^2 = 400ms
  • Attempt 3 (fourth retry): 100ms * 2^3 = 800ms
  • Attempt 4 (fifth retry): 100ms * 2^4 = 1600ms

After each failure, you wait twice as long before the next attempt. This gives the remote system time to recover.

But here’s the critical addition: jitter.

Imagine all 10,000 of your clients get a timeout. Without jitter, all 10,000 clients retry after exactly 100ms. The server gets hit by 10,000 simultaneous retry requests — the thundering herd. This often causes another failure, which triggers another round of synchronized retries. The system collapses.

With jitter, you add randomness to the delay:

delay = random(0, base * 2^attempt)

Now clients retry at slightly different times. Instead of 10,000 requests arriving simultaneously, they spread out. The server handles them gradually, recovers, and serves them all successfully.

AWS recommends decorrelated jitter:

delay = random(base, previous_delay * 3)

This prevents the delays from growing unbounded while maintaining good distribution.

Here’s Python code implementing exponential backoff with decorrelated jitter:

import random
import time

def call_with_retry(func, max_attempts=5, base_delay=0.1):
    delay = base_delay
    for attempt in range(max_attempts):
        try:
            return func()
        except TransientError as e:
            if attempt == max_attempts - 1:
                raise
            # Decorrelated jitter: delay = random(base, previous * 3)
            delay = random.uniform(base_delay, min(delay * 3, 10))  # Cap at 10s
            print(f"Attempt {attempt + 1} failed, retrying in {delay:.2f}s")
            time.sleep(delay)

Idempotency: The Prerequisite for Safe Retries

Here’s a dangerous scenario: you retry a payment charge. The first attempt times out, but the charge actually went through — the confirmation just didn’t make it back to you. You retry, and now the customer is charged twice.

Retries are only safe if the operation is idempotent — calling it multiple times has the same effect as calling it once. For a payment charge, this means:

  • The charge must be deduplicated by a unique ID (an idempotency key)
  • If the same ID arrives twice, the system recognizes it and returns the cached result instead of processing again

We covered idempotency in Chapter 13. It’s not optional when retry logic is involved.

Retry Budgets and Amplification

Here’s where cascading failures sneak back in. In a microservice architecture, retries amplify.

Imagine this call chain:

  • Frontend → Service A → Service B → Service C

Service C’s database becomes slow. C retries (let’s say 3 times per request). But while C is retrying, Service B’s requests are timing out and B retries its calls to C (3 more times). Meanwhile, Service A’s requests to B are timing out, so A retries (3 more times).

A single failed request to C generates 3 × 3 × 3 = 27 requests to C. If thousands of users are affected, you’re looking at exponential load amplification — exactly the retry storm we were trying to avoid.

Retry budgets limit this: each layer gets a limited number of retries, and exceeding that budget triggers fast failure. A system might say: “You can retry 3 times total in this request chain,” not “3 times per service.” This requires coordination and visibility.

Here’s a check-based approach:

# Track retries across the call chain
class RetryBudget:
    def __init__(self, max_retries=3):
        self.remaining = max_retries

    def should_retry(self):
        if self.remaining > 0:
            self.remaining -= 1
            return True
        return False

    def propagate_to_downstream(self):
        # Pass remaining budget downstream
        return {"retry-budget": str(self.remaining)}

# Usage
budget = RetryBudget(max_retries=3)
if not call_service() and budget.should_retry():
    # Retry logic here

In practice, service meshes and distributed tracing systems manage retry budgets automatically.

HTTP Client Retry Configuration

Most HTTP clients provide built-in retry configuration. Here’s axios (JavaScript):

import axios from 'axios';
import AxiosRetry from 'axios-retry';

AxiosRetry(axios, {
  retries: 3,
  retryDelay: (retryCount) => {
    const baseDelay = 100;
    const delay = baseDelay * Math.pow(2, retryCount - 1);
    const jitter = Math.random() * delay;
    return jitter;
  },
  retryCondition: (error) => {
    // Retry on network errors, 5xx, and 429
    return !error.response ||
           error.response.status >= 500 ||
           error.response.status === 429;
  },
});

axios.get('/api/data').then(response => {
  // Automatically retried on transient failures
});

And requests (Python):

from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import requests

session = requests.Session()

# Configure retry strategy
retry_strategy = Retry(
    total=3,  # Total retries
    backoff_factor=1,  # 1s, 2s, 4s, 8s
    status_forcelist=[429, 500, 502, 503, 504],
    method_whitelist=["HEAD", "GET", "OPTIONS"]  # Don't retry POST by default
)

adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)

response = session.get('https://api.example.com/data')

Note the method_whitelist: GET is idempotent, so it’s safe to retry. POST is not idempotent by default, so only retry if you’re sure of idempotency.

Retry Policies in gRPC

gRPC supports server-side retry policies, configured in the service definition:

{
  "methodConfig": [{
    "name": [{"service": "myservice.PaymentService"}],
    "retryPolicy": {
      "maxAttempts": 4,
      "initialBackoff": "0.1s",
      "maxBackoff": "10s",
      "backoffMultiplier": 2,
      "retryableStatusCodes": [
        "UNAVAILABLE",
        "RESOURCE_EXHAUSTED"
      ]
    }
  }]
}

The server defines which errors are retryable. Clients automatically retry according to this policy, with exponential backoff. The entire mechanism is transparent to application code.

Timeout Configuration is Half the Battle

Retries only work if you have reasonable timeouts. If your timeout is 30 seconds, your client will wait 30 seconds, then retry and wait another 30 seconds — total 60 seconds before failing. Users give up well before that.

A practical approach: set timeouts aggressively.

# Aggressive timeouts for public APIs
response = requests.get(url, timeout=2)  # 2 seconds

# More lenient for internal services
response = requests.get(internal_url, timeout=5)

# Different timeouts for different phases
# (connection timeout, read timeout)
response = requests.get(url, timeout=(1, 5))

Combine aggressive timeouts with exponential backoff. You fail fast on transient issues and quickly move to the next retry.

Monitoring Retry Behavior

Here’s what to instrument:

import logging

def call_with_instrumented_retry(func, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            result = func()
            if attempt > 0:
                logging.info(f"Success after {attempt} retries")
            return result
        except TransientError as e:
            if attempt == max_attempts - 1:
                logging.error(f"Failed after {max_attempts} attempts")
                raise
            logging.warning(f"Attempt {attempt + 1} failed: {e}")

Alert on high retry rates. If you see retries spiking, it often indicates:

  • A service is becoming unhealthy (go investigate!)
  • A dependency is overloaded (need more capacity)
  • Network conditions have degraded

Trade-Offs and Limitations

Latency Increase: Retries add latency. A user might wait 5+ seconds through retry backoff for a request that would normally take 100ms. For time-sensitive operations, this is unacceptable. Design retry strategies differently for different operations.

Retry Storms in Cascades: Despite best efforts, complex call chains can amplify retries. Retry budgets help, but require system-wide coordination.

Non-Idempotent Operations: Many real operations aren’t idempotent (fund transfers, reservations, inventory deductions). Either make them idempotent with keys, or don’t retry them.

Resource Consumption: Retrying consumes resources — threads, memory, bandwidth. A retry storm consumes more resources than the original failure. This is why bulkheads (Chapter 89) matter.

Key Takeaways

  • Distinguish transient failures (safe to retry: network timeouts, 503) from permanent failures (unsafe: 400, 404, auth errors)
  • Exponential backoff with jitter prevents thundering herd — delays grow after each retry, and randomness spreads retry attempts across time
  • Decorrelated jitter (AWS formula) balances backoff growth with good distribution
  • Idempotency is prerequisite for safe retries — the same request must be safe to call multiple times
  • Retry budgets limit cascading amplification in deep call chains
  • HTTP clients, gRPC, and service meshes provide built-in retry mechanisms — configure them, don’t hand-roll
  • Aggressive timeouts work with exponential backoff to fail fast and retry efficiently
  • Monitor retry rates closely — spikes indicate unhealthy downstream services

Practice Scenarios

Scenario 1: Retry Storm Analysis Your system has three layers: API Gateway → Order Service → Payment Service. A network glitch causes Payment Service to become temporarily unavailable (returns 503 for 5 seconds).

  • If each service retries 3 times with 100ms fixed intervals, how many requests hit Payment Service during those 5 seconds?
  • How does exponential backoff with jitter improve the situation?
  • What would a retry budget of 2 do to limit amplification?

Scenario 2: Configuring Retries for Different Operations You’re building an e-commerce platform with these operations:

  • GET /products — fetching a product list
  • POST /checkout — processing a purchase
  • GET /inventory/:id — checking stock level

For each operation, decide: should you retry? If yes, what errors should trigger a retry? What timeout values make sense? Consider the user’s experience in each case.

With proper retry logic in place, we’ve handled transient failures gracefully. But what about the failures that don’t recover? That’s where graceful degradation enters the picture.