System Design Fundamentals

Circuit Breaker Pattern

A

Circuit Breaker Pattern

The Cascade Scenario

Imagine this: Your payment service depends on a third-party fraud detection API. The API has always been reliable, but at 3 AM, something goes wrong. The fraud API server has a memory leak and starts responding very slowly. What was a 100ms call now takes 10 seconds.

Your payment service is still working—it’s calling the fraud API correctly. But the fraud API isn’t answering quickly. Your code has a 10-second timeout, so each request blocks a thread waiting for a response.

You have 100 threads in your payment service thread pool. At 3 AM, there are maybe 10 requests per second to your payment API. That’s fine—plenty of threads available. But at 9 AM when traffic increases to 50 requests per second, something horrible happens.

Every request calls the fraud API. Every request waits 10 seconds for a timeout. After just 20 seconds, all 100 threads are blocked waiting for the fraud API. New requests arrive but there are no threads available. Your payment service is down—not because anything is wrong with it, but because it didn’t protect itself from a dependency failure.

This is a cascading failure. One service’s problem spreads up the call chain, bringing down systems that are themselves healthy. The circuit breaker pattern stops this cascade.

The Three States: Closed, Open, Half-Open

The circuit breaker is a state machine with three states:

graph LR
    A["CLOSED<br/>Requests flow through<br/>failures being tracked"] -->|Failure threshold exceeded| B["OPEN<br/>Requests fail immediately<br/>without calling service"]
    B -->|Wait timeout elapsed| C["HALF-OPEN<br/>Allow test requests<br/>to check if service recovered"]
    C -->|Test requests succeed| A
    C -->|Test requests fail| B

CLOSED State (Normal Operation)

The circuit is closed, and requests flow through to the downstream service.

Client → Circuit Breaker (CLOSED) → Fraud API Service
         [failures: 0]

The breaker tracks failures. Failure can mean:

  • Exception thrown (network error, timeout, exception from downstream)
  • Non-200 HTTP response code
  • Response time exceeds threshold (“slow call”)

As long as failures remain below a threshold, the circuit stays closed.

OPEN State (Failing Fast)

When the failure threshold is exceeded, the circuit opens. Now requests are rejected immediately without calling the downstream service.

Client → Circuit Breaker (OPEN) ✗ → Request fails immediately
         [failures: 15, threshold: 10]
         [no downstream call made]

This is the key insight: fail fast without wasting resources on a broken dependency. The threads that would have been blocked waiting for the fraud API are freed up to serve other requests.

The open state lasts for a configured duration (usually 30-60 seconds). During this time, the service is assumed to be unhealthy and is skipped entirely.

HALF-OPEN State (Testing Recovery)

After the wait duration, the circuit transitions to half-open. Now it allows a small number of test requests to see if the downstream has recovered.

Client #1 → Circuit Breaker (HALF-OPEN) → Fraud API
            [test request 1/3]

Client #2 → Circuit Breaker (HALF-OPEN) → Fraud API
            [test request 2/3]

If test requests succeed, the downstream has likely recovered. The circuit closes and normal traffic resumes. If test requests fail, the circuit opens again and waits another 30 seconds.

Configuration: The Knobs You Turn

Real circuit breaker implementations have several tunable parameters. Getting them right is critical:

Failure Rate Threshold

What percentage of requests must fail before opening the circuit?

Example: 50% failure rate threshold
  Last 20 requests: 15 succeeded, 5 failed
  Failure rate: 25% (below 50% threshold)
  Circuit: CLOSED

  Last 20 requests: 5 succeeded, 15 failed
  Failure rate: 75% (exceeds 50% threshold)
  Circuit: OPEN

Too aggressive (10% threshold) and you open the circuit too early, rejecting healthy requests. Too lenient (90% threshold) and you stay on a broken service too long.

Typical value: 50% failure rate.

Slow Call Rate Threshold

Some failures aren’t errors—they’re just slow responses. If 30% of requests take more than 5 seconds, the service is degraded even though it’s not returning errors.

Response Time Threshold: 1 second
Last 20 requests:
  ├─ 15 completed in < 1s (normal)
  └─ 5 took > 1s (slow calls)

Slow call rate: 25%
If threshold is 20%, circuit opens

Typical value: 50% slow call rate with response time threshold of 2x normal latency.

Minimum Number of Calls

You don’t want to open a circuit based on just 1 or 2 requests. What if those requests were anomalies?

Minimum calls to check: 10
Circuit state: CLOSED

Request 1: Success
Request 2: Timeout
Request 3: Success
Request 4: Timeout
Request 5: Timeout (3 failures out of 5, 60% fail rate)

But we've only had 5 calls, less than minimum 10
Circuit: Still CLOSED (not enough data)

After 10 calls, if 60% are failing
Circuit: OPEN

Typical value: 10 calls minimum.

Wait Duration in Open State

How long before trying again?

Circuit opens at 9:00:00 AM
  [failure detected in fraud API]

Wait duration: 30 seconds

Circuit tries half-open at 9:00:30 AM
  [test request to fraud API]

Too short and you hammer a still-failing service. Too long and you reject traffic unnecessarily after the service recovers.

Typical value: 30-60 seconds. Some systems use exponential backoff (30s, then 60s, then 120s) if the service keeps failing.

Permitted Calls in Half-Open State

How many test requests before deciding if recovery succeeded?

Circuit in HALF-OPEN
Permitted calls: 3

Test request 1: Success
Test request 2: Success
Test request 3: Success

All 3 succeeded → Circuit closes

One test request is risky (one lucky success and you’re wrong). Three is reasonable. Five is overly cautious.

Typical value: 3 calls.

Resilience4j (Java)

CircuitBreaker circuitBreaker = CircuitBreaker.ofDefaults("fraudApi");

CircuitBreaker customBreaker = CircuitBreaker.of("fraudApi",
    CircuitBreakerConfig.custom()
        .failureRateThreshold(50)
        .slowCallRateThreshold(50)
        .slowCallDurationThreshold(Duration.ofSeconds(2))
        .waitDurationInOpenState(Duration.ofSeconds(30))
        .permittedNumberOfCallsInHalfOpenState(3)
        .minimumNumberOfCalls(10)
        .build());

Supplier<String> fraudCheck = CircuitBreaker.decorateSupplier(
    circuitBreaker,
    () -> fraudApiClient.checkTransaction(txn)
);

try {
    String result = fraudCheck.get();
} catch (CallNotPermittedException e) {
    // Circuit is OPEN - fail fast
    log.warn("Fraud API unavailable, rejecting transaction");
}

Polly (.NET)

var policy = Policy
    .Handle<HttpRequestException>()
    .Or<TimeoutException>()
    .OrResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
    .CircuitBreaker(
        handledEventsAllowedBeforeBreaking: 10,
        durationOfBreak: TimeSpan.FromSeconds(30),
        onBreak: (outcome, duration) => {
            logger.LogWarning($"Circuit opened for {duration.TotalSeconds}s");
        }
    );

var response = await policy.ExecuteAsync(async () =>
    await httpClient.GetAsync("https://fraud-api.example.com/check")
);

Hystrix (Deprecated but Influential)

Netflix’s Hystrix was the circuit breaker library that popularized the pattern. While officially deprecated, many systems still use it. The newer approach is using service meshes (below).

Circuit Breaker in a Service Mesh

Modern infrastructure runs service meshes like Istio or Linkerd. The circuit breaker is configured at the infrastructure level, not in application code:

# Istio VirtualService with circuit breaker
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: fraud-api
spec:
  hosts:
  - fraud-api
  http:
  - match:
    - uri:
        prefix: /check
    route:
    - destination:
        host: fraud-api
        port:
          number: 8080
    timeout: 2s
    retries:
      attempts: 3
      perTryTimeout: 1s
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: fraud-api
spec:
  host: fraud-api
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 100
      minRequestVolume: 10

Advantage: No application code changes needed. Operators configure reliability at the infrastructure level. When upgrading Istio, all services automatically get better circuit breaker logic.

Disadvantage: Less visibility into circuit breaker state from the application. Need to monitor infrastructure metrics rather than application logs.

Fallback Strategies: What Happens When the Circuit Opens?

When the circuit is open, you have several options for what to return to the user:

Option 1: Fail Loudly (User Sees Error)

try {
    String result = fraudCheck.get();
} catch (CallNotPermittedException e) {
    // Circuit is open
    return ResponseEntity
        .status(503) // Service Unavailable
        .body("Fraud check temporarily unavailable");
}

Honest but bad user experience. Users see “Service error, try again later.”

Option 2: Return Cached Data

try {
    String result = fraudCheck.get();
    cacheResult(txn.id, result);
    return result;
} catch (CallNotPermittedException e) {
    // Circuit is open
    String cachedResult = getFromCache(txn.id);
    if (cachedResult != null) {
        log.info("Using cached fraud check result");
        return cachedResult;
    }
    throw e; // No cache, must fail
}

If you’ve checked this transaction before, use the cached result. Better than nothing.

Option 3: Return Default/Safe Value

try {
    FraudCheckResult result = fraudApiClient.checkTransaction(txn);
    return result;
} catch (CallNotPermittedException e) {
    // Circuit is open
    // Assume it's safe rather than risky
    log.warn("Fraud API unavailable, allowing transaction");
    return FraudCheckResult.ALLOWED;
}

Works if your default is safe. For fraud detection, allowing by default is risky. For a recommendation engine, returning “no recommendation” is safe.

Option 4: Queue for Later

try {
    return fraudApiClient.checkTransaction(txn);
} catch (CallNotPermittedException e) {
    // Circuit is open
    queueForAsyncProcessing(txn);
    return ResponseEntity.accepted().build();
}

Tell the user “we’ll process this when the service recovers” and queue it. Works for non-critical paths.

Option 5: Call an Alternative Service

try {
    return primaryFraudApi.checkTransaction(txn);
} catch (CallNotPermittedException e) {
    // Primary API unavailable, try backup
    log.warn("Primary fraud API unavailable, using backup");
    return backupFraudApi.checkTransaction(txn);
}

If you have a secondary fraud detection service, use it. Requires maintaining multiple external integrations.

Monitoring and Alerting Circuit Breakers

A circuit breaker that opens silently is dangerous. You need visibility:

Metrics to track:
  - Number of times circuit opened (alerting threshold: any opening)
  - Time spent in OPEN state (indicates slow recovery)
  - Number of requests rejected while OPEN
  - Failure rate that caused the opening

Dashboards should show:
  - Current state per circuit (CLOSED, OPEN, HALF-OPEN)
  - State transition history (when did each circuit last open?)
  - Correlation between circuit state and user impact

Alert on circuit opening:

- Name: "Circuit Breaker Opened"
- Condition: Any circuit transitions to OPEN
- Severity: Warning
- Action: Page on-call engineer to investigate upstream service

Alert on stuck OPEN state:

- Name: "Circuit Breaker Stuck Open"
- Condition: Circuit remains OPEN for more than 5 minutes
- Severity: Critical
- Action: Might indicate that upstream service is fundamentally broken, not just temporarily failing

Circuit Breaker vs. Timeout vs. Retry: The Trio

These three patterns work together:

Timeout: "Wait maximum 2 seconds, then give up"
Retry: "If it fails, try up to 3 times"
Circuit Breaker: "If lots of calls are failing, stop even trying"

A well-designed resilient call looks like:

// Timeout: fail fast if service is slow
HttpClientBuilder.create()
    .setConnectionTimeout(2000)
    .setSocketTimeout(2000)
    .build();

// Retry: transient failures often succeed on retry
RetryPolicy retry = new RetryPolicy()
    .retryOn(IOException.class)
    .withMaxRetries(2)
    .withDelay(Duration.ofMillis(100));

// Circuit Breaker: persistent failures trip the breaker
CircuitBreaker breaker = CircuitBreaker.ofDefaults("myService");

// All together
Supplier<String> call = CircuitBreaker.decorateSupplier(
    breaker,
    Retry.decorateSupplier(retry, () -> doApiCall())
);

The order matters: Circuit Breaker (outermost) prevents hammering when the service is down. Retry (middle) handles transient failures. Timeout (innermost) ensures we don’t wait forever.

Common Mistakes

Mistake 1: Threshold Too Aggressive

Opening the circuit after just 3 failed requests out of 10 is too fast. One slow request from a slow client will trip the breaker.

Fix: Use minimum call count of 10-20 and failure rate of 50%+.

Mistake 2: No Fallback Strategy

Circuit opens and users see blank error pages. No graceful degradation.

Fix: Plan fallback strategies before circuit opens. Cache, default values, alternative services.

Mistake 3: Circuit Breaker Without Monitoring

Circuit opens at 3 AM. On-call engineer doesn’t notice until 8 AM when customer complaints arrive.

Fix: Alert on every circuit opening. Have dashboards showing circuit state.

Mistake 4: Too Long Wait Duration

Circuit opens, waits 5 minutes before half-opening. Service recovered after 30 seconds but users are rejected for 4.5 more minutes.

Fix: Start with 30 seconds, tune based on actual recovery times.

Mistake 5: Forgetting About the Whole Chain

You add a circuit breaker to the fraud API call but forget about the database call upstream. Database fails, and now you’re building up a queue of requests waiting for circuit breaker on fraud API (which is fine), but those requests are also waiting on the database.

Fix: Add circuit breakers at every external dependency, not just the most obvious ones.

Key Takeaways

  • Circuit breaker pattern prevents cascading failures by failing fast when dependencies are unhealthy
  • Three states (CLOSED, OPEN, HALF-OPEN) with configurable transitions
  • Configure thresholds carefully: failure rate, slow call rate, minimum call count, wait duration
  • Fallback strategies are essential: cached data, default values, alternative services, or graceful degradation
  • Monitor circuit breaker state actively; alerting on state changes is critical
  • Use circuit breaker alongside timeout and retry for complete resilience
  • Service meshes enable circuit breaker configuration at infrastructure level, eliminating code changes

The circuit breaker pattern, combined with the redundancy and fault tolerance we discussed earlier, forms the foundation of reliable distributed systems. In the next section (bulkhead pattern), we’ll look at how to isolate failures so that one degraded component doesn’t starve resources from others.