Circuit Breaker Pattern
The Cascade Scenario
Imagine this: Your payment service depends on a third-party fraud detection API. The API has always been reliable, but at 3 AM, something goes wrong. The fraud API server has a memory leak and starts responding very slowly. What was a 100ms call now takes 10 seconds.
Your payment service is still working—it’s calling the fraud API correctly. But the fraud API isn’t answering quickly. Your code has a 10-second timeout, so each request blocks a thread waiting for a response.
You have 100 threads in your payment service thread pool. At 3 AM, there are maybe 10 requests per second to your payment API. That’s fine—plenty of threads available. But at 9 AM when traffic increases to 50 requests per second, something horrible happens.
Every request calls the fraud API. Every request waits 10 seconds for a timeout. After just 20 seconds, all 100 threads are blocked waiting for the fraud API. New requests arrive but there are no threads available. Your payment service is down—not because anything is wrong with it, but because it didn’t protect itself from a dependency failure.
This is a cascading failure. One service’s problem spreads up the call chain, bringing down systems that are themselves healthy. The circuit breaker pattern stops this cascade.
The Three States: Closed, Open, Half-Open
The circuit breaker is a state machine with three states:
graph LR
A["CLOSED<br/>Requests flow through<br/>failures being tracked"] -->|Failure threshold exceeded| B["OPEN<br/>Requests fail immediately<br/>without calling service"]
B -->|Wait timeout elapsed| C["HALF-OPEN<br/>Allow test requests<br/>to check if service recovered"]
C -->|Test requests succeed| A
C -->|Test requests fail| B
CLOSED State (Normal Operation)
The circuit is closed, and requests flow through to the downstream service.
Client → Circuit Breaker (CLOSED) → Fraud API Service
[failures: 0]
The breaker tracks failures. Failure can mean:
- Exception thrown (network error, timeout, exception from downstream)
- Non-200 HTTP response code
- Response time exceeds threshold (“slow call”)
As long as failures remain below a threshold, the circuit stays closed.
OPEN State (Failing Fast)
When the failure threshold is exceeded, the circuit opens. Now requests are rejected immediately without calling the downstream service.
Client → Circuit Breaker (OPEN) ✗ → Request fails immediately
[failures: 15, threshold: 10]
[no downstream call made]
This is the key insight: fail fast without wasting resources on a broken dependency. The threads that would have been blocked waiting for the fraud API are freed up to serve other requests.
The open state lasts for a configured duration (usually 30-60 seconds). During this time, the service is assumed to be unhealthy and is skipped entirely.
HALF-OPEN State (Testing Recovery)
After the wait duration, the circuit transitions to half-open. Now it allows a small number of test requests to see if the downstream has recovered.
Client #1 → Circuit Breaker (HALF-OPEN) → Fraud API
[test request 1/3]
Client #2 → Circuit Breaker (HALF-OPEN) → Fraud API
[test request 2/3]
If test requests succeed, the downstream has likely recovered. The circuit closes and normal traffic resumes. If test requests fail, the circuit opens again and waits another 30 seconds.
Configuration: The Knobs You Turn
Real circuit breaker implementations have several tunable parameters. Getting them right is critical:
Failure Rate Threshold
What percentage of requests must fail before opening the circuit?
Example: 50% failure rate threshold
Last 20 requests: 15 succeeded, 5 failed
Failure rate: 25% (below 50% threshold)
Circuit: CLOSED
Last 20 requests: 5 succeeded, 15 failed
Failure rate: 75% (exceeds 50% threshold)
Circuit: OPEN
Too aggressive (10% threshold) and you open the circuit too early, rejecting healthy requests. Too lenient (90% threshold) and you stay on a broken service too long.
Typical value: 50% failure rate.
Slow Call Rate Threshold
Some failures aren’t errors—they’re just slow responses. If 30% of requests take more than 5 seconds, the service is degraded even though it’s not returning errors.
Response Time Threshold: 1 second
Last 20 requests:
├─ 15 completed in < 1s (normal)
└─ 5 took > 1s (slow calls)
Slow call rate: 25%
If threshold is 20%, circuit opens
Typical value: 50% slow call rate with response time threshold of 2x normal latency.
Minimum Number of Calls
You don’t want to open a circuit based on just 1 or 2 requests. What if those requests were anomalies?
Minimum calls to check: 10
Circuit state: CLOSED
Request 1: Success
Request 2: Timeout
Request 3: Success
Request 4: Timeout
Request 5: Timeout (3 failures out of 5, 60% fail rate)
But we've only had 5 calls, less than minimum 10
Circuit: Still CLOSED (not enough data)
After 10 calls, if 60% are failing
Circuit: OPEN
Typical value: 10 calls minimum.
Wait Duration in Open State
How long before trying again?
Circuit opens at 9:00:00 AM
[failure detected in fraud API]
Wait duration: 30 seconds
Circuit tries half-open at 9:00:30 AM
[test request to fraud API]
Too short and you hammer a still-failing service. Too long and you reject traffic unnecessarily after the service recovers.
Typical value: 30-60 seconds. Some systems use exponential backoff (30s, then 60s, then 120s) if the service keeps failing.
Permitted Calls in Half-Open State
How many test requests before deciding if recovery succeeded?
Circuit in HALF-OPEN
Permitted calls: 3
Test request 1: Success
Test request 2: Success
Test request 3: Success
All 3 succeeded → Circuit closes
One test request is risky (one lucky success and you’re wrong). Three is reasonable. Five is overly cautious.
Typical value: 3 calls.
Popular Implementations
Resilience4j (Java)
CircuitBreaker circuitBreaker = CircuitBreaker.ofDefaults("fraudApi");
CircuitBreaker customBreaker = CircuitBreaker.of("fraudApi",
CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slowCallRateThreshold(50)
.slowCallDurationThreshold(Duration.ofSeconds(2))
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(3)
.minimumNumberOfCalls(10)
.build());
Supplier<String> fraudCheck = CircuitBreaker.decorateSupplier(
circuitBreaker,
() -> fraudApiClient.checkTransaction(txn)
);
try {
String result = fraudCheck.get();
} catch (CallNotPermittedException e) {
// Circuit is OPEN - fail fast
log.warn("Fraud API unavailable, rejecting transaction");
}
Polly (.NET)
var policy = Policy
.Handle<HttpRequestException>()
.Or<TimeoutException>()
.OrResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
.CircuitBreaker(
handledEventsAllowedBeforeBreaking: 10,
durationOfBreak: TimeSpan.FromSeconds(30),
onBreak: (outcome, duration) => {
logger.LogWarning($"Circuit opened for {duration.TotalSeconds}s");
}
);
var response = await policy.ExecuteAsync(async () =>
await httpClient.GetAsync("https://fraud-api.example.com/check")
);
Hystrix (Deprecated but Influential)
Netflix’s Hystrix was the circuit breaker library that popularized the pattern. While officially deprecated, many systems still use it. The newer approach is using service meshes (below).
Circuit Breaker in a Service Mesh
Modern infrastructure runs service meshes like Istio or Linkerd. The circuit breaker is configured at the infrastructure level, not in application code:
# Istio VirtualService with circuit breaker
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: fraud-api
spec:
hosts:
- fraud-api
http:
- match:
- uri:
prefix: /check
route:
- destination:
host: fraud-api
port:
number: 8080
timeout: 2s
retries:
attempts: 3
perTryTimeout: 1s
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: fraud-api
spec:
host: fraud-api
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 100
minRequestVolume: 10
Advantage: No application code changes needed. Operators configure reliability at the infrastructure level. When upgrading Istio, all services automatically get better circuit breaker logic.
Disadvantage: Less visibility into circuit breaker state from the application. Need to monitor infrastructure metrics rather than application logs.
Fallback Strategies: What Happens When the Circuit Opens?
When the circuit is open, you have several options for what to return to the user:
Option 1: Fail Loudly (User Sees Error)
try {
String result = fraudCheck.get();
} catch (CallNotPermittedException e) {
// Circuit is open
return ResponseEntity
.status(503) // Service Unavailable
.body("Fraud check temporarily unavailable");
}
Honest but bad user experience. Users see “Service error, try again later.”
Option 2: Return Cached Data
try {
String result = fraudCheck.get();
cacheResult(txn.id, result);
return result;
} catch (CallNotPermittedException e) {
// Circuit is open
String cachedResult = getFromCache(txn.id);
if (cachedResult != null) {
log.info("Using cached fraud check result");
return cachedResult;
}
throw e; // No cache, must fail
}
If you’ve checked this transaction before, use the cached result. Better than nothing.
Option 3: Return Default/Safe Value
try {
FraudCheckResult result = fraudApiClient.checkTransaction(txn);
return result;
} catch (CallNotPermittedException e) {
// Circuit is open
// Assume it's safe rather than risky
log.warn("Fraud API unavailable, allowing transaction");
return FraudCheckResult.ALLOWED;
}
Works if your default is safe. For fraud detection, allowing by default is risky. For a recommendation engine, returning “no recommendation” is safe.
Option 4: Queue for Later
try {
return fraudApiClient.checkTransaction(txn);
} catch (CallNotPermittedException e) {
// Circuit is open
queueForAsyncProcessing(txn);
return ResponseEntity.accepted().build();
}
Tell the user “we’ll process this when the service recovers” and queue it. Works for non-critical paths.
Option 5: Call an Alternative Service
try {
return primaryFraudApi.checkTransaction(txn);
} catch (CallNotPermittedException e) {
// Primary API unavailable, try backup
log.warn("Primary fraud API unavailable, using backup");
return backupFraudApi.checkTransaction(txn);
}
If you have a secondary fraud detection service, use it. Requires maintaining multiple external integrations.
Monitoring and Alerting Circuit Breakers
A circuit breaker that opens silently is dangerous. You need visibility:
Metrics to track:
- Number of times circuit opened (alerting threshold: any opening)
- Time spent in OPEN state (indicates slow recovery)
- Number of requests rejected while OPEN
- Failure rate that caused the opening
Dashboards should show:
- Current state per circuit (CLOSED, OPEN, HALF-OPEN)
- State transition history (when did each circuit last open?)
- Correlation between circuit state and user impact
Alert on circuit opening:
- Name: "Circuit Breaker Opened"
- Condition: Any circuit transitions to OPEN
- Severity: Warning
- Action: Page on-call engineer to investigate upstream service
Alert on stuck OPEN state:
- Name: "Circuit Breaker Stuck Open"
- Condition: Circuit remains OPEN for more than 5 minutes
- Severity: Critical
- Action: Might indicate that upstream service is fundamentally broken, not just temporarily failing
Circuit Breaker vs. Timeout vs. Retry: The Trio
These three patterns work together:
Timeout: "Wait maximum 2 seconds, then give up"
Retry: "If it fails, try up to 3 times"
Circuit Breaker: "If lots of calls are failing, stop even trying"
A well-designed resilient call looks like:
// Timeout: fail fast if service is slow
HttpClientBuilder.create()
.setConnectionTimeout(2000)
.setSocketTimeout(2000)
.build();
// Retry: transient failures often succeed on retry
RetryPolicy retry = new RetryPolicy()
.retryOn(IOException.class)
.withMaxRetries(2)
.withDelay(Duration.ofMillis(100));
// Circuit Breaker: persistent failures trip the breaker
CircuitBreaker breaker = CircuitBreaker.ofDefaults("myService");
// All together
Supplier<String> call = CircuitBreaker.decorateSupplier(
breaker,
Retry.decorateSupplier(retry, () -> doApiCall())
);
The order matters: Circuit Breaker (outermost) prevents hammering when the service is down. Retry (middle) handles transient failures. Timeout (innermost) ensures we don’t wait forever.
Common Mistakes
Mistake 1: Threshold Too Aggressive
Opening the circuit after just 3 failed requests out of 10 is too fast. One slow request from a slow client will trip the breaker.
Fix: Use minimum call count of 10-20 and failure rate of 50%+.
Mistake 2: No Fallback Strategy
Circuit opens and users see blank error pages. No graceful degradation.
Fix: Plan fallback strategies before circuit opens. Cache, default values, alternative services.
Mistake 3: Circuit Breaker Without Monitoring
Circuit opens at 3 AM. On-call engineer doesn’t notice until 8 AM when customer complaints arrive.
Fix: Alert on every circuit opening. Have dashboards showing circuit state.
Mistake 4: Too Long Wait Duration
Circuit opens, waits 5 minutes before half-opening. Service recovered after 30 seconds but users are rejected for 4.5 more minutes.
Fix: Start with 30 seconds, tune based on actual recovery times.
Mistake 5: Forgetting About the Whole Chain
You add a circuit breaker to the fraud API call but forget about the database call upstream. Database fails, and now you’re building up a queue of requests waiting for circuit breaker on fraud API (which is fine), but those requests are also waiting on the database.
Fix: Add circuit breakers at every external dependency, not just the most obvious ones.
Key Takeaways
- Circuit breaker pattern prevents cascading failures by failing fast when dependencies are unhealthy
- Three states (CLOSED, OPEN, HALF-OPEN) with configurable transitions
- Configure thresholds carefully: failure rate, slow call rate, minimum call count, wait duration
- Fallback strategies are essential: cached data, default values, alternative services, or graceful degradation
- Monitor circuit breaker state actively; alerting on state changes is critical
- Use circuit breaker alongside timeout and retry for complete resilience
- Service meshes enable circuit breaker configuration at infrastructure level, eliminating code changes
The circuit breaker pattern, combined with the redundancy and fault tolerance we discussed earlier, forms the foundation of reliable distributed systems. In the next section (bulkhead pattern), we’ll look at how to isolate failures so that one degraded component doesn’t starve resources from others.