What is a circuit breaker?

Architecture~5 min read

A circuit breaker is a design pattern that detects repeated failures when calling a remote service and temporarily stops sending requests to it, so one slow dependency cannot take down the entire system.

Theory

TL;DR

Analogy: like the electrical breaker in your home - it trips on overload to protect the wiring, then resets after cooling down.
Three states: Closed (all requests pass), Open (instant rejection), Half-Open (probe a few requests to check recovery).
Main difference from retries: retries amplify load on a failing service; a circuit breaker stops all traffic until a recovery signal arrives.
Use it when calling remote services where failure rate exceeds 10-20% in a time window.
Pair with a fallback (cached data, default value) so users get something useful instead of an error page.

Quick example

javascript

const CircuitBreaker = require('opossum');
const axios = require('axios');

const breaker = new CircuitBreaker(async () => {
  return axios.get('https://api.example.com/inventory');
}, {
  timeout: 1000,                // fail if no response in 1s
  errorThresholdPercentage: 50, // open after 50% errors
  resetTimeout: 5000            // try half-open after 5s
});

breaker.fallback(() => ({ available: false, source: 'cache' }));

breaker.fire().then(console.log);

After hitting 50% errors, breaker.fire() returns the fallback immediately instead of waiting on a dead service.

How the three states work

Closed is the normal state. Requests pass through and the breaker counts errors in a sliding time window, typically the last N seconds via a ring buffer. When the error rate crosses the threshold, the state flips to Open.

Open means instant rejection. The breaker does not even attempt the call. It returns Promise.reject() or the configured fallback right away, then starts a timer counting down to the next check.

Half-Open is the recovery probe. After resetTimeout expires, the breaker lets 1-2 requests through. If they succeed, state goes back to Closed. If they fail, the breaker reopens and restarts the timer. This is where a subtle problem hides: in a cluster of 100 pods, all of them might probe at the same time, creating a thundering herd against an already struggling service. The fix is staggering probes, for example only pods where podIndex % 10 === 0 send the probe, or using leader election.

How it tracks failures internally

The breaker keeps a sliding window, usually implemented as a ring buffer (Hystrix calls it RingBitSet). It stores the last N outcomes as bits: 0 for success, 1 for failure. On each request it recalculates (failures / total) * 100. If that number exceeds errorThresholdPercentage AND the total count exceeds volumeThreshold, the breaker opens.

The volumeThreshold detail is easy to skip but matters a lot. Without it, a single failed request out of one total equals 100% error rate, which trips the breaker incorrectly. Libraries like opossum and Resilience4j both expose this setting.

When to use

High-latency remote calls (database, external API, another microservice) - use a circuit breaker.
Service mesh with Istio or Linkerd - breaker logic is handled at the proxy level by Envoy; you configure it declaratively.
Low-volume batch jobs that run once per hour - a simple retry with exponential backoff is enough.
Internal function calls or in-process operations - no breaker needed.
Failure rate stays below 5-10% consistently - the overhead is not worth it.

Each service or endpoint gets its own breaker instance. A single shared breaker across all calls means one slow database kills the payment flow too.

Common mistakes

Setting errorThresholdPercentage too low without volumeThreshold

A single network blip causes 100% error rate, trips the breaker, and blocks a perfectly healthy service for the entire resetTimeout period.

javascript

// Wrong - trips on 1 failure out of 1 total call
{ errorThresholdPercentage: 1 }

// Correct - waits for at least 10 calls before evaluating
{ errorThresholdPercentage: 50, volumeThreshold: 10 }

No fallback defined

Without a fallback, the client gets a raw 500 or an unhandled promise rejection. Users see an error page when cached or stale data would have been acceptable.

javascript

// Wrong - caller gets an unhandled exception
breaker.fire(sku).then(use);

// Correct - return last-known data from cache
breaker.fallback(async (sku) => {
  const cached = await redis.get(`inventory:${sku}`);
  return cached ? JSON.parse(cached) : { available: false };
});

One global breaker for all services

If the inventory service degrades and trips the shared breaker, all other calls (payments, user profile, shipping) get blocked too.

javascript

// Wrong - single shared breaker
const globalBreaker = new CircuitBreaker(anyCall, options);

// Correct - one instance per dependency
const inventoryBreaker = new CircuitBreaker(checkInventory, options);
const paymentBreaker   = new CircuitBreaker(chargeCard, options);

resetTimeout set too short or too long

At 1 second, the breaker probes too aggressively and hammers a service that is still recovering. At 5 minutes, the system stays degraded long after the downstream service is healthy again. Base it on the realistic recovery time of the dependency, typically 30 seconds to 2 minutes.

Real-world usage

Netflix - Hystrix implemented per-command breakers inside the Zuul gateway; each downstream service had its own breaker visible in the Hystrix dashboard.
Spring Cloud - Resilience4j replaced Hystrix and is now the default circuit breaker in Spring Boot apps.
Node.js - opossum (created at PayPal) is the standard library for Express-based microservices.
AWS - API Gateway with Lambda uses built-in throttling that functions as a breaker proxy.
Istio - Envoy sidecars enforce circuit breaking at the mesh level via DestinationRule config, without any application code changes.

Follow-up questions

Q: Explain the three states and transitions without looking at code.
A: Closed counts errors in a window; threshold hit flips to Open, which starts a timer and rejects instantly. After the timer, Half-Open probes 1-2 requests; success closes it, failure reopens it.

Q: How is a circuit breaker different from a retry?
A: Retries send more requests to an already failing service, which makes things worse under load. A circuit breaker stops all requests until the service signals recovery. They are complementary: use exponential backoff retries inside the breaker for transient errors, and open the breaker for persistent failures.

Q: How do you share circuit breaker state across 50 instances of the same service?
A: Store state in Redis using pub/sub or a distributed counter. Only one instance (chosen by consistent hashing or leader election) probes in Half-Open to avoid the thundering herd problem.

Q: What is the math behind errorThresholdPercentage?
A: (failures / total) * 100 > threshold, evaluated over a sliding window. Hystrix uses a RingBitSet of the last N calls. The calculation only fires if total >= volumeThreshold.

Q: In a service mesh with 100 pods, how do you prevent a thundering herd during Half-Open?
A: Gate the probe so only pods matching podIndex % N === 0 send the probe request, or use Envoy's built-in outlier_detection. Istio's consecutiveErrors and interval settings give you this out of the box without any code changes.

Examples

Basic circuit breaker with opossum

javascript

const CircuitBreaker = require('opossum');
const axios = require('axios');

const breaker = new CircuitBreaker(async () => {
  const { data } = await axios.get('https://api.example.com/status');
  return data;
}, {
  timeout: 1000,                // call times out after 1s
  errorThresholdPercentage: 50, // open after 50% failures in window
  resetTimeout: 5000,           // half-open probe after 5s
  volumeThreshold: 5            // need at least 5 calls before evaluating
});

breaker.fallback(() => ({ status: 'unknown', source: 'fallback' }));

// Event hooks for monitoring
breaker.on('open',     () => console.log('Breaker OPEN - rejecting requests'));
breaker.on('halfOpen', () => console.log('Breaker HALF-OPEN - probing'));
breaker.on('close',    () => console.log('Breaker CLOSED - traffic restored'));

breaker.fire()
  .then(result => console.log('Response:', result))
  .catch(err   => console.error('Rejected:', err.message));

The event hooks are useful for feeding metrics into Prometheus or DataDog so you can see breaker state changes on a dashboard in real time.

Production Express API with Redis fallback

This is the pattern you would see in an e-commerce service calling an inventory microservice.

javascript

const express = require('express');
const CircuitBreaker = require('opossum');
const axios = require('axios');
const redis = require('redis');

const redisClient = redis.createClient();

// One breaker per dependency
const inventoryBreaker = new CircuitBreaker(async (sku) => {
  const { data } = await axios.post('http://inventory-service/check', { sku });
  return data;
}, {
  timeout: 200,                 // inventory must respond in 200ms
  errorThresholdPercentage: 25, // open after 25% failures
  resetTimeout: 10000,          // wait 10s before probing
  volumeThreshold: 10
});

inventoryBreaker.fallback(async (sku) => {
  const cached = await redisClient.get(`inventory:${sku}`);
  if (cached) return JSON.parse(cached);
  return { available: false, source: 'default' };
});

const app = express();
app.use(express.json());

app.post('/order', async (req, res) => {
  const { sku } = req.body;
  try {
    const inventory = await inventoryBreaker.fire(sku);
    res.json({ canOrder: inventory.available });
  } catch (err) {
    // Only reaches here if fallback itself throws
    res.status(503).json({ error: 'Service unavailable' });
  }
});

app.listen(3000);

I have seen teams skip volumeThreshold here and then spend 20 minutes debugging why the breaker trips in staging after a single cold-start timeout. Set it from day one.

Half-open probe and volumeThreshold edge case

The part that trips a lot of people in code reviews.

javascript

const breaker = new CircuitBreaker(apiCall, {
  errorThresholdPercentage: 50,
  volumeThreshold: 5,      // must see 5 calls before % means anything
  halfOpenActionCount: 2,  // probe exactly 2 requests before deciding
  resetTimeout: 3000
});

// Without volumeThreshold:
// Call 1 fails -> 1/1 = 100% -> breaker opens immediately
// Wrong for a service that had one slow cold start

// With volumeThreshold: 5:
// Calls 1-4 fail -> not enough data, stays Closed
// Call 5 fails   -> 5/5 = 100% -> breaker opens
// Much more realistic behavior

breaker.on('halfOpen', () => {
  console.log('Sending 2 probe requests before deciding');
});

halfOpenActionCount also matters in distributed scenarios. If two probes race and the network is flaky enough that one succeeds and one fails, the result depends on ordering. That is why some teams keep it at 1 to avoid ambiguity.

Short Answer

Interview ready

Premium

A concise answer to help you respond confidently on this topic during an interview.

Finished reading?