Suggest an editImprove this articleRefine the answer for “What is a circuit breaker?”. Your changes go to moderation before they’re published.Approval requiredContentWhat you’re changing🇺🇸EN🇺🇦UAPreviewTitle (EN)Short answer (EN)**A circuit breaker** is a design pattern that stops requests to a failing remote service automatically. Closed state passes all traffic and counts errors; Open state rejects instantly; Half-Open sends probe requests to test recovery. Use it when calling remote APIs or microservices where failure rate crosses 10-20%. **Key point:** unlike retries, it cuts traffic completely until the downstream service recovers.Shown above the full answer for quick recall.Answer (EN)Image**A circuit breaker** is a design pattern that detects repeated failures when calling a remote service and temporarily stops sending requests to it, so one slow dependency cannot take down the entire system. ## Theory ### TL;DR - Analogy: like the electrical breaker in your home - it trips on overload to protect the wiring, then resets after cooling down. - Three states: Closed (all requests pass), Open (instant rejection), Half-Open (probe a few requests to check recovery). - Main difference from retries: retries amplify load on a failing service; a circuit breaker stops all traffic until a recovery signal arrives. - Use it when calling remote services where failure rate exceeds 10-20% in a time window. - Pair with a fallback (cached data, default value) so users get something useful instead of an error page. ### Quick example ```javascript const CircuitBreaker = require('opossum'); const axios = require('axios'); const breaker = new CircuitBreaker(async () => { return axios.get('https://api.example.com/inventory'); }, { timeout: 1000, // fail if no response in 1s errorThresholdPercentage: 50, // open after 50% errors resetTimeout: 5000 // try half-open after 5s }); breaker.fallback(() => ({ available: false, source: 'cache' })); breaker.fire().then(console.log); ``` After hitting 50% errors, `breaker.fire()` returns the fallback immediately instead of waiting on a dead service. ### How the three states work **Closed** is the normal state. Requests pass through and the breaker counts errors in a sliding time window, typically the last N seconds via a ring buffer. When the error rate crosses the threshold, the state flips to Open. **Open** means instant rejection. The breaker does not even attempt the call. It returns `Promise.reject()` or the configured fallback right away, then starts a timer counting down to the next check. **Half-Open** is the recovery probe. After `resetTimeout` expires, the breaker lets 1-2 requests through. If they succeed, state goes back to Closed. If they fail, the breaker reopens and restarts the timer. This is where a subtle problem hides: in a cluster of 100 pods, all of them might probe at the same time, creating a thundering herd against an already struggling service. The fix is staggering probes, for example only pods where `podIndex % 10 === 0` send the probe, or using leader election. ### How it tracks failures internally The breaker keeps a sliding window, usually implemented as a ring buffer (Hystrix calls it `RingBitSet`). It stores the last N outcomes as bits: 0 for success, 1 for failure. On each request it recalculates `(failures / total) * 100`. If that number exceeds `errorThresholdPercentage` AND the total count exceeds `volumeThreshold`, the breaker opens. The `volumeThreshold` detail is easy to skip but matters a lot. Without it, a single failed request out of one total equals 100% error rate, which trips the breaker incorrectly. Libraries like `opossum` and Resilience4j both expose this setting. ### When to use - High-latency remote calls (database, external API, another microservice) - use a circuit breaker. - Service mesh with Istio or Linkerd - breaker logic is handled at the proxy level by Envoy; you configure it declaratively. - Low-volume batch jobs that run once per hour - a simple retry with exponential backoff is enough. - Internal function calls or in-process operations - no breaker needed. - Failure rate stays below 5-10% consistently - the overhead is not worth it. Each service or endpoint gets its own breaker instance. A single shared breaker across all calls means one slow database kills the payment flow too. ### Common mistakes **Setting errorThresholdPercentage too low without volumeThreshold** A single network blip causes 100% error rate, trips the breaker, and blocks a perfectly healthy service for the entire `resetTimeout` period. ```javascript // Wrong - trips on 1 failure out of 1 total call { errorThresholdPercentage: 1 } // Correct - waits for at least 10 calls before evaluating { errorThresholdPercentage: 50, volumeThreshold: 10 } ``` **No fallback defined** Without a fallback, the client gets a raw 500 or an unhandled promise rejection. Users see an error page when cached or stale data would have been acceptable. ```javascript // Wrong - caller gets an unhandled exception breaker.fire(sku).then(use); // Correct - return last-known data from cache breaker.fallback(async (sku) => { const cached = await redis.get(`inventory:${sku}`); return cached ? JSON.parse(cached) : { available: false }; }); ``` **One global breaker for all services** If the inventory service degrades and trips the shared breaker, all other calls (payments, user profile, shipping) get blocked too. ```javascript // Wrong - single shared breaker const globalBreaker = new CircuitBreaker(anyCall, options); // Correct - one instance per dependency const inventoryBreaker = new CircuitBreaker(checkInventory, options); const paymentBreaker = new CircuitBreaker(chargeCard, options); ``` **resetTimeout set too short or too long** At 1 second, the breaker probes too aggressively and hammers a service that is still recovering. At 5 minutes, the system stays degraded long after the downstream service is healthy again. Base it on the realistic recovery time of the dependency, typically 30 seconds to 2 minutes. ### Real-world usage - Netflix - Hystrix implemented per-command breakers inside the Zuul gateway; each downstream service had its own breaker visible in the Hystrix dashboard. - Spring Cloud - Resilience4j replaced Hystrix and is now the default circuit breaker in Spring Boot apps. - Node.js - `opossum` (created at PayPal) is the standard library for Express-based microservices. - AWS - API Gateway with Lambda uses built-in throttling that functions as a breaker proxy. - Istio - Envoy sidecars enforce circuit breaking at the mesh level via `DestinationRule` config, without any application code changes. ### Follow-up questions **Q:** Explain the three states and transitions without looking at code. **A:** Closed counts errors in a window; threshold hit flips to Open, which starts a timer and rejects instantly. After the timer, Half-Open probes 1-2 requests; success closes it, failure reopens it. **Q:** How is a circuit breaker different from a retry? **A:** Retries send more requests to an already failing service, which makes things worse under load. A circuit breaker stops all requests until the service signals recovery. They are complementary: use exponential backoff retries inside the breaker for transient errors, and open the breaker for persistent failures. **Q:** How do you share circuit breaker state across 50 instances of the same service? **A:** Store state in Redis using pub/sub or a distributed counter. Only one instance (chosen by consistent hashing or leader election) probes in Half-Open to avoid the thundering herd problem. **Q:** What is the math behind errorThresholdPercentage? **A:** `(failures / total) * 100 > threshold`, evaluated over a sliding window. Hystrix uses a `RingBitSet` of the last N calls. The calculation only fires if `total >= volumeThreshold`. **Q:** In a service mesh with 100 pods, how do you prevent a thundering herd during Half-Open? **A:** Gate the probe so only pods matching `podIndex % N === 0` send the probe request, or use Envoy's built-in `outlier_detection`. Istio's `consecutiveErrors` and `interval` settings give you this out of the box without any code changes. ## Examples ### Basic circuit breaker with opossum ```javascript const CircuitBreaker = require('opossum'); const axios = require('axios'); const breaker = new CircuitBreaker(async () => { const { data } = await axios.get('https://api.example.com/status'); return data; }, { timeout: 1000, // call times out after 1s errorThresholdPercentage: 50, // open after 50% failures in window resetTimeout: 5000, // half-open probe after 5s volumeThreshold: 5 // need at least 5 calls before evaluating }); breaker.fallback(() => ({ status: 'unknown', source: 'fallback' })); // Event hooks for monitoring breaker.on('open', () => console.log('Breaker OPEN - rejecting requests')); breaker.on('halfOpen', () => console.log('Breaker HALF-OPEN - probing')); breaker.on('close', () => console.log('Breaker CLOSED - traffic restored')); breaker.fire() .then(result => console.log('Response:', result)) .catch(err => console.error('Rejected:', err.message)); ``` The event hooks are useful for feeding metrics into Prometheus or DataDog so you can see breaker state changes on a dashboard in real time. ### Production Express API with Redis fallback This is the pattern you would see in an e-commerce service calling an inventory microservice. ```javascript const express = require('express'); const CircuitBreaker = require('opossum'); const axios = require('axios'); const redis = require('redis'); const redisClient = redis.createClient(); // One breaker per dependency const inventoryBreaker = new CircuitBreaker(async (sku) => { const { data } = await axios.post('http://inventory-service/check', { sku }); return data; }, { timeout: 200, // inventory must respond in 200ms errorThresholdPercentage: 25, // open after 25% failures resetTimeout: 10000, // wait 10s before probing volumeThreshold: 10 }); inventoryBreaker.fallback(async (sku) => { const cached = await redisClient.get(`inventory:${sku}`); if (cached) return JSON.parse(cached); return { available: false, source: 'default' }; }); const app = express(); app.use(express.json()); app.post('/order', async (req, res) => { const { sku } = req.body; try { const inventory = await inventoryBreaker.fire(sku); res.json({ canOrder: inventory.available }); } catch (err) { // Only reaches here if fallback itself throws res.status(503).json({ error: 'Service unavailable' }); } }); app.listen(3000); ``` I have seen teams skip `volumeThreshold` here and then spend 20 minutes debugging why the breaker trips in staging after a single cold-start timeout. Set it from day one. ### Half-open probe and volumeThreshold edge case The part that trips a lot of people in code reviews. ```javascript const breaker = new CircuitBreaker(apiCall, { errorThresholdPercentage: 50, volumeThreshold: 5, // must see 5 calls before % means anything halfOpenActionCount: 2, // probe exactly 2 requests before deciding resetTimeout: 3000 }); // Without volumeThreshold: // Call 1 fails -> 1/1 = 100% -> breaker opens immediately // Wrong for a service that had one slow cold start // With volumeThreshold: 5: // Calls 1-4 fail -> not enough data, stays Closed // Call 5 fails -> 5/5 = 100% -> breaker opens // Much more realistic behavior breaker.on('halfOpen', () => { console.log('Sending 2 probe requests before deciding'); }); ``` `halfOpenActionCount` also matters in distributed scenarios. If two probes race and the network is flaky enough that one succeeds and one fails, the result depends on ordering. That is why some teams keep it at 1 to avoid ambiguity.For the reviewerNote to the moderator (optional)Visible only to the moderator. Helps review go faster.