Guide · Resilience

MCP server circuit breaker

Without a circuit breaker, a broken external dependency causes every MCP tool call that touches it to fail slowly: the tool waits for the full timeout (often 5–30 seconds) before returning an error. Concurrent sessions pile up holding connections, and the MCP server starts failing for reasons that have nothing to do with its own health. A circuit breaker short-circuits this cascade: after enough consecutive failures from a dependency, it opens and all subsequent calls fail immediately — no wait, no timeout, no pile-up. The circuit tries to close again after a configured recovery window. This turns a slow cascade failure into a fast, contained, and self-healing one.

TL;DR

Wrap each external dependency (database, external API, cache) in its own circuit breaker. One breaker per dependency enables bulkhead isolation — a broken external API opens that breaker without affecting tools that only use the database. Use Opossum for Node.js: it handles the CLOSED → OPEN → HALF_OPEN state machine, exposes events for metrics, and supports synchronous fallback functions. Return isError: true from the fallback with a clear reason like "search_api circuit open — try again in 30 seconds". Expose circuit state in a health_check tool so AliveMCP's probe can see dependency health, not just that the server is accepting connections.

The three-state model

Every circuit breaker cycles through three states:

StateBehaviourTransition
CLOSEDNormal operation — calls pass through to the dependency→ OPEN after errorThresholdPercentage failures in the rolling window
OPENFailing fast — all calls immediately invoke the fallback→ HALF_OPEN after resetTimeout ms
HALF_OPENTesting recovery — one probe call passes through→ CLOSED on success; → OPEN on failure

The HALF_OPEN state is what makes the pattern self-healing: the breaker does not stay open forever. After the reset timeout, it lets one call through to check whether the dependency has recovered. A successful probe closes the circuit; a failure restarts the open timer.

Circuit breaker per dependency with Opossum

Create one breaker per external dependency in createDeps(). Opossum wraps an async function and returns a CircuitBreaker object that intercepts calls based on the failure rate:

// deps.ts — circuit breakers created alongside their dependencies
import CircuitBreaker from 'opossum';

interface Breakers {
  searchApi: CircuitBreaker;
  notificationApi: CircuitBreaker;
}

async function callSearchApi(query: string): Promise<SearchResult[]> {
  const res = await fetch(`https://search.internal/v2/search?q=${encodeURIComponent(query)}`, {
    signal: AbortSignal.timeout(5000),
  });
  if (!res.ok) throw new Error(`Search API ${res.status}`);
  return res.json();
}

export function createBreakers(): Breakers {
  const breakerOptions = {
    errorThresholdPercentage: 50, // open after 50% failures in 10-call rolling window
    timeout: 5000,                // consider the call failed if it hasn't returned in 5s
    resetTimeout: 30000,          // try half-open after 30s
    volumeThreshold: 5,           // need at least 5 calls before evaluating error rate
  };

  const searchApi = new CircuitBreaker(callSearchApi, {
    ...breakerOptions,
    name: 'search-api',
  });

  const notificationApi = new CircuitBreaker(callNotificationApi, {
    ...breakerOptions,
    name: 'notification-api',
    resetTimeout: 60000, // notification API recovers more slowly — wait longer
  });

  // Log state transitions for observability
  for (const breaker of [searchApi, notificationApi]) {
    breaker.on('open',     () => console.warn({ event: 'circuit_open',     name: breaker.name }));
    breaker.on('halfOpen', () => console.info({ event: 'circuit_half_open', name: breaker.name }));
    breaker.on('close',    () => console.info({ event: 'circuit_closed',    name: breaker.name }));
  }

  return { searchApi, notificationApi };
}
// In createDeps():
export async function createDeps(): Promise<Deps> {
  const config = parseConfig();
  const db = new Pool({ connectionString: config.DATABASE_URL });
  await db.query('SELECT 1');
  const breakers = createBreakers();
  return { config, db, breakers, logger };
}

Using circuit breakers in tool handlers

Call through the breaker instead of calling the dependency directly. Provide a fallback that returns a meaningful isError: true response — do not let the open-circuit exception propagate as an unhandled error:

server.tool(
  'search_documents',
  'Search documents across the knowledge base',
  { query: z.string().min(1) },
  async (args) => {
    // The breaker's fallback function fires when circuit is OPEN
    // Opossum calls the fallback synchronously, before any timeout
    const fallback = () => ({
      isError: true as const,
      content: [{
        type: 'text' as const,
        text: JSON.stringify({
          error: 'search_unavailable',
          message: 'Search service is temporarily unavailable — try again in about 30 seconds.',
          circuit: 'open',
        }),
      }],
    });

    try {
      // breaker.fire() either calls callSearchApi or calls the fallback
      const results = await deps.breakers.searchApi.fire(args.query);
      return {
        content: [{ type: 'text', text: JSON.stringify(results) }],
      };
    } catch (err: any) {
      // If callSearchApi throws and the fallback isn't set, catch here
      // With fallback registered on the breaker via .fallback(fn), this block
      // only handles errors the fallback itself throws
      return {
        isError: true,
        content: [{ type: 'text', text: `Search error: ${err.message}` }],
      };
    }
  }
);

// Register fallback on the breaker (alternative to inline try/catch):
deps.breakers.searchApi.fallback(() => ({
  isError: true,
  content: [{ type: 'text', text: 'Search temporarily unavailable.' }],
}));

The fallback-on-the-breaker pattern is cleaner for multiple tools sharing the same breaker: register the fallback once on the breaker object in createDeps(), and every tool that fires through the breaker gets the same fallback when the circuit is open.

Bulkhead isolation: preventing cascade failures

The key benefit of one breaker per dependency is bulkhead isolation. Without bulkheads, one slow external API drains the thread pool and degrades the entire server. With per-dependency breakers, only tools that use the broken dependency degrade; tools using only the database (which has its own breaker, or no breaker if it is healthy) continue to work:

// Bulkhead example: three tool categories with different dependency profiles
// search_documents → searchApi breaker only
// send_notification → notificationApi breaker only
// get_user_profile  → database only (no breaker needed for a healthy local DB)

// When searchApi is OPEN:
//   search_documents → fast fail, isError: true
//   send_notification → still works (different breaker)
//   get_user_profile  → still works (different dependency, no breaker involved)

// Without bulkheads:
//   search_documents → 5s timeout × many concurrent calls → exhausts event loop
//   send_notification → also slow (shared thread pool / connection pool pressure)
//   get_user_profile  → also slow (collateral damage)

For the database itself, the connection pool's acquireTimeoutMillis already provides a form of fast-fail. A circuit breaker on the database makes sense when you are worried about query-level failures (a specific query pattern causing database overload) rather than connection-level failures (the database is unreachable). For most MCP servers, a circuit breaker on external HTTP APIs is the highest-value use of the pattern.

Exposing circuit state in the health_check tool

AliveMCP probes confirm that initialize + tools/list succeeds — the server is up and accepting connections. But an open circuit is invisible to this probe: the server is up, tools are registered, the connection works, but tool calls are failing fast. Expose circuit state in a health_check tool so the probe can surface this:

server.tool(
  'health_check',
  'Report server and dependency health',
  {},
  async () => {
    const breakerStats = Object.entries(deps.breakers).map(([name, breaker]) => ({
      name,
      state: breaker.opened ? 'OPEN' : breaker.halfOpen ? 'HALF_OPEN' : 'CLOSED',
      stats: breaker.stats, // totalSuccesses, totalFailures, percentiles
    }));

    const anyOpen = breakerStats.some(b => b.state === 'OPEN');

    return {
      isError: anyOpen,
      content: [{
        type: 'text',
        text: JSON.stringify({
          status: anyOpen ? 'degraded' : 'healthy',
          breakers: breakerStats,
          timestamp: new Date().toISOString(),
        }, null, 2),
      }],
    };
  }
);

Configure AliveMCP to call health_check as a synthetic probe: the probe fires the tool and checks whether isError is false. If any circuit is open, isError: true triggers the AliveMCP alert, even though initialize and tools/list are succeeding. This gives you dependency-level observability without instrumenting each tool call individually.

Related questions

Should I put a circuit breaker on the database?

For a local database that is either up or down, the connection pool's acquire timeout is usually sufficient — if the database is unreachable, new connection attempts fail fast with the acquireTimeoutMillis error. A circuit breaker on the database adds value when you have specific queries that can cause overload (a slow query pattern that cascades) or when you use the database as a distributed lock store and need to fail fast on contention. For external databases (cloud-hosted, accessed over the internet), a circuit breaker is more valuable — network partitions cause long timeouts that the pool alone does not handle gracefully.

What's the difference between a circuit breaker and a retry with backoff?

Retry-with-backoff is appropriate when errors are transient and infrequent — a single failed request retries a few times with increasing delays. A circuit breaker is appropriate when a dependency is experiencing sustained failure — after the threshold is exceeded, all subsequent calls fail immediately rather than retrying, which protects both the caller (no wasted timeout cycles) and the dependency (no retry storms while it is trying to recover). Use both: retry for transient errors within the CLOSED state, circuit breaker to open when retries are failing at scale.

How does a circuit breaker interact with connection pooling?

They are complementary. The connection pool manages capacity — how many concurrent connections you hold. The circuit breaker manages failure rate — when to stop trying. If the database pool is exhausted (pending acquisitions > 0), that is a capacity problem the pool handles with its acquire timeout. If a specific query is slow and failing consistently, that is a failure-rate problem the circuit breaker handles. Wire them together: the circuit breaker wraps the function that acquires a pool connection and runs the query; if the function fails (whether due to pool exhaustion or query error), the failure counts against the breaker threshold.

Further reading