Guide · Resilience

MCP server circuit breaker

Without a circuit breaker, a broken external dependency causes every MCP tool call that touches it to fail slowly: the tool waits for the full timeout (often 5–30 seconds) before returning an error. Concurrent sessions pile up holding connections, and the MCP server starts failing for reasons that have nothing to do with its own health. A circuit breaker short-circuits this cascade: after enough consecutive failures from a dependency, it opens and all subsequent calls fail immediately — no wait, no timeout, no pile-up. The circuit tries to close again after a configured recovery window. This turns a slow cascade failure into a fast, contained, and self-healing one.

TL;DR

Wrap each external dependency (database, external API, cache) in its own circuit breaker. One breaker per dependency enables bulkhead isolation — a broken external API opens that breaker without affecting tools that only use the database. Use Opossum for Node.js: it handles the CLOSED → OPEN → HALF_OPEN state machine, exposes events for metrics, and supports synchronous fallback functions. Return isError: true from the fallback with a clear reason like "search_api circuit open — try again in 30 seconds". Expose circuit state in a health_check tool so AliveMCP's probe can see dependency health, not just that the server is accepting connections.

The three-state model

Every circuit breaker cycles through three states:

State	Behaviour	Transition
CLOSED	Normal operation — calls pass through to the dependency	→ OPEN after `errorThresholdPercentage` failures in the rolling window
OPEN	Failing fast — all calls immediately invoke the fallback	→ HALF_OPEN after `resetTimeout` ms
HALF_OPEN	Testing recovery — one probe call passes through	→ CLOSED on success; → OPEN on failure

The HALF_OPEN state is what makes the pattern self-healing: the breaker does not stay open forever. After the reset timeout, it lets one call through to check whether the dependency has recovered. A successful probe closes the circuit; a failure restarts the open timer.

Circuit breaker per dependency with Opossum

Create one breaker per external dependency in createDeps(). Opossum wraps an async function and returns a CircuitBreaker object that intercepts calls based on the failure rate:

// deps.ts — circuit breakers created alongside their dependencies
import CircuitBreaker from 'opossum';

interface Breakers {
  searchApi: CircuitBreaker;
  notificationApi: CircuitBreaker;
}

async function callSearchApi(query: string): Promise<SearchResult[]> {
  const res = await fetch(`https://search.internal/v2/search?q=${encodeURIComponent(query)}`, {
    signal: AbortSignal.timeout(5000),
  });
  if (!res.ok) throw new Error(`Search API ${res.status}`);
  return res.json();
}

export function createBreakers(): Breakers {
  const breakerOptions = {
    errorThresholdPercentage: 50, // open after 50% failures in 10-call rolling window
    timeout: 5000,                // consider the call failed if it hasn't returned in 5s
    resetTimeout: 30000,          // try half-open after 30s
    volumeThreshold: 5,           // need at least 5 calls before evaluating error rate
  };

  const searchApi = new CircuitBreaker(callSearchApi, {
    ...breakerOptions,
    name: 'search-api',
  });

  const notificationApi = new CircuitBreaker(callNotificationApi, {
    ...breakerOptions,
    name: 'notification-api',
    resetTimeout: 60000, // notification API recovers more slowly — wait longer
  });

  // Log state transitions for observability
  for (const breaker of [searchApi, notificationApi]) {
    breaker.on('open',     () => console.warn({ event: 'circuit_open',     name: breaker.name }));
    breaker.on('halfOpen', () => console.info({ event: 'circuit_half_open', name: breaker.name }));
    breaker.on('close',    () => console.info({ event: 'circuit_closed',    name: breaker.name }));
  }

  return { searchApi, notificationApi };
}

// In createDeps():
export async function createDeps(): Promise<Deps> {
  const config = parseConfig();
  const db = new Pool({ connectionString: config.DATABASE_URL });
  await db.query('SELECT 1');
  const breakers = createBreakers();
  return { config, db, breakers, logger };
}

Using circuit breakers in tool handlers

Call through the breaker instead of calling the dependency directly. Provide a fallback that returns a meaningful isError: true response — do not let the open-circuit exception propagate as an unhandled error:

server.tool(
  'search_documents',
  'Search documents across the knowledge base',
  { query: z.string().min(1) },
  async (args) => {
    // The breaker's fallback function fires when circuit is OPEN
    // Opossum calls the fallback synchronously, before any timeout
    const fallback = () => ({
      isError: true as const,
      content: [{
        type: 'text' as const,
        text: JSON.stringify({
          error: 'search_unavailable',
          message: 'Search service is temporarily unavailable — try again in about 30 seconds.',
          circuit: 'open',
        }),
      }],
    });

    try {
      // breaker.fire() either calls callSearchApi or calls the fallback
      const results = await deps.breakers.searchApi.fire(args.query);
      return {
        content: [{ type: 'text', text: JSON.stringify(results) }],
      };
    } catch (err: any) {
      // If callSearchApi throws and the fallback isn't set, catch here
      // With fallback registered on the breaker via .fallback(fn), this block
      // only handles errors the fallback itself throws
      return {
        isError: true,
        content: [{ type: 'text', text: `Search error: ${err.message}` }],
      };
    }
  }
);

// Register fallback on the breaker (alternative to inline try/catch):
deps.breakers.searchApi.fallback(() => ({
  isError: true,
  content: [{ type: 'text', text: 'Search temporarily unavailable.' }],
}));

The fallback-on-the-breaker pattern is cleaner for multiple tools sharing the same breaker: register the fallback once on the breaker object in createDeps(), and every tool that fires through the breaker gets the same fallback when the circuit is open.

Bulkhead isolation: preventing cascade failures

The key benefit of one breaker per dependency is bulkhead isolation. Without bulkheads, one slow external API drains the thread pool and degrades the entire server. With per-dependency breakers, only tools that use the broken dependency degrade; tools using only the database (which has its own breaker, or no breaker if it is healthy) continue to work:

// Bulkhead example: three tool categories with different dependency profiles
// search_documents → searchApi breaker only
// send_notification → notificationApi breaker only
// get_user_profile  → database only (no breaker needed for a healthy local DB)

// When searchApi is OPEN:
//   search_documents → fast fail, isError: true
//   send_notification → still works (different breaker)
//   get_user_profile  → still works (different dependency, no breaker involved)

// Without bulkheads:
//   search_documents → 5s timeout × many concurrent calls → exhausts event loop
//   send_notification → also slow (shared thread pool / connection pool pressure)
//   get_user_profile  → also slow (collateral damage)

For the database itself, the connection pool's acquireTimeoutMillis already provides a form of fast-fail. A circuit breaker on the database makes sense when you are worried about query-level failures (a specific query pattern causing database overload) rather than connection-level failures (the database is unreachable). For most MCP servers, a circuit breaker on external HTTP APIs is the highest-value use of the pattern.

Exposing circuit state in the health_check tool

AliveMCP probes confirm that initialize + tools/list succeeds — the server is up and accepting connections. But an open circuit is invisible to this probe: the server is up, tools are registered, the connection works, but tool calls are failing fast. Expose circuit state in a health_check tool so the probe can surface this:

server.tool(
  'health_check',
  'Report server and dependency health',
  {},
  async () => {
    const breakerStats = Object.entries(deps.breakers).map(([name, breaker]) => ({
      name,
      state: breaker.opened ? 'OPEN' : breaker.halfOpen ? 'HALF_OPEN' : 'CLOSED',
      stats: breaker.stats, // totalSuccesses, totalFailures, percentiles
    }));

    const anyOpen = breakerStats.some(b => b.state === 'OPEN');

    return {
      isError: anyOpen,
      content: [{
        type: 'text',
        text: JSON.stringify({
          status: anyOpen ? 'degraded' : 'healthy',
          breakers: breakerStats,
          timestamp: new Date().toISOString(),
        }, null, 2),
      }],
    };
  }
);

Configure AliveMCP to call health_check as a synthetic probe: the probe fires the tool and checks whether isError is false. If any circuit is open, isError: true triggers the AliveMCP alert, even though initialize and tools/list are succeeding. This gives you dependency-level observability without instrumenting each tool call individually.