Guide · MCP Resilience

MCP server backpressure

LLM agents can issue tool calls far faster than your backend can process them. A single agent session running parallel tool calls can saturate a database connection pool, overwhelm an external API, or push CPU to 100% — cascading into failures that affect every other client. Backpressure is the mechanism by which a server signals to callers that it is at capacity and they should slow down. For MCP servers, backpressure takes the form of concurrency limits, bounded queues, and explicit rejection — all of which protect downstream resources while giving agents the signal they need to back off.

TL;DR

Wrap every tool handler in a concurrency semaphore (max N in-flight calls). When the semaphore is full, reject immediately with HTTP 429 and a Retry-After header rather than queuing indefinitely. Size N to your database connection pool (or the bottleneck resource). Monitor active concurrency and queue depth as metrics — if active ≥ N consistently, your server needs more capacity or fewer parallel agent sessions.

Why MCP servers need explicit backpressure

Node.js is single-threaded but non-blocking — it can accept thousands of concurrent connections while awaiting I/O. The problem is not the Node event loop but your downstream resources:

Concurrency semaphore pattern

A semaphore limits the number of simultaneous in-flight operations. Use the p-limit package (or a manual implementation) to wrap tool handlers:

import pLimit from 'p-limit';

// One limiter per resource class — don't share limits across unrelated tools
const dbLimit = pLimit(10);   // max 10 concurrent database operations
const apiLimit = pLimit(5);   // max 5 concurrent external API calls

server.tool(
  'search_records',
  'Full-text search across customer records',
  { query: z.string(), limit: z.number().int().min(1).max(100).default(20) },
  async ({ query, limit }) => {
    // pLimit queues if at capacity — good for uniform, short-lived operations
    return dbLimit(async () => {
      const rows = await db.query(
        'SELECT * FROM records WHERE content @@ to_tsquery($1) LIMIT $2',
        [query, limit]
      );
      return { content: [{ type: 'text', text: JSON.stringify(rows) }] };
    });
  }
);

By default, pLimit queues excess requests. This is acceptable when operations are fast and queue depth is bounded. For slow operations or large bursts, pair with a queue depth check.

Bounded queue with early rejection

Queuing indefinitely is dangerous: the queue grows without bound, consuming memory, and requests at the back of the queue wait so long that the agent has already timed out and retried — meaning the queued work is obsolete when it finally executes.

Instead, reject requests when the queue depth exceeds a threshold:

class BoundedSemaphore {
  private active = 0;
  private queued = 0;

  constructor(
    private readonly maxConcurrent: number,
    private readonly maxQueue: number
  ) {}

  async run<T>(fn: () => Promise<T>): Promise<T> {
    if (this.active >= this.maxConcurrent) {
      if (this.queued >= this.maxQueue) {
        // Queue is full — reject immediately rather than growing unbounded
        const err = new Error('Server at capacity — retry after backoff');
        (err as any).code = 'BACKPRESSURE_REJECTION';
        (err as any).retryAfterSeconds = 5;
        throw err;
      }
      this.queued++;
      await this.waitForSlot();
      this.queued--;
    }

    this.active++;
    try {
      return await fn();
    } finally {
      this.active--;
    }
  }

  private waitForSlot(): Promise<void> {
    return new Promise((resolve) => {
      const check = () => {
        if (this.active < this.maxConcurrent) {
          resolve();
        } else {
          setImmediate(check);
        }
      };
      check();
    });
  }

  get stats() {
    return { active: this.active, queued: this.queued };
  }
}

// Size maxConcurrent to your database pool size
// Size maxQueue to ~2x maxConcurrent — brief bursts queue, sustained overload rejects
const semaphore = new BoundedSemaphore(10, 20);

HTTP response codes and headers

When rejecting due to backpressure, use the correct HTTP status and signal the retry delay:

SituationStatusHeadersMeaning
Server at capacity — transient, retry works503Retry-After: 5Service temporarily unavailable
Rate limit per client exceeded429Retry-After: 60, X-RateLimit-Limit, X-RateLimit-ResetToo many requests from this client
Queue full (server-wide)503Retry-After: 10Load shedding — not caller-specific

Express middleware to translate backpressure errors into proper responses:

// Error handler middleware — place after route handlers
app.use((err: Error, req: Request, res: Response, next: NextFunction) => {
  if ((err as any).code === 'BACKPRESSURE_REJECTION') {
    const retryAfter = (err as any).retryAfterSeconds ?? 5;
    res
      .status(503)
      .set('Retry-After', String(retryAfter))
      .set('X-Backpressure-Reason', 'queue-full')
      .json({ error: 'server_at_capacity', retryAfter });
    return;
  }
  next(err);
});

Per-client vs global limits

Global concurrency limits protect your backend but do not prevent a single noisy client from consuming all available slots. Combine global and per-client limits:

const globalSemaphore = new BoundedSemaphore(50, 100);
const clientSemaphores = new Map<string, BoundedSemaphore>();

function getClientSemaphore(clientId: string): BoundedSemaphore {
  if (!clientSemaphores.has(clientId)) {
    // Each client gets at most 10 concurrent, queue of 20
    clientSemaphores.set(clientId, new BoundedSemaphore(10, 20));
    // GC stale entries — production code uses an LRU cache here
  }
  return clientSemaphores.get(clientId)!;
}

async function limitedToolCall<T>(clientId: string, fn: () => Promise<T>): Promise<T> {
  // Must acquire both client and global slot
  return getClientSemaphore(clientId).run(() =>
    globalSemaphore.run(fn)
  );
}

Monitoring queue depth

Emit queue depth as a metric so you can alert before backpressure starts rejecting requests:

import { Counter, Gauge } from 'prom-client';

const activeCalls = new Gauge({
  name: 'mcp_active_tool_calls',
  help: 'Number of tool calls currently executing',
  labelNames: ['tool'],
});

const queuedCalls = new Gauge({
  name: 'mcp_queued_tool_calls',
  help: 'Number of tool calls waiting in the backpressure queue',
});

const rejectedCalls = new Counter({
  name: 'mcp_backpressure_rejections_total',
  help: 'Number of tool calls rejected due to backpressure',
  labelNames: ['reason'],
});

// In your semaphore: update gauges on each state transition
// Instrument the semaphore.stats fields and export via /metrics

Alert when mcp_queued_tool_calls stays above 0 for more than 30 seconds — it means your server is consistently saturated. Alert when mcp_backpressure_rejections_total rate exceeds 1/minute — it means the queue is filling and clients are being turned away.

AliveMCP external probes detect the downstream symptom: probe response time rises, then probe returns 503. Pair external probing with internal queue depth metrics to distinguish "server is overloaded" from "server is down".

Backpressure and the circuit breaker

Backpressure and circuit breakers are complementary. Backpressure limits how much work enters your server from above. Circuit breakers limit how much work your server sends to dependencies below. Use both:

When a downstream circuit opens, the operations that would have gone there complete faster (with errors), freeing semaphore slots sooner. This makes the system self-regulating under partial failure.

Further reading