Guide · Resilience

MCP server retry logic

An MCP tool call that fails once does not have to fail permanently. Networks blip, rate limits reset, external APIs return transient 503s. Retry logic gives transient failures a second chance while protecting the server against hammering a dependency that is genuinely broken. The challenge is deciding which failures are worth retrying, how long to wait between attempts, and how to avoid duplicate side effects when the first attempt succeeded but the response got lost.

TL;DR

Classify errors before retrying: network timeouts and 429/503 responses are retryable; 400 Bad Request and 404 Not Found are not. Use exponential backoff with full jitter — delay = random(0, base × 2ⁿ) — to avoid thundering herds when many sessions hit the same failure simultaneously. Cap the maximum delay and total attempts (3–5 is usually enough). For tool calls with side effects, generate an idempotency key from the session ID and call parameters so retries are safe. Coordinate with your circuit breaker: stop retrying when the breaker is open, because the dependency is known-broken, not transiently failing.

Retryable vs. non-retryable errors

Retrying a non-retryable error wastes time and may cause harm. Classify errors before attempting a retry:

Error type	Retryable?	Reason
Network timeout / ECONNRESET	Yes	Transient — the dependency may recover
HTTP 429 Too Many Requests	Yes (with backoff)	Transient — rate limit window will reset; honour `Retry-After` header if present
HTTP 503 Service Unavailable	Yes	Transient — upstream is overloaded or deploying
HTTP 500 Internal Server Error	Sometimes	May be transient; limit to 2 retries before treating as permanent
HTTP 400 Bad Request	No	The request is malformed — retrying will produce the same error
HTTP 401 / 403	No	Auth failure — retrying without new credentials changes nothing
HTTP 404 Not Found	No	The resource does not exist — retrying will not create it
JSON parse error	No	The response body is corrupted — retrying may return the same bad data

// retry.ts
export class RetryableError extends Error {
  constructor(message: string, public readonly retryAfterMs?: number) {
    super(message);
    this.name = 'RetryableError';
  }
}

export function isRetryable(err: unknown): boolean {
  if (err instanceof RetryableError) return true;
  if (err instanceof Error) {
    // Node.js network errors
    if ('code' in err && typeof (err as NodeJS.ErrnoException).code === 'string') {
      const code = (err as NodeJS.ErrnoException).code!;
      return ['ECONNRESET', 'ECONNREFUSED', 'ETIMEDOUT', 'ENOTFOUND'].includes(code);
    }
  }
  return false;
}

export function isRetryableStatus(status: number): boolean {
  return status === 429 || status === 503 || status === 502 || status === 504;
}

Exponential backoff with full jitter

Exponential backoff doubles the wait time after each failure. Full jitter — choosing a random delay between 0 and the computed ceiling — prevents synchronized retries when many sessions fail at the same moment. Without jitter, every caller waits exactly 1 s, then 2 s, then 4 s and fires simultaneously, recreating the original spike.

// retry.ts (continued)
const BASE_DELAY_MS = 200;
const MAX_DELAY_MS = 10_000;
const MAX_ATTEMPTS = 4;

function computeDelay(attempt: number, retryAfterMs?: number): number {
  if (retryAfterMs != null) return retryAfterMs; // honour server hint
  const ceiling = Math.min(BASE_DELAY_MS * Math.pow(2, attempt), MAX_DELAY_MS);
  return Math.random() * ceiling; // full jitter: uniform [0, ceiling)
}

export async function withRetry<T>(
  fn: () => Promise<T>,
  maxAttempts = MAX_ATTEMPTS,
): Promise<T> {
  let lastError: unknown;
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err;
      const retryAfterMs = err instanceof RetryableError ? err.retryAfterMs : undefined;
      if (!isRetryable(err) || attempt === maxAttempts - 1) throw err;
      const delay = computeDelay(attempt, retryAfterMs);
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
  throw lastError;
}

Use withRetry around external dependency calls, not around entire MCP tool handlers. You want to retry the HTTP request to a flaky upstream API, not re-run all the side effects of the tool.

Idempotency keys for safe retries

A retry is only safe if the operation is idempotent — running it twice produces the same result as running it once. For read operations (search, fetch, query) this is usually free. For write operations (create order, send notification, deduct credit) you must make the retry safe explicitly.

The standard mechanism is an idempotency key: a stable identifier for a logical operation. The upstream API stores the result keyed by that identifier and returns the cached result on subsequent calls with the same key, rather than executing the operation again.

// tool-handler.ts — idempotency key derived from session + parameters
import { createHash } from 'crypto';

function idempotencyKey(sessionId: string, toolName: string, params: unknown): string {
  const data = JSON.stringify({ sessionId, toolName, params });
  return createHash('sha256').update(data).digest('hex').slice(0, 32);
}

server.tool('create_notification', notificationSchema, async (params, extra) => {
  const sessionId = extra.sessionId ?? 'unknown';
  const key = idempotencyKey(sessionId, 'create_notification', params);

  const result = await withRetry(() =>
    notificationApi.post('/notifications', params, {
      headers: { 'Idempotency-Key': key },
    })
  );
  return { content: [{ type: 'text', text: `Notification created: ${result.id}` }] };
});

If the upstream API does not support idempotency keys, implement deduplication at the database layer: check for an existing record with the same key before inserting. See message queue patterns for durable deduplication when processing jobs.

Coordinating retries with circuit breakers

Retries and circuit breakers solve different problems but must coordinate. A circuit breaker tracks error rates across all callers and opens when the dependency is known-broken. Retrying after a circuit open wastes time: the breaker will fail the call immediately, the retry will also fail immediately, and you've just burned three fast failures instead of one.

The correct integration: check the breaker state before retrying. If the breaker is open, surface the error immediately rather than spending retry budget on guaranteed failures.

// retry-with-breaker.ts
import CircuitBreaker from 'opossum';

export function createRetryableBreaker<T>(
  fn: (...args: unknown[]) => Promise<T>,
  breakerOptions: CircuitBreaker.Options,
): CircuitBreaker {
  // Wrap the retryable function in the breaker.
  // The breaker counts a call as failed only after all retry attempts are exhausted.
  const retryingFn = (...args: unknown[]) => withRetry(() => fn(...args));
  const breaker = new CircuitBreaker(retryingFn, breakerOptions);
  breaker.fallback(() => ({
    isError: true,
    reason: 'service_unavailable',
    message: 'dependency circuit open — retry later',
  }));
  return breaker;
}

Wrap the retrying function in the breaker, not the other way round. This way the breaker sees the final outcome after all retry attempts, preventing false-positive circuit opens from transient single-attempt failures.

Honouring Retry-After from rate-limited APIs

Many external APIs return a Retry-After header alongside a 429 response, specifying either a delay in seconds or an absolute date. Ignoring this and using your own backoff schedule is rude — it hammers the API during the window it told you to wait, which can get your IP blocked.

// http-client.ts — parse Retry-After and propagate to RetryableError
async function fetchWithRateLimit(url: string, options: RequestInit): Promise<Response> {
  const res = await fetch(url, options);
  if (res.status === 429) {
    const retryAfter = res.headers.get('Retry-After');
    let retryAfterMs = 5000; // default 5s
    if (retryAfter) {
      const seconds = parseInt(retryAfter, 10);
      if (!isNaN(seconds)) {
        retryAfterMs = seconds * 1000;
      } else {
        // Retry-After as HTTP-date
        const date = new Date(retryAfter);
        if (!isNaN(date.getTime())) {
          retryAfterMs = Math.max(0, date.getTime() - Date.now());
        }
      }
    }
    throw new RetryableError(`Rate limited by ${url}`, retryAfterMs);
  }
  if (!res.ok && isRetryableStatus(res.status)) {
    throw new RetryableError(`HTTP ${res.status} from ${url}`);
  }
  return res;
}

Retry budgets across concurrent sessions

MCP servers handle many concurrent sessions. If 50 sessions all start retrying the same flaky dependency simultaneously, you generate 50 × 4 = 200 upstream requests in rapid succession — the opposite of throttling. A shared retry budget caps the total inflight retry attempts across all sessions at once.

A simple implementation uses a semaphore to cap concurrent retries per dependency:

// semaphore.ts
export class Semaphore {
  private queue: Array<() => void> = [];
  private running = 0;

  constructor(private max: number) {}

  async acquire(): Promise<() => void> {
    if (this.running < this.max) {
      this.running++;
      return () => this.release();
    }
    return new Promise(resolve => {
      this.queue.push(() => {
        this.running++;
        resolve(() => this.release());
      });
    });
  }

  private release(): void {
    this.running--;
    const next = this.queue.shift();
    if (next) next();
  }
}

// deps.ts — one semaphore per external dependency
export const searchApiSemaphore = new Semaphore(5); // max 5 concurrent retrying calls

// In tool handler:
const release = await searchApiSemaphore.acquire();
try {
  return await withRetry(() => callSearchApi(params.query));
} finally {
  release();
}

For a simpler alternative, use the rate-limiting patterns already built into your dependency infrastructure (Redis token bucket, API gateway throttle) rather than implementing session-level retry budgets from scratch.

Observability: logging what retried

Silent retries make debugging hard. Log each retry attempt with structured fields so you can correlate a tool call that eventually succeeded with the three preceding failures in your log aggregator:

// retry.ts — instrumented version
export async function withRetry<T>(
  fn: () => Promise<T>,
  context: { toolName: string; sessionId: string; logger: Logger },
  maxAttempts = MAX_ATTEMPTS,
): Promise<T> {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      const result = await fn();
      if (attempt > 0) {
        context.logger.info('retry_succeeded', {
          tool: context.toolName,
          session: context.sessionId,
          attempt,
        });
      }
      return result;
    } catch (err) {
      context.logger.warn('tool_attempt_failed', {
        tool: context.toolName,
        session: context.sessionId,
        attempt,
        error: err instanceof Error ? err.message : String(err),
        willRetry: isRetryable(err) && attempt < maxAttempts - 1,
      });
      if (!isRetryable(err) || attempt === maxAttempts - 1) throw err;
      await new Promise(resolve => setTimeout(resolve, computeDelay(attempt)));
    }
  }
  throw new Error('unreachable');
}

AliveMCP's transport-layer probe shows you when an MCP server is alive but does not distinguish transient failures from permanent ones. Structured retry logs in your application layer, correlated with AliveMCP's uptime timeline, help you determine whether the spike of errors you're seeing is a transient blip that retry logic handled or a hard failure requiring a page.

Timeout and retry interaction

A retry without a timeout can produce unbounded latency: the failing call hangs for 30 seconds, the retry hangs for another 30 seconds, and the MCP client receives an error 60+ seconds later. Set a per-attempt timeout shorter than the overall MCP tool call budget.

If your MCP transport has a 30-second overall timeout for tool calls, a practical budget might be: 4 attempts × (5-second per-attempt timeout + up to 3-second jitter backoff) ≈ 32 seconds worst case. Keep each attempt's timeout well below the overall budget to leave room for backoff delays.

For longer background operations, consider the queue-and-return pattern instead: return a job_id immediately and let the client poll for completion, removing the tool-call timeout constraint entirely.

Monitoring retry health with AliveMCP

Retry logic is working correctly when retries are infrequent, succeed on the second or third attempt, and do not mask genuine outages. Monitoring helps distinguish healthy retry activity from a retry storm that indicates a deeper problem.

Expose retry metrics in your health_check tool:

Total retry attempts in the last 5 minutes
Retry success rate (retries that eventually succeeded vs. exhausted all attempts)
Per-dependency retry counts — a spike in retries for one dependency while others are clean points to that dependency's instability

AliveMCP probes the MCP transport layer — it can detect that a server is unreachable or not responding to initialize. Pair that with a synthetic tool call probe targeting your health_check tool to surface retry rate anomalies before users see errors. See the MCP Server Resilience and Configurability Guide for the full picture of how retry logic, circuit breakers, feature flags, and configuration validation work together as a resilience layer.