Guide · Rate Limiting

MCP server backoff guidance

Rate limiting only works as intended when callers retry intelligently. An LLM agent that hammers a rate-limited endpoint every 100ms creates a thundering herd and makes the limit ineffective — or overwhelms the server even at low per-call rates. The server's job is to emit enough information in the error response for a caller to compute an appropriate wait time. The caller's job is to implement exponential backoff with jitter rather than fixed-interval retries.

TL;DR

Include a retry_after_ms field in every rate-limit error payload. Callers should use Math.min(cap, base * 2^attempt + jitter) where jitter = Math.random() * 1000 and cap is 30 seconds. Distinguish rate_limited (retryable after backoff) from tool_error (not retryable) and server_overloaded (retryable immediately with extra jitter). Never retry on permanent errors like invalid arguments.

Backoff-relevant error types in MCP

MCP uses isError: true for all tool-level failures — there is no built-in distinction between a permanent error and a transient one. Your server must encode that distinction in the error payload so callers can decide whether to retry.

Error type	Should caller retry?	Backoff strategy	Example
`rate_limited`	Yes	Exponential + jitter, respect `retry_after_ms`	Token bucket empty
`server_overloaded`	Yes	Short fixed wait + full jitter	Request queue full
`transient_error`	Yes	Exponential + jitter, up to 3 attempts	Database connection timeout
`upstream_error`	Maybe	Exponential + jitter, up to 2 attempts; give up if upstream is down	External API returned 503
`invalid_arguments`	No	None — the same call will fail again	Missing required field
`not_found`	No	None	Resource doesn't exist
`permission_denied`	No	None — credentials won't change between retries	Insufficient scope

Encoding retry hints in error payloads

The MCP protocol does not define a standard retry-after field — you must define your own convention in the content[].text payload. Use a consistent JSON structure so callers can parse it reliably.

// src/rate-limit/error-response.ts
export interface RateLimitError {
  error: 'rate_limited' | 'server_overloaded' | 'transient_error';
  message: string;
  retry_after_ms: number;     // how long the caller should wait before retrying
  retry_after_iso: string;    // ISO 8601 timestamp when the limit resets (human-readable)
  retryable: true;
}

export function makeRateLimitError(retryAfterMs: number): RateLimitError {
  return {
    error: 'rate_limited',
    message: `Rate limit exceeded. Retry after ${Math.ceil(retryAfterMs / 1000)} seconds.`,
    retry_after_ms: retryAfterMs,
    retry_after_iso: new Date(Date.now() + retryAfterMs).toISOString(),
    retryable: true,
  };
}

// In your CallToolRequestSchema handler:
if (!rateLimiter.allow(toolName)) {
  // Calculate how long until the bucket has at least 1 token
  const refillTimeMs = rateLimiter.msUntilToken(toolName); // implement in your limiter
  return {
    content: [{ type: 'text', text: JSON.stringify(makeRateLimitError(refillTimeMs)) }],
    isError: true,
  };
}

Add msUntilToken() to your token bucket to calculate the exact wait time:

// Add to TokenBucket class
msUntilToken(): number {
  this.refill();
  if (this.tokens >= 1) return 0;
  const deficit = 1 - this.tokens;
  return Math.ceil((deficit / this.refillRate) * 1000); // ms
}

Exponential backoff with jitter — client-side implementation

Callers should implement exponential backoff with full jitter. "Full jitter" means the wait time is a random value between 0 and the exponential cap, rather than a deterministic exponential. This prevents thundering herd: if 100 clients all hit the same rate limit at the same time and all wait exactly 2 seconds, they will all hammer the server simultaneously again 2 seconds later. With full jitter, they spread out across the 0–2 second window.

// Backoff utility for MCP tool callers
interface BackoffOptions {
  baseMs?: number;      // initial wait (default 200ms)
  maxMs?: number;       // cap on total wait (default 30s)
  maxAttempts?: number; // give up after this many retries (default 5)
  jitterType?: 'full' | 'equal' | 'decorrelated';
}

function computeBackoff(attempt: number, options: BackoffOptions = {}): number {
  const { baseMs = 200, maxMs = 30_000, jitterType = 'full' } = options;
  const exponential = baseMs * Math.pow(2, attempt);
  const capped = Math.min(maxMs, exponential);

  switch (jitterType) {
    case 'full':
      // Random between 0 and the capped exponential — best for thundering herd prevention
      return Math.random() * capped;
    case 'equal':
      // Half deterministic, half random — retries are not too short and not synchronized
      return capped / 2 + Math.random() * (capped / 2);
    case 'decorrelated':
      // Each attempt's wait is based on the previous attempt's wait
      // Not shown here — see AWS blog for implementation
      return Math.random() * capped;
  }
}

async function callToolWithRetry(
  client: MCPClient,
  toolName: string,
  args: Record<string, unknown>,
  options: BackoffOptions = {}
): Promise<MCPToolResult> {
  const { maxAttempts = 5 } = options;

  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    const result = await client.callTool({ name: toolName, arguments: args });

    if (!result.isError) return result;

    // Parse the error payload
    let errorPayload: Record<string, unknown> = {};
    try {
      const text = (result.content as Array<{ type: string; text?: string }>)
        .find(c => c.type === 'text')?.text ?? '{}';
      errorPayload = JSON.parse(text);
    } catch { /* non-JSON error */ }

    // Don't retry non-retryable errors
    if (errorPayload.retryable !== true) {
      throw new Error(`Tool error (non-retryable): ${JSON.stringify(errorPayload)}`);
    }

    if (attempt === maxAttempts - 1) {
      throw new Error(`Tool failed after ${maxAttempts} attempts: ${JSON.stringify(errorPayload)}`);
    }

    // Use server-provided retry_after_ms if available, else compute our own backoff
    const serverHint = typeof errorPayload.retry_after_ms === 'number'
      ? errorPayload.retry_after_ms
      : null;
    const backoffMs = serverHint !== null
      ? serverHint + Math.random() * 200  // add 0–200ms jitter to server hint
      : computeBackoff(attempt, options);

    console.log(`Tool '${toolName}' rate limited. Retrying in ${Math.round(backoffMs)}ms (attempt ${attempt + 1}/${maxAttempts})`);
    await new Promise(resolve => setTimeout(resolve, backoffMs));
  }

  throw new Error('unreachable');
}

Jitter comparison

Not all jitter strategies perform equally under load. The table below compares the three common approaches for MCP clients that might have multiple agents calling the same server simultaneously.

Strategy	Formula	Thundering herd protection	Min wait	Best for
No jitter	`min(cap, base × 2^n)`	None — all clients retry at the same time	Deterministic	Single caller only
Full jitter	`random(0, min(cap, base × 2^n))`	Excellent — uniform spread	0ms (can be immediate)	Multiple concurrent callers, thundering herd risk
Equal jitter	`min(cap,base×2^n)/2 + random(0, min(cap,base×2^n)/2)`	Good — minimum guaranteed wait	cap/2	When 0ms retries are undesirable
Decorrelated	`random(base, prev × 3)`	Good — each client follows a different path	base	Long-lived agent loops with many retries

For most MCP use cases, full jitter is the right default. The only time to prefer equal jitter is when retrying too quickly would cause a thundering herd at zero delay (e.g., all clients finishing at the same time and all retrying immediately). In those cases, the guaranteed minimum wait from equal jitter helps.

Communicating retry guidance in tool descriptions

LLM clients that receive an isError: true response may decide to retry on their own without consulting your retry-after hint. Guide the model explicitly in the tool description:

server.setRequestHandler(ListToolsRequestSchema, async () => ({
  tools: [
    {
      name: 'search_documents',
      description: [
        'Search documents by keyword. Returns matching document excerpts.',
        'If this tool returns an error with "error":"rate_limited", wait for the',
        'number of milliseconds specified in "retry_after_ms" before calling again.',
        'Do not retry more than 3 times.',
      ].join(' '),
      inputSchema: { /* ... */ },
    },
  ],
}));

This is belt-and-suspenders guidance: your retry wrapper in the calling code handles most retries automatically, but the model description prevents the model itself from hammering the tool in a loop when the wrapper isn't in place.

Related questions

How does MCP handle retry differently from HTTP 429?

HTTP 429 Too Many Requests has a standard Retry-After header that clients can parse. MCP tool errors have no equivalent standard field — the isError: true flag only signals that the call failed, not why or when to retry. You must encode retry guidance in your content[].text JSON payload and document your convention so callers can parse it. This is why a consistent error structure (always JSON, always with error, retryable, and retry_after_ms fields) is important — it lets any caller consume your hints without server-specific logic.

Should I retry on all isError: true responses?

No. isError: true covers both permanent errors (invalid arguments, resource not found, permission denied) and transient errors (rate limit, server overload, upstream timeout). Retrying on a permanent error wastes calls and delays the caller from surfacing the error to the user. Always inspect the error field in the payload. Only retry when retryable: true is explicitly set, or when the error type is in your known-retryable list.

What cap should I use for max backoff?

30 seconds is a reasonable default for interactive MCP use. For background agent tasks, you might cap at 2–5 minutes. Set the cap based on user experience expectations: an interactive user waiting >30 seconds for a tool retry will perceive the system as broken. For a background ETL agent, a 5-minute cap is fine. Always pair the cap with a maximum attempt count (5–10 attempts) so the caller eventually gives up and surfaces the error rather than retrying indefinitely.

How do I test that my backoff implementation is correct?

Write a test that simulates a rate-limited server returning retry_after_ms: 500 on the first two calls and success on the third. Assert that the caller waited at least 500ms between each failed call and eventually received the successful result. Also test the max-attempts path: a server that always returns rate-limit errors should cause the caller to give up after maxAttempts retries, not loop forever. Use vi.useFakeTimers() in Vitest or jest.useFakeTimers() in Jest to speed up the test without sleeping.