Guide · Rate Limiting

MCP server client throttling

Server-level rate limits protect your infrastructure from being overwhelmed in aggregate, but they don't prevent a single bad actor — a runaway agent loop, a misconfigured LLM client, or a deliberate abuser — from consuming all available capacity. Per-client throttling gives each caller its own independent rate budget so one aggressive session cannot starve the others, regardless of how many concurrent clients are connected.

TL;DR

Identify each MCP client by a stable key (session ID > API key > IP address), maintain a Map<clientKey, TokenBucket> with TTL-based eviction, and check the bucket in your CallToolRequestSchema handler before dispatching. Apply a penalty multiplier (double refill delay) after three consecutive rate limit hits from the same client. Evict buckets for inactive sessions to bound memory usage.

Choosing a client identity

Per-client throttling requires a stable identifier for each caller. The best identifier depends on how your MCP server authenticates connections.

Identity type	When to use	Stability	Spoofable?
Session ID (from transport)	Unauthenticated or session-token auth	High — server-assigned at connect	No — server generates it
API key (from auth header)	API-key authenticated server	High — stable across reconnects	Only if key is leaked
JWT subject (`sub` claim)	OAuth/JWT authenticated server	High — user-level identity	Only if token is forged
IP address	No auth, public server	Low — NAT, CDN, shared IPs	Yes — trivially with proxies
IP + User-Agent hash	No auth, want better granularity than pure IP	Medium	Somewhat

For most MCP servers, use the session ID assigned by the SDK transport as the primary identity. It is stable for the session lifetime, server-generated (not client-controlled), and free of the false-positive risk of shared IP addresses.

Per-client token bucket with TTL eviction

Each client gets its own token bucket. Buckets are created on first use and evicted after a configurable inactivity period to prevent unbounded memory growth.

// src/rate-limit/client-throttle.ts
interface BucketEntry {
  tokens: number;
  lastRefill: number;
  lastSeen: number;
  penaltyMultiplier: number;  // starts at 1, increases after repeated violations
  consecutiveViolations: number;
}

export class ClientThrottler {
  private readonly buckets = new Map<string, BucketEntry>();
  private readonly maxTokens: number;
  private readonly refillRate: number;  // tokens per second
  private readonly ttlMs: number;       // evict after this many ms of inactivity

  constructor(options: {
    maxTokens?: number;
    refillRatePerSecond?: number;
    ttlMs?: number;
  } = {}) {
    this.maxTokens = options.maxTokens ?? 60;
    this.refillRate = options.refillRatePerSecond ?? 1.0;
    this.ttlMs = options.ttlMs ?? 5 * 60 * 1000; // 5 minutes default
  }

  allow(clientKey: string): { allowed: boolean; remaining: number; penaltyActive: boolean } {
    const now = Date.now();
    let entry = this.buckets.get(clientKey);

    if (!entry) {
      entry = {
        tokens: this.maxTokens,
        lastRefill: now,
        lastSeen: now,
        penaltyMultiplier: 1,
        consecutiveViolations: 0,
      };
      this.buckets.set(clientKey, entry);
    }

    // Refill tokens based on elapsed time, applying penalty multiplier
    const elapsed = (now - entry.lastRefill) / 1000;
    const effectiveRate = this.refillRate / entry.penaltyMultiplier;
    entry.tokens = Math.min(this.maxTokens, entry.tokens + elapsed * effectiveRate);
    entry.lastRefill = now;
    entry.lastSeen = now;

    if (entry.tokens < 1) {
      entry.consecutiveViolations += 1;
      // Escalate penalty: 2× slower refill after 3 violations, 4× after 6, cap at 8×
      entry.penaltyMultiplier = Math.min(8, Math.pow(2, Math.floor(entry.consecutiveViolations / 3)));
      return { allowed: false, remaining: 0, penaltyActive: entry.penaltyMultiplier > 1 };
    }

    entry.tokens -= 1;
    // Reset consecutive violations on a successful call (client behaving again)
    if (entry.consecutiveViolations > 0) entry.consecutiveViolations = 0;
    return { allowed: true, remaining: Math.floor(entry.tokens), penaltyActive: false };
  }

  evictStale(): number {
    const now = Date.now();
    let evicted = 0;
    for (const [key, entry] of this.buckets) {
      if (now - entry.lastSeen > this.ttlMs) {
        this.buckets.delete(key);
        evicted++;
      }
    }
    return evicted;
  }

  size(): number { return this.buckets.size; }
}

Extracting the client key from MCP transport

The MCP SDK's server transport provides a session identifier. How you extract it depends on the transport type you're using.

// src/server.ts — wiring per-client throttling
import { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
import { StreamableHTTPServerTransport } from '@modelcontextprotocol/sdk/server/streamableHttp.js';
import { CallToolRequestSchema } from '@modelcontextprotocol/sdk/types.js';
import { ClientThrottler } from './rate-limit/client-throttle.js';
import type { IncomingMessage } from 'http';

const throttler = new ClientThrottler({
  maxTokens: 60,
  refillRatePerSecond: 1.0,
  ttlMs: 10 * 60 * 1000,
});

// Evict stale entries every 5 minutes
setInterval(() => throttler.evictStale(), 5 * 60 * 1000);

// Helper: extract client key from HTTP request headers
function getClientKey(req: IncomingMessage): string {
  // Prefer API key (stable across reconnects)
  const apiKey = req.headers['x-api-key'];
  if (typeof apiKey === 'string' && apiKey.length > 0) {
    return `key:${apiKey}`;
  }
  // Fall back to IP address
  const forwarded = req.headers['x-forwarded-for'];
  const ip = typeof forwarded === 'string'
    ? forwarded.split(',')[0].trim()
    : req.socket.remoteAddress ?? 'unknown';
  return `ip:${ip}`;
}

// For HTTP transport: use session ID assigned by the SDK
// The session ID is available in the transport object after connect
server.setRequestHandler(CallToolRequestSchema, async (request, extra) => {
  // extra._meta?.sessionId is available when using Streamable HTTP transport
  const clientKey = (extra as any)?._meta?.sessionId
    ?? (extra as any)?._meta?.clientId
    ?? 'unknown';

  const { allowed, remaining, penaltyActive } = throttler.allow(clientKey);

  if (!allowed) {
    return {
      content: [{
        type: 'text',
        text: JSON.stringify({
          error: 'client_rate_limited',
          client: clientKey,
          message: penaltyActive
            ? 'You have exceeded the rate limit repeatedly. A penalty backoff is active — please wait longer before retrying.'
            : 'Rate limit exceeded. Please wait before making more tool calls.',
          remaining_tokens: remaining,
        }),
      }],
      isError: true,
    };
  }

  // Normal tool dispatch continues...
});

Fair queuing: prioritizing well-behaved clients

A simpler approach than per-client throttling is fair queuing: hold excess requests in a per-client queue rather than dropping them immediately. This allows short bursts to queue up and drain in order, which feels more responsive to the caller than an immediate rejection. Implement this with a per-client FIFO queue and a global concurrency limit.

// Simplified fair-queue wrapper
export class FairQueue {
  private queues = new Map<string, Array<() => void>>();
  private activeCounts = new Map<string, number>();
  private readonly maxPerClient: number;
  private readonly maxQueueDepth: number;

  constructor(maxPerClient = 3, maxQueueDepth = 10) {
    this.maxPerClient = maxPerClient;
    this.maxQueueDepth = maxQueueDepth;
  }

  async acquire(clientKey: string): Promise<() => void> {
    const active = this.activeCounts.get(clientKey) ?? 0;

    if (active < this.maxPerClient) {
      this.activeCounts.set(clientKey, active + 1);
      return () => this.release(clientKey);
    }

    // Queue is full — reject to prevent unbounded memory growth
    const queue = this.queues.get(clientKey) ?? [];
    if (queue.length >= this.maxQueueDepth) {
      throw new Error(`client_queue_full:${clientKey}`);
    }

    // Wait for a slot
    return new Promise((resolve) => {
      const entry = () => {
        this.activeCounts.set(clientKey, (this.activeCounts.get(clientKey) ?? 0) + 1);
        resolve(() => this.release(clientKey));
      };
      queue.push(entry);
      this.queues.set(clientKey, queue);
    });
  }

  private release(clientKey: string): void {
    const active = (this.activeCounts.get(clientKey) ?? 1) - 1;
    this.activeCounts.set(clientKey, active);
    const queue = this.queues.get(clientKey);
    if (queue && queue.length > 0) {
      const next = queue.shift()!;
      next(); // promote next queued request
    }
  }
}

Fair queuing works well when callers are LLM agents that will naturally retry — queuing briefly is invisible to the caller. Drop the request (not queue it) when the queue depth exceeds your limit, which protects memory under extreme load.