Guide · Rate Limiting

MCP server client throttling

Server-level rate limits protect your infrastructure from being overwhelmed in aggregate, but they don't prevent a single bad actor — a runaway agent loop, a misconfigured LLM client, or a deliberate abuser — from consuming all available capacity. Per-client throttling gives each caller its own independent rate budget so one aggressive session cannot starve the others, regardless of how many concurrent clients are connected.

TL;DR

Identify each MCP client by a stable key (session ID > API key > IP address), maintain a Map<clientKey, TokenBucket> with TTL-based eviction, and check the bucket in your CallToolRequestSchema handler before dispatching. Apply a penalty multiplier (double refill delay) after three consecutive rate limit hits from the same client. Evict buckets for inactive sessions to bound memory usage.

Choosing a client identity

Per-client throttling requires a stable identifier for each caller. The best identifier depends on how your MCP server authenticates connections.

Identity typeWhen to useStabilitySpoofable?
Session ID (from transport)Unauthenticated or session-token authHigh — server-assigned at connectNo — server generates it
API key (from auth header)API-key authenticated serverHigh — stable across reconnectsOnly if key is leaked
JWT subject (sub claim)OAuth/JWT authenticated serverHigh — user-level identityOnly if token is forged
IP addressNo auth, public serverLow — NAT, CDN, shared IPsYes — trivially with proxies
IP + User-Agent hashNo auth, want better granularity than pure IPMediumSomewhat

For most MCP servers, use the session ID assigned by the SDK transport as the primary identity. It is stable for the session lifetime, server-generated (not client-controlled), and free of the false-positive risk of shared IP addresses.

Per-client token bucket with TTL eviction

Each client gets its own token bucket. Buckets are created on first use and evicted after a configurable inactivity period to prevent unbounded memory growth.

// src/rate-limit/client-throttle.ts
interface BucketEntry {
  tokens: number;
  lastRefill: number;
  lastSeen: number;
  penaltyMultiplier: number;  // starts at 1, increases after repeated violations
  consecutiveViolations: number;
}

export class ClientThrottler {
  private readonly buckets = new Map<string, BucketEntry>();
  private readonly maxTokens: number;
  private readonly refillRate: number;  // tokens per second
  private readonly ttlMs: number;       // evict after this many ms of inactivity

  constructor(options: {
    maxTokens?: number;
    refillRatePerSecond?: number;
    ttlMs?: number;
  } = {}) {
    this.maxTokens = options.maxTokens ?? 60;
    this.refillRate = options.refillRatePerSecond ?? 1.0;
    this.ttlMs = options.ttlMs ?? 5 * 60 * 1000; // 5 minutes default
  }

  allow(clientKey: string): { allowed: boolean; remaining: number; penaltyActive: boolean } {
    const now = Date.now();
    let entry = this.buckets.get(clientKey);

    if (!entry) {
      entry = {
        tokens: this.maxTokens,
        lastRefill: now,
        lastSeen: now,
        penaltyMultiplier: 1,
        consecutiveViolations: 0,
      };
      this.buckets.set(clientKey, entry);
    }

    // Refill tokens based on elapsed time, applying penalty multiplier
    const elapsed = (now - entry.lastRefill) / 1000;
    const effectiveRate = this.refillRate / entry.penaltyMultiplier;
    entry.tokens = Math.min(this.maxTokens, entry.tokens + elapsed * effectiveRate);
    entry.lastRefill = now;
    entry.lastSeen = now;

    if (entry.tokens < 1) {
      entry.consecutiveViolations += 1;
      // Escalate penalty: 2× slower refill after 3 violations, 4× after 6, cap at 8×
      entry.penaltyMultiplier = Math.min(8, Math.pow(2, Math.floor(entry.consecutiveViolations / 3)));
      return { allowed: false, remaining: 0, penaltyActive: entry.penaltyMultiplier > 1 };
    }

    entry.tokens -= 1;
    // Reset consecutive violations on a successful call (client behaving again)
    if (entry.consecutiveViolations > 0) entry.consecutiveViolations = 0;
    return { allowed: true, remaining: Math.floor(entry.tokens), penaltyActive: false };
  }

  evictStale(): number {
    const now = Date.now();
    let evicted = 0;
    for (const [key, entry] of this.buckets) {
      if (now - entry.lastSeen > this.ttlMs) {
        this.buckets.delete(key);
        evicted++;
      }
    }
    return evicted;
  }

  size(): number { return this.buckets.size; }
}

Extracting the client key from MCP transport

The MCP SDK's server transport provides a session identifier. How you extract it depends on the transport type you're using.

// src/server.ts — wiring per-client throttling
import { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
import { StreamableHTTPServerTransport } from '@modelcontextprotocol/sdk/server/streamableHttp.js';
import { CallToolRequestSchema } from '@modelcontextprotocol/sdk/types.js';
import { ClientThrottler } from './rate-limit/client-throttle.js';
import type { IncomingMessage } from 'http';

const throttler = new ClientThrottler({
  maxTokens: 60,
  refillRatePerSecond: 1.0,
  ttlMs: 10 * 60 * 1000,
});

// Evict stale entries every 5 minutes
setInterval(() => throttler.evictStale(), 5 * 60 * 1000);

// Helper: extract client key from HTTP request headers
function getClientKey(req: IncomingMessage): string {
  // Prefer API key (stable across reconnects)
  const apiKey = req.headers['x-api-key'];
  if (typeof apiKey === 'string' && apiKey.length > 0) {
    return `key:${apiKey}`;
  }
  // Fall back to IP address
  const forwarded = req.headers['x-forwarded-for'];
  const ip = typeof forwarded === 'string'
    ? forwarded.split(',')[0].trim()
    : req.socket.remoteAddress ?? 'unknown';
  return `ip:${ip}`;
}

// For HTTP transport: use session ID assigned by the SDK
// The session ID is available in the transport object after connect
server.setRequestHandler(CallToolRequestSchema, async (request, extra) => {
  // extra._meta?.sessionId is available when using Streamable HTTP transport
  const clientKey = (extra as any)?._meta?.sessionId
    ?? (extra as any)?._meta?.clientId
    ?? 'unknown';

  const { allowed, remaining, penaltyActive } = throttler.allow(clientKey);

  if (!allowed) {
    return {
      content: [{
        type: 'text',
        text: JSON.stringify({
          error: 'client_rate_limited',
          client: clientKey,
          message: penaltyActive
            ? 'You have exceeded the rate limit repeatedly. A penalty backoff is active — please wait longer before retrying.'
            : 'Rate limit exceeded. Please wait before making more tool calls.',
          remaining_tokens: remaining,
        }),
      }],
      isError: true,
    };
  }

  // Normal tool dispatch continues...
});

Fair queuing: prioritizing well-behaved clients

A simpler approach than per-client throttling is fair queuing: hold excess requests in a per-client queue rather than dropping them immediately. This allows short bursts to queue up and drain in order, which feels more responsive to the caller than an immediate rejection. Implement this with a per-client FIFO queue and a global concurrency limit.

// Simplified fair-queue wrapper
export class FairQueue {
  private queues = new Map<string, Array<() => void>>();
  private activeCounts = new Map<string, number>();
  private readonly maxPerClient: number;
  private readonly maxQueueDepth: number;

  constructor(maxPerClient = 3, maxQueueDepth = 10) {
    this.maxPerClient = maxPerClient;
    this.maxQueueDepth = maxQueueDepth;
  }

  async acquire(clientKey: string): Promise<() => void> {
    const active = this.activeCounts.get(clientKey) ?? 0;

    if (active < this.maxPerClient) {
      this.activeCounts.set(clientKey, active + 1);
      return () => this.release(clientKey);
    }

    // Queue is full — reject to prevent unbounded memory growth
    const queue = this.queues.get(clientKey) ?? [];
    if (queue.length >= this.maxQueueDepth) {
      throw new Error(`client_queue_full:${clientKey}`);
    }

    // Wait for a slot
    return new Promise((resolve) => {
      const entry = () => {
        this.activeCounts.set(clientKey, (this.activeCounts.get(clientKey) ?? 0) + 1);
        resolve(() => this.release(clientKey));
      };
      queue.push(entry);
      this.queues.set(clientKey, queue);
    });
  }

  private release(clientKey: string): void {
    const active = (this.activeCounts.get(clientKey) ?? 1) - 1;
    this.activeCounts.set(clientKey, active);
    const queue = this.queues.get(clientKey);
    if (queue && queue.length > 0) {
      const next = queue.shift()!;
      next(); // promote next queued request
    }
  }
}

Fair queuing works well when callers are LLM agents that will naturally retry — queuing briefly is invisible to the caller. Drop the request (not queue it) when the queue depth exceeds your limit, which protects memory under extreme load.

Related questions

How does per-client throttling differ from server-level rate limiting?

Server-level rate limits protect total capacity — they cap aggregate calls across all clients. Per-client throttling protects fairness — it ensures no single client can consume all available capacity. A server with 100 calls/second total capacity and 10 calls/second per-client limit allows up to 10 well-behaved clients to each run at full speed without blocking each other. Use both: a global cap to protect infrastructure and per-client limits to protect fairness.

What happens to the token bucket when a client reconnects?

For session-ID-based throttling, a reconnected client gets a fresh session ID and a fresh token bucket (starting full). If you want throttling to persist across reconnects — useful for preventing "reconnect to reset limit" abuse — use a stable identity like an API key or authenticated user ID, and only evict the bucket after a TTL of inactivity rather than on disconnect.

How do I handle shared API keys (multiple agents using one key)?

A shared API key means multiple agents share one token bucket, which may cause false positives where a legitimate agent is throttled because another agent on the same key was aggressive. The solution is to use a compound key: apiKey:sessionId. Each session gets its own bucket under the shared key prefix. You can add a secondary check that the sum of all buckets under the same API key doesn't exceed a combined limit, giving you per-agent fairness and per-key total protection.

Should I log client identities in rate limit events?

Log the identity key you're throttling on (session ID or a hash of the API key) but not the full API key itself. A session ID is safe to log — it is server-generated and meaningless outside the session. Log rate limit events at INFO level with the client key, tool name, and timestamp so you can identify misbehaving callers and decide whether to blocklist them.

Further reading