Guide · Rate Limiting

MCP server per-tool rate limiting

A single rate limit shared across all tools is a blunt instrument. An LLM calling search_documents twenty times in a loop will consume the shared budget and silently block delete_file — a dangerous destructive tool you wanted to throttle tightly anyway. Per-tool rate limiting assigns each tool its own independent counter so a bursty read operation never starves a cautious write operation, and you can tune limits to each tool's risk and cost profile.

TL;DR

Create a Map<toolName, TokenBucket> at server startup, keyed by tool name, with limits sourced from a Zod-validated config object. In your CallToolRequestSchema handler, look up the bucket for request.params.name, call bucket.consume(), and return isError: true with a rate-limit message if the bucket is empty. Low-cost read tools get high limits (20–60/min); destructive or expensive tools get tight limits (1–5/min).

Why per-tool limits matter

LLMs are not predictable callers. An agent loop solving a multi-step task may call a retrieval tool dozens of times in quick succession — that is normal behavior. The same agent might call a write or delete tool once per task. A uniform rate limit forces you to choose between two bad options: a limit high enough for retrieval (which also allows runaway writes) or a limit tight enough for writes (which throttles legitimate retrieval).

Tool categoryTypical call patternSuggested limit (calls/min per session)Why
Read / searchBurst: 10–30 calls in a task30–60Low cost, low risk, high frequency
Write / createSteady: 1–5 calls in a task5–15Moderate cost, moderate risk
Update / patchRare: 1–3 calls in a task3–10Side effects; mistakes are hard to undo
Delete / destroyRare: 0–2 calls in a task1–3Irreversible — tight limit protects against runaway deletion
External API callVaries by upstream limitMatch upstream rate limitPrevents your server from hammering third-party APIs
LLM / AI inferenceMedium: 1–10 calls in a task5–10High cost — limits protect budget

Token bucket implementation with per-tool config

A token bucket gives each tool a pool of tokens. Each call consumes one token. Tokens refill at a configured rate. If no tokens remain, the call is rejected. This naturally handles bursts (up to the bucket capacity) while enforcing a steady-state rate.

// src/rate-limit/per-tool.ts
import { z } from 'zod';

// Zod schema for a single tool's rate limit config
const ToolRateLimitSchema = z.object({
  maxTokens: z.number().int().positive(),    // bucket capacity (burst ceiling)
  refillRate: z.number().positive(),          // tokens added per second
  refillInterval: z.number().int().positive().default(1000), // ms between refills
});

// Zod schema for the full tool limit map
const ToolLimitsConfigSchema = z.record(z.string(), ToolRateLimitSchema);

type ToolLimitsConfig = z.infer<typeof ToolLimitsConfigSchema>;

class TokenBucket {
  private tokens: number;
  private readonly maxTokens: number;
  private readonly refillRate: number;
  private lastRefill: number;

  constructor(config: z.infer<typeof ToolRateLimitSchema>) {
    this.maxTokens = config.maxTokens;
    this.refillRate = config.refillRate;
    this.tokens = config.maxTokens; // start full
    this.lastRefill = Date.now();
  }

  consume(): boolean {
    this.refill();
    if (this.tokens < 1) return false;
    this.tokens -= 1;
    return true;
  }

  remainingTokens(): number {
    this.refill();
    return Math.floor(this.tokens);
  }

  private refill(): void {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000; // seconds
    this.tokens = Math.min(this.maxTokens, this.tokens + elapsed * this.refillRate);
    this.lastRefill = now;
  }
}

export class PerToolRateLimiter {
  private buckets = new Map<string, TokenBucket>();
  private defaultBucket: TokenBucket;

  constructor(config: ToolLimitsConfig, defaultConfig = { maxTokens: 20, refillRate: 0.5, refillInterval: 1000 }) {
    const parsed = ToolLimitsConfigSchema.parse(config);
    for (const [toolName, limits] of Object.entries(parsed)) {
      this.buckets.set(toolName, new TokenBucket(limits));
    }
    this.defaultBucket = new TokenBucket(defaultConfig);
  }

  // Returns true if the call is allowed, false if rate-limited
  allow(toolName: string): boolean {
    const bucket = this.buckets.get(toolName) ?? this.defaultBucket;
    return bucket.consume();
  }

  remaining(toolName: string): number {
    const bucket = this.buckets.get(toolName) ?? this.defaultBucket;
    return bucket.remainingTokens();
  }
}

Wire it into the MCP server with a config object that maps tool names to limits:

// src/server.ts
import { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { CallToolRequestSchema, ListToolsRequestSchema } from '@modelcontextprotocol/sdk/types.js';
import { PerToolRateLimiter } from './rate-limit/per-tool.js';

const rateLimiter = new PerToolRateLimiter({
  search_documents:  { maxTokens: 30, refillRate: 0.5 },   // 30/min burst, 0.5/s steady
  list_files:        { maxTokens: 60, refillRate: 1.0 },   // 60/min burst, 1/s steady
  read_file:         { maxTokens: 30, refillRate: 0.5 },
  create_file:       { maxTokens: 10, refillRate: 0.17 },  // ~10/min burst, 10/min steady
  update_file:       { maxTokens: 8,  refillRate: 0.13 },
  delete_file:       { maxTokens: 2,  refillRate: 0.03 },  // 2 burst, ~2/min steady
  run_query:         { maxTokens: 15, refillRate: 0.25 },
  call_external_api: { maxTokens: 5,  refillRate: 0.08 },  // respect upstream limits
});

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const toolName = request.params.name;

  if (!rateLimiter.allow(toolName)) {
    const remaining = rateLimiter.remaining(toolName);
    return {
      content: [{
        type: 'text',
        text: JSON.stringify({
          error: 'rate_limited',
          tool: toolName,
          message: `Tool '${toolName}' rate limit exceeded. Retry after a short delay.`,
          remaining_tokens: remaining,
        }),
      }],
      isError: true,
    };
  }

  // ... normal tool dispatch
});

Per-session vs global per-tool limits

The implementation above uses a single global bucket per tool — all sessions share the limit. For multi-session servers, you usually want per-session buckets: each connected client gets its own token pool per tool so one aggressive session doesn't deplete the budget for all other sessions.

// Per-session, per-tool rate limiter
export class SessionToolRateLimiter {
  // sessionId → toolName → TokenBucket
  private sessions = new Map<string, Map<string, TokenBucket>>();
  private readonly config: ToolLimitsConfig;

  constructor(config: ToolLimitsConfig) {
    this.config = config;
  }

  allow(sessionId: string, toolName: string): boolean {
    if (!this.sessions.has(sessionId)) {
      this.sessions.set(sessionId, new Map());
    }
    const sessionBuckets = this.sessions.get(sessionId)!;
    if (!sessionBuckets.has(toolName)) {
      const toolConfig = this.config[toolName] ?? { maxTokens: 20, refillRate: 0.33 };
      sessionBuckets.set(toolName, new TokenBucket(toolConfig));
    }
    return sessionBuckets.get(toolName)!.consume();
  }

  // Call when a session disconnects to free memory
  clearSession(sessionId: string): void {
    this.sessions.delete(sessionId);
  }
}

// Wire up session lifecycle
server.on('connect', (transport) => {
  const sessionId = transport.sessionId ?? crypto.randomUUID();
  transport.on('close', () => rateLimiter.clearSession(sessionId));
});

Session-level buckets mean a fresh session always starts with a full token pool, which is the right default for interactive use. Global buckets are better for shared API keys where you want a total cap regardless of how many sessions are active.

Exposing limits in the tool description

LLM clients benefit from knowing a tool's rate limit in advance — the model can self-regulate to avoid hitting limits. Include the limit in the tool description so it appears in tools/list output:

server.setRequestHandler(ListToolsRequestSchema, async () => ({
  tools: [
    {
      name: 'delete_file',
      description: [
        'Permanently delete a file by path. This action cannot be undone.',
        'Rate limit: 2 calls per minute. Use only when the user explicitly confirms deletion.',
      ].join(' '),
      inputSchema: {
        type: 'object',
        properties: {
          path: { type: 'string', description: 'Absolute path of the file to delete' },
        },
        required: ['path'],
      },
    },
    {
      name: 'search_documents',
      description: 'Search documents by keyword. Rate limit: 30 calls per minute.',
      inputSchema: {
        type: 'object',
        properties: {
          query: { type: 'string' },
          limit: { type: 'number', default: 10, maximum: 50 },
        },
        required: ['query'],
      },
    },
  ],
}));

This is particularly important for destructive tools. A well-prompted model will pause before calling delete_file a third time in quick succession if it sees the 2/min limit in the description.

Monitoring per-tool rate limit hit rates

Track how often each tool hits its limit. A high hit rate on a read tool indicates the limit is too tight. A zero hit rate on a destructive tool suggests the tool may never be used — or the limit is so high it provides no protection.

// Instrument the rate limiter to emit metrics
class InstrumentedPerToolRateLimiter extends PerToolRateLimiter {
  private hitCounts = new Map<string, number>();
  private allowCounts = new Map<string, number>();

  allow(toolName: string): boolean {
    const allowed = super.allow(toolName);
    if (allowed) {
      this.allowCounts.set(toolName, (this.allowCounts.get(toolName) ?? 0) + 1);
    } else {
      this.hitCounts.set(toolName, (this.hitCounts.get(toolName) ?? 0) + 1);
      // Emit structured log for monitoring
      console.log(JSON.stringify({
        event: 'rate_limit_hit',
        tool: toolName,
        timestamp: new Date().toISOString(),
      }));
    }
    return allowed;
  }

  getStats(): Record<string, { allowed: number; blocked: number; hitRate: number }> {
    const stats: Record<string, { allowed: number; blocked: number; hitRate: number }> = {};
    for (const toolName of new Set([...this.allowCounts.keys(), ...this.hitCounts.keys()])) {
      const allowed = this.allowCounts.get(toolName) ?? 0;
      const blocked = this.hitCounts.get(toolName) ?? 0;
      stats[toolName] = { allowed, blocked, hitRate: blocked / (allowed + blocked) };
    }
    return stats;
  }
}

Log the stats every few minutes or expose them on an admin endpoint. An AliveMCP external monitor can detect when rate limit errors are spiking (which shows up as tool-level errors in the protocol probe) even before your own alerting fires.

Related questions

Should I use per-tool or per-session rate limits?

Use both. Per-tool limits protect specific operations (a single destructive tool stays within its limit regardless of session count). Per-session limits protect fairness between clients (one aggressive session can't crowd out others). Apply per-tool limits globally and per-session limits layered on top — a call must pass both checks. Start with per-tool global limits since they're simpler to implement; add per-session limits when you have multiple concurrent clients or a multi-tenant deployment.

How do I set the right limits without real production data?

Start conservative and adjust upward. For read tools: allow 1 call per second per session (60/min). For write tools: allow 1 call every 6 seconds (10/min). For destructive tools: allow 1 call every 30 seconds (2/min). Instrument hit rates and raise limits where they exceed 5% — that means legitimate callers are being blocked. Keep limits tight on any tool that triggers an irreversible side effect or a paid external API call.

What should the rate limit error response look like?

Return isError: true with a JSON payload containing at minimum: the error type ("error": "rate_limited"), the tool name, a human-readable message, and the number of remaining tokens. Do not return an HTTP 429 — that is a transport-layer response and MCP tool errors should stay in the JSON-RPC layer. The isError: true flag signals to LLM frameworks that the result is a tool failure, prompting retry logic or a graceful fallback.

How do per-tool limits interact with session-level limits?

Apply them in sequence: check the session-level budget first (is this session over its total call budget?), then check the per-tool bucket (is this specific tool over its limit?). Return distinct error messages so the LLM client knows which limit was hit. Session-level limits reset when the session reconnects; per-tool global limits are shared across sessions and only refill over time.

What if I want to allow bursts on some tools but not others?

The token bucket's maxTokens parameter controls burst capacity independently of the steady-state rate. Set a high maxTokens and a lower refillRate to allow short bursts but throttle sustained use. For example, { maxTokens: 20, refillRate: 0.08 } allows a burst of 20 calls immediately but refills at only ~5 calls per minute — suitable for a batch operation that legitimately needs 20 calls at once but shouldn't sustain that pace.

Further reading