Guide · Rate Limiting
MCP server per-tool rate limiting
A single rate limit shared across all tools is a blunt instrument. An LLM calling search_documents twenty times in a loop will consume the shared budget and silently block delete_file — a dangerous destructive tool you wanted to throttle tightly anyway. Per-tool rate limiting assigns each tool its own independent counter so a bursty read operation never starves a cautious write operation, and you can tune limits to each tool's risk and cost profile.
TL;DR
Create a Map<toolName, TokenBucket> at server startup, keyed by tool name, with limits sourced from a Zod-validated config object. In your CallToolRequestSchema handler, look up the bucket for request.params.name, call bucket.consume(), and return isError: true with a rate-limit message if the bucket is empty. Low-cost read tools get high limits (20–60/min); destructive or expensive tools get tight limits (1–5/min).
Why per-tool limits matter
LLMs are not predictable callers. An agent loop solving a multi-step task may call a retrieval tool dozens of times in quick succession — that is normal behavior. The same agent might call a write or delete tool once per task. A uniform rate limit forces you to choose between two bad options: a limit high enough for retrieval (which also allows runaway writes) or a limit tight enough for writes (which throttles legitimate retrieval).
| Tool category | Typical call pattern | Suggested limit (calls/min per session) | Why |
|---|---|---|---|
| Read / search | Burst: 10–30 calls in a task | 30–60 | Low cost, low risk, high frequency |
| Write / create | Steady: 1–5 calls in a task | 5–15 | Moderate cost, moderate risk |
| Update / patch | Rare: 1–3 calls in a task | 3–10 | Side effects; mistakes are hard to undo |
| Delete / destroy | Rare: 0–2 calls in a task | 1–3 | Irreversible — tight limit protects against runaway deletion |
| External API call | Varies by upstream limit | Match upstream rate limit | Prevents your server from hammering third-party APIs |
| LLM / AI inference | Medium: 1–10 calls in a task | 5–10 | High cost — limits protect budget |
Token bucket implementation with per-tool config
A token bucket gives each tool a pool of tokens. Each call consumes one token. Tokens refill at a configured rate. If no tokens remain, the call is rejected. This naturally handles bursts (up to the bucket capacity) while enforcing a steady-state rate.
// src/rate-limit/per-tool.ts
import { z } from 'zod';
// Zod schema for a single tool's rate limit config
const ToolRateLimitSchema = z.object({
maxTokens: z.number().int().positive(), // bucket capacity (burst ceiling)
refillRate: z.number().positive(), // tokens added per second
refillInterval: z.number().int().positive().default(1000), // ms between refills
});
// Zod schema for the full tool limit map
const ToolLimitsConfigSchema = z.record(z.string(), ToolRateLimitSchema);
type ToolLimitsConfig = z.infer<typeof ToolLimitsConfigSchema>;
class TokenBucket {
private tokens: number;
private readonly maxTokens: number;
private readonly refillRate: number;
private lastRefill: number;
constructor(config: z.infer<typeof ToolRateLimitSchema>) {
this.maxTokens = config.maxTokens;
this.refillRate = config.refillRate;
this.tokens = config.maxTokens; // start full
this.lastRefill = Date.now();
}
consume(): boolean {
this.refill();
if (this.tokens < 1) return false;
this.tokens -= 1;
return true;
}
remainingTokens(): number {
this.refill();
return Math.floor(this.tokens);
}
private refill(): void {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000; // seconds
this.tokens = Math.min(this.maxTokens, this.tokens + elapsed * this.refillRate);
this.lastRefill = now;
}
}
export class PerToolRateLimiter {
private buckets = new Map<string, TokenBucket>();
private defaultBucket: TokenBucket;
constructor(config: ToolLimitsConfig, defaultConfig = { maxTokens: 20, refillRate: 0.5, refillInterval: 1000 }) {
const parsed = ToolLimitsConfigSchema.parse(config);
for (const [toolName, limits] of Object.entries(parsed)) {
this.buckets.set(toolName, new TokenBucket(limits));
}
this.defaultBucket = new TokenBucket(defaultConfig);
}
// Returns true if the call is allowed, false if rate-limited
allow(toolName: string): boolean {
const bucket = this.buckets.get(toolName) ?? this.defaultBucket;
return bucket.consume();
}
remaining(toolName: string): number {
const bucket = this.buckets.get(toolName) ?? this.defaultBucket;
return bucket.remainingTokens();
}
}
Wire it into the MCP server with a config object that maps tool names to limits:
// src/server.ts
import { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { CallToolRequestSchema, ListToolsRequestSchema } from '@modelcontextprotocol/sdk/types.js';
import { PerToolRateLimiter } from './rate-limit/per-tool.js';
const rateLimiter = new PerToolRateLimiter({
search_documents: { maxTokens: 30, refillRate: 0.5 }, // 30/min burst, 0.5/s steady
list_files: { maxTokens: 60, refillRate: 1.0 }, // 60/min burst, 1/s steady
read_file: { maxTokens: 30, refillRate: 0.5 },
create_file: { maxTokens: 10, refillRate: 0.17 }, // ~10/min burst, 10/min steady
update_file: { maxTokens: 8, refillRate: 0.13 },
delete_file: { maxTokens: 2, refillRate: 0.03 }, // 2 burst, ~2/min steady
run_query: { maxTokens: 15, refillRate: 0.25 },
call_external_api: { maxTokens: 5, refillRate: 0.08 }, // respect upstream limits
});
server.setRequestHandler(CallToolRequestSchema, async (request) => {
const toolName = request.params.name;
if (!rateLimiter.allow(toolName)) {
const remaining = rateLimiter.remaining(toolName);
return {
content: [{
type: 'text',
text: JSON.stringify({
error: 'rate_limited',
tool: toolName,
message: `Tool '${toolName}' rate limit exceeded. Retry after a short delay.`,
remaining_tokens: remaining,
}),
}],
isError: true,
};
}
// ... normal tool dispatch
});
Per-session vs global per-tool limits
The implementation above uses a single global bucket per tool — all sessions share the limit. For multi-session servers, you usually want per-session buckets: each connected client gets its own token pool per tool so one aggressive session doesn't deplete the budget for all other sessions.
// Per-session, per-tool rate limiter
export class SessionToolRateLimiter {
// sessionId → toolName → TokenBucket
private sessions = new Map<string, Map<string, TokenBucket>>();
private readonly config: ToolLimitsConfig;
constructor(config: ToolLimitsConfig) {
this.config = config;
}
allow(sessionId: string, toolName: string): boolean {
if (!this.sessions.has(sessionId)) {
this.sessions.set(sessionId, new Map());
}
const sessionBuckets = this.sessions.get(sessionId)!;
if (!sessionBuckets.has(toolName)) {
const toolConfig = this.config[toolName] ?? { maxTokens: 20, refillRate: 0.33 };
sessionBuckets.set(toolName, new TokenBucket(toolConfig));
}
return sessionBuckets.get(toolName)!.consume();
}
// Call when a session disconnects to free memory
clearSession(sessionId: string): void {
this.sessions.delete(sessionId);
}
}
// Wire up session lifecycle
server.on('connect', (transport) => {
const sessionId = transport.sessionId ?? crypto.randomUUID();
transport.on('close', () => rateLimiter.clearSession(sessionId));
});
Session-level buckets mean a fresh session always starts with a full token pool, which is the right default for interactive use. Global buckets are better for shared API keys where you want a total cap regardless of how many sessions are active.
Exposing limits in the tool description
LLM clients benefit from knowing a tool's rate limit in advance — the model can self-regulate to avoid hitting limits. Include the limit in the tool description so it appears in tools/list output:
server.setRequestHandler(ListToolsRequestSchema, async () => ({
tools: [
{
name: 'delete_file',
description: [
'Permanently delete a file by path. This action cannot be undone.',
'Rate limit: 2 calls per minute. Use only when the user explicitly confirms deletion.',
].join(' '),
inputSchema: {
type: 'object',
properties: {
path: { type: 'string', description: 'Absolute path of the file to delete' },
},
required: ['path'],
},
},
{
name: 'search_documents',
description: 'Search documents by keyword. Rate limit: 30 calls per minute.',
inputSchema: {
type: 'object',
properties: {
query: { type: 'string' },
limit: { type: 'number', default: 10, maximum: 50 },
},
required: ['query'],
},
},
],
}));
This is particularly important for destructive tools. A well-prompted model will pause before calling delete_file a third time in quick succession if it sees the 2/min limit in the description.
Monitoring per-tool rate limit hit rates
Track how often each tool hits its limit. A high hit rate on a read tool indicates the limit is too tight. A zero hit rate on a destructive tool suggests the tool may never be used — or the limit is so high it provides no protection.
// Instrument the rate limiter to emit metrics
class InstrumentedPerToolRateLimiter extends PerToolRateLimiter {
private hitCounts = new Map<string, number>();
private allowCounts = new Map<string, number>();
allow(toolName: string): boolean {
const allowed = super.allow(toolName);
if (allowed) {
this.allowCounts.set(toolName, (this.allowCounts.get(toolName) ?? 0) + 1);
} else {
this.hitCounts.set(toolName, (this.hitCounts.get(toolName) ?? 0) + 1);
// Emit structured log for monitoring
console.log(JSON.stringify({
event: 'rate_limit_hit',
tool: toolName,
timestamp: new Date().toISOString(),
}));
}
return allowed;
}
getStats(): Record<string, { allowed: number; blocked: number; hitRate: number }> {
const stats: Record<string, { allowed: number; blocked: number; hitRate: number }> = {};
for (const toolName of new Set([...this.allowCounts.keys(), ...this.hitCounts.keys()])) {
const allowed = this.allowCounts.get(toolName) ?? 0;
const blocked = this.hitCounts.get(toolName) ?? 0;
stats[toolName] = { allowed, blocked, hitRate: blocked / (allowed + blocked) };
}
return stats;
}
}
Log the stats every few minutes or expose them on an admin endpoint. An AliveMCP external monitor can detect when rate limit errors are spiking (which shows up as tool-level errors in the protocol probe) even before your own alerting fires.
Related questions
Should I use per-tool or per-session rate limits?
Use both. Per-tool limits protect specific operations (a single destructive tool stays within its limit regardless of session count). Per-session limits protect fairness between clients (one aggressive session can't crowd out others). Apply per-tool limits globally and per-session limits layered on top — a call must pass both checks. Start with per-tool global limits since they're simpler to implement; add per-session limits when you have multiple concurrent clients or a multi-tenant deployment.
How do I set the right limits without real production data?
Start conservative and adjust upward. For read tools: allow 1 call per second per session (60/min). For write tools: allow 1 call every 6 seconds (10/min). For destructive tools: allow 1 call every 30 seconds (2/min). Instrument hit rates and raise limits where they exceed 5% — that means legitimate callers are being blocked. Keep limits tight on any tool that triggers an irreversible side effect or a paid external API call.
What should the rate limit error response look like?
Return isError: true with a JSON payload containing at minimum: the error type ("error": "rate_limited"), the tool name, a human-readable message, and the number of remaining tokens. Do not return an HTTP 429 — that is a transport-layer response and MCP tool errors should stay in the JSON-RPC layer. The isError: true flag signals to LLM frameworks that the result is a tool failure, prompting retry logic or a graceful fallback.
How do per-tool limits interact with session-level limits?
Apply them in sequence: check the session-level budget first (is this session over its total call budget?), then check the per-tool bucket (is this specific tool over its limit?). Return distinct error messages so the LLM client knows which limit was hit. Session-level limits reset when the session reconnects; per-tool global limits are shared across sessions and only refill over time.
What if I want to allow bursts on some tools but not others?
The token bucket's maxTokens parameter controls burst capacity independently of the steady-state rate. Set a high maxTokens and a lower refillRate to allow short bursts but throttle sustained use. For example, { maxTokens: 20, refillRate: 0.08 } allows a burst of 20 calls immediately but refills at only ~5 calls per minute — suitable for a batch operation that legitimately needs 20 calls at once but shouldn't sustain that pace.
Further reading
- MCP server rate limiting — session-level token bucket and sliding window
- MCP server client throttling — per-identity fair queuing
- MCP server quota management — daily and hourly call budgets
- MCP server backoff guidance — exponential backoff for callers
- MCP server concurrency — managing simultaneous tool calls