Guide · Rate Limiting

MCP server rate limiting

Rate limiting an MCP server is different from rate limiting a stateless REST API because MCP sessions are stateful. Each session starts with an initialize handshake and may make dozens or hundreds of tool calls across its lifetime. Rate limits can apply at three different layers: connection rate (how many new sessions a caller can open per minute), tool call rate (how many tool calls within a session), and per-tool budget (specific expensive tools have their own limits). The wrong layer is session-level HTTP 429 for every tool call — that terminates the session and forces a full re-initialize, which is expensive for the client. The right approach depends on what you are protecting.

TL;DR

Limit new session creation at the HTTP layer with a per-identity token bucket (HTTP 429 before initialize is processed). Limit tool calls within a session at the tool handler level — return an isError: true result with a rate-limit message rather than throwing, so the session stays alive. Use different limits for expensive tools (those that call external APIs or do heavy computation) versus cheap tools. Log limit hits in structured logs with the tool name and caller identity. Alert when the hit rate exceeds 5% of calls — that signals either a misconfigured client or a limit that is too tight.

What to rate limit and where

Layer	What it limits	Where to enforce	Error returned
Connection rate	New sessions per minute per identity	HTTP middleware, before `initialize`	HTTP 429
Concurrent sessions	Maximum open sessions per identity	HTTP middleware	HTTP 429
Tool call rate	Tool calls per minute within a session	Tool handler or server middleware	`isError: true` result
Per-tool budget	Calls to a specific expensive tool	Inside that tool's handler	`isError: true` result

The distinction between HTTP 429 (before the session) and isError: true (inside the session) is critical. HTTP 429 terminates the connection — the client must start a new session, including a new initialize handshake. An isError: true tool result means the tool call was rate limited but the session is still alive — the client can retry after a delay, try a different tool, or back off. Use HTTP 429 only at the session creation boundary. Use isError: true for in-session limits.

Token bucket for session rate limiting

// In-process token bucket — no Redis required for single-instance servers
class TokenBucket {
  private tokens: number;
  private lastRefill: number;

  constructor(
    private readonly capacity: number,   // max tokens
    private readonly refillRate: number, // tokens per second
  ) {
    this.tokens = capacity;
    this.lastRefill = Date.now();
  }

  consume(n = 1): boolean {
    this.refill();
    if (this.tokens >= n) {
      this.tokens -= n;
      return true;
    }
    return false;
  }

  private refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate);
    this.lastRefill = now;
  }
}

// Per-identity buckets — keyed by API key prefix or OAuth sub
const sessionBuckets = new Map();

function rateLimitMiddleware(req, res, next) {
  const identity = res.locals.identity; // set by authMiddleware
  const key = identity?.sub ?? identity?.key_prefix ?? 'anonymous';

  if (!sessionBuckets.has(key)) {
    // 10 new sessions per minute = capacity 10, refill 0.167/s
    sessionBuckets.set(key, new TokenBucket(10, 10 / 60));
  }

  const bucket = sessionBuckets.get(key)!;
  if (!bucket.consume()) {
    return res.status(429).json({
      error: 'Too many sessions. Retry after 60 seconds.',
      retryAfterSeconds: 60,
    });
  }

  next();
}

The token bucket provides smooth rate limiting without the cliff behavior of a fixed window counter. A caller who has been idle for 30 seconds can burst 5 new sessions without hitting the limit. Periodic cleanup of the sessionBuckets map prevents unbounded memory growth — use a LRU eviction policy or clean up buckets that have been at full capacity for more than 10 minutes (they are from inactive callers).

Sliding window with Redis for distributed deployments

import { createClient } from 'redis';

const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();

async function checkRateLimit(key: string, windowSeconds: number, maxRequests: number): Promise {
  const now = Date.now();
  const windowStart = now - windowSeconds * 1000;

  // Lua script for atomic sliding window check-and-record
  const script = `
    local key = KEYS[1]
    local now = tonumber(ARGV[1])
    local window_start = tonumber(ARGV[2])
    local max = tonumber(ARGV[3])
    local ttl = tonumber(ARGV[4])

    redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)
    local count = redis.call('ZCARD', key)
    if count < max then
      redis.call('ZADD', key, now, now)
      redis.call('EXPIRE', key, ttl)
      return 1
    end
    return 0
  `;

  const result = await redis.eval(script, {
    keys: [key],
    arguments: [String(now), String(windowStart), String(maxRequests), String(windowSeconds + 1)],
  });

  return result === 1;
}

// Usage in middleware:
async function rateLimitMiddleware(req, res, next) {
  const identity = res.locals.identity;
  const key = `rl:sessions:${identity?.sub ?? 'anon'}`;
  const allowed = await checkRateLimit(key, 60, 10); // 10 sessions per 60 seconds

  if (!allowed) {
    return res.status(429).json({ error: 'Rate limit exceeded', retryAfterSeconds: 60 });
  }
  next();
}

The Lua script is atomic — the check and record happen in a single Redis transaction, preventing race conditions when multiple instances handle requests from the same identity simultaneously. The sliding window is more accurate than a fixed window (which allows a burst of 2× the limit at the window boundary). For single-instance deployments, the in-process token bucket is simpler — use Redis only when you scale horizontally. See MCP server deployment for scaling patterns.

Per-tool call limits within a session

// Tool-level rate limiting — limit stays alive across tool call errors
const toolCallCounts = new Map>(); // sessionId → toolName → count

server.tool(
  'generate_report',
  'Generates a detailed analytics report — expensive operation',
  { dateRange: z.string().describe('Date range in ISO 8601 format') },
  async (args, context) => {
    const sessionId = context.sessionId;
    const LIMIT = 5; // max 5 report generations per session

    const sessionCounts = toolCallCounts.get(sessionId) ?? new Map();
    const currentCount = sessionCounts.get('generate_report') ?? 0;

    if (currentCount >= LIMIT) {
      return {
        content: [{
          type: 'text',
          text: `Rate limit reached: generate_report is limited to ${LIMIT} calls per session. ` +
                `Start a new session to reset the limit.`,
        }],
        isError: true,
      };
    }

    sessionCounts.set('generate_report', currentCount + 1);
    toolCallCounts.set(sessionId, sessionCounts);

    const report = await generateReport(args.dateRange);
    return { content: [{ type: 'text', text: JSON.stringify(report) }] };
  }
);

Returning isError: true keeps the session alive so the client can call other tools, receive the error message, and decide whether to start a new session or work around the limit. The session-to-tool-count map should be cleaned up when the session ends — add a cleanup hook via the transport's close event. Per-tool limits are appropriate for tools that call external APIs with their own rate limits or perform expensive computation. Cheap tools (string transformation, lookup, calculation) typically do not need per-tool limits.

Measuring rate limit hit rates

Log every rate limit event in your structured logs:

logger.warn({
  level: 'warn',
  event: 'rate_limit_hit',
  layer: 'tool_call', // or 'session_creation'
  tool_name: 'generate_report',
  session_id: context.sessionId,
  caller_prefix: identity?.key_prefix ?? identity?.sub?.slice(0, 8),
  limit: LIMIT,
  current_count: currentCount,
});

Alert when rate limit hit rate exceeds 5% of total tool calls for a given tool. A sustained hit rate above that threshold indicates either a misconfigured client that is retrying aggressively without backing off, or a limit that is too tight for the actual workload. AliveMCP's probe calls never hit your tool-level limits (probes only run the initialize handshake, not tools/call), so an unexpected spike in rate limit events is genuine user traffic, not monitoring noise. See MCP server latency for how rate limiting introduces latency variance in overall MCP server performance metrics.

Related questions

How should I rate limit anonymous (unauthenticated) MCP servers?

Use the client IP address as the rate limit key, with a generous limit per IP (e.g., 30 new sessions per minute) to accommodate shared IP addresses like corporate proxies and NAT gateways. IP-based limits are less precise but necessary when there is no identity to key on. If your server is open to the internet without authentication, also consider connection-level limits at the reverse proxy (Caddy's rate_limit directive or nginx's limit_req_zone) as a first line of defense before your application logic. See MCP server authentication for adding identity-based rate limits.

What should I return in the Retry-After header?

For HTTP 429 responses (session creation limits), set Retry-After: 60 (or however many seconds until the bucket refills). For in-session tool-level limits, you cannot set HTTP headers after the session has started — include the retry guidance in the isError: true text content. For per-session tool counts (limits that reset when the session ends), the guidance is "start a new session" rather than "wait 60 seconds".

How do rate limits interact with AliveMCP probes?

AliveMCP probes send only the initialize handshake — they do not call any tools. This means probe traffic contributes to your session creation rate limit but not to your tool call limits. At one probe per 60 seconds, the probe consumes 1 session-creation token per minute. If your monitoring account has its own API key or OAuth sub, give it a separate bucket with a reserved allocation so a burst of user traffic cannot starve the probe and cause false uptime alerts.

Should I rate limit the initialize handshake itself?

Yes, at the connection rate layer — that is the session creation rate limit described above. Rate limiting the initialize response itself (slowing down the JSON-RPC response) is not a good pattern because AliveMCP and other health check tools measure initialize latency. A slow initialize caused by rate-limit throttling will appear as a latency spike in your uptime monitoring dashboard. Rate limit by rejecting new session creation early (HTTP 429 before the transport handles the request) rather than by slowing down the response.