Rate Limiting · 2026-06-27 · Rate Limiting & Throttling arc

MCP Server Rate Limiting: Per-Tool Limits, Client Throttling, Backoff, DDoS Defense, and Quota Management

Rate limiting an MCP server is not one problem — it is six, layered on top of each other. The session creation rate (how many new initialize handshakes per minute) is different from the tool call rate (how many calls per session). A per-tool limit for a destructive delete_file is different from a per-tool limit for a cheap search_documents. Per-client throttling (one aggressive session shouldn't starve the others) is different from quota management (daily or monthly call budgets per plan tier). And none of the in-process limits matter if a connection flood exhausts your file descriptors at the OS level before a single byte of JSON-RPC is processed. This guide covers all six layers in the order they should be implemented: transport defense first, then per-client throttling, then per-tool limits, then backoff for callers, then DDoS hardening, then quotas. Each layer is accompanied by production-ready TypeScript code.

TL;DR

Apply rate limiting in layers: (1) Caddy/nginx enforces connection-rate limits and 1MB body cap before Node.js sees the request; (2) a ClientThrottler assigns each caller its own token bucket keyed by session ID, with a penalty multiplier that escalates after repeated violations; (3) a PerToolRateLimiter maintains per-tool buckets — high limits for read tools (30–60/min), tight limits for destructive tools (1–2/min); (4) rate-limit error payloads include retry_after_ms so callers can implement full-jitter exponential backoff rather than thundering herd retries; (5) a ConcurrencyGuard caps simultaneous tool calls at a global maximum; (6) a SQLite-backed QuotaManager enforces daily call budgets per plan tier, distinct from per-second rate limits. AliveMCP's external probe detects when any layer is rejecting legitimate traffic before your internal alerting fires.

Why uniform rate limits fail for MCP servers

A single shared rate limit across all tools is the most common mistake in MCP server rate limiting. It forces an impossible choice: set the limit high enough for a legitimate retrieval tool calling search_documents twenty times per task, and you've set a limit that's also too high for a destructive delete_file tool that should run at most twice per minute. Set it low enough to protect the destructive tool, and you've throttled legitimate retrieval so aggressively that the LLM agent loops timeout before completing basic tasks.

LLMs are not predictable callers in the way a human with a browser is. An agent loop solving a multi-step task may call a retrieval tool dozens of times in quick succession — that is correct behavior, not abuse. The same agent might call a write or delete tool once per task. The rate limit system needs to reflect that asymmetry rather than flatten it.

Tool category	Typical call pattern	Suggested limit (calls/min per session)	Why
Read / search	Burst: 10–30 calls in a task	30–60	Low cost, low risk, high frequency
Write / create	Steady: 1–5 calls in a task	5–15	Moderate cost, moderate risk
Update / patch	Rare: 1–3 calls in a task	3–10	Side effects; mistakes are hard to undo
Delete / destroy	Rare: 0–2 calls in a task	1–3	Irreversible — tight limit protects against runaway deletion
External API call	Varies by upstream limit	Match upstream rate limit	Prevents your server from hammering third-party APIs
LLM / AI inference	Medium: 1–10 calls in a task	5–10	High cost — limits protect budget

There is also a second dimension the tool axis doesn't capture: which client is calling. One aggressive session — a runaway agent loop, a misconfigured retry handler — can exhaust a global rate limit and starve all other connected sessions. This is the fairness problem, and it requires per-client throttling on top of per-tool limits.

The correct mental model for MCP rate limiting is a stack: global transport defense at the bottom, per-client throttling in the middle, per-tool limits on top, quotas as the longest-horizon control. Each layer addresses a different threat, and they compose rather than replace each other.

Per-tool token buckets: the PerToolRateLimiter pattern

The token bucket algorithm is the right default for MCP tool rate limits because it naturally handles the burst behavior of LLM agents. Each bucket starts full (capacity = burst ceiling) and refills at a steady rate. A burst of twenty search_documents calls consumes twenty tokens instantly; at the configured refill rate, the bucket is full again within a minute. A tight bucket for delete_file — capacity 2, refill rate 0.03/sec — allows two quick deletes and then enforces a ~30-second cooldown.

// src/rate-limit/per-tool.ts
import { z } from 'zod';

const ToolRateLimitSchema = z.object({
  maxTokens: z.number().int().positive(),   // bucket capacity (burst ceiling)
  refillRate: z.number().positive(),         // tokens added per second
});

const ToolLimitsConfigSchema = z.record(z.string(), ToolRateLimitSchema);
type ToolLimitsConfig = z.infer<typeof ToolLimitsConfigSchema>;

class TokenBucket {
  private tokens: number;
  private lastRefill: number;

  constructor(
    private readonly maxTokens: number,
    private readonly refillRate: number,
  ) {
    this.tokens = maxTokens;
    this.lastRefill = Date.now();
  }

  consume(): boolean {
    this.refill();
    if (this.tokens < 1) return false;
    this.tokens -= 1;
    return true;
  }

  msUntilToken(): number {
    this.refill();
    if (this.tokens >= 1) return 0;
    return Math.ceil(((1 - this.tokens) / this.refillRate) * 1000);
  }

  private refill(): void {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(this.maxTokens, this.tokens + elapsed * this.refillRate);
    this.lastRefill = now;
  }
}

export class PerToolRateLimiter {
  // sessionId → toolName → TokenBucket
  private sessions = new Map<string, Map<string, TokenBucket>>();
  private readonly config: ToolLimitsConfig;
  private readonly defaultConfig = { maxTokens: 20, refillRate: 0.33 };

  constructor(config: ToolLimitsConfig) {
    this.config = ToolLimitsConfigSchema.parse(config);
  }

  allow(sessionId: string, toolName: string): { allowed: boolean; retryAfterMs: number } {
    if (!this.sessions.has(sessionId)) {
      this.sessions.set(sessionId, new Map());
    }
    const sessionBuckets = this.sessions.get(sessionId)!;

    if (!sessionBuckets.has(toolName)) {
      const toolConfig = this.config[toolName] ?? this.defaultConfig;
      sessionBuckets.set(toolName, new TokenBucket(toolConfig.maxTokens, toolConfig.refillRate));
    }

    const bucket = sessionBuckets.get(toolName)!;
    const allowed = bucket.consume();
    return { allowed, retryAfterMs: allowed ? 0 : bucket.msUntilToken() };
  }

  clearSession(sessionId: string): void {
    this.sessions.delete(sessionId);
  }
}

// Wire into server:
const rateLimiter = new PerToolRateLimiter({
  search_documents:  { maxTokens: 30, refillRate: 0.5 },    // 30 burst, 30/min steady
  list_files:        { maxTokens: 60, refillRate: 1.0 },
  read_file:         { maxTokens: 30, refillRate: 0.5 },
  create_file:       { maxTokens: 10, refillRate: 0.17 },
  update_file:       { maxTokens: 8,  refillRate: 0.13 },
  delete_file:       { maxTokens: 2,  refillRate: 0.03 },   // 2 burst, ~2/min steady
  call_external_api: { maxTokens: 5,  refillRate: 0.08 },
});

Key decisions in this implementation: buckets are per-session and per-tool. A fresh session always starts with a full token pool — correct for interactive use. Clearing the session map on disconnect prevents unbounded memory growth. The msUntilToken() method calculates the exact wait time so it can be included in the error payload for callers to use.

Wire the limiter into the CallToolRequestSchema handler and return isError: true — never throw, never return HTTP 429 mid-session. HTTP 429 terminates the connection and forces a full initialize handshake; an isError: true tool result leaves the session alive so the agent can call other tools, receive the error message, and retry after the specified delay.

server.setRequestHandler(CallToolRequestSchema, async (request, extra) => {
  const sessionId = (extra as any)?._meta?.sessionId ?? 'unknown';
  const toolName = request.params.name;

  const { allowed, retryAfterMs } = rateLimiter.allow(sessionId, toolName);
  if (!allowed) {
    return {
      content: [{
        type: 'text',
        text: JSON.stringify({
          error: 'rate_limited',
          tool: toolName,
          message: `Tool '${toolName}' rate limit exceeded. Retry after ${Math.ceil(retryAfterMs / 1000)}s.`,
          retry_after_ms: retryAfterMs,
          retry_after_iso: new Date(Date.now() + retryAfterMs).toISOString(),
          retryable: true,
        }),
      }],
      isError: true,
    };
  }
  // ... normal dispatch
});

Include retry_after_ms and retryable: true in every rate-limit error. These fields are the contract between your server and the caller's retry logic. Without them, callers must guess how long to wait and whether retrying is safe.

Per-client throttling: session identity and penalty multipliers

Per-tool limits protect individual operations. Per-client throttling protects fairness between concurrent callers. A server with 100 calls/second of total capacity and no per-client limits will silently give all 100 slots to the first aggressive session, starving the other nine well-behaved clients connected simultaneously.

The throttler's penalty multiplier is the component most MCP implementations omit. When a client hits its limit repeatedly — indicating a misconfigured retry loop rather than a bursty legitimate task — the penalty halves the refill rate. This distinguishes a legitimately bursty agent (which waits and then calls less aggressively) from a broken retry handler that keeps hammering the limit at full speed.

// src/rate-limit/client-throttle.ts
interface BucketEntry {
  tokens: number;
  lastRefill: number;
  lastSeen: number;
  penaltyMultiplier: number;
  consecutiveViolations: number;
}

export class ClientThrottler {
  private readonly buckets = new Map<string, BucketEntry>();
  private readonly maxTokens: number;
  private readonly refillRate: number;
  private readonly ttlMs: number;

  constructor(options: {
    maxTokens?: number;
    refillRatePerSecond?: number;
    ttlMs?: number;
  } = {}) {
    this.maxTokens = options.maxTokens ?? 60;
    this.refillRate = options.refillRatePerSecond ?? 1.0;
    this.ttlMs = options.ttlMs ?? 5 * 60 * 1000;
  }

  allow(clientKey: string): { allowed: boolean; remaining: number; penaltyActive: boolean } {
    const now = Date.now();
    let entry = this.buckets.get(clientKey);

    if (!entry) {
      entry = { tokens: this.maxTokens, lastRefill: now, lastSeen: now, penaltyMultiplier: 1, consecutiveViolations: 0 };
      this.buckets.set(clientKey, entry);
    }

    const elapsed = (now - entry.lastRefill) / 1000;
    entry.tokens = Math.min(this.maxTokens, entry.tokens + elapsed * (this.refillRate / entry.penaltyMultiplier));
    entry.lastRefill = now;
    entry.lastSeen = now;

    if (entry.tokens < 1) {
      entry.consecutiveViolations += 1;
      // Escalate: 2× slower refill after 3 violations, 4× after 6, cap at 8×
      entry.penaltyMultiplier = Math.min(8, Math.pow(2, Math.floor(entry.consecutiveViolations / 3)));
      return { allowed: false, remaining: 0, penaltyActive: entry.penaltyMultiplier > 1 };
    }

    entry.tokens -= 1;
    if (entry.consecutiveViolations > 0) entry.consecutiveViolations = 0;
    return { allowed: true, remaining: Math.floor(entry.tokens), penaltyActive: false };
  }

  evictStale(): void {
    const now = Date.now();
    for (const [key, entry] of this.buckets) {
      if (now - entry.lastSeen > this.ttlMs) this.buckets.delete(key);
    }
  }
}

const clientThrottler = new ClientThrottler({ maxTokens: 60, refillRatePerSecond: 1.0 });
setInterval(() => clientThrottler.evictStale(), 5 * 60 * 1000);

Identifying each caller requires a stable key. The session ID assigned by the MCP SDK transport is the best default — it is server-generated, stable for the session lifetime, and not spoofable by the client. For servers with API key auth, use the key prefix as the identity so throttling persists across reconnects from the same caller. Fall back to IP address only when no session or auth identity is available, and be aware that corporate NAT and CDN IPs can make IP-based limiting inaccurate.

Apply client throttling before per-tool limits in the request handler. If the caller is penalized, return the penaltyActive signal so they know the delay is longer than usual. A caller that hits the limit once and backs off appropriately should see their penalty clear on the next successful call — the consecutiveViolations counter resets to zero after a successful call.

Caller backoff guidance: full jitter and the retry_after_ms contract

Rate limiting only works as intended when callers retry intelligently. A caller that immediately retries after receiving a rate-limit error creates a thundering herd: if 100 clients all hit the same limit at the same time and all retry exactly 2 seconds later, they produce a second synchronized burst that hits the server in unison. Full-jitter exponential backoff breaks the synchronization by randomizing the retry delay across a range.

// Backoff utility for MCP tool callers
interface BackoffOptions {
  baseMs?: number;
  maxMs?: number;
  maxAttempts?: number;
}

function computeFullJitterBackoff(attempt: number, options: BackoffOptions = {}): number {
  const { baseMs = 200, maxMs = 30_000 } = options;
  // full jitter: random(0, min(cap, base × 2^attempt))
  return Math.random() * Math.min(maxMs, baseMs * Math.pow(2, attempt));
}

async function callToolWithRetry(
  client: MCPClient,
  toolName: string,
  args: Record<string, unknown>,
  options: BackoffOptions = {}
): Promise<MCPToolResult> {
  const { maxAttempts = 5 } = options;

  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    const result = await client.callTool({ name: toolName, arguments: args });
    if (!result.isError) return result;

    let errorPayload: Record<string, unknown> = {};
    try {
      const text = (result.content as Array<{ type: string; text?: string }>)
        .find(c => c.type === 'text')?.text ?? '{}';
      errorPayload = JSON.parse(text);
    } catch { /* non-JSON error — don't retry */ }

    // Only retry if the server explicitly says it's retryable
    if (errorPayload.retryable !== true) {
      throw new Error(`Non-retryable tool error: ${JSON.stringify(errorPayload)}`);
    }

    if (attempt === maxAttempts - 1) {
      throw new Error(`Tool '${toolName}' failed after ${maxAttempts} attempts`);
    }

    // Prefer server-provided retry hint; add jitter to prevent herd
    const serverHint = typeof errorPayload.retry_after_ms === 'number'
      ? errorPayload.retry_after_ms
      : null;
    const waitMs = serverHint !== null
      ? serverHint + Math.random() * 200
      : computeFullJitterBackoff(attempt, options);

    await new Promise(resolve => setTimeout(resolve, waitMs));
  }

  throw new Error('unreachable');
}

Strategy	Formula	Thundering herd protection	Best for
No jitter	`min(cap, base × 2^n)`	None	Single caller only
Full jitter	`random(0, min(cap, base × 2^n))`	Excellent	Multiple concurrent callers (default)
Equal jitter	`cap/2 + random(0, cap/2)`	Good; min wait guaranteed	When zero-delay retries are undesirable
Decorrelated	`random(base, prev × 3)`	Good; each client diverges	Long-lived agent loops with many retries

Use full jitter as the default. The only case to prefer equal jitter is when you know all callers will be released simultaneously (e.g., after a maintenance window) and you want to guarantee a minimum spread before any retry starts.

Also include retry guidance in the tool description itself — LLM clients that receive an isError: true response may choose to retry immediately without consulting your retry-after hint unless the description tells them not to. A one-sentence note like "if this tool returns error: rate_limited, wait for retry_after_ms milliseconds before calling again" is belt-and-suspenders but valuable for preventing model-initiated retry storms.

Transport-layer DDoS defense: Caddy, Node.js abuse guard, and Cloudflare

In-process rate limits are a second line of defense, not the first. A connection flood that opens 10,000 SSE connections from a single IP will exhaust the server's file descriptors before a single JSON-RPC byte is processed — the per-tool rate limiter never runs. The first defense must happen at the transport layer, before Node.js accepts the connection.

Caddy configuration

# Caddyfile — MCP endpoint with DDoS mitigations
your-mcp-server.com {
  # Block requests with bodies over 1MB (prevents large payload attacks)
  request_body {
    max_size 1MB
  }

  # Limit new connections per IP: 10/s burst, 2/s sustained
  # Requires caddy-ratelimit plugin
  rate_limit {
    zone mcp_connections {
      key {remote_host}
      events 10
      window 1s
    }
  }

  @mcp_endpoint path /mcp /mcp/*
  handle @mcp_endpoint {
    reverse_proxy localhost:3000 {
      transport http {
        read_buffer 4096
        # 30s timeout for long-running tool calls
        response_header_timeout 30s
      }
    }
  }
}

Node.js abuse guard

Even with reverse proxy size limits, validate tool argument sizes inside the server. The proxy's 1MB global cap is a coarse guard; per-tool validation catches arguments that fit within the global limit but are still unreasonable for a specific tool.

// src/middleware/abuse-guard.ts
const MAX_ARGUMENT_BYTES = 64 * 1024;  // 64KB per tool call
const MAX_STRING_LENGTH = 10_000;       // 10k chars per string field
const MAX_SESSION_DEPTH = 10;           // max tool calls before requiring a pause

const sessionDepth = new Map<string, number>();

function validateArgSize(args: Record<string, unknown>): void {
  const size = JSON.stringify(args)?.length ?? 0;
  if (size > MAX_ARGUMENT_BYTES) {
    throw Object.assign(new Error('argument_too_large'), { retryable: false });
  }
  for (const [key, val] of Object.entries(args)) {
    if (typeof val === 'string' && val.length > MAX_STRING_LENGTH) {
      throw Object.assign(new Error(`argument_string_too_long: field '${key}'`), { retryable: false });
    }
  }
}

export function checkAbuse(sessionId: string, args: Record<string, unknown>): void {
  validateArgSize(args);
  const depth = sessionDepth.get(sessionId) ?? 0;
  if (depth >= MAX_SESSION_DEPTH) {
    throw Object.assign(new Error('depth_limit_exceeded'), { retryable: false });
  }
  sessionDepth.set(sessionId, depth + 1);
}

export function releaseDepth(sessionId: string): void {
  const depth = sessionDepth.get(sessionId) ?? 0;
  if (depth > 0) sessionDepth.set(sessionId, depth - 1);
}

Global concurrency cap

Orthogonal to rate limiting, a concurrency cap limits how many tool calls are running simultaneously. This is the last line of defense against a burst of calls that each individually stay within rate limits but collectively saturate the server's thread pool or event loop.

export class ConcurrencyGuard {
  private active = 0;
  constructor(private readonly max = 50) {}

  async acquire(): Promise<() => void> {
    if (this.active >= this.max) {
      throw Object.assign(
        new Error(`server_overloaded: ${this.active}/${this.max} concurrent calls`),
        { retryable: true, retryAfterMs: 2000 }
      );
    }
    this.active++;
    return () => { this.active--; };
  }
}

const concurrencyGuard = new ConcurrencyGuard(50);

Cloudflare WAF rules

If your server is behind Cloudflare, WAF rules absorb volumetric attacks before they reach your origin. Three rules cover the MCP-specific surface:

# Rule 1: Challenge IPs making >60 requests/min to the MCP endpoint
# Condition: http.request.uri.path starts_with "/mcp"
#   AND rate limit > 60/60s per IP
# Action: JS Challenge

# Rule 2: Block oversized Content-Length to MCP endpoint
# Condition: http.request.uri.path starts_with "/mcp"
#   AND content-length > "1048576"
# Action: Block

# Rule 3: Allow AliveMCP probe by User-Agent (bypass challenge rules)
# Condition: http.user_agent contains "AliveMCP-Probe"
# Action: Allow

The third rule is critical: without it, Cloudflare's JS Challenge will block the AliveMCP probe that verifies your server is answering. A Cloudflare rule that challenges all non-human traffic is a common cause of false uptime alerts — the CDN intercepts the probe, the probe times out, and AliveMCP fires a down alert for a server that is perfectly healthy behind the CDN.

Quota management: daily budgets and cost-weighted limits

Rate limits and quotas solve different problems. A rate limit says "no more than 10 calls per second." A quota says "no more than 1,000 calls per day." Both are necessary for multi-tenant or tiered MCP servers: the rate limit prevents burst abuse, the quota prevents sustained overuse across a billing period. Rate limits live in memory and reset continuously; quotas live in a persistent database and reset at a period boundary.

Property	Rate limit	Quota
Time window	Seconds or minutes	Hours, days, months
Purpose	Burst control, server protection	Cost control, plan enforcement
Reset behavior	Continuous (sliding window / token refill)	Hard reset at period boundary
State storage	In-memory	Persistent database
Error to return	`rate_limited` with retry hint	`quota_exhausted` with reset timestamp

// src/quota/quota-manager.ts
import Database from 'better-sqlite3';

interface QuotaConfig {
  dailyCallLimit: number;
}

const PLAN_QUOTAS: Record<string, QuotaConfig> = {
  free:       { dailyCallLimit: 100 },
  starter:    { dailyCallLimit: 2_000 },
  team:       { dailyCallLimit: 10_000 },
  enterprise: { dailyCallLimit: Infinity },
};

export class QuotaManager {
  constructor(private readonly db: Database.Database) {}

  check(userId: string, userPlan: string): { allowed: boolean; remaining: number; resetsAt: string } {
    const today = new Date().toISOString().slice(0, 10);
    const limit = PLAN_QUOTAS[userPlan]?.dailyCallLimit ?? 100;

    const row = this.db.prepare(
      `SELECT call_count FROM quota_usage WHERE user_id = ? AND period_start = ?`
    ).get(userId, today) as { call_count: number } | undefined;

    const current = row?.call_count ?? 0;
    if (current >= limit) {
      const tomorrow = new Date();
      tomorrow.setUTCDate(tomorrow.getUTCDate() + 1);
      tomorrow.setUTCHours(0, 0, 0, 0);
      return { allowed: false, remaining: 0, resetsAt: tomorrow.toISOString() };
    }

    return {
      allowed: true,
      remaining: limit - current,
      resetsAt: new Date(new Date().setUTCHours(24, 0, 0, 0)).toISOString(),
    };
  }

  increment(userId: string, costUnits = 1.0): void {
    const today = new Date().toISOString().slice(0, 10);
    this.db.prepare(`
      INSERT INTO quota_usage (user_id, period_start, call_count, cost_units, updated_at)
      VALUES (?, ?, 1, ?, datetime('now'))
      ON CONFLICT(user_id, period_start) DO UPDATE SET
        call_count = call_count + 1,
        cost_units = cost_units + excluded.cost_units,
        updated_at = datetime('now')
    `).run(userId, today, costUnits);
  }
}

Cost-weighted quotas are the natural extension. Not all tool calls cost the same — a database lookup costs 1 unit, an LLM-calling tool that spends $0.01 in inference costs 20. Cost weighting lets you set a single monthly budget in "cost units" rather than a call count, giving expensive operations appropriate weight without writing separate quotas for every tool.

const TOOL_COSTS: Record<string, number> = {
  search_documents:    1,
  read_file:           1,
  run_sql_query:       2,
  call_external_api:   5,
  generate_with_llm:   20,
};

Charge quota only after a successful tool execution, not on rate-limited or validation-failed calls. A server error that is not the user's fault shouldn't count against their daily budget. When quota is exhausted, return quota_exhausted with the resets_at timestamp — this is not a retryable error (unlike rate-limit errors), because retrying before midnight won't help. Include an upgrade URL if you have paid tiers.

The quota check is slightly slower than the rate limit check because it hits SQLite rather than in-memory state. Apply rate limiting first (fast in-memory check) and only reach the quota check if the rate limit passes. This ordering minimizes database load under high traffic.

Which layer defends against what

The six layers compose into a single defense stack. A real request travels through them in order; a single layer's rejection is enough to stop it.

Layer	What it defends against	State	Where
Cloudflare WAF	Volumetric HTTP floods, oversized payloads at CDN edge	CDN	Before your origin
Caddy/nginx	Connection floods per IP, large request bodies, slow-write attacks	Reverse proxy	Before Node.js
Concurrency guard	Simultaneous call saturation — too many calls running at once	In-memory (global)	First check in handler
Client throttler	Single aggressive caller starving all others — fairness	In-memory per-identity	Second check in handler
Per-tool rate limiter	Tool-specific burst and rate — read vs write asymmetry	In-memory per-session-per-tool	Third check in handler
Quota manager	Sustained overuse across billing periods — plan enforcement	SQLite persistent	Fourth check (after rate limit passes)

The abuse guard (argument size validation + tool call depth limit) runs alongside the concurrency guard. It defends against a different threat class — argument-based attacks and prompt injection leading to recursive tool calls — rather than volume-based attacks.

Errors returned at each layer carry distinct semantics. Cloudflare and Caddy return HTTP errors before any MCP session is established. The concurrency guard returns server_overloaded with a short retry_after_ms. The client throttler returns client_rate_limited with penaltyActive when escalation is active. The per-tool limiter returns rate_limited with the exact retry_after_ms until the bucket refills. The quota manager returns quota_exhausted with resets_at and no retry hint. A caller that parses these error types can handle each appropriately — retrying with backoff for rate limits, waiting for midnight for quota exhaustion, not retrying at all for argument validation failures.

AliveMCP integration: when defenses are rejecting legitimate calls

Rate limiting introduces a new silent failure mode: a defense tuned too aggressively starts rejecting legitimate traffic, but it does so through your normal error path — isError: true responses that look like any other tool failure to the caller. The server is "up" by every standard health check definition, and the AliveMCP probe (which only sends initialize + tools/list, not tools/call) will also show it as green. The callers getting rate-limited are the only signal, and if you're not collecting rate-limit hit rates, that signal is invisible.

The integration point is structured logging. Log every defense trigger with a consistent event type so you can aggregate hit rates and alert when they exceed acceptable thresholds.

// Structured log events for rate limiting
const defenseEvents = {
  rateLimited: (tool: string, sessionId: string, retryAfterMs: number) =>
    console.log(JSON.stringify({
      event: 'rate_limit_hit', tool, sessionId, retry_after_ms: retryAfterMs, ts: new Date().toISOString()
    })),
  clientThrottled: (clientKey: string, penaltyActive: boolean) =>
    console.log(JSON.stringify({
      event: 'client_throttled', clientKey, penalty_active: penaltyActive, ts: new Date().toISOString()
    })),
  quotaExhausted: (userId: string, plan: string) =>
    console.log(JSON.stringify({
      event: 'quota_exhausted', userId, plan, ts: new Date().toISOString()
    })),
  serverOverloaded: (active: number, max: number) =>
    console.log(JSON.stringify({
      event: 'concurrency_cap_hit', active, max, ts: new Date().toISOString()
    })),
};

Alert when the rate-limit hit rate on any read tool exceeds 5% of total calls for that tool in a 5-minute window — that signals a limit that is too tight for the actual workload. Alert when the hit rate on a destructive tool exceeds 20% — that signals either a runaway agent or a deliberate abuse attempt. A zero hit rate on a destructive tool's tight limit is also informative: it might mean the tool is never used, or it might mean the limit is so high it provides no protection.

AliveMCP's external probe runs every 60 seconds and sends a real initialize handshake followed by tools/list. It will not trigger your per-tool rate limits (probes don't call tools). But it will catch the cases where defensive layers are breaking the protocol itself: a Cloudflare WAF rule that blocks the AliveMCP probe's IP, a Caddy rate limit that triggers on repeated probes from the same monitoring IP, or a concurrency cap so low that an in-progress tool call causes the probe's tools/list to queue indefinitely and time out. These are real misconfiguration failure modes in production rate-limited servers — the external probe surfaces them before a user does.

The recommended monitoring configuration for a rate-limited MCP server: one AliveMCP monitor on the MCP endpoint (protocol probe), one HTTP monitor on /health (basic availability), and structured log alerting on the rate-limit hit rate thresholds above. The three together cover all three failure classes: transport down, protocol broken, and defenses misconfigured.

Implementation order

If you're starting from an unprotected server, implement the layers in this order. Each step is independently deployable and provides protection without requiring the next step to be in place.

Transport limits first. Add Caddy's request_body { max_size 1MB } and a connection rate limit. This is the highest-leverage change — it stops most abuse before a single line of your application code runs.
Concurrency guard. Add a 50-slot global concurrency cap. This prevents the server from being overwhelmed by a burst of simultaneous long-running tool calls. Without this, a request flood that each individually stays within rate limits can still exhaust the event loop.
Per-tool rate limiter. Add PerToolRateLimiter with conservative limits (start tight, loosen based on observed hit rates). Include retry_after_ms in every rate-limit error.
Per-client throttler. Add ClientThrottler with penalty escalation. This is particularly important when you have multiple concurrent callers.
Structured logging. Add defense event logging and set up hit rate alerts. You can't tune limits you can't measure.
Quota management. Add QuotaManager with SQLite-backed daily limits when you're ready to tie usage to plan tiers.
Cloudflare WAF rules. Add CDN-level rules last, after you understand your traffic patterns and have confirmed the monitoring probe is in the Allow list.