Guide · Performance

MCP server performance

MCP server performance is more than latency. It includes how quickly the server handles simultaneous agent sessions, how large the tools/list payload grows as you add more tools, how cold starts interact with user-perceived response time, and whether the server degrades gracefully under resource pressure. Here's the full picture, with concrete thresholds and optimization paths.

TL;DR

The most common MCP performance bottleneck is a large tools/list payload: 50+ tools with verbose descriptions and nested JSON schemas adds 50–200KB to every agent session initialization. Keep individual tool descriptions under 200 characters, input schemas flat where possible, and total tools/list response under 30KB. For concurrency, MCP servers that run on single-threaded Node.js or Python need explicit connection limits to prevent request queuing that looks like high latency. AliveMCP's probes detect both payload growth (schema hash change + tools count increase) and latency trends (30-day baseline tracking) before they become user-visible problems.

The tool payload problem

Every AI agent that connects to your MCP server calls tools/list early in the session. The response contains JSON schemas for every tool you expose: name, description, and inputSchema (a JSON Schema object). The LLM reads this list to understand what capabilities are available and how to call each tool.

The problem: as you add tools, the payload grows, and the LLM has to process all of it. At <20 tools with concise descriptions, the tools/list response is typically 5–15KB — fast to transfer, fast for the LLM to process. At 50+ tools with verbose descriptions and deeply nested schemas, the response can exceed 100KB. This creates two performance problems:

The 50-tool inflection point is where these problems become visible in production benchmarks. Below 50 tools with average descriptions, most modern MCP clients handle the payload without user-visible impact. Above 50, consider splitting the server into domain-specific sub-servers, implementing capability-based tool filtering (return only the tools relevant to the current user's permissions or context), or using a paginated tools/list extension if your client supports it.

Tool schema design for performance

Individual tool schema design has a larger effect on payload size than raw tool count. Several common anti-patterns significantly inflate schema size:

Run curl https://your-server/mcp tools/list | wc -c to measure your current payload size in bytes. If it's above 30,000 characters (roughly 30KB), optimize descriptions before optimizing anything else — it's the cheapest fix with the largest impact.

Concurrency and connection handling

MCP servers often handle multiple simultaneous agent sessions. Concurrency problems look like latency problems in a probe — the probe's request queues behind live traffic and the round-trip time spikes — but the root cause is different.

Single-threaded Node.js and Python servers

A Node.js or Python MCP server running as a single process handles requests serially if any operation is CPU-bound (JSON parsing of a large payload, synchronous file reads, blocking database queries). A long-running tool call blocks the event loop, causing all other in-flight requests (including your monitoring probe) to queue behind it.

Fixes: use async/await throughout (never blocking), keep tool call CPU work on worker threads, and configure a maximum request queue depth so requests fail fast rather than queuing indefinitely when under load.

Serverless concurrency limits

Lambda, Cloud Run, and Vercel Functions have default concurrency limits. Lambda's default is 1,000 concurrent invocations per account; Cloud Run defaults to 80 concurrent requests per instance. When agent traffic bursts above these limits, the platform throttles with 429 or 503 responses. This shows up in probe logs as periodic HTTP-layer failures during business hours — the pattern is correlated with business hours (when agents are active) and recovers without intervention.

Fixes: request a concurrency limit increase (Lambda reserved concurrency, Cloud Run max-instances), add a queue in front of the MCP server to absorb bursts, or use horizontal scaling with a load balancer. See AWS MCP monitoring and GCP MCP monitoring for platform-specific concurrency limit details.

Probe interference

Monitoring probes add load to your server. AliveMCP's 60-second probe cadence is one initialize + one tools/list call per minute — equivalent to one light user session per minute. For most servers, this is negligible. For a heavily constrained serverless function (128MB RAM, 1 vCPU), a monitoring probe can consume a meaningful fraction of available capacity during cold starts.

If you observe that your monitoring probe's latency is consistently higher than the latency your real users experience (as measured by client-side telemetry), the probe may be hitting the server during resource-constrained moments when live traffic has already warmed it up. This is usually only a problem for very low-resource serverless deployments.

Resource sizing for MCP workloads

MCP tool calls often involve external I/O: database queries, web fetches, file operations, calls to downstream APIs. The right resource sizing depends on your tool implementation, but some rough guidelines apply:

Performance monitoring vs uptime monitoring

Uptime monitoring answers "is the server alive?" Performance monitoring answers "is the server usable?" The two require different signals:

A server can be 100% "up" by uptime metrics while its p95 latency trends upward 10% per week for 3 months — which is exactly what happens when a growing tool schema is never trimmed. Performance monitoring catches this drift before it becomes user-visible. AliveMCP's Author tier tracks latency trends as part of every probe; Team tier ($49/mo) adds 30-day baseline alerts that fire when p95 exceeds 3× the rolling baseline. See MCP server latency for the full alerting setup.

Related questions

How many tools is too many for an MCP server?

Above 50 tools with average-length descriptions and schemas, payload size consistently pushes above 30KB and user-visible session initialization latency starts to degrade. Above 100 tools, the tools/list response approaches 100KB+ and becomes a genuine bottleneck on both latency and LLM context budget. If you're building a broad-surface MCP server, split into domain-specific sub-servers (each <30 tools) and let the agent router pick the right one for the task.

Why does my server's latency vary so much between AliveMCP probes and my users' client-side measurements?

The most common reason is geographic distance: if your server is in us-east-1 and your users are in Europe, their client latency is 120–200ms higher than a probe origin in us-east-1. Second most common: client-side sessions reuse connections (HTTP keep-alive), so they skip TCP connect + TLS handshake on subsequent calls within a session. AliveMCP probes open a fresh connection each time (more conservative, matches a new agent session). Both are valid measurements of different things — probe latency measures new-session startup cost, client latency measures ongoing session cost.

How do I benchmark my MCP server's performance before launch?

Run 100 sequential initialize + tools/list pairs and calculate p50/p95/p99. Then run 10 concurrent sessions and repeat — check whether p95 degrades under concurrency. Measure the tools/list payload size (wc -c) and check that it's under 30KB. Finally, kill the server process and cold-start it 3 times to get a cold-start latency distribution. These four measurements give you a pre-launch performance baseline. Once live, AliveMCP's continuous probing maintains that baseline automatically.

Does AliveMCP measure tool call latency, or just initialize and tools/list?

Standard probes cover initialize and tools/list — the session establishment and capability discovery phases. Tool call latency is not probed by default because tool calls have side effects (they run actual operations) and their latency depends entirely on what the tool does (a file read is <1ms; a web scrape is 2–10 seconds). For tool call performance monitoring, instrument your server directly with OpenTelemetry traces and export to your observability backend.

Further reading