Reference · Performance monitoring
MCP server response time
Response time for an MCP server is not a single number. It's a distribution — and the tail (p95, p99) matters far more than the median, because a slow tool call blocks an entire agent turn while the user waits.
TL;DR
Track three latency buckets for every MCP server: time-to-initialize (how long the protocol handshake takes), time-to-tools-list (how quickly the server returns its capability schema), and tool call round-trip (for your most critical tool, if you run synthetic calls). Alert on p95 > 3× your 7-day baseline for three consecutive probes — a single spike is noise; a sustained tail regression is a real problem. AliveMCP tracks p50 and p95 for every public endpoint it monitors, free on the public dashboard. Join the waitlist to add private endpoints with custom alert thresholds.
Why response time is especially critical for MCP servers
A web page can tolerate a 1.2-second load time without losing every user. An MCP tool call cannot — when an agent invokes a tool, the user's conversation is on hold. The LLM has generated a tool call, the agent framework is waiting for the result, and the user sees a spinner or nothing at all. Slow tools are invisible failures: the agent usually doesn't surface latency to the user, so a 4-second response reads as "the product is broken" with no explanation.
The practical implication: your p95 response time budget for agent-facing MCPs is roughly 2 seconds end-to-end. That's the threshold at which most agent frameworks start showing users that something is taking a while. Under 500ms is good. Under 200ms is excellent. Over 3 seconds is something to fix.
What to measure
Time-to-initialize (T2I)
Start the clock when you send the initialize JSON-RPC request. Stop when you receive a valid response with protocolVersion and capabilities. This measures the handshake overhead — cold-start costs, auth round trips, and server boot time all show up here. Most healthy production servers respond under 200ms. If your T2I is consistently over 600ms, investigate cold starts and auth cache warmth.
Time-to-tools-list (T2TL)
Immediately after initialize, send tools/list and time the response. This is the schema-discovery cost that every new agent session pays. A server that takes 800ms to return its tool list is adding nearly a second to every conversation cold start. T2TL should be < 150ms — it's reading a static registry, not executing business logic.
Tool call round-trip (TCRT)
For critical tools, run a synthetic call with valid arguments that trigger a fast, idempotent path. Time from call to result. This is the number that matters most to users, but it's tool-specific and depends on what the tool does. An LLM-backed tool will be slower than a lookup. Set per-tool baselines and alert on regression, not on absolute thresholds.
Percentile benchmarks from the AliveMCP dashboard (May 2026)
Based on the servers currently being monitored on the public dashboard:
| Metric | p50 (median) | p95 (typical tail) | p99 (worst 1%) |
|---|---|---|---|
| Time-to-initialize | 110ms | 380ms | 950ms |
| Time-to-tools-list | 85ms | 290ms | 720ms |
| Tool call (simple lookup) | 220ms | 680ms | 2,100ms |
| Tool call (LLM-backed) | 1,800ms | 4,500ms | 9,200ms |
LLM-backed tools are always slower because they're calling an upstream model. Factor that into your SLA. Simple lookups should be well under 500ms at p95 — if they're not, the server is likely doing something synchronous (database query without index, cold function spin-up) that it could cache or defer.
Alerting thresholds that work in practice
- Page immediately: p95 T2I > 3,000ms for 5+ consecutive probes, or any probe returning a JSON-RPC error code.
- Slack alert (within 5 min): p95 T2I > 3× your 7-day baseline for 3 consecutive probes. This catches regressions without false-positives from single slow probes.
- Daily digest: p50 creeping up over 7 days (indicating slow resource leak or growing data volume), tool call p95 > 2× baseline without a deploy marker.
- Weekly review: T2TL trending up over time (may indicate tool schema growth or server boot time regression).
The most important thing to avoid is alerting on a single slow probe. Networks are noisy. A 3× threshold sustained across three consecutive 60-second windows gives you a 3-minute confirmation window, which is fast enough to matter and quiet enough not to wake you for transient spikes.
How AliveMCP tracks this
Every probe the public dashboard runs records T2I and T2TL for each endpoint. The 90-day rolling p50 and p95 are shown on each server's /status/<slug> page, with a 24-hour trend chart. Author tier users ($9/mo) can set custom alert thresholds and get webhook notifications when their server's p95 crosses a limit. Check the live dashboard for your server's current performance profile, or see what's in each tier.
Related questions
Should I alert on absolute latency or relative to baseline?
Relative to baseline is almost always better for MCP servers, because tool complexity varies wildly. A 1,500ms LLM-backed tool is fine; a 1,500ms database lookup is a problem. Relative thresholds adapt to each server's character automatically.
How do I profile what's slow inside my MCP server?
Add span timing around your handler dispatch before and after each tool executes. For Node.js MCP servers, a simple process.hrtime.bigint() wrap on your tool handler is enough to see whether the slowdown is in your code, in a downstream API call, or in the JSON serialization.
Does AliveMCP show regional latency?
Yes — Enterprise tier probes from five geographic regions and surfaces regional latency separately. This catches the "works for us, broken in EU" class of failures that a single-region probe misses.