Guide · Latency

MCP server latency

Uptime tells you whether your MCP server is alive. Latency tells you whether it's usable. An agent that waits 8 seconds for every tools/list call is technically "up" — but the user experience is broken. Here's how to measure latency per protocol layer, set thresholds that catch real degradation without alerting on noise, and integrate latency into your SLO framework.

TL;DR

MCP latency has four components: TCP connect, TLS handshake, initialize round-trip, and tools/list round-trip. Measure all four separately. Alert on p95, not p50 — the median hides tail latency problems. Set the alert threshold at 3× the 30-day p95 baseline, require 3 consecutive probe periods above threshold before firing, and exclude cold-start probes (first probe after ≥ 5-minute idle) from SLO calculations on serverless platforms. AliveMCP tracks all four components and provides a 30-day latency baseline automatically.

Why latency matters differently for MCP than for HTTP APIs

HTTP API latency is usually measured end-to-end: how long until the API returns a response? MCP server latency has a compounding structure because AI agents make multiple sequential calls to a single server within one task:

  1. The agent connects and calls initialize to establish the session.
  2. The agent calls tools/list to discover what tools are available and reads their schemas.
  3. The agent calls one or more tools (the actual work).
  4. Optionally, the agent calls resources/list or prompts/list for additional capability discovery.

Steps 1 and 2 happen on every agent session, even for a simple one-tool call. If your initialize + tools/list round-trip takes 2 seconds, every user-facing agent interaction is 2 seconds slower before any tool work begins. At p95, if that 2 seconds becomes 8 seconds, the user experience falls apart even though your server is "healthy" by uptime metrics.

Per-layer latency components and budgets

TCP connect + TLS handshake

This is pure network latency: how long to establish the connection. For a server colocated in the same cloud region as your probe origin, this should be 5–20ms. Cross-continent probes add 80–200ms. TLS 1.3 with 0-RTT resumption cuts this to near-zero for repeat connections.

Target budget: <50ms for same-region, <250ms for cross-region. An alert threshold of >500ms for 3 consecutive probes is reasonable — if your TCP connect time spikes to 500ms+ from a probe origin in the same region, something is wrong with your server's network path (DNS misconfiguration, routing anomaly, firewall rate-limiting).

HTTP response (TTFB to first byte of initialize response)

After the connection is established, this is the time until the server starts returning bytes. A fast initialize response starts returning bytes within 50–100ms of receiving the request. Spikes here often indicate server-side CPU contention or a slow database query during session initialization.

Target budget: <200ms for the initialize HTTP response TTFB. Cold-start platforms (Vercel, Railway, Render) will spike to 600ms–30 seconds on the first request after an idle period — see MCP server cold start for how to handle cold-start latency without false alert storms.

JSON-RPC initialize round-trip

The time from sending the initialize request body to receiving the full result response, including JSON parse. This measures the server's ability to validate the request, look up session configuration, and respond with protocol metadata.

Target budget: <500ms total (end-to-end, including connection setup). Anything above 1 second is worth investigating. Anything above 3 seconds will cause timeouts in some MCP client implementations that use default HTTP client timeouts.

tools/list round-trip

The time from sending the tools/list request to receiving the full response, including all tool schemas. Tools/list can be surprisingly slow if the server generates tool definitions dynamically (querying a database of available plugins, loading configuration from a file, or calling an upstream API to discover available operations).

Target budget: <300ms for <20 tools with static schemas; <800ms for dynamic tool discovery. Tool count matters: a response body with 50 tool definitions and their full JSON schemas can exceed 100KB, which adds meaningful serialization and transfer time. See MCP server performance for guidance on tool schema size optimization.

Choosing the right percentile for alerting

Never alert on p50 (median) latency. The median hides the tail experience — 49% of requests can be badly slow while the median looks fine. The percentile to alert on depends on your SLO target:

For a typical public MCP server, start with p95. The p95 threshold should be set at 3× your 30-day p95 baseline. If your server's 30-day p95 initialize latency is 180ms, the alert threshold is 540ms. This prevents alert fatigue from normal daily variation while catching genuine degradation events.

Require 3 consecutive probe periods above the threshold before firing the alert. A single p95 spike is often a single slow request at the probe origin, not a server problem. Three consecutive periods (3 minutes at 60-second probe cadence) indicate a sustained degradation that is worth waking someone up for.

Separating cold-start latency from genuine degradation

Serverless MCP platforms (Vercel, Railway free tier, Render free tier, AWS Lambda) scale to zero after a period of inactivity and have a multi-second cold start on the first request. This creates a predictable latency spike pattern: one high-latency probe after an idle gap, then recovery to normal latency on subsequent probes.

A naive latency alert fires P1 on the cold-start spike. The correct handling is to flag the probe as a "post-idle probe" and exclude it from SLO calculations while still logging the cold-start latency for trend analysis. AliveMCP recognizes serverless domains (vercel.app, railway.app, render.com, onrender.com, fly.dev, workers.dev) and applies a post-idle suppression automatically.

The distinction between cold-start and genuine degradation in a probe log:

See MCP server cold start for the complete playbook, including platform-specific benchmarks and mitigation options (keep-alive pings, min-instances=1, AliveMCP probes as incidental keep-alive).

Latency and SLO math

Latency SLOs are typically expressed as: "p95 response time for initialize + tools/list will be below X ms, measured over a rolling 30-day window." Choosing X requires knowing your 30-day baseline first — you can't set a meaningful SLO for a new server until you have 30 days of probe data.

For a server hosted on a dedicated VM or container (Railway paid tier, Fly.io, Hetzner VPS), a reasonable target is p95 <800ms end-to-end (TCP connect + initialize + tools/list). For serverless (Vercel Edge Functions, Lambda), excluding cold-starts, p95 <2,000ms is more realistic.

Latency SLOs interact with error budget calculations: a request that takes 10 seconds and eventually succeeds is not an "error" in the traditional sense, but if you have a 5-second client-side timeout, it is effectively an error from the user's perspective. Define whether your SLO counts latency-induced timeouts as errors — most teams count them as errors once they exceed the client timeout, regardless of whether the server eventually responds.

Related questions

What causes sudden p95 latency spikes that recover on their own?

Three common causes: (1) a serverless cold start after an idle period — the spike is predictable and the recovery is fast; (2) a shared hosting provider doing noisy-neighbor resource contention during a burst traffic period on another tenant; (3) a downstream API your server depends on having a transient slowdown. The diagnostic is to check whether the spike correlates with a known idle gap (cold start), is correlated across multiple unrelated servers on the same provider (noisy neighbor), or is isolated to one server with a stable prior baseline (downstream API issue).

How do I set a latency SLO for a brand-new server with no baseline data?

Use conservative defaults for the first 30 days: p95 <2,000ms for serverless, p95 <800ms for dedicated. After 30 days of probe data, refine the SLO using actual p95 baseline + 3× as the threshold. Don't set the threshold lower than your actual baseline — you'll spend the first 30 days triaging false alerts instead of understanding your server's real performance envelope.

Should latency be a component of my public status page?

Show current status (healthy/degraded/down) and historical uptime percentage on the public status page. Keep latency percentiles internal — they expose implementation details to competitors and can cause user confusion ("why is it 847ms instead of 200ms?"). A general "degraded performance" indicator on the public status page is appropriate when p95 exceeds the alert threshold, without showing the raw numbers.

How does AliveMCP measure latency at each protocol layer?

AliveMCP times each phase independently: TCP connect (time to SYN-ACK), TLS handshake completion, HTTP response TTFB (first byte of initialize response), initialize response body complete, then tools/list request start to body complete. The dashboard shows each phase as a separate timeseries, so you can see whether a latency spike is in the connection (network problem) or in the protocol phases (server-side problem).

Further reading