Reference · Reliability

MCP server timeout

A timeout on an MCP tool call doesn't surface cleanly to users — the agent framework usually swallows it, retries silently, or returns a degraded response without explaining why. To users, it looks like the product is slow or broken. To you, it looks like nothing at all unless you're watching.

TL;DR

Three layers of timeout control whether an MCP tool call succeeds: the transport timeout (TCP connection + HTTP response), the protocol timeout (how long the client waits for a JSON-RPC result), and the tool execution timeout (how long your server allows a tool handler to run). Set all three explicitly. Monitor for sustained high-latency probes (leading indicator of timeouts) with AliveMCP before they become user-visible failures. Join the waitlist to track your server's latency profile.

The three timeout layers

1. Transport timeout

This is the TCP connection establishment + TLS handshake + HTTP response headers timeout. If your server is behind a load balancer or reverse proxy, both the client-to-proxy and proxy-to-server timeouts apply in series. Most HTTP clients default to 30 seconds here, which is usually too generous — a server that takes 30 seconds to accept a connection is effectively down.

Recommended values: connection timeout 5s, read timeout 30s. For agent-facing MCPs where the agent is itself inside an LLM generation loop, these numbers need to be much more conservative — a 30-second read timeout can stall an entire agent turn. Consider 10s read timeout for the transport layer and rely on the protocol layer for the actual work budget.

2. Protocol (JSON-RPC) timeout

This is how long the calling client waits for the JSON-RPC response to a specific request ID. This is separate from the transport timeout — the transport stays alive (the connection is open) but the application-level response hasn't arrived. Agent frameworks that implement MCP typically allow configuring this per-server.

Common mistake: setting the protocol timeout too high (e.g. 120 seconds) because you expect some tool calls to be slow. This allows a single slow tool to block the entire agent turn for two minutes. Better: set a reasonable default (10-15 seconds) and use async/streaming patterns for tools that legitimately need more time.

3. Tool execution timeout

This is a timeout you enforce on the server side — how long you allow a single tool handler to run before you abort it and return an error. Without this, a slow downstream API call (a hung database query, a stalled LLM call) can hold up the server's thread or async task indefinitely, eventually exhausting your connection pool and timing out every caller.

Recommended values: 8 seconds for synchronous lookups, 25 seconds for LLM-backed tools. If a tool legitimately needs more time, return a pending result and implement a polling or subscription mechanism rather than holding the connection open.

Common timeout causes

Upstream API timeouts: Your tool calls an external API (a database, a third-party service, another LLM) that goes slow or unresponsive. The most common category. Fix: add a timeout on every external call with a deadline shorter than your tool execution timeout, and fail fast with an error rather than hanging.
Cold starts: Serverless or container-based deployments that spin up on demand add cold-start latency to the first call after an idle period. This typically shows as a periodic p95 spike visible in response-time trends. Fix: keep-alive probes (like AliveMCP's 60-second probes) maintain warm containers as a side effect. Alternatively, provision minimum instances.
Connection pool exhaustion: High-traffic periods can exhaust a database or upstream connection pool, queuing requests indefinitely. A server that's fast at p50 but times out at p99 under load usually has a pool-sizing issue. Fix: tune pool size and add queue-depth metrics.
Large tool schemas: A tools/list response that contains many tools with large inputSchema objects can be slow to serialize, especially on low-memory servers. This rarely causes actual timeouts but is a common source of high T2TL (time-to-tools-list) latency. Fix: compress your schemas, remove redundant descriptions, and consider lazy loading for large input schema definitions.
Network path issues: BGP route changes, CDN configuration drift, or firewall rule changes can increase round-trip time by hundreds of milliseconds on specific paths. These are often regional and show up as high latency from some probe locations but not others. An external monitoring service that probes from multiple regions will catch this before internal metrics do.

How to detect impending timeouts before they happen

Actual timeouts are a lagging indicator — by the time a caller reports a timeout, the problem has been building for a while. Leading indicators to watch:

Rising p95 response time — a p95 that's been creeping up over 48 hours often precedes a wave of timeouts when traffic increases or a dependency degrades further.
Widening p50-to-p95 spread — when the gap between median and p95 grows, it means an increasing fraction of calls are experiencing the slow path. The slow path is often a prelude to timeout territory.
Increased error rate on specific tools — if one tool's error rate rises while others are stable, a tool-specific upstream dependency is the likely culprit. Fix the dependency; don't raise the timeout.

AliveMCP tracks p50 and p95 for every server it probes, and shows the 24-hour trend on each server's status page. Check your server's current trend on the public dashboard.

Alerting on timeout risk

Don't wait for actual timeout errors to alert. Alert when you're trending toward them:

Alert now: p95 response time exceeds your tool execution timeout ceiling for 3+ consecutive probes. This means calls are already at risk.
Investigate: p95 is between 50% and 100% of your timeout ceiling and trending up over 6 hours.
Note for review: p95 has risen more than 2× its 7-day average but is still below the timeout ceiling. Watch it; don't page.

How AliveMCP helps

The 60-second probes AliveMCP runs against every public MCP endpoint measure T2I (time-to-initialize) and T2TL (time-to-tools-list) on every cycle. These are the timeout-leading metrics that predict when a server is heading toward a timeout crisis. Author tier ($9/mo) lets you set custom latency thresholds and receive webhook alerts before your p95 reaches the danger zone. See all plans or join the waitlist for early access.

Get early access