Reference · Reliability

MCP server timeout

A timeout on an MCP tool call doesn't surface cleanly to users — the agent framework usually swallows it, retries silently, or returns a degraded response without explaining why. To users, it looks like the product is slow or broken. To you, it looks like nothing at all unless you're watching.

TL;DR

Three layers of timeout control whether an MCP tool call succeeds: the transport timeout (TCP connection + HTTP response), the protocol timeout (how long the client waits for a JSON-RPC result), and the tool execution timeout (how long your server allows a tool handler to run). Set all three explicitly. Monitor for sustained high-latency probes (leading indicator of timeouts) with AliveMCP before they become user-visible failures. Join the waitlist to track your server's latency profile.

The three timeout layers

1. Transport timeout

This is the TCP connection establishment + TLS handshake + HTTP response headers timeout. If your server is behind a load balancer or reverse proxy, both the client-to-proxy and proxy-to-server timeouts apply in series. Most HTTP clients default to 30 seconds here, which is usually too generous — a server that takes 30 seconds to accept a connection is effectively down.

Recommended values: connection timeout 5s, read timeout 30s. For agent-facing MCPs where the agent is itself inside an LLM generation loop, these numbers need to be much more conservative — a 30-second read timeout can stall an entire agent turn. Consider 10s read timeout for the transport layer and rely on the protocol layer for the actual work budget.

2. Protocol (JSON-RPC) timeout

This is how long the calling client waits for the JSON-RPC response to a specific request ID. This is separate from the transport timeout — the transport stays alive (the connection is open) but the application-level response hasn't arrived. Agent frameworks that implement MCP typically allow configuring this per-server.

Common mistake: setting the protocol timeout too high (e.g. 120 seconds) because you expect some tool calls to be slow. This allows a single slow tool to block the entire agent turn for two minutes. Better: set a reasonable default (10-15 seconds) and use async/streaming patterns for tools that legitimately need more time.

3. Tool execution timeout

This is a timeout you enforce on the server side — how long you allow a single tool handler to run before you abort it and return an error. Without this, a slow downstream API call (a hung database query, a stalled LLM call) can hold up the server's thread or async task indefinitely, eventually exhausting your connection pool and timing out every caller.

Recommended values: 8 seconds for synchronous lookups, 25 seconds for LLM-backed tools. If a tool legitimately needs more time, return a pending result and implement a polling or subscription mechanism rather than holding the connection open.

Common timeout causes

Upstream API timeouts
Your tool calls an external API (a database, a third-party service, another LLM) that goes slow or unresponsive. The most common category. Fix: add a timeout on every external call with a deadline shorter than your tool execution timeout, and fail fast with an error rather than hanging.
Cold starts
Serverless or container-based deployments that spin up on demand add cold-start latency to the first call after an idle period. This typically shows as a periodic p95 spike visible in response-time trends. Fix: keep-alive probes (like AliveMCP's 60-second probes) maintain warm containers as a side effect. Alternatively, provision minimum instances.
Connection pool exhaustion
High-traffic periods can exhaust a database or upstream connection pool, queuing requests indefinitely. A server that's fast at p50 but times out at p99 under load usually has a pool-sizing issue. Fix: tune pool size and add queue-depth metrics.
Large tool schemas
A tools/list response that contains many tools with large inputSchema objects can be slow to serialize, especially on low-memory servers. This rarely causes actual timeouts but is a common source of high T2TL (time-to-tools-list) latency. Fix: compress your schemas, remove redundant descriptions, and consider lazy loading for large input schema definitions.
Network path issues
BGP route changes, CDN configuration drift, or firewall rule changes can increase round-trip time by hundreds of milliseconds on specific paths. These are often regional and show up as high latency from some probe locations but not others. An external monitoring service that probes from multiple regions will catch this before internal metrics do.

How to detect impending timeouts before they happen

Actual timeouts are a lagging indicator — by the time a caller reports a timeout, the problem has been building for a while. Leading indicators to watch:

AliveMCP tracks p50 and p95 for every server it probes, and shows the 24-hour trend on each server's status page. Check your server's current trend on the public dashboard.

Alerting on timeout risk

Don't wait for actual timeout errors to alert. Alert when you're trending toward them:

How AliveMCP helps

The 60-second probes AliveMCP runs against every public MCP endpoint measure T2I (time-to-initialize) and T2TL (time-to-tools-list) on every cycle. These are the timeout-leading metrics that predict when a server is heading toward a timeout crisis. Author tier ($9/mo) lets you set custom latency thresholds and receive webhook alerts before your p95 reaches the danger zone. See all plans or join the waitlist for early access.

Get early access

Related questions

What timeout should I set in my agent framework for MCP tool calls?

A reasonable default is 15 seconds for simple tools and 45 seconds for LLM-backed tools. Set these explicitly in your agent framework's MCP client configuration — don't rely on OS defaults, which are often 60-120 seconds and will stall user turns for a long time before failing.

Can timeouts cause data corruption in my MCP server?

If your tool handlers have side effects (database writes, external API calls that can't be rolled back), a timeout mid-execution can leave data in an inconsistent state. Design tool handlers to be idempotent or transactional where possible. Return an error early if you can't guarantee safe abort.

My server returns the initialize response fast but times out on tools/list — why?

The tools/list response is generated dynamically on some MCP frameworks — it may trigger schema reflection, database lookups, or plugin loading. If initialize is fast but tools/list is slow, the schema generation path is the bottleneck. Cache the tools/list response after the first call if the schema doesn't change between client sessions.

Further reading