Reference · Incident management

MCP server downtime

Most MCP server downtime is invisible to its author and only becomes visible when a user tweets about it, leaves a review, or simply stops using the tool. Here's how to flip that: detect outages before users do, track downtime history, and communicate status proactively.

TL;DR

The only reliable way to know your MCP server is down before a user tells you is continuous external probing — at least every 60 seconds, from outside your own network. A single probe at that cadence catches a full outage within a minute. AliveMCP runs this free for every public MCP endpoint. Join the waitlist to add your server, get alert webhooks, and track your 90-day downtime history.

Why MCP downtime is worse than web app downtime

When a web app goes down, users hit a visible error page. They know. They complain, and that feedback reaches the developer quickly. When an MCP server goes down, the user experience is different: the agent either silently skips the tool call and returns a degraded response, or hangs while waiting for a timeout. Neither produces a clear "the server is down" signal for the user. They just think the AI product is bad. They churn without filing a bug.

This is why AliveMCP's April 2026 scan found 91% of public MCP endpoints non-functional — most authors didn't know their server had been down for days or weeks. The silent failure mode is a structural property of how agent frameworks handle tool errors, not an author problem.

Downtime detection: what the probe looks like

A correct downtime probe for an MCP server is not an HTTP GET to the base URL. It's:

A POST to the MCP endpoint with an initialize JSON-RPC body.
Validation that the response contains protocolVersion and capabilities.
A follow-up tools/list call to confirm the tool registry is functional.

A server is "down" if any of these three steps fails with an error — a network failure, a non-2xx HTTP response, a JSON-RPC error code, or a response that doesn't conform to the MCP spec. A server that returns HTTP 200 with a broken JSON body is down for MCP purposes even though a conventional uptime monitor would mark it green.

See MCP server health check for the exact probe request shapes.

Downtime categories you'll encounter

Hard down: TCP connection refused or TLS handshake fails. The server process has crashed, the host is unreachable, or DNS has lapsed. This is the most obvious failure class and the easiest to detect. Recovery is usually a restart or a deploy.
Protocol down: HTTP reaches the server but the JSON-RPC handler returns an error or an invalid response. Common causes: a dependency that failed on startup, a middleware bug introduced by a recent deploy, or a serialization regression.
Tool-registry down: initialize succeeds but tools/list errors. The server is alive but its capability advertisement is broken. Agents can connect but can't discover what to call — functionally down for any new session.
Flapping: The server alternates between healthy and unhealthy probes. Common causes: a container under memory pressure restarting, a connection pool exhausted during traffic spikes, or a network path with high packet loss. Flapping is tricky to alert on — see the alerting section below.
Degraded / regional: The server is healthy from some probe locations but not others. This usually indicates a CDN misconfiguration, a regional load balancer issue, or a dependency that's only available in one geography.

Alerting on downtime without over-alerting

The naive approach — alert the moment any probe fails — produces alert fatigue from transient network hiccups. A better policy:

Confirmed down: 3 consecutive failed probes (3 minutes at 60-second cadence). This eliminates single-probe network noise while still catching a real outage within 3 minutes.
Confirmed flapping: more than 3 flip-flops (up → down → up) in a 30-minute window. This is a distinct alert from a hard outage — it signals instability rather than absence.
Down for > 15 minutes: escalate to a higher-urgency channel. If the first Slack message didn't generate a response, the 15-minute mark warrants a louder signal.
Down for > 1 hour: post a notice to your public status page and send an email summary if you have one. At this point, some users have already been affected.

Tracking downtime history

Beyond real-time alerting, downtime history is what earns user trust over time. A public 90-day uptime graph showing >99.5% availability is a credibility signal; an opaque server with no history is a liability. Consider:

Recording each outage with start time, end time, duration, and cause (even just "deploy regression" or "host restart").
Publishing your uptime percentage on your server's documentation page or GitHub README.
Linking to a public status page where users can check current status themselves instead of opening a support ticket.

How AliveMCP handles this

Every MCP server on the public dashboard gets a 90-day rolling downtime history, visible at /status/<server-slug>. Outages are marked on a timeline with confirmed-down windows and recovery timestamps. Claim your listing on the Author tier ($9/mo) to unlock custom alert webhooks and a verified-author badge. The Team tier ($49/mo) adds a public status-page subdomain (yourserver.alivemcp.com/status) that you can link from your documentation. See the live dashboard or compare plans.

Get early access