Reference · Incident management
MCP server downtime
Most MCP server downtime is invisible to its author and only becomes visible when a user tweets about it, leaves a review, or simply stops using the tool. Here's how to flip that: detect outages before users do, track downtime history, and communicate status proactively.
TL;DR
The only reliable way to know your MCP server is down before a user tells you is continuous external probing — at least every 60 seconds, from outside your own network. A single probe at that cadence catches a full outage within a minute. AliveMCP runs this free for every public MCP endpoint. Join the waitlist to add your server, get alert webhooks, and track your 90-day downtime history.
Why MCP downtime is worse than web app downtime
When a web app goes down, users hit a visible error page. They know. They complain, and that feedback reaches the developer quickly. When an MCP server goes down, the user experience is different: the agent either silently skips the tool call and returns a degraded response, or hangs while waiting for a timeout. Neither produces a clear "the server is down" signal for the user. They just think the AI product is bad. They churn without filing a bug.
This is why AliveMCP's April 2026 scan found 91% of public MCP endpoints non-functional — most authors didn't know their server had been down for days or weeks. The silent failure mode is a structural property of how agent frameworks handle tool errors, not an author problem.
Downtime detection: what the probe looks like
A correct downtime probe for an MCP server is not an HTTP GET to the base URL. It's:
- A POST to the MCP endpoint with an
initializeJSON-RPC body. - Validation that the response contains
protocolVersionandcapabilities. - A follow-up
tools/listcall to confirm the tool registry is functional.
A server is "down" if any of these three steps fails with an error — a network failure, a non-2xx HTTP response, a JSON-RPC error code, or a response that doesn't conform to the MCP spec. A server that returns HTTP 200 with a broken JSON body is down for MCP purposes even though a conventional uptime monitor would mark it green.
See MCP server health check for the exact probe request shapes.
Downtime categories you'll encounter
- Hard down
- TCP connection refused or TLS handshake fails. The server process has crashed, the host is unreachable, or DNS has lapsed. This is the most obvious failure class and the easiest to detect. Recovery is usually a restart or a deploy.
- Protocol down
- HTTP reaches the server but the JSON-RPC handler returns an error or an invalid response. Common causes: a dependency that failed on startup, a middleware bug introduced by a recent deploy, or a serialization regression.
- Tool-registry down
initializesucceeds buttools/listerrors. The server is alive but its capability advertisement is broken. Agents can connect but can't discover what to call — functionally down for any new session.- Flapping
- The server alternates between healthy and unhealthy probes. Common causes: a container under memory pressure restarting, a connection pool exhausted during traffic spikes, or a network path with high packet loss. Flapping is tricky to alert on — see the alerting section below.
- Degraded / regional
- The server is healthy from some probe locations but not others. This usually indicates a CDN misconfiguration, a regional load balancer issue, or a dependency that's only available in one geography.
Alerting on downtime without over-alerting
The naive approach — alert the moment any probe fails — produces alert fatigue from transient network hiccups. A better policy:
- Confirmed down: 3 consecutive failed probes (3 minutes at 60-second cadence). This eliminates single-probe network noise while still catching a real outage within 3 minutes.
- Confirmed flapping: more than 3 flip-flops (up → down → up) in a 30-minute window. This is a distinct alert from a hard outage — it signals instability rather than absence.
- Down for > 15 minutes: escalate to a higher-urgency channel. If the first Slack message didn't generate a response, the 15-minute mark warrants a louder signal.
- Down for > 1 hour: post a notice to your public status page and send an email summary if you have one. At this point, some users have already been affected.
Tracking downtime history
Beyond real-time alerting, downtime history is what earns user trust over time. A public 90-day uptime graph showing >99.5% availability is a credibility signal; an opaque server with no history is a liability. Consider:
- Recording each outage with start time, end time, duration, and cause (even just "deploy regression" or "host restart").
- Publishing your uptime percentage on your server's documentation page or GitHub README.
- Linking to a public status page where users can check current status themselves instead of opening a support ticket.
How AliveMCP handles this
Every MCP server on the public dashboard gets a 90-day rolling downtime history, visible at /status/<server-slug>. Outages are marked on a timeline with confirmed-down windows and recovery timestamps. Claim your listing on the Author tier ($9/mo) to unlock custom alert webhooks and a verified-author badge. The Team tier ($49/mo) adds a public status-page subdomain (yourserver.alivemcp.com/status) that you can link from your documentation. See the live dashboard or compare plans.
Related questions
How long does a typical MCP server outage last?
From the AliveMCP dashboard data: the median confirmed outage is 8 minutes (usually a container restart or a brief deploy). The long tail is what matters — outages lasting more than an hour are typically caused by hosting provider incidents, expired credentials, or a broken deploy that wasn't noticed because no monitoring was in place.
Should I have a maintenance window?
For most indie MCP authors, deploys are fast enough that you don't need a formal maintenance window. For servers with paying customers, it's worth coordinating: post a maintenance notice to your status page an hour in advance, keep the window under 10 minutes, and post a recovery confirmation when you're done.
What should I post to my status page during an outage?
Keep it factual: the start time, what you know about the cause, and your best estimate of resolution. Update it every 15-30 minutes even if there's nothing new to report — "still investigating" is better than silence. When resolved, post the root cause in 1-2 sentences.