Deep dive · 2026-04-24 · Failure taxonomy

Why MCP servers die silently — 7 failure modes from 2,181 endpoints

When we probed every public Model Context Protocol endpoint in April 2026, only 9% answered correctly. The other 91% broke in seven distinct ways. Some failures are loud — anyone with a browser can see them. Others are quiet enough that the author who shipped the server, the registry that lists it, and the agent that depends on it can all be looking right at the listing and miss the fact that it stopped working months ago. This post unpacks each failure mode with concrete examples from the dataset, what catches it, what doesn't, and the order to wire detection in.

TL;DR

Of 2,181 MCP endpoints across six public registries, 91% failed at least one of seven recurring modes. Loud failures (DNS lapsed, hosting slept, TLS expired) account for ~38% and any HTTP-aware monitor catches them. Quiet failures (route moved, half-configured auth, malformed JSON-RPC, schema drift) account for ~53% and require a protocol-aware probe to see. The failure mode authors most underestimate is schema drift — the server is up, returns 200, parses fine, and silently produces wrong answers. The fix is the same in every case: a real initialize + tools/list probe, on a 60-second cadence, with the previous tool-list hash held alongside the response.

The spectrum: loud death vs quiet death

Before walking the seven modes, the framing that matters: failures don't fail equally. Plot them on a "how loud is this when it breaks" axis and a clear pattern emerges. A DNS lapse is loud — every browser, every curl, every monitor sees the same NXDOMAIN. A schema drift is quiet — the server returns 200 OK with a parseable body, and only an MCP-aware client comparing the new tool list against the previous one notices that the schema_version jumped or that required: ["query"] appeared on a tool that used to take no arguments.

The first half of this post is the loud half. If your server is here, you'll know about it within hours of it breaking, because something else will alert you — usually a user. The second half is the quiet half, where the only alert is the one you wired yourself, against an MCP-aware probe, before the failure happened. We saw 26.7% of all endpoints in the "HTTP alive, MCP dead" bucket — every one of those is in the quiet half, and every one of those would have shown green on UptimeRobot the day we caught them. We covered the headline numbers and the per-registry breakdown in the Q2 2026 registry audit; this post is the deep dive into how each mode actually presents.

Mode 1 — DNS lapsed

How it presents: dig +short your-server.com returns nothing. Every probe, in every region, fails the same way: the host doesn't resolve. This was the single largest sub-bucket in our 38.3% transport-dead pile.

Why it happens: An MCP server gets shipped on a personal project domain — my-cool-mcp.dev, registered for one year on the same card the author uses for everything. Twelve months later the auto-renewal fails (expired card, declined transaction, email to an inbox the author no longer reads), the registrar grace window passes, and the domain drops. The MCP listing on three registries still points at it. The server itself, on whatever host, is probably still running and could probably still answer. But there is no name pointing at the IP any more.

Concrete example from the Q2 dataset: 19 of the dead endpoints we scanned shared a registrar that had been part of a 2025 acquisition. Their domains had all expired in a 6-week window after the acquired registrar's auto-renewal flow broke during the integration. Every one of those servers was discoverable on at least two registries. Every one of those listings was clicked, on average, between 4 and 11 times during our scan window — by users who got an unhelpful browser error and moved on.

What catches it: any monitor that does a real DNS lookup on each probe. Most do.

What doesn't: a monitor that resolves the hostname once at config time and then probes the cached IP. There aren't many of those, but they exist.

How to prevent it: register on a card you'll renew, and set a calendar event two weeks before expiry independent of the registrar's email. The two-week window is wide enough to catch a registrar's email going to spam.

Mode 2 — Free-tier hosting slept or got reaped

How it presents: DNS resolves. TCP connect succeeds. But the server takes 30+ seconds to respond on the first request and then returns a generic platform error page. Or — depending on the platform — the connection is refused and the host returns a 404 that is clearly the platform's, not yours.

Why it happens: The MCP was shipped on a free tier — Render, Railway, Fly, Vercel, a Supabase Edge Function — and the conditions for the free tier eventually changed. The container went to sleep after 15 minutes of inactivity (and now wakes too slowly to satisfy MCP clients with short timeouts). The project hit the monthly credit cap. A region migration moved the app to a new instance type and your old config no longer compiles. Or the platform reaped inactive projects in a quarterly clean-up and you missed the email.

Concrete example from the Q2 dataset: 73 endpoints we marked transport-dead returned platform-branded 404 pages — Render, Railway, Fly, and Vercel boilerplate, in that order of frequency. Eleven more answered with valid HTML but a "this app is sleeping, click to wake" page that no MCP client would ever click. None of those servers had an obvious "I am hosted on a free tier" tell in the registry listing, so a downstream agent had no way to weight reliability.

What catches it: a monitor that actually parses the response body (not just the status code). HTTP 200 with platform-branded body content is a real failure.

What doesn't: a monitor that only watches HTTP status. Many platform sleep pages return 200 with a "service unavailable" body, which is a status code lie.

How to prevent it: assume the free tier is a personal-project tier. If your MCP gets cited from anywhere public, move it to the cheapest paid tier the day that happens. Set a budget alert at 80% of the cap so you find out before the platform decides for you.

Mode 3 — TLS certificate expired

How it presents: DNS resolves. TCP on 443 connects. The TLS handshake fails with certificate has expired or — more often — certificate verify failed: unable to get local issuer certificate, because the chain itself broke when an intermediate CA was rotated.

Why it happens: Three sub-cases dominate. (1) The server is behind a CDN and the CDN's auto-renewal flow needs a DNS record the author moved or revoked. (2) The cert is provisioned by Let's Encrypt with a cron job that hasn't run since the server's last reboot. (3) The cert is fine but the server is serving a default Nginx cert because the vhost config was lost in a deploy.

Concrete example from the Q2 dataset: 41 endpoints failed exclusively at the TLS layer. Of those, 12 had a cert that had expired in the previous 90 days — a pattern consistent with a renewal cron that ran one cycle and then stopped. Modern browsers and most MCP clients will refuse to talk to those servers; the author has no way to know unless someone tells them.

What catches it: any monitor that performs a real HTTPS handshake (most do). External cert-expiry monitors (Let's Monitor, certinel, etc.) will alert before expiry rather than after.

What doesn't: a monitor that probes the HTTP-only port (80) and not 443. Or a monitor that disables certificate verification — fine for diagnostics, dangerous as a default.

How to prevent it: use a cert source with built-in renewal that you don't have to think about (a CDN with managed certs is the simplest). If you run Let's Encrypt yourself, add a separate cert-expiry alert with a 14-day window — independent of your renewal pipeline.

Mode 4 — Route moved without a redirect

How it presents: Server is up. TLS is fine. The registry listing's URL returns 404. POST to /mcp returns 404. The server is alive at some other path — /v1/mcp, or /mcp/sse, or /api/mcp — that the author moved to in a refactor and never went back to update the registries with.

Why it happens: An MCP author ships v1, gets it listed, then six weeks later refactors the routing because they're adding a v2 path and want a clean prefix. They update their own README, but a registry listing edit is one more chore on a quiet Tuesday and it never makes it onto the list. Anyone hitting the registry's URL gets a 404; the author's friends, who follow the README, get the right path and never know there's a problem.

Concrete example from the Q2 dataset: 88 endpoints in the HTTP-alive bucket returned a 404 on the registry-supplied path but, when we tried /mcp, /v1/mcp, and /api/mcp as common alternates, returned a valid initialize response on one of them. Those servers are alive, just unfindable from the registries.

What catches it: a probe that exercises the exact registry-supplied URL. The whole point of the failure is that the listing is wrong; testing your own server from your own README will look fine.

What doesn't: any monitor where the URL was only ever entered once, by you, from your README. The agent that finds you via a registry will hit the registry's path, not yours. For walking through which layer is failing in a "not responding" report, see the diagnostic ladder in MCP endpoint not responding.

How to prevent it: when you change a route, leave a permanent redirect in place from the old path for at least one quarter, and edit the registries the same day you ship the change. Better: never delete an old MCP path; alias it.

Mode 5 — Auth configured half-way

How it presents: initialize succeeds. tools/list returns a populated tool array. Every individual tools/call returns 401 or JSON-RPC error -32001. The server looks alive — it answers, it advertises tools — but those tools are not callable without a credential the registry never advertised.

Why it happens: The author wanted public discovery (so people would find their server in registries) but private execution (so random callers couldn't burn their downstream OpenAI bill). They turned on bearer-token auth at the tool-call layer, exposed initialize + tools/list for crawlers, and then forgot to publish anywhere how to actually get a token. From a tool-caller's perspective the server is up and useless.

Concrete example from the Q2 dataset: 366 endpoints — 16.8% of all listings — fell in this bucket. We counted them as dead because that is what they look like to a real user, but it's the bucket with the most ambiguity: an unknown share are intentional private servers that a registry listed in error. We're conservative in flagging them and would estimate the true "broken auth" share at 12-14% and the "intentional private" share at 3-4%, but it varies by registry.

What catches it: any probe that goes past tools/list and actually invokes one cheap, idempotent tool. We use a list_resources or whatever the cheapest discovery-mode call the server advertises is. If that returns 401, the server isn't dead — but for the audience that finds it via a public registry, it might as well be.

What doesn't: any probe that stops at initialize. initialize is the easy part. The whole reason it's the easy part is because it's where most servers permit anonymous access.

How to prevent it: if you want a public listing, advertise the auth model in the listing's description or in your serverInfo metadata. If the server is genuinely private, take it off public registries. The half-way state is the worst of both worlds.

Mode 6 — Malformed JSON-RPC

How it presents: The server returns 200 OK. The body is JSON. But the JSON is missing required JSON-RPC 2.0 fields — no jsonrpc: "2.0" envelope, no id echoed back from the request, or a result that's a flat string where it should be an object. Or the response uses old MCP protocol shapes that drifted away from spec — tools as a top-level array instead of inside result.tools; inputSchema on tools written as schema; or version strings that no longer match any released MCP spec.

Why it happens: The server was hand-rolled — written from a blog post that was up-to-date 12 months ago, or generated by an LLM from a prompt that didn't include the current spec link. Or the SDK is real but two major versions out of date and the breaking changes between versions never propagated. Either way the server is honestly trying to be an MCP server; it just doesn't quite speak the protocol any current client will accept.

Concrete example from the Q2 dataset: Of 200 endpoints in the schema-malformed bucket, the most common single defect (54 servers) was tools returned as a top-level array rather than nested in result.tools — a shape from an early MCP draft that was changed and never deprecated loudly. A close second (38 servers) was an inputSchema that was a bare JSON string instead of a JSON Schema object — usually because the author wrote JSON.stringify(schema) in their handler.

What catches it: a probe that validates the response against the current MCP spec. We use the official JSON Schema for the protocol envelope and a smaller hand-rolled validator for the MCP-specific shapes inside it. The probe sequence is laid out in the MCP server health check guide.

What doesn't: any monitor that checks for status code 200 and a non-empty body. By that bar, every malformed server passes.

How to prevent it: keep the SDK current. If you don't use an SDK, wire a CI check that runs the official MCP test client against your server before merging anything that touches the request/response layer.

Mode 7 — Schema drift

How it presents: The server is up. It speaks the protocol. tools/list returns a clean, valid tool array. The shape of that tool array, however, has changed in a breaking way since whatever client integrated against it last cached. A field name renamed from q to query. A previously optional field now required. A tool removed entirely. A new x-deprecated: true annotation on what used to be the primary call.

Why it happens: Schemas evolve. The MCP spec doesn't enforce versioning on tools, so authors add and remove inputs in the same lifecycle they add and remove tools — which is to say, freely. There is no client-visible contract telling the agent platform on the other side that the shape it cached at integration time is no longer the shape the server returns.

Concrete example from the Q2 dataset: We probe each endpoint three times across 24-hour windows and we hash the tools/list response on each probe. Across the 196 healthy servers, 14 had a non-trivial hash change between the first and third probe — a 7.1% drift rate over 48 hours. Extrapolated naively to 30 days, that's a ~50% chance any given healthy MCP will have shifted its tool surface by the time you next look. None of these were obvious version bumps; none had patch-note URLs in the response. The tool list just changed.

What catches it: a probe that hashes the tool list, stores the hash alongside each probe, and alerts when the hash changes between probes — even if the response is otherwise valid. This is the single most-requested alert in our author waitlist conversations: "tell me when my tool list changes shape, even if I'm the one who changed it."

What doesn't: any monitor that doesn't have a memory between probes. Stateless HTTP probes by definition cannot detect drift; they have nothing to compare against.

How to prevent it: version your tools at the schema level (x-version on each tool), keep a changelog endpoint at /mcp/changelog that any client can fetch, and treat any breaking schema change the same way you'd treat a major-version bump in a public package.

What catches what — the detection ladder

Mapping the seven modes against the kind of monitor most likely to catch them:

Failure mode	Plain HTTP monitor	HTTPS + body parse	JSON-RPC probe	MCP-aware probe + drift hashing
1. DNS lapsed	Yes	Yes	Yes	Yes
2. Free-tier hosting slept	Sometimes	Yes	Yes	Yes
3. TLS expired	No (if probing :80)	Yes	Yes	Yes
4. Route moved without redirect	Yes (404)	Yes	Yes	Yes
5. Auth half-configured	No	No	No	Yes
6. Malformed JSON-RPC	No	No	Yes	Yes
7. Schema drift	No	No	No	Yes

The pattern: each rung up the ladder catches a strict superset of the previous one. A plain HTTP monitor catches the loud failures (modes 1-4 with caveats). An MCP-aware probe with drift hashing catches every mode in the dataset. The 53% of endpoints that were "alive but broken" map almost exactly to the modes that the bottom three monitor types miss — which is why "is your server up?" is the wrong question and "is your server up and answering MCP correctly?" is the right one. The compact form of this table — and what each price tier actually buys — is captured in UptimeRobot vs AliveMCP.

The 60-second rule

Detection cadence matters as much as detection coverage. Most of these failure modes share a property that makes them especially destructive in agent contexts: they fail on the first call after an event (a deploy, a cert rotation, a free-tier nap), and then keep failing the same way until something fixes them. Once your MCP is broken, every minute you don't know about it is a minute of bad agent answers, dropped sessions, or — worst case — silent wrong answers from drift.

We run our own probe loop on a 60-second cadence for every endpoint in every public registry. The reason is not that 60 seconds is magic; it's that anything coarser than that loses the ability to catch a 15-minute deploy regression before users see it. A six-hour cron — what most authors wire when they wire anything — typically lets a regression run for at least one full business hour before paging.

For a single server you operate yourself, a 60-second probe is realistic to wire by hand if you're already running a small monitoring stack. For "every endpoint in every registry," it is not realistic to wire by hand, which is why we run a hosted version: discovery + probe + alerting + history, sub-$50/month even at the Team tier, free at the Public tier. The point isn't that you have to use ours; the point is that anyone serious about an MCP-dependent agent platform needs probes at this cadence somewhere in the stack.

What to do this week if you ship MCP

Three concrete things, in order:

Run the probe from check if an MCP server is alive against your own server. Not in your dev env — against the production URL listed in the registries you submitted to. Every author we've talked to who has been shipping MCPs for more than three months has been surprised by something the probe found.
Set up alerts on schema drift specifically. Not just on uptime. Of the seven failure modes, schema drift is the one most likely to bite a healthy server. If you don't want to wire it yourself, plug a webhook into AliveMCP's Author tier — Slack alerts on drift events ship that day.
Fix any registry listings that are out of date. Walk every registry your server is on and verify the URL still resolves to your /mcp endpoint. If it doesn't, edit the listing the same day. The 88 servers we found alive at an alternate path could have been recoverable in 10 minutes of cleanup work.

What we'll do next

The Q3 2026 audit is scheduled for mid-July. We'll re-run the same methodology, publish the new headline numbers, and report which of the seven modes shifted. Our hope is that the loud-half failures (DNS, hosting, TLS) are flat or down — these are the ones a single ecosystem-wide push (a "claim your listing and verify its URL" event) could fix in a quarter. The quiet-half failures (auth, malformed JSON-RPC, schema drift) likely move slower because they require tooling adoption, not just hygiene.

If you operate an MCP and want a heads-up the next time your server slips into one of these modes, the easiest path is to claim it on the public dashboard. Free for the public-tier alert; $9/mo if you want to add a Slack or webhook target.

Join the waitlist