Reference · Reliability

MCP server availability

Availability means something more precise for MCP servers than it does for web applications. An MCP server can be "up" by every HTTP metric — accepting connections, returning 200, passing load balancer health checks — and still be unavailable to the agents that depend on it. Protocol-level availability is what matters, and it's not what standard uptime monitors measure.

TL;DR

MCP server availability has two layers. Transport availability: the server accepts TCP connections and completes the TLS handshake. Protocol availability: the server handles the JSON-RPC initialize request and returns a valid tools/list response. Both layers must be up for an agent to use the server. AliveMCP measures both in a single probe sequence. Join the waitlist to see the real availability of your server, not just the HTTP layer.

Two availability layers, two failure modes

Standard uptime monitors work at the transport layer: they open a TCP connection (or send an HTTP GET to a health endpoint) and record whether the server responded with an acceptable status code. This catches a server that is completely down — process crashed, network unreachable, certificate expired.

MCP availability requires the protocol layer too. A server whose MCP router is broken but whose HTTP listener is intact will pass every transport-layer check. Concretely:

The HTTP health check endpoint (GET /health) returns 200, but the MCP endpoint (POST /mcp) returns 500 or an invalid JSON-RPC response.
The initialize request succeeds, but tools/list returns an empty array because the tool registry is disconnected.
The tools/list returns the expected tools, but one tool's inputSchema has been corrupted by a deploy, causing agents that try to call it to fail at the schema validation step.

Each of these is an availability failure for the agents depending on the server, even though the server is "up" by conventional monitoring definitions. The silent failure modes of MCP servers are almost entirely in this second layer — the protocol surface that transport-level monitors miss.

SLA math for MCP servers

Availability SLAs are expressed as a percentage of time the service is available, measured over a rolling window. The arithmetic is straightforward; the implications for agent workloads are worth understanding concretely.

Availability	Downtime per year	Downtime per month	Downtime per week
99.99%	52 minutes	4 minutes	1 minute
99.9%	8 hours 46 min	43 minutes	10 minutes
99.5%	43 hours 49 min	3 hours 39 min	50 minutes
99.0%	87 hours 38 min	7 hours 18 min	1 hour 41 min

For web applications, 99.9% availability is considered good. For agent-facing MCP servers, the framing is different. An AI assistant that makes 50 tool calls per day across all users will, at 99.9% availability, experience roughly 4–5 failed tool calls per month due to downtime. Each failed tool call either causes a degraded agent response or surfaces as an error to the user — depending on how the agent framework handles tool failures.

More critically, MCP downtime tends to cluster. A server doesn't go down for 43 minutes spread uniformly across the month — it goes down for 43 minutes in a single incident. That incident window will fail 100% of tool calls from every connected agent. A single mid-day 43-minute outage affects every user who happened to use the agent during that window.

Availability budget allocation

If you have a 99.9% monthly availability target, that's 43 minutes of allowed downtime. Allocating that budget:

Planned maintenance windows — a deploy that causes a 2-minute restart window uses ~5% of a monthly budget. With three such deploys per month, you've used 15% of your budget on planned work.
Dependency outage buffer — if your MCP server depends on a third-party API with a 99.5% SLA, that dependency alone accounts for 219 minutes of potential downtime per year (at the API tier, not all of which propagates to your server if you have caching or graceful degradation).
Unplanned incident reserve — what's left for actual incidents.

A 99.9% target is achievable without redundancy for many low-traffic MCP servers. A 99.99% target requires a deployment architecture with zero-downtime deploys, cross-region failover, or both.

How availability is measured

Probe-based external measurement

The most operationally accurate way to measure MCP availability is from outside the server, from the perspective of an agent trying to connect. This is what AliveMCP does: a probe runs every 60 seconds, completing the full MCP handshake sequence (TLS, initialize, tools/list) and recording whether each step succeeded.

The rolling 30-day availability percentage displayed on the AliveMCP dashboard is calculated from probe results: successful_probes / total_probes × 100. A probe that fails at any layer — TLS, HTTP, initialization, tools list — counts as a failed probe.

Confirmation threshold and availability calculation

A single failed probe is not enough to declare an outage. Transient network issues (a packet loss event, a brief DNS hiccup) can cause isolated probe failures that have no practical impact on agents. AliveMCP uses a three-consecutive-probe threshold before declaring a server down, and three consecutive successes to declare it recovered. This affects the availability math in a specific way:

A server that fails exactly one probe per hour (always isolated, never three in a row) will show 100% availability — the threshold filters out transient noise. A server that has a genuine 3-minute outage once per week will fail three consecutive probes during that outage window and record approximately 3 minutes of measured downtime per week — the threshold captures genuine outages accurately.

This confirmation model prevents alert fatigue (too many false positives) while remaining sensitive to real outages. It does mean that a very short outage (under 3 minutes) can go undetected. If your availability budget requires detecting sub-3-minute outages, you need a higher-frequency probe cadence than 60 seconds — or a secondary monitoring layer at the application level.

Availability vs reliability

These terms are often used interchangeably but have distinct meanings in operational contexts. Availability is the fraction of time the service is up (binary: is it reachable and functional?). Reliability includes latency and error rate while the service is up. A server with 99.9% availability but a p95 latency of 8 seconds when it is up is highly available but not very reliable. Both metrics matter for agent-facing services; availability is the floor, reliability is the ceiling.

Communicating availability to MCP consumers

If other teams, third-party integrations, or external agents depend on your MCP server, availability data needs to be communicated externally. The standard mechanism is a public status page: a URL that shows current server status, historical availability, and incident timelines.

AliveMCP generates a status page for every monitored server, showing:

Current status (operational, degraded, or down) with the layer at which the check failed
90-day rolling availability chart
Response time trend (30-day p50/p95)
Incident history with start time, end time, and duration

Linking to this page in your MCP server's README, in the MCP registry listing, and in your documentation gives agent operators a single source of truth for your server's availability record. See an example status page on the AliveMCP dashboard, or sign up to get one for your server.

Get early access