Reference · Reliability

MCP server availability

Availability means something more precise for MCP servers than it does for web applications. An MCP server can be "up" by every HTTP metric — accepting connections, returning 200, passing load balancer health checks — and still be unavailable to the agents that depend on it. Protocol-level availability is what matters, and it's not what standard uptime monitors measure.

TL;DR

MCP server availability has two layers. Transport availability: the server accepts TCP connections and completes the TLS handshake. Protocol availability: the server handles the JSON-RPC initialize request and returns a valid tools/list response. Both layers must be up for an agent to use the server. AliveMCP measures both in a single probe sequence. Join the waitlist to see the real availability of your server, not just the HTTP layer.

Two availability layers, two failure modes

Standard uptime monitors work at the transport layer: they open a TCP connection (or send an HTTP GET to a health endpoint) and record whether the server responded with an acceptable status code. This catches a server that is completely down — process crashed, network unreachable, certificate expired.

MCP availability requires the protocol layer too. A server whose MCP router is broken but whose HTTP listener is intact will pass every transport-layer check. Concretely:

Each of these is an availability failure for the agents depending on the server, even though the server is "up" by conventional monitoring definitions. The silent failure modes of MCP servers are almost entirely in this second layer — the protocol surface that transport-level monitors miss.

SLA math for MCP servers

Availability SLAs are expressed as a percentage of time the service is available, measured over a rolling window. The arithmetic is straightforward; the implications for agent workloads are worth understanding concretely.

AvailabilityDowntime per yearDowntime per monthDowntime per week
99.99%52 minutes4 minutes1 minute
99.9%8 hours 46 min43 minutes10 minutes
99.5%43 hours 49 min3 hours 39 min50 minutes
99.0%87 hours 38 min7 hours 18 min1 hour 41 min

For web applications, 99.9% availability is considered good. For agent-facing MCP servers, the framing is different. An AI assistant that makes 50 tool calls per day across all users will, at 99.9% availability, experience roughly 4–5 failed tool calls per month due to downtime. Each failed tool call either causes a degraded agent response or surfaces as an error to the user — depending on how the agent framework handles tool failures.

More critically, MCP downtime tends to cluster. A server doesn't go down for 43 minutes spread uniformly across the month — it goes down for 43 minutes in a single incident. That incident window will fail 100% of tool calls from every connected agent. A single mid-day 43-minute outage affects every user who happened to use the agent during that window.

Availability budget allocation

If you have a 99.9% monthly availability target, that's 43 minutes of allowed downtime. Allocating that budget:

A 99.9% target is achievable without redundancy for many low-traffic MCP servers. A 99.99% target requires a deployment architecture with zero-downtime deploys, cross-region failover, or both.

How availability is measured

Probe-based external measurement

The most operationally accurate way to measure MCP availability is from outside the server, from the perspective of an agent trying to connect. This is what AliveMCP does: a probe runs every 60 seconds, completing the full MCP handshake sequence (TLS, initialize, tools/list) and recording whether each step succeeded.

The rolling 30-day availability percentage displayed on the AliveMCP dashboard is calculated from probe results: successful_probes / total_probes × 100. A probe that fails at any layer — TLS, HTTP, initialization, tools list — counts as a failed probe.

Confirmation threshold and availability calculation

A single failed probe is not enough to declare an outage. Transient network issues (a packet loss event, a brief DNS hiccup) can cause isolated probe failures that have no practical impact on agents. AliveMCP uses a three-consecutive-probe threshold before declaring a server down, and three consecutive successes to declare it recovered. This affects the availability math in a specific way:

A server that fails exactly one probe per hour (always isolated, never three in a row) will show 100% availability — the threshold filters out transient noise. A server that has a genuine 3-minute outage once per week will fail three consecutive probes during that outage window and record approximately 3 minutes of measured downtime per week — the threshold captures genuine outages accurately.

This confirmation model prevents alert fatigue (too many false positives) while remaining sensitive to real outages. It does mean that a very short outage (under 3 minutes) can go undetected. If your availability budget requires detecting sub-3-minute outages, you need a higher-frequency probe cadence than 60 seconds — or a secondary monitoring layer at the application level.

Availability vs reliability

These terms are often used interchangeably but have distinct meanings in operational contexts. Availability is the fraction of time the service is up (binary: is it reachable and functional?). Reliability includes latency and error rate while the service is up. A server with 99.9% availability but a p95 latency of 8 seconds when it is up is highly available but not very reliable. Both metrics matter for agent-facing services; availability is the floor, reliability is the ceiling.

Communicating availability to MCP consumers

If other teams, third-party integrations, or external agents depend on your MCP server, availability data needs to be communicated externally. The standard mechanism is a public status page: a URL that shows current server status, historical availability, and incident timelines.

AliveMCP generates a status page for every monitored server, showing:

Linking to this page in your MCP server's README, in the MCP registry listing, and in your documentation gives agent operators a single source of truth for your server's availability record. See an example status page on the AliveMCP dashboard, or sign up to get one for your server.

Get early access

Related questions

What's the difference between availability and uptime?

In practice the terms are used interchangeably, but uptime usually refers to the raw fraction of time the server is running (HTTP-level), while availability more precisely means the fraction of time the server is usable by its clients. For MCP servers, the distinction matters: a server with 100% uptime can have lower protocol availability if the MCP router or tool registry is intermittently broken.

How do I measure availability if I don't have external monitoring?

You can approximate it from error rate data in your server logs: (1 - error_rate) × 100. However, this only captures failures that reach your server. A server that's down at the network or TLS layer never generates log entries. External monitoring is the only way to capture those failures.

Should I report availability to users of my MCP server?

Yes, especially if you're publishing a server intended for use by others. The MCP ecosystem norm is transparency — the Q2 2026 registry audit found that 91% of public endpoints are unreachable, which suggests most authors are not actively monitoring or reporting availability. A public status page differentiates your server as production-ready.

How does planned maintenance affect my availability SLA?

If you notify users in advance and your SLA terms exclude planned maintenance windows, those outages don't count against your availability target. For internal SLAs (no formal notice requirement), planned maintenance counts as downtime unless you implement zero-downtime deployments. Budget your maintenance windows explicitly: a 2-minute rolling restart is ~0.14% of a monthly availability budget.

Further reading