Guide · Error Rate

MCP server error rate

Uptime is binary — up or down. Error rate is continuous — 0% to 100% of probes failing, with every point in between being meaningful. A server with a 5% error rate is technically "up" by uptime metrics but degraded in practice: 1 in 20 agent sessions hits a failure. Tracking error rate instead of (or alongside) uptime gives you a finer signal, an error budget to spend, and earlier warning of brewing problems.

TL;DR

Error rate = failed probes / total probes over a rolling window. Measure it per protocol layer (transport, HTTP, initialize, tools/list) — you can have 0% transport errors and 3% initialize errors simultaneously. Alert on rate over a 5-minute window, not on individual probe failures. A single probe failure is usually probe-origin network jitter. Three consecutive failures (or a 5% error rate over 5 minutes) is a real server problem. Error budget SLO math: at 99.9% SLO, you have 43.8 minutes of error budget per month. AliveMCP tracks error rate per layer on every probe interval.

What counts as an error in MCP

The MCP protocol has multiple error surfaces, and counting them correctly requires knowing which layer the failure occurred at:

Transport errors

TCP connection refused, TCP timeout (no response in probe timeout window), TLS handshake failure (certificate invalid, expired, or hostname mismatch), TLS version mismatch. These are the most severe: transport errors mean the server is completely unreachable. See MCP server SSL certificate monitoring for the SSL-specific failure modes.

HTTP errors

The TCP connection succeeded but the HTTP response indicates a problem. Relevant status codes: 4xx (client error — usually auth misconfiguration or wrong endpoint path), 5xx (server error — application crash, out-of-memory, upstream dependency failure), 429 (rate-limited — the probe origin is being throttled), 301/302 (redirect — the server moved but the probe URL wasn't updated). A 200 with non-JSON-RPC response body (HTML error page, maintenance page) also counts as an HTTP error for MCP purposes.

JSON-RPC errors (initialize layer)

The HTTP response is 200 and the body is valid JSON-RPC, but the initialize method returns an error object rather than a result. Common error codes: -32700 (parse error), -32600 (invalid request), -32601 (method not found — the server doesn't implement initialize), -32603 (internal error), protocol-specific error codes for auth failures. These are distinct from HTTP 4xx — the server successfully received the request but refused or failed to process the MCP handshake.

Tool surface errors (tools/list layer)

Initialize succeeded but tools/list returned an error, returned an empty array when at least one tool is expected, returned a malformed response (missing tools key, tools with missing required fields), or returned a schema that can't be parsed as valid JSON Schema. These errors indicate a server that is alive at the protocol level but is not functioning correctly for agent use.

Error rate calculation

Error rate is calculated over a rolling window, not instantaneously. Two window sizes matter for different purposes:

Short window (5–15 minutes): used for real-time alerting. Captures acute incidents quickly. At 60-second probe cadence, a 5-minute window contains 5 probe samples — enough to distinguish a single jittery probe (1/5 = 20% error rate over 5 minutes) from sustained failure (5/5 = 100%). Alert threshold: fire P1 at ≥ 60% error rate over any 5-minute window, P2 at ≥ 20% over 15 minutes.
Long window (30 days): used for SLO accounting. Measures the month-to-date availability against your SLO target. Captures slow-burn degradation that the short window misses (1 failure per hour = 1.7% error rate, below the short-window alert threshold but significant over a month).

A practical implementation: track both windows simultaneously. The short window drives operational alerts (page someone, fix the server now). The long window drives SLO reviews (are we on track to meet our monthly availability commitment?). They answer different questions and should not be conflated.

Per-layer error rate: why it matters

Aggregating all errors into a single "error rate" loses information. A server with a 3% error rate where all errors are at the tools/list layer is a different situation than a server with 3% transport errors — same rate, completely different diagnosis and fix.

Concrete example: after a deployment of a new MCP server version, you see:

Transport: 0% errors (server is up)
HTTP: 0% errors (server responds 200)
Initialize: 0% errors (MCP handshake succeeds)
Tools/list: 8% errors (tools/list returns empty array on 1 in 12 probes)

Without per-layer error rate, the aggregate is 2% (8% × 0.25 if tools/list is 1 of 4 layers, or just 8% if you only run the tools/list probe). Either way, the signal is "something is wrong." With per-layer error rate, the signal is "the server is alive and the MCP handshake works but the tool surface is intermittently empty." This immediately points toward a race condition in tool registration during server startup — a completely different diagnosis path than a transport or HTTP error.

Error budget and SLO math

An error budget SLO expresses availability as a target percentage over a rolling period. At 99.9% SLO over a 30-day month:

Error budget = (1 - 0.999) × 30 days × 24 hours × 60 minutes
             = 0.001 × 43,200 minutes
             = 43.2 minutes of allowed downtime per month

At 60-second probe cadence, 43.2 minutes = 43 probe failures per month before you breach the SLO. That's less than 1.5 probe failures per day. A single five-minute incident (5 consecutive probe failures) consumes 11.6% of your monthly budget.

SLO tiers and what they imply for operations:

99.0% SLO: 432 minutes (7.2 hours) of allowed downtime per month. Viable with manual incident response — one on-call engineer can investigate and fix most incidents within this budget. Typical for internal tools and developer-facing MCP servers.
99.5% SLO: 216 minutes (3.6 hours) per month. Requires a documented incident response playbook and an alert that wakes someone up within 5 minutes. See MCP server incident response for playbook structure.
99.9% SLO: 43.2 minutes per month. Requires automated recovery for the most common failure modes and alert-to-acknowledge time under 10 minutes. Achievable for most MCP servers with good monitoring and a hysteresis-based alerting setup.
99.99% SLO: 4.3 minutes per month. Multi-region active-active deployment required. Outside the scope of most MCP server operators unless building a commercial service that provides SLA guarantees to enterprise customers.

Error budget burn rate is the most actionable operational metric: if you're 10 days into the month and have already consumed 60% of your monthly error budget, you're on track to breach your SLO. Alerting on burn rate ("you will breach SLO in N days at the current rate") is more useful than alerting on raw downtime ("you have X minutes of budget left").

False positives: probe-origin jitter vs. real errors

Not every probe failure is a server error. The probe origin (AliveMCP's monitoring infrastructure) can have transient network issues that cause individual probes to fail even when the server is healthy. This is probe-origin jitter, and it accounts for most single-sample probe failures.

How to distinguish probe-origin jitter from real server errors:

Jitter signature: a single probe failure followed immediately by a passing probe. The failure mode is usually TCP timeout or TLS handshake timeout — the connection never completed, suggesting the probe packet was dropped in transit rather than reaching the server.
Real error signature: multiple consecutive failures, or failures that show a specific error code (HTTP 500, JSON-RPC -32603) rather than a generic timeout. Real server errors have a characteristic error payload; probe-origin jitter produces timeouts.

AliveMCP's alerting requires 3 consecutive failures before firing a P1 alert, and evaluates error rate over a 5-minute window rather than on individual samples. This design specifically suppresses probe-origin jitter while catching real server errors within 3 minutes of onset. A single jittery probe is logged but does not contribute to the error rate trend or trigger an alert.

For the most rigorous analysis, multi-region probing eliminates probe-origin jitter as a false positive entirely: if probes from three independent regions all fail simultaneously, the problem is the server, not any individual probe origin. AliveMCP Team tier ($49/mo) includes multi-region probing with cross-region correlation built in.

Error rate vs. downtime: which metric to lead with

Uptime percentage (downtime as the inverse) is the number most people ask for: "what's your server's uptime?" But error rate is a more precise metric for a server with intermittent problems:

A server with 3 one-hour outages per month has 99.6% uptime and a 0.4% error rate. These numbers tell the same story.
A server with 1 failed probe per hour every hour of the month has 98.6% uptime (720 probes, 720 failures, but spread out — depending on how you count continuous vs. intermittent failures) or a 1.4% error rate — but zero multi-minute outages. A user who hits the server at random has a 1.4% chance of hitting the failure. The uptime story ("no major outages this month") is misleading; the error rate story ("1.4% of sessions failed") is accurate.

For MCP servers where agent sessions are relatively short and users retry automatically, error rate is the better operational metric. For MCP servers where a session is a long-running workflow that can't be retried cheaply, downtime duration is the more impactful metric (a 30-minute outage during a workflow is much worse than 30 one-minute intermittent failures).