Guide · Error Rate

MCP server error rate

Uptime is binary — up or down. Error rate is continuous — 0% to 100% of probes failing, with every point in between being meaningful. A server with a 5% error rate is technically "up" by uptime metrics but degraded in practice: 1 in 20 agent sessions hits a failure. Tracking error rate instead of (or alongside) uptime gives you a finer signal, an error budget to spend, and earlier warning of brewing problems.

TL;DR

Error rate = failed probes / total probes over a rolling window. Measure it per protocol layer (transport, HTTP, initialize, tools/list) — you can have 0% transport errors and 3% initialize errors simultaneously. Alert on rate over a 5-minute window, not on individual probe failures. A single probe failure is usually probe-origin network jitter. Three consecutive failures (or a 5% error rate over 5 minutes) is a real server problem. Error budget SLO math: at 99.9% SLO, you have 43.8 minutes of error budget per month. AliveMCP tracks error rate per layer on every probe interval.

What counts as an error in MCP

The MCP protocol has multiple error surfaces, and counting them correctly requires knowing which layer the failure occurred at:

Transport errors

TCP connection refused, TCP timeout (no response in probe timeout window), TLS handshake failure (certificate invalid, expired, or hostname mismatch), TLS version mismatch. These are the most severe: transport errors mean the server is completely unreachable. See MCP server SSL certificate monitoring for the SSL-specific failure modes.

HTTP errors

The TCP connection succeeded but the HTTP response indicates a problem. Relevant status codes: 4xx (client error — usually auth misconfiguration or wrong endpoint path), 5xx (server error — application crash, out-of-memory, upstream dependency failure), 429 (rate-limited — the probe origin is being throttled), 301/302 (redirect — the server moved but the probe URL wasn't updated). A 200 with non-JSON-RPC response body (HTML error page, maintenance page) also counts as an HTTP error for MCP purposes.

JSON-RPC errors (initialize layer)

The HTTP response is 200 and the body is valid JSON-RPC, but the initialize method returns an error object rather than a result. Common error codes: -32700 (parse error), -32600 (invalid request), -32601 (method not found — the server doesn't implement initialize), -32603 (internal error), protocol-specific error codes for auth failures. These are distinct from HTTP 4xx — the server successfully received the request but refused or failed to process the MCP handshake.

Tool surface errors (tools/list layer)

Initialize succeeded but tools/list returned an error, returned an empty array when at least one tool is expected, returned a malformed response (missing tools key, tools with missing required fields), or returned a schema that can't be parsed as valid JSON Schema. These errors indicate a server that is alive at the protocol level but is not functioning correctly for agent use.

Error rate calculation

Error rate is calculated over a rolling window, not instantaneously. Two window sizes matter for different purposes:

A practical implementation: track both windows simultaneously. The short window drives operational alerts (page someone, fix the server now). The long window drives SLO reviews (are we on track to meet our monthly availability commitment?). They answer different questions and should not be conflated.

Per-layer error rate: why it matters

Aggregating all errors into a single "error rate" loses information. A server with a 3% error rate where all errors are at the tools/list layer is a different situation than a server with 3% transport errors — same rate, completely different diagnosis and fix.

Concrete example: after a deployment of a new MCP server version, you see:

Without per-layer error rate, the aggregate is 2% (8% × 0.25 if tools/list is 1 of 4 layers, or just 8% if you only run the tools/list probe). Either way, the signal is "something is wrong." With per-layer error rate, the signal is "the server is alive and the MCP handshake works but the tool surface is intermittently empty." This immediately points toward a race condition in tool registration during server startup — a completely different diagnosis path than a transport or HTTP error.

Error budget and SLO math

An error budget SLO expresses availability as a target percentage over a rolling period. At 99.9% SLO over a 30-day month:

Error budget = (1 - 0.999) × 30 days × 24 hours × 60 minutes
             = 0.001 × 43,200 minutes
             = 43.2 minutes of allowed downtime per month

At 60-second probe cadence, 43.2 minutes = 43 probe failures per month before you breach the SLO. That's less than 1.5 probe failures per day. A single five-minute incident (5 consecutive probe failures) consumes 11.6% of your monthly budget.

SLO tiers and what they imply for operations:

Error budget burn rate is the most actionable operational metric: if you're 10 days into the month and have already consumed 60% of your monthly error budget, you're on track to breach your SLO. Alerting on burn rate ("you will breach SLO in N days at the current rate") is more useful than alerting on raw downtime ("you have X minutes of budget left").

False positives: probe-origin jitter vs. real errors

Not every probe failure is a server error. The probe origin (AliveMCP's monitoring infrastructure) can have transient network issues that cause individual probes to fail even when the server is healthy. This is probe-origin jitter, and it accounts for most single-sample probe failures.

How to distinguish probe-origin jitter from real server errors:

AliveMCP's alerting requires 3 consecutive failures before firing a P1 alert, and evaluates error rate over a 5-minute window rather than on individual samples. This design specifically suppresses probe-origin jitter while catching real server errors within 3 minutes of onset. A single jittery probe is logged but does not contribute to the error rate trend or trigger an alert.

For the most rigorous analysis, multi-region probing eliminates probe-origin jitter as a false positive entirely: if probes from three independent regions all fail simultaneously, the problem is the server, not any individual probe origin. AliveMCP Team tier ($49/mo) includes multi-region probing with cross-region correlation built in.

Error rate vs. downtime: which metric to lead with

Uptime percentage (downtime as the inverse) is the number most people ask for: "what's your server's uptime?" But error rate is a more precise metric for a server with intermittent problems:

For MCP servers where agent sessions are relatively short and users retry automatically, error rate is the better operational metric. For MCP servers where a session is a long-running workflow that can't be retried cheaply, downtime duration is the more impactful metric (a 30-minute outage during a workflow is much worse than 30 one-minute intermittent failures).

Related questions

What error rate is acceptable for a public MCP server?

Below 1% error rate over any 24-hour window is a reasonable baseline target. Above 1%, investigate: at 60-second probe cadence, 1% means about 14 failed probes per day — that's detectable and fixable. Above 5% sustained, most users will notice failures in normal use. Below 0.1% is achievable with good infrastructure and worth targeting for commercial MCP services. The right target depends on your SLO commitment — derive it from the SLO math rather than picking an arbitrary number.

How do I count an error that happens on tools/list but not on initialize?

Count it as an error at the tools/list layer specifically, not as a global error. Your per-layer error rate for tools/list increments; your per-layer error rates for transport, HTTP, and initialize do not. For aggregate SLO accounting, decide in advance whether you count partial-layer failures (tools/list failed but initialize succeeded) as "the server was available" or "the server was unavailable." Both choices are defensible — document which you chose so SLO calculations are consistent month to month.

Should I track error rate for tool calls (not just initialize and tools/list)?

Yes, if you have the instrumentation. Tool call error rate is separate from probe error rate — it measures real user impact, not monitoring signal. Instrument your MCP server to emit a metric on every tool call (success/error, latency, tool name) and compute error rate per tool. Some tools will naturally have higher error rates than others (tools that call unreliable downstream APIs). Tracking per-tool error rate helps you identify which tools are causing agent frustration vs. which tools are healthy. AliveMCP's probes don't cover tool call error rate because tool calls have side effects — this instrumentation belongs in your server's application code.

How do I measure error budget burn rate in real time?

Burn rate = (current monthly error rate / SLO monthly error budget). If your SLO is 99.9% and your rolling 7-day error rate is 0.5%, your burn rate is 0.5% / (0.1% × 7/30) ≈ 21× — you're burning your monthly budget 21 times faster than sustainable. Alert when burn rate exceeds 5× (you'll exhaust the budget in 6 days at this rate). AliveMCP's error rate dashboard shows the rolling error rate; combine with your SLO target to compute burn rate manually, or use Team tier's SLO mode which computes it automatically.

Further reading