Guide · Reliability

MCP server reliability

Reliability isn't just "uptime percentage" — it's the combination of how quickly you detect failures (MTTD) and how quickly you restore service (MTTR). A server with 99.9% uptime but a 4-hour MTTR provides a worse user experience than one with 99.5% uptime and a 10-minute MTTR. Reliability engineering for MCP servers means designing for fast detection, fast recovery, and graceful degradation when full recovery isn't immediately possible.

TL;DR

Target MTTD under 5 minutes (achievable with 60-second probe cadence + 3-probe confirmation + immediate alert routing) and MTTR under 30 minutes (achievable with runbooks, automatic restart policies, and pre-rehearsed rollback procedures). The most impactful reliability investments for most MCP servers: zero-downtime deployments (blue-green or rolling), automatic process restart on crash, and monthly incident rehearsal. External monitoring with AliveMCP gives you MTTD data; your incident timeline gives you MTTR data. Improve both with each post-mortem.

MTTD — Mean Time to Detect

MTTD measures the gap between "the server failed" and "you know the server failed." Every minute of MTTD is a minute where agent sessions are silently failing and users have no visibility into why.

MTTD components

MTTD = probe detection delay + alert routing delay + human acknowledgment delay

Probe detection delay: at 60-second cadence with 3-probe confirmation, your monitoring system detects a failure in 3 minutes at most. At 5-minute cadence, detection delay is up to 15 minutes. Cadence is the single biggest lever on probe detection delay.
Alert routing delay: the time between the monitoring system detecting the failure and a notification reaching someone who can act. Should be <30 seconds with a properly configured PagerDuty/Slack webhook. This is negligible in practice if your routing is correctly configured.
Human acknowledgment delay: the time between the alert arriving and someone actively working the incident. Depends on on-call rotation coverage and working hours. During business hours with an on-call SRE, this can be 1–2 minutes. At 3am without an on-call rotation, it might be hours. This is the component most teams underinvest in.

For a solo MCP author with no on-call rotation: your MTTD during sleeping hours is effectively until you wake up and check your phone. Accept this as a constraint and design your server to recover automatically where possible (reducing the impact of high MTTD) rather than trying to achieve low MTTD through a one-person on-call rotation that will inevitably burn out.

Improving MTTD

Reduce probe cadence from 5 minutes to 60 seconds (3× MTTD improvement).
Switch from email alerts to push notifications or SMS (reduces human acknowledgment delay from 30 minutes to 2 minutes for business-hours incidents).
Add process-level alerting alongside external probing — your server should log and alert on its own crashes and OOM kills, which can be detected faster from inside the process than from an external probe.

See MCP server downtime alerting for the full alert configuration guide.

MTTR — Mean Time to Restore

MTTR measures the gap between "you know the server failed" and "the server is serving requests again." High MTTR usually comes from one of three causes: slow diagnosis (what broke?), slow access (getting to the server, getting credentials, finding the right config), or slow fix execution (manual steps that could be automated).

Reducing diagnosis time

Monitoring data dramatically reduces diagnosis time when the failure is external-probe visible. If AliveMCP shows the failure started at 14:32 UTC, peaked at transport layer (TCP refused), and recovered at 15:04 UTC, you know: (1) it was a connectivity failure, not an application failure; (2) it lasted 32 minutes; (3) nothing was wrong at the MCP protocol layer specifically. This narrows the diagnosis to network/host issues immediately.

Without external monitoring, diagnosis starts with "the server might be down" and requires logging in, checking the process, checking logs — adding 5–15 minutes to every incident before you even know what failed.

Per-layer probe data from MCP server health check gives you the failure layer as the first data point, which narrows the diagnosis space from "anything in the stack" to "the specific layer that failed."

Runbooks: pre-written recovery procedures

A runbook is a documented response to a specific failure mode. Instead of diagnosing under pressure, the on-call engineer matches the failure pattern to a runbook and executes a pre-verified recovery procedure. Runbooks reduce MTTR by eliminating the "what do I do now?" hesitation under stress.

Essential MCP server runbooks:

Server crash / OOM: check process state, restart command, verify probe passes, check memory limits.
Transport failure (TCP refused): check server process, check host status, check firewall/security group, check if deployment in progress.
HTTP 5xx: check application logs, check upstream dependencies, check recent deployments, rollback command.
Initialize failure: check MCP server version compatibility, check auth configuration, verify protocol version in initialize response.
Tools/list failure: check tool registration code, check tool definition schema validity, restart with debug logging.
SSL expiry: renew certificate, verify renewal, reload TLS configuration.

Automatic restart policies

The simplest MTTR improvement for crash-based failures is automatic process restart. If your MCP server crashes and restarts in 5 seconds, the MTTR for crash-induced outages drops from "however long it takes a human to notice and act" to 5 seconds. Configure your process manager (systemd, Docker restart policy, Kubernetes liveness probe + restart policy) to restart on failure automatically.

Systemd: Restart=on-failure with RestartSec=5s and StartLimitIntervalSec=300 (gives up after 5 restart attempts in 5 minutes — prevents crash loops from consuming resources indefinitely). Docker: --restart=on-failure:5. Kubernetes: default liveness probe restart already handles this.

Zero-downtime deployments

Deployments are the most common cause of planned downtime for MCP servers. Every deployment that restarts the server process creates a downtime event — even if it's only 30 seconds, it consumes error budget and disrupts active agent sessions. The solution is zero-downtime deployment.

Blue-green deployment

Run two identical server instances (blue and green). Blue is live and serving traffic. Deploy the new version to green. Run smoke tests against green. Switch the load balancer/DNS to point to green. Blue is now idle but still running. If green fails, switch back to blue in seconds. After green is stable for 10 minutes, terminate blue.

Required infrastructure: a load balancer or reverse proxy that can switch between two backend instances (Caddy, nginx, HAProxy, cloud load balancer). For serverless deployments (Lambda, Cloud Run), the platform handles this via traffic split or revision routing.

Rolling deployment

For MCP servers behind a load balancer with multiple instances: update one instance at a time while others continue serving traffic. The load balancer's health check gates traffic to each new instance. Only instances passing the health check receive traffic. If the new version fails the health check, the rolling update stops and the failed instance is rolled back — the remaining instances are still on the old version and serving traffic normally.

MCP health check for rolling deployment: GET /health endpoint that runs the initialize probe internally and returns 200 only if the server can successfully complete the initialize handshake. See MCP server health check.

Canary deployment

Route 5–10% of traffic to the new version for 10–30 minutes before promoting it to 100%. If error rate on the canary instance exceeds baseline, roll back automatically. Requires traffic percentage routing (Cloudflare Workers, AWS ALB weighted target groups, Kubernetes Argo Rollouts). Most appropriate for high-traffic production MCP servers where even a 1% post-deploy error rate represents meaningful user impact.

Graceful degradation patterns

When full availability isn't achievable, graceful degradation keeps some functionality available rather than providing none:

Cached tool definitions on tools/list failure

If your tools/list handler fails (database unreachable, dependency timeout), return the last successfully-fetched tool list from a local in-memory cache with a "stale" timestamp, rather than returning an error. Agent clients can use cached tool definitions for most queries; only operations requiring freshly updated tool schemas are affected. This converts a hard tools/list failure into a soft degradation.

Reduced capability mode

If a subset of your tools depend on an unavailable downstream service, return a tools/list that excludes those tools rather than failing entirely. Agents see fewer tools but can still use the available ones. Include a server-side log entry noting which tools were excluded and why — useful for post-incident review.

Circuit breakers on downstream dependencies

A circuit breaker is a pattern that stops sending requests to a failing downstream dependency after N consecutive failures, instead returning an immediate error (or degraded response) without waiting for the timeout. This prevents a slow/failed downstream from causing your entire MCP server to time out on every request. After a configurable recovery period, the circuit breaker enters "half-open" state and tries one request — if it succeeds, the circuit closes and normal operation resumes.

Implementation: use a circuit breaker library (opossum for Node.js, resilience4j for Java/Kotlin, tenacity for Python) on each external API call within your tool handlers. At the probe layer, circuit breakers are transparent — the probe doesn't know whether a tool failure was due to the server or a tripped circuit breaker. Track circuit breaker state in your server's internal metrics to correlate with probe failures.

Tracking reliability over time

Reliability engineering requires trend data, not just current status. Track these metrics month-over-month:

MTTD: for each incident, record when the failure started (first failed probe) and when the alert was acknowledged. Average over all incidents in the month.
MTTR: from acknowledgment to recovery (first passing probe after alert acknowledgment). Average over incidents.
Incident count: how many distinct incidents (not individual probe failures) per month. Trend should be down as reliability engineering compounds.
Error budget consumption: percentage of monthly SLO error budget consumed. If consistently >50%, the SLO is at risk. See MCP server SLO.
Deployment MTTR: separate from unplanned MTTR, track how much downtime each deployment causes. This is the primary target for zero-downtime deployment work.