Guide · Reliability

MCP server reliability

Reliability isn't just "uptime percentage" — it's the combination of how quickly you detect failures (MTTD) and how quickly you restore service (MTTR). A server with 99.9% uptime but a 4-hour MTTR provides a worse user experience than one with 99.5% uptime and a 10-minute MTTR. Reliability engineering for MCP servers means designing for fast detection, fast recovery, and graceful degradation when full recovery isn't immediately possible.

TL;DR

Target MTTD under 5 minutes (achievable with 60-second probe cadence + 3-probe confirmation + immediate alert routing) and MTTR under 30 minutes (achievable with runbooks, automatic restart policies, and pre-rehearsed rollback procedures). The most impactful reliability investments for most MCP servers: zero-downtime deployments (blue-green or rolling), automatic process restart on crash, and monthly incident rehearsal. External monitoring with AliveMCP gives you MTTD data; your incident timeline gives you MTTR data. Improve both with each post-mortem.

MTTD — Mean Time to Detect

MTTD measures the gap between "the server failed" and "you know the server failed." Every minute of MTTD is a minute where agent sessions are silently failing and users have no visibility into why.

MTTD components

MTTD = probe detection delay + alert routing delay + human acknowledgment delay

For a solo MCP author with no on-call rotation: your MTTD during sleeping hours is effectively until you wake up and check your phone. Accept this as a constraint and design your server to recover automatically where possible (reducing the impact of high MTTD) rather than trying to achieve low MTTD through a one-person on-call rotation that will inevitably burn out.

Improving MTTD

See MCP server downtime alerting for the full alert configuration guide.

MTTR — Mean Time to Restore

MTTR measures the gap between "you know the server failed" and "the server is serving requests again." High MTTR usually comes from one of three causes: slow diagnosis (what broke?), slow access (getting to the server, getting credentials, finding the right config), or slow fix execution (manual steps that could be automated).

Reducing diagnosis time

Monitoring data dramatically reduces diagnosis time when the failure is external-probe visible. If AliveMCP shows the failure started at 14:32 UTC, peaked at transport layer (TCP refused), and recovered at 15:04 UTC, you know: (1) it was a connectivity failure, not an application failure; (2) it lasted 32 minutes; (3) nothing was wrong at the MCP protocol layer specifically. This narrows the diagnosis to network/host issues immediately.

Without external monitoring, diagnosis starts with "the server might be down" and requires logging in, checking the process, checking logs — adding 5–15 minutes to every incident before you even know what failed.

Per-layer probe data from MCP server health check gives you the failure layer as the first data point, which narrows the diagnosis space from "anything in the stack" to "the specific layer that failed."

Runbooks: pre-written recovery procedures

A runbook is a documented response to a specific failure mode. Instead of diagnosing under pressure, the on-call engineer matches the failure pattern to a runbook and executes a pre-verified recovery procedure. Runbooks reduce MTTR by eliminating the "what do I do now?" hesitation under stress.

Essential MCP server runbooks:

Automatic restart policies

The simplest MTTR improvement for crash-based failures is automatic process restart. If your MCP server crashes and restarts in 5 seconds, the MTTR for crash-induced outages drops from "however long it takes a human to notice and act" to 5 seconds. Configure your process manager (systemd, Docker restart policy, Kubernetes liveness probe + restart policy) to restart on failure automatically.

Systemd: Restart=on-failure with RestartSec=5s and StartLimitIntervalSec=300 (gives up after 5 restart attempts in 5 minutes — prevents crash loops from consuming resources indefinitely). Docker: --restart=on-failure:5. Kubernetes: default liveness probe restart already handles this.

Zero-downtime deployments

Deployments are the most common cause of planned downtime for MCP servers. Every deployment that restarts the server process creates a downtime event — even if it's only 30 seconds, it consumes error budget and disrupts active agent sessions. The solution is zero-downtime deployment.

Blue-green deployment

Run two identical server instances (blue and green). Blue is live and serving traffic. Deploy the new version to green. Run smoke tests against green. Switch the load balancer/DNS to point to green. Blue is now idle but still running. If green fails, switch back to blue in seconds. After green is stable for 10 minutes, terminate blue.

Required infrastructure: a load balancer or reverse proxy that can switch between two backend instances (Caddy, nginx, HAProxy, cloud load balancer). For serverless deployments (Lambda, Cloud Run), the platform handles this via traffic split or revision routing.

Rolling deployment

For MCP servers behind a load balancer with multiple instances: update one instance at a time while others continue serving traffic. The load balancer's health check gates traffic to each new instance. Only instances passing the health check receive traffic. If the new version fails the health check, the rolling update stops and the failed instance is rolled back — the remaining instances are still on the old version and serving traffic normally.

MCP health check for rolling deployment: GET /health endpoint that runs the initialize probe internally and returns 200 only if the server can successfully complete the initialize handshake. See MCP server health check.

Canary deployment

Route 5–10% of traffic to the new version for 10–30 minutes before promoting it to 100%. If error rate on the canary instance exceeds baseline, roll back automatically. Requires traffic percentage routing (Cloudflare Workers, AWS ALB weighted target groups, Kubernetes Argo Rollouts). Most appropriate for high-traffic production MCP servers where even a 1% post-deploy error rate represents meaningful user impact.

Graceful degradation patterns

When full availability isn't achievable, graceful degradation keeps some functionality available rather than providing none:

Cached tool definitions on tools/list failure

If your tools/list handler fails (database unreachable, dependency timeout), return the last successfully-fetched tool list from a local in-memory cache with a "stale" timestamp, rather than returning an error. Agent clients can use cached tool definitions for most queries; only operations requiring freshly updated tool schemas are affected. This converts a hard tools/list failure into a soft degradation.

Reduced capability mode

If a subset of your tools depend on an unavailable downstream service, return a tools/list that excludes those tools rather than failing entirely. Agents see fewer tools but can still use the available ones. Include a server-side log entry noting which tools were excluded and why — useful for post-incident review.

Circuit breakers on downstream dependencies

A circuit breaker is a pattern that stops sending requests to a failing downstream dependency after N consecutive failures, instead returning an immediate error (or degraded response) without waiting for the timeout. This prevents a slow/failed downstream from causing your entire MCP server to time out on every request. After a configurable recovery period, the circuit breaker enters "half-open" state and tries one request — if it succeeds, the circuit closes and normal operation resumes.

Implementation: use a circuit breaker library (opossum for Node.js, resilience4j for Java/Kotlin, tenacity for Python) on each external API call within your tool handlers. At the probe layer, circuit breakers are transparent — the probe doesn't know whether a tool failure was due to the server or a tripped circuit breaker. Track circuit breaker state in your server's internal metrics to correlate with probe failures.

Tracking reliability over time

Reliability engineering requires trend data, not just current status. Track these metrics month-over-month:

Related questions

What's a realistic MTTD target for an indie MCP server?

5 minutes is achievable with 60-second probe cadence and push notification alerting. At 5-minute probe cadence, the floor is 15 minutes. For a solo developer running a server as a side project without 24/7 on-call coverage, accept that after-hours MTTD will be long (hours) and invest instead in automatic restart policies to reduce MTTR for the failure modes that auto-recover. Focus your MTTD improvements on business hours, where fast detection actually results in fast human response.

How do I reduce MTTR for a server I can't SSH into directly?

Managed platforms (Railway, Render, Fly.io, Lambda) don't always provide direct SSH. Your MTTR tools: the platform's web dashboard or CLI for restarts, log tail for diagnosis, environment variable updates for config changes, and git push for code changes. Most platform failures have a faster recovery path than manual SSH: trigger a new deployment (often 60–90 seconds on Railway), scale to zero and back up (Cloud Run), or click "restart service" in the dashboard. Pre-authenticate your CLI tools before an incident — running railway login for the first time during an outage adds unnecessary recovery delay.

Should I focus more on MTTD or MTTR?

MTTR has more impact on user experience in most cases. The difference between 3-minute MTTD and 5-minute MTTD is 2 minutes — most users won't notice. The difference between 10-minute MTTR and 90-minute MTTR is 80 minutes of lost availability. Invest in MTTR first (runbooks, automatic restarts, zero-downtime deployments), then invest in MTTD once your recovery procedures are solid. The exception: if you're running a workflow where even 2 minutes of missed-detection time causes significant damage (think: automated financial transactions or critical infrastructure), MTTD becomes more important.

Further reading