Guide · Reliability
MCP server reliability
Reliability isn't just "uptime percentage" — it's the combination of how quickly you detect failures (MTTD) and how quickly you restore service (MTTR). A server with 99.9% uptime but a 4-hour MTTR provides a worse user experience than one with 99.5% uptime and a 10-minute MTTR. Reliability engineering for MCP servers means designing for fast detection, fast recovery, and graceful degradation when full recovery isn't immediately possible.
TL;DR
Target MTTD under 5 minutes (achievable with 60-second probe cadence + 3-probe confirmation + immediate alert routing) and MTTR under 30 minutes (achievable with runbooks, automatic restart policies, and pre-rehearsed rollback procedures). The most impactful reliability investments for most MCP servers: zero-downtime deployments (blue-green or rolling), automatic process restart on crash, and monthly incident rehearsal. External monitoring with AliveMCP gives you MTTD data; your incident timeline gives you MTTR data. Improve both with each post-mortem.
MTTD — Mean Time to Detect
MTTD measures the gap between "the server failed" and "you know the server failed." Every minute of MTTD is a minute where agent sessions are silently failing and users have no visibility into why.
MTTD components
MTTD = probe detection delay + alert routing delay + human acknowledgment delay
- Probe detection delay: at 60-second cadence with 3-probe confirmation, your monitoring system detects a failure in 3 minutes at most. At 5-minute cadence, detection delay is up to 15 minutes. Cadence is the single biggest lever on probe detection delay.
- Alert routing delay: the time between the monitoring system detecting the failure and a notification reaching someone who can act. Should be <30 seconds with a properly configured PagerDuty/Slack webhook. This is negligible in practice if your routing is correctly configured.
- Human acknowledgment delay: the time between the alert arriving and someone actively working the incident. Depends on on-call rotation coverage and working hours. During business hours with an on-call SRE, this can be 1–2 minutes. At 3am without an on-call rotation, it might be hours. This is the component most teams underinvest in.
For a solo MCP author with no on-call rotation: your MTTD during sleeping hours is effectively until you wake up and check your phone. Accept this as a constraint and design your server to recover automatically where possible (reducing the impact of high MTTD) rather than trying to achieve low MTTD through a one-person on-call rotation that will inevitably burn out.
Improving MTTD
- Reduce probe cadence from 5 minutes to 60 seconds (3× MTTD improvement).
- Switch from email alerts to push notifications or SMS (reduces human acknowledgment delay from 30 minutes to 2 minutes for business-hours incidents).
- Add process-level alerting alongside external probing — your server should log and alert on its own crashes and OOM kills, which can be detected faster from inside the process than from an external probe.
See MCP server downtime alerting for the full alert configuration guide.
MTTR — Mean Time to Restore
MTTR measures the gap between "you know the server failed" and "the server is serving requests again." High MTTR usually comes from one of three causes: slow diagnosis (what broke?), slow access (getting to the server, getting credentials, finding the right config), or slow fix execution (manual steps that could be automated).
Reducing diagnosis time
Monitoring data dramatically reduces diagnosis time when the failure is external-probe visible. If AliveMCP shows the failure started at 14:32 UTC, peaked at transport layer (TCP refused), and recovered at 15:04 UTC, you know: (1) it was a connectivity failure, not an application failure; (2) it lasted 32 minutes; (3) nothing was wrong at the MCP protocol layer specifically. This narrows the diagnosis to network/host issues immediately.
Without external monitoring, diagnosis starts with "the server might be down" and requires logging in, checking the process, checking logs — adding 5–15 minutes to every incident before you even know what failed.
Per-layer probe data from MCP server health check gives you the failure layer as the first data point, which narrows the diagnosis space from "anything in the stack" to "the specific layer that failed."
Runbooks: pre-written recovery procedures
A runbook is a documented response to a specific failure mode. Instead of diagnosing under pressure, the on-call engineer matches the failure pattern to a runbook and executes a pre-verified recovery procedure. Runbooks reduce MTTR by eliminating the "what do I do now?" hesitation under stress.
Essential MCP server runbooks:
- Server crash / OOM: check process state, restart command, verify probe passes, check memory limits.
- Transport failure (TCP refused): check server process, check host status, check firewall/security group, check if deployment in progress.
- HTTP 5xx: check application logs, check upstream dependencies, check recent deployments, rollback command.
- Initialize failure: check MCP server version compatibility, check auth configuration, verify protocol version in initialize response.
- Tools/list failure: check tool registration code, check tool definition schema validity, restart with debug logging.
- SSL expiry: renew certificate, verify renewal, reload TLS configuration.
Automatic restart policies
The simplest MTTR improvement for crash-based failures is automatic process restart. If your MCP server crashes and restarts in 5 seconds, the MTTR for crash-induced outages drops from "however long it takes a human to notice and act" to 5 seconds. Configure your process manager (systemd, Docker restart policy, Kubernetes liveness probe + restart policy) to restart on failure automatically.
Systemd: Restart=on-failure with RestartSec=5s and StartLimitIntervalSec=300 (gives up after 5 restart attempts in 5 minutes — prevents crash loops from consuming resources indefinitely). Docker: --restart=on-failure:5. Kubernetes: default liveness probe restart already handles this.
Zero-downtime deployments
Deployments are the most common cause of planned downtime for MCP servers. Every deployment that restarts the server process creates a downtime event — even if it's only 30 seconds, it consumes error budget and disrupts active agent sessions. The solution is zero-downtime deployment.
Blue-green deployment
Run two identical server instances (blue and green). Blue is live and serving traffic. Deploy the new version to green. Run smoke tests against green. Switch the load balancer/DNS to point to green. Blue is now idle but still running. If green fails, switch back to blue in seconds. After green is stable for 10 minutes, terminate blue.
Required infrastructure: a load balancer or reverse proxy that can switch between two backend instances (Caddy, nginx, HAProxy, cloud load balancer). For serverless deployments (Lambda, Cloud Run), the platform handles this via traffic split or revision routing.
Rolling deployment
For MCP servers behind a load balancer with multiple instances: update one instance at a time while others continue serving traffic. The load balancer's health check gates traffic to each new instance. Only instances passing the health check receive traffic. If the new version fails the health check, the rolling update stops and the failed instance is rolled back — the remaining instances are still on the old version and serving traffic normally.
MCP health check for rolling deployment: GET /health endpoint that runs the initialize probe internally and returns 200 only if the server can successfully complete the initialize handshake. See MCP server health check.
Canary deployment
Route 5–10% of traffic to the new version for 10–30 minutes before promoting it to 100%. If error rate on the canary instance exceeds baseline, roll back automatically. Requires traffic percentage routing (Cloudflare Workers, AWS ALB weighted target groups, Kubernetes Argo Rollouts). Most appropriate for high-traffic production MCP servers where even a 1% post-deploy error rate represents meaningful user impact.
Graceful degradation patterns
When full availability isn't achievable, graceful degradation keeps some functionality available rather than providing none:
Cached tool definitions on tools/list failure
If your tools/list handler fails (database unreachable, dependency timeout), return the last successfully-fetched tool list from a local in-memory cache with a "stale" timestamp, rather than returning an error. Agent clients can use cached tool definitions for most queries; only operations requiring freshly updated tool schemas are affected. This converts a hard tools/list failure into a soft degradation.
Reduced capability mode
If a subset of your tools depend on an unavailable downstream service, return a tools/list that excludes those tools rather than failing entirely. Agents see fewer tools but can still use the available ones. Include a server-side log entry noting which tools were excluded and why — useful for post-incident review.
Circuit breakers on downstream dependencies
A circuit breaker is a pattern that stops sending requests to a failing downstream dependency after N consecutive failures, instead returning an immediate error (or degraded response) without waiting for the timeout. This prevents a slow/failed downstream from causing your entire MCP server to time out on every request. After a configurable recovery period, the circuit breaker enters "half-open" state and tries one request — if it succeeds, the circuit closes and normal operation resumes.
Implementation: use a circuit breaker library (opossum for Node.js, resilience4j for Java/Kotlin, tenacity for Python) on each external API call within your tool handlers. At the probe layer, circuit breakers are transparent — the probe doesn't know whether a tool failure was due to the server or a tripped circuit breaker. Track circuit breaker state in your server's internal metrics to correlate with probe failures.
Tracking reliability over time
Reliability engineering requires trend data, not just current status. Track these metrics month-over-month:
- MTTD: for each incident, record when the failure started (first failed probe) and when the alert was acknowledged. Average over all incidents in the month.
- MTTR: from acknowledgment to recovery (first passing probe after alert acknowledgment). Average over incidents.
- Incident count: how many distinct incidents (not individual probe failures) per month. Trend should be down as reliability engineering compounds.
- Error budget consumption: percentage of monthly SLO error budget consumed. If consistently >50%, the SLO is at risk. See MCP server SLO.
- Deployment MTTR: separate from unplanned MTTR, track how much downtime each deployment causes. This is the primary target for zero-downtime deployment work.
Related questions
What's a realistic MTTD target for an indie MCP server?
5 minutes is achievable with 60-second probe cadence and push notification alerting. At 5-minute probe cadence, the floor is 15 minutes. For a solo developer running a server as a side project without 24/7 on-call coverage, accept that after-hours MTTD will be long (hours) and invest instead in automatic restart policies to reduce MTTR for the failure modes that auto-recover. Focus your MTTD improvements on business hours, where fast detection actually results in fast human response.
How do I reduce MTTR for a server I can't SSH into directly?
Managed platforms (Railway, Render, Fly.io, Lambda) don't always provide direct SSH. Your MTTR tools: the platform's web dashboard or CLI for restarts, log tail for diagnosis, environment variable updates for config changes, and git push for code changes. Most platform failures have a faster recovery path than manual SSH: trigger a new deployment (often 60–90 seconds on Railway), scale to zero and back up (Cloud Run), or click "restart service" in the dashboard. Pre-authenticate your CLI tools before an incident — running railway login for the first time during an outage adds unnecessary recovery delay.
Should I focus more on MTTD or MTTR?
MTTR has more impact on user experience in most cases. The difference between 3-minute MTTD and 5-minute MTTD is 2 minutes — most users won't notice. The difference between 10-minute MTTR and 90-minute MTTR is 80 minutes of lost availability. Invest in MTTR first (runbooks, automatic restarts, zero-downtime deployments), then invest in MTTD once your recovery procedures are solid. The exception: if you're running a workflow where even 2 minutes of missed-detection time causes significant damage (think: automated financial transactions or critical infrastructure), MTTD becomes more important.
Further reading
- MCP server downtime alerting — reducing MTTD with fast alert configuration
- MCP server SLO — connecting reliability metrics to error budgets
- MCP server incident response — runbooks and MTTR playbooks
- MCP server health check — the four-layer probe powering MTTD
- MCP server performance — reliability at the performance layer
- MCP server availability — the availability SLO that reliability engineering defends
- AliveMCP — MTTD and MTTR tracking built into every monitor