Guide · MCP Resilience

MCP server canary deployments

A canary deployment routes a small fraction of real traffic to a new server version while the majority stays on the stable version. If the canary behaves correctly — same error rate, similar latency, no unexpected tool schema errors — traffic is gradually shifted until the new version handles 100% of requests. If the canary reveals a regression, it is rolled back before most agents ever see the bad version. For MCP servers, canary releases are especially valuable because agents operate autonomously and may retry aggressively: a regression that causes 5% of tool calls to fail will trigger retries that amplify the impact, making early detection essential.

TL;DR

Run two instances: mcp-stable (current) and mcp-canary (new version). Configure your reverse proxy (nginx or Caddy) to route 5–10% of requests to the canary. Monitor error rate, latency P99, and tool schema mismatch errors on both. After 1 hour with no regression, shift to 25%, then 50%, then 100%. If canary error rate exceeds stable by more than 2×, roll back immediately — remove the canary weight and redeploy.

Why MCP servers need canary deployments

MCP servers interact with agents that are not under your control. Unlike a web app where a bad deploy affects users who can refresh or report issues, a bad MCP server deploy affects agents that:

Retry silently — an agent that receives an unexpected schema validation error may retry 5 times before giving up, amplifying your error rate by 5×
Propagate errors downstream — in a pipeline, a tool failure causes the upstream orchestrator to fail the entire task, not just the single tool call
Cache tool schemas — agents may not re-fetch the tools list after a deploy, meaning a schema change that was intended to be backward-compatible may fail for agents with cached old schemas until they restart
Run autonomously at scale — an enterprise deployment may have hundreds of agents calling your server simultaneously; a bad deploy is immediately felt at scale

A canary deployment limits blast radius to the small fraction of agents on the canary, giving you real signal with bounded damage.

Traffic splitting with nginx

nginx upstream split directives route a percentage of requests to each backend:

# nginx.conf
upstream mcp_stable {
  server 127.0.0.1:3000 weight=95;
}

upstream mcp_canary {
  server 127.0.0.1:3001 weight=5;
}

# Use split_clients for percentage-based routing
split_clients "${remote_addr}${request_id}" $mcp_upstream {
  5%    mcp_canary;
  *     mcp_stable;
}

server {
  listen 443 ssl http2;
  server_name mcp.yourdomain.com;

  location / {
    proxy_pass http://$mcp_upstream;
    # Inject version header so agents can see which version served them
    add_header X-Server-Version $upstream_http_x_server_version;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
  }
}

The split_clients hash on ${remote_addr}${request_id} means a single agent session (same IP, different request IDs) may hit both versions — this is intentional for stateless HTTP-transport MCP. For SSE-based sessions that must stay on one version for their entire duration, use session-affinity instead (see below).

Traffic splitting with Caddy

Caddy's reverse_proxy directive supports weighted load balancing:

# Caddyfile
mcp.yourdomain.com {
  reverse_proxy {
    to localhost:3000 localhost:3001
    lb_policy weighted_round_robin 95 5
    header_up X-Server-Version {http.reverse_proxy.upstream.address}
  }
}

Caddy does not support named upstream groups like nginx, but the weighted_round_robin policy gives 95/5 split across the two upstreams listed.

Session-affinity for SSE transports

If your MCP server uses SSE transport, a client opens a long-lived HTTP stream for the session. Switching that stream mid-session from stable to canary would reconnect the client and may cause state loss. Use sticky routing based on a session ID header:

# nginx: route by Mcp-Session-Id if present, otherwise split randomly
map $http_mcp_session_id $mcp_upstream {
  default    "";
}

split_clients "${http_mcp_session_id:-${remote_addr}}" $mcp_split {
  5%    "canary";
  *     "stable";
}

map $mcp_split $mcp_backend {
  "canary"  "127.0.0.1:3001";
  default   "127.0.0.1:3000";
}

server {
  location / {
    proxy_pass http://$mcp_backend;
  }
}

This hashes on the session ID so the same session always goes to the same backend. A new session has a 5% chance of landing on the canary and stays there for its entire duration.

Canary health metrics

Monitor these metrics independently for stable and canary, tagging each with a version label:

import { Counter, Histogram } from 'prom-client';

const toolCallDuration = new Histogram({
  name: 'mcp_tool_call_duration_seconds',
  help: 'Tool call execution duration',
  labelNames: ['tool', 'outcome', 'version'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
});

const toolCallErrors = new Counter({
  name: 'mcp_tool_call_errors_total',
  help: 'Tool call errors by type',
  labelNames: ['tool', 'error_type', 'version'],
});

// Set version from env at server start
const SERVER_VERSION = process.env.SERVER_VERSION ?? 'unknown';

// In your tool middleware:
toolCallDuration.observe({ tool: toolName, outcome, version: SERVER_VERSION }, durationSeconds);
if (outcome === 'error') {
  toolCallErrors.inc({ tool: toolName, error_type: errorType, version: SERVER_VERSION });
}

In your Prometheus/Grafana setup, compare mcp_tool_call_errors_total{version="canary"} vs {version="stable"}. A Grafana panel showing the error rate ratio is the primary canary health gate.

Automated rollback triggers

Define rollback criteria before deploying the canary — not after you see a problem:

Signal	Rollback threshold	Action
Canary error rate	> 2× stable error rate for 5 minutes	Remove canary weight → redeploy stable
Canary P99 latency	> 3× stable P99 for 5 minutes	Investigate slow path; likely a missing index or N+1 query
Schema validation errors (tool calls)	> 0.1% of canary tool calls	Breaking schema change — revert and use schema migration pattern
Canary process crash (OOM, SIGABRT)	Any crash	Immediate rollback; check for memory regression

Encode these as Prometheus alerting rules so rollback is triggered automatically by your on-call system, not by a human noticing a dashboard.

Canary progression gates

Before moving from 5% to 25% to 100%, require explicit confirmation of health:

5% for 30 minutes — baseline health check. No crashes, error rate within 1.5× stable.
25% for 1 hour — sufficient volume to detect rare error paths. All rollback thresholds green.
50% for 1 hour — confirm performance at half-traffic. P99 latency stable.
100% — complete the cutover. Scale down the old version 5 minutes later.

This progression can be automated (update nginx upstream weights via the nginx Plus API, or swap Caddy config via its admin API) or done manually for infrequent releases.

Canary deployments and AliveMCP

AliveMCP runs external protocol probes against your server endpoint. During a canary, configure two monitor targets — one for the stable port and one for the canary port — so you see independent uptime and response time graphs for each. If the canary probe returns errors or latency spikes that the stable probe does not, you have clean signal for a rollback decision independent of your internal metrics.