Guide · Deployment
MCP server blue-green deployment
Blue-green deployment is simple for stateless REST APIs: bring up the green slot, verify it, flip the load balancer, shut down blue. MCP servers using SSE transport complicate that sequence because sessions are long-lived — an AI client that connected before the cutover holds an open SSE connection to the blue slot, and cutting traffic mid-session drops that session without the client knowing. The pattern that works: drain blue first, gate the flip on a passing health probe, then cut. Servers using Streamable HTTP transport skip the drain step entirely because they're stateless.
TL;DR
Bring up the green slot in parallel with blue. Run your smoke tests and wait for AliveMCP to confirm the green slot passes the initialize → tools/list probe. Then enter the session drain window: stop sending new connections to blue (set blue weight to 0) but allow existing SSE sessions to finish — typically 60–120 seconds for MCP sessions. After the drain window, shut down blue. If the green probe fails within five minutes of the flip, flip the upstream back to blue instantly — that's the rollback.
Why MCP blue-green differs from REST blue-green
REST API blue-green is straightforward: requests are stateless, each one carries all its context, and a mid-request flip causes at worst one retried request. MCP SSE sessions are different in three ways:
| Property | REST API | MCP SSE session | MCP Streamable HTTP |
|---|---|---|---|
| Connection lifetime | Milliseconds | Minutes to hours | Milliseconds |
| Session state location | None (or DB) | Server process memory | None (or DB) |
| Mid-deploy disconnect impact | Client retries one request | AI client loses full session context | Client retries one request |
| Blue-green complexity | Flip immediately | Drain window required | Flip immediately |
The core problem with SSE sessions: the MCP SDK's Client does not automatically reconnect and replay the initialize handshake after a connection drop. The AI client (Claude Desktop, Cursor, etc.) sees a broken SSE stream and treats the server as dead. Your users notice. This is why the session drain window is essential for SSE-transport servers.
Blue-green topology
The simplest topology runs two identical server processes on different ports behind a single reverse proxy. The proxy is the only thing that changes during a deploy — both slots stay running simultaneously during the drain window.
Internet → Caddy / nginx (port 443)
├─ blue → localhost:3001 (current production)
└─ green → localhost:3002 (new version, being validated)
In container deployments, green and blue are separate containers. In cloud deployments, they're separate instances or app revisions. The proxy mechanism is the same in all cases: a weighted upstream where you shift the weight from blue to green during the flip.
Step 1 — Bring up the green slot and run probes
Start the new version on the green port. It should not receive production traffic yet. Run your smoke tests against it directly, and add a temporary AliveMCP monitor pointed at the green slot's URL.
# Start green on port 3002 (production traffic still hits blue on 3001)
PORT=3002 node dist/index.js &
GREEN_PID=$!
# Smoke test: verify the MCP initialize handshake completes
curl -sf -X POST http://localhost:3002/mcp \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"deploy-smoke","version":"1"}}}' \
| grep -q protocolVersion || { echo "Green slot failed initialize probe"; exit 1; }
# Smoke test: verify tools/list returns expected tools
TOOLS=$(curl -sf -X POST http://localhost:3002/mcp \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":2,"method":"tools/list","params":{}}' \
| jq -r '[.result.tools[].name] | sort | @json')
EXPECTED='["get_document","list_documents","search_documents"]'
[ "$TOOLS" = "$EXPECTED" ] || { echo "Tool list mismatch: $TOOLS"; exit 1; }
Add an AliveMCP monitor for https://staging.yourdomain.com (or the green slot's direct URL) before flipping. This gives you an external probe independent of your deploy script — if the green slot fails its MCP handshake for any reason, AliveMCP alerts you before your users are affected.
Step 2 — Drain the blue slot
Once the green slot passes all probes, begin the drain: stop routing new connections to blue, but keep the blue process running for existing sessions. In Caddy:
# Caddyfile — upstream configuration during drain window
# Set green to 100% weight, blue to 0 (no new connections)
# Blue still handles existing SSE connections
reverse_proxy /mcp {
to localhost:3001 localhost:3002
lb_policy weighted_round_robin
lb_try_duration 5s
# During drain: blue weight=0 (no new), green weight=1 (all new traffic)
# The Caddy admin API lets you update weights without reloading config
}
With nginx, use the weight=0 parameter and reload:
upstream mcp_backend {
# Blue: weight 0 during drain (stops receiving new connections)
server localhost:3001 weight=0;
# Green: all new connections
server localhost:3002 weight=1;
}
server {
listen 443 ssl;
location /mcp {
proxy_pass http://mcp_backend;
proxy_http_version 1.1;
proxy_set_header Connection '';
proxy_buffering off;
proxy_read_timeout 3600s; # Keep SSE connections alive during drain
}
}
The drain window should be long enough for active MCP sessions to finish naturally. Most MCP sessions last under 60 seconds — a 120-second drain window covers the 99th percentile. Use your server logs to measure actual session durations before picking a drain window.
# Wait for drain window
DRAIN_SECONDS=120
echo "Drain window: waiting ${DRAIN_SECONDS}s for blue sessions to finish..."
sleep $DRAIN_SECONDS
# Verify no active SSE connections on blue (check /metrics or process-level sockets)
BLUE_CONNECTIONS=$(ss -tn state ESTABLISHED dst localhost:3001 | wc -l)
if [ "$BLUE_CONNECTIONS" -gt 1 ]; then
echo "Warning: $BLUE_CONNECTIONS connections still active on blue after drain window"
fi
Step 3 — Shut down blue and verify green
After the drain window, shut down the blue process. All traffic is now on green.
kill -SIGTERM $BLUE_PID
wait $BLUE_PID
# Verify green is healthy with an external probe (curl the production endpoint)
for i in $(seq 1 12); do
RESULT=$(curl -sf -X POST https://yourdomain.com/mcp \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"post-deploy-verify","version":"1"}}}' \
2>/dev/null | jq -r '.result.protocolVersion // empty')
[ -n "$RESULT" ] && { echo "Green slot verified: $RESULT"; break; }
sleep 5
done
Check AliveMCP — the probe should be green within two minutes of blue shutting down. If it goes red instead, that's the rollback trigger.
Rollback procedure
If the green slot fails its post-deploy probe (AliveMCP alerts, or your verify loop fails), rollback is a single upstream flip: restart blue with the previous version and shift all weight back to it.
# Rollback: restart blue with the previous build
PORT=3001 node dist-prev/index.js &
BLUE_PID=$!
# Wait for blue to pass its probe
for i in $(seq 1 20); do
curl -sf -X POST http://localhost:3001/mcp \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"rollback-check","version":"1"}}}' \
| grep -q protocolVersion && break
sleep 2
done
# Flip nginx/Caddy: blue weight=1, green weight=0
# Then kill the green slot
kill -SIGTERM $GREEN_PID
Keep the previous build artifact (previous Docker image tag, previous dist/ folder, or previous Git tag) accessible during every deploy. The rollback is only fast if you don't have to rebuild — a rollback that requires a full CI rebuild costs you 5–10 minutes under an outage.
| Rollback trigger | Detection method | Rollback action |
|---|---|---|
| Green initialize probe fails in smoke test | Pre-flip smoke test script | Kill green, deploy never happened |
| AliveMCP alert within 5 minutes of flip | AliveMCP external probe | Restart blue (prev build) → flip upstream |
| Error rate spike in tool call responses | Application metrics / structured logs | Flip upstream back to blue (if still running) |
| Memory/CPU spike on green | Process monitoring | Flip upstream back to blue |
Streamable HTTP: blue-green without session drain
If you're using Streamable HTTP transport instead of SSE, each request is independent — there are no long-lived SSE connections to drain. Blue-green becomes straightforward:
# With Streamable HTTP: no drain window needed
# Each request to /mcp is a self-contained JSON-RPC exchange
# Cutting traffic mid-request drops one request (client retries)
# Step 1: start green
PORT=3002 node dist/index.js
# Step 2: smoke test green
# Step 3: flip upstream to green immediately (no drain)
# Step 4: shut down blue immediately (no drain)
The trade-off is that Streamable HTTP in stateless mode requires any session state (tool call history within a session, user context) to live in a shared store like Redis or PostgreSQL rather than process memory. For most MCP servers this is the right architecture anyway — it also makes horizontal scaling much simpler.
Automating blue-green in CI/CD
A complete GitHub Actions workflow integrating the deploy + drain + verify pattern:
name: Blue-green deploy
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Build green artifact
run: npm ci && npm run build
- name: Deploy green slot
run: |
# Deploy to green slot (vendor-specific commands)
# e.g., fly deploy --app myapp-green --wait-timeout 60
# e.g., railway deploy --service myapp-green
- name: Smoke test green slot
run: |
GREEN_URL="https://green.yourdomain.com"
for i in $(seq 1 20); do
curl -sf -X POST "${GREEN_URL}/mcp" \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"ci","version":"1"}}}' \
| grep -q protocolVersion && { echo "Green probe passed"; break; }
sleep 3
done
- name: Flip traffic to green (via reverse proxy config update)
run: |
# Update load balancer to route to green
# Vendor-specific: update target group, flip DNS, update Caddy via API, etc.
- name: Drain window
run: |
echo "Waiting 120s for active blue sessions to complete..."
sleep 120
- name: Shut down blue slot
run: |
# Decommission the old blue slot
# fly scale count 0 --app myapp-blue
# railway down --service myapp-blue
- name: Post-deploy verification
run: |
PROD_URL="https://yourdomain.com"
for i in $(seq 1 12); do
curl -sf -X POST "${PROD_URL}/mcp" \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"post-deploy","version":"1"}}}' \
| grep -q protocolVersion && { echo "Production verified"; exit 0; }
sleep 5
done
echo "Post-deploy verification failed — check AliveMCP"
exit 1
AliveMCP as the deploy gate
AliveMCP's external probe runs the same initialize → tools/list sequence that a real MCP client runs. This makes it the right gate for blue-green deploys: it catches the failure modes that smoke tests and HTTP health checks miss.
| Failure mode | HTTP /health catches it | Smoke test catches it | AliveMCP catches it |
|---|---|---|---|
| Server process crashed | Yes | Yes | Yes |
| MCP initialize returns wrong protocolVersion | No | Yes | Yes |
| tools/list returns empty array (registration bug) | No | Yes | Yes |
| TLS certificate expired on green slot | Maybe (depends on check) | Yes (if using HTTPS) | Yes |
| DNS routing still pointing to blue after flip | No (hits old server) | No (hits old server) | Yes (external probe sees routing) |
Configure a second AliveMCP monitor for the green slot URL before the flip. If it fails, the flip never happens. If it passes and you flip but the production probe then goes red, that's your automatic rollback signal — AliveMCP sends the alert and your on-call runbook says "flip upstream back to blue immediately."
Related pages
FAQ
How long should the drain window be for MCP SSE sessions?
Measure your actual session durations from server access logs — look for SSE connection close events. Most interactive MCP sessions end in under 60 seconds when the user is done with a task. Set the drain window to the 95th-percentile session duration, with a minimum of 60 seconds and a maximum of 5 minutes. Beyond 5 minutes, the deploy is stalled for too long; any sessions still active should be terminated gracefully with a SIGTERM to the blue process.
Can I skip the drain window if I tell clients to reconnect?
MCP clients (Claude Desktop, Cursor, VS Code Copilot) do not automatically reconnect and replay the initialize handshake. From their perspective, an SSE disconnect means the server is gone. There is no MCP protocol-level session resumption — the client must be restarted or must re-initialize from scratch. Skipping the drain window means cutting those sessions, which users experience as the MCP server going offline mid-task.
Does blue-green work with Kubernetes rolling deploys?
Kubernetes rolling deploys are a variant of blue-green where pods are replaced one at a time rather than all at once. For MCP SSE servers, set minReadySeconds to your drain window duration so the rolling controller waits for existing sessions to drain before terminating the old pod. For Streamable HTTP servers, rolling deploys work without any special configuration since requests are stateless.
How do I handle database migrations in a blue-green deploy?
Run additive migrations before the flip — new columns, new tables, new indexes — so both blue (old version) and green (new version) can operate against the schema simultaneously. Never run destructive schema changes (column removal, column rename) during the window when both versions are running. Run destructive cleanup in a separate migration after the old version is fully decommissioned. See MCP server database migrations for the full three-phase migration pattern.
What if AliveMCP shows the green slot as healthy but users are reporting errors?
AliveMCP verifies that the MCP protocol handshake succeeds — it doesn't verify that tool calls return correct data. If the protocol layer is healthy but tool results are wrong (wrong database, stale cache, misconfigured environment variable), AliveMCP will be green while users report errors. Add a canary tool call to your smoke test suite: call a tool that exercises the full data path and verify the response content, not just the HTTP status.