Guide · Deployment

MCP server blue-green deployment

Blue-green deployment is simple for stateless REST APIs: bring up the green slot, verify it, flip the load balancer, shut down blue. MCP servers using SSE transport complicate that sequence because sessions are long-lived — an AI client that connected before the cutover holds an open SSE connection to the blue slot, and cutting traffic mid-session drops that session without the client knowing. The pattern that works: drain blue first, gate the flip on a passing health probe, then cut. Servers using Streamable HTTP transport skip the drain step entirely because they're stateless.

TL;DR

Bring up the green slot in parallel with blue. Run your smoke tests and wait for AliveMCP to confirm the green slot passes the initialize → tools/list probe. Then enter the session drain window: stop sending new connections to blue (set blue weight to 0) but allow existing SSE sessions to finish — typically 60–120 seconds for MCP sessions. After the drain window, shut down blue. If the green probe fails within five minutes of the flip, flip the upstream back to blue instantly — that's the rollback.

Why MCP blue-green differs from REST blue-green

REST API blue-green is straightforward: requests are stateless, each one carries all its context, and a mid-request flip causes at worst one retried request. MCP SSE sessions are different in three ways:

Property	REST API	MCP SSE session	MCP Streamable HTTP
Connection lifetime	Milliseconds	Minutes to hours	Milliseconds
Session state location	None (or DB)	Server process memory	None (or DB)
Mid-deploy disconnect impact	Client retries one request	AI client loses full session context	Client retries one request
Blue-green complexity	Flip immediately	Drain window required	Flip immediately

The core problem with SSE sessions: the MCP SDK's Client does not automatically reconnect and replay the initialize handshake after a connection drop. The AI client (Claude Desktop, Cursor, etc.) sees a broken SSE stream and treats the server as dead. Your users notice. This is why the session drain window is essential for SSE-transport servers.

Blue-green topology

The simplest topology runs two identical server processes on different ports behind a single reverse proxy. The proxy is the only thing that changes during a deploy — both slots stay running simultaneously during the drain window.

Internet → Caddy / nginx (port 443)
              ├─ blue  → localhost:3001 (current production)
              └─ green → localhost:3002 (new version, being validated)

In container deployments, green and blue are separate containers. In cloud deployments, they're separate instances or app revisions. The proxy mechanism is the same in all cases: a weighted upstream where you shift the weight from blue to green during the flip.

Step 1 — Bring up the green slot and run probes

Start the new version on the green port. It should not receive production traffic yet. Run your smoke tests against it directly, and add a temporary AliveMCP monitor pointed at the green slot's URL.

# Start green on port 3002 (production traffic still hits blue on 3001)
PORT=3002 node dist/index.js &
GREEN_PID=$!

# Smoke test: verify the MCP initialize handshake completes
curl -sf -X POST http://localhost:3002/mcp \
  -H 'Content-Type: application/json' \
  -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"deploy-smoke","version":"1"}}}' \
| grep -q protocolVersion || { echo "Green slot failed initialize probe"; exit 1; }

# Smoke test: verify tools/list returns expected tools
TOOLS=$(curl -sf -X POST http://localhost:3002/mcp \
  -H 'Content-Type: application/json' \
  -d '{"jsonrpc":"2.0","id":2,"method":"tools/list","params":{}}' \
| jq -r '[.result.tools[].name] | sort | @json')
EXPECTED='["get_document","list_documents","search_documents"]'
[ "$TOOLS" = "$EXPECTED" ] || { echo "Tool list mismatch: $TOOLS"; exit 1; }

Add an AliveMCP monitor for https://staging.yourdomain.com (or the green slot's direct URL) before flipping. This gives you an external probe independent of your deploy script — if the green slot fails its MCP handshake for any reason, AliveMCP alerts you before your users are affected.

Step 2 — Drain the blue slot

Once the green slot passes all probes, begin the drain: stop routing new connections to blue, but keep the blue process running for existing sessions. In Caddy:

# Caddyfile — upstream configuration during drain window
# Set green to 100% weight, blue to 0 (no new connections)
# Blue still handles existing SSE connections

reverse_proxy /mcp {
    to localhost:3001 localhost:3002

    lb_policy weighted_round_robin
    lb_try_duration 5s

    # During drain: blue weight=0 (no new), green weight=1 (all new traffic)
    # The Caddy admin API lets you update weights without reloading config
}

With nginx, use the weight=0 parameter and reload:

upstream mcp_backend {
    # Blue: weight 0 during drain (stops receiving new connections)
    server localhost:3001 weight=0;
    # Green: all new connections
    server localhost:3002 weight=1;
}

server {
    listen 443 ssl;
    location /mcp {
        proxy_pass http://mcp_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection '';
        proxy_buffering off;
        proxy_read_timeout 3600s; # Keep SSE connections alive during drain
    }
}

The drain window should be long enough for active MCP sessions to finish naturally. Most MCP sessions last under 60 seconds — a 120-second drain window covers the 99th percentile. Use your server logs to measure actual session durations before picking a drain window.

# Wait for drain window
DRAIN_SECONDS=120
echo "Drain window: waiting ${DRAIN_SECONDS}s for blue sessions to finish..."
sleep $DRAIN_SECONDS

# Verify no active SSE connections on blue (check /metrics or process-level sockets)
BLUE_CONNECTIONS=$(ss -tn state ESTABLISHED dst localhost:3001 | wc -l)
if [ "$BLUE_CONNECTIONS" -gt 1 ]; then
  echo "Warning: $BLUE_CONNECTIONS connections still active on blue after drain window"
fi

Step 3 — Shut down blue and verify green

After the drain window, shut down the blue process. All traffic is now on green.

kill -SIGTERM $BLUE_PID
wait $BLUE_PID

# Verify green is healthy with an external probe (curl the production endpoint)
for i in $(seq 1 12); do
  RESULT=$(curl -sf -X POST https://yourdomain.com/mcp \
    -H 'Content-Type: application/json' \
    -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"post-deploy-verify","version":"1"}}}' \
    2>/dev/null | jq -r '.result.protocolVersion // empty')
  [ -n "$RESULT" ] && { echo "Green slot verified: $RESULT"; break; }
  sleep 5
done

Check AliveMCP — the probe should be green within two minutes of blue shutting down. If it goes red instead, that's the rollback trigger.

Rollback procedure

If the green slot fails its post-deploy probe (AliveMCP alerts, or your verify loop fails), rollback is a single upstream flip: restart blue with the previous version and shift all weight back to it.

# Rollback: restart blue with the previous build
PORT=3001 node dist-prev/index.js &
BLUE_PID=$!

# Wait for blue to pass its probe
for i in $(seq 1 20); do
  curl -sf -X POST http://localhost:3001/mcp \
    -H 'Content-Type: application/json' \
    -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"rollback-check","version":"1"}}}' \
    | grep -q protocolVersion && break
  sleep 2
done

# Flip nginx/Caddy: blue weight=1, green weight=0
# Then kill the green slot
kill -SIGTERM $GREEN_PID

Keep the previous build artifact (previous Docker image tag, previous dist/ folder, or previous Git tag) accessible during every deploy. The rollback is only fast if you don't have to rebuild — a rollback that requires a full CI rebuild costs you 5–10 minutes under an outage.

Rollback trigger	Detection method	Rollback action
Green initialize probe fails in smoke test	Pre-flip smoke test script	Kill green, deploy never happened
AliveMCP alert within 5 minutes of flip	AliveMCP external probe	Restart blue (prev build) → flip upstream
Error rate spike in tool call responses	Application metrics / structured logs	Flip upstream back to blue (if still running)
Memory/CPU spike on green	Process monitoring	Flip upstream back to blue

Streamable HTTP: blue-green without session drain

If you're using Streamable HTTP transport instead of SSE, each request is independent — there are no long-lived SSE connections to drain. Blue-green becomes straightforward:

# With Streamable HTTP: no drain window needed
# Each request to /mcp is a self-contained JSON-RPC exchange
# Cutting traffic mid-request drops one request (client retries)

# Step 1: start green
PORT=3002 node dist/index.js

# Step 2: smoke test green
# Step 3: flip upstream to green immediately (no drain)
# Step 4: shut down blue immediately (no drain)

The trade-off is that Streamable HTTP in stateless mode requires any session state (tool call history within a session, user context) to live in a shared store like Redis or PostgreSQL rather than process memory. For most MCP servers this is the right architecture anyway — it also makes horizontal scaling much simpler.

Automating blue-green in CI/CD

A complete GitHub Actions workflow integrating the deploy + drain + verify pattern:

name: Blue-green deploy

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Build green artifact
        run: npm ci && npm run build

      - name: Deploy green slot
        run: |
          # Deploy to green slot (vendor-specific commands)
          # e.g., fly deploy --app myapp-green --wait-timeout 60
          # e.g., railway deploy --service myapp-green

      - name: Smoke test green slot
        run: |
          GREEN_URL="https://green.yourdomain.com"
          for i in $(seq 1 20); do
            curl -sf -X POST "${GREEN_URL}/mcp" \
              -H 'Content-Type: application/json' \
              -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"ci","version":"1"}}}' \
            | grep -q protocolVersion && { echo "Green probe passed"; break; }
            sleep 3
          done

      - name: Flip traffic to green (via reverse proxy config update)
        run: |
          # Update load balancer to route to green
          # Vendor-specific: update target group, flip DNS, update Caddy via API, etc.

      - name: Drain window
        run: |
          echo "Waiting 120s for active blue sessions to complete..."
          sleep 120

      - name: Shut down blue slot
        run: |
          # Decommission the old blue slot
          # fly scale count 0 --app myapp-blue
          # railway down --service myapp-blue

      - name: Post-deploy verification
        run: |
          PROD_URL="https://yourdomain.com"
          for i in $(seq 1 12); do
            curl -sf -X POST "${PROD_URL}/mcp" \
              -H 'Content-Type: application/json' \
              -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"post-deploy","version":"1"}}}' \
            | grep -q protocolVersion && { echo "Production verified"; exit 0; }
            sleep 5
          done
          echo "Post-deploy verification failed — check AliveMCP"
          exit 1

AliveMCP as the deploy gate

AliveMCP's external probe runs the same initialize → tools/list sequence that a real MCP client runs. This makes it the right gate for blue-green deploys: it catches the failure modes that smoke tests and HTTP health checks miss.

Failure mode	HTTP /health catches it	Smoke test catches it	AliveMCP catches it
Server process crashed	Yes	Yes	Yes
MCP initialize returns wrong protocolVersion	No	Yes	Yes
tools/list returns empty array (registration bug)	No	Yes	Yes
TLS certificate expired on green slot	Maybe (depends on check)	Yes (if using HTTPS)	Yes
DNS routing still pointing to blue after flip	No (hits old server)	No (hits old server)	Yes (external probe sees routing)

Configure a second AliveMCP monitor for the green slot URL before the flip. If it fails, the flip never happens. If it passes and you flip but the production probe then goes red, that's your automatic rollback signal — AliveMCP sends the alert and your on-call runbook says "flip upstream back to blue immediately."

FAQ

How long should the drain window be for MCP SSE sessions?

Measure your actual session durations from server access logs — look for SSE connection close events. Most interactive MCP sessions end in under 60 seconds when the user is done with a task. Set the drain window to the 95th-percentile session duration, with a minimum of 60 seconds and a maximum of 5 minutes. Beyond 5 minutes, the deploy is stalled for too long; any sessions still active should be terminated gracefully with a SIGTERM to the blue process.

Can I skip the drain window if I tell clients to reconnect?

MCP clients (Claude Desktop, Cursor, VS Code Copilot) do not automatically reconnect and replay the initialize handshake. From their perspective, an SSE disconnect means the server is gone. There is no MCP protocol-level session resumption — the client must be restarted or must re-initialize from scratch. Skipping the drain window means cutting those sessions, which users experience as the MCP server going offline mid-task.

Does blue-green work with Kubernetes rolling deploys?

Kubernetes rolling deploys are a variant of blue-green where pods are replaced one at a time rather than all at once. For MCP SSE servers, set minReadySeconds to your drain window duration so the rolling controller waits for existing sessions to drain before terminating the old pod. For Streamable HTTP servers, rolling deploys work without any special configuration since requests are stateless.

How do I handle database migrations in a blue-green deploy?

Run additive migrations before the flip — new columns, new tables, new indexes — so both blue (old version) and green (new version) can operate against the schema simultaneously. Never run destructive schema changes (column removal, column rename) during the window when both versions are running. Run destructive cleanup in a separate migration after the old version is fully decommissioned. See MCP server database migrations for the full three-phase migration pattern.

What if AliveMCP shows the green slot as healthy but users are reporting errors?

AliveMCP verifies that the MCP protocol handshake succeeds — it doesn't verify that tool calls return correct data. If the protocol layer is healthy but tool results are wrong (wrong database, stale cache, misconfigured environment variable), AliveMCP will be green while users report errors. Add a canary tool call to your smoke test suite: call a tool that exercises the full data path and verify the response content, not just the HTTP status.