Release engineering · 2026-06-27 · Release Engineering arc

The MCP Server Release Engineering Stack: Blue-Green Deploys, Preview Environments, npm Publishing, and Automated Releases

Almost every MCP server tutorial ends at node dist/index.js. The server runs, Claude Desktop can call its tools, the author closes the laptop. But "it runs locally" and "it ships reliably to users" are separated by a large gap — one that REST API developers close with mature, well-understood tooling, but that MCP authors are figuring out from scratch because MCP's stateful SSE sessions, tool schema contracts, and protocol handshake requirements introduce problems that don't exist in the REST world. This post synthesizes the five disciplines that close that gap: blue-green deployments with session drain, preview environments with per-PR database namespacing, npm publishing with tool-schema semver, monorepo coordination with a shared schema package, and release automation that closes the release loop with a post-deploy protocol probe.

The five disciplines at a glance

Each discipline solves a specific class of release failure. Skipping any one leaves a specific category of breakage with no safety net:

Discipline	Failure class it prevents	MCP-specific complication	Verification signal
Blue-green deploy	Users hit a broken version while the new one starts up	SSE sessions are stateful — cutting traffic mid-session drops active AI client sessions without reconnect	AliveMCP probe on green slot before traffic flip
Preview environments	Bugs that pass unit + integration tests but fail in a real deployed environment	Missing env vars, unapplied database migrations, CORS issues — only caught with a real deployment	AliveMCP probe per PR as a required CI check
npm publishing	Breaking tool schema changes shipped without a major version bump	Tool `inputSchema` is a contract consumed by LLMs — a breaking change to it breaks every agent using that tool	AliveMCP probe on the HTTP-mode endpoint post-publish
Monorepo	Tool schema drift between packages in the same repo	Multiple MCP servers sharing tool definitions — a schema change in one must propagate to all	One AliveMCP monitor per app, independent health signals
Release automation	Manual release steps that are skipped under pressure or forgotten	MCP tool schema changes require snapshot testing to enforce semver discipline automatically	AliveMCP post-deploy probe closes the release loop

The unifying pattern: all five disciplines converge on the same external verification signal — an MCP protocol probe that runs the full initialize → tools/list handshake against the live endpoint. HTTP health checks return 200 while the MCP server is silently broken. Only a probe that speaks the actual MCP protocol can confirm the new version is correctly serving requests.

Blue-green deployments: the session drain window

Blue-green deployment is simple for stateless REST APIs: start the new version on a green slot, verify it passes health checks, flip the load balancer, decommission blue. The same sequence applied naively to an MCP SSE server causes a user-visible failure: every AI client that was mid-session on the blue slot loses its connection the instant you flip traffic, and the MCP SDK's Client does not automatically reconnect and replay the initialize handshake. From the user's perspective, their Claude Desktop session simply dies.

The solution is a session drain window inserted between "flip new connections to green" and "shut down blue":

Internet → Caddy (port 443)
              ├─ blue  → localhost:3001  ← existing SSE sessions draining
              └─ green → localhost:3002  ← all new connections since flip

The drain window is the interval after setting blue's weight to 0 (so no new connections route there) but before killing the blue process. Active SSE sessions on blue continue uninterrupted. After 60–120 seconds, the vast majority of sessions have naturally ended, and you can safely shut down blue. A Caddy-based flip looks like:

# Start green slot on port 3002 (blue still taking production traffic on 3001)
PORT=3002 node dist/index.js &

# Gate: verify green passes the MCP protocol handshake
curl -sf -X POST http://localhost:3002/mcp \
  -H 'Content-Type: application/json' \
  -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"deploy-probe","version":"1"}}}' \
| grep -q protocolVersion || { echo "Green slot failed probe"; exit 1; }

# Flip: new connections go to green, existing SSE sessions drain on blue
caddy reload --config /etc/caddy/green.Caddyfile

# Drain window: wait for active SSE sessions to end naturally
sleep 90

# Shut down blue
kill $BLUE_PID

The gate before the flip is the critical step. An AliveMCP monitor pointed at the green slot URL confirms not just that the process is running, but that it correctly completes the MCP protocol handshake — the same check an AI client performs when it connects. If the green slot is running but has a broken tool registration or a schema mismatch, that probe fails before a single user is affected.

If your server uses Streamable HTTP transport instead of SSE, you can skip the drain window entirely — Streamable HTTP is stateless, each request carries all its context, and a mid-deploy request is at worst retried. The tradeoff is that Streamable HTTP sessions don't survive network interruptions the way SSE connections can persist through brief hiccups, but for deploy scenarios the stateless property is a significant operational simplification.

Preview environments: catching what CI cannot

Unit tests and integration tests run against in-process fakes. They're fast and reliable for testing handler logic. But there's a class of bug they cannot detect: the production environment differs from the test environment in a way that breaks the MCP server only when deployed. Preview environments — ephemeral per-PR deployments — are the layer that catches this class before it reaches production.

The four failure categories that preview environments catch that CI misses:

Failure class	Why CI misses it	Why preview catches it
Missing environment variable	Test doubles don't need real credentials	The deployed server crashes at startup without the var
Unapplied database migration	In-memory fakes have no schema	The deployed server hits a real database with the old schema
CORS misconfiguration	Browser isn't involved in CI	The browser-based AI client fails the preflight check
TLS certificate issue	Tests run over HTTP or use test TLS	The deployed server uses the real certificate chain

The most elegant pattern for handling the database challenge in preview environments is per-PR PostgreSQL schema namespacing. Rather than spinning up a separate database instance per PR — expensive and slow to provision — you create a separate schema within a shared development database:

# CI creates a schema named after the PR branch (sanitized)
PREVIEW_SCHEMA="pr_$(echo $GITHUB_HEAD_REF | tr '/' '_' | tr '-' '_' | cut -c1-40)"

psql $DATABASE_URL -c "CREATE SCHEMA IF NOT EXISTS $PREVIEW_SCHEMA;"

# Run migrations against only the preview schema
DATABASE_URL="${DATABASE_URL}?search_path=${PREVIEW_SCHEMA}" \
  node scripts/migrate.js

# Deploy the preview server with the schema-scoped URL
DATABASE_URL="${DATABASE_URL}?search_path=${PREVIEW_SCHEMA}" \
  railway up --service mcp-server-preview

On PR close, CI drops the schema: psql $DATABASE_URL -c "DROP SCHEMA $PREVIEW_SCHEMA CASCADE;". Each PR gets isolated database state, migrations are tested against a real database engine, and teardown is a single command.

The required CI check for every preview environment is an AliveMCP probe against the preview URL. Not an HTTP health check — an MCP protocol probe. A preview server that returns HTTP 200 on /health but has a broken tool registration (a common migration-related failure) will pass an HTTP health check and fail the protocol probe. The CI check should block merge until the protocol probe passes:

# .github/workflows/preview.yml
- name: Wait for AliveMCP preview probe to pass
  run: |
    for i in $(seq 1 12); do
      STATUS=$(curl -sf "https://alivemcp.com/api/probe?url=${PREVIEW_URL}" | jq -r '.status')
      [ "$STATUS" = "healthy" ] && exit 0
      echo "Probe attempt $i: $STATUS — retrying in 15s"
      sleep 15
    done
    echo "Preview environment failed MCP probe after 3 minutes"
    exit 1

npm publishing: semver for tool schema changes

Many MCP servers are distributed as npm packages — installed with npx, configured in Claude Desktop or Cursor via a path in the config file, and updated with npm update. The versioning rules for an npm-distributed MCP server are different from the rules for a library because the primary consumer of your tool schema is not a developer but an LLM. A change that a developer would consider minor — renaming a parameter from userId to user_id — is a breaking change for every agent prompt that was written against the old parameter name.

The semver table for MCP tool schema changes:

Change type	Semver bump	Reasoning
Bug fix that doesn't affect `inputSchema` or tool behavior	Patch (1.0.x)	Safe for any consumer — no agent prompts need to change
New tool added (no existing tool changes)	Minor (1.x.0)	Additive — existing consumers work unchanged, new consumers can use new tool
New optional parameter added to existing tool	Minor (1.x.0)	Additive — existing calls without the parameter still work
Existing parameter renamed or removed	Major (x.0.0)	Breaking — agent prompts referencing the old parameter name will pass invalid inputs
Tool renamed or removed	Major (x.0.0)	Breaking — agent workflows depending on the tool will fail at tool-selection time
Required parameter added to existing tool	Major (x.0.0)	Breaking — existing calls without the new parameter will fail schema validation
Tool description changed significantly	Minor (1.x.0) minimum	Affects LLM tool selection — agents calibrated on old description may behave differently

The package.json structure for a published MCP server differs from a library. You need both a bin entry for the CLI entry point (enabling npx use) and an exports field exposing the programmatic API for testing:

{
  "name": "@yourorg/my-mcp-server",
  "version": "1.3.0",
  "type": "module",
  "exports": {
    ".": "./dist/index.js",
    "./server": "./dist/server.js"
  },
  "bin": {
    "my-mcp-server": "./dist/cli.js"
  },
  "files": ["dist/", "README.md"]
}

The exports["./server"] entry exposes the createServer(deps) factory function so tests can import the server without triggering the CLI entry point's process.exit calls. The bin entry is what enables npx @yourorg/my-mcp-server to work in Claude Desktop config. Automate the publish step via GitHub Actions with --provenance on every tagged release:

# .github/workflows/publish.yml
on:
  push:
    tags: ['v*.*.*']

jobs:
  publish:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      id-token: write  # required for --provenance
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '22'
          registry-url: 'https://registry.npmjs.org'
      - run: npm ci
      - run: npm test
      - run: npm run build
      - run: npm publish --provenance --access public
        env:
          NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}

Monorepos: the shared mcp-schema package

If you maintain more than one MCP server, the temptation is to copy tool definitions between repositories. The problem: a single edit propagates across N repositories, each of which needs a PR, review, and deploy. The monorepo pattern with pnpm workspaces solves this by extracting tool definitions into a shared package that every server imports.

The layout that scales:

packages/
  mcp-schema/        ← shared tool definitions, Zod schemas, DOCUMENT_TOOLS array
    src/
      tools.ts       ← all tool definitions exported as an array
      schemas.ts     ← Zod schemas for all tool inputs
    package.json

apps/
  documents-server/  ← imports @yourorg/mcp-schema
  search-server/     ← imports @yourorg/mcp-schema
  admin-server/      ← imports @yourorg/mcp-schema
pnpm-workspace.yaml
turbo.json

The packages/mcp-schema package exports the tool definitions as a typed array:

// packages/mcp-schema/src/tools.ts
import { z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';

export const GetDocumentSchema = z.object({
  documentId: z.string().describe('The document ID to retrieve'),
  version:    z.number().optional().describe('Optional version number; defaults to latest'),
});

export const DOCUMENT_TOOLS = [
  {
    name: 'get_document',
    description: 'Retrieve a document by ID. Returns full content and metadata.',
    inputSchema: zodToJsonSchema(GetDocumentSchema),
  },
  // ... more tools
] as const;

Each app server imports from @yourorg/mcp-schema and registers the tools from the array. The critical architectural benefit: a change to any tool definition in the shared package propagates to every server automatically — there's no synchronization step, no risk of servers diverging, and the TypeScript compiler catches any place in any server that calls a changed schema incorrectly.

The Turborepo pipeline ensures that mcp-schema builds before any app that depends on it:

// turbo.json
{
  "tasks": {
    "build":  { "dependsOn": ["^build"], "outputs": ["dist/**"] },
    "test":   { "dependsOn": ["^build"] },
    "lint":   { "dependsOn": ["^build"] }
  }
}

For CI, pnpm's filter syntax with git change detection scopes the build to affected packages: pnpm --filter "...[HEAD~1]" build builds only packages that changed plus their dependents. A change to mcp-schema triggers builds for all three app servers. A change to only search-server triggers only that build.

In the monorepo pattern, each deployed app gets its own AliveMCP monitor rather than a single monitor covering all three. The reason: you want independent health signals. If documents-server goes down during a deploy, its AliveMCP alert should fire independently of search-server's status. A single monitor covering a load-balanced endpoint would miss a partial outage where two of three servers are healthy.

Release automation: closing the loop with snapshot testing

Manual release processes fail under pressure. The step most often skipped is the one that confirms the new version actually works — because after a two-hour deploy incident, nobody wants to run one more check. Release automation via semantic-release or changesets removes the human from the release loop entirely, and snapshot testing is the enforcement mechanism that makes the automation safe.

The two main tools for MCP server release automation serve different team workflows:

Tool	How it decides what to release	Best fit
semantic-release	Reads conventional commit messages (`feat:`, `fix:`, `BREAKING CHANGE:`) and derives the next version automatically	Teams that want zero manual steps — commit message discipline is the only required input
changesets	Developers run `changeset add` per PR to record the change type (patch/minor/major) and description; a version PR accumulates them and publishes on merge	Teams that want a human-reviewed CHANGELOG and explicit per-PR declarations

For MCP servers, the key conventional commit mapping is:

# patch — bug fixes that don't touch tool schema
fix: handle null response from upstream API gracefully

# minor — new tools (additive)
feat: add search_documents tool with full-text query support

# major — breaking tool schema change (parameter rename, tool removal)
feat!: rename userId parameter to user_id across all tools

BREAKING CHANGE: The userId parameter has been renamed to user_id in
get_user, update_user, and delete_user. Agent prompts using userId will
receive a Zod validation error. Update all prompts before upgrading.

Snapshot testing is the gate that makes automated releases safe for MCP servers specifically. It captures the tools/list response — the complete tool manifest the server advertises — and fails the build if it changes unexpectedly:

// test/snapshot.test.ts
import { expect, it } from 'vitest';
import { createMcpTestClient } from './helpers/mcp-client.js';
import { createServer } from '../src/server.js';

it('tool manifest matches snapshot', async () => {
  const client = await createMcpTestClient(createServer);

  const { tools } = await client.raw.listTools();
  const manifest = tools.map(t => ({
    name:        t.name,
    description: t.description,
    inputSchema: t.inputSchema,
  }));

  // Fails if any tool name, description, or schema changed without intent
  expect(manifest).toMatchSnapshot();
});

When a developer changes a tool definition, the snapshot test fails with a diff showing exactly what changed. This is the moment to decide: is this a patch (bug fix with no schema change), a minor (new tool added), or a major (schema changed)? Update the snapshot with vitest --update-snapshots and commit the new snapshot file alongside the conventional commit that reflects the decision. The commit message and the snapshot change together form an auditable record of every tool schema change in the git history.

After every successful release, a post-deploy probe closes the loop. For deployed services, this is a curl against the live endpoint:

# semantic-release exec plugin: runs after publish
after_success() {
  echo "Probing live endpoint after release..."
  for i in $(seq 1 6); do
    STATUS=$(curl -sf "https://alivemcp.com/api/probe?url=${PRODUCTION_URL}" | jq -r '.status')
    [ "$STATUS" = "healthy" ] && { echo "Post-deploy probe passed"; return 0; }
    echo "Attempt $i: $STATUS — retrying in 30s"
    sleep 30
  done
  echo "Post-deploy probe failed after 3 minutes — investigate immediately"
  exit 1
}

If the post-deploy probe fails, the release pipeline exits non-zero, the CI job is red, and the on-call notification fires before a single user reports an issue. This is the verification gap that manual releases leave open: the CHANGELOG is updated, the tag is pushed, npm is published — but nobody checks whether the deployed service is actually serving the new version's protocol correctly.

The unifying signal: AliveMCP across all five disciplines

What differentiates MCP release engineering from REST API release engineering is the protocol probe requirement. An HTTP health check endpoint returns 200 or 500 — it tells you the process is alive and the HTTP stack is responding. It does not tell you whether the MCP protocol handshake succeeds, whether the tool manifest is correctly registered, whether the Zod schemas are valid, or whether the server can complete an initialize → tools/list sequence.

All five release engineering disciplines require a probe that speaks the actual MCP protocol — not an HTTP health check. The reason each discipline converges on this requirement:

Blue-green: the green slot must pass an MCP protocol probe before any traffic is flipped. HTTP 200 on /health does not confirm tool registration.
Preview environments: the per-PR CI gate must confirm the MCP handshake succeeds in the deployed environment. A missing env var that breaks tool registration returns HTTP 200 on /health but fails the protocol probe.
npm publishing: HTTP-mode MCP servers can be monitored post-publish by pointing an AliveMCP monitor at the deployment URL that serves the published package.
Monorepo: each app in the monorepo needs an independent AliveMCP monitor so a broken deploy to one app doesn't hide behind another app's healthy status.
Release automation: the post-deploy probe in the release pipeline closes the automation loop — confirming that what was released is actually serving protocol requests correctly, not just that the process started.

The gap that AliveMCP closes across all five: in-process tests (unit tests, integration tests, snapshot tests) verify what the code does. The protocol probe verifies what the running deployed service does. A deploy-time misconfiguration — a missing environment variable, a failed migration, a TLS certificate mismatch, a process that panics on first request — passes every in-process test and fails the protocol probe. That's the bug class each of these five disciplines is designed to catch before users do.

Adoption sequence: what to implement first

The five disciplines don't need to be implemented simultaneously. The right adoption order minimizes risk while delivering the highest-value safety net earliest:

Monorepo first (if you have multiple packages). The shared mcp-schema package is a prerequisite for versioning discipline — you can't enforce semver across multiple packages if the tool definitions are duplicated everywhere. If you only have one MCP server, skip this step for now.
Versioning discipline second. Commit to the semver table for tool schema changes. Add the snapshot test. This has zero infrastructure cost and immediately makes every tool schema change an explicit decision rather than an accident.
Release automation third. Once versioning discipline is established, automate it. Set up semantic-release or changesets. The snapshot test is the gate that makes this automation trustworthy — without it, automated major/minor/patch decisions might be wrong.
Preview environments fourth. Once your release process is clean, add preview environments to catch the class of bugs that only appear in real deployments. Railway and Fly.io both have native PR environment support that makes this low-effort to add.
Blue-green deployments fifth (if SSE transport). Once everything else is in place, the blue-green pattern is the last piece — and the most operationally complex. If you've already migrated to Streamable HTTP transport, skip this: the stateless transport eliminates the need for a drain window.

At each step, add the corresponding AliveMCP monitor: one per deployed app, pointed at the production URL, with a Slack or webhook alert for any outage. The monitor is the confirmation that your release engineering stack is actually working in production — not just that it passed in CI.

Frequently asked questions

How long should the session drain window be for blue-green MCP deployments?

60–120 seconds covers most MCP SSE sessions. A typical Claude Desktop session with an active MCP tool is either completing a tool call (seconds to tens of seconds) or idle (can be dropped without impact). If your server's tool calls can take longer than 60 seconds — for example, a server that runs slow database queries or calls upstream APIs with long timeouts — measure your p99 tool call duration and set the drain window to 1.5× that value. After the drain window, any remaining SSE connections are dropped; this is acceptable because the alternative (waiting indefinitely) blocks the deploy forever.

What's the difference between a preview environment and a staging environment?

A staging environment is a single shared environment that represents the current state of the main branch — it's used for manual QA before release. A preview environment is ephemeral and per-PR — it's created when a PR opens and destroyed when the PR closes. Preview environments are superior for catching the class of bugs described here (missing env vars, unapplied migrations) because they're always testing the specific combination of code changes in the PR against a real database migration, not the accumulated state of everything that's merged since someone last touched staging.

Should I use semantic-release or changesets for an MCP server?

Use changesets if your team has multiple contributors and you want a human-reviewed CHANGELOG that describes tool schema changes in plain language (important when your users are other developers integrating your MCP server into their agents). Use semantic-release if you're a solo maintainer who trusts conventional commit discipline and wants zero manual steps between "push a tag" and "npm publish". Either choice works — the snapshot test is the gate that makes both safe for MCP tool schema changes.

Why does the per-PR PostgreSQL schema namespacing approach work better than separate databases per PR?

Separate databases per PR require provisioning (slow, often takes 1–3 minutes on cloud providers), cost money when you have many open PRs, and require teardown automation that's easy to get wrong and leaves orphaned databases accumulating charges. Schema namespacing in a shared development database provisions in milliseconds (it's a DDL statement), costs nothing beyond the shared database's baseline, and teardown is a single DROP SCHEMA. The isolation is sufficient for catching migration and configuration bugs — preview environments are not running production load, so the shared database's performance characteristics don't matter.

What should the snapshot test capture besides tool names and descriptions?

Capture the complete inputSchema for every tool — including nested object shapes, required/optional fields, enum values, description strings on each property, and minimum/maximum constraints. The description strings on schema properties are the most commonly overlooked: they're part of what the LLM reads to understand how to call the tool correctly. A change to a property description is technically non-breaking (the tool still works) but can affect LLM behavior enough to break agent workflows that were calibrated on the old wording. Including property descriptions in the snapshot makes every such change explicit and intentional rather than invisible.