Practical guide · 2026-06-02 · Production operations

MCP Server Production Checklist: 12 Things to Verify Before Going Live

Building an MCP server that works on your laptop and building one that handles real AI-agent traffic are different jobs. The first mostly requires knowing the SDK. The second requires a dozen additional decisions about authentication, error handling, shutdown behaviour, monitoring, and schema governance — decisions that look optional until the day a tool handler throws uncaught, a connection pool exhausts under session load, or a rolling deploy silently breaks every client that cached your old tool list. This checklist covers each of the twelve. They are in the order you should address them: start-up (items 1–3), error contract (items 4–5), runtime safety (items 6–7), observability (items 8–9), deployment pipeline (items 10–11), and monitoring after the deploy (item 12).

TL;DR

The twelve items, grouped into six layers:

  1. Fail at startup — validate every required environment variable before the server accepts traffic.
  2. Auth at the transport boundary — Bearer-token or JWT middleware before initialize, never inside tool handlers.
  3. Rate limit at the transport boundary — per-connection rate, concurrent session cap, and per-tool call budget, all enforced before or during session creation.
  4. Typed error handlingisError: true for application failures, McpError for protocol invariants; never let an uncaught exception escape a tool handler.
  5. Graceful shutdown with drain — SIGTERM → mark unhealthy → wait for active sessions to close → shut down DB connections → exit. In that order.
  6. Connection pool tuned for long-lived sessions — acquire per tool call, not per session; size for concurrent tool calls, not concurrent sessions.
  7. Structured JSON logging without PII — level, ts, session_id, tool_name, duration_ms; never log tool call arguments.
  8. External uptime monitoring — a real initialize + tools/list probe from outside your network, on a 60-second cadence.
  9. Schema snapshot in version control — SHA-256 of sorted tools/list committed as a baseline; CI fails if the hash changes without a baseline update.
  10. Three CI gates — protocol compliance test, schema snapshot gate, and post-deploy probe before the deployment pipeline marks the release successful.
  11. TypeScript strict mode with Zod — single source of truth for tool input schema, runtime validation, and type narrowing.
  12. SSE infrastructure configserver.timeout = 0, proxy flush interval set to −1 (immediate), termination grace period wider than your drain timeout.

Why MCP servers need a different checklist from REST APIs

Most production hardening advice is written for stateless HTTP APIs. MCP servers are different in three structural ways that invalidate a large fraction of the standard advice.

Sessions are stateful and long-lived. An MCP session begins with an initialize handshake and stays open until the client disconnects, which in an agent context can be minutes or hours. A traditional rate limit applied to each HTTP request would trigger constantly; a connection pool that holds one handle per request would be idle 99% of the time. Both patterns need rethinking for the session model.

Errors have two distinct channels. A REST API signals application errors with HTTP status codes. An MCP server has to choose, for every error, between a JSON-RPC error object (which terminates the current request with an error code) and a successful response that contains isError: true in the result body (which keeps the session alive and gives the AI client enough information to decide whether to retry). Getting this wrong means either leaking uncaught exceptions as confusing -32603 errors or, worse, breaking sessions that should have stayed open.

Schema is part of the contract. When a client calls a REST endpoint and the request body changes shape, the server returns a 400 and the client sees it immediately. When an MCP tool schema changes — a parameter added, a required field renamed — the client that cached the old tools/list response at session start has no way to know. The session continues, the next call silently fails parameter validation, and the AI agent gets a -32602 invalid params error it wasn't expecting. Schema governance that would be optional on a REST API is load-bearing on an MCP server.

Item 1 — Validate required environment variables at startup

The most common failure mode in production MCP servers is not a bug in the tool logic — it is a server that starts successfully, passes the initialize probe, and then fails on the first real tool call because process.env.API_KEY is undefined. The fix is straightforward: validate every required variable at startup, before the server begins accepting connections, and throw a descriptive error if any are missing.

const required = ['MCP_API_KEY', 'DATABASE_URL', 'REDIS_URL'];
for (const key of required) {
  if (!process.env[key]) throw new Error(`Missing required env var: ${key}`);
}

This forces the failure to happen at deploy time, where your CI pipeline and post-deploy probe catch it, rather than at call time, where the only signal is a confused AI client. The pattern extends naturally to value validation: a DATABASE_URL that doesn't start with postgres:// is better caught at startup than on the first query.

The variables themselves should be injected by the platform, never committed to source control. The practical patterns for each deployment target — Fly.io secrets, Railway variables, Kubernetes secretRef, Docker Compose env_file — are covered in detail in the MCP server environment variables guide. The key rule: commit a .env.example with placeholder values and add .env to .gitignore; never load dotenv in production where the platform injects variables directly.

Item 2 — Add authentication at the HTTP transport boundary

Authentication belongs on the HTTP transport layer, in Express middleware that runs before any MCP traffic is processed. The common mistake — checking auth inside individual tool handlers — means the initialize handshake succeeds, the session opens, the client discovers the tool list, and only the subsequent tool call returns a 401. This leaves the session in a half-open state that MCP clients handle inconsistently. Authentication at the transport boundary means an unauthenticated request never reaches initialize at all — it sees an HTTP 401 and never creates a session.

For API key authentication, the middleware is a Bearer token check using Node's built-in timingSafeEqual to prevent timing attacks:

import { timingSafeEqual, createHash } from 'node:crypto';

function requireAuth(req, res, next) {
  const provided = req.headers.authorization?.replace('Bearer ', '') ?? '';
  const expected = process.env.MCP_API_KEY ?? '';
  const a = createHash('sha256').update(provided).digest();
  const b = createHash('sha256').update(expected).digest();
  if (!timingSafeEqual(a, b)) return res.status(401).json({ error: 'Unauthorized' });
  next();
}
app.use('/mcp', requireAuth);

For OAuth 2.0 bearer tokens with JWT, the pattern is JWKS verification using the jose library — cache the key set at module scope, verify issuer and audience claims, and store the decoded identity in res.locals for use in tool handlers. For a monitoring probe, the right approach is a dedicated probe API key with read-only scope that is never used by real clients — if the probe key starts returning 401, the alert fires on credential expiry before real users see it. The full patterns are in the MCP server authentication deep-dive.

Item 3 — Add rate limiting at the transport boundary

MCP rate limiting has four distinct layers, and the placement of each matters. Connection-rate limiting (how many new sessions can be opened per minute from a given IP) belongs at the HTTP layer, returning HTTP 429 before transport.handleRequest is called. Concurrent session caps also belong at the HTTP layer, rejecting new sessions once the cap is reached. Per-tool call rate limiting belongs inside the session, returning isError: true so the session stays alive for the tools that aren't being abused. Per-tool budgets (a specific expensive tool gets its own call counter) are a fourth layer applied only where the cost model requires it.

The reason placement matters: rate limiting the initialize response adds spurious latency to probe metrics and can cause false health alerts. Rate limiting at the HTTP layer before transport.handleRequest means the probe's initialize call is never affected by rate state from real clients. For distributed deployments where multiple server instances share the load, the per-connection state needs to live in Redis using a sliding window backed by a Lua script to make the increment + window-trim + read atomic. The single-instance pattern is an in-process token bucket, fast enough that it adds no measurable latency to initialize. Full recipes are in the MCP server rate limiting guide.

Item 4 — Wire typed error handling across all tool handlers

Every tool handler has exactly two failure paths, and the right choice between them is deterministic. An application failure — the database returned no rows, the external API returned a 404, the user passed a value the logic cannot process — should return a successful JSON-RPC response with isError: true and a human-readable error message in the content array. The session stays open. The AI client gets a textual description of what went wrong and can decide whether to retry, rephrase, or give up. A protocol failure — a method that doesn't exist, a request body that isn't valid JSON-RPC, a session that has already been terminated — should throw a McpError with the appropriate error code from the SDK's ErrorCode enum.

The failure mode to eliminate entirely is the uncaught exception escaping a tool handler. An uncaught exception at the handler level propagates to the SDK's internal error handling, which returns a -32603 Internal error JSON-RPC error — a response that carries no application context and may confuse the client into terminating the session. The structural fix is a try/catch wrapping every tool handler with a fallback to isError: true. The broader error taxonomy — retry-safe vs non-retry-safe errors, transient vs permanent failures, the right alert thresholds for each — is in the MCP server error handling guide.

Item 5 — Implement graceful shutdown with a drain timeout

The shutdown sequence matters for two reasons: it protects active sessions from being dropped mid-call, and it determines whether your uptime monitoring sees a crash or a clean transition. A server that exits immediately on SIGTERM will leave active tool calls with broken connections, cause the probe to see a hard failure rather than a planned restart, and force clients to reconnect from scratch.

The correct sequence is: (1) set isShuttingDown = true so the health endpoint returns 503, removing the pod from the load balancer's pool; (2) stop accepting new connections by closing the HTTP listener; (3) wait for active sessions to close, up to a configured DRAIN_TIMEOUT_MS; (4) close database connection pools and other resources; (5) exit with code 0. The drain timeout should be set to your P99 tool-call duration plus a 5-second buffer — check your structured logs or AliveMCP latency history to size it correctly.

Two infrastructure details that break this without any code error: the container's CMD must use the exec form so that Node.js runs as PID 1 and receives SIGTERM directly (shell form spawns a shell that doesn't forward signals), and the platform's grace period must be wider than DRAIN_TIMEOUT_MS (Kubernetes terminationGracePeriodSeconds, Docker Compose stop_grace_period). On Kubernetes, a preStop sleep of 5 seconds absorbs the endpoint-propagation race that can cause requests to arrive after the listener is already closed. Full recipes and the platform-specific configuration table are in the MCP server graceful shutdown guide.

Item 6 — Size and use the connection pool correctly for MCP sessions

MCP sessions are long-lived. If your tool handlers acquire a database connection at session start and hold it until the session closes, your pool exhausts at exactly pool_size concurrent sessions — not at some multiple of sessions multiplied by request rate, as in stateless APIs. The practical consequence: a server with a pool of 20 and a typical concurrent session count of 15 will start throwing pool timeout errors at 20 concurrent sessions, which can happen quickly under any real load.

The correct pattern is: acquire per tool call, release immediately in a finally block. Most query builders (Knex, Drizzle) do this automatically when you chain calls — the connection is released when the Promise resolves or rejects. The pattern that causes subtle bugs is using the raw adapter's pool.connect(), which returns a client that holds the connection until you explicitly call client.release(). A missed release in an error path means a connection is permanently leaked from the pool until the process restarts.

Pool size formula: target_concurrent_sessions × avg_tool_calls_per_session × db_query_fraction × concurrent_fraction. For most MCP servers the right number is significantly smaller than the default (min: 2, max: 10 is a reasonable start), because tool calls are short. Pool exhaustion shows up as latency spikes on AliveMCP's response-time graph before it shows up as errors — the first sign is numPendingAcquires > 0 in pool telemetry during peak load. The detailed sizing analysis and Redis connection patterns are in the MCP server connection pooling guide.

Item 7 — Add structured JSON logging without PII

The one rule that is non-negotiable: never log tool call arguments. Tool handlers receive user queries and personal context from AI conversations. The arguments the user passed to the AI, which the AI passed to your tool, contain information the user did not consent to have written to your log storage and, eventually, your log aggregation vendor. Enforce this at the logger level with a redact configuration, not just in code review discipline — the architectural goal is that it is structurally impossible for a tool argument to appear in a log line, regardless of what any individual developer writes.

Beyond the PII rule, the structure of log lines matters for operational value. Every line should include: level, ts (ISO 8601), session_id (for correlating all lines from one session), and msg. Tool call lines add: tool_name, duration_ms, and error_code (null on success). The initialize line adds: client_name, duration_ms, and error_code. AliveMCP probe calls appear as client_name: "AliveMCP" and can be filtered in dashboards to separate probe traffic from real-user traffic.

Propagate session_id automatically using AsyncLocalStorage — store the session ID in async context during the initialize handler, then read it in the logger's mixin. This means session_id appears on every log line from every tool call in that session without any function needing to explicitly pass it through. The monitoring gap that structured logging cannot cover — when the server is completely down, there are zero log lines — is exactly what external uptime monitoring exists to close. Full patterns and the pino configuration recipe are in the MCP server logging guide.

Item 8 — Add external uptime monitoring

HTTP monitoring is not enough for an MCP server. An HTTP monitor checks that your server returns a 200 response to a TCP connection. A healthy-looking HTTP response is compatible with an MCP server where initialize hangs, tools/list returns an empty array, or every tool call returns a JSON-RPC error. Of the 2,414 endpoints in the Q3 2026 registry audit, 26.9% were in the "HTTP alive, MCP dead" bucket — they would have shown green on an HTTP monitor on the day we caught them.

A protocol-aware probe runs the full initializetools/list sequence, verifies that initialize returns a valid capabilities object with a protocolVersion string and a serverInfo block, checks that tools/list returns a non-empty tools array, and hashes the tools array to detect schema drift between probes. The probe should run from outside your network — a probe that runs on the same host as the server doesn't catch network-level failures — on a 60-second cadence, from multiple geographic regions to catch region-specific degradation.

AliveMCP runs this probe automatically for every MCP server registered in the public directories, from all five probe regions, with schema-drift hashing. If you are shipping a private server or want configured alerts, the free tier covers 5 servers; the Author tier ($9/month) covers 25. The probe sequence in curl form is in check if an MCP server is alive; the full uptime monitoring architecture is in MCP server uptime monitoring.

Item 9 — Commit a schema snapshot and gate deploys on it

Your tool list is a contract. Every client that connects to your server, including AI agents that cache the tool list at the start of a conversation, depends on the tool names, parameter names, parameter types, and required-field sets remaining stable across your deploys. When you add a required parameter to an existing tool, every client that is mid-session with the cached old tool list will send tool calls with the old parameter set and receive a -32602 invalid params error — without any indication that the schema changed.

The structural fix is a schema snapshot committed to your repository. At the end of your build step, run initialize + tools/list against the just-built server, sort the tools array by name, sort each tool's properties deterministically, serialize to JSON, and SHA-256 hash the result. Commit the hash as a file (e.g. schema-snapshot.sha256). Your CI pipeline has a gate that runs the same hash against the current server build and fails if the hash differs from the committed baseline. This makes any schema change a mandatory review moment — the developer has to update the baseline file, which creates a diff, which gets reviewed.

The snapshot gate catches pre-deploy schema drift. AliveMCP's continuous hash monitoring catches post-deploy schema changes that are invisible to a one-time gate (config-driven schema changes, feature-flag-controlled tools). The two are complementary: CI catches what's knowable before the deploy; AliveMCP catches what only becomes visible after. The CI gate configuration — including how to run the snapshot check in GitHub Actions and how to integrate it with the post-deploy probe gate — is in the MCP server versioning guide and the MCP server CI/CD guide.

Item 10 — Add three MCP-specific CI gates

Standard CI pipelines check compilation, test pass rate, and lint. MCP servers need three additional gates that standard CI doesn't include by default.

Gate 1: Protocol compliance test. A test that starts the server in a test environment, sends a real initialize request, and verifies the response shape — protocolVersion is a non-empty string, serverInfo.name and serverInfo.version are present, capabilities is an object. This gate catches SDK upgrades that change the protocol version string, configuration errors that produce malformed responses, and refactors that accidentally remove required initialization fields. It should run on every push, not just on releases.

Gate 2: Schema snapshot gate. As described in item 9 — SHA-256 of sorted tools/list must match the committed baseline or the gate fails. This gate runs after compilation, because the schema is only verifiable against a running server.

Gate 3: Post-deploy probe gate. After the deployment step pushes the new version to production, the pipeline waits up to 120 seconds for an external probe to verify that the production endpoint answers a real initialize + tools/list sequence successfully. Only when that verification passes does the pipeline mark the deployment successful. If the probe times out or returns an error, the pipeline marks the deployment failed and triggers rollback. This gate converts an undetected production regression into a deploy-time failure, typically reducing mean time to detection from "when the next user complains" to "within 2 minutes of deploy".

AliveMCP provides a webhook notification on each probe state change that can serve as the post-deploy gate signal. Alternatively, the probe script in check if an MCP server is alive can be wrapped in a retry loop with a timeout and called directly from the pipeline. The complete four-stage pipeline configuration — build → test → deploy → verify — is in the MCP server CI/CD guide.

Item 11 — Use TypeScript strict mode with Zod schemas

Zod schemas in MCP server code play three roles simultaneously, and all three are load-bearing in production. First, Zod generates the JSON Schema that your tool's inputSchema returns to clients via tools/list — meaning the schema that AI clients see in the tool list is derived from the same declaration as the validation code, not from a separate hand-maintained definition that can drift. Second, Zod validates incoming tool call arguments at runtime, returning a structured -32602 invalid params error with a precise validation message before your handler logic runs. Third, Zod's type inference narrows the TypeScript type of the arguments inside the handler, eliminating an entire class of runtime type errors.

TypeScript's strict: true flag makes this effective in production by catching process.env.API_KEY as string | undefined (forcing the startup validation in item 1), catching unhandled nullable returns from database queries, and catching potential undefined in indexed accesses with noUncheckedIndexedAccess: true. The sourceMap: true flag in tsconfig.json makes stack traces in production logs point at TypeScript lines rather than compiled JavaScript, which is the difference between a 2-minute and a 20-minute debugging session.

The one build-pipeline rule: never run ts-node in production. It compiles TypeScript on every startup, adding seconds to cold-start time and additional CPU overhead during the initialization phase. Compile to dist/ with tsc and run the compiled output. Use ts-node --watch for development and tsc --noEmit as a fast CI typecheck gate. The full TypeScript configuration — tsconfig.json best practices, ESM setup, Zod input patterns, and the Node.js version recommendation — is in the MCP server TypeScript guide.

Item 12 — Configure SSE infrastructure for streaming tools

The final item is one that most developers discover by accident, typically when a streaming tool works perfectly in local development and silently stops working after the first production deploy. The symptom: progress notifications sent via server.notification() are never received by the client, or they arrive in a large batch after a long delay rather than incrementally. The root cause is almost always an intermediate layer — a reverse proxy, a CDN, or the Node.js HTTP server itself — buffering the Server-Sent Events stream before forwarding it.

The infrastructure configuration required for streaming to work reliably has four components: (1) server.timeout = 0 on the Express server, disabling the default HTTP timeout that would close streaming connections after 5 minutes; (2) flush_interval: -1 on the Caddy reverse proxy (or proxy_buffering off on nginx, or the X-Accel-Buffering: no header for nginx proxy mode), instructing the proxy to forward SSE frames immediately rather than waiting to fill a buffer; (3) a platform-level connection timeout longer than your longest streaming operation (Cloudflare's 100-second limit for SSE connections is a hard constraint that requires a keep-alive ping pattern for tools that run longer); (4) a Kubernetes ingress annotation nginx.ingress.kubernetes.io/proxy-read-timeout set wide enough for your longest tool call.

The monitoring wrinkle: AliveMCP's probe uses initialize + tools/list only. Streaming failures are invisible to uptime probes — a server can be passing its probe while every streaming tool call fails silently. The signal for streaming failures is in structured logs: duration_ms outliers and notifications_sent: 0 on tool calls where notifications are expected. Alert on sessions open longer than max_tool_duration × 1.5. The full streaming configuration and monitoring approach are in the MCP server streaming guide.

The order to work through this list

Not all twelve items are equal in urgency or blast radius. The right order to work through them if you are hardening an existing server:

  1. Items 1, 4, and 7 first — startup validation, error contract, and logging. These three eliminate the most common sources of silent production failures and give you the observability to catch the rest. A server with these three wired correctly tells you when it breaks and why. Without them, debugging is guesswork.
  2. Items 2 and 3 second — authentication and rate limiting. If your server is public, these are security requirements, not hardening suggestions. Add them before you share the endpoint URL anywhere beyond your own testing.
  3. Items 5 and 6 third — graceful shutdown and connection pooling. These require load testing to validate and may require infra changes (container grace period, pool size configuration), so they take longer to ship correctly. But their failures are also more visible — dropped connections and pool exhaustion are loud in logs.
  4. Items 8 and 9 fourth — external monitoring and schema snapshots. Wire these before your first real-traffic deploy. Once real agents are using your server, a schema change becomes a coordination problem. The monitoring baseline is also most useful when it captures the initial healthy state, not an already-degraded state.
  5. Items 10, 11, and 12 as a batch — CI gates, TypeScript strict mode, and SSE config. These improve quality and operational stability but are rarely the difference between a server that runs and one that doesn't. Add them as a batch when the server is already stable in production.

What this checklist does not cover

Twelve items is the right scope for production readiness. This checklist does not cover architectural questions that belong upstream of "are you ready to ship": whether your tool set is the right abstraction for your use case, whether your pricing covers your infrastructure costs at scale, or whether your server belongs in the MCP registry at all. It also does not cover operational maturity questions that belong downstream of "is this ready for traffic": multi-tenant isolation, long-term schema governance, per-tenant rate limits, and the archiver pattern for long-term uptime history. Those are multi-tenant operations territory.

What this checklist does cover is the set of decisions that, in our experience probing 2,414 public MCP endpoints in the Q3 2026 audit, most frequently distinguish servers that stay up from servers that go dark within weeks. The "HTTP alive, MCP dead" bucket — 650 servers that looked healthy to an HTTP monitor but failed a real protocol probe — is almost entirely attributable to items 1, 4, and 8 above. The servers that fail silently for months before anyone notices are almost always missing item 8. Treat these as the minimum bar, then build forward from there.

Further reading