Infrastructure guide · 2026-06-03 · Production hardening

MCP Server Infrastructure Hardening Guide: Secrets Management, API Gateway, Bulkheads, Retry Logic, and Service Mesh

The resilience and configurability guide covered four concerns that operate inside the MCP server process: config validation, feature flags, circuit breakers, and compression. Those patterns make the application layer resilient. They do not, on their own, harden the infrastructure that surrounds the process — the layer where credentials enter the system, where unauthenticated connections are rejected, where transient failures are retried safely, where concurrent load is isolated so one slow dependency cannot exhaust all resources, and where a service mesh enforces all of this as a consistent policy rather than a per-service convention. Five concerns address the outer layer: secrets management, API gateway, bulkheads, retry logic, and service mesh. This guide covers them as a system.

TL;DR

The Two Resilience Layers

A useful mental model: production resilience for an MCP server has two distinct layers.

The application layer — covered in the resilience and configurability guide — handles concerns that live inside the process boundary: how env vars are validated before any connection opens (config validation), which tools a session can see (feature flags), what happens when an external API starts failing mid-operation (circuit breakers), and whether large responses compress efficiently without breaking streaming (compression).

The infrastructure layer — covered in this guide — handles concerns that live outside the process boundary or in the boundary itself:

Neither layer replaces the other. Config validation catches malformed values; secrets management ensures those values are present and were never written to a log. Circuit breakers detect broken dependencies; bulkheads limit the concurrent damage while the breaker decides whether to open. Application-layer retries handle transient failures; a service mesh catches the ones that happen before the application layer sees the request. The two layers compose — not redundantly, but complementarily.

The AliveMCP probe sits outside both layers and sees their combined effect. A secrets management failure (wrong credential → authentication error → tool calls return 401) looks different from an API gateway misconfiguration (no flush_interval -1 → SSE events delayed → clients see high latency without errors). The probe can surface both, but you need both layers wired correctly to make the signals interpretable.

Secrets Management — Credentials Before the Deps Object Opens

Secrets management addresses one narrow but critical question: how do plaintext credentials get into the environment that parseConfig() reads, without ever appearing in a log, a git commit, or a crash dump?

The four injection patterns span a spectrum from simplest to most operationally mature:

PatternHow it worksMain riskWhen to use
Plain env varsDATABASE_URL=postgres://user:pass@host/db in shell / .envCredential in version control, shell history, or ps aux outputLocal development only
Secrets manager at deploy timeCI/CD pipeline fetches secret, injects into container as env varCredential is still plaintext in container env — visible to any process in the containerSimple setups without dynamic rotation
SDK fetch at startupApplication calls AWS Secrets Manager / Vault in createDeps() before opening any connectionRequires IAM role or Vault auth; startup fails loudly if access is deniedAWS or Vault already in stack; credential rotation needed
Kubernetes Secrets as filesSecret mounted as file in /run/secrets/; application reads file at startup; kubelet updates file on rotation without pod restartFile visible to any process in pod; mount as noSwap tmpfs to prevent swap exposureKubernetes deployments with credential rotation

Regardless of which pattern you use, the Zod config schema is the validation boundary. The secrets manager's job is to produce a value; the config schema's job is to ensure that value has the right shape. Neither layer should know about the other's mechanics:

// deps.ts — SDK fetch pattern
import { SecretsManagerClient, GetSecretValueCommand } from '@aws-sdk/client-secrets-manager';
import { parseConfig } from './config';

export async function createDeps(): Promise<Deps> {
  // 1. Fetch secrets from AWS Secrets Manager and merge into process.env
  //    before parseConfig() runs — the schema validates, not the fetcher
  const sm = new SecretsManagerClient({ region: process.env.AWS_REGION ?? 'us-east-1' });
  const { SecretString } = await sm.send(
    new GetSecretValueCommand({ SecretId: process.env.SECRET_ARN! })
  );
  const fetched = JSON.parse(SecretString!);
  Object.assign(process.env, fetched); // merge into env; Zod schema validates shape

  // 2. Now parseConfig() sees the full env including fetched secrets
  const config = parseConfig();

  // 3. Open connections using validated config — credentials never logged
  const db = new Pool({ connectionString: config.DATABASE_URL, max: 10 });
  await db.query('SELECT 1');

  return { config, db };
}

Three invariants to enforce regardless of injection pattern:

  1. Never log the raw config objectlogConfigSummary() logs credential presence and length ([32 chars]), not value. Enforce this at the logger level, not in code review — code review misses it in incident-response patches.
  2. Redact connection strings before loggingDATABASE_URL.replace(/:\/\/[^@]+@/, '://***@') strips the password from any Postgres URL that appears in a startup log or error message.
  3. Validate before connectingparseConfig() throws on malformed values. The process exits before opening any connection to a database whose URL was corrupted during a rotation. AliveMCP sees a connection failure immediately rather than a degraded-mode server that accepts connections but returns errors on every authenticated tool call.

Dynamic secret rotation matters for long-running MCP servers. Database connections with a 24-hour-old password break when the credential is rotated unless the pool reconnects. The Kubernetes file-mount pattern handles this naturally (kubelet updates the file; fs.watch callback re-validates and re-opens the pool). For Vault, startCredentialRenewer renews the lease at half the lease_duration, and the pool reconnects on rotation. The key invariant: connection pool reconnection is triggered by the secrets layer, not by a crashed tool call that happens to hit the rotated credential first.

The interaction with config management: secrets management and config validation share the same parseConfig() Zod schema. A secret fetched by the secrets layer is validated by the same schema that validates the port number and the log level. This is deliberate — the schema is the single source of truth for what a valid server configuration looks like, regardless of where the values came from.

API Gateway — The Protocol-Aware Front Door

An API gateway addresses three things the MCP server application layer should not need to implement: TLS termination, JWT signature verification, and per-client rate limiting. Handling these at the gateway means they apply to every connection before a single byte of MCP JSON-RPC is parsed — and they apply consistently regardless of which application process handles the routed request.

The boundary between gateway and application is not arbitrary:

ConcernGatewayApplicationReason
TLS terminationYesNoGateway hardware is optimised for TLS; Node.js handles it adequately but adds overhead per process
JWT signature verificationYesOptionallyGateway rejects bad tokens before the MCP server allocates a session; application extracts verified claims from forwarded headers
Per-client rate limitingYesNo (usually)Gateway has the client identity before routing; a Redis-backed shared rate-limit state ensures limits hold across multiple application replicas
Tool-level authorisationNoYesThe gateway cannot inspect MCP JSON-RPC method names — only the application knows tools/call from tools/list
Circuit breaking to upstreamSometimesYesApplication-layer circuit breakers know which downstream dependency failed; gateway-layer breakers protect against application overload

The critical MCP-specific gateway concern is SSE buffering. MCP's streaming transport sends server-to-client events as Server-Sent Events. A buffering gateway delays every SSE frame until a buffer is full, which means clients receive batched events rather than a live stream. The fix is one directive:

# Caddyfile — minimal production gateway for an MCP server
alivemcp.com {
  # TLS: auto-managed via ACME — no manual cert rotation
  encode zstd gzip {
    @sse { header Content-Type text/event-stream }
    except @sse          # SSE path must not be compressed (see mcp-server-compression)
  }

  # SSE and streaming endpoints: disable buffering
  @mcp_stream path /sse /mcp/stream
  handle @mcp_stream {
    flush_interval -1    # CRITICAL: disables gateway-side buffering for SSE
    reverse_proxy localhost:3000 {
      header_up X-Forwarded-For {remote_host}
      header_up X-Request-ID    {http.request.uuid}
    }
  }

  # Health probe: no auth, no rate limit — reachable by AliveMCP and LB health checks
  handle /healthz {
    reverse_proxy localhost:3000
  }

  # All other routes: per-client rate limit + JWT verification
  handle {
    rate_limit {
      zone dynamic {
        key     {http.request.header.X-Api-Key}
        events  100
        window  60s
      }
    }
    reverse_proxy localhost:3000 {
      header_up X-Forwarded-For {remote_host}
      header_up X-Request-ID    {http.request.uuid}
    }
  }
}

Two things to verify after adding the gateway: (1) AliveMCP's probe still reaches the /healthz endpoint and returns healthy — if it does not, the gateway is either blocking the probe or the health endpoint is behind auth. (2) SSE tool calls return responses in real time, not in batches — a delayed first frame after a long tool call is normal; delayed intermediate stream events are a buffering problem.

JWT verification at the gateway means the application receives pre-verified identity as request headers. For Caddy, the caddy-jwt plugin verifies RS256/ES256 tokens against a JWKS endpoint and forwards verified claims:

# Caddyfile — JWT verification block (caddy-jwt plugin)
handle /api/* {
  jwt {
    primary yes
    jwks_url https://your-idp.com/.well-known/jwks.json
    allow_claims sub, plan   # forward these as X-User-Id and X-User-Plan headers
  }
  reverse_proxy localhost:3000 {
    header_up X-User-Id   {http.auth.user.id}
    header_up X-User-Plan {http.auth.user.claims.plan}
  }
}

The MCP server application reads X-User-Id and X-User-Plan at the initialize handler without re-verifying the JWT. The gateway is the verification point; the application trusts gateway-forwarded headers on non-external connections. This separation lets the gateway verify tokens once per connection rather than once per tool call — important for long-lived SSE sessions where the JWT would otherwise need to be re-verified on every tools/call request.

The interaction with feature flags: the X-User-Plan header forwarded by the gateway is often the input to tool-registration flag resolution at initialize time. The gateway handles authentication; the application maps the authenticated plan to a flag set. Neither layer needs to know how the other works — the forwarded header is the interface.

Bulkheads — Containing Blast Radius Per Dependency

The bulkhead pattern addresses cascade failures caused by shared resource pools. When all external dependencies share a single HTTP connection pool, a slow dependency can exhaust all available sockets — blocking unrelated tools from reaching healthy dependencies. Bulkheads divide the shared pool into per-dependency allocations so that one slow dependency can only exhaust its own allocation.

The failure mode without bulkheads is concrete:

  1. The search API slows to 15-second responses (instead of 200ms).
  2. 50 concurrent MCP sessions call the search tool. Each holds an HTTP socket waiting for the search API.
  3. The server uses a shared https.Agent with maxSockets: 50. All 50 sockets are now waiting on the search API.
  4. A notify tool call needs a socket to reach the notification API — which is healthy and would respond in 100ms. It queues behind the 50 search calls and waits 15 seconds.
  5. AliveMCP sees high latency on all tools, not just search. The database is healthy. The notification API is healthy. The slow failure propagates through the shared pool.

With bulkheads: the search API gets its own https.Agent with maxSockets: 10. A search slowdown can block at most 10 concurrent tool calls. The notification API and database have their own agents — their full capacity is available regardless of search API state. The pattern lives entirely in createDeps():

// deps.ts — per-dependency HTTP agents as bulkheads
import https from 'https';

export interface Deps {
  searchAgent: https.Agent;
  notificationAgent: https.Agent;
  db: Pool;
  cache: Redis;
  config: AppConfig;
}

export async function createDeps(): Promise<Deps> {
  const config = parseConfig();

  const searchAgent = new https.Agent({
    maxSockets: 10,       // blast radius: at most 10 concurrent search calls
    maxFreeSockets: 2,
    timeout: 6000,
    keepAlive: true,
  });

  const notificationAgent = new https.Agent({
    maxSockets: 5,        // notification API has lower concurrency budget
    maxFreeSockets: 1,
    timeout: 5000,
    keepAlive: true,
  });

  // Database pool is already an isolated bulkhead
  const db = new Pool({
    connectionString: config.DATABASE_URL,
    max: 20,
    idleTimeoutMillis: 30_000,
    connectionTimeoutMillis: 5_000,
  });
  await db.query('SELECT 1');

  return { searchAgent, notificationAgent, db, cache: await createRedis(config), config };
}

Pass the per-dependency agent when making HTTP calls from tool handlers. With Node.js 18+ fetch, use the undici dispatcher:

import { fetch } from 'undici';

async function callSearchApi(query: string, deps: Deps): Promise<SearchResult[]> {
  const res = await fetch(
    `https://search.internal/v2/search?q=${encodeURIComponent(query)}`,
    {
      dispatcher: deps.searchAgent as any,  // undici-compatible agent
      signal: AbortSignal.timeout(5000),
    }
  );
  if (!res.ok) throw new RetryableError(`Search API ${res.status}`);
  return res.json();
}

Semaphore-based bulkheads provide a second isolation mechanism for async operations that do not go through HTTP. A semaphore caps the number of concurrent callers regardless of the connection pool:

class Bulkhead {
  private running = 0;
  private queue: Array<() => void> = [];

  constructor(
    private maxConcurrent: number,
    private maxQueue: number,
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.running >= this.maxConcurrent) {
      if (this.queue.length >= this.maxQueue) {
        throw new Error('Bulkhead full — dependency queue at capacity');
      }
      // Wait for a slot, then proceed — never queue indefinitely
      await new Promise<void>(resolve => this.queue.push(resolve));
    }
    this.running++;
    try {
      return await fn();
    } finally {
      this.running--;
      this.queue.shift()?.();
    }
  }

  get stats() {
    return { running: this.running, queued: this.queue.length };
  }
}

The relationship between bulkheads and circuit breakers is complementary, not redundant. A bulkhead limits concurrent callers while a dependency is slow — it does not cut off calls. A circuit breaker cuts off calls when a dependency is broken — it does not limit concurrency. The correct layering: wrap the bulkhead-limited function with the circuit breaker. The breaker sees the final outcome after all attempts (including retries within the bulkhead window), not individual attempt failures. When the breaker opens, bulkhead capacity is immediately freed because calls fail fast instead of queuing.

Expose bulkhead stats in the health_check tool. A bulkhead that is permanently at maxConcurrent is a leading indicator of dependency degradation — it appears before the error rate climbs high enough to trip the circuit breaker:

server.tool('health_check', {}, async () => {
  return {
    content: [{
      type: 'text',
      text: JSON.stringify({
        bulkheads: {
          searchApi: deps.searchBulkhead.stats,       // { running: 10, queued: 4 }
          notificationApi: deps.notifyBulkhead.stats,
        },
        circuitBreakers: {
          searchApi: { open: deps.breakers.searchApi.opened },
          notificationApi: { open: deps.breakers.notificationApi.opened },
        },
      }),
    }],
  };
});

Retry Logic — Second Chances for Transient Failures

Retry logic gives transient failures a second chance without amplifying load on a struggling dependency. The two correctness requirements are: (1) only retry errors that are actually transient — retrying a 400 Bad Request wastes resources and confuses the dependency, and (2) space retries with exponential backoff and jitter — retrying at a fixed interval with many sessions creates a thundering herd that can overwhelm a recovering dependency.

Error classification is the most important step. Before writing any retry loop, decide which error categories are retryable for your dependencies:

ErrorRetryable?Reason
ECONNRESET, ECONNREFUSEDYesNetwork blip — the dependency may be accepting connections again in milliseconds
ETIMEDOUTYes (with cap)Transient congestion — but cap at 2–3 retries; repeated timeouts indicate sustained degradation
429 Too Many RequestsYesRate limit — honour the Retry-After header if present; wait at least that long
503 Service UnavailableYesTransient overload — same as 429 for Retry-After handling
400 Bad RequestNoInput is wrong — retrying will produce the same error
401 UnauthorizedNoCredential is wrong or expired — retrying wastes requests
403 ForbiddenNoAccess denied — retrying will produce the same error
404 Not FoundNoResource does not exist — retrying will produce the same error
JSON parse errorNoMalformed response — will not improve on retry

A typed error class that carries retry metadata keeps the classification explicit:

export class RetryableError extends Error {
  constructor(
    message: string,
    public readonly retryAfterMs?: number,  // from Retry-After header
  ) {
    super(message);
    this.name = 'RetryableError';
  }
}

function isRetryable(err: unknown): err is RetryableError {
  if (err instanceof RetryableError) return true;
  if (err instanceof Error) {
    const code = (err as NodeJS.ErrnoException).code;
    return code === 'ECONNRESET' || code === 'ETIMEDOUT' || code === 'ECONNREFUSED';
  }
  return false;
}

Exponential backoff with full jitter prevents thundering herds. Full jitter randomises the entire delay window — delay = random(0, min(base × 2ⁿ, MAX_DELAY)) — rather than adding a small jitter to a fixed base. When 50 sessions all fail at the same moment, full jitter spreads retries across the full window instead of clustering them:

const BASE_DELAY_MS = 200;
const MAX_DELAY_MS = 10_000;
const MAX_ATTEMPTS = 4;

export async function withRetry<T>(
  fn: () => Promise<T>,
  context: { toolName: string; sessionId: string },
): Promise<T> {
  let lastErr: unknown;

  for (let attempt = 1; attempt <= MAX_ATTEMPTS; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastErr = err;
      const willRetry = isRetryable(err) && attempt < MAX_ATTEMPTS;
      console.log(JSON.stringify({
        event: 'retry',
        ...context,
        attempt,
        willRetry,
        err: err instanceof Error ? err.message : String(err),
      }));

      if (!willRetry) break;

      // Honour Retry-After if present; otherwise use exponential backoff with full jitter
      const suggested = err instanceof RetryableError ? (err.retryAfterMs ?? 0) : 0;
      const backoff = Math.random() * Math.min(BASE_DELAY_MS * 2 ** attempt, MAX_DELAY_MS);
      await sleep(Math.max(suggested, backoff));
    }
  }

  throw lastErr;
}

Idempotency keys are required for tool calls with side effects. Without them, a retry of a write operation that succeeded but whose response was lost produces a duplicate side effect — two emails sent, two records created, two payments charged. A deterministic key derived from the session ID and call parameters makes the operation safe to retry:

import { createHash } from 'crypto';

function idempotencyKey(sessionId: string, toolName: string, params: unknown): string {
  return createHash('sha256')
    .update(JSON.stringify({ sessionId, toolName, params }))
    .digest('hex')
    .slice(0, 32);
}

// In a tool handler for a write operation:
server.tool('send_notification', SendNotificationSchema, async (input) => {
  const key = idempotencyKey(sessionId, 'send_notification', input);
  await withRetry(
    () => callNotificationApi({ ...input, idempotencyKey: key }, deps.notificationAgent),
    { toolName: 'send_notification', sessionId },
  );
  return { content: [{ type: 'text', text: 'Notification sent' }] };
});

Circuit breaker coordination is the final piece. When the circuit breaker for a dependency is open, retrying is wrong — the dependency is known-broken, not transiently failing. The correct structure wraps the retrying function inside the circuit breaker, not outside it:

// CORRECT: breaker wraps retry
const result = await deps.breakers.searchApi.fire(() =>
  withRetry(
    () => callSearchApi(input.query, deps.searchAgent),
    { toolName: 'search', sessionId },
  )
);

// WRONG: retry wraps breaker — retries a known-broken dependency
const result = await withRetry(
  () => deps.breakers.searchApi.fire(() => callSearchApi(input.query, deps.searchAgent)),
  { toolName: 'search', sessionId },
);

With the correct structure, the breaker sees the final outcome after all retry attempts. If retries exhaust and the final attempt fails, the breaker increments its failure count. When the breaker opens, subsequent calls fail immediately without entering the retry loop — no retries of a known-broken dependency, no additional load on a struggling service.

The interaction with bulkheads: retry + bulkhead works like this — a failed call releases its bulkhead slot immediately, and the retry acquires a new slot. This means the bulkhead stat running stays bounded during a retry storm. A semaphore-based bulkhead with maxQueue: 0 that throws immediately when full lets the retry loop decide whether to wait, rather than the bulkhead silently queuing retries.

Service Mesh — Infrastructure-Level Policy Enforcement

A service mesh moves retry, timeout, circuit-breaking, and mTLS policies from application code into the infrastructure layer. Instead of each service team implementing these concerns, the mesh sidecar enforces them consistently for all service-to-service traffic without any application code change. For MCP server fleets with multiple calling services, a mesh eliminates the per-service application code that would otherwise duplicate these patterns.

The two mainstream options have different tradeoffs:

ConcernLinkerdIstio
Installation complexityLow — CLI + annotation injectionMedium — full Kubernetes operator
mTLSAutomatic for all pod-to-pod trafficAutomatic; configurable per namespace with PeerAuthentication
Retry policy configurationServiceProfile CRD per routeVirtualService CRD; flexible retry conditions
Per-pod circuit breakingVia ServiceProfile outlier detectionVia DestinationRule outlier detection
Traffic splitting / canaryTrafficSplit CRD (SMI)VirtualService weight routing
SSE long-connection supportGood; idle connection detection adjustableRequires idleTimeout: 0s on SSE routes in VirtualService

An Istio VirtualService that applies retry and timeout policies to all traffic hitting the MCP server, with an SSE exemption for the streaming path:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: alivemcp-server
spec:
  hosts: ["alivemcp-server"]
  http:
    # SSE path: disable timeout — long-lived connections must not be cut by mesh
    - match:
        - uri:
            prefix: "/sse"
        - uri:
            prefix: "/mcp/stream"
      route:
        - destination:
            host: alivemcp-server
            port:
              number: 3000
      timeout: 0s          # no mesh-level timeout on SSE connections

    # All other routes: retry + 20s total budget
    - route:
        - destination:
            host: alivemcp-server
            port:
              number: 3000
      timeout: 20s
      retries:
        attempts: 3
        perTryTimeout: 5s
        retryOn: "gateway-error,connect-failure,retriable-4xx,503"

Per-pod circuit breaking via Istio DestinationRule outlier detection is complementary to application-layer circuit breakers. The mesh ejects individual pods from load-balancing rotation based on observed error rates; the application-layer breaker cuts off a specific downstream dependency for all pods. Neither is redundant:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: alivemcp-server
spec:
  host: alivemcp-server
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5      # eject pod after 5 consecutive 5xx responses
      interval: 10s                # check every 10 seconds
      baseEjectionTime: 30s        # minimum ejection duration
      maxEjectionPercent: 50       # never eject more than half the pod pool

The SSE idle timeout is the most common mesh misconfiguration for MCP servers. Both Linkerd and Istio apply a default idle connection timeout (typically 60s) that terminates long-lived connections that have no traffic. An MCP session with an idle SSE connection — no events flowing but the client is still connected — is terminated by the mesh and the client must reconnect. Set timeout: 0s on SSE routes in Istio VirtualService, or configure Linkerd's ServiceProfile with an appropriate timeout for SSE routes.

W3C traceparent propagation is the practical win from a service mesh. Istio and Linkerd both generate and propagate traceparent headers automatically. In the MCP server application, extract the incoming trace context and start a child span per tool call so distributed traces span the entire request path from calling service through the mesh sidecar to the MCP server to its downstream dependencies:

import { trace, context, propagation } from '@opentelemetry/api';

server.tool('search', SearchInputSchema, async (input, { headers }) => {
  const ctx = propagation.extract(context.active(), headers);
  const span = trace.getTracer('mcp-server').startSpan('tool.search', {}, ctx);

  try {
    const results = await context.with(
      trace.setSpan(context.active(), span),
      () => withRetry(
        () => callSearchApi(input.query, deps.searchAgent),
        { toolName: 'search', sessionId: headers['mcp-session-id'] },
      ),
    );
    return { content: [{ type: 'text', text: JSON.stringify(results) }] };
  } catch (err) {
    span.recordException(err as Error);
    throw err;
  } finally {
    span.end();
  }
});

The relationship with AliveMCP's external probes: the service mesh enforces policies on service-to-service traffic inside the cluster. AliveMCP probes from outside the cluster — it sees gateway-layer failures and application-layer failures that are invisible to within-cluster mesh metrics. A misconfigured Istio VirtualService that cuts SSE connections at 60 seconds is visible to AliveMCP as a periodic latency spike (clients reconnect every 60 seconds); it is invisible to mesh-internal golden-signal metrics, which measure RPS, error rate, and latency per route and all look fine for a connection that was validly terminated and reconnected.

How the Five Concerns Compose

The five outer-layer concerns connect in a specific order. Starting with a freshly deployed MCP server:

  1. Secrets management runs first — credentials are injected into process.env before createDeps() starts. The result is a fully populated environment that the Zod config schema can validate. If secret injection fails (wrong IAM role, expired Vault token, missing Kubernetes Secret), the process exits before opening any connection.
  2. Config validation and bulkhead creation happen together in createDeps() — once parseConfig() returns a validated AppConfig, each external dependency gets its own https.Agent (bulkhead) and CircuitBreaker (application-layer circuit break). These three concerns share the same initialization call: const config = parseConfig(); const searchAgent = new https.Agent({ maxSockets: config.SEARCH_POOL_SIZE }); const searchBreaker = createBreaker(config);
  3. The API gateway verifies identity before the process sees the connection — TLS terminates at Caddy, JWTs are verified against the JWKS endpoint, rate limits are checked against the per-client Redis key, and the verified X-User-Plan header is forwarded. The MCP server's initialize handler reads the header and snapshots feature flags for the session. The gateway does not know about MCP; the application does not know about TLS.
  4. Retry logic wraps individual dependency calls inside tool handlers — each call to an external API is wrapped with withRetry(), which classifies the error, spaces retries with full-jitter backoff, honours Retry-After headers, and generates idempotency keys for write operations. Retry is layered inside the circuit breaker: the breaker sees the final outcome after all attempts.
  5. The service mesh, if present, enforces policies at the infrastructure layer — Istio's VirtualService adds mesh-level retry for gateway errors and connection failures (complementing application-level retry for dependency-specific transient errors), DestinationRule outlier detection ejects unhealthy pods from the load-balancer rotation, and mTLS is enforced automatically for all service-to-service connections within the cluster. The SSE path gets a timeout: 0s exception.

The startup sequence, with all five concerns visible:

// Full startup with outer-layer concerns in order

// 1. Secrets management — inject credentials before parseConfig()
await injectSecrets();           // AWS SM / Vault / Kubernetes file mount

// 2. Config validation + connection opening + bulkhead + circuit breaker creation
const deps = await createDeps(); // parseConfig() → Zod validation → per-dep agents + breakers

// 3. HTTP server with gateway-aware middleware
const app = express();
app.use(correlationId());        // read X-Request-ID forwarded by gateway
app.use(structuredLogger(deps)); // log with request ID for trace correlation
app.use(extractVerifiedClaims()); // read X-User-Id, X-User-Plan from gateway headers

// 4. MCP transport — registers per-session handlers
//    → feature flags resolved at initialize time using X-User-Plan
//    → tools call withRetry(breaker.fire(...)) for each external dependency
//    → bulkhead on searchAgent limits concurrent search calls
app.use('/mcp', createMcpHandler(deps));

// 5. Health endpoint — exempt from gateway auth for AliveMCP + LB probes
app.get('/healthz', healthHandler(deps));

// 6. Start listening — only after all of the above succeed
app.listen(deps.config.PORT, () => {
  deps.logger.info({ event: 'server_ready', port: deps.config.PORT });
});
// Service mesh: sidecars (Istio/Linkerd) are already running in the pod —
//   they intercept traffic at the network layer, no application code change needed

What AliveMCP Can and Cannot See

Running AliveMCP's external probe on top of this infrastructure gives you visibility into a set of failures that inner-cluster metrics cannot observe:

What AliveMCP cannot see without the health_check tool: the open/closed state of individual circuit breakers, bulkhead occupancy per dependency, retry rates and success rates per tool, and the state of per-tenant flag snapshots. Wire the health_check tool as a second probe target in the AliveMCP dashboard to surface application-layer state alongside the transport-layer probe.

Recommended Introduction Order

Not all five concerns need to be introduced at once. A progression that adds complexity only when the scale justifies it:

  1. Secrets management + config validation (day one) — before accepting any external traffic, credentials should not be in plain env vars. AWS Secrets Manager or a Kubernetes Secret file mount adds minimal complexity and eliminates the entire class of "credential in git" incidents. The Zod config schema validates everything in one place. This is the highest-value, lowest-overhead concern.
  2. API gateway (before public launch) — Caddy with automatic TLS handles the certificate management that is otherwise a recurring operational burden. JWT verification at the gateway means the application never needs to handle unauthenticated connections. The flush_interval -1 fix for SSE must be in place before the first client connects. Cost: a Caddyfile and an IAM policy for Caddy's ACME DNS challenge.
  3. Bulkheads (when adding a second external dependency) — a single external dependency does not need per-dependency pool isolation. When the second dependency is added, create per-dependency https.Agent instances. The cost is near zero; the blast-radius reduction is significant when either dependency degrades.
  4. Retry logic (when first external dependency is added) — the retry wrapper and error classification can be written as a utility function once and reused across all tool handlers. Idempotency key generation is cheapest to add before the first write operation goes to production; retrofitting it after a duplicate-write incident is more expensive.
  5. Service mesh (when running multiple services or multiple replicas) — a service mesh adds operational complexity (sidecar injection, CRD management, JWKS for mTLS). It pays off when you have multiple services calling the MCP server, need consistent retry/timeout policies without per-service application code, or need mTLS without implementing it yourself. Single-service, single-replica deployments do not need a mesh.

Further Reading

Each concern covered in this guide has a dedicated deep-dive with full code examples:

For the application-layer concerns that sit inside the process boundary:

For real-world failure data from the public MCP ecosystem: State of the MCP Registry — Q3 2026 covers 2,414 endpoints across six registries and documents the failure modes that production hardening prevents.