Infrastructure guide · 2026-06-03 · Production operations

MCP Server Resilience and Configurability Guide: Config Validation, Feature Flags, Circuit Breakers, and Compression

The infrastructure operations guide established the Deps object as the backbone of production MCP servers: database pool, cache, queue, logger, and config all created once at startup and flowed into tool handlers as typed parameters. That backbone handles the structure of resources. It does not, on its own, handle what happens when those resources are misconfigured at boot, when a downstream API starts failing mid-operation, when you need to change the tool surface without restarting the server, or when large tool responses are choking a slow client connection. Four operational maturity concerns close those gaps: config validation, feature flags, circuit breakers, and compression. This guide covers them as a system — each one extends the Deps backbone rather than standing alone.

TL;DR

Config validation happens first inside createDeps() — call parseConfig() before opening any connections. A Zod schema throws with named-variable error messages on any missing or malformed value; the process exits before app.listen. AliveMCP sees this as a probe failure immediately, which is the right signal.
Feature flags have three evaluation points, not one — infrastructure flags at startup (what connections to open), tool-registration flags at initialize time per session (which tools the session can call), and behaviour flags per tool call (how a registered tool operates). Using the wrong evaluation point is the root cause of clients that cache a tool list and then call tools that no longer exist.
One circuit breaker per external dependency in createDeps() — bulkhead isolation means a broken external API opens its breaker without affecting tools that only touch the database. Return isError: true from the fallback immediately; no timeout wait. Expose circuit state in a health_check tool so AliveMCP can see beyond the transport layer.
Compress HTTP POST responses; exempt the SSE GET path — a buffering compressor on the SSE path delays every server-to-client notification. One filter function on the Express compression middleware fixes this. Set a 1 KB threshold to skip small JSON responses where compression overhead exceeds savings.

The Startup Sequence as a Layered Contract

Before covering each concern individually, it helps to see how they compose in the startup sequence. The four concerns each occupy a distinct position in the lifecycle:

// startup order — each layer depends on the one above it
parseConfig()                          // 1. validate all env vars → AppConfig
createConnections(config)              //    open db, cache using validated config
createBreakers()                       // 2. wrap connections with fault tolerance
app.use(compression({ filter: sseExempt })) // 3. HTTP transport: compress POSTs, skip SSE
app.listen(config.PORT, onReady)       //    server ready

// per session (inside onConnect / handleInitialize):
const flags = await resolveFlags(deps, sessionContext) // 4. flag snapshot for this session
registerToolsForSession(server, deps, flags)            //    build tool surface

Config validation runs first because every subsequent step depends on it. Circuit breakers are created alongside the connections they wrap. Compression wraps the HTTP layer before the first request arrives. Feature flags resolve per session because different sessions — or different tenants — may have different tool surfaces. If any step fails, the failure is loud and immediate: parseConfig() throws, the process exits, and AliveMCP's probe sees a connection failure before any client has been served.

This ordering matters because it makes partial-start failures impossible. A server that exits during config validation never becomes half-initialised — no connections open, no tools registered, no sessions established. The probe failure is clean.

Config Validation — Fail Before Accepting Connections

MCP server configuration management rests on one rule: all environment variables are validated before any connection is opened. The most common failure mode is not missing config — it is undetected misconfiguration that allows the server to start in a degraded state. A missing Redis URL that silently disables the rate limiter. An API key with a typo that produces authenticated-but-unauthorised errors on the first tool call. A database URL with a wrong password that hangs Pool.connect() without throwing.

Zod schema validation inside createDeps() catches all three before app.listen runs:

// config.ts
import { z } from 'zod';

const configSchema = z.object({
  PORT: z.coerce.number().int().min(1).max(65535).default(3000),
  NODE_ENV: z.enum(['development', 'test', 'production']).default('development'),
  DATABASE_URL: z.string().url('DATABASE_URL must be a valid connection string'),
  REDIS_URL: z.string().url().optional(),
  API_SECRET: z.string().min(32, 'API_SECRET must be at least 32 characters'),
  // Circuit breaker tuning — in config so they can be adjusted per deployment
  CB_ERROR_THRESHOLD: z.coerce.number().int().min(1).max(100).default(50),
  CB_RESET_TIMEOUT_MS: z.coerce.number().int().min(1000).default(30000),
  // Feature flag source
  ENABLED_FEATURES: z.string().default(''),
  REDIS_FLAG_PREFIX: z.string().default('flags:'),
});

export type AppConfig = z.infer<typeof configSchema>;

export function parseConfig(): AppConfig {
  const result = configSchema.safeParse(process.env);
  if (!result.success) {
    const errors = result.error.issues
      .map(i => `  ${i.path.join('.')}: ${i.message}`)
      .join('\n');
    throw new Error(`Configuration error — fix before starting:\n${errors}`);
  }
  return result.data;
}

Notice that circuit breaker thresholds (CB_ERROR_THRESHOLD, CB_RESET_TIMEOUT_MS) and feature flag config (ENABLED_FEATURES, REDIS_FLAG_PREFIX) are part of the same schema. This is deliberate: the config schema is the single source of truth for everything that shapes the server's runtime behaviour, including the other three concerns in this guide.

The createDeps() function uses the validated config throughout:

// deps.ts
export async function createDeps(): Promise<Deps> {
  const config = parseConfig(); // throws on misconfiguration — process exits

  const db = new Pool({ connectionString: config.DATABASE_URL, max: 10 });
  await db.query('SELECT 1'); // fail fast — hung connect blocks app.listen

  const cache = config.REDIS_URL
    ? new Redis(config.REDIS_URL, { maxRetriesPerRequest: 3 })
    : null;
  if (cache) await cache.ping();

  const logger = buildLogger(config.NODE_ENV);
  logConfigSummary(config, logger); // log redacted summary — never log config object

  return { config, db, cache, logger };
}

function logConfigSummary(config: AppConfig, logger: Logger) {
  logger.info({
    event: 'config_loaded',
    port: config.PORT,
    database: config.DATABASE_URL.replace(/:\/\/[^@]+@/, '://***@'),
    redis: config.REDIS_URL ? 'configured' : 'disabled',
    api_secret: `[${config.API_SECRET.length} chars]`,
    features: config.ENABLED_FEATURES || '(none)',
  });
}

The redacted summary logs what was configured — host, database, presence of optional connections — without logging the credentials themselves. This is the class of accidental secret-in-logs incident that shows up in post-mortems. Enforce it at the logger level: never pass the raw config object to any logger call.

Dynamic config reload for non-structural settings (rate limits, log verbosity, flag state) can be added without changing the startup contract. The key boundary is: settings that affect what connections are open (DATABASE_URL, REDIS_URL, PORT) require a restart. Settings that affect how existing connections behave (timeouts, thresholds, flags) can be refreshed from Redis or a config file at runtime without restarting. Keep this boundary explicit in your schema — it is the difference between a config change that requires a deploy and one that can be applied through a feature flag update.

Feature Flags — Three Evaluation Points, Not One

Feature flags in MCP servers work differently than feature flags in web applications. A web request is stateless per render — gating a UI component behind a flag affects only the users who see that request. An MCP server exposes a tool surface that clients cache. Agents that call initialize receive a tool list and cache it for the lifetime of the session. If you change which tools are registered after initialize, the agent is operating against a stale schema.

This produces the three evaluation points:

Flag category	Evaluation point	What it controls	Who reads it
Infrastructure flags	Process startup — inside `createDeps()`	Which connections are opened	`config.REDIS_URL`, `config.ENABLED_FEATURES` parsed from env
Tool-registration flags	`initialize` — once per session	Which tools the session can call	`deps.config.ENABLED_FEATURES` or Redis hash at session start
Behaviour flags	Per tool call	How a registered tool operates	Redis `GET` on each call, or local in-memory cache with TTL

Infrastructure flags are already handled by the config schema — REDIS_URL being present or absent determines whether createDeps() opens a Redis connection. That decision is made once at process start and cannot change without a restart. The remaining two categories require explicit handling.

Tool-registration flags at initialize time: read the flag snapshot at session start, pass it to the tool registration function, and never re-evaluate it for the lifetime of that session. The snapshot is the session's contract with the client:

// flagStore.ts
export async function flagsForSession(
  deps: Deps,
  sessionId: string
): Promise<Set<string>> {
  // Single-tenant: static set from env
  if (!deps.cache) {
    return new Set(
      deps.config.ENABLED_FEATURES
        .split(',').map(s => s.trim()).filter(Boolean)
    );
  }
  // Multi-tenant or runtime-mutable: read per-session flags from Redis
  const tenantId = await getTenantIdForSession(deps, sessionId);
  const flagKeys = await deps.cache.hgetall(
    `${deps.config.REDIS_FLAG_PREFIX}tenant:${tenantId}`
  );
  return new Set(
    Object.entries(flagKeys ?? {})
      .filter(([, v]) => v === 'true')
      .map(([k]) => k)
  );
}

// In your MCP session handler:
server.on('connect', async (session) => {
  const flags = await flagsForSession(deps, session.id);
  registerToolsForSession(server, deps, session, flags);
});

function registerToolsForSession(
  server: McpServer,
  deps: Deps,
  session: Session,
  flags: Set<string>
) {
  // Core tools — always registered
  registerSearchTools(server, deps);
  registerStatusTools(server, deps);

  // Flagged tools — registered only if the session's snapshot includes the flag
  if (flags.has('pdf_export')) {
    registerPdfExportTools(server, deps);
  }
  if (flags.has('semantic_search')) {
    registerSemanticSearchTools(server, deps);
  }
}

Behaviour flags per call: evaluate inside the tool handler on each invocation. These flags affect how a registered tool behaves, not whether it exists. A client that cached the tool list is unaffected by a behaviour flag changing between calls — the tool is still there, it just operates differently:

server.tool('search', SearchInputSchema, async (input) => {
  // Behaviour flag — evaluated per call, not at session start
  const useSemanticSearch = deps.cache
    ? (await deps.cache.get('flags:use_semantic_search')) === 'true'
    : false;

  const results = useSemanticSearch
    ? await deps.embeddingSearch(input.query)
    : await deps.db.query('SELECT * FROM items WHERE content ILIKE $1', [`%${input.query}%`]);

  return { content: [{ type: 'text', text: JSON.stringify(results.rows) }] };
});

AliveMCP's tools/list probe detects when a flag change silently changes the tool count. If a tool-registration flag is flipped mid-deployment and new sessions get a different tool surface than existing ones, the probe latency profile changes — not an outage, but a signal worth investigating before agents start failing with "unknown tool" errors.

Circuit Breakers — One per Dependency, Wired in `createDeps()`

Without a circuit breaker, a failing external API produces slow cascading failures: every tool call that touches the API waits for the full timeout (5–30 seconds) before returning an error. Concurrent sessions accumulate. The MCP server looks sick — high latency, connection pile-up — for reasons entirely outside its own code. The circuit breaker pattern short-circuits this: after enough consecutive failures, the breaker opens and all subsequent calls fail immediately with an explicit error, no timeout wait.

Breakers belong in createDeps(), created alongside the connections they protect. One breaker per external dependency is the bulkhead isolation rule: a broken search API opens the search API breaker without affecting tools that only use the database pool:

// deps.ts (extended from config section above)
import CircuitBreaker from 'opossum';

export interface Deps {
  config: AppConfig;
  db: Pool;
  cache: Redis | null;
  logger: Logger;
  breakers: {
    searchApi: CircuitBreaker;
    notificationApi: CircuitBreaker;
  };
}

async function callSearchApi(query: string): Promise<SearchResult[]> {
  const res = await fetch(`https://search.internal/v2/search?q=${encodeURIComponent(query)}`, {
    signal: AbortSignal.timeout(5000),
  });
  if (!res.ok) throw new Error(`Search API ${res.status}`);
  return res.json();
}

async function sendNotification(payload: NotificationPayload): Promise<void> {
  const res = await fetch('https://notify.internal/v1/send', {
    method: 'POST',
    body: JSON.stringify(payload),
    signal: AbortSignal.timeout(3000),
  });
  if (!res.ok) throw new Error(`Notification API ${res.status}`);
}

function createBreakers(config: AppConfig): Deps['breakers'] {
  const opts = {
    errorThresholdPercentage: config.CB_ERROR_THRESHOLD,
    timeout: 5000,
    resetTimeout: config.CB_RESET_TIMEOUT_MS,
    volumeThreshold: 5,
  };

  const searchApi = new CircuitBreaker(callSearchApi, { ...opts, name: 'search-api' });
  const notificationApi = new CircuitBreaker(sendNotification, { ...opts, name: 'notification-api' });

  // Log state transitions for observability
  for (const [name, breaker] of [['search-api', searchApi], ['notification-api', notificationApi]] as const) {
    breaker.on('open',     () => logger.warn({ circuit: name }, 'circuit opened'));
    breaker.on('halfOpen', () => logger.info({ circuit: name }, 'circuit half-open — probing'));
    breaker.on('close',    () => logger.info({ circuit: name }, 'circuit closed — recovered'));
  }

  return { searchApi, notificationApi };
}

export async function createDeps(): Promise<Deps> {
  const config = parseConfig();
  const db = new Pool({ connectionString: config.DATABASE_URL });
  await db.query('SELECT 1');
  const cache = config.REDIS_URL ? new Redis(config.REDIS_URL) : null;
  const logger = buildLogger(config.NODE_ENV);
  const breakers = createBreakers(config);
  return { config, db, cache, logger, breakers };
}

Note that CB_ERROR_THRESHOLD and CB_RESET_TIMEOUT_MS come from config — the Zod schema defined in the first section. Circuit breaker thresholds are not constants to hardcode; they are deployment-specific tuning parameters. A search API with a known 10-second degradation window needs a longer reset timeout than an in-datacenter cache.

In tool handlers, call through the breaker rather than the raw function:

server.tool('search', SearchInputSchema, async (input) => {
  // Register a fallback before firing — the fallback runs when the circuit is OPEN
  deps.breakers.searchApi.fallback(() => ({
    isError: true,
    content: [{ type: 'text', text: 'Search API is temporarily unavailable — try again in 30 seconds' }],
  }));

  try {
    const results = await deps.breakers.searchApi.fire(input.query);
    return { content: [{ type: 'text', text: JSON.stringify(results) }] };
  } catch (err) {
    // Thrown when fallback is not registered, or when fallback itself throws
    return { isError: true, content: [{ type: 'text', text: String(err) }] };
  }
});

The fallback returns isError: true immediately — no waiting for a timeout. The agent receives an explicit "unavailable" signal rather than a hung tool call. LLM reasoning handles "try again in 30 seconds" cleanly; it handles a 30-second hang followed by a generic network error much less cleanly.

The health_check tool exposes circuit state for monitoring beyond the transport layer. AliveMCP confirms that initialize succeeds and tools/list returns correctly — the transport layer is up. It cannot confirm whether the search API breaker is currently open. The health_check tool bridges the gap:

server.tool('health_check', {}, async () => {
  const searchCircuit = deps.breakers.searchApi;
  const notifyCircuit = deps.breakers.notificationApi;

  const [dbResult, cacheResult] = await Promise.allSettled([
    deps.db.query('SELECT 1').then(() => ({ ok: true })),
    deps.cache?.ping().then(() => ({ ok: true })) ?? Promise.resolve({ ok: true }),
  ]);

  const status = {
    db: dbResult.status === 'fulfilled' ? 'ok' : 'error',
    cache: cacheResult.status === 'fulfilled' ? 'ok' : 'error',
    search_api: searchCircuit.opened ? 'circuit_open' : searchCircuit.halfOpen ? 'half_open' : 'ok',
    notification_api: notifyCircuit.opened ? 'circuit_open' : notifyCircuit.halfOpen ? 'half_open' : 'ok',
    search_api_stats: {
      failures: searchCircuit.stats.failures,
      successes: searchCircuit.stats.successes,
      rejects: searchCircuit.stats.rejects,
    },
  };

  const isError = status.db === 'error' || status.cache === 'error'
    || status.search_api === 'circuit_open' || status.notification_api === 'circuit_open';

  return { isError, content: [{ type: 'text', text: JSON.stringify(status) }] };
});

Configure AliveMCP to run a second probe type that calls health_check after the standard initialize probe. A server that passes the transport-layer probe but returns isError: true from health_check is degraded at the application layer — a distinction that matters for on-call response. A dead transport is a restart; an open circuit is "the downstream is down, not us".

Compression — Compact the POST Path, Exempt the SSE Path

MCP server compression has one non-negotiable constraint: the SSE GET endpoint must not be compressed by a buffering compressor. A standard gzip middleware compresses by buffering output until a flush threshold is met, then flushing a compressed chunk. On an SSE stream where each event is 50–200 bytes, the compressor buffers many events before flushing. The client sees long silences followed by bursts — the latency profile of a broken streaming connection, not a working one. Tools that stream intermediate results appear to hang.

The fix is a single filter function on the Express compression middleware:

import express from 'express';
import compression from 'compression';
import type { Request, Response } from 'express';

const app = express();

app.use(compression({
  threshold: 1024, // skip responses under 1 KB — compression overhead exceeds savings
  level: 6,        // gzip level 6 is the right dynamic-response tradeoff
  filter: (req: Request, res: Response) => {
    const contentType = res.getHeader('Content-Type') as string | undefined;
    if (contentType?.includes('text/event-stream')) {
      return false; // never compress SSE — buffering compressor breaks streaming
    }
    return compression.filter(req, res); // default filter for everything else
  },
}));

// Register MCP transport AFTER compression middleware
app.post('/mcp', mcpTransportHandler);  // compressed (JSON tool responses)
app.get('/mcp',  sseTransportHandler);  // NOT compressed (SSE stream, filter returns false)

The 1 KB threshold is important. Short tool responses — a boolean status, a single integer, a brief confirmation — compress to nearly their original size when the gzip header overhead is included. A 50-byte response with a 20-byte gzip header is 40% larger than the original. Apply compression where it saves bandwidth: large JSON arrays (search results, document lists), prose content (summaries, document extracts), and structured data with repetitive keys.

Static assets: pre-compress at build time with Brotli. The Express runtime compressor uses gzip because it must finish in request time. Brotli at quality 11 is significantly smaller but too slow for runtime compression. Pre-compressing static assets — your frontend JS bundle, CSS, any large JSON datasets — produces .br files that the server serves directly without runtime CPU overhead:

# Build step: pre-compress all static assets
find public/ -type f \( -name "*.js" -o -name "*.css" -o -name "*.json" \) | \
  xargs -P4 -I{} bash -c 'brotli --best --keep {} && echo "compressed: {}"'

# Caddy serves pre-compressed files automatically when Accept-Encoding: br
# No Express middleware needed for static assets if Caddy handles them

Caddy as an alternative: if Caddy is already your reverse proxy (recommended for MCP servers because of its flush_interval -1 SSE support), you can centralise compression at the proxy layer instead of in Express. The SSE exemption is an explicit route matcher:

# Caddyfile — compression at proxy layer with SSE exemption
alivemcp.com {
  @sse path /mcp
  @sse method GET

  handle @sse {
    flush_interval -1          # no buffering — required for SSE
    # no encode directive — do not compress SSE
    reverse_proxy localhost:3000
  }

  handle {
    encode zstd gzip           # compress all other responses
    reverse_proxy localhost:3000
  }
}

Choose one approach — Express middleware or Caddy — not both. Double compression produces larger output than no compression. If Caddy is already in the stack for sticky session routing (from the infrastructure operations guide), centralising compression there reduces application-layer complexity.

How the Four Concerns Interact

Each concern is a separable addition to the Deps backbone, but they interact in ways that are worth understanding explicitly.

Config and circuit breakers share the same schema. The circuit breaker thresholds (CB_ERROR_THRESHOLD, CB_RESET_TIMEOUT_MS) live in the Zod config schema. This means they are validated at startup alongside all other config, can be changed between deployments without code changes, and are logged in the redacted startup summary. A circuit breaker tuned with hardcoded constants is harder to adjust for an API whose reliability characteristics vary by environment.

Config and feature flags share the same flag source boundary. Infrastructure flags (which connections to open) are in the config schema; they require a restart to change. Tool-registration flags (which tools a session can call) extend the config with runtime mutability via Redis. The Redis connection itself is an infrastructure flag — if REDIS_URL is absent, the server falls back to env-var flags, which are static. The two-tier model (static env-var flags for simple deployments, Redis-backed flags for runtime mutability) degrades gracefully.

Circuit breakers and feature flags both express graceful degradation. When the search API breaker opens, the tool still exists — the client's cached tool list is unchanged — but calling it returns isError: true immediately. When a tool-registration flag is absent, the tool is never registered — the client's tool list is smaller from the start. Both are forms of intentional capability reduction. The breaker is reactive (triggered by failure rate); the flag is proactive (configured by the operator). Together they cover both planned and unplanned degradation.

Compression and circuit breakers affect the same latency observable. A buffering compressor on the SSE path produces events that arrive in delayed bursts — similar to what a slow external API dependency produces before the circuit opens. If you diagnose "SSE stream is delivering results in chunks instead of streaming" and the circuit breakers are all closed, the issue is almost certainly the compression middleware not exempting the SSE path. If the circuit breaker for the search API has been open for 10 minutes, the issue is the external API. Knowing both patterns prevents misdiagnosis.

The Complete `createDeps()` with All Four Concerns

The full startup function combining config validation, circuit breakers, and the inputs needed for feature flags:

// deps.ts — full startup with all four concerns integrated
export async function createDeps(): Promise<Deps> {
  // 1. Config validation — throws on any missing or malformed env var
  const config = parseConfig();

  // 2. Connections — using validated config values
  const db = new Pool({
    connectionString: config.DATABASE_URL,
    max: 10,
  });
  await db.query('SELECT 1'); // fail fast before app.listen

  const cache = config.REDIS_URL
    ? new Redis(config.REDIS_URL, { maxRetriesPerRequest: 3, enableReadyCheck: true })
    : null;
  if (cache) await cache.ping();

  const logger = buildLogger(config.NODE_ENV);
  logConfigSummary(config, logger);

  // 3. Circuit breakers — one per external dependency, thresholds from config
  const breakerOpts = {
    errorThresholdPercentage: config.CB_ERROR_THRESHOLD,
    timeout: 5000,
    resetTimeout: config.CB_RESET_TIMEOUT_MS,
    volumeThreshold: 5,
  };
  const breakers = {
    searchApi: new CircuitBreaker(callSearchApi, { ...breakerOpts, name: 'search-api' }),
    notificationApi: new CircuitBreaker(sendNotification, { ...breakerOpts, name: 'notification-api' }),
  };
  for (const [name, b] of Object.entries(breakers)) {
    b.on('open',     () => logger.warn({ circuit: name }, 'circuit opened'));
    b.on('halfOpen', () => logger.info({ circuit: name }, 'circuit half-open'));
    b.on('close',    () => logger.info({ circuit: name }, 'circuit closed'));
  }

  return { config, db, cache, logger, breakers };
}

// server.ts — compression registered before MCP transport
const deps = await createDeps();

const app = express();
app.use(compression({
  threshold: 1024,
  filter: (req, res) => {
    const ct = res.getHeader('Content-Type') as string | undefined;
    return ct?.includes('text/event-stream') ? false : compression.filter(req, res);
  },
}));

app.post('/mcp', createMcpPostHandler(deps));
app.get('/mcp',  createMcpSseHandler(deps));

app.listen(deps.config.PORT, () => {
  deps.logger.info({ port: deps.config.PORT }, 'server ready');
});

Feature flags are resolved per session, not at this level. The deps object carries everything needed to resolve them — deps.config.ENABLED_FEATURES for static flags, deps.cache for Redis-backed flags — but the resolution itself happens inside the initialize handler where the session context (tenant ID, session ID) is available.

Monitoring the Full Stack

External uptime monitoring confirms the transport layer: that the HTTP server is reachable, that initialize completes, that tools/list returns the expected tool count. This is necessary but not sufficient. The four concerns in this guide each produce failure modes that are invisible to transport-layer probes:

Config failure: the process exits before app.listen — this is visible to probes (connection refused) but the cause (bad env var) is in the process logs, not the probe result. Configure your probe to alert on connection refused and check structured logs for Configuration error events.
Feature flag bug: tool count changes unexpectedly. AliveMCP's tools/list probe detects this via tool-count drift — not an outage, but a signal that a flag change did not produce the intended tool surface. Configure a baseline tool count alert.
Open circuit breaker: the transport passes but the application layer is degraded. The health_check tool bridges this gap with explicit circuit state reporting and an isError: true aggregate response.
Compression misconfiguration: SSE events arrive in delayed bursts. This produces tool call latency that looks like a slow downstream rather than a transport configuration bug. Diagnose by checking whether the SSE path returns Content-Encoding: gzip in response headers — it should not.

Together, transport-layer probes from AliveMCP and application-layer probes via the health_check tool give you full coverage: the transport is up or down (AliveMCP), and the application layer is healthy or degraded and why (health_check). The four operational concerns in this guide each contribute to one or both monitoring surfaces — which is how you know they are in place and working.

What to Add First

For a server that already has the Deps backbone from the infrastructure operations guide, the recommended introduction order follows the startup sequence:

Config validation first — add Zod to the config parsing immediately. The only cost is adding the schema; the benefit is that every misconfiguration becomes a named error at startup rather than an undefined behaviour mid-operation.
Circuit breakers second — as soon as the server has any external API dependency (not just the database pool, which is already in createDeps()). One breaker per external API, thresholds in the config schema.
Compression third — once the server is receiving real traffic and you want to reduce bandwidth. One middleware with one filter function. The SSE exemption is the only MCP-specific consideration.
Feature flags last — when you have a concrete need to vary the tool surface between environments, tenants, or deployment stages without a restart. Flag infrastructure before you need it becomes dead code; flag infrastructure when you have a specific use case becomes the right abstraction at the right time.

Config validation is the only one that has no downside to adding on day one. The others are worth deferring until the need is clear — but once you need them, the Deps pattern makes them easy to add without refactoring anything that already works.