Guide · Infrastructure

MCP server API gateway

An API gateway sits in front of your MCP server and handles cross-cutting concerns — TLS termination, authentication, rate limiting, request routing — without burdening the application layer. For MCP servers this boundary matters more than for typical REST APIs because MCP sessions are long-lived: a single SSE connection can persist for minutes or hours, which means gateway behaviour during connection establishment has lasting effects on every tool call that follows.

TL;DR

Use Caddy or Kong as the gateway layer. Terminate TLS at the gateway, not the Node.js process. Verify JWTs at the gateway before the request reaches the MCP server — reject early, log the rejection, never forward unauthenticated connections. Apply per-client rate limits at the gateway, keyed by the API key or client ID in the request header, so a misbehaving client does not affect others. Set flush_interval -1 on the SSE route to disable buffering — a buffering gateway breaks MCP streaming transport. Use AliveMCP to probe from outside the gateway so you detect both gateway failures and application failures independently.

What belongs in the gateway vs. the application

Deciding where to enforce a concern determines who sees the overhead and who can reason about it:

Concern	Gateway	Application	Notes
TLS termination	Yes	No	Node.js handles HTTPS adequately but gateway hardware is optimised for it
JWT signature verification	Yes	Optionally	Gateway rejects bad tokens before the MCP server sees the connection; application may still extract claims
Per-client rate limiting	Yes	No (usually)	Gateway has the client identity before routing — application-layer rate limits add a second tier for per-tool limits
Request logging / access log	Yes	Yes	Gateway logs every request; application logs tool-level events
Tool-level authorisation	No	Yes	Gateway cannot inspect MCP JSON-RPC method names — application layer knows `tools/call` vs. `tools/list`
Business logic / tool execution	No	Yes	Always in application
Circuit breaking to upstream	Sometimes	Yes	Application-layer breakers know which dependency failed; gateway-layer breakers protect against application overload

Caddy as a minimal MCP gateway

Caddy is the fastest path to a production-quality gateway for MCP servers. It handles TLS certificates automatically via ACME, and its streaming behaviour is correct for SSE out of the box when configured properly.

# Caddyfile — gateway in front of MCP server on :3000
alivemcp.com {
  # TLS: auto-managed via ACME
  encode zstd gzip {
    # SSE must not be buffered — exempt the MCP stream endpoint
    @sse {
      header Content-Type text/event-stream
    }
    except @sse
  }

  @mcp_stream path /sse /mcp/stream
  handle @mcp_stream {
    flush_interval -1           # disable buffering for SSE
    reverse_proxy localhost:3000 {
      header_up X-Forwarded-For {remote_host}
      header_up X-Request-ID    {http.request.uuid}
    }
  }

  # Health probe endpoint — not rate limited, no auth
  handle /healthz {
    reverse_proxy localhost:3000
  }

  # All other routes: rate limited + JWT required
  handle {
    rate_limit {
      zone dynamic {
        key     {http.request.header.X-Api-Key}
        events  100
        window  60s
      }
    }
    reverse_proxy localhost:3000 {
      header_up X-Forwarded-For {remote_host}
      header_up X-Request-ID    {http.request.uuid}
    }
  }
}

Note the flush_interval -1 directive on the SSE path. Without this, Caddy may buffer SSE frames before forwarding them, which causes MCP clients to receive delayed or batched events. The encode block's except @sse excludes SSE connections from the compression middleware for the same reason — see MCP server compression for the full reasoning.

JWT verification at the gateway

Gateway-layer JWT verification rejects unauthenticated connections before they consume MCP server resources. For Caddy, the caddy-jwt plugin handles RS256/ES256 token verification against a JWKS endpoint. For Kong, use the jwt plugin built in.

# Caddyfile — JWT verification via caddy-jwt plugin
alivemcp.com {
  @authenticated {
    not path /healthz /assets/*
  }
  handle @authenticated {
    jwtauth {
      sign_key_type RS256
      jwks_url https://your-auth-provider.com/.well-known/jwks.json
      jwks_refresh_interval 1h
      header_claims sub X-User-Id
      header_claims plan X-User-Plan
    }
    reverse_proxy localhost:3000
  }

  # Health probes pass through unauthenticated
  handle /healthz {
    reverse_proxy localhost:3000
  }
}

The verified claims (sub → X-User-Id, plan → X-User-Plan) are forwarded as request headers to the MCP server. The application layer reads them in the initialize handler to set up per-session context without re-verifying the JWT signature — the gateway already did that work.

On the MCP server side, read the forwarded headers in the request handler and store them in the session context:

// server.ts — read gateway-forwarded claims
app.post('/mcp', async (req, res) => {
  const userId = req.headers['x-user-id'] as string | undefined;
  const userPlan = req.headers['x-user-plan'] as string | undefined;
  if (!userId) {
    res.status(401).json({ error: 'missing auth' });
    return;
  }
  // attach to session context for tool handlers
  const session = await mcpServer.connect(transport);
  session.context = { userId, userPlan };
});

Per-client rate limiting at the gateway

Gateway-layer rate limits protect the MCP server from a single client consuming all capacity. Key the rate limit by client identity — API key, JWT subject, or IP — not by IP alone, because many legitimate clients may share an IP (NAT, office networks, CI runners).

For Kong, the rate-limiting-advanced plugin with Redis as the shared state store handles per-consumer limits across multiple gateway replicas:

# Kong plugin config (declarative)
plugins:
  - name: rate-limiting-advanced
    config:
      limit: [100]
      window_size: [60]
      identifier: consumer         # key by authenticated consumer ID
      sync_rate: 1                 # sync Redis every 1s for accuracy
      strategy: redis
      redis:
        host: redis.internal
        port: 6379

For application-layer per-tool rate limits (e.g., a specific tool that calls an expensive external API), see MCP server rate limiting. Gateway limits and application limits compose: the gateway enforces the outer budget; the application enforces per-tool inner budgets.

Load balancing MCP sessions across replicas

SSE-based MCP transport is stateful: once a session is established, all tool calls for that session must reach the same replica. A gateway that load-balances without session affinity will route subsequent requests to different replicas, breaking the session.

Caddy sticky routing by session header:

reverse_proxy localhost:3001 localhost:3002 localhost:3003 {
  lb_policy header Mcp-Session-Id   # sticky by MCP session ID
  flush_interval -1
  health_path    /healthz
  health_interval 10s
}

For stateless MCP (HTTP POST only, no SSE), round-robin works correctly because each request is independent. See MCP server load balancing for the full comparison. Stateless mode also simplifies the gateway configuration: no session affinity required, and the flush_interval directive is unnecessary.

Monitoring gateway health vs. application health

A gateway sits between the internet and your application. It can fail independently of the application — TLS certificate renewal error, misconfigured route, OOM kill of the gateway process. Probing only the application from inside the same host misses gateway failures.

The correct monitoring topology: probe from outside the gateway using an external monitor so the probe traverses the full request path (internet → gateway → application). AliveMCP probes your MCP server's initialize endpoint from external infrastructure, catching both gateway failures (probe can't connect) and application failures (probe connects but MCP handshake fails).

Expose two health endpoints with different semantics:

/healthz — gateway-accessible, no auth required. Returns 200 if the application process is up, 503 if it is not yet ready (before app.listen) or draining. This is what the gateway load balancer polls.
health_check MCP tool — full application-layer health: database ping, Redis ping, circuit-breaker states, queue depth, scheduler status. This is what AliveMCP calls as a synthetic tool probe.

The two-layer approach mirrors the infrastructure operations pattern where each concern has its own observability surface. See also the MCP Server Resilience Guide for how the gateway fits into the broader resilience stack.

Request ID propagation

Debugging a failed tool call requires correlating logs from the gateway and the application. The convention is a X-Request-ID header: the gateway generates or forwards a UUID per request, and the application includes it in every structured log line.

// server.ts — read request ID from gateway header
import { v4 as uuidv4 } from 'uuid';

app.use((req, res, next) => {
  const requestId = (req.headers['x-request-id'] as string) ?? uuidv4();
  // attach to async local storage so all logger calls in this request include it
  asyncLocalStorage.run({ requestId }, next);
});

When AliveMCP alerts on a probe failure, the request ID from its probe is logged at the gateway and the application simultaneously. You can grep both log sources with the same ID to reconstruct exactly what happened during the failed probe attempt.