Guide · Infrastructure
MCP server API gateway
An API gateway sits in front of your MCP server and handles cross-cutting concerns — TLS termination, authentication, rate limiting, request routing — without burdening the application layer. For MCP servers this boundary matters more than for typical REST APIs because MCP sessions are long-lived: a single SSE connection can persist for minutes or hours, which means gateway behaviour during connection establishment has lasting effects on every tool call that follows.
TL;DR
Use Caddy or Kong as the gateway layer. Terminate TLS at the gateway, not the Node.js process. Verify JWTs at the gateway before the request reaches the MCP server — reject early, log the rejection, never forward unauthenticated connections. Apply per-client rate limits at the gateway, keyed by the API key or client ID in the request header, so a misbehaving client does not affect others. Set flush_interval -1 on the SSE route to disable buffering — a buffering gateway breaks MCP streaming transport. Use AliveMCP to probe from outside the gateway so you detect both gateway failures and application failures independently.
What belongs in the gateway vs. the application
Deciding where to enforce a concern determines who sees the overhead and who can reason about it:
| Concern | Gateway | Application | Notes |
|---|---|---|---|
| TLS termination | Yes | No | Node.js handles HTTPS adequately but gateway hardware is optimised for it |
| JWT signature verification | Yes | Optionally | Gateway rejects bad tokens before the MCP server sees the connection; application may still extract claims |
| Per-client rate limiting | Yes | No (usually) | Gateway has the client identity before routing — application-layer rate limits add a second tier for per-tool limits |
| Request logging / access log | Yes | Yes | Gateway logs every request; application logs tool-level events |
| Tool-level authorisation | No | Yes | Gateway cannot inspect MCP JSON-RPC method names — application layer knows tools/call vs. tools/list |
| Business logic / tool execution | No | Yes | Always in application |
| Circuit breaking to upstream | Sometimes | Yes | Application-layer breakers know which dependency failed; gateway-layer breakers protect against application overload |
Caddy as a minimal MCP gateway
Caddy is the fastest path to a production-quality gateway for MCP servers. It handles TLS certificates automatically via ACME, and its streaming behaviour is correct for SSE out of the box when configured properly.
# Caddyfile — gateway in front of MCP server on :3000
alivemcp.com {
# TLS: auto-managed via ACME
encode zstd gzip {
# SSE must not be buffered — exempt the MCP stream endpoint
@sse {
header Content-Type text/event-stream
}
except @sse
}
@mcp_stream path /sse /mcp/stream
handle @mcp_stream {
flush_interval -1 # disable buffering for SSE
reverse_proxy localhost:3000 {
header_up X-Forwarded-For {remote_host}
header_up X-Request-ID {http.request.uuid}
}
}
# Health probe endpoint — not rate limited, no auth
handle /healthz {
reverse_proxy localhost:3000
}
# All other routes: rate limited + JWT required
handle {
rate_limit {
zone dynamic {
key {http.request.header.X-Api-Key}
events 100
window 60s
}
}
reverse_proxy localhost:3000 {
header_up X-Forwarded-For {remote_host}
header_up X-Request-ID {http.request.uuid}
}
}
}
Note the flush_interval -1 directive on the SSE path. Without this, Caddy may buffer SSE frames before forwarding them, which causes MCP clients to receive delayed or batched events. The encode block's except @sse excludes SSE connections from the compression middleware for the same reason — see MCP server compression for the full reasoning.
JWT verification at the gateway
Gateway-layer JWT verification rejects unauthenticated connections before they consume MCP server resources. For Caddy, the caddy-jwt plugin handles RS256/ES256 token verification against a JWKS endpoint. For Kong, use the jwt plugin built in.
# Caddyfile — JWT verification via caddy-jwt plugin
alivemcp.com {
@authenticated {
not path /healthz /assets/*
}
handle @authenticated {
jwtauth {
sign_key_type RS256
jwks_url https://your-auth-provider.com/.well-known/jwks.json
jwks_refresh_interval 1h
header_claims sub X-User-Id
header_claims plan X-User-Plan
}
reverse_proxy localhost:3000
}
# Health probes pass through unauthenticated
handle /healthz {
reverse_proxy localhost:3000
}
}
The verified claims (sub → X-User-Id, plan → X-User-Plan) are forwarded as request headers to the MCP server. The application layer reads them in the initialize handler to set up per-session context without re-verifying the JWT signature — the gateway already did that work.
On the MCP server side, read the forwarded headers in the request handler and store them in the session context:
// server.ts — read gateway-forwarded claims
app.post('/mcp', async (req, res) => {
const userId = req.headers['x-user-id'] as string | undefined;
const userPlan = req.headers['x-user-plan'] as string | undefined;
if (!userId) {
res.status(401).json({ error: 'missing auth' });
return;
}
// attach to session context for tool handlers
const session = await mcpServer.connect(transport);
session.context = { userId, userPlan };
});
Per-client rate limiting at the gateway
Gateway-layer rate limits protect the MCP server from a single client consuming all capacity. Key the rate limit by client identity — API key, JWT subject, or IP — not by IP alone, because many legitimate clients may share an IP (NAT, office networks, CI runners).
For Kong, the rate-limiting-advanced plugin with Redis as the shared state store handles per-consumer limits across multiple gateway replicas:
# Kong plugin config (declarative)
plugins:
- name: rate-limiting-advanced
config:
limit: [100]
window_size: [60]
identifier: consumer # key by authenticated consumer ID
sync_rate: 1 # sync Redis every 1s for accuracy
strategy: redis
redis:
host: redis.internal
port: 6379
For application-layer per-tool rate limits (e.g., a specific tool that calls an expensive external API), see MCP server rate limiting. Gateway limits and application limits compose: the gateway enforces the outer budget; the application enforces per-tool inner budgets.
Load balancing MCP sessions across replicas
SSE-based MCP transport is stateful: once a session is established, all tool calls for that session must reach the same replica. A gateway that load-balances without session affinity will route subsequent requests to different replicas, breaking the session.
Caddy sticky routing by session header:
reverse_proxy localhost:3001 localhost:3002 localhost:3003 {
lb_policy header Mcp-Session-Id # sticky by MCP session ID
flush_interval -1
health_path /healthz
health_interval 10s
}
For stateless MCP (HTTP POST only, no SSE), round-robin works correctly because each request is independent. See MCP server load balancing for the full comparison. Stateless mode also simplifies the gateway configuration: no session affinity required, and the flush_interval directive is unnecessary.
Monitoring gateway health vs. application health
A gateway sits between the internet and your application. It can fail independently of the application — TLS certificate renewal error, misconfigured route, OOM kill of the gateway process. Probing only the application from inside the same host misses gateway failures.
The correct monitoring topology: probe from outside the gateway using an external monitor so the probe traverses the full request path (internet → gateway → application). AliveMCP probes your MCP server's initialize endpoint from external infrastructure, catching both gateway failures (probe can't connect) and application failures (probe connects but MCP handshake fails).
Expose two health endpoints with different semantics:
/healthz— gateway-accessible, no auth required. Returns 200 if the application process is up, 503 if it is not yet ready (beforeapp.listen) or draining. This is what the gateway load balancer polls.health_checkMCP tool — full application-layer health: database ping, Redis ping, circuit-breaker states, queue depth, scheduler status. This is what AliveMCP calls as a synthetic tool probe.
The two-layer approach mirrors the infrastructure operations pattern where each concern has its own observability surface. See also the MCP Server Resilience Guide for how the gateway fits into the broader resilience stack.
Request ID propagation
Debugging a failed tool call requires correlating logs from the gateway and the application. The convention is a X-Request-ID header: the gateway generates or forwards a UUID per request, and the application includes it in every structured log line.
// server.ts — read request ID from gateway header
import { v4 as uuidv4 } from 'uuid';
app.use((req, res, next) => {
const requestId = (req.headers['x-request-id'] as string) ?? uuidv4();
// attach to async local storage so all logger calls in this request include it
asyncLocalStorage.run({ requestId }, next);
});
When AliveMCP alerts on a probe failure, the request ID from its probe is logged at the gateway and the application simultaneously. You can grep both log sources with the same ID to reconstruct exactly what happened during the failed probe attempt.