Guide · Infrastructure
MCP server load balancing
The MCP StreamableHTTP transport uses two HTTP connections per session: a POST for each tool call (request/response) and a long-lived GET SSE stream for server-to-client notifications. Both connections carry the same mcp-session-id header, and the server-side session state — the in-memory McpServer instance, the registered tools, the AsyncLocalStorage context — lives in one process. Round-robin load balancing routes the POST to one backend and the GET SSE to a different backend: the SSE backend has no session, the connection fails. Load-balancing MCP servers requires either sticky sessions (route all requests for a session to the same backend) or a stateless design (eliminate SSE entirely so every request is independently routable).
TL;DR
Two approaches: sticky sessions — route all requests matching the mcp-session-id header to the same backend instance using cookie-based or header-based affinity. Stateless mode — disable SSE by setting enableSseResponse: false on the transport, making every POST fully stateless and round-robbin friendly. Stateless mode works for clients that don't require server-initiated notifications. Use protocol-aware health checks (/healthz returning JSON with status) rather than TCP checks for upstream probe. AliveMCP's probe hits one backend at a time — for full coverage, configure one AliveMCP monitor per backend instance, not one per load balancer.
Why round-robin breaks MCP sessions
An MCP session has three phases. In the initialize phase, the client sends a POST to establish the session and receives a mcp-session-id in the response. In the tool-call phase, the client sends POSTs using that session ID, and the server routes them to the in-memory McpServer instance that owns the session. In the notification phase, the client opens a GET SSE stream so the server can push notifications back.
With naive round-robin across three backends A, B, C:
initializePOST → backend A. Session created in A's memory.mcp-session-id: sess-xyzreturned.- Next tool call POST (same session ID) → backend B. B has no session with that ID → JSON-RPC error or silent failure.
- GET SSE stream → backend C. C has no session → connection immediately closed.
The fix is to ensure all requests for a session ID reach the same backend.
Sticky sessions with Caddy
Caddy supports header-based load balancing through lb_policy header. Route on the mcp-session-id header so all requests for a session go to the same upstream:
api.yourdomain.com {
reverse_proxy /mcp* backend1:3000 backend2:3000 backend3:3000 {
lb_policy header mcp-session-id
health_uri /healthz
health_interval 10s
health_timeout 5s
flush_interval -1 # required for SSE streaming
}
}
lb_policy header mcp-session-id hashes the header value and maps it consistently to one backend. Sessions without the header (e.g., the initial initialize POST, which has no session ID yet) are distributed round-robin across healthy backends. The initialize response includes the session ID, and all subsequent requests from that client include it — so only the very first request can land on any backend, and all following requests stay sticky.
The critical addition is flush_interval -1, which tells Caddy to flush the response buffer immediately. Without it, SSE frames buffer in Caddy's response writer and the client receives batched updates instead of a real-time stream — or nothing at all if Caddy's buffer never fills.
Sticky sessions with nginx
nginx open-source does not have built-in consistent hash on arbitrary headers. The options are:
# Option 1: ip_hash — consistent routing by client IP
upstream mcp_backends {
ip_hash;
server backend1:3000;
server backend2:3000;
server backend3:3000;
keepalive 32;
}
server {
location /mcp {
proxy_pass http://mcp_backends;
proxy_http_version 1.1;
proxy_set_header Connection ''; # keepalive
proxy_buffering off; # required for SSE streaming
proxy_read_timeout 3600s; # SSE connections are long-lived
proxy_set_header X-Real-IP $remote_addr;
}
}
# Option 2: sticky cookie (nginx Plus / OpenResty / lua-resty-balancer)
# On nginx Plus:
upstream mcp_backends {
server backend1:3000;
server backend2:3000;
sticky cookie srv_id expires=1h domain=.yourdomain.com httponly;
}
# On OpenResty with lua-resty-balancer:
# set $backend based on ngx.req.get_headers()["mcp-session-id"]
IP hash is simpler but breaks when a client's IP changes (e.g., mobile networks). For MCP clients connecting from fixed-IP infrastructure (agent platforms, CI systems), IP hash is usually adequate. For browser-based or mobile clients, sticky cookie or a header-hash module is more reliable.
Stateless mode — horizontal scaling without sticky sessions
If your MCP server does not need to push notifications to clients (most tool-serving servers don't), you can eliminate SSE entirely. Set enableSseResponse: false on the transport and all requests become stateless POSTs:
// server.ts — stateless mode
app.post('/mcp', authMiddleware, rateLimitMiddleware, async (req, res) => {
const server = new McpServer({ name: 'my-server', version: '1.0.0' });
registerAllTools(server, deps);
const transport = new StreamableHTTPServerTransport({
sessionIdHeader: 'mcp-session-id',
enableSseResponse: false, // no long-lived SSE stream
});
await server.connect(transport);
await transport.handleRequest(req, res);
});
In stateless mode, each POST /mcp creates a short-lived server instance that handles exactly one request and closes. There is no session ID correlation between requests from the same client. The load balancer can round-robin freely. The tradeoff: the server cannot push notifications, progress events, or log streams to the client via SSE. Clients that need to receive in-session notifications must poll via tool calls instead.
Stateless mode is ideal for MCP servers that expose read-only tools over a shared database — search, lookup, summarize — where each tool call is independently meaningful and no session continuity is needed.
Kubernetes with session affinity
In Kubernetes with a standard nginx Ingress controller, configure session affinity on the Ingress resource:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: mcp-ingress
annotations:
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/session-cookie-name: "mcp-backend"
nginx.ingress.kubernetes.io/session-cookie-expires: "3600"
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-buffering: "off" # required for SSE
spec:
rules:
- host: api.yourdomain.com
http:
paths:
- path: /mcp
pathType: Prefix
backend:
service:
name: mcp-service
port:
number: 3000
The Ingress controller sets a sticky cookie on the first request and routes subsequent requests from the same client to the same pod. proxy-buffering: "off" is the Kubernetes-native equivalent of proxy_buffering off in nginx config. Without it, SSE frames buffer in the Ingress controller's response writer and clients see delayed or missing streaming events.
Health checks for load-balanced MCP backends
TCP-level health checks (a simple SYN/ACK test) confirm the port is open but not that the MCP server is healthy. An HTTP health check on /healthz that returns the startup state is better:
// server.ts — health endpoint
let ready = false;
app.get('/healthz', (req, res) => {
if (!ready) {
res.status(503).json({ status: 'starting' });
return;
}
if (isShuttingDown) {
res.status(503).json({ status: 'shutting_down' });
return;
}
res.json({ status: 'ok', sessions: activeSessions.size });
});
async function main() {
const deps = await createDeps();
// register tools, set up routes...
app.listen(3000, () => { ready = true; });
}
The load balancer's health check should target /healthz. A 503 response during startup (before ready = true) keeps the backend out of rotation until the MCP server has finished initializing. A 503 during shutdown keeps in-flight sessions alive on other backends while this instance drains.
AliveMCP provides a complementary layer above the load balancer: it probes the full MCP protocol (initialize + tools/list) at the load-balancer IP, confirming that the entire stack — LB routing, backend, tool registration — is working. Configure one AliveMCP monitor per backend IP if you want per-instance health visibility, in addition to the one monitor on the load balancer address.
Related questions
Can I use Redis to share session state across backends and avoid sticky sessions?
In principle yes — the session's in-progress tool call state is small enough to serialize to Redis. In practice, the MCP SDK does not provide built-in distributed session storage, so you would need to reimplement session state management. For most MCP servers, sticky sessions or stateless mode is simpler. Stateless mode (disabling SSE) is the cleaner path to horizontal scaling than distributed session state.
How does graceful shutdown work in a load-balanced cluster?
When Kubernetes sends SIGTERM to a pod (on scale-in or rolling deploy), the pod should: (1) remove itself from the load balancer's rotation by returning 503 from /healthz, (2) wait for preStop.exec.command sleep (typically 10–15s) to let in-flight requests complete, (3) drain active SSE sessions. See the graceful shutdown guide for the full sequence. The key timing: the load balancer's health check interval must be shorter than the preStop sleep, so the backend is removed from rotation before new connections arrive.
Should AliveMCP monitor the load balancer or individual backends?
Both, for different purposes. The load-balancer endpoint monitor tells you whether the service is reachable from outside. Individual backend monitors (if your backends are reachable on private IPs within the probe network) tell you which backend is failing when the cluster is partially degraded. Most teams start with a single load-balancer-level monitor and add per-backend monitors when they hit a degraded-cluster incident they couldn't diagnose from the LB monitor alone.
Does Cloudflare work as a load balancer for MCP servers?
Yes, with caveats. Cloudflare's load balancing supports HTTP health checks and cookie-based session affinity (called "session affinity" in the Traffic tab). SSE sessions through Cloudflare hit the 100-second connection limit on free and pro plans — configure your server to send a keep-alive SSE comment (: ping\n\n) every 90 seconds to avoid forced disconnections. See the WebSockets and SSE guide for the proxy configuration table per provider.
Further reading
- MCP server WebSockets — SSE proxy configuration for Caddy, nginx, ALB, Cloudflare, and Kubernetes
- MCP server graceful shutdown — drain sequences and SIGTERM handling in a load-balanced cluster
- MCP server multi-tenant — per-tenant routing and session isolation
- MCP server scheduled tasks — leader election to prevent duplicate scheduled runs across replicas
- MCP server reliability — zero-downtime deployments and rolling restart patterns
- AliveMCP — uptime monitoring that probes the MCP protocol at the load-balancer level