Guide · Infrastructure

MCP server load balancing

The MCP StreamableHTTP transport uses two HTTP connections per session: a POST for each tool call (request/response) and a long-lived GET SSE stream for server-to-client notifications. Both connections carry the same mcp-session-id header, and the server-side session state — the in-memory McpServer instance, the registered tools, the AsyncLocalStorage context — lives in one process. Round-robin load balancing routes the POST to one backend and the GET SSE to a different backend: the SSE backend has no session, the connection fails. Load-balancing MCP servers requires either sticky sessions (route all requests for a session to the same backend) or a stateless design (eliminate SSE entirely so every request is independently routable).

TL;DR

Two approaches: sticky sessions — route all requests matching the mcp-session-id header to the same backend instance using cookie-based or header-based affinity. Stateless mode — disable SSE by setting enableSseResponse: false on the transport, making every POST fully stateless and round-robbin friendly. Stateless mode works for clients that don't require server-initiated notifications. Use protocol-aware health checks (/healthz returning JSON with status) rather than TCP checks for upstream probe. AliveMCP's probe hits one backend at a time — for full coverage, configure one AliveMCP monitor per backend instance, not one per load balancer.

Why round-robin breaks MCP sessions

An MCP session has three phases. In the initialize phase, the client sends a POST to establish the session and receives a mcp-session-id in the response. In the tool-call phase, the client sends POSTs using that session ID, and the server routes them to the in-memory McpServer instance that owns the session. In the notification phase, the client opens a GET SSE stream so the server can push notifications back.

With naive round-robin across three backends A, B, C:

initialize POST → backend A. Session created in A's memory. mcp-session-id: sess-xyz returned.
Next tool call POST (same session ID) → backend B. B has no session with that ID → JSON-RPC error or silent failure.
GET SSE stream → backend C. C has no session → connection immediately closed.

The fix is to ensure all requests for a session ID reach the same backend.

Sticky sessions with Caddy

Caddy supports header-based load balancing through lb_policy header. Route on the mcp-session-id header so all requests for a session go to the same upstream:

api.yourdomain.com {
  reverse_proxy /mcp* backend1:3000 backend2:3000 backend3:3000 {
    lb_policy header mcp-session-id
    health_uri /healthz
    health_interval 10s
    health_timeout 5s
    flush_interval -1                  # required for SSE streaming
  }
}

lb_policy header mcp-session-id hashes the header value and maps it consistently to one backend. Sessions without the header (e.g., the initial initialize POST, which has no session ID yet) are distributed round-robin across healthy backends. The initialize response includes the session ID, and all subsequent requests from that client include it — so only the very first request can land on any backend, and all following requests stay sticky.

The critical addition is flush_interval -1, which tells Caddy to flush the response buffer immediately. Without it, SSE frames buffer in Caddy's response writer and the client receives batched updates instead of a real-time stream — or nothing at all if Caddy's buffer never fills.

Sticky sessions with nginx

nginx open-source does not have built-in consistent hash on arbitrary headers. The options are:

# Option 1: ip_hash — consistent routing by client IP
upstream mcp_backends {
  ip_hash;
  server backend1:3000;
  server backend2:3000;
  server backend3:3000;
  keepalive 32;
}

server {
  location /mcp {
    proxy_pass http://mcp_backends;
    proxy_http_version 1.1;
    proxy_set_header Connection '';      # keepalive
    proxy_buffering off;                 # required for SSE streaming
    proxy_read_timeout 3600s;            # SSE connections are long-lived
    proxy_set_header X-Real-IP $remote_addr;
  }
}

# Option 2: sticky cookie (nginx Plus / OpenResty / lua-resty-balancer)
# On nginx Plus:
upstream mcp_backends {
  server backend1:3000;
  server backend2:3000;
  sticky cookie srv_id expires=1h domain=.yourdomain.com httponly;
}

# On OpenResty with lua-resty-balancer:
# set $backend based on ngx.req.get_headers()["mcp-session-id"]

IP hash is simpler but breaks when a client's IP changes (e.g., mobile networks). For MCP clients connecting from fixed-IP infrastructure (agent platforms, CI systems), IP hash is usually adequate. For browser-based or mobile clients, sticky cookie or a header-hash module is more reliable.

Stateless mode — horizontal scaling without sticky sessions

If your MCP server does not need to push notifications to clients (most tool-serving servers don't), you can eliminate SSE entirely. Set enableSseResponse: false on the transport and all requests become stateless POSTs:

// server.ts — stateless mode
app.post('/mcp', authMiddleware, rateLimitMiddleware, async (req, res) => {
  const server = new McpServer({ name: 'my-server', version: '1.0.0' });
  registerAllTools(server, deps);

  const transport = new StreamableHTTPServerTransport({
    sessionIdHeader: 'mcp-session-id',
    enableSseResponse: false,           // no long-lived SSE stream
  });

  await server.connect(transport);
  await transport.handleRequest(req, res);
});

In stateless mode, each POST /mcp creates a short-lived server instance that handles exactly one request and closes. There is no session ID correlation between requests from the same client. The load balancer can round-robin freely. The tradeoff: the server cannot push notifications, progress events, or log streams to the client via SSE. Clients that need to receive in-session notifications must poll via tool calls instead.

Stateless mode is ideal for MCP servers that expose read-only tools over a shared database — search, lookup, summarize — where each tool call is independently meaningful and no session continuity is needed.

Kubernetes with session affinity

In Kubernetes with a standard nginx Ingress controller, configure session affinity on the Ingress resource:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: mcp-ingress
  annotations:
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "mcp-backend"
    nginx.ingress.kubernetes.io/session-cookie-expires: "3600"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"   # required for SSE
spec:
  rules:
  - host: api.yourdomain.com
    http:
      paths:
      - path: /mcp
        pathType: Prefix
        backend:
          service:
            name: mcp-service
            port:
              number: 3000

The Ingress controller sets a sticky cookie on the first request and routes subsequent requests from the same client to the same pod. proxy-buffering: "off" is the Kubernetes-native equivalent of proxy_buffering off in nginx config. Without it, SSE frames buffer in the Ingress controller's response writer and clients see delayed or missing streaming events.

Health checks for load-balanced MCP backends

TCP-level health checks (a simple SYN/ACK test) confirm the port is open but not that the MCP server is healthy. An HTTP health check on /healthz that returns the startup state is better:

// server.ts — health endpoint
let ready = false;

app.get('/healthz', (req, res) => {
  if (!ready) {
    res.status(503).json({ status: 'starting' });
    return;
  }
  if (isShuttingDown) {
    res.status(503).json({ status: 'shutting_down' });
    return;
  }
  res.json({ status: 'ok', sessions: activeSessions.size });
});

async function main() {
  const deps = await createDeps();
  // register tools, set up routes...
  app.listen(3000, () => { ready = true; });
}

The load balancer's health check should target /healthz. A 503 response during startup (before ready = true) keeps the backend out of rotation until the MCP server has finished initializing. A 503 during shutdown keeps in-flight sessions alive on other backends while this instance drains.

AliveMCP provides a complementary layer above the load balancer: it probes the full MCP protocol (initialize + tools/list) at the load-balancer IP, confirming that the entire stack — LB routing, backend, tool registration — is working. Configure one AliveMCP monitor per backend IP if you want per-instance health visibility, in addition to the one monitor on the load balancer address.