Guide · Load testing

MCP server load testing

Load testing an MCP server differs from load testing an HTTP API because MCP sessions are stateful: each load session must complete the initialize handshake before it can send tool calls. You can't reuse connections across sessions the way you would with a REST endpoint. The right metric isn't requests-per-second — it's concurrent active sessions until P99 tool-call latency exceeds your acceptable threshold (typically 2–5 seconds).

TL;DR

Measure concurrent sessions, not RPS. Build a load harness that opens N sessions in parallel, each completing initialize → one or more tool calls → session close, and tracks per-session latency percentiles. Find the session ceiling: the N where P99 tool-call latency first exceeds your SLO. Use that ceiling to set your auto-scaling trigger (or your manual scaling decision). After the load test, compare the initialize latency distribution against AliveMCP's probe history — production probe latency is your ongoing canary for regression.

Why RPS is the wrong metric

Traditional HTTP load testing tools (k6, Locust, JMeter) are designed around requests-per-second: ramp up virtual users, each sending rapid HTTP requests, measure throughput and latency. MCP session semantics break this model:

Session setup cost is per-session, not per-request. The initialize handshake happens once per session. A load test that reconnects on every tool call adds artificial overhead that real clients don't incur (clients hold sessions open across multiple tool calls).
Tool calls are not uniform in duration. A tool that calls an external API takes 500ms to 5s. A tool that does an in-memory computation takes 5ms. Averaging these into a single RPS number is misleading — what matters is whether the slow tools meet their latency SLO under concurrency.
Concurrency pressure on MCP is session-level. The server's bottleneck is usually the number of active sessions it can hold open, not the throughput of individual tool calls. A single session with 10 sequential tool calls creates less server pressure than 10 concurrent sessions with 1 tool call each — the latter requires 10 times the per-session state memory.

Measure what clients actually experience: the latency distribution of each tool call under N concurrent sessions. N varies from 1 to your expected peak concurrency, increasing in steps of 5 or 10.

Building a load test harness

Use the official MCP SDK client for correctness — it handles the session lifecycle correctly. A Node.js harness for N concurrent sessions:

import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { SSEClientTransport } from '@modelcontextprotocol/sdk/client/sse.js';

async function runSession(serverUrl, toolName, toolArgs) {
  const client = new Client({ name: 'load-test', version: '1' }, { capabilities: {} });
  const transport = new SSEClientTransport(new URL(serverUrl));
  const initStart = Date.now();
  await client.connect(transport);
  const initLatency = Date.now() - initStart;

  const callStart = Date.now();
  const result = await client.callTool({ name: toolName, arguments: toolArgs });
  const callLatency = Date.now() - callStart;

  await client.close();
  return { initLatency, callLatency, success: !result.isError };
}

async function loadTest(serverUrl, concurrency, toolName, toolArgs) {
  const sessions = Array.from({ length: concurrency }, () =>
    runSession(serverUrl, toolName, toolArgs)
  );
  const results = await Promise.allSettled(sessions);
  const latencies = results
    .filter(r => r.status === 'fulfilled')
    .map(r => r.value.callLatency)
    .sort((a, b) => a - b);
  const errors = results.filter(r => r.status === 'rejected').length;

  const p50 = latencies[Math.floor(latencies.length * 0.5)];
  const p95 = latencies[Math.floor(latencies.length * 0.95)];
  const p99 = latencies[Math.floor(latencies.length * 0.99)];
  console.log(`N=${concurrency}: p50=${p50}ms p95=${p95}ms p99=${p99}ms errors=${errors}`);
  return { p50, p95, p99, errors };
}

// Ramp up from 1 to 50 concurrent sessions
const SERVER = 'https://your-mcp-server.example.com/mcp';
for (const n of [1, 5, 10, 20, 30, 50]) {
  await loadTest(SERVER, n, 'your_tool_name', { param: 'test-value' });
  await new Promise(r => setTimeout(r, 2000)); // brief pause between steps
}

Run this against a production-representative instance (not localhost — you need network latency in the measurement). Capture the output at each step.

Realistic load profiles

The harness above simulates synchronized sessions — all sessions start at the same moment, which is a worst-case spike. Real traffic arrives at a spread. For a more realistic test:

Staggered arrival: start sessions with a random jitter (0–500ms per session). This creates overlapping sessions that aren't synchronized, which reduces memory pressure spikes while maintaining the target concurrency level.
Session duration variance: real sessions hold multiple tool calls over a 10–300 second window. Simulate this by having each session call the tool N times with random think time between calls (e.g., uniform 1–5 seconds).
Mixed tool workload: if your server has fast tools and slow tools, run load tests with the realistic mix (e.g., 80% fast tool calls, 20% slow). The slow calls occupy server resources longer, which affects how many sessions can run concurrently.
Sustained load: run the test for at least 5 minutes at each concurrency level. Memory leaks, connection pool exhaustion, and GC pressure often appear only after sustained load, not in a quick ramp-up-and-stop test.

Finding the session ceiling

The session ceiling is the number of concurrent sessions where P99 tool-call latency first exceeds your SLO. For most MCP servers with interactive users, the SLO is 2–5 seconds for tool-call P99 latency. Beyond the ceiling, the server is overloaded: latency climbs, errors appear, or both.

From your load test output, plot P99 latency versus concurrency. The relationship is usually:

Linear region: P99 scales roughly linearly with concurrency — 2× sessions adds ~2× latency. This is acceptable scaling.
Knee: at some point, P99 starts growing faster than linearly. You're hitting a bottleneck (CPU saturation, database connection limit, event loop starvation, GC pressure).
Cliff: beyond the knee, latency spikes and errors appear. This is your hard ceiling.

Set your auto-scaling trigger (or capacity planning target) at the knee — the point before the cliff, with enough headroom to scale out before hitting the cliff under real traffic. If your ceiling is 20 concurrent sessions before P99 exceeds 2s, set your HPA to scale at 15 sessions per replica.

Common bottlenecks and their signatures:

CPU saturation: P95 and P99 both climb. docker stats shows CPU near 100%. Fix: add replicas or optimize hot code paths.
Event loop starvation (Node.js): P99 climbs sharply while P50 stays low — some sessions get stuck behind a slow operation blocking the event loop. Fix: move CPU-intensive operations to worker threads.
Memory pressure: P99 latency spikes coincide with GC pauses. RSS climbs linearly with concurrent sessions and doesn't release. Fix: find per-session memory leaks; cap session duration; add replicas.
Database connection pool exhaustion: errors appear with "connection pool full" messages. Fix: increase pool size or reduce connection hold time in tool implementations.

Load test results vs AliveMCP probe data

Your load test gives you a point-in-time measurement of initialize latency at various concurrency levels. AliveMCP gives you a continuous time series of initialize latency from a single external probe under real-world conditions (varying DNS resolution time, TLS handshake variability, network jitter).

Compare the two:

Baseline single-session initialize latency from the load test (N=1) should be close to AliveMCP's median initialize latency. A large discrepancy suggests the probe is experiencing higher network overhead than you measured (check AliveMCP's probe region — it may probe from a distant geography).
Latency spikes in AliveMCP probe history that don't appear in your load test often indicate infrastructure-level issues: GC pauses on a shared VPS, load balancer health-check bursts, or a cron job that runs on the same host at a fixed interval.
After a deploy, AliveMCP probe latency should remain stable or improve. A latency increase post-deploy means the new version has a regression — catch it before users do. See MCP server deployment for post-deploy verification.

Common failure modes under load

Failure modes that appear only under concurrent load:

Session state leakage: session A's data appears in session B's tool results. Caused by incorrectly sharing mutable state between sessions — a JavaScript module-level variable that should be scoped to the session. Caught by running two sessions simultaneously and asserting that each session's results are independent.
Initialize race on startup: the first N sessions to arrive before the server finishes startup all fail. Caused by starting the HTTP listener before the server is fully initialized. Fix: start listening only after all dependencies are ready, and use a startup probe in the orchestrator to gate traffic.
SSE connection limit: HTTP servers have a default maximum concurrent connection limit. Node.js's default listen backlog is 511; beyond that, connections are refused. Under high concurrency, SSE connections (which are long-lived) exhaust this limit. Fix: increase the listen backlog and set server.maxConnections explicitly.
Tool call timeouts under concurrent load: a downstream API that responds in 300ms under single-session load responds in 2s+ under 20 concurrent sessions because it's also being hit concurrently. The MCP server's tool-call timeout triggers. Fix: add a per-tool-call circuit breaker that fails fast when the downstream is slow, rather than queuing all 20 tool calls to wait.

See MCP server timeout configuration and MCP server reliability for mitigation strategies.