Guide · Load testing
MCP server load testing
Load testing an MCP server differs from load testing an HTTP API because MCP sessions are stateful: each load session must complete the initialize handshake before it can send tool calls. You can't reuse connections across sessions the way you would with a REST endpoint. The right metric isn't requests-per-second — it's concurrent active sessions until P99 tool-call latency exceeds your acceptable threshold (typically 2–5 seconds).
TL;DR
Measure concurrent sessions, not RPS. Build a load harness that opens N sessions in parallel, each completing initialize → one or more tool calls → session close, and tracks per-session latency percentiles. Find the session ceiling: the N where P99 tool-call latency first exceeds your SLO. Use that ceiling to set your auto-scaling trigger (or your manual scaling decision). After the load test, compare the initialize latency distribution against AliveMCP's probe history — production probe latency is your ongoing canary for regression.
Why RPS is the wrong metric
Traditional HTTP load testing tools (k6, Locust, JMeter) are designed around requests-per-second: ramp up virtual users, each sending rapid HTTP requests, measure throughput and latency. MCP session semantics break this model:
- Session setup cost is per-session, not per-request. The
initializehandshake happens once per session. A load test that reconnects on every tool call adds artificial overhead that real clients don't incur (clients hold sessions open across multiple tool calls). - Tool calls are not uniform in duration. A tool that calls an external API takes 500ms to 5s. A tool that does an in-memory computation takes 5ms. Averaging these into a single RPS number is misleading — what matters is whether the slow tools meet their latency SLO under concurrency.
- Concurrency pressure on MCP is session-level. The server's bottleneck is usually the number of active sessions it can hold open, not the throughput of individual tool calls. A single session with 10 sequential tool calls creates less server pressure than 10 concurrent sessions with 1 tool call each — the latter requires 10 times the per-session state memory.
Measure what clients actually experience: the latency distribution of each tool call under N concurrent sessions. N varies from 1 to your expected peak concurrency, increasing in steps of 5 or 10.
Building a load test harness
Use the official MCP SDK client for correctness — it handles the session lifecycle correctly. A Node.js harness for N concurrent sessions:
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { SSEClientTransport } from '@modelcontextprotocol/sdk/client/sse.js';
async function runSession(serverUrl, toolName, toolArgs) {
const client = new Client({ name: 'load-test', version: '1' }, { capabilities: {} });
const transport = new SSEClientTransport(new URL(serverUrl));
const initStart = Date.now();
await client.connect(transport);
const initLatency = Date.now() - initStart;
const callStart = Date.now();
const result = await client.callTool({ name: toolName, arguments: toolArgs });
const callLatency = Date.now() - callStart;
await client.close();
return { initLatency, callLatency, success: !result.isError };
}
async function loadTest(serverUrl, concurrency, toolName, toolArgs) {
const sessions = Array.from({ length: concurrency }, () =>
runSession(serverUrl, toolName, toolArgs)
);
const results = await Promise.allSettled(sessions);
const latencies = results
.filter(r => r.status === 'fulfilled')
.map(r => r.value.callLatency)
.sort((a, b) => a - b);
const errors = results.filter(r => r.status === 'rejected').length;
const p50 = latencies[Math.floor(latencies.length * 0.5)];
const p95 = latencies[Math.floor(latencies.length * 0.95)];
const p99 = latencies[Math.floor(latencies.length * 0.99)];
console.log(`N=${concurrency}: p50=${p50}ms p95=${p95}ms p99=${p99}ms errors=${errors}`);
return { p50, p95, p99, errors };
}
// Ramp up from 1 to 50 concurrent sessions
const SERVER = 'https://your-mcp-server.example.com/mcp';
for (const n of [1, 5, 10, 20, 30, 50]) {
await loadTest(SERVER, n, 'your_tool_name', { param: 'test-value' });
await new Promise(r => setTimeout(r, 2000)); // brief pause between steps
}
Run this against a production-representative instance (not localhost — you need network latency in the measurement). Capture the output at each step.
Realistic load profiles
The harness above simulates synchronized sessions — all sessions start at the same moment, which is a worst-case spike. Real traffic arrives at a spread. For a more realistic test:
- Staggered arrival: start sessions with a random jitter (0–500ms per session). This creates overlapping sessions that aren't synchronized, which reduces memory pressure spikes while maintaining the target concurrency level.
- Session duration variance: real sessions hold multiple tool calls over a 10–300 second window. Simulate this by having each session call the tool N times with random think time between calls (e.g., uniform 1–5 seconds).
- Mixed tool workload: if your server has fast tools and slow tools, run load tests with the realistic mix (e.g., 80% fast tool calls, 20% slow). The slow calls occupy server resources longer, which affects how many sessions can run concurrently.
- Sustained load: run the test for at least 5 minutes at each concurrency level. Memory leaks, connection pool exhaustion, and GC pressure often appear only after sustained load, not in a quick ramp-up-and-stop test.
Finding the session ceiling
The session ceiling is the number of concurrent sessions where P99 tool-call latency first exceeds your SLO. For most MCP servers with interactive users, the SLO is 2–5 seconds for tool-call P99 latency. Beyond the ceiling, the server is overloaded: latency climbs, errors appear, or both.
From your load test output, plot P99 latency versus concurrency. The relationship is usually:
- Linear region: P99 scales roughly linearly with concurrency — 2× sessions adds ~2× latency. This is acceptable scaling.
- Knee: at some point, P99 starts growing faster than linearly. You're hitting a bottleneck (CPU saturation, database connection limit, event loop starvation, GC pressure).
- Cliff: beyond the knee, latency spikes and errors appear. This is your hard ceiling.
Set your auto-scaling trigger (or capacity planning target) at the knee — the point before the cliff, with enough headroom to scale out before hitting the cliff under real traffic. If your ceiling is 20 concurrent sessions before P99 exceeds 2s, set your HPA to scale at 15 sessions per replica.
Common bottlenecks and their signatures:
- CPU saturation: P95 and P99 both climb.
docker statsshows CPU near 100%. Fix: add replicas or optimize hot code paths. - Event loop starvation (Node.js): P99 climbs sharply while P50 stays low — some sessions get stuck behind a slow operation blocking the event loop. Fix: move CPU-intensive operations to worker threads.
- Memory pressure: P99 latency spikes coincide with GC pauses. RSS climbs linearly with concurrent sessions and doesn't release. Fix: find per-session memory leaks; cap session duration; add replicas.
- Database connection pool exhaustion: errors appear with "connection pool full" messages. Fix: increase pool size or reduce connection hold time in tool implementations.
Load test results vs AliveMCP probe data
Your load test gives you a point-in-time measurement of initialize latency at various concurrency levels. AliveMCP gives you a continuous time series of initialize latency from a single external probe under real-world conditions (varying DNS resolution time, TLS handshake variability, network jitter).
Compare the two:
- Baseline single-session initialize latency from the load test (N=1) should be close to AliveMCP's median initialize latency. A large discrepancy suggests the probe is experiencing higher network overhead than you measured (check AliveMCP's probe region — it may probe from a distant geography).
- Latency spikes in AliveMCP probe history that don't appear in your load test often indicate infrastructure-level issues: GC pauses on a shared VPS, load balancer health-check bursts, or a cron job that runs on the same host at a fixed interval.
- After a deploy, AliveMCP probe latency should remain stable or improve. A latency increase post-deploy means the new version has a regression — catch it before users do. See MCP server deployment for post-deploy verification.
Common failure modes under load
Failure modes that appear only under concurrent load:
- Session state leakage: session A's data appears in session B's tool results. Caused by incorrectly sharing mutable state between sessions — a JavaScript module-level variable that should be scoped to the session. Caught by running two sessions simultaneously and asserting that each session's results are independent.
- Initialize race on startup: the first N sessions to arrive before the server finishes startup all fail. Caused by starting the HTTP listener before the server is fully initialized. Fix: start listening only after all dependencies are ready, and use a startup probe in the orchestrator to gate traffic.
- SSE connection limit: HTTP servers have a default maximum concurrent connection limit. Node.js's default listen backlog is 511; beyond that, connections are refused. Under high concurrency, SSE connections (which are long-lived) exhaust this limit. Fix: increase the listen backlog and set
server.maxConnectionsexplicitly. - Tool call timeouts under concurrent load: a downstream API that responds in 300ms under single-session load responds in 2s+ under 20 concurrent sessions because it's also being hit concurrently. The MCP server's tool-call timeout triggers. Fix: add a per-tool-call circuit breaker that fails fast when the downstream is slow, rather than queuing all 20 tool calls to wait.
See MCP server timeout configuration and MCP server reliability for mitigation strategies.
Related questions
Can I use k6 or Locust to load test an MCP server?
k6 and Locust are HTTP load testing tools. They can send HTTP requests to your MCP server's HTTP/SSE endpoint, but they don't implement the MCP protocol — they can't complete the initialize handshake or hold sessions open. You can write a custom k6 extension or Locust task that simulates the session lifecycle, but it's easier to use the MCP SDK client in a Node.js or Python script, as shown above. For very high concurrency (thousands of sessions), use k6 with a custom module or write a Go-based harness using the MCP Go client.
How long should I run a load test?
At least 5 minutes at each concurrency level. Short tests (under 2 minutes) miss GC pressure effects in Node.js (which accumulates over minutes), database connection pool behavior (which stabilizes after a warm-up period), and rate limiting on downstream APIs (which may allow burst traffic for the first 60 seconds). For identifying memory leaks, run for 30 minutes at your expected peak concurrency and measure RSS at the start and end — any growth beyond expected per-session overhead indicates a leak.
What should I do if my server hits the ceiling before expected traffic?
Profile first. Add timing instrumentation to each tool call and look for the slow path. Common findings: one tool call takes 5× longer than others under load (optimize or offload it); the database connection pool is exhausted (increase pool size or add read replicas); Node.js event loop is blocked by a synchronous operation (move to worker threads or use an async library). If profiling shows the bottleneck is CPU, add replicas. If it's memory, find the leak. Don't add replicas without profiling — you'll just hit the same ceiling at slightly higher concurrency and pay more for hosting.
How does load testing relate to SLO tracking?
Your load test defines the session ceiling where your latency SLO is met. Your SLO defines the acceptable latency threshold. AliveMCP tracks your SLO compliance in production — it records whether each probe response falls within your target latency, and alerts when the error budget starts burning. Load testing tells you how much headroom you have; AliveMCP tells you whether you're burning through it in production.
Further reading
- MCP server testing — protocol compliance and schema snapshots
- MCP server performance — latency profiling and optimization
- MCP server reliability — MTTD, MTTR, and error budget
- MCP server SLO — defining and tracking latency targets
- MCP server timeout configuration
- MCP server deployment — post-deploy verification
- AliveMCP — continuous external monitoring that measures real-world initialize latency