Guide · Load testing

MCP server load testing

Load testing an MCP server differs from load testing an HTTP API because MCP sessions are stateful: each load session must complete the initialize handshake before it can send tool calls. You can't reuse connections across sessions the way you would with a REST endpoint. The right metric isn't requests-per-second — it's concurrent active sessions until P99 tool-call latency exceeds your acceptable threshold (typically 2–5 seconds).

TL;DR

Measure concurrent sessions, not RPS. Build a load harness that opens N sessions in parallel, each completing initialize → one or more tool calls → session close, and tracks per-session latency percentiles. Find the session ceiling: the N where P99 tool-call latency first exceeds your SLO. Use that ceiling to set your auto-scaling trigger (or your manual scaling decision). After the load test, compare the initialize latency distribution against AliveMCP's probe history — production probe latency is your ongoing canary for regression.

Why RPS is the wrong metric

Traditional HTTP load testing tools (k6, Locust, JMeter) are designed around requests-per-second: ramp up virtual users, each sending rapid HTTP requests, measure throughput and latency. MCP session semantics break this model:

Measure what clients actually experience: the latency distribution of each tool call under N concurrent sessions. N varies from 1 to your expected peak concurrency, increasing in steps of 5 or 10.

Building a load test harness

Use the official MCP SDK client for correctness — it handles the session lifecycle correctly. A Node.js harness for N concurrent sessions:

import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { SSEClientTransport } from '@modelcontextprotocol/sdk/client/sse.js';

async function runSession(serverUrl, toolName, toolArgs) {
  const client = new Client({ name: 'load-test', version: '1' }, { capabilities: {} });
  const transport = new SSEClientTransport(new URL(serverUrl));
  const initStart = Date.now();
  await client.connect(transport);
  const initLatency = Date.now() - initStart;

  const callStart = Date.now();
  const result = await client.callTool({ name: toolName, arguments: toolArgs });
  const callLatency = Date.now() - callStart;

  await client.close();
  return { initLatency, callLatency, success: !result.isError };
}

async function loadTest(serverUrl, concurrency, toolName, toolArgs) {
  const sessions = Array.from({ length: concurrency }, () =>
    runSession(serverUrl, toolName, toolArgs)
  );
  const results = await Promise.allSettled(sessions);
  const latencies = results
    .filter(r => r.status === 'fulfilled')
    .map(r => r.value.callLatency)
    .sort((a, b) => a - b);
  const errors = results.filter(r => r.status === 'rejected').length;

  const p50 = latencies[Math.floor(latencies.length * 0.5)];
  const p95 = latencies[Math.floor(latencies.length * 0.95)];
  const p99 = latencies[Math.floor(latencies.length * 0.99)];
  console.log(`N=${concurrency}: p50=${p50}ms p95=${p95}ms p99=${p99}ms errors=${errors}`);
  return { p50, p95, p99, errors };
}

// Ramp up from 1 to 50 concurrent sessions
const SERVER = 'https://your-mcp-server.example.com/mcp';
for (const n of [1, 5, 10, 20, 30, 50]) {
  await loadTest(SERVER, n, 'your_tool_name', { param: 'test-value' });
  await new Promise(r => setTimeout(r, 2000)); // brief pause between steps
}

Run this against a production-representative instance (not localhost — you need network latency in the measurement). Capture the output at each step.

Realistic load profiles

The harness above simulates synchronized sessions — all sessions start at the same moment, which is a worst-case spike. Real traffic arrives at a spread. For a more realistic test:

Finding the session ceiling

The session ceiling is the number of concurrent sessions where P99 tool-call latency first exceeds your SLO. For most MCP servers with interactive users, the SLO is 2–5 seconds for tool-call P99 latency. Beyond the ceiling, the server is overloaded: latency climbs, errors appear, or both.

From your load test output, plot P99 latency versus concurrency. The relationship is usually:

Set your auto-scaling trigger (or capacity planning target) at the knee — the point before the cliff, with enough headroom to scale out before hitting the cliff under real traffic. If your ceiling is 20 concurrent sessions before P99 exceeds 2s, set your HPA to scale at 15 sessions per replica.

Common bottlenecks and their signatures:

Load test results vs AliveMCP probe data

Your load test gives you a point-in-time measurement of initialize latency at various concurrency levels. AliveMCP gives you a continuous time series of initialize latency from a single external probe under real-world conditions (varying DNS resolution time, TLS handshake variability, network jitter).

Compare the two:

Common failure modes under load

Failure modes that appear only under concurrent load:

See MCP server timeout configuration and MCP server reliability for mitigation strategies.

Related questions

Can I use k6 or Locust to load test an MCP server?

k6 and Locust are HTTP load testing tools. They can send HTTP requests to your MCP server's HTTP/SSE endpoint, but they don't implement the MCP protocol — they can't complete the initialize handshake or hold sessions open. You can write a custom k6 extension or Locust task that simulates the session lifecycle, but it's easier to use the MCP SDK client in a Node.js or Python script, as shown above. For very high concurrency (thousands of sessions), use k6 with a custom module or write a Go-based harness using the MCP Go client.

How long should I run a load test?

At least 5 minutes at each concurrency level. Short tests (under 2 minutes) miss GC pressure effects in Node.js (which accumulates over minutes), database connection pool behavior (which stabilizes after a warm-up period), and rate limiting on downstream APIs (which may allow burst traffic for the first 60 seconds). For identifying memory leaks, run for 30 minutes at your expected peak concurrency and measure RSS at the start and end — any growth beyond expected per-session overhead indicates a leak.

What should I do if my server hits the ceiling before expected traffic?

Profile first. Add timing instrumentation to each tool call and look for the slow path. Common findings: one tool call takes 5× longer than others under load (optimize or offload it); the database connection pool is exhausted (increase pool size or add read replicas); Node.js event loop is blocked by a synchronous operation (move to worker threads or use an async library). If profiling shows the bottleneck is CPU, add replicas. If it's memory, find the leak. Don't add replicas without profiling — you'll just hit the same ceiling at slightly higher concurrency and pay more for hosting.

How does load testing relate to SLO tracking?

Your load test defines the session ceiling where your latency SLO is met. Your SLO defines the acceptable latency threshold. AliveMCP tracks your SLO compliance in production — it records whether each probe response falls within your target latency, and alerts when the error budget starts burning. Load testing tells you how much headroom you have; AliveMCP tells you whether you're burning through it in production.

Further reading