Guide · Performance
MCP server benchmarking
Measuring MCP server performance requires matching the benchmarking tool to what you want to measure. An InMemoryTransport microbenchmark isolates tool-handler logic from network overhead and is the right choice when you want to compare two handler implementations or measure the cost of a library call. An autocannon or k6 load test against the live HTTP/SSE endpoint measures end-to-end latency including the transport, OS networking stack, and any middleware. Neither replaces production monitoring — but both give you numbers to optimize against.
TL;DR
For handler latency: create an InMemoryTransport linked pair, warm up the JIT with 500+ calls, then time 10,000 iterations with performance.now() and report p50/p95/p99 using a percentile function. For HTTP/SSE transport latency: use autocannon -c 10 -d 30 http://localhost:3000/sse. Report both numbers when evaluating optimizations — handler time and transport time are independent and can dominate in different scenarios.
What to benchmark and why it matters
Before running any benchmark, define what you're measuring and what decision it will inform. Common MCP benchmarking goals:
| Goal | What to measure | Tool |
|---|---|---|
| Compare two handler implementations | Per-call handler latency (no network) | InMemoryTransport + performance.now() |
| Find the throughput ceiling | Max requests/sec before latency climbs | autocannon with concurrency sweep |
| Validate an optimization | p99 before and after the change | InMemoryTransport benchmark in CI |
| Set an SLO | p99 at target concurrency over 60s | autocannon or k6 with percentile reporting |
| Profile a specific function | CPU ticks in the function | --prof or 0x |
Benchmarking is only meaningful when you have a hypothesis to test. "The server is slow" is not a hypothesis. "The search handler is slower than the get handler because it parses a 500KB JSON corpus on every call" is — and a benchmark can confirm or refute it.
InMemoryTransport microbenchmark
The MCP SDK's InMemoryTransport creates an in-process linked pair that runs the full MCP protocol (initialize handshake, tools/list, tools/call) without any network. Tool-call round-trips through InMemoryTransport complete in microseconds on modern hardware, so the latency you measure is almost entirely your handler code — not the transport.
// benchmark/handler-bench.ts
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { InMemoryTransport } from '@modelcontextprotocol/sdk/inMemory.js';
import { createServer } from '../src/server.js';
function percentile(sorted: number[], p: number): number {
const idx = Math.ceil((p / 100) * sorted.length) - 1;
return sorted[Math.max(0, idx)];
}
async function runBenchmark(name: string, toolName: string, args: Record<string, unknown>) {
const [serverTransport, clientTransport] = InMemoryTransport.createLinkedPair();
const server = createServer();
await server.connect(serverTransport);
const client = new Client({ name: 'bench', version: '1.0.0' }, { capabilities: {} });
await client.connect(clientTransport);
// JIT warmup — critical for accurate V8 measurements
const WARMUP = 500;
for (let i = 0; i < WARMUP; i++) {
await client.callTool({ name: toolName, arguments: args });
}
// Timed run
const ITERATIONS = 10_000;
const times: number[] = [];
for (let i = 0; i < ITERATIONS; i++) {
const start = performance.now();
await client.callTool({ name: toolName, arguments: args });
times.push(performance.now() - start);
}
times.sort((a, b) => a - b);
console.log(`\n=== ${name} (${ITERATIONS} calls) ===`);
console.log(` p50: ${percentile(times, 50).toFixed(3)} ms`);
console.log(` p95: ${percentile(times, 95).toFixed(3)} ms`);
console.log(` p99: ${percentile(times, 99).toFixed(3)} ms`);
console.log(` max: ${times[times.length - 1].toFixed(3)} ms`);
console.log(` ops/s: ${(1000 / percentile(times, 50)).toFixed(0)}`);
await client.close();
}
async function main() {
await runBenchmark('search_documents', 'search_documents', { query: 'test query', page: 1 });
await runBenchmark('get_document', 'get_document', { id: 'doc-001' });
}
main();
npx tsx benchmark/handler-bench.ts
=== search_documents (10000 calls) ===
p50: 0.412 ms
p95: 1.830 ms
p99: 4.211 ms
max: 23.441 ms
ops/s: 2427
=== get_document (10000 calls) ===
p50: 0.051 ms
p95: 0.112 ms
p99: 0.188 ms
max: 1.204 ms
ops/s: 19608
This immediately surfaces the problem: search_documents is 8× slower at p50 and 22× slower at p99 than get_document. The max spike of 23ms while p99 is 4ms suggests occasional GC pauses or a deoptimization event — worth profiling with --prof to investigate.
Benchmarking HTTP/SSE transport with autocannon
For MCP servers using SSEServerTransport or StreamableHTTPServerTransport, test the full stack including the HTTP layer. autocannon is a Node.js HTTP benchmarker that reports latency percentiles alongside throughput.
npm install -g autocannon
# Basic: 10 concurrent connections for 30 seconds
autocannon -c 10 -d 30 http://localhost:3000/sse
# Pipe a tool-call body for POST benchmarks
autocannon -c 10 -d 30 \
-m POST \
-H 'Content-Type: application/json' \
-b '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"search_documents","arguments":{"query":"test"}}}' \
http://localhost:3000/mcp
Key autocannon output fields to read:
| Field | What it means |
|---|---|
| Latency p50 | Typical request — most users experience this |
| Latency p97.5 | Worst 2.5% — if this is >10× p50, you have outliers |
| Latency p99 | SLO target — set an alert if production p99 exceeds this |
| Req/sec | Throughput ceiling at this concurrency level |
| Errors | Any non-2xx responses — nonzero means your handler is failing under load |
Run autocannon at multiple concurrency levels (1, 10, 50, 100) to find where latency starts climbing — that inflection point is your throughput ceiling. Beyond it, the server is queuing requests faster than it processes them and p99 will grow unbounded.
Vitest bench for per-function microbenchmarks
Vitest includes a built-in benchmark runner (bench) that uses Tinybench under the hood. Use it for testing individual functions — not full MCP round-trips — when you want to compare implementations of a parsing function, a validation step, or a data transformation.
// benchmark/parse.bench.ts
import { bench, describe } from 'vitest';
import { parseDocumentSlow } from '../src/parsers/slow.js';
import { parseDocumentFast } from '../src/parsers/fast.js';
const SAMPLE = JSON.stringify({ id: 'doc-001', body: 'x'.repeat(50_000) });
describe('document parsing', () => {
bench('slow parser (rebuild schema each call)', () => {
parseDocumentSlow(SAMPLE);
});
bench('fast parser (cached schema)', () => {
parseDocumentFast(SAMPLE);
});
});
npx vitest bench
BENCH benchmark/parse.bench.ts
document parsing
name hz min max mean p75 p99 p999 rme samples
· slow parser (rebuild schema each call) 843.11 1.0421 6.5432 1.1860 1.2031 3.1042 6.5432 ±1.23% 422
· fast parser (cached schema) 19504.22 0.0421 1.2011 0.0513 0.0532 0.1042 0.2431 ±0.51% 9752
23× throughput difference from caching the schema object. This is the kind of result you commit as a regression guard: add a Vitest bench that fails CI if hz for the fast path drops below a threshold.
Common benchmarking mistakes
| Mistake | Why it matters | Fix |
|---|---|---|
| No JIT warmup | First 200–1000 calls run in the V8 interpreter; times are 2–10× higher | Run 500+ warmup calls before timing |
| Benchmarking too few iterations | GC pauses dominate small sample sets, inflating p99 | Use ≥1000 iterations; 10,000 for stable p99 |
| Benchmarking in debug mode | ts-node/tsx in non-optimized mode is 3–5× slower | Compile to JS with tsc first, then benchmark |
| Sharing state between runs | Cache warm-up in first run benefits subsequent runs | Create fresh InMemoryTransport per benchmark, or flush caches explicitly |
| Not measuring percentiles | Mean conceals tail latency; a p99 spike is user-visible even if mean is fine | Always report p50, p95, p99, max |
| Benchmarking on development machine | Laptop thermal throttling and background processes add noise | Benchmark on the same hardware class as production, or use a CI benchmark runner |
Connecting benchmarks to SLOs
A benchmark number only has meaning relative to an SLO. If your MCP server performance target is p99 tool-call latency under 200ms, an InMemoryTransport p99 of 4ms means you have 196ms of budget left for network, middleware, and database round-trips. If your p99 is already 180ms in the microbenchmark, you have no budget for anything else.
Add benchmark regression checks to CI. The simplest approach: run the InMemoryTransport benchmark in a separate Vitest bench file, assert that p99 is below a threshold, and fail the build if it regresses. This catches performance regressions at code review time, before they reach production.
// benchmark/regression.bench.ts
import { bench, expect } from 'vitest';
bench('search_documents p99 must be under 10ms', async () => {
// Vitest bench will fail the suite if hz drops below a threshold
// or you can assert manually using performance.now() in beforeEach
}, { time: 5000, iterations: 1000 });
What benchmarks cannot tell you
An InMemoryTransport benchmark runs in the same process, on the same machine, with no network. It cannot tell you: how the server behaves under real network conditions with connection establishment overhead; whether TLS handshake latency is significant; how the server handles concurrent connections from multiple LLM agents simultaneously; or whether your deployment environment introduces additional latency (cold starts, container CPU limits, shared-tenant database connection pools). Use concurrency testing for multi-client scenarios. Use load testing for realistic network conditions. Use AliveMCP to monitor end-to-end latency continuously in production from an external vantage point.
Related questions
How do I benchmark a stdio-transport MCP server?
Use InMemoryTransport in a separate benchmark script rather than spawning the stdio process. The stdio spawning overhead (fork, pipe setup) is a fixed cost per test run, not per tool call — it would dominate a microbenchmark. InMemoryTransport replicates the full MCP protocol path without the process overhead. For testing the stdio transport itself (measuring process spawn latency), use StdioServerTransport with execa to spawn the server and time the full lifecycle.
Should I benchmark with real or fake dependencies?
Use fake (in-memory) dependencies for handler microbenchmarks so you're measuring handler logic, not database round-trips. Use real dependencies (or a staging database) for end-to-end benchmarks where you want to know the full-stack number. Fake dependencies let you isolate regressions in handler code; real dependencies let you set SLOs that include realistic I/O latency. Both are useful — just be explicit about which you're running.
Why is my benchmark result different every time I run it?
Variance in microbenchmark results comes from GC pauses, V8 deoptimization events, OS scheduler interruptions, and thermal throttling. For stable results: run on a quiet machine (no other load), use enough iterations (≥1000 after warmup), report p99 rather than mean (GC noise shows up in max and upper percentiles, not p50), and run the benchmark multiple times and take the median run. A variance under 5% for p50 is acceptable; variance over 20% suggests the benchmark is too short or the environment is too noisy.
Further reading
- MCP server profiling — CPU flame graphs to find what's slow
- MCP server performance — latency budgets and SLO design
- MCP server latency — p99 measurement and reduction techniques
- MCP server worker threads — CPU-intensive tool offloading
- MCP server concurrency — concurrent tool calls and back-pressure
- MCP server load testing — realistic protocol load generation
- MCP server unit testing — InMemoryTransport for automated tests
- AliveMCP — continuous production latency monitoring from an external probe