Guide · Performance

MCP server benchmarking

Measuring MCP server performance requires matching the benchmarking tool to what you want to measure. An InMemoryTransport microbenchmark isolates tool-handler logic from network overhead and is the right choice when you want to compare two handler implementations or measure the cost of a library call. An autocannon or k6 load test against the live HTTP/SSE endpoint measures end-to-end latency including the transport, OS networking stack, and any middleware. Neither replaces production monitoring — but both give you numbers to optimize against.

TL;DR

For handler latency: create an InMemoryTransport linked pair, warm up the JIT with 500+ calls, then time 10,000 iterations with performance.now() and report p50/p95/p99 using a percentile function. For HTTP/SSE transport latency: use autocannon -c 10 -d 30 http://localhost:3000/sse. Report both numbers when evaluating optimizations — handler time and transport time are independent and can dominate in different scenarios.

What to benchmark and why it matters

Before running any benchmark, define what you're measuring and what decision it will inform. Common MCP benchmarking goals:

Goal	What to measure	Tool
Compare two handler implementations	Per-call handler latency (no network)	InMemoryTransport + performance.now()
Find the throughput ceiling	Max requests/sec before latency climbs	autocannon with concurrency sweep
Validate an optimization	p99 before and after the change	InMemoryTransport benchmark in CI
Set an SLO	p99 at target concurrency over 60s	autocannon or k6 with percentile reporting
Profile a specific function	CPU ticks in the function	--prof or 0x

Benchmarking is only meaningful when you have a hypothesis to test. "The server is slow" is not a hypothesis. "The search handler is slower than the get handler because it parses a 500KB JSON corpus on every call" is — and a benchmark can confirm or refute it.

InMemoryTransport microbenchmark

The MCP SDK's InMemoryTransport creates an in-process linked pair that runs the full MCP protocol (initialize handshake, tools/list, tools/call) without any network. Tool-call round-trips through InMemoryTransport complete in microseconds on modern hardware, so the latency you measure is almost entirely your handler code — not the transport.

// benchmark/handler-bench.ts
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { InMemoryTransport } from '@modelcontextprotocol/sdk/inMemory.js';
import { createServer } from '../src/server.js';

function percentile(sorted: number[], p: number): number {
  const idx = Math.ceil((p / 100) * sorted.length) - 1;
  return sorted[Math.max(0, idx)];
}

async function runBenchmark(name: string, toolName: string, args: Record<string, unknown>) {
  const [serverTransport, clientTransport] = InMemoryTransport.createLinkedPair();
  const server = createServer();
  await server.connect(serverTransport);

  const client = new Client({ name: 'bench', version: '1.0.0' }, { capabilities: {} });
  await client.connect(clientTransport);

  // JIT warmup — critical for accurate V8 measurements
  const WARMUP = 500;
  for (let i = 0; i < WARMUP; i++) {
    await client.callTool({ name: toolName, arguments: args });
  }

  // Timed run
  const ITERATIONS = 10_000;
  const times: number[] = [];

  for (let i = 0; i < ITERATIONS; i++) {
    const start = performance.now();
    await client.callTool({ name: toolName, arguments: args });
    times.push(performance.now() - start);
  }

  times.sort((a, b) => a - b);

  console.log(`\n=== ${name} (${ITERATIONS} calls) ===`);
  console.log(`  p50:  ${percentile(times, 50).toFixed(3)} ms`);
  console.log(`  p95:  ${percentile(times, 95).toFixed(3)} ms`);
  console.log(`  p99:  ${percentile(times, 99).toFixed(3)} ms`);
  console.log(`  max:  ${times[times.length - 1].toFixed(3)} ms`);
  console.log(`  ops/s: ${(1000 / percentile(times, 50)).toFixed(0)}`);

  await client.close();
}

async function main() {
  await runBenchmark('search_documents', 'search_documents', { query: 'test query', page: 1 });
  await runBenchmark('get_document', 'get_document', { id: 'doc-001' });
}

main();

npx tsx benchmark/handler-bench.ts

=== search_documents (10000 calls) ===
  p50:  0.412 ms
  p95:  1.830 ms
  p99:  4.211 ms
  max:  23.441 ms
  ops/s: 2427

=== get_document (10000 calls) ===
  p50:  0.051 ms
  p95:  0.112 ms
  p99:  0.188 ms
  max:  1.204 ms
  ops/s: 19608

This immediately surfaces the problem: search_documents is 8× slower at p50 and 22× slower at p99 than get_document. The max spike of 23ms while p99 is 4ms suggests occasional GC pauses or a deoptimization event — worth profiling with --prof to investigate.

Benchmarking HTTP/SSE transport with autocannon

For MCP servers using SSEServerTransport or StreamableHTTPServerTransport, test the full stack including the HTTP layer. autocannon is a Node.js HTTP benchmarker that reports latency percentiles alongside throughput.

npm install -g autocannon

# Basic: 10 concurrent connections for 30 seconds
autocannon -c 10 -d 30 http://localhost:3000/sse

# Pipe a tool-call body for POST benchmarks
autocannon -c 10 -d 30 \
  -m POST \
  -H 'Content-Type: application/json' \
  -b '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"search_documents","arguments":{"query":"test"}}}' \
  http://localhost:3000/mcp

Key autocannon output fields to read:

Field	What it means
Latency p50	Typical request — most users experience this
Latency p97.5	Worst 2.5% — if this is >10× p50, you have outliers
Latency p99	SLO target — set an alert if production p99 exceeds this
Req/sec	Throughput ceiling at this concurrency level
Errors	Any non-2xx responses — nonzero means your handler is failing under load

Run autocannon at multiple concurrency levels (1, 10, 50, 100) to find where latency starts climbing — that inflection point is your throughput ceiling. Beyond it, the server is queuing requests faster than it processes them and p99 will grow unbounded.

Vitest bench for per-function microbenchmarks

Vitest includes a built-in benchmark runner (bench) that uses Tinybench under the hood. Use it for testing individual functions — not full MCP round-trips — when you want to compare implementations of a parsing function, a validation step, or a data transformation.

// benchmark/parse.bench.ts
import { bench, describe } from 'vitest';
import { parseDocumentSlow } from '../src/parsers/slow.js';
import { parseDocumentFast } from '../src/parsers/fast.js';

const SAMPLE = JSON.stringify({ id: 'doc-001', body: 'x'.repeat(50_000) });

describe('document parsing', () => {
  bench('slow parser (rebuild schema each call)', () => {
    parseDocumentSlow(SAMPLE);
  });

  bench('fast parser (cached schema)', () => {
    parseDocumentFast(SAMPLE);
  });
});

npx vitest bench

 BENCH  benchmark/parse.bench.ts

  document parsing
    name                                     hz        min        max       mean     p75     p99    p999   rme  samples
  · slow parser (rebuild schema each call)  843.11   1.0421   6.5432   1.1860  1.2031  3.1042  6.5432  ±1.23%     422
  · fast parser (cached schema)          19504.22   0.0421   1.2011   0.0513  0.0532  0.1042  0.2431  ±0.51%    9752

23× throughput difference from caching the schema object. This is the kind of result you commit as a regression guard: add a Vitest bench that fails CI if hz for the fast path drops below a threshold.

Common benchmarking mistakes

Mistake	Why it matters	Fix
No JIT warmup	First 200–1000 calls run in the V8 interpreter; times are 2–10× higher	Run 500+ warmup calls before timing
Benchmarking too few iterations	GC pauses dominate small sample sets, inflating p99	Use ≥1000 iterations; 10,000 for stable p99
Benchmarking in debug mode	ts-node/tsx in non-optimized mode is 3–5× slower	Compile to JS with tsc first, then benchmark
Sharing state between runs	Cache warm-up in first run benefits subsequent runs	Create fresh InMemoryTransport per benchmark, or flush caches explicitly
Not measuring percentiles	Mean conceals tail latency; a p99 spike is user-visible even if mean is fine	Always report p50, p95, p99, max
Benchmarking on development machine	Laptop thermal throttling and background processes add noise	Benchmark on the same hardware class as production, or use a CI benchmark runner

Connecting benchmarks to SLOs

A benchmark number only has meaning relative to an SLO. If your MCP server performance target is p99 tool-call latency under 200ms, an InMemoryTransport p99 of 4ms means you have 196ms of budget left for network, middleware, and database round-trips. If your p99 is already 180ms in the microbenchmark, you have no budget for anything else.

Add benchmark regression checks to CI. The simplest approach: run the InMemoryTransport benchmark in a separate Vitest bench file, assert that p99 is below a threshold, and fail the build if it regresses. This catches performance regressions at code review time, before they reach production.

// benchmark/regression.bench.ts
import { bench, expect } from 'vitest';

bench('search_documents p99 must be under 10ms', async () => {
  // Vitest bench will fail the suite if hz drops below a threshold
  // or you can assert manually using performance.now() in beforeEach
}, { time: 5000, iterations: 1000 });

What benchmarks cannot tell you

An InMemoryTransport benchmark runs in the same process, on the same machine, with no network. It cannot tell you: how the server behaves under real network conditions with connection establishment overhead; whether TLS handshake latency is significant; how the server handles concurrent connections from multiple LLM agents simultaneously; or whether your deployment environment introduces additional latency (cold starts, container CPU limits, shared-tenant database connection pools). Use concurrency testing for multi-client scenarios. Use load testing for realistic network conditions. Use AliveMCP to monitor end-to-end latency continuously in production from an external vantage point.