Testing pyramid · 2026-06-26 · MCP Server Testing & QA
The MCP Server Testing Pyramid: Integration Tests, Acceptance Tests, Test Infrastructure, and Production Monitoring
Most MCP server testing guides start with a unit test that calls a handler function directly and end with "add more tests." That's not a testing strategy — it's the first rung of a four-layer pyramid. The layers above it — integration tests that exercise the full MCP protocol stack, acceptance tests written from the LLM's perspective, and production probing that catches what every in-process test misses — each catch a class of bug the layer below them cannot see. This post synthesizes the infrastructure and practice behind all four layers: the mock client factory that makes integration and acceptance tests maintainable, the test doubles pattern that makes them fast, parallel execution that keeps CI times reasonable as the suite grows, and the production gap that only external monitoring can close.
The four-layer pyramid at a glance
Each layer in the pyramid covers a different scope and catches a different class of bug. A complete test suite uses all four — removing any one leaves a specific failure class with no detector:
| Layer | Scope | Speed | Bug class caught | Bug class missed |
|---|---|---|---|---|
| Handler unit test | Single tool handler function | ~1ms | Handler logic, input validation, output formatting | Tool registration, protocol routing, wiring bugs |
| Integration test (InMemoryTransport) | Full Server + Client over MCP protocol | ~10ms | Tool registration, handler routing, schema declarations, protocol handshake | Network, TLS, HTTP binding, process lifecycle |
| Acceptance test (scenario) | Multi-step workflows, LLM-facing behavior | ~50ms | Tool description accuracy, error LLM-readability, cross-tool consistency, realistic workflows | Production infrastructure, deployment environment |
| Production probe (AliveMCP) | Live deployed endpoint over real network | ~200ms | TCP reachability, TLS validity, MCP handshake over network, process crash, cloud deployment failure | Handler logic, tool behavior (tests the infrastructure, not the code) |
The critical insight is the layering direction: each lower layer is faster and more granular; each higher layer covers infrastructure that the lower layers cannot reach. Handler unit tests are the right place to verify that a get_user tool returns the right JSON. They are the wrong place to verify that get_user is registered under the correct name in the tool manifest. That verification belongs at the integration layer, one rung up.
Foundation: dependency injection and test doubles
The integration and acceptance layers are only fast if your server accepts its dependencies from outside rather than constructing them internally. A server that creates its own database pool at startup forces every integration test to either connect to a real database or monkey-patch the module. Neither scales. The prerequisite for a scalable test suite is a Deps interface and a factory function that accepts it:
// server.ts — accept all external dependencies from the caller
export interface Deps {
db: {
getUser: (id: string) => Promise<User | null>;
saveUser: (user: User) => Promise<void>;
};
email: {
sendWelcome: (to: string) => Promise<void>;
};
}
export function createServer(deps: Deps): Server {
const server = new Server(
{ name: 'user-service', version: '1.0.0' },
{ capabilities: { tools: {} } },
);
// ... register handlers using deps.db.getUser, deps.email.sendWelcome
return server;
}
With this shape, the production bootstrap passes real implementations; every test passes a test double. The three double types serve different purposes in an MCP test suite:
| Double type | What it is | When to use for MCP |
|---|---|---|
| Fake | A working mini-implementation — an in-memory Map that behaves like a database | Default. Use for all handler dependencies across the entire test suite. |
| Stub | A function that returns a hardcoded value without inspecting its inputs | Single-use dependencies: a pricing API that always returns $9.99 in tests. |
| Spy | A wrapper that records call arguments and count | Side-effect assertions: verify that a webhook fired with the correct payload. |
For databases specifically, the createFakeDb() factory pattern is the most useful form: a function that returns a fresh fake on each call, so tests that create or delete records start from a clean state without resetting a real database between runs:
// test/fakes/fake-db.ts
export function createFakeDb() {
const users = new Map<string, User>();
return {
async getUser(id: string): Promise<User | null> {
return users.get(id) ?? null;
},
async saveUser(user: User): Promise<void> {
users.set(user.id, user);
},
// Utility for test setup
_seed(user: User) { users.set(user.id, user); },
_all() { return [...users.values()]; },
};
}
export type FakeDb = ReturnType<typeof createFakeDb>;
The _seed and _all helpers (prefixed with _ to signal they're test-only) let acceptance tests set up "given" state without calling tools to create it. A test that verifies the delete_user tool can seed the user directly into the fake, call the tool, and then verify the fake's state — without depending on create_user working correctly first.
For email sending, a spy is the right double: you don't need it to do anything, you need to assert it was called with the correct address after a create_user tool call:
import { vi } from 'vitest';
const emailSpy = { sendWelcome: vi.fn().mockResolvedValue(undefined) };
// After calling create_user:
expect(emailSpy.sendWelcome).toHaveBeenCalledWith('alice@example.com');
expect(emailSpy.sendWelcome).toHaveBeenCalledTimes(1);
One pattern to avoid: mocking the MCP SDK itself (vi.mock('@modelcontextprotocol/sdk')). The SDK is the protocol — mocking it defeats the purpose of the integration layer. Use real Server, real Client, and real InMemoryTransport everywhere. Use test doubles only for the external dependencies that your handlers call.
The mock client factory: eliminating protocol plumbing from every test
Every integration test needs the same four lines: create a linked transport pair, connect the server, create a client, connect the client. Repeated across twenty test files, that's eighty lines of identical boilerplate — and twenty places to forget afterAll(() => client.close()). The mock client factory extracts all of this into a single helper:
// test/helpers/mcp-client.ts
import { afterAll } from 'vitest';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { InMemoryTransport } from '@modelcontextprotocol/sdk/inMemory.js';
import type { Server } from '@modelcontextprotocol/sdk/server/index.js';
export async function createMcpTestClient<TDeps>(
serverFactory: (deps: TDeps) => Server,
deps: TDeps,
): Promise<McpTestClient> {
const [serverTransport, clientTransport] = InMemoryTransport.createLinkedPair();
const server = serverFactory(deps);
await server.connect(serverTransport);
const client = new Client(
{ name: 'mock-test-client', version: '1.0.0' },
{ capabilities: {} },
);
await client.connect(clientTransport);
afterAll(async () => { await client.close(); });
return new McpTestClient(client);
}
export class McpTestClient {
constructor(private readonly client: Client) {}
listTools() { return this.client.listTools(); }
callTool(name: string, args: Record<string, unknown>) {
return this.client.callTool({ name, arguments: args });
}
// Typed convenience: extract first text content block
async callToolText(name: string, args: Record<string, unknown>): Promise<string> {
const result = await this.callTool(name, args);
const block = result.content[0];
if (!block || block.type !== 'text') throw new Error('No text content in response');
return block.text;
}
// Schema assertion: verify a tool's inputSchema contains expected properties
async assertSchemaIncludes(
toolName: string,
partial: Record<string, unknown>,
): Promise<void> {
const { tools } = await this.listTools();
const tool = tools.find(t => t.name === toolName);
if (!tool) throw new Error(`Tool ${toolName} not found in listTools response`);
expect(tool.inputSchema).toMatchObject(partial);
}
close() { return this.client.close(); }
}
Usage in a test file reduces to two lines of setup:
// user-service.test.ts
import { describe, it, beforeAll, expect } from 'vitest';
import { createMcpTestClient } from '../test/helpers/mcp-client.js';
import { createServer } from './server.js';
import { createFakeDb } from '../test/fakes/fake-db.js';
describe('user-service integration', () => {
let client: Awaited<ReturnType<typeof createMcpTestClient>>;
const fakeDb = createFakeDb();
beforeAll(async () => {
client = await createMcpTestClient(createServer, {
db: fakeDb,
email: { sendWelcome: async () => {} },
});
});
it('lists the get_user and create_user tools', async () => {
const { tools } = await client.listTools();
expect(tools.map(t => t.name)).toEqual(
expect.arrayContaining(['get_user', 'create_user']),
);
});
it('get_user returns isError for unknown ID', async () => {
const result = await client.callTool('get_user', { userId: 'does-not-exist' });
expect(result.isError).toBe(true);
});
it('get_user inputSchema declares userId as required string', async () => {
await client.assertSchemaIncludes('get_user', {
type: 'object',
properties: { userId: { type: 'string' } },
required: ['userId'],
});
});
});
The callToolText() wrapper eliminates the cast noise ((result.content[0] as TextContent).text) that appears in almost every assertion. The assertSchemaIncludes() helper turns schema regression tests into one-liners — if a future change removes userId from the required array, this test fails before the deploy reaches production.
The afterAll registered inside the factory is the key ergonomic improvement: tests cannot forget to call client.close() because the factory registers it automatically within the test scope. If the factory is called inside a describe block, cleanup runs after that block; if called at the module level, cleanup runs after all tests in the file.
Integration tests: the full protocol stack without a network
An integration test that uses the factory above exercises the complete MCP protocol path: the initialize handshake fires when client.connect() is called, capabilities are negotiated, and every subsequent listTools() or callTool() travels through the actual request dispatcher in your server. What it skips is the network: InMemoryTransport routes messages in-process with no HTTP, no TCP, no TLS.
This in-process skip is a feature, not a gap. It means integration tests run at ~10ms per test — fast enough to run on every commit without slowing down the CI loop. The network layer belongs at the E2E tier (spawning a real server process with SSEClientTransport) or at the production monitoring layer. The integration layer's job is to catch wiring bugs: wrong handler name, missing tool registration, incorrect inputSchema format, handler that throws instead of returning isError: true.
The most common integration bugs that unit tests miss:
- Tool name mismatch: the handler is registered under
'getUser'but the description saysget_user— theCallToolRequestforget_userhits the unknown-tool path.client.callTool({ name: 'get_user', ... })catches this; a direct handler call never exercises the routing logic. - Missing tool registration: a handler function was written but never connected to the server via
setRequestHandler.client.listTools()returns it not in the list. Direct handler calls never reveal this. - Incorrect inputSchema: the server advertises
required: ['user_id']but the handler readsrequest.params.arguments.userId(camelCase). The tool is registered correctly; the advertised schema is wrong.assertSchemaIncludescatches this. - isError protocol violation: a handler throws rather than returning
{ isError: true, content: [...] }— the SDK propagates the exception as an MCP protocol error, not a tool-level error. Integration tests that assertresult.isError === trueon error paths catch this; unit tests that call the handler directly may never see the MCP error shape at all.
For servers with real database dependencies, the integration tests above run against fakes. When you want final confidence that the handler queries are correct, add a small separate suite that runs against a real database container in CI using GitHub Actions services:
# .github/workflows/test.yml (excerpt)
jobs:
test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: testdb
POSTGRES_USER: testuser
POSTGRES_PASSWORD: testpass
ports: ['5432:5432']
options: --health-cmd pg_isready --health-interval 10s --health-retries 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '22' }
- run: npm ci
- run: npm test # fast fake-based suite
- run: npm run test:integration # real-db suite, tagged separately
env:
DATABASE_URL: postgres://testuser:testpass@localhost:5432/testdb
Keep the fast fake-based suite as the gating check that runs on every commit. The real-db suite can run on PRs or nightly. The split keeps commit feedback under a second while still running real-infrastructure validation on every change that goes to main.
Acceptance tests: the LLM's perspective
Integration tests ask "is the tool wired correctly?" Acceptance tests ask "does calling this tool help an LLM accomplish the user's goal?" These are different questions. A tool can have correct handler logic and correct wiring and still fail acceptance — because its error messages are opaque to an LLM, its description promises behavior the implementation doesn't deliver, or a realistic multi-step workflow breaks at the boundary between tools.
The Given/When/Then pattern from behavior-driven development maps naturally to this layer. "Given" sets up the server state. "When" makes a tool call (or a sequence of calls). "Then" asserts the observable result from the client's perspective — not an internal state check:
// user-service.acceptance.test.ts
describe('create_user acceptance', () => {
let client: McpTestClient;
let fakeDb: FakeDb;
beforeEach(async () => {
fakeDb = createFakeDb(); // fresh state per test
client = await createMcpTestClient(createServer, {
db: fakeDb,
email: { sendWelcome: async () => {} },
});
});
afterEach(() => client.close());
it('returns a human-readable confirmation after creating a user', async () => {
// When
const text = await client.callToolText('create_user', {
name: 'Alice',
email: 'alice@example.com',
});
// Then: confirmation is readable, not a raw JSON dump
expect(text).toMatch(/alice/i);
expect(text).toMatch(/created/i);
// Then: the user actually exists in the backing store
const saved = fakeDb._all();
expect(saved).toHaveLength(1);
expect(saved[0].email).toBe('alice@example.com');
});
it('returns an LLM-actionable error when email is already taken', async () => {
// Given: alice already exists
fakeDb._seed({ id: 'u1', name: 'Alice', email: 'alice@example.com' });
// When: create_user is called with the same email
const result = await client.callTool('create_user', {
name: 'Alice Duplicate',
email: 'alice@example.com',
});
// Then: isError is true
expect(result.isError).toBe(true);
// Then: the error message explains what went wrong and how to recover
const errorText = (result.content[0] as { text: string }).text;
expect(errorText).toMatch(/email/i);
expect(errorText).toMatch(/already|exists|taken/i);
// The LLM can read this and decide to use a different email or fetch the existing user
});
});
The key difference from integration tests: the assertion target is the text content that an LLM would read and reason from, not just whether isError is the right boolean. An error that returns isError: true with text: 'constraint_violation' passes an integration test but fails acceptance — because an LLM receiving that message cannot determine whether it should retry with a different email, fetch the existing user, or report an internal error to the user.
Three acceptance test classes that every production MCP server should have:
- Description accuracy: verify that each tool's description matches what it does. The description is what the LLM reads to decide whether to call a tool — a description that says "fetches a user by ID" when the tool actually queries by email causes the LLM to send the wrong argument type.
- Error LLM-readability: for every
isError: truepath, assert that the error message contains the information an LLM needs to self-correct. At minimum: what went wrong (e.g., "User not found"), what identifier was used (e.g., "ID: xyz-123"), and what the caller can do (e.g., "Use list_users to find valid IDs"). - Multi-step scenarios: test realistic workflows that chain multiple tool calls. A "create and then retrieve" scenario catches the bug where
create_usersucceeds but the returned ID format doesn't match whatget_useraccepts — a cross-tool consistency bug that no single-tool test can detect.
it('create_user then get_user returns the same user', async () => {
// Given: no existing users
// When: create a user and capture the ID from the response
const createText = await client.callToolText('create_user', {
name: 'Bob',
email: 'bob@example.com',
});
const created = JSON.parse(createText); // server returns JSON
expect(created.id).toBeTruthy();
// When: retrieve the user using the ID from the creation response
const fetchText = await client.callToolText('get_user', { userId: created.id });
const fetched = JSON.parse(fetchText);
// Then: the fetched user matches the created user
expect(fetched.name).toBe('Bob');
expect(fetched.email).toBe('bob@example.com');
});
This kind of round-trip test is the highest-value acceptance test for a CRUD MCP server. It validates the complete data flow — write, read, consistency — in a single scenario that mirrors what an LLM agent actually does when operating over your tools.
Parallel testing: scaling the suite without slowing CI
As the test suite grows to hundreds of tests across dozens of files, wall-clock time becomes the constraint. MCP tests are unusually easy to parallelize because InMemoryTransport creates isolated in-process connections with no shared network state. Two test files can each run their own Server and Client instances simultaneously without port conflicts:
| Shared resource concern | With InMemoryTransport |
|---|---|
| Port conflicts (two servers on port 3000) | No port — transport is in-process |
| Shared database state | Each test creates a new fake database instance via createFakeDb() |
| Network saturation | No network — all messages route in memory |
| Process startup time | No process — server is a JS object created synchronously |
| TLS certificate | No TLS — transport skips the HTTP layer entirely |
Vitest parallelizes test files by default across worker threads. The default configuration is the right starting point:
// vitest.config.ts
import { defineConfig } from 'vitest/config';
import os from 'os';
export default defineConfig({
test: {
pool: 'threads',
poolOptions: {
threads: {
maxThreads: Math.max(1, os.cpus().length - 1),
minThreads: 1,
},
},
sequence: { shuffle: true }, // catch ordering dependencies early
},
});
The main parallelism pitfall is module-level shared state — a const db = createFakeDb() at the top of a test file, shared across all tests in that file. Tests in the same file run in the same worker and share the module scope. If test A creates a user and test B calls list_users, test B's result depends on whether test A ran first. The fix is to move createFakeDb() into beforeEach (a fresh database per test) or into the beforeAll of a describe block with the test's known initial state:
// Unsafe: module-level fake shared across all tests in the file
const db = createFakeDb(); // ← don't do this
// Safe: fresh fake per test
describe('get_user', () => {
let client: McpTestClient;
beforeEach(async () => {
const db = createFakeDb(); // fresh per test
client = await createMcpTestClient(createServer, { db });
});
afterEach(() => client.close());
});
For CI with a very large suite (over 1,000 tests), split the run across multiple GitHub Actions jobs using Vitest's --shard flag:
# .github/workflows/test.yml — matrix sharding
jobs:
test:
strategy:
matrix:
shard: [1, 2, 3, 4]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '22' }
- run: npm ci
- run: npx vitest run --shard=${{ matrix.shard }}/4
Four shards running in parallel on four GitHub Actions runners takes roughly one quarter of the sequential time. For a 500-test suite that takes 2 minutes sequentially, sharding brings the CI wall-clock time to under 30 seconds — fast enough to stay in the commit feedback loop.
During development, vitest --watch re-runs only the tests affected by changed files. For a module that 20 test files import, watch mode limits the re-run to those 20 files rather than the full suite, keeping feedback in seconds even for large codebases.
Production monitoring: the fifth layer that closes the pyramid
All four test layers above run before deployment. They verify that your code is correct. None of them can verify that your deployed server is reachable. This is the pyramid's open top — the failure class that every in-process test leaves uncovered:
- TCP unreachable: the cloud deployment succeeded but the process is listening on port 3000 while the load balancer forwards to port 8080. Every test passes. No real MCP client can connect.
- TLS certificate expired: the certificate expired over the weekend.
InMemoryTransporthas no TLS. Your CI runs on port 3000 with no certificate. The production error is invisible until a client reports it on Monday morning. - MCP protocol broken post-deploy: an environment variable controls which transport class is instantiated. In production, the wrong value causes the server to start an HTTP/1.1 server instead of an SSE server. The health check endpoint returns 200; the MCP
initializehandshake returns nothing intelligible. - Process crash loop: the process starts, fails to connect to the database (wrong password in the production secret), and crashes. A new process starts, fails again. The health check endpoint returns 200 during the startup window before the crash. From inside CI, the server looks healthy.
These failure classes share a property: they require a real network client making a real MCP protocol call to detect. AliveMCP probes the live initialize handshake over the network every 60 seconds — not just a TCP ping or an HTTP health check, but the full MCP protocol sequence: connect, negotiate capabilities, call tools/list, and optionally call a sentinel tool. The probe's failure taxonomy maps directly to the four classes above:
| Failure class | AliveMCP failure_reason | What the probe does differently from /health |
|---|---|---|
| TCP unreachable | connection_refused |
Connects on the MCP endpoint port, not the health check port |
| TLS expired or invalid | tls_error |
Validates the TLS certificate chain before sending any data |
| Protocol broken | protocol_error |
Sends a real MCP initialize request and parses the response as JSON-RPC 2.0 |
| Slow / degraded | timeout |
Times out if the handshake exceeds the threshold (default 10s) |
The timing matters: a 60-second probe interval means any of these failures is detected within one minute of occurring. The /health endpoint your platform calls on startup typically isn't called again until the next deploy. AliveMCP's continuous probing closes the gap between deploy-time health checks and real client experience.
The relationship between the testing pyramid layers and AliveMCP is complementary, not competitive. Your unit tests verify handler logic. Your integration tests verify wiring. Your acceptance tests verify user-facing behavior. AliveMCP verifies that the deployed server your users actually call is alive, reachable, and speaking valid MCP — the question that no test running in CI can answer.
Putting the pyramid together: a practical setup checklist
The complete setup for the four-layer pyramid in a TypeScript MCP server:
- Dependency injection first. Refactor your
createServer()to accept aDepsinterface. This unblocks all subsequent layers. - Write
createFakeDb()for each external data store your handlers use. Keep fakes intest/fakes/alongside production code insrc/. - Create
test/helpers/mcp-client.tswith thecreateMcpTestClientfactory and typed helpers (callToolText,assertSchemaIncludes). - Write integration tests that cover: tool names appear in
listTools(), each tool'sinputSchemais correct, happy-path call returns a response, error-path call returnsisError: truewith non-empty content. - Write acceptance tests for each non-trivial user scenario: create-then-retrieve round trips, multi-tool workflows, and every
isError: truepath (asserting that the error message is LLM-readable, not just present). - Configure Vitest for parallel execution with
pool: 'threads'and verify no module-level shared state across tests in the same file. - Add CI sharding via
--shard=N/Monce the suite exceeds ~500 tests and wall-clock time exceeds 2 minutes. - Register your production endpoint with AliveMCP to close the pyramid's open top: continuous protocol probing of the live endpoint with alerts on
connection_refused,tls_error,protocol_error, andtimeout.
Each step is independently valuable — the dependency injection refactor pays for itself even if you never write an acceptance test. But the full pyramid, with all four layers operational, gives you something no individual layer provides: confidence that the server you shipped is the server that works for the real LLM clients calling it in production.
Further reading
- MCP server integration testing — InMemoryTransport, DI fakes, Docker Compose CI services
- MCP server mock client — createMcpTestClient factory, typed callTool wrappers, schema assertions
- MCP server test doubles — fakes vs stubs vs spies, createFakeDb() pattern
- MCP server acceptance testing — Given/When/Then, LLM-readable error quality, multi-step scenarios
- MCP server parallel testing — InMemoryTransport isolation, Vitest workers, --shard CI
- MCP server unit testing — handler functions and InMemoryTransport basics
- MCP server contract testing — catching tool schema drift before it breaks agents
- MCP server Vitest configuration — coverage, workers, watch mode
- MCP server test coverage — branch coverage targets for handlers and validation logic
- MCP server snapshot testing — LLM-aware output regression detection
- MCP server testing guide — unit tests, Vitest setup, and coverage configuration
- A complete testing strategy — five layers, five bug classes
- AliveMCP — production MCP endpoint monitoring to complete the pyramid