Testing pyramid · 2026-06-26 · MCP Server Testing & QA

The MCP Server Testing Pyramid: Integration Tests, Acceptance Tests, Test Infrastructure, and Production Monitoring

Most MCP server testing guides start with a unit test that calls a handler function directly and end with "add more tests." That's not a testing strategy — it's the first rung of a four-layer pyramid. The layers above it — integration tests that exercise the full MCP protocol stack, acceptance tests written from the LLM's perspective, and production probing that catches what every in-process test misses — each catch a class of bug the layer below them cannot see. This post synthesizes the infrastructure and practice behind all four layers: the mock client factory that makes integration and acceptance tests maintainable, the test doubles pattern that makes them fast, parallel execution that keeps CI times reasonable as the suite grows, and the production gap that only external monitoring can close.

The four-layer pyramid at a glance

Each layer in the pyramid covers a different scope and catches a different class of bug. A complete test suite uses all four — removing any one leaves a specific failure class with no detector:

Layer	Scope	Speed	Bug class caught	Bug class missed
Handler unit test	Single tool handler function	~1ms	Handler logic, input validation, output formatting	Tool registration, protocol routing, wiring bugs
Integration test (InMemoryTransport)	Full Server + Client over MCP protocol	~10ms	Tool registration, handler routing, schema declarations, protocol handshake	Network, TLS, HTTP binding, process lifecycle
Acceptance test (scenario)	Multi-step workflows, LLM-facing behavior	~50ms	Tool description accuracy, error LLM-readability, cross-tool consistency, realistic workflows	Production infrastructure, deployment environment
Production probe (AliveMCP)	Live deployed endpoint over real network	~200ms	TCP reachability, TLS validity, MCP handshake over network, process crash, cloud deployment failure	Handler logic, tool behavior (tests the infrastructure, not the code)

The critical insight is the layering direction: each lower layer is faster and more granular; each higher layer covers infrastructure that the lower layers cannot reach. Handler unit tests are the right place to verify that a get_user tool returns the right JSON. They are the wrong place to verify that get_user is registered under the correct name in the tool manifest. That verification belongs at the integration layer, one rung up.

Foundation: dependency injection and test doubles

The integration and acceptance layers are only fast if your server accepts its dependencies from outside rather than constructing them internally. A server that creates its own database pool at startup forces every integration test to either connect to a real database or monkey-patch the module. Neither scales. The prerequisite for a scalable test suite is a Deps interface and a factory function that accepts it:

// server.ts — accept all external dependencies from the caller
export interface Deps {
  db: {
    getUser: (id: string) => Promise<User | null>;
    saveUser: (user: User) => Promise<void>;
  };
  email: {
    sendWelcome: (to: string) => Promise<void>;
  };
}

export function createServer(deps: Deps): Server {
  const server = new Server(
    { name: 'user-service', version: '1.0.0' },
    { capabilities: { tools: {} } },
  );
  // ... register handlers using deps.db.getUser, deps.email.sendWelcome
  return server;
}

With this shape, the production bootstrap passes real implementations; every test passes a test double. The three double types serve different purposes in an MCP test suite:

Double type	What it is	When to use for MCP
Fake	A working mini-implementation — an in-memory Map that behaves like a database	Default. Use for all handler dependencies across the entire test suite.
Stub	A function that returns a hardcoded value without inspecting its inputs	Single-use dependencies: a pricing API that always returns $9.99 in tests.
Spy	A wrapper that records call arguments and count	Side-effect assertions: verify that a webhook fired with the correct payload.

For databases specifically, the createFakeDb() factory pattern is the most useful form: a function that returns a fresh fake on each call, so tests that create or delete records start from a clean state without resetting a real database between runs:

// test/fakes/fake-db.ts
export function createFakeDb() {
  const users = new Map<string, User>();

  return {
    async getUser(id: string): Promise<User | null> {
      return users.get(id) ?? null;
    },
    async saveUser(user: User): Promise<void> {
      users.set(user.id, user);
    },
    // Utility for test setup
    _seed(user: User) { users.set(user.id, user); },
    _all() { return [...users.values()]; },
  };
}

export type FakeDb = ReturnType<typeof createFakeDb>;

The _seed and _all helpers (prefixed with _ to signal they're test-only) let acceptance tests set up "given" state without calling tools to create it. A test that verifies the delete_user tool can seed the user directly into the fake, call the tool, and then verify the fake's state — without depending on create_user working correctly first.

For email sending, a spy is the right double: you don't need it to do anything, you need to assert it was called with the correct address after a create_user tool call:

import { vi } from 'vitest';

const emailSpy = { sendWelcome: vi.fn().mockResolvedValue(undefined) };

// After calling create_user:
expect(emailSpy.sendWelcome).toHaveBeenCalledWith('alice@example.com');
expect(emailSpy.sendWelcome).toHaveBeenCalledTimes(1);

One pattern to avoid: mocking the MCP SDK itself (vi.mock('@modelcontextprotocol/sdk')). The SDK is the protocol — mocking it defeats the purpose of the integration layer. Use real Server, real Client, and real InMemoryTransport everywhere. Use test doubles only for the external dependencies that your handlers call.

The mock client factory: eliminating protocol plumbing from every test

Every integration test needs the same four lines: create a linked transport pair, connect the server, create a client, connect the client. Repeated across twenty test files, that's eighty lines of identical boilerplate — and twenty places to forget afterAll(() => client.close()). The mock client factory extracts all of this into a single helper:

// test/helpers/mcp-client.ts
import { afterAll } from 'vitest';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { InMemoryTransport } from '@modelcontextprotocol/sdk/inMemory.js';
import type { Server } from '@modelcontextprotocol/sdk/server/index.js';

export async function createMcpTestClient<TDeps>(
  serverFactory: (deps: TDeps) => Server,
  deps: TDeps,
): Promise<McpTestClient> {
  const [serverTransport, clientTransport] = InMemoryTransport.createLinkedPair();
  const server = serverFactory(deps);
  await server.connect(serverTransport);

  const client = new Client(
    { name: 'mock-test-client', version: '1.0.0' },
    { capabilities: {} },
  );
  await client.connect(clientTransport);

  afterAll(async () => { await client.close(); });

  return new McpTestClient(client);
}

export class McpTestClient {
  constructor(private readonly client: Client) {}

  listTools() { return this.client.listTools(); }
  callTool(name: string, args: Record<string, unknown>) {
    return this.client.callTool({ name, arguments: args });
  }

  // Typed convenience: extract first text content block
  async callToolText(name: string, args: Record<string, unknown>): Promise<string> {
    const result = await this.callTool(name, args);
    const block = result.content[0];
    if (!block || block.type !== 'text') throw new Error('No text content in response');
    return block.text;
  }

  // Schema assertion: verify a tool's inputSchema contains expected properties
  async assertSchemaIncludes(
    toolName: string,
    partial: Record<string, unknown>,
  ): Promise<void> {
    const { tools } = await this.listTools();
    const tool = tools.find(t => t.name === toolName);
    if (!tool) throw new Error(`Tool ${toolName} not found in listTools response`);
    expect(tool.inputSchema).toMatchObject(partial);
  }

  close() { return this.client.close(); }
}

Usage in a test file reduces to two lines of setup:

// user-service.test.ts
import { describe, it, beforeAll, expect } from 'vitest';
import { createMcpTestClient } from '../test/helpers/mcp-client.js';
import { createServer } from './server.js';
import { createFakeDb } from '../test/fakes/fake-db.js';

describe('user-service integration', () => {
  let client: Awaited<ReturnType<typeof createMcpTestClient>>;
  const fakeDb = createFakeDb();

  beforeAll(async () => {
    client = await createMcpTestClient(createServer, {
      db: fakeDb,
      email: { sendWelcome: async () => {} },
    });
  });

  it('lists the get_user and create_user tools', async () => {
    const { tools } = await client.listTools();
    expect(tools.map(t => t.name)).toEqual(
      expect.arrayContaining(['get_user', 'create_user']),
    );
  });

  it('get_user returns isError for unknown ID', async () => {
    const result = await client.callTool('get_user', { userId: 'does-not-exist' });
    expect(result.isError).toBe(true);
  });

  it('get_user inputSchema declares userId as required string', async () => {
    await client.assertSchemaIncludes('get_user', {
      type: 'object',
      properties: { userId: { type: 'string' } },
      required: ['userId'],
    });
  });
});

The callToolText() wrapper eliminates the cast noise ((result.content[0] as TextContent).text) that appears in almost every assertion. The assertSchemaIncludes() helper turns schema regression tests into one-liners — if a future change removes userId from the required array, this test fails before the deploy reaches production.

The afterAll registered inside the factory is the key ergonomic improvement: tests cannot forget to call client.close() because the factory registers it automatically within the test scope. If the factory is called inside a describe block, cleanup runs after that block; if called at the module level, cleanup runs after all tests in the file.

Integration tests: the full protocol stack without a network

An integration test that uses the factory above exercises the complete MCP protocol path: the initialize handshake fires when client.connect() is called, capabilities are negotiated, and every subsequent listTools() or callTool() travels through the actual request dispatcher in your server. What it skips is the network: InMemoryTransport routes messages in-process with no HTTP, no TCP, no TLS.

This in-process skip is a feature, not a gap. It means integration tests run at ~10ms per test — fast enough to run on every commit without slowing down the CI loop. The network layer belongs at the E2E tier (spawning a real server process with SSEClientTransport) or at the production monitoring layer. The integration layer's job is to catch wiring bugs: wrong handler name, missing tool registration, incorrect inputSchema format, handler that throws instead of returning isError: true.

The most common integration bugs that unit tests miss:

Tool name mismatch: the handler is registered under 'getUser' but the description says get_user — the CallToolRequest for get_user hits the unknown-tool path. client.callTool({ name: 'get_user', ... }) catches this; a direct handler call never exercises the routing logic.
Missing tool registration: a handler function was written but never connected to the server via setRequestHandler. client.listTools() returns it not in the list. Direct handler calls never reveal this.
Incorrect inputSchema: the server advertises required: ['user_id'] but the handler reads request.params.arguments.userId (camelCase). The tool is registered correctly; the advertised schema is wrong. assertSchemaIncludes catches this.
isError protocol violation: a handler throws rather than returning { isError: true, content: [...] } — the SDK propagates the exception as an MCP protocol error, not a tool-level error. Integration tests that assert result.isError === true on error paths catch this; unit tests that call the handler directly may never see the MCP error shape at all.

For servers with real database dependencies, the integration tests above run against fakes. When you want final confidence that the handler queries are correct, add a small separate suite that runs against a real database container in CI using GitHub Actions services:

# .github/workflows/test.yml (excerpt)
jobs:
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: testdb
          POSTGRES_USER: testuser
          POSTGRES_PASSWORD: testpass
        ports: ['5432:5432']
        options: --health-cmd pg_isready --health-interval 10s --health-retries 5
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '22' }
      - run: npm ci
      - run: npm test                          # fast fake-based suite
      - run: npm run test:integration          # real-db suite, tagged separately
        env:
          DATABASE_URL: postgres://testuser:testpass@localhost:5432/testdb

Keep the fast fake-based suite as the gating check that runs on every commit. The real-db suite can run on PRs or nightly. The split keeps commit feedback under a second while still running real-infrastructure validation on every change that goes to main.

Acceptance tests: the LLM's perspective

Integration tests ask "is the tool wired correctly?" Acceptance tests ask "does calling this tool help an LLM accomplish the user's goal?" These are different questions. A tool can have correct handler logic and correct wiring and still fail acceptance — because its error messages are opaque to an LLM, its description promises behavior the implementation doesn't deliver, or a realistic multi-step workflow breaks at the boundary between tools.

The Given/When/Then pattern from behavior-driven development maps naturally to this layer. "Given" sets up the server state. "When" makes a tool call (or a sequence of calls). "Then" asserts the observable result from the client's perspective — not an internal state check:

// user-service.acceptance.test.ts
describe('create_user acceptance', () => {
  let client: McpTestClient;
  let fakeDb: FakeDb;

  beforeEach(async () => {
    fakeDb = createFakeDb();   // fresh state per test
    client = await createMcpTestClient(createServer, {
      db: fakeDb,
      email: { sendWelcome: async () => {} },
    });
  });

  afterEach(() => client.close());

  it('returns a human-readable confirmation after creating a user', async () => {
    // When
    const text = await client.callToolText('create_user', {
      name: 'Alice',
      email: 'alice@example.com',
    });

    // Then: confirmation is readable, not a raw JSON dump
    expect(text).toMatch(/alice/i);
    expect(text).toMatch(/created/i);
    // Then: the user actually exists in the backing store
    const saved = fakeDb._all();
    expect(saved).toHaveLength(1);
    expect(saved[0].email).toBe('alice@example.com');
  });

  it('returns an LLM-actionable error when email is already taken', async () => {
    // Given: alice already exists
    fakeDb._seed({ id: 'u1', name: 'Alice', email: 'alice@example.com' });

    // When: create_user is called with the same email
    const result = await client.callTool('create_user', {
      name: 'Alice Duplicate',
      email: 'alice@example.com',
    });

    // Then: isError is true
    expect(result.isError).toBe(true);
    // Then: the error message explains what went wrong and how to recover
    const errorText = (result.content[0] as { text: string }).text;
    expect(errorText).toMatch(/email/i);
    expect(errorText).toMatch(/already|exists|taken/i);
    // The LLM can read this and decide to use a different email or fetch the existing user
  });
});

The key difference from integration tests: the assertion target is the text content that an LLM would read and reason from, not just whether isError is the right boolean. An error that returns isError: true with text: 'constraint_violation' passes an integration test but fails acceptance — because an LLM receiving that message cannot determine whether it should retry with a different email, fetch the existing user, or report an internal error to the user.

Three acceptance test classes that every production MCP server should have:

Description accuracy: verify that each tool's description matches what it does. The description is what the LLM reads to decide whether to call a tool — a description that says "fetches a user by ID" when the tool actually queries by email causes the LLM to send the wrong argument type.
Error LLM-readability: for every isError: true path, assert that the error message contains the information an LLM needs to self-correct. At minimum: what went wrong (e.g., "User not found"), what identifier was used (e.g., "ID: xyz-123"), and what the caller can do (e.g., "Use list_users to find valid IDs").
Multi-step scenarios: test realistic workflows that chain multiple tool calls. A "create and then retrieve" scenario catches the bug where create_user succeeds but the returned ID format doesn't match what get_user accepts — a cross-tool consistency bug that no single-tool test can detect.

it('create_user then get_user returns the same user', async () => {
  // Given: no existing users

  // When: create a user and capture the ID from the response
  const createText = await client.callToolText('create_user', {
    name: 'Bob',
    email: 'bob@example.com',
  });
  const created = JSON.parse(createText);   // server returns JSON
  expect(created.id).toBeTruthy();

  // When: retrieve the user using the ID from the creation response
  const fetchText = await client.callToolText('get_user', { userId: created.id });
  const fetched = JSON.parse(fetchText);

  // Then: the fetched user matches the created user
  expect(fetched.name).toBe('Bob');
  expect(fetched.email).toBe('bob@example.com');
});

This kind of round-trip test is the highest-value acceptance test for a CRUD MCP server. It validates the complete data flow — write, read, consistency — in a single scenario that mirrors what an LLM agent actually does when operating over your tools.

Parallel testing: scaling the suite without slowing CI

As the test suite grows to hundreds of tests across dozens of files, wall-clock time becomes the constraint. MCP tests are unusually easy to parallelize because InMemoryTransport creates isolated in-process connections with no shared network state. Two test files can each run their own Server and Client instances simultaneously without port conflicts:

Shared resource concern	With InMemoryTransport
Port conflicts (two servers on port 3000)	No port — transport is in-process
Shared database state	Each test creates a new fake database instance via `createFakeDb()`
Network saturation	No network — all messages route in memory
Process startup time	No process — server is a JS object created synchronously
TLS certificate	No TLS — transport skips the HTTP layer entirely

Vitest parallelizes test files by default across worker threads. The default configuration is the right starting point:

// vitest.config.ts
import { defineConfig } from 'vitest/config';
import os from 'os';

export default defineConfig({
  test: {
    pool: 'threads',
    poolOptions: {
      threads: {
        maxThreads: Math.max(1, os.cpus().length - 1),
        minThreads: 1,
      },
    },
    sequence: { shuffle: true },  // catch ordering dependencies early
  },
});

The main parallelism pitfall is module-level shared state — a const db = createFakeDb() at the top of a test file, shared across all tests in that file. Tests in the same file run in the same worker and share the module scope. If test A creates a user and test B calls list_users, test B's result depends on whether test A ran first. The fix is to move createFakeDb() into beforeEach (a fresh database per test) or into the beforeAll of a describe block with the test's known initial state:

// Unsafe: module-level fake shared across all tests in the file
const db = createFakeDb();  // ← don't do this

// Safe: fresh fake per test
describe('get_user', () => {
  let client: McpTestClient;

  beforeEach(async () => {
    const db = createFakeDb();  // fresh per test
    client = await createMcpTestClient(createServer, { db });
  });

  afterEach(() => client.close());
});

For CI with a very large suite (over 1,000 tests), split the run across multiple GitHub Actions jobs using Vitest's --shard flag:

# .github/workflows/test.yml — matrix sharding
jobs:
  test:
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '22' }
      - run: npm ci
      - run: npx vitest run --shard=${{ matrix.shard }}/4

Four shards running in parallel on four GitHub Actions runners takes roughly one quarter of the sequential time. For a 500-test suite that takes 2 minutes sequentially, sharding brings the CI wall-clock time to under 30 seconds — fast enough to stay in the commit feedback loop.

During development, vitest --watch re-runs only the tests affected by changed files. For a module that 20 test files import, watch mode limits the re-run to those 20 files rather than the full suite, keeping feedback in seconds even for large codebases.

Production monitoring: the fifth layer that closes the pyramid

All four test layers above run before deployment. They verify that your code is correct. None of them can verify that your deployed server is reachable. This is the pyramid's open top — the failure class that every in-process test leaves uncovered:

TCP unreachable: the cloud deployment succeeded but the process is listening on port 3000 while the load balancer forwards to port 8080. Every test passes. No real MCP client can connect.
TLS certificate expired: the certificate expired over the weekend. InMemoryTransport has no TLS. Your CI runs on port 3000 with no certificate. The production error is invisible until a client reports it on Monday morning.
MCP protocol broken post-deploy: an environment variable controls which transport class is instantiated. In production, the wrong value causes the server to start an HTTP/1.1 server instead of an SSE server. The health check endpoint returns 200; the MCP initialize handshake returns nothing intelligible.
Process crash loop: the process starts, fails to connect to the database (wrong password in the production secret), and crashes. A new process starts, fails again. The health check endpoint returns 200 during the startup window before the crash. From inside CI, the server looks healthy.

These failure classes share a property: they require a real network client making a real MCP protocol call to detect. AliveMCP probes the live initialize handshake over the network every 60 seconds — not just a TCP ping or an HTTP health check, but the full MCP protocol sequence: connect, negotiate capabilities, call tools/list, and optionally call a sentinel tool. The probe's failure taxonomy maps directly to the four classes above:

Failure class	AliveMCP failure_reason	What the probe does differently from `/health`
TCP unreachable	`connection_refused`	Connects on the MCP endpoint port, not the health check port
TLS expired or invalid	`tls_error`	Validates the TLS certificate chain before sending any data
Protocol broken	`protocol_error`	Sends a real MCP `initialize` request and parses the response as JSON-RPC 2.0
Slow / degraded	`timeout`	Times out if the handshake exceeds the threshold (default 10s)

The timing matters: a 60-second probe interval means any of these failures is detected within one minute of occurring. The /health endpoint your platform calls on startup typically isn't called again until the next deploy. AliveMCP's continuous probing closes the gap between deploy-time health checks and real client experience.

The relationship between the testing pyramid layers and AliveMCP is complementary, not competitive. Your unit tests verify handler logic. Your integration tests verify wiring. Your acceptance tests verify user-facing behavior. AliveMCP verifies that the deployed server your users actually call is alive, reachable, and speaking valid MCP — the question that no test running in CI can answer.

Putting the pyramid together: a practical setup checklist

The complete setup for the four-layer pyramid in a TypeScript MCP server:

Dependency injection first. Refactor your createServer() to accept a Deps interface. This unblocks all subsequent layers.
Write createFakeDb() for each external data store your handlers use. Keep fakes in test/fakes/ alongside production code in src/.
Create test/helpers/mcp-client.ts with the createMcpTestClient factory and typed helpers (callToolText, assertSchemaIncludes).
Write integration tests that cover: tool names appear in listTools(), each tool's inputSchema is correct, happy-path call returns a response, error-path call returns isError: true with non-empty content.
Write acceptance tests for each non-trivial user scenario: create-then-retrieve round trips, multi-tool workflows, and every isError: true path (asserting that the error message is LLM-readable, not just present).
Configure Vitest for parallel execution with pool: 'threads' and verify no module-level shared state across tests in the same file.
Add CI sharding via --shard=N/M once the suite exceeds ~500 tests and wall-clock time exceeds 2 minutes.
Register your production endpoint with AliveMCP to close the pyramid's open top: continuous protocol probing of the live endpoint with alerts on connection_refused, tls_error, protocol_error, and timeout.

Each step is independently valuable — the dependency injection refactor pays for itself even if you never write an acceptance test. But the full pyramid, with all four layers operational, gives you something no individual layer provides: confidence that the server you shipped is the server that works for the real LLM clients calling it in production.