Guide · Testing

MCP server acceptance testing

Unit tests ask: "does this handler function produce the right output?" Acceptance tests ask: "does this tool do what it says it does, from the perspective of the LLM or human calling it?" The difference in framing matters for MCP servers: a tool can have correct handler logic and still fail acceptance — because its error messages are opaque to an LLM, its description doesn't match its behavior, or a realistic multi-step scenario breaks at the boundary between tools. Acceptance tests cover the surface your users see.

TL;DR

Write acceptance tests as end-to-end scenarios using InMemoryTransport and a real Client. Structure each test as Given (setup state), When (call the tool), Then (assert the observable result). Verify that: (1) the tool list matches its documented surface, (2) each tool's description accurately reflects what it does, (3) error messages contain enough information for an LLM to retry or self-correct, and (4) realistic multi-tool workflows complete without unexpected isError responses.

What acceptance tests cover that unit tests don't

MCP server test levels differ in what question they answer:

Test levelQuestionPerspective
Unit test (handler call)Does this handler return the right data?Developer
Integration test (InMemoryTransport)Is the tool wired correctly to the protocol?Developer
Acceptance test (scenario)Does calling this tool help an LLM accomplish the user's goal?LLM / user
AliveMCP probe (production)Is the deployed endpoint reachable over the network?External observer

Acceptance tests catch the category of bug where the implementation is internally correct but externally wrong: the tool creates a record but the confirmation message says "deleted"; the search tool returns results but in a format the LLM can't parse; the pagination tool works correctly but starts at page 0 when the description says 1.

Given / When / Then structure for MCP tools

The Given/When/Then pattern from Behavior-Driven Development (BDD) maps naturally to MCP tool tests. "Given" sets up the server state. "When" makes a tool call. "Then" asserts the observable result.

// user-service.acceptance.test.ts
import { describe, it, beforeEach, afterEach, expect } from 'vitest';
import { createMcpTestClient, McpTestClient } from '../test/helpers/mcp-client.js';
import { createServer } from './server.js';
import { createFakeDb } from '../test/fakes/fake-db.js';

describe('create_user tool acceptance', () => {
  let client: McpTestClient;
  let fakeDb: ReturnType<typeof createFakeDb>;

  beforeEach(async () => {
    fakeDb = createFakeDb();
    client = await createMcpTestClient(createServer, {
      db: fakeDb,
      email: { sendWelcome: async () => {} },
    });
  });

  afterEach(() => client.close());

  it('creates a user and returns a readable confirmation', async () => {
    // Given: no existing users
    // When: create_user is called with valid name and email
    const text = await client.callToolText('create_user', {
      name: 'Alice',
      email: 'alice@example.com',
    });

    // Then: the response contains the new user ID and name
    const user = JSON.parse(text);
    expect(user).toMatchObject({ name: 'Alice', email: 'alice@example.com' });
    expect(user.id).toMatch(/^[0-9a-f-]{36}$/); // UUID format
  });

  it('returns an LLM-actionable error when email is already registered', async () => {
    // Given: alice is already registered
    await client.callToolText('create_user', { name: 'Alice', email: 'alice@example.com' });

    // When: the same email is used again
    const errorMsg = await client.callToolExpectError('create_user', {
      name: 'Alicia',
      email: 'alice@example.com',
    });

    // Then: the error message tells the LLM what to do next
    expect(errorMsg).toMatch(/already registered/i);
    expect(errorMsg).toMatch(/alice@example\.com/);
  });
});

Testing that error messages are LLM-readable

An LLM calling your MCP tool reads the isError: true response text and decides whether to retry with different arguments, ask the user for clarification, or report failure. Error messages that say "Error: 400" give the LLM no information; messages that say "Invalid email address: 'alice@example' is missing a domain" let the LLM self-correct. Acceptance tests can enforce LLM-readability explicitly.

describe('error message quality', () => {
  it('validation error names the invalid field', async () => {
    const msg = await client.callToolExpectError('create_user', {
      name: '',   // empty — should fail
      email: 'alice@example.com',
    });
    // Must contain the field name so the LLM knows what to fix
    expect(msg.toLowerCase()).toMatch(/name/);
  });

  it('not-found error names the missing resource', async () => {
    const msg = await client.callToolExpectError('get_user', {
      userId: 'nonexistent-id',
    });
    // Must include the attempted ID so the LLM can report it to the user
    expect(msg).toContain('nonexistent-id');
  });

  it('permission error suggests what the user should do', async () => {
    const msg = await client.callToolExpectError('delete_user', {
      userId: 'user-1',
      confirm: true,
    });
    // Good error: "Only administrators can delete users. Contact your admin."
    // Bad error: "403"
    expect(msg.length).toBeGreaterThan(20); // Non-trivially descriptive
  });
});

Multi-step scenario tests

LLMs typically call several tools in sequence to complete a task. Acceptance tests should include these multi-step workflows to catch bugs at tool boundaries — where one tool's output format doesn't match another tool's expected input, or where a create operation returns an ID in a format the downstream tool doesn't accept.

describe('end-to-end user lifecycle', () => {
  it('creates, retrieves, and deletes a user in three tool calls', async () => {
    // Step 1: Create user
    const created = await client.callToolJson<User>(
      'create_user',
      { name: 'Bob', email: 'bob@example.com' }
    );
    expect(created.id).toBeDefined();

    // Step 2: Retrieve the user by the ID returned from step 1
    const retrieved = await client.callToolJson<User>(
      'get_user',
      { userId: created.id }
    );
    expect(retrieved.name).toBe('Bob');

    // Step 3: Delete the user using the same ID
    const deleteMsg = await client.callToolText(
      'delete_user',
      { userId: created.id, confirm: true }
    );
    expect(deleteMsg).toMatch(/deleted/i);

    // Step 4: Verify deletion — get_user should now return isError
    const afterDelete = await client.callToolExpectError(
      'get_user',
      { userId: created.id }
    );
    expect(afterDelete).toMatch(/not found/i);
  });
});

Testing tool descriptions

Tool descriptions are part of the LLM-facing surface of your server. The MCP listTools response includes a description field for each tool — LLM clients use it to decide which tool to call. Acceptance tests can verify that descriptions are present, non-trivially descriptive, and actually match the tool's behavior.

describe('tool descriptions', () => {
  let tools: Awaited<ReturnType<McpTestClient['listTools']>>['tools'];

  beforeAll(async () => {
    ({ tools } = await client.listTools());
  });

  it('every tool has a non-empty description', () => {
    for (const tool of tools) {
      expect(tool.description, `Tool '${tool.name}' has empty description`).toBeTruthy();
      expect(tool.description!.length, `Tool '${tool.name}' description is too short`).toBeGreaterThan(15);
    }
  });

  it('create_user description mentions what it creates', () => {
    const tool = tools.find(t => t.name === 'create_user')!;
    expect(tool.description!.toLowerCase()).toMatch(/user|account/);
  });

  it('delete_user description warns about irreversibility', () => {
    const tool = tools.find(t => t.name === 'delete_user')!;
    expect(tool.description!.toLowerCase()).toMatch(/permanent|irreversible|cannot be undone|cannot undo/);
  });
});

These tests may seem pedantic, but they prevent a category of bug that's invisible in unit tests: a refactor renames the handler and accidentally clears the description to an empty string, or a generator produces "PLACEHOLDER — TODO: fill in description".

Organizing acceptance tests separately from unit tests

Acceptance tests are slower than unit tests and often more brittle — they exercise more of the stack and can fail for reasons that aren't code bugs (fake data setup, test ordering). Keep them in a separate directory and run them on a different schedule.

// vitest.config.ts
import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    // Default — unit + integration tests (fast)
    include: ['src/**/*.test.ts'],
    exclude: ['src/**/*.acceptance.test.ts'],
  },
});

// vitest.acceptance.config.ts
import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    include: ['src/**/*.acceptance.test.ts'],
    testTimeout: 15000, // Acceptance tests can be slower
  },
});
# package.json scripts
{
  "scripts": {
    "test": "vitest run",
    "test:acceptance": "vitest run --config vitest.acceptance.config.ts",
    "test:all": "vitest run && vitest run --config vitest.acceptance.config.ts"
  }
}

The gap between acceptance tests and production

Acceptance tests with InMemoryTransport verify that your MCP server does what it promises — from the client's perspective, with the full protocol stack. What they don't cover is the gap between "tests pass" and "users can connect": the deployed server might not start on the expected port, the TLS certificate might have expired, the process might crash under load, or a cloud provider health check might be misconfigured. AliveMCP bridges this gap by probing the live production MCP endpoint every 60 seconds and alerting you when the deployed system fails the same initialize handshake your tests pass.

Related questions

Should acceptance tests use real or fake external dependencies?

Use fakes for the default acceptance test suite so tests run without external services. Add a second acceptance test suite (tagged or in a separate file) that uses real dependencies — a real test database, a real API sandbox account — and run it before production deployments or on a nightly schedule. The fake-based acceptance tests catch logic bugs; the real-dependency tests catch integration bugs between your code and the real services.

How do acceptance tests differ from end-to-end tests?

For MCP servers, the distinction is the transport layer. Acceptance tests use InMemoryTransport — no network, no HTTP server, runs in CI with no infrastructure. End-to-end tests use the real HTTP transport (StreamableHTTPServerTransport or SSE) with a real HTTP client — they catch HTTP-specific bugs like CORS headers, session ID handling, and connection timeouts. Acceptance tests are the primary quality gate (run on every commit); end-to-end tests are a secondary gate (run before deploy).

How do I test a tool that has non-deterministic output?

Extract the deterministic parts. A tool that returns a timestamp in its output can be tested with expect(result).toMatchObject({ name: 'Alice' }) using partial matching, leaving out the createdAt field. For outputs with random IDs, assert the format (expect(id).toMatch(/^[a-z0-9-]{36}$/)) rather than the value. For tools that call an LLM internally and return generated text, test structural properties (response is non-empty, required sections exist) rather than exact content.

Can I use Cucumber or Gherkin feature files for MCP acceptance tests?

Yes. @cucumber/cucumber integrates with Node.js and can use the same InMemoryTransport setup in step definitions. Gherkin syntax makes the behavioral specification readable by non-developers and produces a living documentation artifact. The tradeoff: more setup overhead and a separate runner. For most MCP server projects, plain Vitest describe/it with explicit Given/When/Then comments achieves the same readability benefit with less infrastructure.

Further reading