Guide · Mutation Testing

MCP server mutation testing

Line coverage at 90% means your tests execute 90% of lines — it says nothing about whether a test would fail if those lines were wrong. Mutation testing runs your test suite against hundreds of deliberately broken versions of your code. Any broken version that passes all tests is a surviving mutant — a real gap in your test coverage. For MCP tool handlers, the most common survivors cluster in one place: the error paths. The branch where an external API call throws, where a Zod schema fails to parse, where an empty array comes back. These are exactly the failures that happen in production, and exactly the ones happy-path unit tests skip.

TL;DR

Run Stryker against your tool handler source files only — not the server boilerplate. A mutant survives when no test changes its outcome. The most revealing surviving mutants in MCP handlers are: the catch branch that swallows an API error silently, the conditional that short-circuits on empty results, and the boundary check on numeric inputs. Write tests that specifically assert on error-path behaviour (result.isError === true, specific error message content) to kill these mutants. Target 80%+ mutation score for tool handler core logic. Run Stryker in incremental mode in CI so it only re-tests mutants from changed files.

Why line coverage lies

Consider this tool handler for a GitHub search integration:

// tools/search-repos.ts
import { z } from 'zod';
import type { Deps } from '../deps.js';

export const searchReposSchema = z.object({
  query: z.string().min(1),
  limit: z.number().int().min(1).max(50).default(10),
});

export async function searchReposHandler(
  args: z.infer<typeof searchReposSchema>,
  deps: Deps
): Promise<{ content: Array<{ type: 'text'; text: string }>; isError?: boolean }> {
  let results;
  try {
    results = await deps.github.searchRepositories(args.query, args.limit);
  } catch (err) {
    deps.logger.error('GitHub search failed', { err });
    return {
      content: [{ type: 'text', text: `Search failed: ${(err as Error).message}` }],
      isError: true,
    };
  }

  if (results.length === 0) {
    return {
      content: [{ type: 'text', text: 'No repositories found.' }],
    };
  }

  const formatted = results
    .map(r => `${r.full_name} — ${r.description ?? 'no description'} (${r.stargazers_count} stars)`)
    .join('\n');

  return {
    content: [{ type: 'text', text: formatted }],
  };
}

Now consider this test:

// test/search-repos.test.ts
test('returns formatted results', async () => {
  const deps = createTestDeps();
  deps.github.searchRepositories = async () => [
    { full_name: 'org/repo', description: 'A tool', stargazers_count: 42 },
  ];

  const result = await searchReposHandler({ query: 'typescript mcp', limit: 5 }, deps);

  expect(result.isError).toBeFalsy();
  expect(result.content[0].text).toContain('org/repo');
});

Istanbul reports this test gives you 85% line coverage on search-repos.ts. The catch block was entered during a previous unrelated test run that happened to trigger the branch — so the lines are marked as executed. But there is no assertion on what happens when searchRepositories throws. The if (results.length === 0) branch is also never hit.

Coverage tells you a line was executed. It does not tell you whether a test would fail if that line returned the wrong value, threw an unexpected type, or was deleted entirely. A test that calls a function and discards the result contributes 100% line coverage for that function. Mutation testing measures the thing that actually matters: would your tests catch a bug here?

What mutation testing does

A mutation testing tool like Stryker works in three steps. First it parses your source files into an AST. Then it generates mutants — modified copies of the AST with one small change applied, each representing a plausible bug a developer could introduce. Then it reruns your test suite against each mutant. If any test fails, the mutant is killed. If all tests pass, the mutant survived.

The four most common mutation categories, and what they reveal for MCP handlers:

Conditional negation. if (results.length === 0) becomes if (results.length !== 0). Survives when you have no test that asserts on the empty-results code path.
Arithmetic operator flip. args.limit - 1 becomes args.limit + 1. Survives when your boundary tests don't assert on the exact value, only that the response is non-empty.
Value substitution. isError: true becomes isError: false. Survives when your error-path test asserts that the response contains text, but never checks result.isError === true.
Statement deletion. The entire return { isError: true } statement is removed. The function continues past the catch block and attempts to format results, which is undefined. Survives when your error-path test doesn't exist at all.

For the handler above, Stryker would generate approximately 18–22 mutants. A typical happy-path test suite kills about 11 of them — a mutation score of roughly 55%. The survivors are concentrated in the catch block, the empty-results branch, and the isError flag.

Setting up Stryker for an MCP server

Install Stryker and the Jest runner (or Vitest runner if your project uses Vitest):

npm install --save-dev @stryker-mutator/core @stryker-mutator/jest-runner

Create stryker.config.mjs in the project root. The key configuration decision is mutate: scope it tightly to your tool handler files. Do not mutate the server wiring, the transport setup, or the SDK call sites — those aren't business logic and mutations there produce noise:

// stryker.config.mjs
/** @type {import('@stryker-mutator/api/core').PartialStrykerOptions} */
export default {
  packageManager: 'npm',
  reporters: ['html', 'clear-text', 'progress'],
  testRunner: 'jest',
  coverageAnalysis: 'perTest',

  // Only mutate tool handler business logic
  mutate: [
    'src/tools/**/*.ts',
    '!src/tools/index.ts',      // exclude the registration barrel
    '!src/tools/**/*.test.ts',  // never mutate the tests themselves
  ],

  // Jest config for Stryker — point to your existing jest config
  jest: {
    projectType: 'custom',
    configFile: 'jest.config.ts',
    enableFindRelatedTests: true,
  },

  // Mutation thresholds — fail CI below these scores
  thresholds: {
    high: 80,
    low: 70,
    break: 65,  // fail the process with exit code 1 below 65%
  },

  // Incremental mode: only test mutants in changed files
  incremental: true,
  incrementalFile: '.stryker-tmp/incremental.json',

  // Ignore mutants in trivial patterns (logging calls, debug strings)
  ignorers: ['@stryker-mutator/typescript-checker'],
  mutationOptions: {
    excludedMutations: ['StringLiteral'], // skip string constant mutations
  },
};

Run it:

npx stryker run

The first run is slow — it runs your full test suite once per mutant. With 10 tool handler files averaging 20 mutants each, expect 200 test runs. With coverageAnalysis: 'perTest', Stryker maps which tests cover which source lines and only runs the relevant subset of tests per mutant. This reduces a 200-run pass to roughly 40–80 test runs on a typical MCP server.

The HTML report at reports/mutation/mutation.html shows each surviving mutant highlighted in source, with the mutation applied shown inline. This is where you find the exact lines that have no test coverage.

High-value mutations for MCP handlers

Four mutation categories that consistently surface real test gaps in MCP tool handlers, and the tests you need to kill them:

1. Error path coverage

When the injected dependency throws, the catch block runs. A surviving mutant here means no test ever caused the dependency to throw and asserted on the result:

// SURVIVING MUTANT: Stryker deleted the isError: true line
// Original:
return { content: [{ type: 'text', text: `Search failed: ${msg}` }], isError: true };
// Mutant:
return { content: [{ type: 'text', text: `Search failed: ${msg}` }] };

// Test that kills it:
test('returns isError when GitHub API throws', async () => {
  const deps = createTestDeps();
  deps.github.searchRepositories = async () => {
    throw new Error('rate limit exceeded');
  };

  const result = await searchReposHandler({ query: 'mcp', limit: 5 }, deps);

  expect(result.isError).toBe(true);
  expect(result.content[0].text).toContain('rate limit exceeded');
});

2. Schema validation bypass

Zod's .safeParse() returns a discriminated union with success: boolean. Mutations on the success check survive when no test sends invalid input and asserts that validation was applied:

// Tool handler that validates a nested object from an API response:
const parsed = ResponseSchema.safeParse(rawApiData);
if (!parsed.success) {
  return {
    content: [{ type: 'text', text: 'API returned unexpected shape' }],
    isError: true,
  };
}

// SURVIVING MUTANT: !parsed.success → parsed.success (negation flipped)
// The function proceeds with unvalidated data when the condition is negated.

// Test that kills it:
test('returns isError when API response fails schema validation', async () => {
  const deps = createTestDeps();
  // Return data that does not match ResponseSchema
  deps.github.searchRepositories = async () => [
    { not_the_right_shape: true } as any,
  ];

  const result = await searchReposHandler({ query: 'mcp', limit: 5 }, deps);

  expect(result.isError).toBe(true);
  expect(result.content[0].text).toContain('unexpected shape');
});

3. Empty result handling

The results.length === 0 guard is a common surviving mutant because most tests seed test data that returns at least one result:

// SURVIVING MUTANT: results.length === 0 → results.length !== 0
// The empty-results path and the results path swap — the formatted list is
// returned when there are no results (crashing on .map of empty data),
// and "No repositories found." is returned when there ARE results.

// Test that kills it:
test('returns "No repositories found" when search returns empty array', async () => {
  const deps = createTestDeps();
  deps.github.searchRepositories = async () => [];

  const result = await searchReposHandler({ query: 'unlikely-query-xyz', limit: 5 }, deps);

  expect(result.isError).toBeFalsy();
  expect(result.content[0].text).toBe('No repositories found.');
});

4. Boundary conditions on numeric inputs

Tools that apply minimum/maximum clamping or offset logic are frequent sources of off-by-one survivors:

// Pagination tool handler snippet:
const offset = Math.max(0, args.page - 1) * args.pageSize;
const rows = await deps.db.query(sql, [args.pageSize, offset]);

// SURVIVING MUTANT: args.page - 1 → args.page + 1
// Page 1 now fetches offset=2*pageSize (page 3's data).

// Test that kills it:
test('page 1 returns the first page of results', async () => {
  const deps = createTestDeps();
  await seedRows(deps.db, 25); // 25 total rows

  const result = await listRecordsHandler({ page: 1, pageSize: 10 }, deps);
  const rows = JSON.parse(result.content[0].text);

  // First page must contain row IDs 1–10, not 21–25
  expect(rows[0].id).toBe(1);
  expect(rows).toHaveLength(10);
});

test('page 2 returns the second page of results', async () => {
  const deps = createTestDeps();
  await seedRows(deps.db, 25);

  const result = await listRecordsHandler({ page: 2, pageSize: 10 }, deps);
  const rows = JSON.parse(result.content[0].text);

  expect(rows[0].id).toBe(11);
  expect(rows).toHaveLength(10);
});

The boundary tests work because they assert on the specific row ID returned at a specific offset — not just that rows were returned. Stryker's arithmetic operator mutations (+ vs −, * vs /) survive tests that only check response shape, not response values.

Mutation score targets

Not all code deserves the same mutation score target. Prioritise where mutations reveal real production risk:

Tool handler business logic — target 80%+. This is the code that transforms inputs into outputs and handles error conditions. A surviving mutant here is a plausible production bug. The 80% threshold is achievable without exhaustive testing: the remaining 20% is typically in logging calls, minor string formatting, and defensive fallbacks that are genuinely hard to trigger.
Zod schema definitions — target 70%+. Schema mutations (a .min(1) becomes .min(0), a required field becomes optional) reveal whether your integration tests actually validate argument rejection. These are important but require more test setup than pure unit tests.
Server wiring and registration code — no target needed. The code that calls server.tool() and passes the handler function is structural boilerplate. Mutations here (renaming the tool in registration vs the handler function) are caught by integration tests that call the tool by name, not by dedicated mutation tests. Exclude this code from Stryker's mutate glob.
Transport adapters and protocol code — exclude entirely. This is SDK behaviour, not your code. Mutating it wastes test runs on behaviour your tests can't control.

The Stryker HTML report breaks down mutation score by file. A file with a score below 60% almost always contains an error path or edge case that has never been tested. Start there — those surviving mutants map directly to the production failures most likely to be reported.

Mutation testing in CI

Stryker's full run is too slow for every commit if you have a large codebase. Two strategies keep it practical:

Incremental mode

With incremental: true and an incrementalFile path, Stryker caches mutant results across runs. On the next run it only re-tests mutants whose source file or test file changed. Cache the incremental file in CI between runs so subsequent commits reuse results for unchanged files.

GitHub Actions configuration

# .github/workflows/mutation.yml
name: Mutation testing

on:
  push:
    branches: [main]
  pull_request:
    paths:
      - 'src/tools/**'
      - 'test/**'

jobs:
  mutation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '22'
          cache: 'npm'

      - run: npm ci

      - name: Restore Stryker incremental cache
        uses: actions/cache@v4
        with:
          path: .stryker-tmp/incremental.json
          key: stryker-${{ runner.os }}-${{ hashFiles('src/tools/**/*.ts') }}
          restore-keys: |
            stryker-${{ runner.os }}-

      - name: Run mutation tests
        run: npx stryker run
        env:
          # Prevent Stryker from opening the browser for the HTML report
          CI: true

      - name: Upload mutation report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: mutation-report
          path: reports/mutation/

The cache key includes a hash of the tool handler source. When tool handler files change, the cache key changes and Stryker re-tests all mutants in those files. When only test files change, Stryker uses the cached mutant results but re-evaluates whether the updated tests kill previously surviving mutants — this is fast because it only runs tests against the surviving mutants from the cache, not all mutants.

For pull requests, scope the workflow trigger to paths: src/tools/** so mutation tests only run when tool handler code changes. Unit tests and integration tests run on every push; mutation testing runs only when the logic under test changes.

The thresholds.break option in stryker.config.mjs exits with code 1 when the mutation score drops below the threshold — this fails the CI step and blocks merges when test quality regresses. Set it conservatively at first (50–60%) and tighten over time as you kill more survivors.

Test gaps in error paths and AliveMCP

There is a consistent pattern in MCP server post-mortems: the production incident involved an error path. The external API started returning a 503. The upstream service began returning a response body that didn't match the expected schema. The database query returned zero rows when the code assumed at least one. In each case, a test existed for the happy path — and the error path had a surviving mutant that nobody noticed.

Mutation testing and production monitoring are complementary layers of the same problem:

Mutation testing (pre-production): tells you which error paths have no test asserting on their behaviour. You fix these before shipping. The surviving mutant in the catch block is a reminder to write a test that injects a failure and asserts on result.isError === true.
AliveMCP (production): detects when the error path actually triggers in production. When a downstream API degrades, AliveMCP's external probe — which calls initialize and a representative tool via the full MCP protocol — returns an error or times out. AliveMCP alerts within 60 seconds.

Without mutation testing, your test suite gives you false confidence: 90% line coverage looks good in the CI dashboard, but the error path survives because nothing checks its output. The mutation score tells a different story — 55% mutation score means nearly half your tool handler mutations produce code that your tests cannot distinguish from correct code.

Kill the surviving mutants first. Then let AliveMCP monitor whether those error paths ever actually fire in production. The two tools measure different things: mutation testing measures whether your tests would catch a bug; AliveMCP measures whether a bug is happening right now. You need both. A well-tested MCP server that is currently down is not the same as a poorly-tested server that happens to be up.

See also: MCP server unit testing for the handler-level test patterns that kill most mutants, and MCP server integration testing for the InMemoryTransport-based tests that catch error paths at the protocol level.