Guide · Mutation Testing

MCP server mutation testing

Line coverage at 90% means your tests execute 90% of lines — it says nothing about whether a test would fail if those lines were wrong. Mutation testing runs your test suite against hundreds of deliberately broken versions of your code. Any broken version that passes all tests is a surviving mutant — a real gap in your test coverage. For MCP tool handlers, the most common survivors cluster in one place: the error paths. The branch where an external API call throws, where a Zod schema fails to parse, where an empty array comes back. These are exactly the failures that happen in production, and exactly the ones happy-path unit tests skip.

TL;DR

Run Stryker against your tool handler source files only — not the server boilerplate. A mutant survives when no test changes its outcome. The most revealing surviving mutants in MCP handlers are: the catch branch that swallows an API error silently, the conditional that short-circuits on empty results, and the boundary check on numeric inputs. Write tests that specifically assert on error-path behaviour (result.isError === true, specific error message content) to kill these mutants. Target 80%+ mutation score for tool handler core logic. Run Stryker in incremental mode in CI so it only re-tests mutants from changed files.

Why line coverage lies

Consider this tool handler for a GitHub search integration:

// tools/search-repos.ts
import { z } from 'zod';
import type { Deps } from '../deps.js';

export const searchReposSchema = z.object({
  query: z.string().min(1),
  limit: z.number().int().min(1).max(50).default(10),
});

export async function searchReposHandler(
  args: z.infer<typeof searchReposSchema>,
  deps: Deps
): Promise<{ content: Array<{ type: 'text'; text: string }>; isError?: boolean }> {
  let results;
  try {
    results = await deps.github.searchRepositories(args.query, args.limit);
  } catch (err) {
    deps.logger.error('GitHub search failed', { err });
    return {
      content: [{ type: 'text', text: `Search failed: ${(err as Error).message}` }],
      isError: true,
    };
  }

  if (results.length === 0) {
    return {
      content: [{ type: 'text', text: 'No repositories found.' }],
    };
  }

  const formatted = results
    .map(r => `${r.full_name} — ${r.description ?? 'no description'} (${r.stargazers_count} stars)`)
    .join('\n');

  return {
    content: [{ type: 'text', text: formatted }],
  };
}

Now consider this test:

// test/search-repos.test.ts
test('returns formatted results', async () => {
  const deps = createTestDeps();
  deps.github.searchRepositories = async () => [
    { full_name: 'org/repo', description: 'A tool', stargazers_count: 42 },
  ];

  const result = await searchReposHandler({ query: 'typescript mcp', limit: 5 }, deps);

  expect(result.isError).toBeFalsy();
  expect(result.content[0].text).toContain('org/repo');
});

Istanbul reports this test gives you 85% line coverage on search-repos.ts. The catch block was entered during a previous unrelated test run that happened to trigger the branch — so the lines are marked as executed. But there is no assertion on what happens when searchRepositories throws. The if (results.length === 0) branch is also never hit.

Coverage tells you a line was executed. It does not tell you whether a test would fail if that line returned the wrong value, threw an unexpected type, or was deleted entirely. A test that calls a function and discards the result contributes 100% line coverage for that function. Mutation testing measures the thing that actually matters: would your tests catch a bug here?

What mutation testing does

A mutation testing tool like Stryker works in three steps. First it parses your source files into an AST. Then it generates mutants — modified copies of the AST with one small change applied, each representing a plausible bug a developer could introduce. Then it reruns your test suite against each mutant. If any test fails, the mutant is killed. If all tests pass, the mutant survived.

The four most common mutation categories, and what they reveal for MCP handlers:

For the handler above, Stryker would generate approximately 18–22 mutants. A typical happy-path test suite kills about 11 of them — a mutation score of roughly 55%. The survivors are concentrated in the catch block, the empty-results branch, and the isError flag.

Setting up Stryker for an MCP server

Install Stryker and the Jest runner (or Vitest runner if your project uses Vitest):

npm install --save-dev @stryker-mutator/core @stryker-mutator/jest-runner

Create stryker.config.mjs in the project root. The key configuration decision is mutate: scope it tightly to your tool handler files. Do not mutate the server wiring, the transport setup, or the SDK call sites — those aren't business logic and mutations there produce noise:

// stryker.config.mjs
/** @type {import('@stryker-mutator/api/core').PartialStrykerOptions} */
export default {
  packageManager: 'npm',
  reporters: ['html', 'clear-text', 'progress'],
  testRunner: 'jest',
  coverageAnalysis: 'perTest',

  // Only mutate tool handler business logic
  mutate: [
    'src/tools/**/*.ts',
    '!src/tools/index.ts',      // exclude the registration barrel
    '!src/tools/**/*.test.ts',  // never mutate the tests themselves
  ],

  // Jest config for Stryker — point to your existing jest config
  jest: {
    projectType: 'custom',
    configFile: 'jest.config.ts',
    enableFindRelatedTests: true,
  },

  // Mutation thresholds — fail CI below these scores
  thresholds: {
    high: 80,
    low: 70,
    break: 65,  // fail the process with exit code 1 below 65%
  },

  // Incremental mode: only test mutants in changed files
  incremental: true,
  incrementalFile: '.stryker-tmp/incremental.json',

  // Ignore mutants in trivial patterns (logging calls, debug strings)
  ignorers: ['@stryker-mutator/typescript-checker'],
  mutationOptions: {
    excludedMutations: ['StringLiteral'], // skip string constant mutations
  },
};

Run it:

npx stryker run

The first run is slow — it runs your full test suite once per mutant. With 10 tool handler files averaging 20 mutants each, expect 200 test runs. With coverageAnalysis: 'perTest', Stryker maps which tests cover which source lines and only runs the relevant subset of tests per mutant. This reduces a 200-run pass to roughly 40–80 test runs on a typical MCP server.

The HTML report at reports/mutation/mutation.html shows each surviving mutant highlighted in source, with the mutation applied shown inline. This is where you find the exact lines that have no test coverage.

High-value mutations for MCP handlers

Four mutation categories that consistently surface real test gaps in MCP tool handlers, and the tests you need to kill them:

1. Error path coverage

When the injected dependency throws, the catch block runs. A surviving mutant here means no test ever caused the dependency to throw and asserted on the result:

// SURVIVING MUTANT: Stryker deleted the isError: true line
// Original:
return { content: [{ type: 'text', text: `Search failed: ${msg}` }], isError: true };
// Mutant:
return { content: [{ type: 'text', text: `Search failed: ${msg}` }] };

// Test that kills it:
test('returns isError when GitHub API throws', async () => {
  const deps = createTestDeps();
  deps.github.searchRepositories = async () => {
    throw new Error('rate limit exceeded');
  };

  const result = await searchReposHandler({ query: 'mcp', limit: 5 }, deps);

  expect(result.isError).toBe(true);
  expect(result.content[0].text).toContain('rate limit exceeded');
});

2. Schema validation bypass

Zod's .safeParse() returns a discriminated union with success: boolean. Mutations on the success check survive when no test sends invalid input and asserts that validation was applied:

// Tool handler that validates a nested object from an API response:
const parsed = ResponseSchema.safeParse(rawApiData);
if (!parsed.success) {
  return {
    content: [{ type: 'text', text: 'API returned unexpected shape' }],
    isError: true,
  };
}

// SURVIVING MUTANT: !parsed.success → parsed.success (negation flipped)
// The function proceeds with unvalidated data when the condition is negated.

// Test that kills it:
test('returns isError when API response fails schema validation', async () => {
  const deps = createTestDeps();
  // Return data that does not match ResponseSchema
  deps.github.searchRepositories = async () => [
    { not_the_right_shape: true } as any,
  ];

  const result = await searchReposHandler({ query: 'mcp', limit: 5 }, deps);

  expect(result.isError).toBe(true);
  expect(result.content[0].text).toContain('unexpected shape');
});

3. Empty result handling

The results.length === 0 guard is a common surviving mutant because most tests seed test data that returns at least one result:

// SURVIVING MUTANT: results.length === 0 → results.length !== 0
// The empty-results path and the results path swap — the formatted list is
// returned when there are no results (crashing on .map of empty data),
// and "No repositories found." is returned when there ARE results.

// Test that kills it:
test('returns "No repositories found" when search returns empty array', async () => {
  const deps = createTestDeps();
  deps.github.searchRepositories = async () => [];

  const result = await searchReposHandler({ query: 'unlikely-query-xyz', limit: 5 }, deps);

  expect(result.isError).toBeFalsy();
  expect(result.content[0].text).toBe('No repositories found.');
});

4. Boundary conditions on numeric inputs

Tools that apply minimum/maximum clamping or offset logic are frequent sources of off-by-one survivors:

// Pagination tool handler snippet:
const offset = Math.max(0, args.page - 1) * args.pageSize;
const rows = await deps.db.query(sql, [args.pageSize, offset]);

// SURVIVING MUTANT: args.page - 1 → args.page + 1
// Page 1 now fetches offset=2*pageSize (page 3's data).

// Test that kills it:
test('page 1 returns the first page of results', async () => {
  const deps = createTestDeps();
  await seedRows(deps.db, 25); // 25 total rows

  const result = await listRecordsHandler({ page: 1, pageSize: 10 }, deps);
  const rows = JSON.parse(result.content[0].text);

  // First page must contain row IDs 1–10, not 21–25
  expect(rows[0].id).toBe(1);
  expect(rows).toHaveLength(10);
});

test('page 2 returns the second page of results', async () => {
  const deps = createTestDeps();
  await seedRows(deps.db, 25);

  const result = await listRecordsHandler({ page: 2, pageSize: 10 }, deps);
  const rows = JSON.parse(result.content[0].text);

  expect(rows[0].id).toBe(11);
  expect(rows).toHaveLength(10);
});

The boundary tests work because they assert on the specific row ID returned at a specific offset — not just that rows were returned. Stryker's arithmetic operator mutations (+ vs −, * vs /) survive tests that only check response shape, not response values.

Mutation score targets

Not all code deserves the same mutation score target. Prioritise where mutations reveal real production risk:

The Stryker HTML report breaks down mutation score by file. A file with a score below 60% almost always contains an error path or edge case that has never been tested. Start there — those surviving mutants map directly to the production failures most likely to be reported.

Mutation testing in CI

Stryker's full run is too slow for every commit if you have a large codebase. Two strategies keep it practical:

Incremental mode

With incremental: true and an incrementalFile path, Stryker caches mutant results across runs. On the next run it only re-tests mutants whose source file or test file changed. Cache the incremental file in CI between runs so subsequent commits reuse results for unchanged files.

GitHub Actions configuration

# .github/workflows/mutation.yml
name: Mutation testing

on:
  push:
    branches: [main]
  pull_request:
    paths:
      - 'src/tools/**'
      - 'test/**'

jobs:
  mutation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '22'
          cache: 'npm'

      - run: npm ci

      - name: Restore Stryker incremental cache
        uses: actions/cache@v4
        with:
          path: .stryker-tmp/incremental.json
          key: stryker-${{ runner.os }}-${{ hashFiles('src/tools/**/*.ts') }}
          restore-keys: |
            stryker-${{ runner.os }}-

      - name: Run mutation tests
        run: npx stryker run
        env:
          # Prevent Stryker from opening the browser for the HTML report
          CI: true

      - name: Upload mutation report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: mutation-report
          path: reports/mutation/

The cache key includes a hash of the tool handler source. When tool handler files change, the cache key changes and Stryker re-tests all mutants in those files. When only test files change, Stryker uses the cached mutant results but re-evaluates whether the updated tests kill previously surviving mutants — this is fast because it only runs tests against the surviving mutants from the cache, not all mutants.

For pull requests, scope the workflow trigger to paths: src/tools/** so mutation tests only run when tool handler code changes. Unit tests and integration tests run on every push; mutation testing runs only when the logic under test changes.

The thresholds.break option in stryker.config.mjs exits with code 1 when the mutation score drops below the threshold — this fails the CI step and blocks merges when test quality regresses. Set it conservatively at first (50–60%) and tighten over time as you kill more survivors.

Test gaps in error paths and AliveMCP

There is a consistent pattern in MCP server post-mortems: the production incident involved an error path. The external API started returning a 503. The upstream service began returning a response body that didn't match the expected schema. The database query returned zero rows when the code assumed at least one. In each case, a test existed for the happy path — and the error path had a surviving mutant that nobody noticed.

Mutation testing and production monitoring are complementary layers of the same problem:

Without mutation testing, your test suite gives you false confidence: 90% line coverage looks good in the CI dashboard, but the error path survives because nothing checks its output. The mutation score tells a different story — 55% mutation score means nearly half your tool handler mutations produce code that your tests cannot distinguish from correct code.

Kill the surviving mutants first. Then let AliveMCP monitor whether those error paths ever actually fire in production. The two tools measure different things: mutation testing measures whether your tests would catch a bug; AliveMCP measures whether a bug is happening right now. You need both. A well-tested MCP server that is currently down is not the same as a poorly-tested server that happens to be up.

See also: MCP server unit testing for the handler-level test patterns that kill most mutants, and MCP server integration testing for the InMemoryTransport-based tests that catch error paths at the protocol level.

Related questions

How is mutation testing different from code coverage?

Code coverage (Istanbul, c8) records which lines were executed during a test run. A line counts as covered even if the test does not assert on what that line returned, modified, or produced. Mutation testing inserts a bug into each line one at a time and checks whether the test fails. A line that is executed but not asserted on produces a surviving mutant — coverage says "covered," mutation testing says "untested." The mutation score is a lower bound on the fraction of bugs your test suite would actually catch.

How long does Stryker take on a typical MCP server?

A single tool handler file with 50 lines of logic generates roughly 15–25 mutants. With coverageAnalysis: 'perTest', each mutant runs only the tests that cover its source line — typically 3–8 tests per mutant rather than the full suite. For a server with 8 tool handlers, expect 120–200 mutants total and a run time of 2–5 minutes on a GitHub Actions runner. The first run is the slowest; incremental mode on subsequent runs typically reduces this to under 60 seconds if fewer than 3 files changed.

Should I aim for 100% mutation score?

No. Some mutations are semantically equivalent — they change the code without changing observable behaviour, and no test can distinguish them. Others are in defensive fallbacks that require impractical test setups (testing a catch-of-a-catch, or a type guard on data that is always the right type). Chasing 100% creates tests that are brittle and test implementation details. Target 80% for tool handler core logic, accept lower for peripheral code, and focus energy on the surviving mutants that map to real error paths.

Can mutation testing replace integration testing?

No. Mutation testing runs your existing tests against modified source — it measures the quality of whatever tests you already have. It does not replace the need for integration tests that run the full MCP protocol stack, or end-to-end tests that test the deployed server. Mutation testing is a quality metric for unit and integration tests, not a substitute for them. Use it to find the gaps, then write the missing tests, then run mutation testing again to confirm the mutants are killed.

What is the difference between a killed mutant and a timeout?

A killed mutant is one where at least one test fails with the mutation applied — the test suite detected the bug. A timeout mutant is one where the test suite took too long to complete with the mutation applied — Stryker marks it as "timed out" and counts it separately. Timeout mutants often indicate that a mutation introduced an infinite loop or a blocking wait. Stryker counts them separately from killed mutants in the mutation score; they are not counted as survivors but they also aren't clean kills. Investigate timeout mutants — they often point to a missing test for an infinite-retry loop or a recursive call.

Further reading

Mutation testing tells you which error paths lack assertions. AliveMCP tells you when those error paths fire in production.

Kill your surviving mutants. Then let AliveMCP monitor whether the underlying failures ever actually happen — external probe, full MCP protocol, 60-second alert cadence.

Start monitoring free