Guide · Kubernetes Runtime Patterns

k6 Load Testing for MCP Servers — scripting protocol flows and finding limits

k6 is a developer-friendly load testing tool that speaks HTTP natively, making it well-suited for testing MCP servers over the Streamable HTTP transport. Load testing an MCP server requires scripting the full protocol flow — initialize handshake, tool discovery, tool calls — because the protocol has state and semantics that a simple HTTP GET benchmark cannot capture.

TL;DR

Script a k6 virtual user that runs the complete MCP flow: initialize → tools/list → tools/call with representative tool inputs. Use a ramp-up/sustain/ramp-down scenario. Assert on P95 latency (< 2 s for tool calls), error rate (< 1%), and that initialize always returns a valid protocolVersion. Run the load test against staging before every deploy and set the result as a required CI gate. After deploy, AliveMCP takes over with continuous external monitoring — the two tools are complementary: k6 finds your pre-production breaking point, AliveMCP detects post-production failures. Never run a k6 load test against production without first verifying that your monitoring has enough capacity to absorb the probe load simultaneously.

Why generic HTTP load testing is insufficient for MCP servers

A naive k6 script that sends POST requests to the MCP endpoint with hardcoded JSON bodies misses several important characteristics of real MCP traffic:

Protocol handshake overhead. Every MCP session begins with an initialize request. This is not a lightweight health check — it includes negotiating protocol version, capability exchange, and registering client information. If your load test skips initialize, you are not testing the protocol overhead that every real client imposes.
Tool discovery overhead. Most MCP clients call tools/list once per session to discover available tools. This call is often cached server-side but still deserializes and serializes a tool registry that may contain dozens of tool definitions. Under load, this serialization becomes a significant CPU cost that a tool-call-only benchmark misses.
JSON-RPC request ID correlation. MCP uses JSON-RPC 2.0 with integer request IDs. The response id must match the request id. A load test that reuses the same ID for all requests will not catch ID-collision bugs in servers that use in-memory request tracking.
Tool-specific workloads. Different tools have radically different latency and CPU profiles. A tool that reads from SQLite is fast; a tool that calls an external AI API may take 2–5 seconds. Load testing only the fast tools gives you a falsely optimistic picture of the server's capacity under realistic workloads.

Complete k6 load test script for Streamable HTTP MCP servers

import http from 'k6/http';
import { check, sleep } from 'k6';
import { Counter, Trend } from 'k6/metrics';

// Custom metrics for MCP-specific tracking
const mcpInitErrors = new Counter('mcp_init_errors');
const mcpToolErrors = new Counter('mcp_tool_errors');
const mcpInitDuration = new Trend('mcp_init_duration', true);
const mcpToolDuration = new Trend('mcp_tool_duration', true);

const MCP_URL = __ENV.MCP_URL || 'https://mcp-staging.example.com/mcp';
const headers = {
  'Content-Type': 'application/json',
  'Accept': 'application/json, text/event-stream',
};

// Scenario configuration: ramp up → sustain → ramp down
export const options = {
  scenarios: {
    mcp_load: {
      executor: 'ramping-vus',
      startVUs: 0,
      stages: [
        { duration: '2m', target: 20 },   // Ramp up to 20 VUs over 2 minutes
        { duration: '5m', target: 20 },   // Sustain 20 VUs for 5 minutes
        { duration: '2m', target: 50 },   // Spike to 50 VUs
        { duration: '3m', target: 50 },   // Sustain spike
        { duration: '2m', target: 0 },    // Ramp down
      ],
    },
  },

  // Thresholds — test fails if these are violated
  thresholds: {
    'http_req_failed': ['rate<0.01'],             // < 1% HTTP errors overall
    'mcp_init_errors': ['count<5'],               // At most 5 initialize failures in entire test
    'mcp_tool_errors': ['rate<0.02'],             // < 2% tool call errors
    'mcp_init_duration': ['p(95)<1000'],          // 95th percentile initialize < 1 s
    'mcp_tool_duration': ['p(95)<3000'],          // 95th percentile tool call < 3 s
    'mcp_tool_duration': ['p(99)<8000'],          // 99th percentile tool call < 8 s
  },
};

// Unique request ID per virtual user + iteration
let requestId = 0;
function nextId() { return ++requestId; }

// Step 1: MCP initialize handshake
function mcpInitialize() {
  const start = Date.now();
  const res = http.post(MCP_URL, JSON.stringify({
    jsonrpc: '2.0',
    id: nextId(),
    method: 'initialize',
    params: {
      protocolVersion: '2024-11-05',
      clientInfo: { name: 'k6-load-test', version: '1.0.0' },
      capabilities: {},
    },
  }), { headers });

  mcpInitDuration.add(Date.now() - start);

  const ok = check(res, {
    'initialize status 200': r => r.status === 200,
    'initialize returns protocolVersion': r => {
      try {
        const body = JSON.parse(r.body);
        return body.result && body.result.protocolVersion === '2024-11-05';
      } catch { return false; }
    },
    'initialize has no error': r => {
      try {
        const body = JSON.parse(r.body);
        return !body.error;
      } catch { return false; }
    },
  });

  if (!ok) mcpInitErrors.add(1);
  return ok;
}

// Step 2: Send initialized notification (required by MCP spec)
function mcpInitialized() {
  http.post(MCP_URL, JSON.stringify({
    jsonrpc: '2.0',
    method: 'notifications/initialized',
    params: {},
  }), { headers });
  // No response expected for notifications
}

// Step 3: List available tools
function mcpToolsList() {
  const res = http.post(MCP_URL, JSON.stringify({
    jsonrpc: '2.0',
    id: nextId(),
    method: 'tools/list',
    params: {},
  }), { headers });

  return check(res, {
    'tools/list status 200': r => r.status === 200,
    'tools/list returns array': r => {
      try {
        const body = JSON.parse(r.body);
        return Array.isArray(body.result?.tools);
      } catch { return false; }
    },
  });
}

// Step 4: Call a representative tool with realistic inputs
function mcpToolCall(toolName, toolInput) {
  const start = Date.now();
  const res = http.post(MCP_URL, JSON.stringify({
    jsonrpc: '2.0',
    id: nextId(),
    method: 'tools/call',
    params: {
      name: toolName,
      arguments: toolInput,
    },
  }), { headers, timeout: '10s' });

  mcpToolDuration.add(Date.now() - start);

  const ok = check(res, {
    [`${toolName} status 200`]: r => r.status === 200,
    [`${toolName} has result`]: r => {
      try {
        const body = JSON.parse(r.body);
        return body.result !== undefined || body.error !== undefined;
      } catch { return false; }
    },
    [`${toolName} not server error`]: r => {
      // isError:true is a valid MCP tool result (the tool failed gracefully)
      // HTTP 5xx means the server itself crashed — that is a load test failure
      return r.status < 500;
    },
  });

  if (!ok) mcpToolErrors.add(1);
  return ok;
}

// Main VU function — runs once per iteration per virtual user
export default function () {
  // Full MCP session flow
  if (!mcpInitialize()) {
    sleep(1);
    return; // Don't proceed if initialize failed
  }

  mcpInitialized();
  mcpToolsList();

  // Call representative tools — adjust to match your server's actual tool set
  // Mix of fast and slow tools to simulate realistic load distribution
  const toolScenario = Math.random();
  if (toolScenario < 0.5) {
    // Fast tool: database read (50% of calls)
    mcpToolCall('get_record', { id: `rec_${Math.floor(Math.random() * 1000)}` });
  } else if (toolScenario < 0.8) {
    // Medium tool: search with filtering (30% of calls)
    mcpToolCall('search_records', { query: 'test', limit: 10 });
  } else {
    // Slow tool: external API call (20% of calls)
    mcpToolCall('fetch_external_data', { source: 'api', resource_id: `r_${Math.floor(Math.random() * 100)}` });
  }

  // Realistic think time between sessions (0.5–2 seconds)
  sleep(0.5 + Math.random() * 1.5);
}

Running the load test and interpreting results

# Run against staging environment
MCP_URL=https://mcp-staging.example.com/mcp k6 run mcp-load-test.js

# Run with output to InfluxDB for Grafana dashboards (optional)
k6 run --out influxdb=http://influxdb:8086/k6 mcp-load-test.js

# Run with HTML report (k6 v0.46+)
k6 run --out json=results.json mcp-load-test.js
k6 report results.json

Key metrics to watch in k6 output

Metric	Target threshold	What a violation means
`mcp_init_duration p(95)`	< 1,000 ms	Server is slow to complete protocol handshake; check session store or capability negotiation
`mcp_tool_duration p(95)`	< 3,000 ms for DB tools; < 10,000 ms for external API tools	Tool handlers are slow; check database query plans or upstream API latency
`http_req_failed`	< 1%	Server returning 5xx; check process memory, connection pool, or event loop metrics
`mcp_init_errors count`	< 5 total	Initialize failures indicate protocol-level problems; check SDK version or session management
`http_reqs rate`	Set your target RPS based on production traffic estimates	Actual throughput under load; divide by VU count to get per-session RPS

Reading the end-of-test summary

# Example k6 output showing threshold violations
FAIL ✗ mcp_tool_duration.........: avg=4521ms min=120ms med=3800ms max=45200ms p(90)=8100ms p(95)=12400ms
     ✗ { threshold: 'p(95)<3000' }

# What this tells you:
# - Median tool call: 3.8 s (marginally acceptable)
# - P95: 12.4 s (exceeds 3 s threshold — users in the 95th percentile wait 12 seconds)
# - Max: 45.2 s (some tool calls take 45 seconds — likely timeout or external API stall)
# Fix: add timeout to external API calls in tool handlers; scale up database connection pool

Load testing SSE-transport MCP servers

SSE-transport MCP servers require a different k6 approach because the protocol uses long-lived SSE streams for server-to-client messages and HTTP POST for client-to-server messages. k6 does not natively parse SSE streams in the same way as HTTP responses, but you can simulate the SSE client-server flow using the k6/http module with custom response streaming.

import http from 'k6/http';
import { check } from 'k6';

// For SSE transport: establish the SSE connection, then POST messages to /messages endpoint
export default function () {
  // 1. Open SSE stream — the server sends a 'endpoint' event with the session POST URL
  const sseRes = http.get('https://mcp-staging.example.com/sse', {
    headers: { Accept: 'text/event-stream' },
    timeout: '5s',
  });

  // Extract the POST endpoint URL from the SSE data (server sends "data: /messages?sessionId=XXX")
  const endpointMatch = sseRes.body.match(/data: (\/messages\?sessionId=\S+)/);
  if (!endpointMatch) return;
  const postUrl = `https://mcp-staging.example.com${endpointMatch[1]}`;

  // 2. POST initialize to the session endpoint
  const initRes = http.post(postUrl, JSON.stringify({
    jsonrpc: '2.0',
    id: 1,
    method: 'initialize',
    params: {
      protocolVersion: '2024-11-05',
      clientInfo: { name: 'k6-sse-test', version: '1.0.0' },
      capabilities: {},
    },
  }), { headers: { 'Content-Type': 'application/json' } });

  check(initRes, { 'SSE initialize accepted': r => r.status === 202 });

  // Note: in real SSE transport, the initialize response arrives on the SSE stream.
  // k6 reads the stream body once — for sustained SSE connections, use the k6 browser
  // or a custom extension. For basic load testing of the HTTP layer, this pattern is sufficient.
}

For sustained SSE connection load testing at scale — simulating hundreds of concurrent long-lived SSE streams — consider using a dedicated SSE load testing tool (Artillery with the SSE plugin, or a custom Go/Node.js harness) alongside k6 for the HTTP layer testing. k6's strength is the HTTP request-response pattern; SSE stream simulation benefits from tools built specifically for persistent connections.

Integrating k6 with CI/CD and AliveMCP

CI/CD gate on load test thresholds

# .github/workflows/deploy.yml
jobs:
  load-test:
    runs-on: ubuntu-latest
    needs: deploy-staging
    steps:
      - uses: actions/checkout@v4
      - name: Install k6
        run: |
          curl https://github.com/grafana/k6/releases/download/v0.52.0/k6-v0.52.0-linux-amd64.tar.gz -L | tar xvz
          mv k6-v0.52.0-linux-amd64/k6 /usr/local/bin/k6

      - name: Run MCP load test against staging
        env:
          MCP_URL: ${{ secrets.STAGING_MCP_URL }}
        run: k6 run --exit-on-running-error mcp-load-test.js
        # k6 exits non-zero if any threshold is violated → CI job fails → deploy blocked

  deploy-production:
    needs: load-test   # Production deploy only runs if load test passes
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to production
        # ... your deployment step

      - name: Verify AliveMCP probe after deploy
        run: |
          # Wait for AliveMCP to run a fresh probe
          sleep 90
          # Check if AliveMCP sees the endpoint as up
          STATUS=$(curl -sf "https://api.alivemcp.com/v1/monitors/${{ secrets.ALIVEMCP_MONITOR_ID }}/status" \
            -H "Authorization: Bearer ${{ secrets.ALIVEMCP_API_KEY }}" | jq -r '.status')
          [ "$STATUS" = "up" ] || (echo "AliveMCP reports endpoint down after deploy" && exit 1)

The k6 / AliveMCP boundary

k6 load testing and AliveMCP external monitoring are complementary tools that cover different phases of the deployment lifecycle:

Phase	Tool	What it checks
Pre-deploy staging validation	k6	Throughput, P95 latency, error rate under load — finds the breaking point before production
Post-deploy smoke check	AliveMCP API call in CI	Confirms the production endpoint is up and speaking MCP within 90 seconds of deploy
Steady-state production monitoring	AliveMCP (continuous)	Every-minute protocol probe; TLS expiry; latency trend; 90-day uptime history
Capacity regression detection	k6 (weekly baseline)	Run the same load test weekly; compare P95 vs previous week; catch latency regressions before they affect users

Spike testing and soak testing patterns

Spike test: validate autoscaling response

export const options = {
  scenarios: {
    spike: {
      executor: 'ramping-vus',
      stages: [
        { duration: '30s', target: 5 },    // Baseline
        { duration: '10s', target: 100 },  // Sudden spike to 100 VUs
        { duration: '3m',  target: 100 },  // Hold spike (HPA should scale out)
        { duration: '10s', target: 5 },    // Drop back to baseline
        { duration: '2m',  target: 5 },    // Hold baseline (HPA should scale in)
        { duration: '10s', target: 0 },
      ],
    },
  },
  // During spike, expect some queuing latency — be more lenient on P99
  thresholds: {
    'mcp_tool_duration': ['p(95)<5000'],  // 5 s p95 during spike (vs 3 s normal)
    'http_req_failed': ['rate<0.02'],
  },
};

Soak test: validate memory and connection pool stability over time

export const options = {
  scenarios: {
    soak: {
      executor: 'constant-vus',
      vus: 10,
      duration: '2h',   // 2-hour sustained load — catches memory leaks and pool drift
    },
  },
  thresholds: {
    // Memory leak: if P95 latency increases monotonically over 2 hours, pool or heap is leaking
    'mcp_tool_duration': ['p(95)<3000'],
    'http_req_failed': ['rate<0.01'],
  },
};

After a soak test, compare the P95 latency in the first 10 minutes versus the last 10 minutes. A 20%+ increase indicates a gradual degradation — likely a memory leak in a tool handler, a connection pool that is not returning connections after long-running operations, or a caching layer that is growing unbounded. These are exactly the failures that AliveMCP's 90-day latency graph will show as a gradual upward trend if they escape into production undetected.

Frequently asked questions

How many VUs should I target in my k6 load test?

Start by estimating your expected peak production traffic: how many concurrent MCP client sessions do you expect at peak? Each k6 VU simulates one client session — the loop of initialize → tools/list → tool calls → think time → repeat. If you expect 50 concurrent sessions in production, test to 100 VUs (2× peak) to find your headroom. If your production traffic is unknown (new product, no historical data), use the capacity planning formulas in the capacity planning guide to derive an estimate from resource limits. Always run at least one test that intentionally exceeds your scaling limits — this establishes your actual breaking point and ensures you are not accidentally already at your limit in production. k6 can run locally for up to a few hundred VUs; for higher load you will need to use k6 Cloud or distribute the test across multiple machines.

Should I run k6 load tests against production or only staging?

Staging only, with caveats. Running load tests against production risks: (1) impacting real users if the test pushes the server to its limits, (2) polluting your analytics and waitlist data with synthetic tool call results, (3) triggering paid API calls if your MCP tools proxy to external services like OpenAI or Stripe. Staging should match production as closely as possible — same instance types, same Kubernetes resource limits, same HPA configuration — so that staging results are meaningful predictors of production behavior. The one exception is occasional low-intensity smoke tests against production (5–10 VUs for 2 minutes) to validate that the deployment is behaving correctly under real traffic conditions. AliveMCP's continuous 1-VU external probe effectively provides this signal automatically for the protocol layer, without the risk of overwhelming production with a full load test.

How do I script k6 to use different tool inputs for each virtual user?

Use k6's SharedArray for pre-generated test inputs, or generate inputs procedurally in the default function using __VU (virtual user number) and __ITER (iteration count) built-in variables. The SharedArray approach is best when you have a dataset of representative inputs: import { SharedArray } from 'k6/data'; const inputs = new SharedArray('inputs', () => JSON.parse(open('./test-inputs.json'))); — then access inputs[__ITER % inputs.length] in the default function. For procedural generation, use __VU as a user ID seed: const userId = `user_${__VU}`; const recordId = `rec_${__VU * 1000 + __ITER}`; — this ensures each VU accesses a different record range, preventing cache effects from distorting results. Avoid using the same input for every request — it will warm caches unrealistically and make your results over-optimistic.

My k6 test passes in staging but I see latency spikes in production. What should I check?

The most common cause is that staging does not match production in one of these dimensions: (1) Data volume — staging has 1,000 records, production has 10 million; queries that hit an index in staging hit an unindexed scan in production at scale. Run EXPLAIN on your most common queries against production-size data. (2) Geographic latency — k6 runs locally or from a single region; production clients are geographically distributed with higher RTT. Add network emulation (k6 run --env NETWORK_LATENCY=50ms) or run k6 from a region closer to your production cluster. (3) Warm vs cold caches — staging restarts frequently, flushing in-memory caches; production has warm caches from continuous traffic. If your tool handlers cache results, the first few minutes of production traffic after a deploy will be slower than the steady state. AliveMCP's latency graph shows this warm-up curve — look for the first few minutes after a deploy in the 90-day response time graph to calibrate expectations.

How does k6 load testing complement AliveMCP continuous monitoring?

k6 and AliveMCP answer different questions. k6 asks: "how does the server behave under synthetic controlled load?" It is a pre-production tool for finding breaking points, tuning thresholds, and validating HPA configuration before a deploy goes live. AliveMCP asks: "is the server alive and correctly speaking MCP right now?" It is a continuous post-production tool for detecting real-world failures — TLS expiry, infrastructure drift, upstream dependency failures, memory leaks that only manifest after days of production traffic. The two tools are designed to be used together: k6 prevents deploys that would cause immediate failures under load; AliveMCP catches the long-tail failures that k6 cannot simulate because they require days or weeks of production traffic to manifest. If AliveMCP's response time graph shows latency increasing over a week, run a soak test with k6 to reproduce the degradation in staging and identify the root cause before it reaches user-visible levels.