Guide · Performance

MCP server concurrency

The MCP SDK dispatches tool calls concurrently — if two CallToolRequest messages arrive before the first handler returns, both handlers run simultaneously on the same event loop. This is usually what you want (high throughput, no blocking), but it creates two hazards: shared mutable state (two handlers reading and writing the same variable interleave in unexpected ways) and resource exhaustion (an LLM agent issuing 50 simultaneous tool calls opens 50 database connections). This guide shows how to detect both hazards and fix them with async-mutex, p-limit, and back-pressure patterns.

TL;DR

For shared mutable state: use async-mutex to serialize access to the shared resource. For resource exhaustion: use p-limit to cap how many tool calls run concurrently. For database connections specifically: use a connection pool with a max cap and rely on the pool's built-in queuing rather than implementing your own. Test concurrent handlers by running Promise.all() with multiple simultaneous tool calls through InMemoryTransport and asserting correct outcomes.

How the MCP SDK handles concurrent calls

The MCP SDK processes incoming messages from the transport as they arrive. There is no automatic queuing or serialization of CallToolRequest messages — two requests that arrive in the same event loop tick both invoke the request handler, which returns two separate Promises. The Promises run concurrently.

This mirrors how any Node.js HTTP server works: multiple requests are served concurrently by default. Most of the time this is correct. The cases where it requires attention:

Scenario	Problem	Fix
Handler reads then writes shared state	Read-modify-write race: both handlers read the same value, both write back, one update is lost	async-mutex around the read-modify-write
Handler creates a resource (file, connection) checked for existence first	Check-then-act race: both check "does file exist?", both create, one overwrites	atomic create-if-not-exists (O_CREAT \| O_EXCL) or mutex
LLM agent issues many simultaneous calls	Resource exhaustion: database pool saturated, rate limit hit, OOM	p-limit cap per resource or at the server level
Handler has per-call setup/teardown	Teardown of call A runs while call B is using the resource they share	Scope setup/teardown to each call explicitly; use try/finally

Shared mutable state race conditions

The classic Node.js concurrency bug is the read-modify-write race. It looks like this in an MCP handler:

// Shared state — module-level
let requestCount = 0;
const activeUsers = new Set<string>();

// BUG: race condition in read-modify-write
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === 'register_user') {
    const { userId } = request.params.arguments as { userId: string };

    // Race: two concurrent calls both see activeUsers.size = 9
    if (activeUsers.size >= 10) {
      return { content: [{ type: 'text', text: 'User limit reached' }], isError: true };
    }
    // Both pass the check, both add — now activeUsers.size = 11
    activeUsers.add(userId);
    requestCount++;
    return { content: [{ type: 'text', text: 'Registered' }] };
  }
});

Because activeUsers.size read and activeUsers.add() are not atomic, two concurrent calls can both read 9, both pass the ≥10 check, and both insert — ending up with 11 users despite the limit of 10. This is a logic-level race in JavaScript, not a threading race (there are no threads). The race happens across await boundaries when handlers interleave.

// Fix: async-mutex for the critical section
import { Mutex } from 'async-mutex';

const userMutex = new Mutex();
const activeUsers = new Set<string>();

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === 'register_user') {
    const { userId } = request.params.arguments as { userId: string };

    return userMutex.runExclusive(async () => {
      if (activeUsers.size >= 10) {
        return { content: [{ type: 'text', text: 'User limit reached' }], isError: true };
      }
      activeUsers.add(userId);
      return { content: [{ type: 'text', text: 'Registered' }] };
    });
  }
});

runExclusive queues callers — if a second call arrives while the first is in the critical section, the second waits. The critical section here is microseconds (no I/O), so the serialization overhead is negligible. For critical sections that include database calls, prefer database-level locking (SELECT FOR UPDATE, transactions with serializable isolation) over application-level mutexes.

Capping concurrency with p-limit

When you want to allow concurrent tool calls but limit how many run simultaneously, p-limit is the right tool. It is not a mutex — it allows up to N concurrent operations, not exactly 1.

npm install p-limit

import pLimit from 'p-limit';

// Allow at most 5 concurrent calls to the external API
const apiCallLimit = pLimit(5);

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === 'enrich_record') {
    const { recordId } = request.params.arguments as { recordId: string };

    // If 5 calls are already running, this waits in the queue
    return apiCallLimit(async () => {
      const enriched = await externalApi.enrich(recordId);
      return { content: [{ type: 'text', text: JSON.stringify(enriched) }] };
    });
  }
});

Use case for p-limit: your MCP server wraps an external API that rate-limits at 5 requests/second. An LLM agent calls 50 tools simultaneously. Without limiting, 50 requests hit the external API at once, most get 429 errors. With p-limit(5), only 5 run at once; the rest queue. Effective rate is still 5/s — correct behavior, no errors.

Typical concurrency limits by resource type:

Resource	Typical p-limit value	Why
External HTTP API (rate-limited)	Match the API's rate limit	Avoid 429 errors; queue excess calls
CPU-intensive worker pool	Worker pool size (maxThreads)	No point queuing more than the pool can handle
Database (no pool)	2–5	SQLite is single-writer; Postgres: tune to connection count
File system writes	1–10	Too many concurrent writes cause I/O saturation on spinning disks

Per-connection vs global state

MCP servers may handle multiple simultaneous client connections (multiple LLM agents, or one agent with multiple sessions). State can be scoped per-connection (safe: each connection has its own copy) or global (requires explicit concurrency control).

// Per-connection state: scoped inside the createServer function
// Each client connection gets a fresh server instance with its own state
export function createServer(): Server {
  // State is per-server-instance, not module-level
  const sessionData = new Map<string, unknown>();
  let callCount = 0;

  const server = new Server(
    { name: 'my-mcp', version: '1.0.0' },
    { capabilities: { tools: {} } }
  );

  server.setRequestHandler(CallToolRequestSchema, async (request) => {
    callCount++; // safe: only one client per server instance
    // ...
  });

  return server;
}

// Main: create a fresh server per connection
transport.onconnect = async (connection) => {
  const server = createServer();
  await server.connect(connection);
};

When state must be shared across connections (a shared database, a shared cache, a shared rate limit), declare it outside createServer() at module level and apply mutex/p-limit as described above.

Back-pressure: rejecting when overloaded

Queuing excess calls with p-limit is appropriate when the queue is bounded in size and drains quickly. For public or multi-tenant MCP servers, an unbounded queue is itself a resource exhaustion risk — a flood of requests fills memory. Implement explicit back-pressure: reject calls when the queue is full, rather than queuing indefinitely.

const MAX_QUEUE_DEPTH = 100;
let queueDepth = 0;

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (queueDepth >= MAX_QUEUE_DEPTH) {
    return {
      content: [{ type: 'text', text: 'Server is busy. Please retry in a moment.' }],
      isError: true,
    };
  }

  queueDepth++;
  try {
    return await apiCallLimit(async () => {
      const result = await processRequest(request);
      return result;
    });
  } finally {
    queueDepth--;
  }
});

The isError: true response is the correct mechanism — it tells the LLM client that the tool call failed but the server is reachable, and the LLM can decide to retry with exponential back-off. Compare this to a JSON-RPC protocol error, which most LLM clients treat as fatal. See MCP server error handling for the full distinction.

Testing concurrent tool call handlers

Race conditions are notoriously hard to reproduce in tests because they depend on exact interleaving. The most effective approach: write a test that deliberately exercises concurrent paths and asserts invariants that a race would violate.

// test/concurrency.test.ts
import { describe, it, expect } from 'vitest';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { InMemoryTransport } from '@modelcontextprotocol/sdk/inMemory.js';
import { createServer } from '../src/server.js';

describe('concurrent tool calls', () => {
  it('register_user respects the 10-user limit under concurrent load', async () => {
    const [serverTransport, clientTransport] = InMemoryTransport.createLinkedPair();
    const server = createServer();
    await server.connect(serverTransport);
    const client = new Client({ name: 'test', version: '1.0.0' }, { capabilities: {} });
    await client.connect(clientTransport);

    // Issue 20 simultaneous registrations — only 10 should succeed
    const results = await Promise.all(
      Array.from({ length: 20 }, (_, i) =>
        client.callTool({ name: 'register_user', arguments: { userId: `user-${i}` } })
      )
    );

    const successes = results.filter(r => !r.isError);
    const failures = results.filter(r => r.isError);

    expect(successes).toHaveLength(10);
    expect(failures).toHaveLength(10);

    await client.close();
  });

  it('concurrent calls to independent tools do not interfere', async () => {
    const [serverTransport, clientTransport] = InMemoryTransport.createLinkedPair();
    const server = createServer();
    await server.connect(serverTransport);
    const client = new Client({ name: 'test', version: '1.0.0' }, { capabilities: {} });
    await client.connect(clientTransport);

    // Both should succeed independently
    const [resultA, resultB] = await Promise.all([
      client.callTool({ name: 'get_user', arguments: { userId: 'a' } }),
      client.callTool({ name: 'get_user', arguments: { userId: 'b' } }),
    ]);

    expect(resultA.isError).toBeFalsy();
    expect(resultB.isError).toBeFalsy();

    await client.close();
  });
});

These tests are not exhaustive — they exercise a few interleavings. For critical business logic, pair them with property-based testing (fast-check): generate random sequences of concurrent operations and assert invariants hold for all of them.

What concurrency bugs look like in production

Concurrency bugs in production often appear as intermittent, non-reproducible errors: a count that is occasionally wrong by one, a user that exists but can't be found, a file that was overwritten unexpectedly. They are hard to reproduce in development because the exact interleaving that causes the bug is timing-dependent. AliveMCP tracks tool-call error rates continuously — a sudden spike in isError: true responses that correlates with high concurrency (multiple LLM agents active simultaneously) is often the first signal of a shared-state race condition. Monitor the error rate trend, not just uptime.