Guide · Performance

MCP server concurrency

The MCP SDK dispatches tool calls concurrently — if two CallToolRequest messages arrive before the first handler returns, both handlers run simultaneously on the same event loop. This is usually what you want (high throughput, no blocking), but it creates two hazards: shared mutable state (two handlers reading and writing the same variable interleave in unexpected ways) and resource exhaustion (an LLM agent issuing 50 simultaneous tool calls opens 50 database connections). This guide shows how to detect both hazards and fix them with async-mutex, p-limit, and back-pressure patterns.

TL;DR

For shared mutable state: use async-mutex to serialize access to the shared resource. For resource exhaustion: use p-limit to cap how many tool calls run concurrently. For database connections specifically: use a connection pool with a max cap and rely on the pool's built-in queuing rather than implementing your own. Test concurrent handlers by running Promise.all() with multiple simultaneous tool calls through InMemoryTransport and asserting correct outcomes.

How the MCP SDK handles concurrent calls

The MCP SDK processes incoming messages from the transport as they arrive. There is no automatic queuing or serialization of CallToolRequest messages — two requests that arrive in the same event loop tick both invoke the request handler, which returns two separate Promises. The Promises run concurrently.

This mirrors how any Node.js HTTP server works: multiple requests are served concurrently by default. Most of the time this is correct. The cases where it requires attention:

ScenarioProblemFix
Handler reads then writes shared stateRead-modify-write race: both handlers read the same value, both write back, one update is lostasync-mutex around the read-modify-write
Handler creates a resource (file, connection) checked for existence firstCheck-then-act race: both check "does file exist?", both create, one overwritesatomic create-if-not-exists (O_CREAT | O_EXCL) or mutex
LLM agent issues many simultaneous callsResource exhaustion: database pool saturated, rate limit hit, OOMp-limit cap per resource or at the server level
Handler has per-call setup/teardownTeardown of call A runs while call B is using the resource they shareScope setup/teardown to each call explicitly; use try/finally

Shared mutable state race conditions

The classic Node.js concurrency bug is the read-modify-write race. It looks like this in an MCP handler:

// Shared state — module-level
let requestCount = 0;
const activeUsers = new Set<string>();

// BUG: race condition in read-modify-write
server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === 'register_user') {
    const { userId } = request.params.arguments as { userId: string };

    // Race: two concurrent calls both see activeUsers.size = 9
    if (activeUsers.size >= 10) {
      return { content: [{ type: 'text', text: 'User limit reached' }], isError: true };
    }
    // Both pass the check, both add — now activeUsers.size = 11
    activeUsers.add(userId);
    requestCount++;
    return { content: [{ type: 'text', text: 'Registered' }] };
  }
});

Because activeUsers.size read and activeUsers.add() are not atomic, two concurrent calls can both read 9, both pass the ≥10 check, and both insert — ending up with 11 users despite the limit of 10. This is a logic-level race in JavaScript, not a threading race (there are no threads). The race happens across await boundaries when handlers interleave.

// Fix: async-mutex for the critical section
import { Mutex } from 'async-mutex';

const userMutex = new Mutex();
const activeUsers = new Set<string>();

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === 'register_user') {
    const { userId } = request.params.arguments as { userId: string };

    return userMutex.runExclusive(async () => {
      if (activeUsers.size >= 10) {
        return { content: [{ type: 'text', text: 'User limit reached' }], isError: true };
      }
      activeUsers.add(userId);
      return { content: [{ type: 'text', text: 'Registered' }] };
    });
  }
});

runExclusive queues callers — if a second call arrives while the first is in the critical section, the second waits. The critical section here is microseconds (no I/O), so the serialization overhead is negligible. For critical sections that include database calls, prefer database-level locking (SELECT FOR UPDATE, transactions with serializable isolation) over application-level mutexes.

Capping concurrency with p-limit

When you want to allow concurrent tool calls but limit how many run simultaneously, p-limit is the right tool. It is not a mutex — it allows up to N concurrent operations, not exactly 1.

npm install p-limit
import pLimit from 'p-limit';

// Allow at most 5 concurrent calls to the external API
const apiCallLimit = pLimit(5);

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (request.params.name === 'enrich_record') {
    const { recordId } = request.params.arguments as { recordId: string };

    // If 5 calls are already running, this waits in the queue
    return apiCallLimit(async () => {
      const enriched = await externalApi.enrich(recordId);
      return { content: [{ type: 'text', text: JSON.stringify(enriched) }] };
    });
  }
});

Use case for p-limit: your MCP server wraps an external API that rate-limits at 5 requests/second. An LLM agent calls 50 tools simultaneously. Without limiting, 50 requests hit the external API at once, most get 429 errors. With p-limit(5), only 5 run at once; the rest queue. Effective rate is still 5/s — correct behavior, no errors.

Typical concurrency limits by resource type:

ResourceTypical p-limit valueWhy
External HTTP API (rate-limited)Match the API's rate limitAvoid 429 errors; queue excess calls
CPU-intensive worker poolWorker pool size (maxThreads)No point queuing more than the pool can handle
Database (no pool)2–5SQLite is single-writer; Postgres: tune to connection count
File system writes1–10Too many concurrent writes cause I/O saturation on spinning disks

Per-connection vs global state

MCP servers may handle multiple simultaneous client connections (multiple LLM agents, or one agent with multiple sessions). State can be scoped per-connection (safe: each connection has its own copy) or global (requires explicit concurrency control).

// Per-connection state: scoped inside the createServer function
// Each client connection gets a fresh server instance with its own state
export function createServer(): Server {
  // State is per-server-instance, not module-level
  const sessionData = new Map<string, unknown>();
  let callCount = 0;

  const server = new Server(
    { name: 'my-mcp', version: '1.0.0' },
    { capabilities: { tools: {} } }
  );

  server.setRequestHandler(CallToolRequestSchema, async (request) => {
    callCount++; // safe: only one client per server instance
    // ...
  });

  return server;
}

// Main: create a fresh server per connection
transport.onconnect = async (connection) => {
  const server = createServer();
  await server.connect(connection);
};

When state must be shared across connections (a shared database, a shared cache, a shared rate limit), declare it outside createServer() at module level and apply mutex/p-limit as described above.

Back-pressure: rejecting when overloaded

Queuing excess calls with p-limit is appropriate when the queue is bounded in size and drains quickly. For public or multi-tenant MCP servers, an unbounded queue is itself a resource exhaustion risk — a flood of requests fills memory. Implement explicit back-pressure: reject calls when the queue is full, rather than queuing indefinitely.

const MAX_QUEUE_DEPTH = 100;
let queueDepth = 0;

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  if (queueDepth >= MAX_QUEUE_DEPTH) {
    return {
      content: [{ type: 'text', text: 'Server is busy. Please retry in a moment.' }],
      isError: true,
    };
  }

  queueDepth++;
  try {
    return await apiCallLimit(async () => {
      const result = await processRequest(request);
      return result;
    });
  } finally {
    queueDepth--;
  }
});

The isError: true response is the correct mechanism — it tells the LLM client that the tool call failed but the server is reachable, and the LLM can decide to retry with exponential back-off. Compare this to a JSON-RPC protocol error, which most LLM clients treat as fatal. See MCP server error handling for the full distinction.

Testing concurrent tool call handlers

Race conditions are notoriously hard to reproduce in tests because they depend on exact interleaving. The most effective approach: write a test that deliberately exercises concurrent paths and asserts invariants that a race would violate.

// test/concurrency.test.ts
import { describe, it, expect } from 'vitest';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { InMemoryTransport } from '@modelcontextprotocol/sdk/inMemory.js';
import { createServer } from '../src/server.js';

describe('concurrent tool calls', () => {
  it('register_user respects the 10-user limit under concurrent load', async () => {
    const [serverTransport, clientTransport] = InMemoryTransport.createLinkedPair();
    const server = createServer();
    await server.connect(serverTransport);
    const client = new Client({ name: 'test', version: '1.0.0' }, { capabilities: {} });
    await client.connect(clientTransport);

    // Issue 20 simultaneous registrations — only 10 should succeed
    const results = await Promise.all(
      Array.from({ length: 20 }, (_, i) =>
        client.callTool({ name: 'register_user', arguments: { userId: `user-${i}` } })
      )
    );

    const successes = results.filter(r => !r.isError);
    const failures = results.filter(r => r.isError);

    expect(successes).toHaveLength(10);
    expect(failures).toHaveLength(10);

    await client.close();
  });

  it('concurrent calls to independent tools do not interfere', async () => {
    const [serverTransport, clientTransport] = InMemoryTransport.createLinkedPair();
    const server = createServer();
    await server.connect(serverTransport);
    const client = new Client({ name: 'test', version: '1.0.0' }, { capabilities: {} });
    await client.connect(clientTransport);

    // Both should succeed independently
    const [resultA, resultB] = await Promise.all([
      client.callTool({ name: 'get_user', arguments: { userId: 'a' } }),
      client.callTool({ name: 'get_user', arguments: { userId: 'b' } }),
    ]);

    expect(resultA.isError).toBeFalsy();
    expect(resultB.isError).toBeFalsy();

    await client.close();
  });
});

These tests are not exhaustive — they exercise a few interleavings. For critical business logic, pair them with property-based testing (fast-check): generate random sequences of concurrent operations and assert invariants hold for all of them.

What concurrency bugs look like in production

Concurrency bugs in production often appear as intermittent, non-reproducible errors: a count that is occasionally wrong by one, a user that exists but can't be found, a file that was overwritten unexpectedly. They are hard to reproduce in development because the exact interleaving that causes the bug is timing-dependent. AliveMCP tracks tool-call error rates continuously — a sudden spike in isError: true responses that correlates with high concurrency (multiple LLM agents active simultaneously) is often the first signal of a shared-state race condition. Monitor the error rate trend, not just uptime.

Related questions

Does the MCP SDK serialize tool calls automatically?

No. The SDK processes incoming messages as they arrive from the transport and dispatches each CallToolRequest to the registered handler immediately. If two requests arrive before the first handler returns, both run concurrently. This is the same behavior as Node.js HTTP servers — you are responsible for managing shared state in your handlers. The SDK provides no built-in serialization mechanism.

What is the difference between async-mutex and p-limit?

async-mutex provides a binary lock: exactly one caller is inside the critical section at a time. Use it when you need to serialize access to a shared resource. p-limit is a concurrency limiter: up to N callers run simultaneously; additional callers queue. Use it when you want to allow concurrency but cap it at a safe level. They are often used together: p-limit at the outer level (cap concurrent API calls at 5) and async-mutex at the inner level (serialize writes to a shared counter).

Should I use a database transaction instead of async-mutex?

For operations that read and write to a database, yes — use a database transaction with the appropriate isolation level. Transactions are implemented at the database level and handle concurrency correctly even across multiple Node.js processes (important if you run multiple server instances). Application-level mutexes like async-mutex work only within a single process. If your MCP server runs as multiple instances behind a load balancer, database transactions are the only correct approach for cross-instance coordination.

How do I handle concurrency in a stateless MCP server?

A truly stateless MCP server (all state in the database, no module-level mutable variables) has no shared mutable state race conditions by definition. Stateless is the easiest architecture to reason about under concurrency — and it's the right default for MCP servers that need to scale horizontally. The only concurrency concern for a stateless server is resource exhaustion (database connections, rate limits) which you address with connection pooling and p-limit.

Further reading