Orchestration guide · 2026-06-11 · Multi-agent MCP systems

Multi-Agent MCP Orchestration: Five Patterns for Parallel Tool Calls, Shared State, and Agent Handoffs

Single-agent MCP development is relatively forgiving: one session, sequential tool calls, a shared process that holds whatever state you need. Multi-agent deployments are not forgiving. When an orchestrator spawns twenty sub-agents that all connect to the same MCP server simultaneously, five failure modes appear that do not exist in single-agent testing: parallel writes corrupt shared state, fan-out saturates the database connection pool, agent handoffs lose context at server boundaries, composed tool chains swallow errors in intermediate steps, and long-running sessions overflow the context window. Each failure mode has a pattern that addresses it. This post covers all five as an operational architecture guide for MCP servers moving from single-agent prototypes to multi-agent production systems.

TL;DR

Why multi-agent patterns differ from single-agent patterns

Every pattern in this post has an analogue in single-agent MCP development. But the multi-agent context changes the urgency and the failure mode for each one. Understanding why helps you prioritize — not all five patterns are equally important for every deployment.

Multi-agent behaviorWhy it breaks single-agent assumptionsPattern that addresses it
N parallel tool calls from N sub-agents in the same processModule-level state (caches, counters, in-process Maps) is shared across all sessions; concurrent writes produce torn reads and lost updates that are invisible under sequential single-agent loadStateless handlers + external storage
Fan-out: orchestrator spawns N clients, each calling the same serverConnection pool sized for 5 concurrent connections holds fine under single-agent; 20 sub-agents × 3 parallel tool calls each = 60 concurrent connection acquisitions; overflow queues indefinitely and P99 latency grows linearlyPool sizing + backpressure
Agent execution shifts from one MCP server to a different one mid-workflowThe receiving server cannot reconstruct the sending server's state from the MCP protocol alone; accumulated context (resolved entity IDs, constraint flags, prior decisions) lives in process memory or the sending server's databaseHandoffEnvelope + checkpoint-and-resume
Server assembles a multi-step pipeline from primitive operationsA step failure in a composed pipeline can swallow its error silently if only the final result is returned; the agent sees "tool call failed" with no indication which step failed or what partial progress was madeTyped StepError pipeline
Long-running sessions accumulate context across many tool callsNaive context stores grow without bound; in-process stores exhaust heap memory under many parallel long-running sessions; agent prompts bloat with raw history that has already been processedSliding-window + LRU eviction

The failure modes compound: an orchestrator that spawns 20 sub-agents creates connection pool pressure (fan-out), shared state contention (parallel writes), and potential context store growth (many active sessions) simultaneously. Adding one pattern reduces pressure on the others: sizing the connection pool correctly reduces the latency spikes that cause sub-agents to time out and retry, which reduces the duplicate writes that hit the shared state, which reduces the conflict rate in the optimistic locking layer.

Pattern 1: Topology — choosing your coordination model

Before any of the other patterns apply, you need to choose how your agents coordinate. The two dominant models in MCP deployments are orchestrator-dispatcher and swarm. They look similar from the MCP server's perspective — both produce parallel sessions — but have fundamentally different coordination semantics.

The orchestrator-dispatcher topology has a single controller agent that receives the top-level task, decomposes it into subtasks, dispatches each to a sub-agent, waits for results, and synthesizes the output. The orchestrator controls parallelism, enforces the task dependency graph, and decides when to retry a failed sub-agent. Sub-agents are the heavy MCP callers. This topology is predictable and debuggable — every tool call traces to a specific sub-agent assignment in the orchestrator's decision log.

The swarm topology has no central coordinator. Multiple agents observe a shared task queue, each picks up a work item, calls MCP tools, writes results to a shared store, and publishes a completion event. Swarms are resilient to individual agent failure — another agent picks up the pending item. They are the right choice for embarrassingly parallel workloads (bulk document processing, web crawls, data transformation pipelines) where no individual result depends on a specific agent's identity.

Fan-out in the orchestrator-dispatcher model should use p-limit to bound parallelism to a number your MCP server's connection pool can sustain, rather than unbounded Promise.all:

import pLimit from 'p-limit';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { SSEClientTransport } from '@modelcontextprotocol/sdk/client/sse.js';

// Rate-limit fan-out to match the server's connection pool
// If pool.max = 40 and each sub-agent makes 2 concurrent calls,
// cap fan-out at 20 concurrent sub-agents
const limit = pLimit(20);

async function fanOut(serverUrl: string, tasks: SubTask[]): Promise<Result[]> {
  return Promise.all(
    tasks.map(task => limit(async () => {
      const transport = new SSEClientTransport(new URL(serverUrl));
      const client = new Client(
        { name: `sub-agent-${task.id}`, version: '1.0.0' },
        { capabilities: {} }
      );
      await client.connect(transport);
      try {
        const result = await client.callTool({
          name: task.tool,
          arguments: task.args,
        });
        return { taskId: task.id, result };
      } finally {
        await client.close();
      }
    }))
  );
}

The pool sizing math: if your database connection pool has max: 40 and each tool handler acquires one connection, you can sustain 40 concurrent handler executions. If your orchestrator fans out to 20 sub-agents and each sub-agent makes 2 parallel tool calls, peak concurrency is 40 — right at the pool limit. Margin is good: set pLimit(15) so there is headroom for retries without immediately hitting the pool ceiling.

Pattern 2: Shared state — avoiding contention across parallel agents

The most common multi-agent MCP bug looks like data corruption: two sub-agents call the same tool concurrently, both read a record with version: 5, both compute an update, and both write back — one write silently overwrites the other. The resulting record has version 6 but is missing half of one agent's update. Under single-agent sequential testing this never appears. Under multi-agent parallel load it appears immediately.

The shared state guide covers three dangerous in-process state patterns (lost updates, torn reads, phantom reads) and their fixes. The core fix for the lost update problem is optimistic locking with a version field:

// shared-state.ts — optimistic locking with retry on conflict
import Database from 'better-sqlite3';

const db = new Database('./state.db');
db.pragma('journal_mode = WAL');  // WAL mode: reads don't block writes

const updateWithLock = db.prepare(`
  UPDATE task_state
  SET payload = @payload, version = version + 1, updated_at = @now
  WHERE id = @id AND version = @expectedVersion
`);

async function updateTaskState(
  id: string,
  transform: (current: TaskState) => TaskState,
  maxRetries = 5
): Promise<TaskState> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const current = db.prepare('SELECT * FROM task_state WHERE id = ?').get(id) as TaskState | undefined;
    if (!current) throw new Error(`Task state not found: ${id}`);

    const next = transform(current);
    const changed = updateWithLock.run({
      id,
      payload: JSON.stringify(next),
      expectedVersion: current.version,
      now: new Date().toISOString(),
    });

    if (changed.changes === 1) return next; // success

    // Version mismatch — another agent updated concurrently
    const backoffMs = 20 * Math.pow(2, attempt) + Math.random() * 10;
    await new Promise(r => setTimeout(r, backoffMs));
  }
  throw new Error(`Optimistic lock failed after ${maxRetries} retries — task ${id} has high write contention`);
}

SQLite in WAL mode handles this correctly for single-node deployments: the WHERE version = @expectedVersion predicate is evaluated atomically within the write transaction, so a concurrent writer that changed the version first will cause changes === 0, triggering a retry. The exponential backoff with jitter prevents retry storms when many agents collide on the same record.

For multi-node deployments where SQLite is not shared across instances, use Redis with a Lua CAS script — the same compare-and-swap semantics, but atomic across the network boundary:

-- redis-cas.lua — atomic compare-and-swap: only write if version matches
local key = KEYS[1]
local expected_version = tonumber(ARGV[1])
local new_payload = ARGV[2]
local new_version = expected_version + 1

local current_version = tonumber(redis.call('HGET', key, 'version'))
if current_version ~= expected_version then
  return { 0, current_version }  -- conflict: return actual version
end

redis.call('HSET', key, 'payload', new_payload, 'version', new_version)
return { 1, new_version }  -- success: return new version

When write contention is very high (many agents writing to the same small set of records), consider an event-sourced append-only log instead of direct updates. Appends never conflict — each agent appends its event, and a read-time fold over the log produces the current state. Snapshot the log periodically to bound read time. This trades write throughput for snapshot management complexity.

Pattern 3: Tool composition — typed pipelines with error propagation

In a multi-step agent workflow, the agent often has a choice: call N primitive tools sequentially and coordinate the pipeline itself, or call one composed tool that runs the whole pipeline server-side. The decision hinges on whether the agent needs to reason about the intermediate results. If the agent would only pass each intermediate directly to the next call unchanged, composing server-side saves N-1 round-trips and keeps the agent's context window clean.

The key challenge in server-side tool composition is error propagation. When a four-step pipeline fails at step 3, the calling agent needs to know: which step failed, what partial progress was made, and whether the failure is retryable. A naked throw new Error('step failed') loses all of that context. A typed StepError preserves it:

// step-error.ts — typed error for composed pipeline steps
export class StepError extends Error {
  constructor(
    public readonly step: string,
    message: string,
    public readonly context: Record<string, unknown> = {},
    public readonly retryable: boolean = true
  ) {
    super(`[${step}] ${message}`);
    this.name = 'StepError';
  }
}

// pipeline.ts — sequential typed pipeline with partial result recovery
async function processDocument(url: string, query: string) {
  let raw: RawDocument | undefined;
  let parsed: ParsedDocument | undefined;

  // Step 1: fetch
  try {
    raw = await fetchStep(url);
  } catch (err) {
    throw new StepError('fetch', String(err), { url }, isTransient(err));
  }

  // Step 2: parse
  try {
    parsed = await parseStep(raw);
  } catch (err) {
    throw new StepError('parse', String(err), { url, rawLength: raw.rawHtml.length }, false);
  }

  // Step 3: score — returns partial result on error rather than throwing
  let scored: ScoredDocument;
  try {
    scored = await scoreStep(parsed, query);
  } catch (err) {
    // Non-critical step: return a degraded result rather than failing the whole pipeline
    scored = { url, title: parsed.title, relevanceScore: 0, summary: parsed.bodyText.slice(0, 200) };
  }

  return scored;
}

For pipelines where you want to process a list of items and collect all errors rather than short-circuiting on the first failure, use Promise.allSettled for the map step:

// map-reduce with partial success: process all items, collect errors
const results = await Promise.allSettled(
  items.map(item => limit(() => processDocument(item.url, query)))
);

const successful = results
  .filter((r): r is PromiseFulfilledResult<ScoredDocument> => r.status === 'fulfilled')
  .map(r => r.value);

const failed = results
  .filter((r): r is PromiseRejectedResult => r.status === 'rejected')
  .map(r => ({ reason: r.reason instanceof StepError ? r.reason.context : r.reason }));

return {
  content: [{ type: 'text', text: JSON.stringify({ successful, failed, total: items.length }) }]
};

The agent receives a structured result that distinguishes successful items from failed ones, with enough context to decide how to proceed — whether to retry only the failed items, surface an error to the user, or continue with the partial results. This is categorically better than a binary success/failure response for long-running multi-agent pipelines where partial success is the expected norm under production load.

Pattern 4: Agent handoffs — checkpoint-and-resume across server boundaries

A handoff is any transition where agent execution shifts from one MCP server to a different one with context that must survive the boundary. The most common triggers: a routing server dispatches to a specialist, a saturated server sheds load to an overflow instance, a server crashes and the agent retries on a standby, or a staged pipeline moves context from an extraction server to a summarization server to a storage server.

What all of these have in common: the receiving server cannot reconstruct the sending server's state from the MCP protocol alone. It depends on context accumulated across prior tool calls — resolved entity IDs, extracted parameters, constraint flags, prior decisions. The agent handoff guide defines a HandoffEnvelope schema for this transfer:

// handoff.ts — Zod schema for the context serialization contract
import { z } from 'zod';

export const HandoffEnvelopeSchema = z.object({
  session_id: z.string().uuid(),           // stable across the multi-server conversation
  handoff_id: z.string().uuid(),           // unique per handoff event
  idempotency_token: z.string().min(1),    // receiving server deduplicates on this

  source_server: z.string(),              // for tracing
  target_server: z.string(),              // receiving server validates it is the intended target
  next_tool_hint: z.string().optional(),  // tool to call first — skips tools/list round-trip

  accumulated_context: z.record(z.unknown()),  // distilled key/value context (not raw transcript)
  continuation_token: z.string().optional(),   // cursor for resuming pagination or streaming

  created_at: z.string().datetime(),
  ttl_seconds: z.number().int().positive().default(300),
});

export type HandoffEnvelope = z.infer<typeof HandoffEnvelopeSchema>;

The accumulated_context field is not a conversation transcript — it is a distilled summary of what matters for the next stage. Raw message history belongs in the agent prompt or a separate conversation-history store, not in the handoff envelope. A good rule of thumb: if the envelope exceeds 64 KB uncompressed, the context payload is too large and should be stored externally with a reference key in the envelope.

The checkpoint-and-resume flow: the sending server writes the envelope to a durable store (SQLite or Redis) before returning the handoff to the orchestrating agent. The receiving server reads the checkpoint immediately on receiving the handoff call, before doing any work. This makes a crash between checkpoint write and handoff processing recoverable — the agent retries, the receiving server reads the checkpoint again, and the idempotency token prevents double execution:

// receiving-server.ts — idempotency check on handoff receipt
server.tool('receive_handoff', 'Resume a task from a handoff envelope', {
  envelope: HandoffEnvelopeSchema,
}, async ({ envelope }) => {
  // Validate we are the intended target
  if (envelope.target_server !== SERVER_NAME) {
    throw new Error(`Handoff mis-delivered: intended ${envelope.target_server}, got ${SERVER_NAME}`);
  }

  // Reject stale envelopes
  const ageSeconds = (Date.now() - new Date(envelope.created_at).getTime()) / 1000;
  if (ageSeconds > envelope.ttl_seconds) {
    throw new Error(`Handoff expired: ${ageSeconds.toFixed(0)}s old, TTL was ${envelope.ttl_seconds}s`);
  }

  // Idempotency check: return existing result if already processed
  const existing = db.prepare(`
    SELECT status, result_json FROM handoff_checkpoints
    WHERE idempotency_token = ?
  `).get(envelope.idempotency_token);

  if (existing?.status === 'completed') {
    return { content: [{ type: 'text', text: existing.result_json }] };
  }
  if (existing?.status === 'processing') {
    return { content: [{ type: 'text', text: JSON.stringify({ status: 'processing', retry_after_seconds: 5 }) }] };
  }

  // Mark as processing and begin work
  db.prepare(`INSERT INTO handoff_checkpoints (...) VALUES (...)`).run({ ... });
  const result = await resumeWork(envelope);
  db.prepare(`UPDATE handoff_checkpoints SET status = 'completed', result_json = ? WHERE idempotency_token = ?`)
    .run(JSON.stringify(result), envelope.idempotency_token);

  return { content: [{ type: 'text', text: JSON.stringify(result) }] };
});

One critical operational requirement: every server in a handoff chain must be independently monitored. A handoff that lands on a server whose health probe is failing is a context-loss event waiting to happen. Configure AliveMCP to probe each server in the chain with a separate endpoint — the external protocol probe will detect a crashed receiving server before the orchestrating agent spends its retry budget trying to deliver a handoff to a dead server.

Pattern 5: Conversation context — sliding window and LRU eviction

MCP tools are stateless by default — each tool call arrives with no memory of previous calls in the same session. For many tools this is fine. But conversational workflows benefit from server-side context: knowing which documents were already retrieved, which parameters the user established at session start, which decisions were made in earlier steps. The naive fix — put all context in the agent's prompt — collapses quickly as sessions grow: context windows are finite, raw tool-call history is verbose, and the agent wastes inference tokens on information it already processed.

The conversation context guide covers the full implementation. The key design decisions for multi-agent deployments:

Store choice by deployment topology: in-memory Map for single-process (zero latency, zero dependencies, lost on restart), SQLite for single-instance durable (survives restarts, no cross-instance sharing), Redis for multi-instance (shared across instances, survives restarts, network latency per access). Most single-process MCP servers start with Map and switch to Redis when they scale horizontally or when context must survive server restarts.

Sliding window compression prevents unbounded context growth. When the accumulated tool-call history exceeds a token budget, the server summarizes the oldest turns into a compact paragraph and discards the raw records:

// context-compression.ts — sliding window with summarization trigger
const MAX_TOOL_CALLS_IN_WINDOW = 20;

async function addToolCallRecord(
  sessionId: string,
  record: ToolCallRecord,
  summarizer: (turns: ToolCallRecord[]) => Promise<string>
): Promise<void> {
  const ctx = await getContext(sessionId);
  ctx.toolCallHistory.push(record);

  if (ctx.toolCallHistory.length > MAX_TOOL_CALLS_IN_WINDOW) {
    // Summarize the oldest half and discard raw records
    const toSummarize = ctx.toolCallHistory.splice(0, MAX_TOOL_CALLS_IN_WINDOW / 2);
    const summary = await summarizer(toSummarize);
    ctx.summaryBeforeWindow = ctx.summaryBeforeWindow
      ? `${ctx.summaryBeforeWindow} Then: ${summary}`
      : summary;
  }

  await saveContext(sessionId, ctx);
}

// LRU eviction: expire idle sessions from the in-memory store
class LRUContextStore {
  private store = new Map<string, StoredContext>();
  private readonly maxSessions: number;
  private readonly idleTtlMs: number;

  constructor(maxSessions = 500, idleTtlMs = 30 * 60 * 1000) {
    this.maxSessions = maxSessions;
    this.idleTtlMs = idleTtlMs;
    setInterval(() => this.evict(), 60_000);
  }

  private evict(): void {
    const now = Date.now();
    // Evict idle sessions first
    for (const [id, entry] of this.store) {
      if (now - entry.lastAccessedAt > this.idleTtlMs) this.store.delete(id);
    }
    // If still over cap, evict by LRU order
    if (this.store.size > this.maxSessions) {
      const sorted = [...this.store.entries()]
        .sort(([, a], [, b]) => a.lastAccessedAt - b.lastAccessedAt);
      const toRemove = sorted.slice(0, this.store.size - this.maxSessions);
      toRemove.forEach(([id]) => this.store.delete(id));
    }
  }
}

Security note specific to multi-agent deployments: when multiple sub-agents share a session ID (intentionally, to collaborate on shared context), never derive the session ID from user-supplied input without sanitizing it. A malicious session_id containing ../ or a Redis key separator can cause a context read to cross session boundaries. Validate session IDs against a UUID regex before any store operation.

Expose a context.clear tool so the orchestrating agent can explicitly reset context between distinct tasks within a long-running session. Without it, context from a completed sub-task bleeds into the next one, producing confusing tool behavior that is hard to reproduce in testing.

How the five patterns compose

Each pattern reduces pressure on the others. They are most effective when applied together:

Pattern APattern BHow they interact
Topology (fan-out sizing)Shared state (optimistic locking)Sizing fan-out to 15 instead of 40 concurrent sub-agents reduces write contention on shared records — fewer concurrent writers means fewer optimistic lock conflicts and fewer retries
Shared state (optimistic locking)Tool composition (typed pipeline)When a pipeline step fails due to an optimistic lock conflict, a StepError with retryable: true gives the caller enough information to retry only that step rather than the entire pipeline
Tool composition (map-reduce)Agent handoffs (idempotency)A map-reduce pipeline that fans out across multiple MCP servers uses idempotency tokens at the handoff layer so retried fan-out calls do not double-execute completed sub-tasks
Agent handoffs (checkpoint-and-resume)Conversation context (sliding window)The handoff envelope carries the summaryBeforeWindow from the sending server's context store, so the receiving server can initialize context from the checkpoint without replaying the full session history
Conversation context (LRU store)Topology (orchestrator fan-out)Capping the LRU store at 500 active sessions limits memory growth even when an orchestrator spawns hundreds of short-lived sub-agents; each sub-agent session is evicted when it goes idle rather than accumulating forever

The recommended introduction order matches the urgency of the failure modes under real multi-agent load:

  1. Shared state first — data corruption from concurrent writes is the most dangerous failure because it is invisible during testing and produces subtle, hard-to-reproduce bugs in production. Add external storage and optimistic locking before you add any parallelism.
  2. Topology + pool sizing second — once shared state is safe, control your fan-out. A bounded pLimit prevents connection pool exhaustion and gives the server enough headroom to handle retries without amplifying pressure.
  3. Tool composition third — invest in typed pipelines for any workflow that today requires the agent to call 3+ tools in sequence with no branching on intermediates. The round-trip savings pay off quickly at scale.
  4. Agent handoffs fourth — add checkpoint-and-resume only once you have server-to-server handoffs in your workflow. The pattern is essential when you have them, but overkill if all your agents call the same server.
  5. Conversation context last — sliding window and LRU eviction are worth adding once sessions are actually long enough to overflow. Under typical sub-agent workloads (short-lived, task-specific sessions), the base in-memory Map with an idle TTL is sufficient.

What external monitoring sees in multi-agent deployments

All five patterns are in-process. They depend on the server being healthy enough to execute them. Optimistic locking cannot resolve a write conflict if the database is unreachable. Tool composition cannot return a partial result if the Node.js event loop is blocked by a CPU-intensive step. A handoff cannot be checkpointed if the SQLite file is on a full disk. Context cannot be retrieved from the LRU store if the process has crashed and restarted without loading the prior state.

This is the gap that AliveMCP fills in multi-agent deployments specifically. In a single-agent system, a crashed MCP server is noticed immediately — the single session fails. In a multi-agent system with an orchestrator-dispatcher topology, the orchestrator's retry logic may absorb the failure silently: if one of N sub-agents fails, the orchestrator retries it on a different connection, and the overall task still completes (slowly). The failed server can go undetected for long enough to accumulate a backlog of missed handoffs and corrupted checkpoints.

The 60-second external probe from AliveMCP sends a full MCP initialize handshake from outside the server — not an HTTP healthcheck that the server's own router handles — and verifies that the server can negotiate the protocol, list its tools, and return a valid response. In a multi-server handoff chain, each server in the chain should be configured as a separate monitored endpoint. A receiving server with a failing probe is identifiable before the orchestrating agent exhausts its handoff retry budget against it.

The combination of the five in-process patterns from this post and out-of-process protocol probing covers the full operational failure surface for multi-agent MCP deployments: safe under concurrency (shared state + topology), efficient under volume (tool composition + context), resilient across server boundaries (handoffs), and visible to the on-call team when infrastructure fails (external monitoring).