Guide · State Management

MCP server shared state

When multiple agent sessions call the same MCP server simultaneously, shared state is the root cause of most race conditions and data corruption bugs. The right architecture depends on whether your deployment is single-node or distributed — but the patterns for safe concurrent access are the same.

TL;DR

Never store mutable state in Node.js process memory across MCP sessions — use external storage instead. For single-node deployments, SQLite in WAL mode handles hundreds of concurrent readers with serialized writes and no separate infrastructure. For multi-node or distributed deployments, Redis is the right shared context store: use hash keys per session, SETNX for distributed locks, and Lua scripts for atomic compare-and-swap. In either case, apply optimistic locking with a version field to detect concurrent writes and use an event-sourced append-only log when write contention is high. Pair with read-through caching to reduce storage round-trips under high read load. Monitor state health with AliveMCP external probes alongside internal write conflict rate metrics. See also: multi-agent topologies and error handling for conflict recovery strategies.

Why shared state is dangerous in MCP servers

An MCP server that handles a single agent session at a time can safely use in-process state — module-level Maps, cached objects, counters. The Node.js event loop is single-threaded, so within a single async task chain there are no concurrent mutations. The problem emerges the moment a second agent session starts issuing tool calls while the first is still active.

Consider a tool handler that tracks per-session context in a module-level Map:

// Dangerous: shared in-process state across sessions
const sessionContext = new Map<string, { stepCount: number; lastTool: string }>();

server.tool(
  'run_step',
  'Execute the next step in a multi-step workflow',
  { sessionId: z.string(), action: z.string() },
  async ({ sessionId, action }) => {
    const ctx = sessionContext.get(sessionId) ?? { stepCount: 0, lastTool: '' };

    // RACE: two concurrent calls with the same sessionId both read stepCount = 5
    // Both increment to 6 and write back — one increment is lost
    ctx.stepCount += 1;
    ctx.lastTool = action;
    sessionContext.set(sessionId, ctx);

    return { content: [{ type: 'text', text: `Step ${ctx.stepCount} complete` }] };
  }
);

The three dangerous patterns to eliminate:

The fix in all three cases is the same: move state to a store that provides atomic operations and isolation guarantees, and remove all mutable state from the Node.js process. The process becomes stateless — a pure request-response transformer — and horizontal scaling becomes trivial.

Redis as a shared context store

Redis is the most common choice for shared MCP server state in distributed deployments. Its single-threaded command execution model means every command is atomic with respect to every other command — there is no equivalent of a torn read at the Redis layer. Lua scripts extend this atomicity to multi-command sequences.

Model per-session context as a Redis hash keyed by session ID:

import { createClient } from 'redis';
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { z } from 'zod';

const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();

const server = new McpServer({ name: 'stateful-server', version: '1.0.0' });

server.tool(
  'update_workflow_step',
  'Advance the workflow to the next step and record the action',
  {
    sessionId: z.string(),
    action: z.string(),
    result: z.string(),
  },
  async ({ sessionId, action, result }) => {
    const key = `session:${sessionId}:context`;

    // HINCRBY is atomic — no race condition on stepCount
    const newStepCount = await redis.hIncrBy(key, 'stepCount', 1);
    await redis.hSet(key, {
      lastTool: action,
      lastResult: result,
      updatedAt: Date.now().toString(),
    });
    // Set TTL so stale sessions are GC'd automatically
    await redis.expire(key, 3600);

    return {
      content: [{
        type: 'text',
        text: JSON.stringify({ stepCount: newStepCount, recorded: true }),
      }],
    };
  }
);

// Distributed lock using SETNX for operations that must not overlap
async function withLock<T>(
  lockKey: string,
  ttlSeconds: number,
  fn: () => Promise<T>
): Promise<T> {
  const lockValue = `${Date.now()}-${Math.random()}`;
  const acquired = await redis.set(lockKey, lockValue, {
    NX: true,       // only set if not exists
    EX: ttlSeconds, // auto-expire prevents deadlocks
  });

  if (!acquired) {
    throw new Error(`Could not acquire lock: ${lockKey}`);
  }

  try {
    return await fn();
  } finally {
    // Release only if we still own the lock (Lua script for atomicity)
    const releaseLua = `
      if redis.call('get', KEYS[1]) == ARGV[1] then
        return redis.call('del', KEYS[1])
      else
        return 0
      end
    `;
    await redis.eval(releaseLua, { keys: [lockKey], arguments: [lockValue] });
  }
}

Use Lua scripts for compare-and-swap operations that must be atomic across multiple Redis commands. The script runs in Redis's single-threaded context — no other command can execute between the script's commands. This is the correct way to implement optimistic locking checks, conditional updates, and atomic dequeue operations in Redis.

// Atomic compare-and-swap: update value only if it matches expected
const casLua = `
  local current = redis.call('hget', KEYS[1], 'value')
  if current == ARGV[1] then
    redis.call('hset', KEYS[1], 'value', ARGV[2])
    redis.call('hset', KEYS[1], 'version', ARGV[3])
    return 1
  else
    return 0
  end
`;

async function atomicCAS(
  key: string,
  expectedValue: string,
  newValue: string,
  newVersion: string
): Promise<boolean> {
  const result = await redis.eval(casLua, {
    keys: [key],
    arguments: [expectedValue, newValue, newVersion],
  });
  return result === 1;
}

Redis connection management for MCP servers follows the same pooling rules as database connections. Use a connection pool sized to your expected concurrent tool call count. The redis npm package manages an internal connection pool — set socket.reconnectStrategy and pingInterval so the MCP server recovers automatically from Redis restarts without manual intervention. Pair with a circuit breaker that opens when Redis latency spikes, preventing cascading timeouts across all tool handlers.

SQLite WAL mode for single-node deployments

For MCP servers that run on a single host and do not need to share state across multiple processes, SQLite in WAL (Write-Ahead Log) mode is a compelling alternative to Redis. There is no separate infrastructure to operate, latency is sub-millisecond (no network round-trip), and WAL mode allows concurrent readers without blocking writes.

SQLite's concurrency model in WAL mode: multiple readers can read simultaneously without blocking each other or blocking a concurrent writer. Writers serialize — only one write transaction can be active at a time, but readers are never blocked by writers. For MCP servers with a read-heavy workload (tool calls that mostly query data with occasional writes), WAL mode delivers near-Redis throughput with zero operational overhead.

import Database from 'better-sqlite3';

// Open with WAL mode — set once at startup, persists in the database file
const db = new Database('./state.db');
db.pragma('journal_mode = WAL');
db.pragma('synchronous = NORMAL');  // faster than FULL, safe with WAL
db.pragma('busy_timeout = 5000');   // wait up to 5s for write locks
db.pragma('foreign_keys = ON');

// Schema: sessions table with version field for optimistic locking
db.exec(`
  CREATE TABLE IF NOT EXISTS session_state (
    session_id TEXT PRIMARY KEY,
    step_count INTEGER NOT NULL DEFAULT 0,
    last_tool   TEXT,
    payload     TEXT,  -- JSON
    version     INTEGER NOT NULL DEFAULT 1,
    updated_at  INTEGER NOT NULL DEFAULT (unixepoch())
  );
`);

// Prepare statements at startup — faster than preparing per-call
const getState = db.prepare(
  'SELECT * FROM session_state WHERE session_id = ?'
);
const upsertState = db.prepare(`
  INSERT INTO session_state (session_id, step_count, last_tool, payload, version, updated_at)
  VALUES (@sessionId, @stepCount, @lastTool, @payload, 1, unixepoch())
  ON CONFLICT(session_id) DO UPDATE SET
    step_count = excluded.step_count,
    last_tool  = excluded.last_tool,
    payload    = excluded.payload,
    version    = session_state.version + 1,
    updated_at = unixepoch()
`);

server.tool(
  'record_step',
  'Record a workflow step result in session state',
  { sessionId: z.string(), tool: z.string(), payload: z.unknown() },
  async ({ sessionId, tool, payload }) => {
    // better-sqlite3 is synchronous — wrap in setImmediate to yield to event loop
    // for long-running writes; short writes (<1ms) are fine inline
    const existing = getState.get(sessionId) as
      { step_count: number } | undefined;

    upsertState.run({
      sessionId,
      stepCount: (existing?.step_count ?? 0) + 1,
      lastTool: tool,
      payload: JSON.stringify(payload),
    });

    return {
      content: [{
        type: 'text',
        text: JSON.stringify({ recorded: true }),
      }],
    };
  }
);

One limitation of SQLite WAL mode: write serialization. If 50 concurrent tool calls all attempt to write simultaneously, they queue behind the single active write transaction. For write-heavy workloads under high concurrency, the queue length grows and p99 latency rises. The solution is to batch writes — collect pending writes in memory for a short window (10–50ms) and commit them in a single transaction. This reduces write transactions from N to 1 while keeping per-call result delivery fast.

Optimistic locking with version fields

Optimistic locking is the right pattern when write conflicts are rare but must be detected and handled correctly when they occur. The key idea: every record carries a version counter. A handler reads the record and its version, performs its computation, then writes back the updated record with WHERE version = read_version. If another handler modified the record in the interim, the version will have changed and the WHERE clause matches zero rows — signaling a conflict.

interface WorkflowState {
  sessionId: string;
  currentPhase: string;
  completedSteps: string[];
  version: number;
}

const readState = db.prepare<[string]>(
  'SELECT * FROM workflow_state WHERE session_id = ?'
);
const updateStateOptimistic = db.prepare(`
  UPDATE workflow_state
  SET current_phase   = @currentPhase,
      completed_steps = @completedSteps,
      version         = version + 1,
      updated_at      = unixepoch()
  WHERE session_id = @sessionId
    AND version    = @expectedVersion
`);

async function advanceWorkflow(
  sessionId: string,
  nextPhase: string,
  completedStep: string,
  maxRetries = 3
): Promise<WorkflowState> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    const current = readState.get(sessionId) as WorkflowState | undefined;
    if (!current) throw new Error(`Session not found: ${sessionId}`);

    const updated = {
      ...current,
      currentPhase: nextPhase,
      completedSteps: [...current.completedSteps, completedStep],
      expectedVersion: current.version,
    };

    const result = updateStateOptimistic.run(updated);

    if (result.changes === 1) {
      // Success — exactly one row was updated
      return { ...updated, version: current.version + 1 };
    }

    // Conflict detected — another handler modified the record
    // Retry with a brief backoff (exponential with jitter)
    if (attempt < maxRetries) {
      const backoffMs = Math.min(50 * 2 ** attempt, 500) + Math.random() * 20;
      await new Promise(r => setTimeout(r, backoffMs));
    }
  }

  throw new Error(
    `Optimistic lock conflict on session ${sessionId} after ${maxRetries} retries`
  );
}

Retry on conflict is appropriate when the conflict rate is low (under 5%) and retries quickly converge. If you see a high conflict rate under load, it means multiple agents are racing on the same session ID — a sign that the orchestrator's task partitioning is not properly isolating agents from each other. Re-examine the fan-out design: each sub-agent should work on a distinct partition of data. See the multi-agent topologies guide for partitioning strategies.

Emit a counter metric for optimistic lock conflicts and a histogram of retry counts. A rising conflict rate metric is an early warning signal — investigate before it causes user-visible latency.

Event-sourced state

When write contention is consistently high — because many agents are writing to overlapping state — consider switching from a mutable state model to an event-sourced model. Instead of reading a record, modifying it, and writing it back, each tool call appends an immutable event to an ordered log. State is derived by replaying the log, not by reading a single mutable row. Appending to a log is always safe under concurrency: two appends to the same log do not conflict, because they produce two sequential entries.

// Event log schema — append-only, never update or delete
db.exec(`
  CREATE TABLE IF NOT EXISTS workflow_events (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    session_id  TEXT NOT NULL,
    event_type  TEXT NOT NULL,
    payload     TEXT NOT NULL,  -- JSON
    agent_id    TEXT NOT NULL,
    occurred_at INTEGER NOT NULL DEFAULT (unixepoch('now', 'subsec') * 1000)
  );
  CREATE INDEX IF NOT EXISTS idx_events_session
    ON workflow_events(session_id, id);
`);

const appendEvent = db.prepare(`
  INSERT INTO workflow_events (session_id, event_type, payload, agent_id)
  VALUES (@sessionId, @eventType, @payload, @agentId)
`);

const getEvents = db.prepare<[string]>(
  'SELECT * FROM workflow_events WHERE session_id = ? ORDER BY id ASC'
);

// Materialize current state by replaying all events for a session
function materializeState(sessionId: string): WorkflowState {
  const events = getEvents.all(sessionId) as Array<{
    event_type: string;
    payload: string;
  }>;

  // Fold events into current state — each event mutates a copy of the state
  return events.reduce<WorkflowState>(
    (state, event) => {
      const p = JSON.parse(event.payload);
      switch (event.event_type) {
        case 'step_completed':
          return {
            ...state,
            completedSteps: [...state.completedSteps, p.stepId],
            currentPhase: p.nextPhase ?? state.currentPhase,
          };
        case 'phase_advanced':
          return { ...state, currentPhase: p.phase };
        case 'error_recorded':
          return { ...state, hasError: true, errorMessage: p.message };
        default:
          return state;
      }
    },
    { sessionId, currentPhase: 'init', completedSteps: [], hasError: false }
  );
}

server.tool(
  'complete_step',
  'Record a completed workflow step',
  { sessionId: z.string(), stepId: z.string(), agentId: z.string() },
  async ({ sessionId, stepId, agentId }) => {
    // Append is always safe — no read-modify-write, no conflicts
    appendEvent.run({
      sessionId,
      eventType: 'step_completed',
      payload: JSON.stringify({ stepId }),
      agentId,
    });

    return {
      content: [{ type: 'text', text: JSON.stringify({ appended: true }) }],
    };
  }
);

The tradeoff with event sourcing: reading current state requires replaying all events for the session, which grows more expensive as the event log grows. Mitigate this with periodic snapshots — materialize state after every N events and store the snapshot alongside the event log. When reading, load the most recent snapshot and replay only events after it. For most MCP workflows (tens to hundreds of steps per session), snapshots are unnecessary — the replay is fast enough. For long-running sessions with thousands of events, snapshots become essential.

Event sourcing also produces a complete audit trail for free: every state transition is recorded with its agent ID and timestamp. This is directly useful for debugging multi-agent conflicts — you can replay the event log for any session and see exactly which agents made which state transitions in which order. See the audit logging guide for patterns to query and expose this history.

Read-through cache patterns

Tool handlers that read shared state on every call impose storage round-trip latency on every tool invocation. For state that changes infrequently relative to its read rate — configuration, user preferences, feature flags, permission sets — a read-through cache in process memory reduces storage load and cuts p50 latency dramatically.

interface CacheEntry<T> {
  value: T;
  fetchedAt: number;
  ttlMs: number;
}

class ReadThroughCache<T> {
  private store = new Map<string, CacheEntry<T>>();

  constructor(
    private readonly fetch: (key: string) => Promise<T>,
    private readonly defaultTtlMs: number
  ) {}

  async get(key: string, ttlMs = this.defaultTtlMs): Promise<T> {
    const entry = this.store.get(key);
    if (entry && Date.now() - entry.fetchedAt < entry.ttlMs) {
      return entry.value; // cache hit
    }

    // Cache miss or expired — fetch from storage
    const value = await this.fetch(key);
    this.store.set(key, { value, fetchedAt: Date.now(), ttlMs });
    return value;
  }

  invalidate(key: string): void {
    this.store.delete(key);
  }

  invalidateAll(): void {
    this.store.clear();
  }
}

// Cache session configuration — rarely changes, read on every tool call
const configCache = new ReadThroughCache<SessionConfig>(
  async (sessionId) => {
    const row = db.prepare(
      'SELECT config FROM session_config WHERE session_id = ?'
    ).get(sessionId) as { config: string } | undefined;
    return row ? JSON.parse(row.config) : defaultConfig;
  },
  30_000 // 30-second TTL
);

// In a write tool: invalidate the cache for the affected session
server.tool(
  'update_session_config',
  'Update session configuration options',
  { sessionId: z.string(), config: z.record(z.unknown()) },
  async ({ sessionId, config }) => {
    db.prepare(
      'INSERT OR REPLACE INTO session_config (session_id, config) VALUES (?, ?)'
    ).run(sessionId, JSON.stringify(config));

    // Invalidate so next read fetches fresh data
    configCache.invalidate(sessionId);

    return { content: [{ type: 'text', text: JSON.stringify({ updated: true }) }] };
  }
);

The read-through cache above is process-local — it does not synchronize across multiple MCP server instances. If you run multiple instances behind a load balancer, a write to instance A invalidates instance A's cache, but instance B still serves the stale value until its TTL expires. For configuration data with a 30-second TTL this is usually acceptable — agents see a stale config for at most 30 seconds. For data that must be consistent across all instances immediately after a write, use a Redis Pub/Sub invalidation channel: the writing instance publishes a cache invalidation message, all instances subscribe and invalidate their local entry. See the caching guide for the full distributed cache invalidation pattern.

Monitoring state health with AliveMCP

Shared state bugs are often invisible in low-traffic testing but surface only under concurrent load. By the time a race condition manifests as a user-visible error, it has usually been silently corrupting data for some time. Monitoring state health proactively requires both external and internal signals.

External probing with AliveMCP: configure a probe that calls a read-heavy tool and validates the response schema. A state corruption bug often shows up as a malformed JSON response, an unexpected null, or a missing required field — all detectable by schema validation in the probe assertion. A probe that calls get_workflow_state and asserts response.completedSteps is an array catches the torn-read case where two concurrent writes left the field in an inconsistent type.

Internal metrics to emit:

AliveMCP external probes detect when your server's state subsystem is causing observable failures — 500 errors from unhandled conflict exceptions, 503s from storage connection pool exhaustion, or slow responses from lock contention. Set up structured log alerts alongside external probes: log every optimistic lock conflict with the session ID, tool name, and retry count, then alert when the conflict log rate exceeds a threshold. The combination of external availability monitoring and internal conflict metrics gives you early warning before state bugs affect production agents.

See also: MCP server observability, MCP server metrics, and retry logic for handling transient state conflicts at the caller layer.

Further reading