Guide · Session Management

MCP server conversation context

MCP tools are stateless by default — each tool call arrives with no memory of previous calls in the same conversation. But many tools are more useful with context: knowing what the user searched for two calls ago, which documents have already been retrieved, or what decisions were made earlier in the session. Maintaining conversation context server-side lets tools personalize responses without bloating the agent's prompt with raw history.

TL;DR

Store conversation context server-side in a Map<sessionId, ConversationContext> for single-instance deployments, or Redis for multi-instance. Scope every read and write to the session_id — never let context bleed across sessions. Apply a sliding TTL (typically 30 minutes of idle time) and bound the context size with a sliding window of recent tool calls, summarizing older turns when the window overflows. Track which tools were called with which arguments so you can skip re-fetching resources already retrieved. Expose a context.clear tool so the agent can explicitly reset context between tasks. Add a health_check tool that reports context store connectivity and memory pressure, and configure AliveMCP to probe it — a context store that goes down silently degrades every subsequent tool call in every active session. For handing session context across server boundaries, see MCP server agent handoff; for shared mutable state across multiple server instances, see MCP server shared state.

The stateless MCP problem

The MCP protocol is request/response: the client sends a tools/call request, the server returns a result. The protocol itself carries no session state between calls. Each tool invocation is handled by the server as an independent event — the handler function receives the tool arguments and nothing else.

This is the right default. Statelessness makes servers easy to scale horizontally, simple to reason about, and straightforward to test. But it creates friction for tools that are inherently conversational:

A search_documents tool that returns the same results on every call because it cannot remember which results the user already dismissed.
A fetch_data tool that re-fetches a large dataset on every call because it does not know it was already fetched two calls ago.
A generate_report tool that asks for the same parameters every call because it cannot remember the user's preferences from the session start.

The naive fix is to put all context in the agent's prompt. This works for short sessions but collapses quickly: context windows are finite, raw tool-call history is verbose, and the agent wastes inference tokens on information it already processed. A better architecture maintains a compact, structured context store on the server and keeps the agent's prompt focused on the current task.

Know when server-side context is not the right answer. If the context is needed only for a single tool call and can be passed as an argument, pass it as an argument — no state needed. If the context is fundamental to the agent's reasoning (which path did I take in this decision tree?), it belongs in the agent's prompt, not on the server. Server-side context is appropriate for derived state: resolved IDs, fetched resources, computed summaries, and user preferences that the agent should not have to re-specify on every call.

Session context store design

Choose your backing store based on your deployment topology:

Store	Best for	Tradeoffs
In-memory Map	Single-process, single-instance	Zero latency; lost on restart; no cross-instance sharing
Redis	Multi-instance, horizontally scaled	Shared across instances; survives restarts (with persistence); network latency per read/write
SQLite	Single-instance, durable across restarts	Survives restarts; no cross-instance sharing; zero network overhead

For most single-process MCP servers, an in-memory Map is the right starting point. It is zero-dependency, sub-microsecond, and trivial to implement. Switch to Redis when you scale to multiple instances or when you need context to survive server restarts (for long-running workflows where a mid-session restart would be disruptive).

// context-store.ts — in-memory store with TTL, switchable to Redis
interface StoredContext {
  context: ConversationContext;
  lastAccessedAt: number;  // ms since epoch
}

const store = new Map<string, StoredContext>();
const IDLE_TTL_MS = 30 * 60 * 1000;  // 30 minutes

// Evict idle sessions on a background timer
setInterval(() => {
  const now = Date.now();
  for (const [sessionId, entry] of store.entries()) {
    if (now - entry.lastAccessedAt > IDLE_TTL_MS) {
      store.delete(sessionId);
    }
  }
}, 60_000);  // scan every minute

export function getContext(sessionId: string): ConversationContext {
  const entry = store.get(sessionId);
  if (!entry) {
    const fresh = createFreshContext(sessionId);
    store.set(sessionId, { context: fresh, lastAccessedAt: Date.now() });
    return fresh;
  }
  entry.lastAccessedAt = Date.now();
  return entry.context;
}

export function saveContext(sessionId: string, context: ConversationContext): void {
  store.set(sessionId, { context, lastAccessedAt: Date.now() });
}

export function clearContext(sessionId: string): void {
  store.delete(sessionId);
}

export function contextStoreSize(): number {
  return store.size;
}

Memory bounds matter. An unbounded Map grows with every new session and never shrinks — a slow memory leak on long-running servers. Combine TTL eviction (sessions idle for more than N minutes are evicted) with a maximum session count (if the store exceeds M sessions, evict the least-recently-used). See the eviction and cleanup section for the full LRU implementation.

Context schema design

A well-designed context schema stores only what tools need to personalize their responses — not a raw transcript. The schema should be versioned so that you can migrate existing context objects when the schema evolves:

// types.ts — ConversationContext schema
interface ToolCallRecord {
  tool: string;
  args: Record<string, unknown>;
  called_at: string;        // ISO 8601
  result_summary: string;   // compact summary, not the full result
}

interface FetchedResource {
  resource_id: string;
  resource_type: 'document' | 'database_row' | 'api_response';
  fetched_at: string;
  content_hash: string;     // for cache invalidation
  summary: string;          // compressed summary of the resource
}

interface ConversationContext {
  schema_version: 1;
  session_id: string;
  created_at: string;
  last_updated_at: string;

  // User preferences resolved during this session
  user_preferences: {
    output_format?: 'json' | 'markdown' | 'plain';
    max_results?: number;
    language?: string;
  };

  // Rolling window of recent tool calls (bounded — see compression section)
  recent_tool_calls: ToolCallRecord[];

  // Deduplicated set of resources already fetched
  fetched_resources: FetchedResource[];

  // Arbitrary key/value context set by tools
  custom: Record<string, unknown>;

  // Compression state — null until first compression
  summary_before_window?: string;
  window_start_index?: number;
}

function createFreshContext(sessionId: string): ConversationContext {
  const now = new Date().toISOString();
  return {
    schema_version: 1,
    session_id: sessionId,
    created_at: now,
    last_updated_at: now,
    user_preferences: {},
    recent_tool_calls: [],
    fetched_resources: [],
    custom: {},
  };
}

Store summaries and IDs rather than full content. The result_summary field on a ToolCallRecord should be a one or two sentence description of what the tool returned — enough for other tools to know whether a re-call is needed, but not so much that it bloats the context object. The full result lives in the agent's context window or in a separate cache; the context store holds only the metadata needed to avoid redundant work.

Derive rather than store when possible. If you can recompute a value cheaply from other stored values, do not store it. Stored state has migration costs; derived state is always up to date. For example, instead of storing a has_fetched_user_profile boolean, derive it from fetched_resources.some(r => r.resource_type === 'database_row' && r.resource_id.startsWith('user:')).

Context compression for long sessions

A session that runs for dozens of tool calls accumulates a recent_tool_calls array that grows without bound. The context object gets large, serialization becomes slow, and the oldest entries add noise without adding value. Compression trims the window and replaces older entries with a dense summary.

// context-compression.ts — sliding window with summary generation
const WINDOW_MAX = 20;         // keep at most 20 recent tool calls
const COMPRESS_THRESHOLD = 15; // compress when recent_tool_calls reaches 15

export function maybeCompressContext(
  context: ConversationContext,
  summarize: (calls: ToolCallRecord[]) => string  // inject summarization logic
): ConversationContext {
  if (context.recent_tool_calls.length < COMPRESS_THRESHOLD) {
    return context;
  }

  // Split: keep the most recent WINDOW_MAX/2 calls live; summarize the rest
  const splitIndex = Math.floor(WINDOW_MAX / 2);
  const toSummarize = context.recent_tool_calls.slice(0, -splitIndex);
  const toKeep = context.recent_tool_calls.slice(-splitIndex);

  // Combine with any existing summary
  const combinedSummary = context.summary_before_window
    ? `${context.summary_before_window}\n\n${summarize(toSummarize)}`
    : summarize(toSummarize);

  return {
    ...context,
    recent_tool_calls: toKeep,
    summary_before_window: combinedSummary,
    last_updated_at: new Date().toISOString(),
  };
}

// Simple rule-based summarizer (no LLM required for most use cases)
export function summarizeToolCalls(calls: ToolCallRecord[]): string {
  const byTool = new Map<string, number>();
  for (const call of calls) {
    byTool.set(call.tool, (byTool.get(call.tool) ?? 0) + 1);
  }

  const lines: string[] = [
    `Summary of ${calls.length} earlier tool calls (${calls[0].called_at} to ${calls.at(-1)!.called_at}):`,
  ];
  for (const [tool, count] of byTool.entries()) {
    lines.push(`  - ${tool}: called ${count} time${count > 1 ? 's' : ''}`);
  }

  // Add notable results from the most impactful calls
  const notable = calls.filter(c => c.result_summary.length > 0).slice(0, 3);
  for (const call of notable) {
    lines.push(`  - ${call.tool} result: ${call.result_summary}`);
  }

  return lines.join('\n');
}

For sessions where the older context is still highly relevant (research sessions that loop back to earlier findings), LLM-based summarization produces better output than rule-based compression. Call a lightweight summarization model with the toSummarize array and cache the summary — the summarization call is a one-time cost per compression event, and the result replaces potentially hundreds of raw records. Keep the summarization call out of the hot path: run it asynchronously after saving the compressed context, not during the tool call that triggered the threshold.

For retrieval of older context (sessions that span many hundreds of turns and need to surface specific past facts), embedding-based retrieval is more accurate than summary windows. Store embeddings of each tool call's result summary, and when a tool needs older context, do a nearest-neighbor search over the embedding store. This is more complex to implement but avoids the information loss inherent in sliding-window compression. See MCP server caching for patterns that work alongside this approach.

Tool call history tracking

Tracking which tools were called with which arguments enables two key optimizations: deduplication of fetched resources (do not re-fetch a document that was already fetched) and "already tried" context (do not retry a search query that returned no results).

// In each tool handler: record the call and check for prior calls
server.tool(
  'fetch_document',
  'Fetch a document by ID from the knowledge base',
  { document_id: z.string() },
  async (args, { session }) => {
    const ctx = getContext(session.id);

    // Check if this document was already fetched in this session
    const alreadyFetched = ctx.fetched_resources.find(
      r => r.resource_id === args.document_id
    );
    if (alreadyFetched) {
      return {
        content: [{
          type: 'text',
          text: JSON.stringify({
            note: 'document_already_fetched_this_session',
            document_id: args.document_id,
            fetched_at: alreadyFetched.fetched_at,
            summary: alreadyFetched.summary,
          }),
        }],
      };
    }

    // Fetch the document
    const doc = await fetchDocumentFromStore(args.document_id);

    // Record the fetch in context
    const updatedCtx = maybeCompressContext({
      ...ctx,
      fetched_resources: [
        ...ctx.fetched_resources,
        {
          resource_id: args.document_id,
          resource_type: 'document',
          fetched_at: new Date().toISOString(),
          content_hash: hashContent(doc.content),
          summary: doc.content.slice(0, 200) + (doc.content.length > 200 ? '…' : ''),
        },
      ],
      recent_tool_calls: [
        ...ctx.recent_tool_calls,
        {
          tool: 'fetch_document',
          args: { document_id: args.document_id },
          called_at: new Date().toISOString(),
          result_summary: `Fetched "${doc.title}" (${doc.content.length} chars)`,
        },
      ],
      last_updated_at: new Date().toISOString(),
    }, summarizeToolCalls);

    saveContext(session.id, updatedCtx);

    return {
      content: [{ type: 'text', text: JSON.stringify(doc) }],
    };
  }
);

Bound the fetched_resources array separately from the recent_tool_calls array. Resources do not compress well — you need the actual resource_id to deduplicate, not a summary. Apply a maximum of, say, 200 entries and evict the oldest when the limit is reached. For resource-intensive sessions (a research task that fetches hundreds of documents), consider a lightweight content-hash index instead of storing full records.

Track search query results to avoid re-running identical searches. Store the query string (or a hash of it) and whether it returned results. When the same query arrives again, surface the "already tried, returned N results" context so the agent can decide whether to vary the query or proceed with what it already has. This pattern dramatically reduces redundant API calls in iterative search workflows, which has a direct impact on your rate limit consumption.

Context isolation between sessions

Cross-session context leakage is a serious security issue in multi-tenant MCP servers. If session A's context is readable from session B, an attacker who controls session B can exfiltrate session A's data — documents fetched, queries made, user preferences stored. Strict isolation is non-negotiable.

// Safe context key design — session_id must be cryptographically unguessable
import { randomBytes } from 'node:crypto';

// Generate session IDs as 32-byte random hex strings — not sequential integers
export function generateSessionId(): string {
  return randomBytes(32).toString('hex');
}

// All context reads and writes go through this wrapper — never bypass it
export function getContextForSession(
  sessionId: string,
  requestingSessionId: string  // must match sessionId — enforced here
): ConversationContext {
  if (sessionId !== requestingSessionId) {
    // This should never happen in correct code; log and throw if it does
    throw new Error(
      `Context isolation violation: session ${requestingSessionId} attempted to read context for ${sessionId}`
    );
  }
  return getContext(sessionId);
}

Never derive session IDs from user-supplied data (usernames, email addresses, sequential counters). Session IDs must be cryptographically random and unguessable — a UUID v4 or 32 bytes of random hex. Sequential IDs (session-1, session-2) allow enumeration attacks: an attacker increments the ID to read adjacent sessions' context.

For Redis-backed context stores, prefix every key with the session ID and add a namespace prefix to prevent accidental overlap with other data in the same Redis instance:

// redis-context.ts — namespace-prefixed keys prevent cross-session leakage
const CONTEXT_KEY_PREFIX = 'mcp:ctx:v1:';

function contextKey(sessionId: string): string {
  // Validate that sessionId is hex-only to prevent key injection
  if (!/^[0-9a-f]{64}$/.test(sessionId)) {
    throw new Error(`Invalid session ID format: ${sessionId}`);
  }
  return `${CONTEXT_KEY_PREFIX}${sessionId}`;
}

export async function getContextRedis(
  sessionId: string
): Promise<ConversationContext> {
  const raw = await redis.get(contextKey(sessionId));
  if (!raw) return createFreshContext(sessionId);
  return JSON.parse(raw) as ConversationContext;
}

export async function saveContextRedis(
  sessionId: string,
  context: ConversationContext
): Promise<void> {
  if (context.session_id !== sessionId) {
    throw new Error('session_id mismatch: context.session_id does not match key');
  }
  await redis.set(contextKey(sessionId), JSON.stringify(context), {
    EX: 30 * 60,  // 30-minute TTL; refreshed on every write
  });
}

Audit all context reads in your structured logs. Log the session_id and the tool name on every context read, at DEBUG level. In a security incident, this log trail allows you to verify whether cross-session access occurred. Related: see MCP server error handling for how to surface context isolation errors without leaking internal details to the agent.

Eviction and cleanup strategies

A context store without aggressive eviction is a slow memory leak. Sessions accumulate, idle context is never freed, and the server's memory footprint grows until it is restarted or OOM-killed. The four complementary eviction strategies:

Idle TTL. Evict sessions that have not been accessed for N minutes. 30 minutes is a reasonable default for interactive sessions; 5 minutes for automated agent pipelines that complete quickly.
LRU eviction. When the store exceeds a maximum session count, evict the least-recently-used session. This is a safety valve for pathological cases where many short sessions accumulate faster than TTL eviction can clear them.
Explicit context.clear. Expose a context_clear tool that the agent can call to explicitly reset its context. Useful between distinct tasks in a long-running session where the agent wants to start fresh.
Memory pressure eviction. Monitor process heap usage and trigger an LRU sweep when heap exceeds a high-water mark. This prevents OOM kills without requiring a predefined session limit.

// lru-context-store.ts — LRU map with idle TTL and memory pressure eviction
class LRUContextStore {
  private store = new Map<string, StoredContext>();
  private readonly maxSessions: number;
  private readonly idleTtlMs: number;
  private readonly heapHighWaterMark: number;  // bytes

  constructor(opts: {
    maxSessions?: number;
    idleTtlMs?: number;
    heapHighWaterMark?: number;
  } = {}) {
    this.maxSessions = opts.maxSessions ?? 10_000;
    this.idleTtlMs = opts.idleTtlMs ?? 30 * 60 * 1000;
    this.heapHighWaterMark = opts.heapHighWaterMark ?? 512 * 1024 * 1024;  // 512 MB
  }

  get(sessionId: string): ConversationContext {
    const entry = this.store.get(sessionId);
    if (!entry) {
      const ctx = createFreshContext(sessionId);
      this.set(sessionId, ctx);
      return ctx;
    }
    // Move to end (most recently used) by re-inserting
    this.store.delete(sessionId);
    entry.lastAccessedAt = Date.now();
    this.store.set(sessionId, entry);
    return entry.context;
  }

  set(sessionId: string, context: ConversationContext): void {
    if (this.store.size >= this.maxSessions) {
      // Evict the first (least-recently-used) entry
      const lruKey = this.store.keys().next().value;
      if (lruKey) this.store.delete(lruKey);
    }
    this.store.set(sessionId, { context, lastAccessedAt: Date.now() });
  }

  clear(sessionId: string): void {
    this.store.delete(sessionId);
  }

  evictIdle(): number {
    const now = Date.now();
    let evicted = 0;
    for (const [id, entry] of this.store.entries()) {
      if (now - entry.lastAccessedAt > this.idleTtlMs) {
        this.store.delete(id);
        evicted++;
      }
    }
    return evicted;
  }

  evictUnderMemoryPressure(): number {
    const heap = process.memoryUsage().heapUsed;
    if (heap < this.heapHighWaterMark) return 0;

    // Evict the oldest 20% of sessions
    const toEvict = Math.ceil(this.store.size * 0.2);
    let evicted = 0;
    for (const key of this.store.keys()) {
      if (evicted >= toEvict) break;
      this.store.delete(key);
      evicted++;
    }
    return evicted;
  }

  stats() {
    return {
      session_count: this.store.size,
      max_sessions: this.maxSessions,
      heap_used_mb: Math.round(process.memoryUsage().heapUsed / 1024 / 1024),
    };
  }
}

export const contextStore = new LRUContextStore();

// Background eviction
setInterval(() => {
  contextStore.evictIdle();
  contextStore.evictUnderMemoryPressure();
}, 60_000);

Expose a context_clear tool so the orchestrating agent can reset its server-side context explicitly. This is useful at task boundaries — the agent has finished one task and is starting another, and wants to ensure that fetched resources and tool call history from the previous task do not influence the new one:

server.tool(
  'context_clear',
  'Clear all server-side conversation context for this session',
  {},
  async (_, { session }) => {
    contextStore.clear(session.id);
    return {
      content: [{
        type: 'text',
        text: JSON.stringify({
          status: 'cleared',
          session_id: session.id,
          message: 'Conversation context has been reset. Starting fresh.',
        }),
      }],
    };
  }
);

Monitoring context health with AliveMCP

The MCP protocol probe that confirms initialize and tools/list succeed cannot tell you whether the context store is healthy. A context store failure is silent: tool calls succeed (the handler runs) but context reads return empty, so every tool call behaves as if it is the first call in the session — no deduplication, no preferences, no accumulated state. Users notice this as tools becoming repetitive and unhelpful, not as an obvious error.

Surface context store health in a health_check tool and configure AliveMCP to probe it:

server.tool(
  'health_check',
  'Report server health including context store stats',
  {},
  async () => {
    const stats = contextStore.stats();
    const heapHighWater = 512 * 1024 * 1024;
    const heapPressure = process.memoryUsage().heapUsed / heapHighWater;

    // For Redis-backed stores: also ping the Redis connection
    let redisOk = true;
    try {
      if (process.env.REDIS_URL) {
        await redis.ping();
      }
    } catch {
      redisOk = false;
    }

    const degraded = !redisOk || heapPressure > 0.9 || stats.session_count >= stats.max_sessions;

    return {
      isError: degraded,
      content: [{
        type: 'text',
        text: JSON.stringify({
          status: degraded ? 'degraded' : 'healthy',
          context_store: {
            session_count: stats.session_count,
            max_sessions: stats.max_sessions,
            heap_used_mb: stats.heap_used_mb,
            heap_pressure_pct: Math.round(heapPressure * 100),
            redis_ok: redisOk,
          },
          timestamp: new Date().toISOString(),
        }, null, 2),
      }],
    };
  }
);

Configure AliveMCP to call health_check every minute. Set alert conditions for: Redis unreachable (redis_ok: false), heap pressure above 90%, and session count at the maximum (indicating that LRU eviction is actively occurring — live sessions may be getting evicted). These three conditions represent progressive degradation of context quality before it becomes a complete failure.

Add context store metrics to your metrics pipeline: emit context_store_sessions (gauge), context_store_evictions_total (counter, by reason: idle/lru/memory_pressure), and context_read_latency_ms (histogram). Correlate eviction rate spikes with latency increases — when sessions are being evicted under memory pressure, the next call from an evicted session pays the cost of rebuilding context from scratch. See also MCP server observability for how to wire these metrics into a unified observability stack alongside your other MCP server signals.

For long-running context that must survive server restarts — agent workflows that span hours or days — combine the in-memory store with a periodic flush to SQLite or Redis. Flush every 5 minutes in the background; on startup, reload all non-expired contexts from the durable store. This gives you the low latency of an in-memory read with the durability of a persistent store, at the cost of up to 5 minutes of context if the server crashes between flushes. Related: for context that needs to be passed between different server instances, see MCP server agent handoff.