Production resilience guide · 2026-06-10 · Agent-scale MCP servers
MCP Server Production Resilience: Six Patterns for Agent-Scale Traffic
When developers first build MCP servers, they test them with single sequential tool calls and a fixed schema. Production looks different: an orchestrating agent calls three tools in parallel, retries on timeout, caches the tool schema for hours, and may be running dozens of instances simultaneously against the same server. The six failure modes that emerge from this gap — duplicate side effects from retried calls, database exhaustion from parallel N+1 queries, agent fleets stuck on a stale tool schema, a bad deploy amplified by silent agent retries, cascading failure when a dependency hangs, and event-loop saturation from fan-out parallelism — each have a corresponding production pattern. This post covers all six as an operational checklist for MCP servers that have moved beyond hobby traffic.
TL;DR
- Idempotency — attach an
Idempotency-Keyheader to every tool call with side effects. Store the result in Redis with a TTL sized to the operation type (1h for interactive calls, 24h for automated, 7d for batch). Return the cached result for duplicate keys instead of re-executing. See the idempotency deep-dive for the in_flight/complete state machine that blocks concurrent duplicates. - Backpressure — wrap tool handlers in a semaphore (
p-limitor aBoundedSemaphore). Reject with HTTP 503 +Retry-Afterwhen the queue depth exceedsmaxQueue. Set per-client and global concurrency limits independently. See the backpressure guide for the per-client LRU semaphore map and Prometheus queue-depth gauge. - Schema evolution — only make additive changes (add optional parameters, expand enums, widen constraints). For any breaking change, run dual-accept in the handler while you migrate the agent fleet, then remove the old form after 30 zero-call days. See the schema evolution guide for the safe vs. breaking change table and the deprecation audit log query.
- Canary deployment — route 5% of traffic to the new version by hashing on
remote_addr + request_id. Track per-version error rates in Prometheus. Promote (5% → 25% → 50% → 100%) only after 30 clean minutes at each gate; auto-roll back if error rate exceeds 2× stable for 5 minutes. See the canary deployment guide for nginx and Caddy split-traffic configurations and SSE session affinity. - Graceful degradation — return partial results rather than total failure when a dependency is slow or down. Use a five-tier response model (full → stale cache → partial enrichment → IDs only → informative error). Return a
_meta.degradedflag in the tool response so agents can decide how to proceed. See the graceful degradation guide for thePromise.race()stale-cache pattern and health check conventions. - Request batching — use DataLoader to coalesce concurrent per-item database queries into a single batch query. Scope the loader per-request (attach to Express
req) to share deduplication across parallel tool calls in the same HTTP request. See the request batching guide for the DataLoader per-request scope pattern and the N+1 diagnostic query.
Why agent traffic is unlike HTTP API traffic
Each of these six patterns has analogues in conventional HTTP API development — idempotency keys exist in payment APIs, backpressure exists in message queues, canary deployments exist in any CI/CD pipeline. But they become uniquely critical for MCP servers because of how LLM agents generate traffic:
| Agent behavior | How it differs from browser/mobile clients | Failure it creates |
|---|---|---|
| Retry on timeout or error | Agents retry silently, often 3–5 times, without user awareness | Duplicate side effects — N emails sent, N payments charged, N rows inserted |
| Call tools in parallel | LLMs fan out to multiple tool calls in a single inference step | N+1 query explosion — 10 parallel get_order calls = 10 SELECT queries |
| Cache tool schemas | Agents read tools/list once; prompt cache TTL is 5+ minutes | Schema drift — agent fleet runs old schema for hours after a breaking deploy |
| Run autonomously at scale | Enterprise deployments run 50–200 agent instances against the same server | Traffic spikes at concurrency × parallelism factor, overwhelming database pools |
| Retry silently on bad deploy | Agents don't distinguish between a 503 and a 500; both trigger retry | A broken deploy is amplified 5× before the on-call team notices |
| Continue through partial failure | Agents have no fallback logic for malformed responses; they stall | One slow dependency freezes the entire agent pipeline |
The interaction between these behaviors is what makes production MCP servers fail in ways that are hard to debug: a deployment that looks fine in staging (sequential test calls, fixed schema, single instance) falls over in production because 50 agents retried in parallel on a slow database query, each retry creating a new duplicate side effect, while a stale schema in the prompt cache sent every 5th retry to a broken v2 tool name.
Pattern 1: Idempotency — safe retries for tool calls with side effects
Any tool call that creates, modifies, or deletes persistent state — send email, charge payment, insert row, dispatch job — is a side effect. Agent retry loops execute the tool call body multiple times on transient failure. Without idempotency controls, each retry produces a duplicate side effect. The MCP idempotency pattern uses a client-generated key to deduplicate executions at the server.
async function withIdempotency(key, ttlSeconds, fn) {
const existing = await redis.get(`idem:${key}`);
if (existing) return JSON.parse(existing); // return cached result
const lockKey = `idem:lock:${key}`;
const acquired = await redis.set(lockKey, '1', 'EX', 30, 'NX');
if (!acquired) {
// Concurrent duplicate in-flight — wait for the first to complete
await new Promise(r => setTimeout(r, 500));
return withIdempotency(key, ttlSeconds, fn);
}
try {
const result = await fn();
await redis.set(`idem:${key}`, JSON.stringify(result), 'EX', ttlSeconds);
return result;
} finally {
await redis.del(lockKey);
}
}
Idempotency keys deduplicate attempts, not outcomes — if the first execution fails, the error is stored and returned for all subsequent attempts with the same key. This is intentional: a failed charge should not be retried indefinitely by an agent loop. The agent receives the stored error and can decide to surface it to the user rather than silently retrying forever. TTL selection determines how long the deduplication window stays open:
| Operation type | Recommended TTL | Rationale |
|---|---|---|
| Interactive (user-triggered) | 1 hour | Agent session is bounded; no retry crosses session boundary |
| Automated (cron/pipeline) | 24 hours | Daily pipelines may retry hours later on transient failure |
| Batch (async job dispatch) | 7 days | Long-running jobs; retries may span multiple processing windows |
| Financial (payment, invoice) | 30 days | Compliance requirement — idempotency records = audit trail |
The idempotency key should be generated by the caller (the orchestrating agent or its wrapper code) before the tool call, not by the server. A UUID generated per logical operation works for interactive calls. For automated pipelines, an operation hash derived from the operation inputs produces deterministic keys that survive process restarts.
Pattern 2: Backpressure — bound concurrency before the database pays for it
Agent parallelism translates directly into database connection pool pressure. An agent that fans out to 20 parallel tool calls in a single inference step creates 20 concurrent handler executions. Each handler that touches the database holds a connection for the duration of the query. A database pool of 10 connections with 20 concurrent handlers means 10 handlers waiting for connections — and if those waiting handlers hold HTTP connections that the agent is polling, the agent may time out and retry, adding 20 more handlers to the queue.
The MCP backpressure pattern uses a bounded semaphore that rejects excess load rather than queuing it indefinitely:
class BoundedSemaphore {
constructor(maxConcurrent, maxQueue) {
this.maxConcurrent = maxConcurrent;
this.maxQueue = maxQueue;
this.active = 0;
this.queue = [];
}
async acquire() {
if (this.active < this.maxConcurrent) {
this.active++;
return;
}
if (this.queue.length >= this.maxQueue) {
const err = new Error('Queue full');
err.status = 503;
err.retryAfter = 5;
throw err;
}
await new Promise((resolve, reject) => this.queue.push({ resolve, reject }));
this.active++;
}
release() {
this.active--;
const next = this.queue.shift();
if (next) next.resolve();
}
}
Rejecting with HTTP 503 + Retry-After: 5 rather than queuing indefinitely is a deliberate design choice. An unbounded queue keeps the agent waiting but provides no signal that the server is under pressure. A 503 with a retry hint tells the agent to back off. Well-designed agent frameworks respond to 503 with exponential backoff, which reduces inbound pressure exactly when the server needs relief. The reject-rather-than-queue approach is what transforms a feedback loop (more retries → more pressure → more retries) into a negative feedback loop (pressure → rejection → backoff → pressure decreases).
Layer a per-client semaphore (using an LRU map keyed on actor.id) over the global semaphore so a single agent cannot consume all available capacity. The per-client limit (10 concurrent) prevents monopolization while the global limit (50 concurrent) prevents total server saturation. A client hitting its per-client limit gets 429; a client hitting the global limit gets 503 — different codes so the agent can distinguish "slow down" from "server is full".
Pattern 3: Schema evolution — safe changes across a deployed agent fleet
MCP server tool schemas are not like REST API response schemas. A REST client is code you control; when you ship a breaking API change, you update the client in the same deploy. An MCP tool schema is consumed by an LLM at inference time, often from a prompt cache with a 5-minute TTL. In enterprise deployments, 50 agent instances may be running the old schema for up to an hour after you ship a breaking change. The MCP schema evolution pattern prevents downtime during the transition window.
The critical distinction is between additive and breaking changes. Additive changes are safe to ship at any time:
| Change type | Safe to ship immediately? | Why |
|---|---|---|
| Add optional parameter | Yes | Old agents omit it; handler defaults work |
| Expand an enum (add values) | Yes | Old agents never send the new value; no breakage |
Widen a constraint (maxLength 50 → 200) | Yes | Old agents stay within the old constraint |
| Add required parameter | No | Old agents omit it; handler receives undefined |
| Remove or rename parameter | No | Old agents send the old name; handler ignores it |
Narrow a constraint (maxLength 200 → 50) | No | Old agents may send values in 51–200 range |
For unavoidable breaking changes, the migration path is: (1) add the new parameter as optional alongside the old one, (2) accept both forms in the handler during the migration window, (3) emit a deprecation warning in the tool response when the old form is used, (4) query the audit log for calls using the old form, (5) remove the old form only after 30 consecutive zero-call days. Never remove a parameter and add its replacement in the same deployment — there is no migration window where both old and new agents are handled correctly.
// Dual-accept: old `user_id` param and new `actor_id` param
server.setRequestHandler(CallToolRequestSchema, async (req) => {
const { actor_id, user_id } = req.params.arguments;
const resolvedId = actor_id ?? user_id; // accept both during migration
if (user_id !== undefined) {
auditLog.warn({ event: 'deprecated_param', param: 'user_id', tool: 'get_profile' });
}
return await getProfile(resolvedId);
});
Pattern 4: Canary deployment — blast-radius limiting for MCP server releases
MCP servers are unusually sensitive to bad deploys. An agent that retries silently on error turns a 2% error rate in canary into 10% apparent failure rate before the alert fires — because each error becomes 3–5 retries that all hit the canary shard. The MCP canary deployment pattern uses traffic splitting with per-version Prometheus labels to detect problems before they reach full traffic.
The routing key for splitting must be deterministic — the same agent session must consistently hit the same backend, otherwise a session that starts on stable ends up on canary mid-conversation and vice versa. Hash on remote_addr + request_id (or the MCP session ID for SSE transports) to achieve deterministic routing:
# nginx split_clients: deterministic 5/95 split on remote_addr+request_id
split_clients "${remote_addr}${request_id}" $upstream {
5% mcp_canary;
* mcp_stable;
}
upstream mcp_canary { server 127.0.0.1:3001; }
upstream mcp_stable { server 127.0.0.1:3000; }
Tag every metric with a version label so you can compare error rates between canary and stable in the same Prometheus query:
rate(mcp_tool_call_errors_total{version="canary"}[5m])
/
rate(mcp_tool_call_errors_total{version="stable"}[5m])
The four-gate promotion schedule and rollback thresholds:
| Gate | Traffic % | Minimum hold time | Auto-rollback trigger |
|---|---|---|---|
| 1 — Initial | 5% | 30 minutes | Error rate >2× stable for 5 minutes |
| 2 — Expand | 25% | 1 hour | P99 latency >3× stable for 5 minutes |
| 3 — Half | 50% | 1 hour | Schema validation errors >0.1% |
| 4 — Full | 100% | — | Any crash or unhandled promise rejection |
For SSE transports, apply session affinity: hash on the Mcp-Session-Id header value so long-lived streaming connections are not interrupted by traffic redistribution during a promotion step. Clients that connected to stable continue on stable; new clients are distributed according to the current split.
Pattern 5: Graceful degradation — partial responses under dependency failure
MCP tool calls are often composite — a single get_enriched_profile call might hit a primary database, a CRM, and a billing service. If the billing service is slow, returning a 500 fails the entire tool call and stalls the agent pipeline. The MCP graceful degradation pattern returns partial results with a _meta.degraded flag rather than total failure.
The five-tier response model defines a priority ordering for degraded responses:
| Tier | What you return | When to use |
|---|---|---|
| 1 — Full | All enriched data from all sources | All dependencies healthy |
| 2 — Stale cache | Previously-fetched data from Redis, with stale: true | Live fetch timed out; stale data is better than nothing |
| 3 — Partial enrichment | Primary data + some enrichments, skipped: ["billing"] | Non-critical enrichment services unavailable |
| 4 — IDs only | Just the entity identifiers | Primary store slow; agent can re-fetch individually later |
| 5 — Informative error | { error: "service unavailable", retryAfterSeconds: 30 } | All sources down; give agent a retry hint |
The stale-cache pattern uses a short TTL "freshness" key and a long TTL "stale" key together with a Promise.race() timeout:
async function getWithStaleFallback(key, fetchFn) {
const fresh = await redis.get(`fresh:${key}`);
if (fresh) return { ...JSON.parse(fresh), stale: false };
const stale = await redis.get(`stale:${key}`); // long-TTL fallback
const liveResult = await Promise.race([
fetchFn().then(data => {
redis.set(`fresh:${key}`, JSON.stringify(data), 'EX', 30);
redis.set(`stale:${key}`, JSON.stringify(data), 'EX', 3600);
return { ...data, stale: false };
}),
new Promise(r => setTimeout(() => r(null), 2000)), // 2s timeout
]);
if (liveResult) return liveResult;
if (stale) return { ...JSON.parse(stale), stale: true };
return null; // tier 5: no data at all
}
Agents consuming degraded responses need a signal to act on. The _meta convention in the tool response body carries this signal without changing the tool's primary schema: _meta: { degraded: true, degradationReason: "billing_service_timeout", cachedAt: "2026-06-10T14:23:00Z", skipped: ["billing"], retryAfterSeconds: 30 }. An agent that reads degraded: true can surface a "data may be outdated" note to the user, retry after retryAfterSeconds, or proceed with the available data — rather than stalling and waiting for a full retry on the same blocked dependency.
One important configuration detail: a gracefully degrading server should return HTTP 200 with status: "degraded" in its health check response, not HTTP 503. A 503 tells load balancers and uptime monitors (including AliveMCP) that the server is down and should be routed around. A server returning partial results is functioning — it is a degraded state, not a down state, and should not be taken out of rotation.
Pattern 6: Request batching — eliminating the N+1 query problem
The N+1 query problem is familiar from GraphQL, but MCP servers encounter it in a different form. An LLM agent making 10 parallel get_order_details calls generates 10 concurrent handlers. Each handler fires one SELECT * FROM orders WHERE id = ? query. Ten queries where one would have done the same work. On a database with a 10ms query time and 10 parallel callers, this is a 10ms operation that takes 10× the connection-pool pressure it should.
The DataLoader pattern coalesces concurrent keys into a single batch query within one Node.js event loop tick:
import DataLoader from 'dataloader';
// Batch function: receives an array of IDs, returns results in same order
async function batchOrders(ids) {
const rows = await db.all(
`SELECT * FROM orders WHERE id IN (${ids.map(() => '?').join(',')})`,
ids
);
const map = new Map(rows.map(r => [r.id, r]));
return ids.map(id => map.get(id) ?? null); // must preserve input order
}
// Per-request loader: shared across all tool calls in one HTTP request
function createOrderLoader() {
return new DataLoader(batchOrders, { maxBatchSize: 1000 });
}
// Attach to request context (Express middleware)
app.use((req, res, next) => {
req.loaders = { orders: createOrderLoader() };
next();
});
Per-request scoping is the critical design decision. A global loader would accumulate keys across requests — but it would also share the deduplication cache between different users' requests, creating a privacy boundary violation. A per-request loader is created fresh per HTTP request, shared across all parallel tool calls within that request, and garbage collected when the response is sent. This gives you batching within an agent's parallel tool calls without any cross-request contamination.
The DataLoader also handles deduplication within a request: if two parallel tool calls both need the same order ID, they share a single Promise and the batch function receives the ID only once. A request that triggers 10 parallel get_order_details calls for 10 distinct orders plus 3 repeat lookups for the same order becomes 1 batch query for the 10 distinct IDs — not 13 queries.
The diagnostic for an N+1 problem is simple: add a DataLoader batch size histogram (mcp_dataloader_batch_size_bucket). A healthy distribution is bimodal: many batches of size 1 (single-order lookups) and a peak at your agent's typical parallelism factor (5–20). A distribution that is entirely size-1 with no batching despite parallel tool calls signals a scoping bug — the loader is probably global or session-scoped when it should be request-scoped, so keys are never accumulating into a single tick.
How the six patterns compose
Each pattern addresses a distinct failure mode, but they interact in ways that are worth mapping:
| Pattern | Works with | Interaction |
|---|---|---|
| Idempotency | Backpressure | A 503 from backpressure triggers an agent retry; idempotency ensures the retry is safe |
| Backpressure | Request batching | Batching reduces concurrent database connections, raising the effective concurrency limit before backpressure kicks in |
| Schema evolution | Canary deployment | Never ship a breaking schema change in a canary deploy — old agents will hit the new schema on their 5% shard. Always make the change additive before promoting to any traffic split. |
| Canary deployment | Graceful degradation | Add the canary shard to your AliveMCP monitoring as a separate endpoint — the external probe detects infrastructure failures in the canary that in-process metrics cannot surface |
| Graceful degradation | Idempotency | A degraded response with retryAfterSeconds instructs the agent to retry; the idempotency key ensures the retry doesn't double-execute any side effects that completed before the degradation occurred |
| Request batching | Idempotency | DataLoader deduplicates read calls within a request. Idempotency keys deduplicate write calls across requests. Use both independently — they operate at different scopes. |
The layering recommendation: add request batching first (biggest performance gain, zero user-facing behavior change), then backpressure (prevents pool exhaustion under fan-out load), then idempotency for any tool with side effects, then graceful degradation for tools with multiple dependencies, then canary deployment once you have enough production traffic to make the split meaningful, and finally schema evolution discipline as a permanent practice rather than a one-time addition — every schema change goes through the additive-first review.
What external monitoring sees that you cannot
All six patterns are in-process: they run inside the server and depend on the server being healthy enough to execute them. A backpressure semaphore cannot reject traffic if the Node process has crashed. Idempotency key lookups cannot succeed if Redis is unreachable. Graceful degradation code cannot return a partial response if the event loop is blocked by a synchronous CPU-intensive operation. Schema evolution handling cannot serve any traffic if the deployment failed partway through and left the server in a mixed state.
This is the gap that AliveMCP fills. The 60-second external probe sends a full MCP initialize handshake from outside the server — not an HTTP healthcheck endpoint inside it — and verifies that the server can successfully negotiate the protocol, list its tools, and return a valid response. That probe catches infrastructure-layer failures (crashed process, expired TLS certificate, overloaded reverse proxy, botched Caddy reload after a canary promotion) that none of the six in-process patterns can detect. The combination of in-process resilience patterns and out-of-process protocol probing covers the failure surface that neither covers alone.
The six patterns in this post, combined with the security hardening guide, cover the two primary operational concerns for production MCP servers: who can call you, what they did, and whether the call was safe (security), and what happens when they call you in ways you didn't design for (resilience). Both are prerequisites for a server that stays healthy when it moves from "working in testing" to "running under production agent workloads".