Guide · MCP Resilience
MCP server graceful degradation
Graceful degradation means returning something useful when a dependency fails rather than hard-failing the entire tool call. When your MCP server's database is slow, serve a stale cached result. When your search index is down, return an empty result with a degraded: true flag rather than a 500 error. When a third-party API is unavailable, skip the enrichment step and return the base data. The agent gets a response it can reason about and continue with, rather than an error that terminates the task. Graceful degradation is what separates a resilient production server from one that fails completely whenever a non-critical dependency has a bad minute.
TL;DR
Define degradation tiers for each tool before writing fallback code. Tier 1: full response from live data. Tier 2: cached response with age metadata. Tier 3: partial response (some enrichments skipped). Tier 4: minimal response (IDs only, no details). Tier 5: informative error (dependency down, try again in N minutes). Implement each tier as an explicit fallback in the tool handler, ordered from best to worst. Return a degraded flag in the response so agents know to treat the result accordingly.
Graceful degradation vs graceful shutdown
These are often confused but address different failure modes:
- Graceful shutdown — the MCP server process itself is stopping. It drains in-flight requests, closes connections cleanly, and then exits. The server is intentionally going away. (covered separately)
- Graceful degradation — the MCP server process is running and healthy, but one or more of its dependencies (database, external API, cache, search index) are experiencing failures or elevated latency. The server continues to operate but returns reduced-quality responses.
Graceful shutdown is about the server's own lifecycle. Graceful degradation is about its dependency health. Both are necessary for a production-grade MCP server.
Degradation tier model
Before writing any fallback code, define the degradation tiers for each tool. What is the minimum acceptable response when each dependency fails?
| Tier | State | Response quality | When to use |
|---|---|---|---|
| 1 | Fully operational | Full live data | All dependencies healthy |
| 2 | Database slow/unavailable | Stale cached data with cached_at timestamp | Redis cache hit; DB read timeout |
| 3 | Enrichment service down | Base data without enrichment, enriched: false | Optional third-party API unavailable |
| 4 | Read replica down, primary overloaded | IDs and essential fields only | DB returning data but extremely slow |
| 5 | Primary data source unavailable | Informative error with retry guidance | Nothing can be served safely |
Tier 5 is still better than an unhandled exception: it tells the agent how long to wait before retrying, which the agent can use to schedule a delayed retry rather than hammering the server.
Stale cache fallback
The most common degradation pattern is serving a cached result when the authoritative data source is slow. Use Redis with a short TTL for the "fresh" cache and a longer TTL for the "stale" fallback:
import { createClient } from 'redis';
const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();
async function withStaleCache<T>(
key: string,
fetchFresh: () => Promise<T>,
options: {
freshTtlSeconds: number; // e.g. 60 — serve from cache for 1 min before re-fetching
staleTtlSeconds: number; // e.g. 3600 — keep stale copy for 1 hour as fallback
timeoutMs: number; // e.g. 2000 — how long to wait for fresh data
}
): Promise<{ data: T; stale: boolean; cachedAt: string | null }> {
const freshKey = `fresh:${key}`;
const staleKey = `stale:${key}`;
const metaKey = `meta:${key}`;
// Try fresh cache first
const cached = await redis.get(freshKey);
if (cached) {
const meta = await redis.get(metaKey);
return { data: JSON.parse(cached) as T, stale: false, cachedAt: meta };
}
// Try to fetch live data with timeout
try {
const result = await Promise.race([
fetchFresh(),
new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error('fetch_timeout')), options.timeoutMs)
),
]);
const cachedAt = new Date().toISOString();
// Update both fresh and stale caches
await Promise.all([
redis.set(freshKey, JSON.stringify(result), { EX: options.freshTtlSeconds }),
redis.set(staleKey, JSON.stringify(result), { EX: options.staleTtlSeconds }),
redis.set(metaKey, cachedAt, { EX: options.staleTtlSeconds }),
]);
return { data: result, stale: false, cachedAt };
} catch (err) {
// Live fetch failed — try stale fallback
const stale = await redis.get(staleKey);
const meta = await redis.get(metaKey);
if (stale) {
return { data: JSON.parse(stale) as T, stale: true, cachedAt: meta };
}
// No cache at all — re-throw
throw err;
}
}
// Tool using stale-cache fallback
server.tool(
'get_account',
'Get account details by ID',
{ accountId: z.string() },
async ({ accountId }) => {
const { data, stale, cachedAt } = await withStaleCache(
`account:${accountId}`,
() => db.accounts.findById(accountId),
{ freshTtlSeconds: 30, staleTtlSeconds: 3600, timeoutMs: 2000 }
);
return {
content: [{
type: 'text',
text: JSON.stringify({ ...data, _meta: { stale, cachedAt } }),
}],
};
}
);
Partial response pattern
When a non-critical enrichment service is unavailable, return the base data with the enrichment skipped rather than failing the entire call:
server.tool(
'get_company',
'Get company details with optional LinkedIn enrichment and funding data',
{ companyId: z.string() },
async ({ companyId }) => {
// Core data — required; failure here is a real error
const company = await db.companies.findById(companyId);
if (!company) throw new Error(`Company ${companyId} not found`);
const enrichments: Record<string, unknown> = {};
const skipped: string[] = [];
// LinkedIn enrichment — optional; degrade gracefully if unavailable
try {
const linkedin = await linkedinApi.getCompanyProfile(company.domain);
enrichments.linkedin = linkedin;
} catch {
skipped.push('linkedin_profile');
}
// Funding data — optional
try {
const funding = await crunchbaseApi.getFunding(company.domain);
enrichments.funding = funding;
} catch {
skipped.push('funding_data');
}
return {
content: [{
type: 'text',
text: JSON.stringify({
...company,
...enrichments,
_meta: {
degraded: skipped.length > 0,
skipped,
note: skipped.length > 0
? `${skipped.join(', ')} unavailable — base data returned`
: undefined,
},
}),
}],
};
}
);
The agent receives the base company data and can proceed with its task. The _meta.skipped field tells the agent exactly which enrichments were omitted, so it can factor that into its reasoning.
Signaling degraded state to agents
Agents make better decisions when they know a response is degraded. Establish a consistent _meta convention across all your tools:
interface ResponseMeta {
degraded?: boolean; // true if any fallback was used
degradationReason?: string; // human-readable: 'database_slow', 'enrichment_unavailable'
cachedAt?: string; // ISO 8601 — when the cached data was fetched
stale?: boolean; // true if served from stale cache
skipped?: string[]; // list of skipped enrichments/operations
retryAfterSeconds?: number; // if degraded: how long before trying again
}
An agent can detect degraded: true and decide whether to: accept the partial result and continue the task, note the limitation in its output to the user, or schedule a retry for operations where fresh data is required.
Circuit breaker integration
Graceful degradation works best when combined with a circuit breaker. When a dependency is consistently failing, the circuit opens and subsequent calls fail fast — returning the stale cache or partial response without waiting for the full timeout on every request:
// Pseudo-code combining circuit breaker with graceful degradation
async function callWithFallback<T>(
circuitBreaker: CircuitBreaker,
fetchFresh: () => Promise<T>,
fetchFallback: () => Promise<{ data: T; degraded: true }>
): Promise<{ data: T; degraded: boolean }> {
try {
const data = await circuitBreaker.execute(fetchFresh);
return { data, degraded: false };
} catch (err) {
// Circuit is open or live fetch failed — use fallback
const fallback = await fetchFallback();
return fallback;
}
}
The circuit breaker eliminates the timeout wait — once the circuit opens, the fallback is returned immediately rather than after a 2-second timeout on every call. This keeps response time consistent even during extended dependency outages.
Health check integration
Expose degradation state in your health check endpoint so external monitors can distinguish "fully operational" from "degraded but serving":
app.get('/health', (req, res) => {
const status = {
status: 'ok', // ok | degraded | down
version: SERVER_VERSION,
dependencies: {
database: db.isHealthy() ? 'ok' : 'degraded',
redis: redis.isReady ? 'ok' : 'degraded',
searchIndex: searchIndex.isHealthy() ? 'ok' : 'degraded',
},
degradedFeatures: [] as string[],
};
if (status.dependencies.database !== 'ok') {
status.status = 'degraded';
status.degradedFeatures.push('live_data_reads');
}
if (status.dependencies.searchIndex !== 'ok') {
status.status = 'degraded';
status.degradedFeatures.push('full_text_search');
}
const httpStatus = status.status === 'down' ? 503 : 200;
res.status(httpStatus).json(status);
});
AliveMCP probes this endpoint on every check cycle. A degraded server returns HTTP 200 so it is not flagged as "down" — but the probe body response can be shown in the status dashboard, letting you monitor degradation events without triggering false-positive downtime alerts.
Further reading
- MCP server circuit breaker — fast-fail on known-broken dependencies
- MCP server graceful shutdown — draining sessions before process exit
- MCP server caching — response cache and stale-while-revalidate
- MCP server Redis — stale cache and session storage patterns
- MCP server health check — liveness, readiness, and degraded state
- MCP server backpressure — flow control when dependencies are slow
- MCP server retry logic — exponential backoff and jitter
- MCP server error handling — structured error types and recovery
- AliveMCP — uptime monitoring for HTTP-deployed MCP servers