Guide · Caching
MCP server caching
Tool result caching reduces external API calls, cuts upstream rate limit consumption, and lowers tool call latency for repeated queries. In an MCP server, caching sits inside tool handlers — a cache hit returns a stored result before the external call is made. The key design decisions are: what to use as the cache key (a deterministic serialization of the tool arguments), what TTL to set (determined by how stale the data can be before it causes user-visible errors), and what not to cache (tools with side effects or user-specific results). These decisions are different for each tool — there is no one-size-fits-all TTL.
TL;DR
Use the lru-cache package for in-process caching with per-entry TTLs. Build cache keys from a sorted, deterministic serialization of the tool arguments (JSON.stringify(args, Object.keys(args).sort())). Set TTLs based on how stale the data is acceptable, not on how fast the external API is — if users would notice a 5-minute-old result, use a 60-second TTL. Log cache hits and misses to measure hit rate. Never cache tools that write data, trigger notifications, or whose results vary by caller identity without including the identity in the cache key.
In-process LRU cache
import { LRUCache } from 'lru-cache';
// One cache per tool, or a shared cache keyed by tool name + args
const searchCache = new LRUCache<string, string>({
max: 500, // max 500 cached entries
ttl: 5 * 60 * 1000, // 5 minutes in milliseconds
updateAgeOnGet: false, // TTL is absolute from insertion, not from last access
});
function cacheKey(args: Record<string, unknown>): string {
// Sort keys for deterministic serialization regardless of argument order
return JSON.stringify(args, Object.keys(args).sort());
}
server.tool(
'search_docs',
'Search the documentation for a query',
{
query: z.string().min(1),
limit: z.number().int().min(1).max(20).default(5),
},
async (args) => {
const key = cacheKey(args);
const cached = searchCache.get(key);
if (cached !== undefined) {
logger.info({ event: 'cache_hit', tool: 'search_docs', key_hash: hashKey(key) });
return { content: [{ type: 'text', text: cached }] };
}
const results = await searchDocsApi(args.query, args.limit);
const serialized = JSON.stringify(results);
searchCache.set(key, serialized);
logger.info({ event: 'cache_miss', tool: 'search_docs', key_hash: hashKey(key) });
return { content: [{ type: 'text', text: serialized }] };
}
);
The lru-cache package (npm install lru-cache) provides a well-tested in-process cache with LRU eviction and per-entry TTLs. Set updateAgeOnGet: false so cache entries expire absolutely — data that was cached 5 minutes ago expires in 5 minutes whether or not it was read. With updateAgeOnGet: true, a frequently-accessed entry never expires, which means stale data stays cached as long as any client keeps requesting it.
Log the key hash (not the full key) for cache hit/miss events — the full key may contain user query terms. Use a fast non-cryptographic hash like xxhash or a simple FNV hash for the log correlation ID. This gives you enough information to investigate cache behavior without logging user input.
Cache key design
The cache key must be deterministic — the same logical query with arguments in a different order must produce the same key, otherwise the cache has a false miss rate:
// WRONG: argument order matters — these produce different keys
cacheKey({ limit: 5, query: 'mcp monitoring' }) // "{"limit":5,"query":"mcp monitoring"}"
cacheKey({ query: 'mcp monitoring', limit: 5 }) // "{"query":"mcp monitoring","limit":5}"
// CORRECT: sort keys first
function cacheKey(args: Record<string, unknown>): string {
return JSON.stringify(args, Object.keys(args).sort());
}
// Both produce: "{"limit":5,"query":"mcp monitoring"}"
// For nested objects, deep-sort recursively:
function deepSortKeys(obj: unknown): unknown {
if (obj === null || typeof obj !== 'object') return obj;
if (Array.isArray(obj)) return obj.map(deepSortKeys);
const sorted: Record<string, unknown> = {};
for (const key of Object.keys(obj as object).sort()) {
sorted[key] = deepSortKeys((obj as Record<string, unknown>)[key]);
}
return sorted;
}
function cacheKey(args: Record<string, unknown>): string {
return JSON.stringify(deepSortKeys(args));
}
For tools where the cache should be per-caller (user-specific results), include the caller identity in the key: `${userId}:${JSON.stringify(deepSortKeys(args))}`. This trades cache efficiency (less sharing between callers) for correctness (different callers get their own results). Never use a shared cache for tools that return user-specific data without including the user ID in the key — this is a data privacy violation, not just a bug.
Redis cache for distributed deployments
import { createClient } from 'redis';
import { createHash } from 'node:crypto';
const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();
async function cachedToolCall<T>(
toolName: string,
args: Record<string, unknown>,
ttlSeconds: number,
fn: () => Promise<T>
): Promise<T> {
const rawKey = `${toolName}:${JSON.stringify(deepSortKeys(args))}`;
const key = `mcpcache:${createHash('sha256').update(rawKey).digest('hex').slice(0, 16)}`;
const cached = await redis.get(key);
if (cached) {
return JSON.parse(cached) as T;
}
const result = await fn();
await redis.setEx(key, ttlSeconds, JSON.stringify(result));
return result;
}
// Usage in tool handler:
server.tool('get_metrics', 'Get system metrics for a service', {
service: z.string(),
window: z.enum(['1m', '5m', '1h']),
}, async (args) => {
const metrics = await cachedToolCall('get_metrics', args, 60, () => fetchMetrics(args));
return { content: [{ type: 'text', text: JSON.stringify(metrics) }] };
});
Use Redis caching when your MCP server runs multiple instances — in-process caches are per-instance and do not share state. The SHA-256 hash truncated to 16 hex characters (8 bytes, 2^64 collision space) gives a compact, safe cache key. Use setEx (SET with EXpiry) to ensure entries always have a TTL — Redis without TTLs grows unbounded. Set the Redis key prefix (mcpcache:) to make cache keys easy to identify and flush in bulk if needed (SCAN + DEL or FLUSHDB).
What not to cache
Four categories of tools should never have their results cached:
- Mutation tools — any tool that writes data, sends a message, triggers an action, or has side effects. Returning a cached "success" for a mutation that was never executed is a correctness bug. Examples:
create_issue,send_notification,update_config. - Time-sensitive tools — tools where a stale result is actively harmful, not just slightly wrong. Examples:
get_current_price,check_availability,get_live_status. If the cache TTL must be under 5 seconds to be safe, consider whether caching adds any value at all — the overhead of a cache lookup approaches the benefit. - User-specific tools without identity in the key — tools that return different results based on who is calling (permissions, personal data, account-specific state) must include the caller identity in the cache key. If you cannot reliably determine the caller identity, do not cache.
- Non-deterministic tools — tools that call LLMs, random number generators, or any system that returns different results for the same input by design. Caching these defeats the purpose of calling them.
Cache warming and cold start
The cache is cold after every deployment — the first tool call after deploy hits the upstream API with no cache. If your server handles high-traffic tools that are expensive to call cold, warm the cache at startup:
// Warm the cache with the most common queries at startup
async function warmCache() {
const commonQueries = ['mcp monitoring', 'uptime check', 'health check'];
await Promise.all(
commonQueries.map(query =>
cachedToolCall('search_docs', { query, limit: 5 }, 300, () => searchDocsApi(query, 5))
)
);
logger.info({ event: 'cache_warmed', queries: commonQueries.length });
}
// Call after server starts, before accepting traffic
await warmCache();
AliveMCP's probe calls initialize and tools/list, not tools/call — this means AliveMCP does not warm your tool cache and does not appear in your cache hit/miss metrics. The first real user session after deployment will experience the cold cache. This shows up as a latency spike in response time metrics immediately after deploy — use it as a signal that cache warming is working (latency normalizes after the first few calls) or not working (latency stays elevated).
Related questions
How do I invalidate the cache when the underlying data changes?
TTL-based expiration is the simplest approach — set the TTL to the acceptable staleness window and let entries expire naturally. Event-driven invalidation (deleting cache entries when data changes) is more complex but allows longer TTLs: listen for a webhook or database change event and call cache.delete(key) for affected entries. Event-driven invalidation requires that you know which cache keys correspond to the changed data, which is only feasible for simple key structures. For complex query results, TTL-based expiration is usually the right default.
Should I cache at the tool level or use a shared response cache?
Cache at the tool level (inside each tool handler) rather than at the HTTP response level. MCP uses a streaming protocol where a single HTTP request may contain multiple JSON-RPC messages — response-level caching does not map cleanly to the message-per-message nature of the protocol. Tool-level caching is precise: you cache the result of a specific tool call with specific arguments, not the entire HTTP response body.
How do I measure cache effectiveness?
Track hit rate per tool: (cache hits) / (cache hits + cache misses). A hit rate below 20% for a tool you expected to benefit from caching means the query patterns are too diverse (no two queries use the same arguments) or the TTL is too short. Log both cache hits and misses with the tool name and a hash of the key. Export these counters to your metrics system (Prometheus cache_hits_total and cache_misses_total labeled by tool name) and alert when hit rate drops below your baseline — this can indicate cache eviction pressure (max entries too low), a TTL change, or a deployment that cleared the cache unexpectedly.
Further reading
- MCP server rate limiting — caching reduces upstream rate limit consumption
- MCP server performance — cache hit rate as a performance lever
- MCP server latency — cold cache vs warm cache latency patterns
- MCP server response time — post-deploy latency spikes from cold cache
- MCP server logging — cache hit/miss structured log events
- MCP server deployment — cache state across rolling deploys
- AliveMCP — uptime monitoring that measures cold-start and warm-cache latency separately