Guide · AI Integration
Semantic Caching for MCP Servers — reduce latency and cost for similar tool queries
Exact-match caching works for deterministic queries — the same SQL, the same file path, the same API endpoint. But MCP tool calls from LLM agents are never exactly the same: "what is the rate limit policy?" and "how many requests per minute can I make?" are different strings that retrieve the same documentation chunks. Semantic caching catches these paraphrase variants by embedding the incoming query and checking cosine similarity against previous queries in the cache. A cache hit serves a cached response in ~1ms instead of running 100ms of retrieval. This guide covers the full semantic cache implementation: similarity threshold tuning, TTL and invalidation, Redis-backed production deployment, cache cold start behavior, and AliveMCP's role in detecting the latency signature of a cold cache.
TL;DR
Embed each incoming tool query and compare it against cached query vectors using cosine similarity. If similarity exceeds your threshold (start at 0.92, adjust based on observed false positives and false negatives), return the cached response. Store cache entries in Redis with a 1-hour TTL for dynamic corpora, 24 hours for stable reference data. Invalidate cache entries for affected queries when documents are updated. Monitor with AliveMCP: a cold cache (after server restart or Redis flush) produces a latency spike visible in P95 that decays as the cache warms — configure an AliveMCP alert on sustained P95 increases to detect permanent performance regressions rather than expected warmup behavior.
What semantic caching is — and what it isn't
Semantic caching returns a previously computed response when an incoming query is semantically similar to a cached query. The similarity check uses the embedding space — queries that mean the same thing but use different words have high cosine similarity and hit the cache.
Semantic caching is not the same as:
- Exact-match caching: stores only identical strings. Miss rate is near 100% for LLM-generated queries.
- Retrieval caching (result caching): stores the vector search results for a specific embedding. These are the documents retrieved, not the tool response. Lower-level and harder to reason about.
- Prompt caching: the LLM provider caches parts of the prompt in their infrastructure. Speeds up LLM inference, not tool call execution.
Semantic caching operates at the tool call boundary: before executing the expensive retrieval, check whether a semantically equivalent query has been answered before and the answer is still fresh. The cached response is the full formatted tool output — exactly what the retrieval pipeline would have produced.
Architecture
// Semantic cache middleware for MCP tools
async function withSemanticCache(toolFn, options = {}) {
const {
similarityThreshold = 0.92,
ttlSeconds = 3600,
maxCacheSize = 10000,
} = options;
return async function cachedTool(args) {
const { query } = args;
if (!query) return toolFn(args); // only cache query-based tools
// 1. Embed the incoming query
const queryEmbedding = await embedText(query);
// 2. Check cache for a similar query
const cacheHit = await findSimilarCacheEntry(queryEmbedding, similarityThreshold);
if (cacheHit) {
console.log(JSON.stringify({
event: 'semantic_cache_hit',
similarity: cacheHit.similarity,
cached_query: cacheHit.query,
incoming_query: query,
}));
return cacheHit.response;
}
// 3. Cache miss: execute the tool
const response = await toolFn(args);
// 4. Store in cache
await storeCacheEntry({
query,
embedding: queryEmbedding,
response,
ttlSeconds,
});
return response;
};
}
The wrapper pattern applies semantic caching to any MCP tool transparently. The tool function itself doesn't know about caching — it always receives the real arguments and produces the real result. Caching is a middleware concern.
Redis implementation for production
In-process cache (JavaScript Map or Python dict) works for single-process MCP servers, but loses the cache on every restart and isn't shared across multiple server instances. Redis with the RediSearch module supports vector similarity search natively, making it the natural production choice.
import { createClient } from 'redis';
import { createHash } from 'crypto';
const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();
// Create the vector index on startup
async function initCacheIndex() {
try {
await redis.ft.create(
'idx:semantic_cache',
{
'embedding': {
type: 'VECTOR',
ALGORITHM: 'HNSW',
TYPE: 'FLOAT32',
DIM: 1536,
DISTANCE_METRIC: 'COSINE',
},
'query': { type: 'TEXT' },
'created_at': { type: 'NUMERIC', SORTABLE: true },
},
{ ON: 'HASH', PREFIX: 'cache:' }
);
console.log('Semantic cache index created');
} catch (err) {
if (err.message.includes('Index already exists')) return;
throw err;
}
}
async function storeCacheEntry({ query, embedding, response, ttlSeconds }) {
const id = createHash('sha256').update(query).digest('hex').slice(0, 16);
const key = `cache:${id}`;
// Store the embedding as a Float32 buffer
const embeddingBuf = Buffer.allocUnsafe(embedding.length * 4);
embedding.forEach((v, i) => embeddingBuf.writeFloatLE(v, i * 4));
await redis.hSet(key, {
query,
embedding: embeddingBuf,
response: JSON.stringify(response),
created_at: Date.now(),
});
await redis.expire(key, ttlSeconds);
}
async function findSimilarCacheEntry(queryEmbedding, threshold) {
const embeddingBuf = Buffer.allocUnsafe(queryEmbedding.length * 4);
queryEmbedding.forEach((v, i) => embeddingBuf.writeFloatLE(v, i * 4));
const results = await redis.ft.search(
'idx:semantic_cache',
`*=>[KNN 1 @embedding $vec AS score]`,
{
PARAMS: { vec: embeddingBuf },
RETURN: ['query', 'response', 'score'],
DIALECT: 2,
}
);
if (results.total === 0) return null;
const top = results.documents[0];
const similarity = 1 - parseFloat(top.value.score); // RediSearch returns distance
if (similarity < threshold) return null;
return {
query: top.value.query,
response: JSON.parse(top.value.response),
similarity,
};
}
RediSearch's HNSW vector index performs the similarity lookup in microseconds regardless of cache size. The dominant cost of a cache hit is the embedding API call to embed the incoming query — approximately 50–100ms for OpenAI. This is the irreducible minimum for semantic caching: you must embed the query before you can check whether a similar embedding is cached. If your goal is to avoid embedding API costs (not just downstream retrieval), local embedding for cache lookups is more effective.
Similarity threshold tuning
The threshold controls the trade-off between false positives (returning a cached response for a query that needed a fresh retrieval) and false negatives (cache miss when the cached response would have been correct).
| Threshold | Behavior | Risk | Use case |
|---|---|---|---|
| 0.99 | Near-exact match only | Low false positives; very low hit rate | High-precision factual queries |
| 0.95 | Same meaning, slight rephrasing | Very low false positives; moderate hit rate | Good starting point for most tools |
| 0.92 | Clear paraphrases | Rare false positives; good hit rate | Documentation search, FAQ tools |
| 0.85 | Related but distinct queries | Occasional wrong cache hits | Exploratory assistants where close is enough |
| 0.80 | Broad topic match | Frequent wrong cache hits in ambiguous domains | Avoid unless corpus is very narrow |
Measure your threshold's behavior empirically. Log every cache hit with the cached query, incoming query, and similarity score. After a week, sample the pairs with similarity between 0.90 and 0.95 — if the cached responses correctly answer the incoming queries, your threshold can stay at 0.92 or lower. If you find incorrect cache hits in that range, raise the threshold to 0.95. The right threshold is domain-specific: a legal document search tool should use 0.97 or higher because subtle wording differences carry legal significance; a general FAQ tool can use 0.90.
TTL strategy by data volatility
const TTL_SECONDS = {
stable_reference: 86400, // 24h — API docs, specs, policies that rarely change
weekly_updated: 3600 * 8, // 8h — blog content, changelogs, tutorials
daily_updated: 3600, // 1h — product documentation, pricing pages
real_time: 0, // disable cache — live inventory, current status, prices
};
async function searchWithCachedTool({ query, source_filter, no_cache = false }) {
if (no_cache) return executeSearch({ query, source_filter });
// Determine TTL from the source being queried
const sourceType = classifySource(source_filter);
const ttl = TTL_SECONDS[sourceType] || TTL_SECONDS.weekly_updated;
if (ttl === 0) return executeSearch({ query, source_filter });
return withSemanticCache(
executeSearch,
{ similarityThreshold: 0.92, ttlSeconds: ttl }
)({ query, source_filter });
}
Expose a no_cache: true parameter for agents that need guaranteed freshness — for example, when the user has just updated a document and wants the next query to reflect the change. Without this escape hatch, agents cannot override the cache TTL.
Cache invalidation on document updates
When a document is updated and re-indexed, existing cache entries for queries that matched chunks from that document are stale. Invalidating by exact document match requires storing which document chunks each cache entry is based on — expensive to track. A simpler approach: use short TTLs for frequently-updated content and accept that stale responses last at most TTL seconds.
For corpora where you need immediate cache invalidation on update (compliance documents, pricing pages, live inventory), implement event-based invalidation:
// On document update: tag cache entries by source and invalidate
async function invalidateCacheForSource(sourceId) {
// Scan cache entries tagged with this source
const keys = await redis.keys(`cache:source:${sourceId}:*`);
if (keys.length > 0) {
await redis.del(keys);
console.log(JSON.stringify({
event: 'cache_invalidated',
source_id: sourceId,
entries_deleted: keys.length,
}));
}
}
// When storing cache entries, also tag by source
async function storeCacheEntryWithSourceTag({ query, embedding, response, sources, ttlSeconds }) {
const id = createHash('sha256').update(query).digest('hex').slice(0, 16);
await redis.hSet(`cache:${id}`, { query, embedding, response, created_at: Date.now() });
await redis.expire(`cache:${id}`, ttlSeconds);
// Tag by source for targeted invalidation
for (const source of sources) {
await redis.sadd(`cache:source:${source}:entries`, `cache:${id}`);
await redis.expire(`cache:source:${source}:entries`, ttlSeconds);
}
}
Cold start latency and AliveMCP monitoring
A semantic cache starts empty after every server restart or Redis flush. The first queries after startup are all cache misses — they pay the full retrieval cost (embedding + vector search + reranking). The cache warms as queries arrive and populate it. For a production MCP server with consistent query patterns, the cache typically reaches 40–60% hit rate within the first 15–30 minutes of operation.
The latency signature of a cold cache restart:
- T+0 (restart): AliveMCP protocol probe detects connection refused → P1 alert fires
- T+30s (server back up): AliveMCP protocol probe succeeds → server marked recovered
- T+0–15min (cache warming): P95 tool call latency elevated (all misses, full retrieval cost)
- T+15–30min: P95 latency decays toward baseline as hit rate rises
Configure AliveMCP with a latency alert threshold that distinguishes the expected warmup period from a permanent regression. A threshold of "P95 > 800ms for more than 20 minutes sustained" catches permanent problems while ignoring the expected cold-start transient.
// Expose cache metrics for AliveMCP health endpoint
app.get('/health', async (req, res) => {
const cacheStats = await getCacheStats();
const hitRate = cacheStats.hits / (cacheStats.hits + cacheStats.misses + 1);
const p95Latency = getP95Latency();
const status = {
cache_hit_rate: Math.round(hitRate * 100) / 100,
cache_size: cacheStats.total_entries,
p95_latency_ms: p95Latency,
cache_warm: hitRate > 0.20, // cache is "warm" when hit rate exceeds 20%
};
if (p95Latency > 2000) {
return res.status(503).json({ status: 'degraded', reason: 'high_p95_latency', ...status });
}
res.json({ status: 'ok', ...status });
});
The cache_warm field helps distinguish a freshly-started server (expected high latency) from a performance regression (unexpected high latency). Your monitoring dashboard can display this to explain why latency is elevated without triggering a false alarm.
Cost and latency impact measurements
Before deploying semantic caching to production, instrument your tool to measure the actual impact on your query patterns:
// Measure cache impact over 7 days before enabling
const metrics = {
cache_hits: 0,
cache_misses: 0,
hit_latency_sum_ms: 0,
miss_latency_sum_ms: 0,
embedding_tokens_saved: 0,
retrieval_calls_saved: 0,
};
// After 7 days of measurement:
const hitRate = metrics.cache_hits / (metrics.cache_hits + metrics.cache_misses);
const avgHitLatency = metrics.hit_latency_sum_ms / metrics.cache_hits;
const avgMissLatency = metrics.miss_latency_sum_ms / metrics.cache_misses;
const latencySavingMs = (avgMissLatency - avgHitLatency) * hitRate;
const tokensSavedPerDay = (metrics.embedding_tokens_saved / 7);
const costSavedPerDay = (tokensSavedPerDay / 1_000_000) * 0.02; // OpenAI pricing
console.log(`Hit rate: ${(hitRate * 100).toFixed(1)}%`);
console.log(`Avg hit latency: ${avgHitLatency.toFixed(0)}ms vs miss: ${avgMissLatency.toFixed(0)}ms`);
console.log(`P50 latency reduction: ~${latencySavingMs.toFixed(0)}ms`);
console.log(`Embedding cost saved: $${(costSavedPerDay * 30).toFixed(2)}/month`);
For most MCP documentation tools, the latency benefit (reduced P50 for users asking common questions) is the primary value. The cost benefit from embedding token savings is typically small because embedding is already cheap relative to vector search and LLM inference. The secondary benefit is reduced load on the vector store and embedding API, which improves reliability under concurrent load.
Frequently asked questions
Is semantic caching worth adding to my MCP server?
It depends on your query repetition rate. If your MCP server handles a narrow domain (customer support for one product, internal documentation for one team) with many users asking similar questions, semantic caching can achieve 40–70% hit rates and significantly reduce P50 latency. If your server handles highly diverse queries (research tool, general search over a large corpus), hit rates may be below 10% and the embedding cost of checking the cache exceeds the benefit. Measure your actual query patterns for one week before adding caching. Log all incoming queries, cluster them by embedding similarity, and calculate the natural hit rate at various thresholds — that gives you a realistic hit rate estimate before you write any caching code.
How do I know when the semantic cache is returning a wrong answer?
You can't know at the time of the cache hit — that's the trade-off. Your quality monitoring is the external signal: if users start reporting wrong answers or if your evaluation set shows regression, the similarity threshold is too low. Build a shadow mode first: run the cache check and log potential hits without actually serving them. Review the logged (cached_query, incoming_query, similarity) pairs to calibrate your threshold on real traffic before enabling it in production. After enabling, add sampling-based evaluation: for 1% of cache hits, ignore the cache, run the real retrieval, and compare the results to the cached response. High divergence rate indicates the threshold needs adjustment.
Should I use Redis or an in-process cache for semantic caching?
Use Redis if you run more than one MCP server instance (even two), or if cache persistence across server restarts is important for maintaining low latency. Use in-process (JavaScript Map or Python dict with a simple vector index) if you run a single server instance and you're comfortable losing the cache on restart. The in-process option has lower latency for cache lookups (~0.1ms for an optimized nearest-neighbor search vs ~2ms for Redis round-trip) but doesn't share across instances and disappears on restart. For most production MCP servers, Redis is the right choice — the 2ms overhead is negligible against the 50ms embedding API call that must precede the cache lookup anyway.
Does semantic caching interact badly with streaming MCP responses?
If your MCP tool streams responses (progressive chunks via the streaming API), semantic caching requires buffering the full response before caching it — you can't cache a partial stream. On a cache miss, buffer all streaming chunks into a complete response, cache it, and return it. On a cache hit, return the cached complete response. This means cache hits don't stream — they return the full response at once, which is actually faster for the client (no waiting for stream completion). The latency reduction from cache hits outweighs the loss of streaming for cached queries. For uncached responses, streaming still works normally on cache miss.
How does AliveMCP monitoring help with a semantic cache deployment?
AliveMCP monitors the MCP protocol layer and reports P95 latency trends. Semantic cache deployments produce a distinctive latency pattern: P50 drops significantly (cache hits are fast), P95 changes less (cache misses are slow, and rare slow misses dominate the 95th percentile). A sudden P95 spike after a Redis restart or cache flush indicates a cold cache — the entire hit rate drops to 0% temporarily. AliveMCP's alert fires on the protocol probe failure during restart, letting you correlate the latency spike on the recovery curve with the known restart time. Without AliveMCP, a sudden P95 spike looks like a retrieval performance regression — with AliveMCP's restart alert, you know it's a cache cold-start and the latency will recover as the cache warms.
Further reading
- RAG with MCP Servers — retrieval-augmented generation tool patterns
- Vector Search MCP Tools — pgvector, Chroma, and Pinecone integration
- Embedding Tools in MCP Servers — generate and store vectors via MCP
- MCP Server Caching — exact-match response caching patterns
- MCP Server Redis Integration — session state, rate limiting, and caching
- MCP Server Performance — latency profiling and P95 optimization