Guide · Multi-Tenant SaaS
Usage Metering for MCP Servers — per-tool tracking, quota enforcement, and billing integration
A multi-tenant MCP server that runs without usage metering is a cost leak waiting to happen: one aggressive agent can exhaust compute budget for every other tenant in the same instance. Usage metering — counting tool calls per tenant, enforcing quotas, and feeding usage data into a billing system — is the operational layer that makes MCP-as-a-service financially sustainable. This guide covers where to intercept tool calls for metering, Redis sliding window counters for real-time quota enforcement, asynchronous event batching for billing pipelines, and Stripe metered billing integration so that tool calls translate directly to customer invoices.
TL;DR
Wrap every tool handler with a metering middleware that: (1) reads tenantId from the session context, (2) increments a Redis sliding-window counter, (3) checks the counter against the tenant's plan quota and returns isError: true with a quota_exceeded code if breached, and (4) enqueues a usage event for the billing pipeline. Point AliveMCP at a /health endpoint that checks Redis connectivity and billing pipeline lag — metering failures are silent: the server stays up while tenants are charged incorrectly (or not at all).
Where to intercept tool calls
MCP SDK servers expose a server.tool() registration method. The metering layer sits between the SDK dispatch and your handler logic — not inside each handler. This keeps metering code in one place and prevents a missed import from silently bypassing quota checks.
// metering-middleware.ts
import type { CallToolRequestSchema } from '@modelcontextprotocol/sdk/types.js';
import { getMeteringClient } from './metering-client.js';
import { getTenantFromSession } from './auth.js';
type ToolHandler = (args: Record<string, unknown>) => Promise<unknown>;
export function withMetering(
toolName: string,
handler: ToolHandler,
): ToolHandler {
return async (args) => {
const tenant = getTenantFromSession(); // reads from AsyncLocalStorage
if (!tenant) {
return { isError: true, content: [{ type: 'text', text: 'Unauthenticated' }] };
}
const metering = getMeteringClient();
const allowed = await metering.checkAndIncrement(tenant.id, toolName);
if (!allowed) {
return {
isError: true,
content: [{ type: 'text', text: `quota_exceeded: ${toolName} limit reached for plan ${tenant.plan}` }],
};
}
// Enqueue billing event asynchronously — never await in the hot path
metering.enqueueUsageEvent({ tenantId: tenant.id, tool: toolName, timestamp: Date.now() });
return handler(args);
};
}
// Registration: wrap at the point of registration, not inside the handler
server.tool('search_products', searchProductsSchema, withMetering('search_products', searchProductsHandler));
server.tool('get_order', getOrderSchema, withMetering('get_order', getOrderHandler));
server.tool('create_shipment', createShipmentSchema, withMetering('create_shipment', createShipmentHandler));
The getTenantFromSession() call uses Node.js AsyncLocalStorage populated by the authentication middleware before the tool handler fires. This avoids threading tenantId through every tool argument and keeps the tool interface clean.
Redis sliding window counters
A sliding window counter counts requests within a rolling time window (e.g., "1000 tool calls in the last hour") rather than a fixed calendar window (e.g., "1000 tool calls this calendar month"). Sliding windows prevent the burst pattern where a tenant exhausts a month's quota in the first hour of the month then gets a full reset at midnight.
Lua script for atomic check-and-increment
The check (is the tenant over quota?) and increment (count this call) must be atomic to prevent TOCTOU races under concurrency. A Redis Lua script runs atomically on the server:
// metering-client.ts
import { createClient } from 'redis';
const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();
// Lua script: sliding window using a sorted set
// Members are event IDs, scores are Unix timestamps (milliseconds)
// Returns 1 if allowed, 0 if quota exceeded
const CHECK_AND_INCREMENT = `
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window_ms = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local event_id = ARGV[4]
-- Remove events outside the sliding window
redis.call('ZREMRANGEBYSCORE', key, '-inf', now - window_ms)
-- Count remaining events in window
local count = redis.call('ZCARD', key)
if count >= limit then
return 0
end
-- Add this event
redis.call('ZADD', key, now, event_id)
-- TTL slightly longer than window to ensure cleanup
redis.call('PEXPIRE', key, window_ms + 60000)
return 1
`;
const WINDOW_MS = 60 * 60 * 1000; // 1 hour sliding window
const PLAN_LIMITS: Record<string, number> = {
free: 100,
starter: 1000,
pro: 10000,
enterprise: Infinity,
};
export async function checkAndIncrement(tenantId: string, toolName: string): Promise<boolean> {
const plan = await getTenantPlan(tenantId); // cached, 60s TTL
const limit = PLAN_LIMITS[plan] ?? 100;
if (limit === Infinity) return true; // enterprise: skip Redis, never block
const key = `meter:${tenantId}:calls`;
const now = Date.now();
const eventId = `${now}-${Math.random().toString(36).slice(2)}`;
const result = await redis.eval(CHECK_AND_INCREMENT, {
keys: [key],
arguments: [String(now), String(WINDOW_MS), String(limit), eventId],
});
return result === 1;
}
Per-tool quotas
Some tools are more expensive than others. A semantic search over 10 million vectors costs more compute than a simple database lookup. Model per-tool quotas by using a composite key:
// Per-tool quota config: multipliers applied on top of plan limit
const TOOL_COST: Record<string, number> = {
'search_products': 1,
'semantic_search': 5, // costs 5 quota units
'generate_report': 20, // costs 20 quota units
'batch_import': 100, // costs 100 quota units
};
export async function checkAndIncrementWeighted(
tenantId: string,
toolName: string,
): Promise<boolean> {
const cost = TOOL_COST[toolName] ?? 1;
const plan = await getTenantPlan(tenantId);
const limit = PLAN_LIMITS[plan] ?? 100;
if (limit === Infinity) return true;
// Add `cost` events at the same timestamp to represent the weighted cost
const results = await Promise.all(
Array.from({ length: cost }, () => checkAndIncrement(tenantId, toolName))
);
// All increments must succeed — if any fail (quota would be exceeded),
// ideally use a transaction. Simplified: first result determines allow/deny.
return results[0];
}
Asynchronous billing event pipeline
Never synchronously call your billing API (Stripe, Lago, Orb) in the tool handler's hot path. Billing API latency (50–500ms) would add to every tool call's response time. Instead, enqueue usage events and batch them asynchronously:
// billing-pipeline.ts — async event queue with batching
interface UsageEvent {
tenantId: string;
tool: string;
cost: number; // quota units consumed
timestamp: number; // Unix ms
}
const eventQueue: UsageEvent[] = [];
const BATCH_SIZE = 100;
const FLUSH_INTERVAL_MS = 30_000; // flush every 30 seconds
export function enqueueUsageEvent(event: UsageEvent): void {
eventQueue.push(event);
if (eventQueue.length >= BATCH_SIZE) {
flushEvents(); // fire and forget — don't await
}
}
async function flushEvents(): Promise<void> {
if (eventQueue.length === 0) return;
const batch = eventQueue.splice(0, BATCH_SIZE);
try {
await reportBatchToStripe(batch);
await persistBatchToDb(batch); // local audit trail
} catch (err) {
// Re-enqueue on transient failure with backoff
console.error('Billing flush failed, re-enqueueing', err);
eventQueue.unshift(...batch);
}
}
// Start background flush loop
setInterval(flushEvents, FLUSH_INTERVAL_MS);
Stripe metered billing integration
Stripe Billing supports metered subscriptions: you report usage (call stripe.subscriptionItems.createUsageRecord) and Stripe invoices the tenant at the end of the billing period based on reported units.
import Stripe from 'stripe';
const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!, { apiVersion: '2024-06-20' });
async function reportBatchToStripe(events: UsageEvent[]): Promise<void> {
// Aggregate by tenant: sum cost units for the batch period
const totals = new Map<string, number>();
for (const event of events) {
totals.set(event.tenantId, (totals.get(event.tenantId) ?? 0) + event.cost);
}
await Promise.all(
[...totals.entries()].map(async ([tenantId, units]) => {
const subscriptionItemId = await getStripeSubscriptionItemId(tenantId);
if (!subscriptionItemId) return; // tenant not on metered plan
await stripe.subscriptionItems.createUsageRecord(subscriptionItemId, {
quantity: units,
timestamp: Math.floor(Date.now() / 1000),
action: 'increment', // additive: Stripe accumulates across reports
});
})
);
}
// Cache subscription item IDs — avoid a Stripe API call per batch
const subscriptionItemCache = new Map<string, string | null>();
async function getStripeSubscriptionItemId(tenantId: string): Promise<string | null> {
if (subscriptionItemCache.has(tenantId)) {
return subscriptionItemCache.get(tenantId) ?? null;
}
const tenant = await db.query('SELECT stripe_subscription_item_id FROM tenants WHERE id = $1', [tenantId]);
const id = tenant.rows[0]?.stripe_subscription_item_id ?? null;
subscriptionItemCache.set(tenantId, id);
// Expire cache entry after 10 minutes (plan upgrades take effect promptly)
setTimeout(() => subscriptionItemCache.delete(tenantId), 10 * 60 * 1000);
return id;
}
Exposing usage to tenants
Tenants want to see their own usage. Expose a get_usage_summary tool that returns current period consumption and remaining quota. This reduces support tickets and helps tenants plan upgrades:
server.tool(
'get_usage_summary',
{ description: 'Get usage statistics for the current billing period' },
async () => {
const tenant = getTenantFromSession();
if (!tenant) {
return { isError: true, content: [{ type: 'text', text: 'Unauthenticated' }] };
}
// Read from Redis: current window count
const key = `meter:${tenant.id}:calls`;
const now = Date.now();
await redis.zRemRangeByScore(key, '-inf', String(now - WINDOW_MS));
const used = await redis.zCard(key);
const limit = PLAN_LIMITS[tenant.plan] ?? 100;
const remaining = limit === Infinity ? null : Math.max(0, limit - used);
const windowResetAt = new Date(now + WINDOW_MS).toISOString();
return {
content: [{
type: 'text',
text: JSON.stringify({
plan: tenant.plan,
period: 'rolling_1h',
used,
limit: limit === Infinity ? 'unlimited' : limit,
remaining,
window_resets_at: windowResetAt,
}),
}],
};
}
);
Health checks for metering infrastructure
The metering layer introduces two new failure modes: Redis unreachable (quota enforcement breaks — all tenants become unmetered or all blocked depending on your fail-open/fail-closed decision) and billing pipeline lag (usage events queue up but never reach Stripe, causing under-billing or over-billing on reconciliation).
// /health additions for metering infrastructure
async function getMeteringHealth(): Promise<HealthStatus> {
const checks: Record<string, 'ok' | 'degraded' | 'down'> = {};
// 1. Redis connectivity
try {
await redis.ping();
checks.redis = 'ok';
} catch {
checks.redis = 'down';
}
// 2. Billing pipeline lag
const queueDepth = eventQueue.length;
if (queueDepth > 1000) {
checks.billing_pipeline = 'degraded'; // 1000+ events queued = pipeline falling behind
} else if (queueDepth > 5000) {
checks.billing_pipeline = 'down';
} else {
checks.billing_pipeline = 'ok';
}
// 3. Last successful Stripe flush
const timeSinceFlush = Date.now() - lastSuccessfulFlushAt;
if (timeSinceFlush > 5 * 60 * 1000) { // 5 minutes without a flush
checks.billing_flush = 'degraded';
} else {
checks.billing_flush = 'ok';
}
const degraded = Object.values(checks).some(s => s === 'degraded');
const down = Object.values(checks).some(s => s === 'down');
return {
status: down ? 'down' : degraded ? 'degraded' : 'ok',
checks,
queue_depth: queueDepth,
};
}
AliveMCP probes this /health endpoint every 60 seconds. If Redis goes down, you want to know immediately — not when tenants have been using your service for free for three hours because quota enforcement was silently bypassed. If the billing pipeline lags, you want to reconcile the audit log before the billing period closes.
Fail-open vs fail-closed on Redis outage
When Redis is unreachable, you must decide: allow all tool calls (fail-open) or block all tool calls (fail-closed). Neither is universally correct:
| Behavior | Revenue risk | Availability risk | Best for |
|---|---|---|---|
| Fail-open (allow all) | Tenants may exceed quota at no cost during outage | Low — tool calls succeed | Availability-critical services where downtime costs more than over-usage |
| Fail-closed (block all) | None — no unmetered usage | High — all tenants blocked | Strict quota enforcement where over-usage is a compliance issue |
| Fail-open for paying tenants only | Low — paid tenants with history unlikely to massively over-use | Medium — free tier blocked | SaaS with paid vs free tiers; protects revenue while maintaining paid SLA |
The recommended approach for most MCP SaaS: fail-open for tenants on paid plans (they have skin in the game), fail-closed for free-tier tenants. Implement this by catching the Redis error in checkAndIncrement and checking the tenant's plan tier:
export async function checkAndIncrement(tenantId: string, toolName: string): Promise<boolean> {
const plan = await getTenantPlan(tenantId);
try {
// ... normal Redis logic ...
return result === 1;
} catch (err) {
// Redis outage: apply fail-open/closed policy by tier
if (plan === 'free') {
return false; // fail-closed for free tier
}
return true; // fail-open for paid plans
}
}
Frequently asked questions
Should I use a fixed window or sliding window for quota counters?
Sliding windows prevent burst exploitation at window boundaries. With a fixed monthly window resetting at midnight on the 1st, a tenant can exhaust their quota in the first hour of the month, wait for reset, and repeat — effectively getting unlimited usage in bursts. A sliding window (e.g., 10,000 calls in any rolling 1-hour window) smooths this out. The tradeoff: sliding windows require more Redis memory (a sorted set per tenant vs a single counter) and slightly more complex logic. For billing periods (monthly invoicing), use fixed windows aligned to the billing cycle. For rate limiting (preventing bursts), use sliding windows. Both can coexist: a monthly fixed-window quota for billing + an hourly sliding window for burst protection.
What happens to usage events if the MCP server restarts before flushing?
In-memory queues are lost on restart. For usage events that must not be lost (billing is a regulatory concern for many SaaS products), persist usage events to a durable store before enqueuing for billing: write each usage event to a PostgreSQL usage_events table as it occurs, then batch-report to Stripe from that table using a background job. The table acts as a durable audit log and retry buffer. For less critical metering (analytics, dashboards), in-memory queuing with a short flush interval (30 seconds) is usually acceptable — the data loss window is small and the operational simplicity is worth it.
How do I handle plan upgrades mid-window?
Cache tenant plan lookups with a short TTL (60 seconds) rather than indefinitely. When a tenant upgrades from Free to Pro, their new limit takes effect within the next cache expiry cycle — they don't need to wait until their next tool call triggers a cache miss. For immediate effect (tenant complains they're still blocked after upgrading), expose a cache invalidation endpoint your billing webhook can call: POST /internal/tenants/:id/invalidate-plan-cache. Stripe sends a customer.subscription.updated webhook when a plan changes — wire it to call your invalidation endpoint and the tenant's new limit takes effect within seconds.
How do I meter tool calls that return streaming responses?
Count the tool call at initiation (before the first chunk), not at completion. This prevents a tenant from starting 1000 concurrent streaming calls that each take 60 seconds before the counter is incremented — the quota is checked and consumed at request start. For streaming tools where cost is proportional to output tokens (e.g., a tool that calls an LLM internally), you need a two-phase approach: consume one quota unit at start (reservation), then adjust based on actual token count at completion. This requires a separate "settlement" mechanism and is only worth implementing if the per-call cost variance is large enough to matter for billing accuracy.
Can I use this metering pattern with serverless MCP deployments?
Yes, but the in-memory event queue doesn't survive function invocations. In serverless environments (Lambda, Vercel Functions, Cloudflare Workers), write usage events synchronously to a durable store (DynamoDB, PostgreSQL, Upstash Redis) as part of the tool call — don't rely on a background flush loop that won't run after the invocation ends. Redis sliding window counters work well in serverless because each invocation reads and writes to the same shared Redis instance. Use Upstash Redis (HTTP-based, works in edge runtimes) rather than traditional Redis clients that require persistent TCP connections.
Further reading
- Rate Limiting for MCP Servers — token buckets, sliding windows, and fair queuing
- Multi-Tenant MCP Server Architecture — per-tenant routing and data isolation
- API Key Management for MCP Servers — rotation, scoping, and per-tenant authentication
- Redis for MCP Servers — caching, pub/sub, and session state
- MCP Server Health Checks — including infrastructure checks in readiness probes