Guide · Developer Experience

MCP server token budget

When an MCP server calls an upstream LLM API (Anthropic, OpenAI, Gemini) on behalf of tenants, each tool call can cost $0.001–$0.10 in token fees. Without budget enforcement, a single poorly-prompted LLM session — one that calls a tool in a loop or generates an enormous context — can exhaust a month's budget in minutes. Token budget enforcement at the MCP server layer is the last line of defense: it is independent of the client, cannot be overridden by a prompt injection, and applies consistently across every MCP client that connects to your server.

TL;DR

Identify the tenant from the MCP connection context (API key, OAuth token, or session header). Before executing any tool that calls an upstream LLM, check the tenant's usage against their monthly quota in SQLite. If over quota, return isError: true with a budget-exceeded message. After each successful call, record the estimated token count consumed. Expose a check_budget tool so LLMs can self-report remaining budget before starting expensive operations. Run a nightly cron to reset monthly quotas. Use soft limits (warn at 80%) and hard limits (block at 100%) to prevent surprise overage.

Why enforce budgets at the MCP layer

Several layers could enforce token budgets: the LLM client (Claude Desktop, a custom agent), the LLM API itself (Anthropic usage limits), or the MCP server. The MCP server is the right layer for three reasons:

  1. It is the only layer you control when serving multiple clients with different client implementations. You cannot modify Claude Desktop's behavior; you can modify your server.
  2. It is prompt-injection resistant. A user cannot instruct the LLM to bypass a budget check in the server's tool handler via a system prompt or conversation message — the check happens in server code, not in the LLM's reasoning.
  3. It is the only layer that has context about upstream cost. The MCP client does not know what each tool call costs internally. The server that calls the upstream LLM knows the token counts from the API response and can record them accurately.

Database schema

Two tables: tenants (quota configuration) and usage_events (individual call records). This separation allows quota changes without touching usage history, and allows analytics on usage patterns without touching quota enforcement.

-- SQLite schema
CREATE TABLE IF NOT EXISTS tenants (
  id              TEXT PRIMARY KEY,          -- API key or org ID
  name            TEXT NOT NULL,
  monthly_quota   INTEGER NOT NULL DEFAULT 1000000,  -- token limit per month
  soft_limit_pct  REAL NOT NULL DEFAULT 0.8,          -- warn at 80%
  plan            TEXT NOT NULL DEFAULT 'free',       -- free | pro | enterprise
  created_at      TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
  reset_day       INTEGER DEFAULT 1          -- day of month to reset quota
);

CREATE TABLE IF NOT EXISTS usage_events (
  id              INTEGER PRIMARY KEY AUTOINCREMENT,
  tenant_id       TEXT NOT NULL REFERENCES tenants(id),
  tool_name       TEXT NOT NULL,
  tokens_input    INTEGER NOT NULL DEFAULT 0,
  tokens_output   INTEGER NOT NULL DEFAULT 0,
  tokens_total    INTEGER NOT NULL DEFAULT 0,
  recorded_at     TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now'))
);

CREATE INDEX IF NOT EXISTS usage_tenant_month
  ON usage_events(tenant_id, recorded_at);  -- fast monthly sum queries

Budget check middleware

Wrap the tool execution with a budget check function. Call it before executing any tool that incurs upstream cost. The check is synchronous (SQLite is synchronous) and adds under 1ms per tool call.

// src/budget.ts
import type Database from 'better-sqlite3';

export interface BudgetStatus {
  tenant_id:     string;
  quota:         number;
  used_this_month: number;
  remaining:     number;
  pct_used:      number;
  over_hard_limit: boolean;
  over_soft_limit: boolean;
  reset_day:     number;
}

export function getBudgetStatus(db: Database.Database, tenantId: string): BudgetStatus {
  const tenant = db.prepare('SELECT * FROM tenants WHERE id = ?').get(tenantId) as {
    monthly_quota: number; soft_limit_pct: number; reset_day: number;
  } | undefined;

  if (!tenant) throw new Error(`Unknown tenant: ${tenantId}`);

  const now = new Date();
  // Calculate start of current billing period
  const resetDay = tenant.reset_day;
  const periodStart = new Date(now.getFullYear(), now.getMonth(), resetDay);
  if (periodStart > now) periodStart.setMonth(periodStart.getMonth() - 1);

  const { total } = db.prepare(`
    SELECT COALESCE(SUM(tokens_total), 0) as total
    FROM usage_events
    WHERE tenant_id = ? AND recorded_at >= ?
  `).get(tenantId, periodStart.toISOString()) as { total: number };

  const pct = total / tenant.monthly_quota;
  return {
    tenant_id:       tenantId,
    quota:           tenant.monthly_quota,
    used_this_month: total,
    remaining:       Math.max(0, tenant.monthly_quota - total),
    pct_used:        pct,
    over_hard_limit: pct >= 1.0,
    over_soft_limit: pct >= tenant.soft_limit_pct,
    reset_day:       resetDay,
  };
}

export function recordUsage(
  db: Database.Database,
  tenantId: string,
  toolName: string,
  tokensInput: number,
  tokensOutput: number
): void {
  db.prepare(`
    INSERT INTO usage_events (tenant_id, tool_name, tokens_input, tokens_output, tokens_total)
    VALUES (?, ?, ?, ?, ?)
  `).run(tenantId, toolName, tokensInput, tokensOutput, tokensInput + tokensOutput);
}

Wiring budget enforcement into tool handlers

Identify the tenant from the MCP connection context. The MCP protocol does not have a built-in auth mechanism for tool calls — common patterns are: API key in server environment (single-tenant), API key passed at connection time via a custom header (HTTP transport), or tenant ID derived from the process environment (one server process per tenant).

// src/tools/summarize.ts — example tool that calls an upstream LLM
import type { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { CallToolRequestSchema, ListToolsRequestSchema } from '@modelcontextprotocol/sdk/types.js';
import Anthropic from '@anthropic-ai/sdk';
import { z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';
import { getBudgetStatus, recordUsage } from '../budget.js';
import type { Deps } from '../deps.js';

const SummarizeSchema = z.object({
  text:          z.string().min(1).max(50000).describe('Text to summarize'),
  max_words:     z.number().int().positive().max(500).default(150),
});

export function registerSummarizeTool(server: Server, deps: Deps) {
  const anthropic = new Anthropic({ apiKey: deps.anthropicApiKey });

  server.setRequestHandler(CallToolRequestSchema, async (request) => {
    if (request.params.name !== 'summarize') return; // handled elsewhere

    // 1. Identify tenant (here: from env; in multi-tenant: from connection auth)
    const tenantId = deps.tenantId;

    // 2. Budget pre-check
    const budget = getBudgetStatus(deps.db, tenantId);
    if (budget.over_hard_limit) {
      return {
        content: [{
          type: 'text',
          text: [
            `Budget exceeded: you have used ${budget.used_this_month.toLocaleString()} of ${budget.quota.toLocaleString()} tokens this month (${Math.round(budget.pct_used * 100)}%).`,
            `Quota resets on day ${budget.reset_day} of each month.`,
            `To increase your quota, upgrade your plan at https://alivemcp.com/#pricing.`,
          ].join(' '),
        }],
        isError: true,
      };
    }

    // 3. Soft-limit warning (include in successful response, don't block)
    const softLimitWarning = budget.over_soft_limit
      ? `[Note: ${Math.round(budget.pct_used * 100)}% of monthly token budget used — ${budget.remaining.toLocaleString()} tokens remaining.]`
      : null;

    // 4. Validate inputs
    const parsed = SummarizeSchema.safeParse(request.params.arguments);
    if (!parsed.success) {
      return { content: [{ type: 'text', text: parsed.error.message }], isError: true };
    }

    // 5. Execute the upstream LLM call
    const response = await anthropic.messages.create({
      model:      'claude-haiku-4-5-20251001',
      max_tokens: 1024,
      messages: [{
        role: 'user',
        content: `Summarize the following text in at most ${parsed.data.max_words} words:\n\n${parsed.data.text}`,
      }],
    });

    // 6. Record actual token usage from the API response
    const usage = response.usage;
    recordUsage(deps.db, tenantId, 'summarize', usage.input_tokens, usage.output_tokens);

    const summary = response.content[0].type === 'text' ? response.content[0].text : '';
    const content = softLimitWarning ? `${softLimitWarning}\n\n${summary}` : summary;
    return { content: [{ type: 'text', text: content }] };
  });
}

The check_budget tool

Expose a check_budget tool so LLMs can query remaining budget before starting expensive operations. This allows the LLM to warn the user proactively ("You have 12% of your monthly budget remaining — this operation will use approximately 5%") rather than failing mid-task when the budget is exhausted.

// src/tools/budget.ts
import type { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { getBudgetStatus } from '../budget.js';
import type { Deps } from '../deps.js';

export function registerBudgetTool(server: Server, deps: Deps) {
  // Add to ListTools response:
  // { name: 'check_budget', description: 'Check remaining token budget for this month.', inputSchema: { type: 'object', properties: {} } }

  // In CallTool handler:
  if (request.params.name === 'check_budget') {
    const status = getBudgetStatus(deps.db, deps.tenantId);
    return {
      content: [{
        type: 'text',
        text: JSON.stringify({
          quota:            status.quota,
          used_this_month:  status.used_this_month,
          remaining:        status.remaining,
          pct_used:         Math.round(status.pct_used * 100),
          status:           status.over_hard_limit ? 'exhausted'
                          : status.over_soft_limit ? 'warning'
                          : 'ok',
          resets_on_day:    status.reset_day,
        }),
      }],
    };
  }
}

The check_budget tool has no required arguments and is free to call (it reads SQLite, not an upstream API). Instruct the LLM to call it at the start of any multi-step workflow that involves heavy tool use.

Estimating token counts when the upstream API doesn't return them

When the upstream API does not return a usage object (some APIs, some response streaming modes, or tools that call non-LLM APIs), estimate token counts for budget accounting. A rough but consistent estimate is sufficient — budget enforcement does not need to be exact to the token.

ScenarioEstimation approach
Anthropic / OpenAI API with usage objectUse response.usage.input_tokens + response.usage.output_tokens exactly
Streaming API responseCount chunks: accumulate usage_metadata from stream delta events, or estimate from character count (chars / 4 ≈ tokens)
Non-LLM API (web search, database, etc.)Charge a fixed "administrative" token cost per call (e.g., 100 tokens) to account for context overhead
Tool that returns large text (web page, document)Estimate output tokens from response length: Math.ceil(text.length / 4)

Document your estimation methodology in a comment near the recordUsage() call so you can audit it later: // Estimating ~150 tokens overhead per web fetch — actual LLM context cost billed to client.

Quota reset — nightly cron

Quotas reset at the start of the tenant's billing period (e.g., the 1st of each month). Rather than deleting usage events, mark them as belonging to past billing periods by not querying past the period start date. The getBudgetStatus function already does this: it queries usage only since periodStart.

The only maintenance task is ensuring that very old usage events don't slow down the aggregate query. Archive or delete events older than 13 months (one full billing year) with a nightly script:

// scripts/archive-usage.ts
import { openDb } from '../src/db.js';

const db = openDb();
const cutoff = new Date();
cutoff.setMonth(cutoff.getMonth() - 13);

const { changes } = db.prepare(
  'DELETE FROM usage_events WHERE recorded_at < ?'
).run(cutoff.toISOString());

console.log(`Archived ${changes} usage events older than ${cutoff.toISOString()}`);
db.close();

Run this as a system cron job or a GitHub Actions scheduled workflow: 0 3 * * * tsx /app/scripts/archive-usage.ts.

Plan tiers and quota configuration

Different plans get different monthly quotas. Set quotas in the tenants table when a new tenant is onboarded or when their plan changes. Quota changes take effect immediately (the next getBudgetStatus call reads the new value).

PlanMonthly token quotaEquivalent usage
Free100,000 tokens~500 tool calls at 200 tokens avg
Pro ($9/mo)1,000,000 tokens~5,000 tool calls
Team ($49/mo)10,000,000 tokens~50,000 tool calls
EnterpriseCustom (negotiated)Unlimited practical usage

Choose quota numbers based on your upstream API cost per token and your target gross margin per plan. For Anthropic Haiku at $0.25/MTok input + $1.25/MTok output (blended ~$0.50/MTok): 1M tokens costs $0.50 in upstream fees. A $9 Pro plan with a 1M token quota yields roughly 18× markup — comfortable margin for a bootstrapped product.

Related questions

Should I enforce token budgets at the MCP server or at the LLM API level?

Both, with different roles. LLM API-level limits (Anthropic spend limits, OpenAI usage tiers) are a safety net that prevents runaway spend in absolute dollar terms — they should be set conservatively as a fallback. MCP server budgets are your product's enforcement mechanism: they map to your pricing tiers, reset on your billing cycle, and return errors with context that matches your product's terminology. API-level limits don't know which of your tenants is responsible for the spend; your server does.

How do I handle budget enforcement for streaming tool responses?

For streaming LLM responses, collect the usage metadata from the final stream event (Anthropic sends message_delta with usage at stream end). Record usage after the stream closes, not before. For budget pre-check, estimate the maximum likely cost before starting the stream: if the model is asked for a 500-word response and input is 2,000 tokens, the worst case is ~2,500 tokens total. Check if the remaining budget is greater than this estimate; if not, return a budget error before starting the stream.

What's the right granularity for budget tracking — per-user or per-org?

It depends on your billing model. If you charge per seat (each user has their own subscription), track per-user. If you charge per organization (one team subscription covers all members), track per-org/tenant. In most B2B SaaS models, per-org tracking is simpler: the org admin manages the budget, team members share the pool, and you bill the org. Track user_id in usage_events as metadata even if quota enforcement is at the org level — it enables usage analytics by member without changing the enforcement model.

How do I test budget enforcement without burning real tokens?

Use Anthropic's claude-haiku-4-5-20251001 with short prompts for the cheapest real API calls during testing — the cost per test run is under $0.001. For unit tests, mock the upstream API call: inject a fake anthropic client via the deps pattern that returns a fixed usage object ({ input_tokens: 100, output_tokens: 50 }) without making a real API call. Test the budget check, recording, and enforcement logic independently of the upstream API.

Further reading