Guide · Developer Experience

MCP server token budget

When an MCP server calls an upstream LLM API (Anthropic, OpenAI, Gemini) on behalf of tenants, each tool call can cost $0.001–$0.10 in token fees. Without budget enforcement, a single poorly-prompted LLM session — one that calls a tool in a loop or generates an enormous context — can exhaust a month's budget in minutes. Token budget enforcement at the MCP server layer is the last line of defense: it is independent of the client, cannot be overridden by a prompt injection, and applies consistently across every MCP client that connects to your server.

TL;DR

Identify the tenant from the MCP connection context (API key, OAuth token, or session header). Before executing any tool that calls an upstream LLM, check the tenant's usage against their monthly quota in SQLite. If over quota, return isError: true with a budget-exceeded message. After each successful call, record the estimated token count consumed. Expose a check_budget tool so LLMs can self-report remaining budget before starting expensive operations. Run a nightly cron to reset monthly quotas. Use soft limits (warn at 80%) and hard limits (block at 100%) to prevent surprise overage.

Why enforce budgets at the MCP layer

Several layers could enforce token budgets: the LLM client (Claude Desktop, a custom agent), the LLM API itself (Anthropic usage limits), or the MCP server. The MCP server is the right layer for three reasons:

It is the only layer you control when serving multiple clients with different client implementations. You cannot modify Claude Desktop's behavior; you can modify your server.
It is prompt-injection resistant. A user cannot instruct the LLM to bypass a budget check in the server's tool handler via a system prompt or conversation message — the check happens in server code, not in the LLM's reasoning.
It is the only layer that has context about upstream cost. The MCP client does not know what each tool call costs internally. The server that calls the upstream LLM knows the token counts from the API response and can record them accurately.

Database schema

Two tables: tenants (quota configuration) and usage_events (individual call records). This separation allows quota changes without touching usage history, and allows analytics on usage patterns without touching quota enforcement.

-- SQLite schema
CREATE TABLE IF NOT EXISTS tenants (
  id              TEXT PRIMARY KEY,          -- API key or org ID
  name            TEXT NOT NULL,
  monthly_quota   INTEGER NOT NULL DEFAULT 1000000,  -- token limit per month
  soft_limit_pct  REAL NOT NULL DEFAULT 0.8,          -- warn at 80%
  plan            TEXT NOT NULL DEFAULT 'free',       -- free | pro | enterprise
  created_at      TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
  reset_day       INTEGER DEFAULT 1          -- day of month to reset quota
);

CREATE TABLE IF NOT EXISTS usage_events (
  id              INTEGER PRIMARY KEY AUTOINCREMENT,
  tenant_id       TEXT NOT NULL REFERENCES tenants(id),
  tool_name       TEXT NOT NULL,
  tokens_input    INTEGER NOT NULL DEFAULT 0,
  tokens_output   INTEGER NOT NULL DEFAULT 0,
  tokens_total    INTEGER NOT NULL DEFAULT 0,
  recorded_at     TEXT DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now'))
);

CREATE INDEX IF NOT EXISTS usage_tenant_month
  ON usage_events(tenant_id, recorded_at);  -- fast monthly sum queries

Budget check middleware

Wrap the tool execution with a budget check function. Call it before executing any tool that incurs upstream cost. The check is synchronous (SQLite is synchronous) and adds under 1ms per tool call.

// src/budget.ts
import type Database from 'better-sqlite3';

export interface BudgetStatus {
  tenant_id:     string;
  quota:         number;
  used_this_month: number;
  remaining:     number;
  pct_used:      number;
  over_hard_limit: boolean;
  over_soft_limit: boolean;
  reset_day:     number;
}

export function getBudgetStatus(db: Database.Database, tenantId: string): BudgetStatus {
  const tenant = db.prepare('SELECT * FROM tenants WHERE id = ?').get(tenantId) as {
    monthly_quota: number; soft_limit_pct: number; reset_day: number;
  } | undefined;

  if (!tenant) throw new Error(`Unknown tenant: ${tenantId}`);

  const now = new Date();
  // Calculate start of current billing period
  const resetDay = tenant.reset_day;
  const periodStart = new Date(now.getFullYear(), now.getMonth(), resetDay);
  if (periodStart > now) periodStart.setMonth(periodStart.getMonth() - 1);

  const { total } = db.prepare(`
    SELECT COALESCE(SUM(tokens_total), 0) as total
    FROM usage_events
    WHERE tenant_id = ? AND recorded_at >= ?
  `).get(tenantId, periodStart.toISOString()) as { total: number };

  const pct = total / tenant.monthly_quota;
  return {
    tenant_id:       tenantId,
    quota:           tenant.monthly_quota,
    used_this_month: total,
    remaining:       Math.max(0, tenant.monthly_quota - total),
    pct_used:        pct,
    over_hard_limit: pct >= 1.0,
    over_soft_limit: pct >= tenant.soft_limit_pct,
    reset_day:       resetDay,
  };
}

export function recordUsage(
  db: Database.Database,
  tenantId: string,
  toolName: string,
  tokensInput: number,
  tokensOutput: number
): void {
  db.prepare(`
    INSERT INTO usage_events (tenant_id, tool_name, tokens_input, tokens_output, tokens_total)
    VALUES (?, ?, ?, ?, ?)
  `).run(tenantId, toolName, tokensInput, tokensOutput, tokensInput + tokensOutput);
}

Wiring budget enforcement into tool handlers

Identify the tenant from the MCP connection context. The MCP protocol does not have a built-in auth mechanism for tool calls — common patterns are: API key in server environment (single-tenant), API key passed at connection time via a custom header (HTTP transport), or tenant ID derived from the process environment (one server process per tenant).

// src/tools/summarize.ts — example tool that calls an upstream LLM
import type { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { CallToolRequestSchema, ListToolsRequestSchema } from '@modelcontextprotocol/sdk/types.js';
import Anthropic from '@anthropic-ai/sdk';
import { z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';
import { getBudgetStatus, recordUsage } from '../budget.js';
import type { Deps } from '../deps.js';

const SummarizeSchema = z.object({
  text:          z.string().min(1).max(50000).describe('Text to summarize'),
  max_words:     z.number().int().positive().max(500).default(150),
});

export function registerSummarizeTool(server: Server, deps: Deps) {
  const anthropic = new Anthropic({ apiKey: deps.anthropicApiKey });

  server.setRequestHandler(CallToolRequestSchema, async (request) => {
    if (request.params.name !== 'summarize') return; // handled elsewhere

    // 1. Identify tenant (here: from env; in multi-tenant: from connection auth)
    const tenantId = deps.tenantId;

    // 2. Budget pre-check
    const budget = getBudgetStatus(deps.db, tenantId);
    if (budget.over_hard_limit) {
      return {
        content: [{
          type: 'text',
          text: [
            `Budget exceeded: you have used ${budget.used_this_month.toLocaleString()} of ${budget.quota.toLocaleString()} tokens this month (${Math.round(budget.pct_used * 100)}%).`,
            `Quota resets on day ${budget.reset_day} of each month.`,
            `To increase your quota, upgrade your plan at https://alivemcp.com/#pricing.`,
          ].join(' '),
        }],
        isError: true,
      };
    }

    // 3. Soft-limit warning (include in successful response, don't block)
    const softLimitWarning = budget.over_soft_limit
      ? `[Note: ${Math.round(budget.pct_used * 100)}% of monthly token budget used — ${budget.remaining.toLocaleString()} tokens remaining.]`
      : null;

    // 4. Validate inputs
    const parsed = SummarizeSchema.safeParse(request.params.arguments);
    if (!parsed.success) {
      return { content: [{ type: 'text', text: parsed.error.message }], isError: true };
    }

    // 5. Execute the upstream LLM call
    const response = await anthropic.messages.create({
      model:      'claude-haiku-4-5-20251001',
      max_tokens: 1024,
      messages: [{
        role: 'user',
        content: `Summarize the following text in at most ${parsed.data.max_words} words:\n\n${parsed.data.text}`,
      }],
    });

    // 6. Record actual token usage from the API response
    const usage = response.usage;
    recordUsage(deps.db, tenantId, 'summarize', usage.input_tokens, usage.output_tokens);

    const summary = response.content[0].type === 'text' ? response.content[0].text : '';
    const content = softLimitWarning ? `${softLimitWarning}\n\n${summary}` : summary;
    return { content: [{ type: 'text', text: content }] };
  });
}

The check_budget tool

Expose a check_budget tool so LLMs can query remaining budget before starting expensive operations. This allows the LLM to warn the user proactively ("You have 12% of your monthly budget remaining — this operation will use approximately 5%") rather than failing mid-task when the budget is exhausted.

// src/tools/budget.ts
import type { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { getBudgetStatus } from '../budget.js';
import type { Deps } from '../deps.js';

export function registerBudgetTool(server: Server, deps: Deps) {
  // Add to ListTools response:
  // { name: 'check_budget', description: 'Check remaining token budget for this month.', inputSchema: { type: 'object', properties: {} } }

  // In CallTool handler:
  if (request.params.name === 'check_budget') {
    const status = getBudgetStatus(deps.db, deps.tenantId);
    return {
      content: [{
        type: 'text',
        text: JSON.stringify({
          quota:            status.quota,
          used_this_month:  status.used_this_month,
          remaining:        status.remaining,
          pct_used:         Math.round(status.pct_used * 100),
          status:           status.over_hard_limit ? 'exhausted'
                          : status.over_soft_limit ? 'warning'
                          : 'ok',
          resets_on_day:    status.reset_day,
        }),
      }],
    };
  }
}

The check_budget tool has no required arguments and is free to call (it reads SQLite, not an upstream API). Instruct the LLM to call it at the start of any multi-step workflow that involves heavy tool use.

Estimating token counts when the upstream API doesn't return them

When the upstream API does not return a usage object (some APIs, some response streaming modes, or tools that call non-LLM APIs), estimate token counts for budget accounting. A rough but consistent estimate is sufficient — budget enforcement does not need to be exact to the token.

Scenario	Estimation approach
Anthropic / OpenAI API with usage object	Use `response.usage.input_tokens + response.usage.output_tokens` exactly
Streaming API response	Count chunks: accumulate `usage_metadata` from stream delta events, or estimate from character count (`chars / 4` ≈ tokens)
Non-LLM API (web search, database, etc.)	Charge a fixed "administrative" token cost per call (e.g., 100 tokens) to account for context overhead
Tool that returns large text (web page, document)	Estimate output tokens from response length: `Math.ceil(text.length / 4)`

Document your estimation methodology in a comment near the recordUsage() call so you can audit it later: // Estimating ~150 tokens overhead per web fetch — actual LLM context cost billed to client.

Quota reset — nightly cron

Quotas reset at the start of the tenant's billing period (e.g., the 1st of each month). Rather than deleting usage events, mark them as belonging to past billing periods by not querying past the period start date. The getBudgetStatus function already does this: it queries usage only since periodStart.

The only maintenance task is ensuring that very old usage events don't slow down the aggregate query. Archive or delete events older than 13 months (one full billing year) with a nightly script:

// scripts/archive-usage.ts
import { openDb } from '../src/db.js';

const db = openDb();
const cutoff = new Date();
cutoff.setMonth(cutoff.getMonth() - 13);

const { changes } = db.prepare(
  'DELETE FROM usage_events WHERE recorded_at < ?'
).run(cutoff.toISOString());

console.log(`Archived ${changes} usage events older than ${cutoff.toISOString()}`);
db.close();

Run this as a system cron job or a GitHub Actions scheduled workflow: 0 3 * * * tsx /app/scripts/archive-usage.ts.

Plan tiers and quota configuration

Different plans get different monthly quotas. Set quotas in the tenants table when a new tenant is onboarded or when their plan changes. Quota changes take effect immediately (the next getBudgetStatus call reads the new value).

Plan	Monthly token quota	Equivalent usage
Free	100,000 tokens	~500 tool calls at 200 tokens avg
Pro ($9/mo)	1,000,000 tokens	~5,000 tool calls
Team ($49/mo)	10,000,000 tokens	~50,000 tool calls
Enterprise	Custom (negotiated)	Unlimited practical usage

Choose quota numbers based on your upstream API cost per token and your target gross margin per plan. For Anthropic Haiku at $0.25/MTok input + $1.25/MTok output (blended ~$0.50/MTok): 1M tokens costs $0.50 in upstream fees. A $9 Pro plan with a 1M token quota yields roughly 18× markup — comfortable margin for a bootstrapped product.