Guide · LLM Provider Integrations

MCP server Anthropic API integration

Calling the Anthropic Claude API from an MCP tool has three sharp edges that are unique to this integration: the Messages API's content block structure (text, tool_use, and tool_result blocks are typed differently from OpenAI's format), prompt caching via cache_control (misapplied caching burns tokens; correct usage cuts costs 90%), and the nested tool use problem (Claude can call tools that are themselves your MCP tools, creating a recursion risk). This guide covers the production patterns for each.

TL;DR

Install @anthropic-ai/sdk and initialize with ANTHROPIC_API_KEY. Buffer all streaming responses with stream.finalMessage() — MCP tools can't stream mid-call. Apply cache_control: { type: "ephemeral" } to your system prompt and any large static context blocks to enable prompt caching; check usage.cache_read_input_tokens to verify hits. Track response.usage.input_tokens + output_tokens per call and return isError: true with token counts before calls that would overflow context. Handle overloaded_error (529) with exponential backoff — it's a temporary capacity signal, not a billing error. Wire AliveMCP on your MCP endpoint to distinguish Claude API outages from tool-level errors.

SDK setup and authentication

npm install @anthropic-ai/sdk zod

import Anthropic from "@anthropic-ai/sdk";
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { z } from "zod";

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
  maxRetries: 2,   // retries on 529 overloaded and 5xx automatically
  timeout: 60_000, // ms — Claude can take 30-60s on long generations
});

const server = new McpServer({ name: "claude-tools", version: "1.0.0" });

The SDK is initialized once at module level and is safe to reuse across tool calls. The maxRetries default of 2 covers transient 529 and 5xx errors. For high-throughput tools, set maxRetries: 0 and implement your own backoff (see the error handling section below) so you can log retry attempts and respect per-minute token limits.

Model selection

The current Claude model family as of mid-2026:

Model	Model ID	Context window	Best for
Claude Opus 4.7	`claude-opus-4-7`	200,000 tokens	Complex multi-step reasoning, high-stakes analysis
Claude Sonnet 4.6	`claude-sonnet-4-6`	200,000 tokens	Balanced performance, production workloads
Claude Haiku 4.5	`claude-haiku-4-5-20251001`	200,000 tokens	Fast, cheap, high-volume classification and extraction

Pass the model ID as a tool argument rather than hardcoding it. Expose a list_claude_models tool that returns the table above so the calling agent can select the right model for cost-vs-quality tradeoffs without needing to know model IDs out of band.

Messages API tool with prompt caching

The Anthropic Messages API structures content as typed blocks. The most important non-obvious rule: cache_control must be applied to the last user turn or system prompt block that you want to treat as a cache boundary. Applying it to a block in the middle of a long system prompt wastes tokens.

server.tool(
  "ask_claude",
  "Send a message to Claude and return the response. Uses prompt caching for repeated system prompts.",
  {
    model: z.enum(["claude-opus-4-7", "claude-sonnet-4-6", "claude-haiku-4-5-20251001"])
      .default("claude-sonnet-4-6"),
    system_prompt: z.string().optional()
      .describe("System context — cached automatically if > 1024 tokens"),
    user_message: z.string(),
    max_tokens: z.number().int().min(1).max(8192).default(1024),
  },
  async ({ model, system_prompt, user_message, max_tokens }) => {
    try {
      const response = await anthropic.messages.create({
        model,
        max_tokens,
        // Apply cache_control to system prompt if present
        // Anthropic caches prompts > 1024 tokens when cache_control is set
        system: system_prompt
          ? [{ type: "text", text: system_prompt, cache_control: { type: "ephemeral" } }]
          : undefined,
        messages: [{ role: "user", content: user_message }],
      });

      const textBlock = response.content.find(b => b.type === "text");
      const replyText = textBlock?.type === "text" ? textBlock.text : "(no text response)";

      return {
        content: [{
          type: "text",
          text: JSON.stringify({
            reply: replyText,
            model: response.model,
            stop_reason: response.stop_reason,
            usage: {
              input_tokens: response.usage.input_tokens,
              output_tokens: response.usage.output_tokens,
              cache_read_input_tokens: response.usage.cache_read_input_tokens ?? 0,
              cache_creation_input_tokens: response.usage.cache_creation_input_tokens ?? 0,
            },
          }),
        }],
      };
    } catch (err) {
      return anthropicErrorToToolResult(err);
    }
  }
);

Prompt caching is automatic when cache_control: { type: "ephemeral" } is present and the content block is ≥ 1024 tokens. Cache writes cost 25% more than regular input tokens; cache reads cost 10%. The break-even is the second identical request — every request after that is 90% cheaper for that prefix. Check usage.cache_read_input_tokens > 0 in the response to verify a hit occurred.

Token budget tracking

Claude's 200k-token context is generous but not unlimited. MCP tools that accumulate conversation history across calls can overflow silently — the API returns an error, but the caller just sees a failed tool call without context. Track token usage per call and check before the next call:

// Module-level token budget tracker
const tokenBudget = {
  inputTotal: 0,
  outputTotal: 0,
  cacheReads: 0,

  record(usage: Anthropic.Usage) {
    this.inputTotal += usage.input_tokens;
    this.outputTotal += usage.output_tokens;
    this.cacheReads += usage.cache_read_input_tokens ?? 0;
  },

  estimateNextCallFit(estimatedInputTokens: number, maxOutputTokens: number, modelLimit = 200_000): boolean {
    return estimatedInputTokens + maxOutputTokens <= modelLimit;
  },
};

server.tool(
  "get_token_usage",
  "Return the cumulative token usage for this MCP server session",
  {},
  async () => {
    return {
      content: [{
        type: "text",
        text: JSON.stringify({
          input_tokens_total: tokenBudget.inputTotal,
          output_tokens_total: tokenBudget.outputTotal,
          cache_reads_total: tokenBudget.cacheReads,
          estimated_cost_note: "Cache reads are ~90% cheaper than regular input tokens",
        }),
      }],
    };
  }
);

For the broader token budget management pattern across MCP tools, see token budget management for MCP servers.

Buffered streaming

The Anthropic SDK's streaming API uses an async iterator over MessageStreamEvents. MCP tool handlers return a complete result — they can't yield chunks mid-execution. Use stream.finalMessage() to buffer:

server.tool(
  "claude_long_generation",
  "Generate long-form content with Claude — buffers the full stream before returning",
  {
    prompt: z.string(),
    model: z.enum(["claude-opus-4-7", "claude-sonnet-4-6"]).default("claude-sonnet-4-6"),
    max_tokens: z.number().int().default(4096),
  },
  async ({ prompt, model, max_tokens }) => {
    try {
      const stream = anthropic.messages.stream({
        model,
        max_tokens,
        messages: [{ role: "user", content: prompt }],
      });

      // finalMessage() collects all stream events and returns the complete Message
      const message = await stream.finalMessage();

      const text = message.content
        .filter(b => b.type === "text")
        .map(b => b.type === "text" ? b.text : "")
        .join("");

      tokenBudget.record(message.usage);

      return {
        content: [{ type: "text", text }],
      };
    } catch (err) {
      return anthropicErrorToToolResult(err);
    }
  }
);

For more on the streaming-in-MCP pattern and when to use the two-tool start/poll approach instead, see streaming responses in MCP servers.

Nested tool use

Claude supports tool use — the model can respond with tool_use content blocks requesting a function call. When your MCP tool calls Claude with tools enabled, Claude may respond with tool use requests rather than text. You need to handle this loop:

server.tool(
  "claude_agent_loop",
  "Run a Claude agent loop that can call sub-tools (calculator, lookup) until it produces a final answer",
  {
    task: z.string(),
    max_rounds: z.number().int().min(1).max(8).default(5),
  },
  async ({ task, max_rounds }) => {
    const tools: Anthropic.Tool[] = [
      {
        name: "calculate",
        description: "Evaluate a mathematical expression",
        input_schema: {
          type: "object",
          properties: { expression: { type: "string", description: "Math expression to evaluate" } },
          required: ["expression"],
        },
      },
    ];

    const messages: Anthropic.MessageParam[] = [
      { role: "user", content: task },
    ];

    for (let round = 0; round < max_rounds; round++) {
      const response = await anthropic.messages.create({
        model: "claude-sonnet-4-6",
        max_tokens: 1024,
        tools,
        messages,
      });

      tokenBudget.record(response.usage);

      if (response.stop_reason === "end_turn") {
        const textBlock = response.content.find(b => b.type === "text");
        const text = textBlock?.type === "text" ? textBlock.text : "(no answer)";
        return { content: [{ type: "text", text }] };
      }

      if (response.stop_reason === "tool_use") {
        // Add Claude's response (with tool_use blocks) to messages
        messages.push({ role: "assistant", content: response.content });

        // Execute each tool call and collect results
        const toolResults: Anthropic.ToolResultBlockParam[] = [];
        for (const block of response.content) {
          if (block.type !== "tool_use") continue;
          let result: string;
          if (block.name === "calculate") {
            try {
              // Safe eval replacement — parse and evaluate only numeric expressions
              result = String(Function('"use strict"; return (' + block.input.expression + ')')());
            } catch {
              result = "Error: invalid expression";
            }
          } else {
            result = "Tool not found";
          }
          toolResults.push({ type: "tool_result", tool_use_id: block.id, content: result });
        }

        messages.push({ role: "user", content: toolResults });
        continue;
      }

      // stop_reason === "max_tokens" or unexpected
      break;
    }

    return {
      isError: true,
      content: [{ type: "text", text: `Agent loop ended after ${max_rounds} rounds without a final answer` }],
    };
  }
);

The recursion risk: if your MCP server's own tools are exposed to Claude (as tools in the above loop), Claude might call your MCP tool, which calls Claude, which calls your MCP tool. Set a conservative max_rounds and track depth in module state to break cycles.

Error handling

Error / HTTP status	error_code	Cause	Response
401	`authentication_error`	Invalid API key	`isError: true` — rotate key
403	`permission_error`	Model not available on tier	`isError: true` — check plan
429	`rate_limit_error`	Request or token limit exceeded	Backoff and retry
529	`overloaded_error`	Anthropic capacity pressure	Backoff and retry — not a billing issue
500	`api_error`	Anthropic infrastructure error	Retry with backoff

function anthropicErrorToToolResult(err: unknown) {
  if (err instanceof Anthropic.APIError) {
    const retryable = err.status === 429 || err.status === 529 || err.status >= 500;
    return {
      isError: true,
      content: [{
        type: "text" as const,
        text: JSON.stringify({
          error: `Anthropic API error (${err.status})`,
          message: err.message,
          retryable,
          hint: err.status === 529
            ? "Claude is temporarily overloaded — retry in 5-30 seconds"
            : err.status === 429
            ? "Rate limit reached — check x-ratelimit-reset headers and retry"
            : undefined,
        }),
      }],
    };
  }
  throw err;
}

The overloaded_error (529) is specific to Anthropic and means the API received more traffic than capacity can serve at that moment. Unlike a rate limit error (which means you sent too many requests), an overloaded error means Anthropic is at capacity. Back off 5–30 seconds and retry. For MCP tools where the caller can't wait, return isError: true with the retry hint and let the calling agent decide when to retry.

For the general error handling pattern in MCP tools, see error handling patterns for MCP tools.

Frequently asked questions

When should I use prompt caching versus just passing context each time?

Prompt caching pays off when: (1) you have a large static system prompt (> 1024 tokens) that doesn't change between calls, (2) you're doing RAG and the retrieved documents are the same across multiple calls in a session, or (3) you're running a multi-turn conversation where the full conversation history is re-sent on each call. Cache writes cost 25% more; reads cost 10% of the original. The break-even is on the second hit. If your system prompt is short (< 1024 tokens) or changes on every call, prompt caching has no effect — the minimum cacheable prefix is 1024 tokens.

How do I use Claude's extended thinking in an MCP tool?

Pass thinking: { type: "enabled", budget_tokens: N } in the messages.create call. The response will include thinking content blocks before the text block. In MCP tools you have two choices: return the thinking blocks alongside the text (useful for debugging or when the caller wants to see reasoning), or filter them out and return only the final text. Thinking tokens count toward your token budget but typically improve answer quality significantly for complex multi-step problems. Be aware that extended thinking adds latency and cost.

Can I call the Anthropic Batch API from an MCP tool?

Yes. The Batch API (anthropic.messages.batches.create()) is useful when your MCP tool needs to process many independent inputs in parallel without hitting rate limits. Use the two-tool start/poll pattern: a start_batch tool submits the batch and returns the batch ID, and a poll_batch tool checks status and returns results when ready. Batches can take minutes to hours depending on size, so they're unsuitable for synchronous tool calls where the user is waiting.

What's the right model for an MCP tool that runs inside a Claude session?

If your MCP tool is called by Claude (the user's AI assistant), and you call Claude again inside that tool, you're creating nested Claude sessions. For most use cases, use Haiku 4.5 for the inner call — it's fast and cheap, which matters because the outer Claude session is already thinking about when to call your tool. Reserve Sonnet or Opus for inner calls that genuinely need heavy reasoning (e.g., code generation, long-form synthesis). Track the total token spend across both sessions in your tool response so the outer Claude can account for it.

How does AliveMCP help with Anthropic API reliability?

AliveMCP monitors your MCP server's endpoint — the transport layer — not the Anthropic API itself. When your MCP server is down (process crashed, OOM, deployment failed), AliveMCP pages you within 60 seconds. This is distinct from the Anthropic API being slow or returning 529 errors, which is Anthropic's problem to fix. By monitoring the MCP endpoint separately, you can distinguish "my server crashed" from "Anthropic is having an outage" in your incident response.