Guide · LLM Provider Integrations
MCP server Anthropic API integration
Calling the Anthropic Claude API from an MCP tool has three sharp edges that are unique to this integration: the Messages API's content block structure (text, tool_use, and tool_result blocks are typed differently from OpenAI's format), prompt caching via cache_control (misapplied caching burns tokens; correct usage cuts costs 90%), and the nested tool use problem (Claude can call tools that are themselves your MCP tools, creating a recursion risk). This guide covers the production patterns for each.
TL;DR
Install @anthropic-ai/sdk and initialize with ANTHROPIC_API_KEY. Buffer all streaming responses with stream.finalMessage() — MCP tools can't stream mid-call. Apply cache_control: { type: "ephemeral" } to your system prompt and any large static context blocks to enable prompt caching; check usage.cache_read_input_tokens to verify hits. Track response.usage.input_tokens + output_tokens per call and return isError: true with token counts before calls that would overflow context. Handle overloaded_error (529) with exponential backoff — it's a temporary capacity signal, not a billing error. Wire AliveMCP on your MCP endpoint to distinguish Claude API outages from tool-level errors.
SDK setup and authentication
npm install @anthropic-ai/sdk zod
import Anthropic from "@anthropic-ai/sdk";
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { z } from "zod";
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
maxRetries: 2, // retries on 529 overloaded and 5xx automatically
timeout: 60_000, // ms — Claude can take 30-60s on long generations
});
const server = new McpServer({ name: "claude-tools", version: "1.0.0" });
The SDK is initialized once at module level and is safe to reuse across tool calls. The maxRetries default of 2 covers transient 529 and 5xx errors. For high-throughput tools, set maxRetries: 0 and implement your own backoff (see the error handling section below) so you can log retry attempts and respect per-minute token limits.
Model selection
The current Claude model family as of mid-2026:
| Model | Model ID | Context window | Best for |
|---|---|---|---|
| Claude Opus 4.7 | claude-opus-4-7 | 200,000 tokens | Complex multi-step reasoning, high-stakes analysis |
| Claude Sonnet 4.6 | claude-sonnet-4-6 | 200,000 tokens | Balanced performance, production workloads |
| Claude Haiku 4.5 | claude-haiku-4-5-20251001 | 200,000 tokens | Fast, cheap, high-volume classification and extraction |
Pass the model ID as a tool argument rather than hardcoding it. Expose a list_claude_models tool that returns the table above so the calling agent can select the right model for cost-vs-quality tradeoffs without needing to know model IDs out of band.
Messages API tool with prompt caching
The Anthropic Messages API structures content as typed blocks. The most important non-obvious rule: cache_control must be applied to the last user turn or system prompt block that you want to treat as a cache boundary. Applying it to a block in the middle of a long system prompt wastes tokens.
server.tool(
"ask_claude",
"Send a message to Claude and return the response. Uses prompt caching for repeated system prompts.",
{
model: z.enum(["claude-opus-4-7", "claude-sonnet-4-6", "claude-haiku-4-5-20251001"])
.default("claude-sonnet-4-6"),
system_prompt: z.string().optional()
.describe("System context — cached automatically if > 1024 tokens"),
user_message: z.string(),
max_tokens: z.number().int().min(1).max(8192).default(1024),
},
async ({ model, system_prompt, user_message, max_tokens }) => {
try {
const response = await anthropic.messages.create({
model,
max_tokens,
// Apply cache_control to system prompt if present
// Anthropic caches prompts > 1024 tokens when cache_control is set
system: system_prompt
? [{ type: "text", text: system_prompt, cache_control: { type: "ephemeral" } }]
: undefined,
messages: [{ role: "user", content: user_message }],
});
const textBlock = response.content.find(b => b.type === "text");
const replyText = textBlock?.type === "text" ? textBlock.text : "(no text response)";
return {
content: [{
type: "text",
text: JSON.stringify({
reply: replyText,
model: response.model,
stop_reason: response.stop_reason,
usage: {
input_tokens: response.usage.input_tokens,
output_tokens: response.usage.output_tokens,
cache_read_input_tokens: response.usage.cache_read_input_tokens ?? 0,
cache_creation_input_tokens: response.usage.cache_creation_input_tokens ?? 0,
},
}),
}],
};
} catch (err) {
return anthropicErrorToToolResult(err);
}
}
);
Prompt caching is automatic when cache_control: { type: "ephemeral" } is present and the content block is ≥ 1024 tokens. Cache writes cost 25% more than regular input tokens; cache reads cost 10%. The break-even is the second identical request — every request after that is 90% cheaper for that prefix. Check usage.cache_read_input_tokens > 0 in the response to verify a hit occurred.
Token budget tracking
Claude's 200k-token context is generous but not unlimited. MCP tools that accumulate conversation history across calls can overflow silently — the API returns an error, but the caller just sees a failed tool call without context. Track token usage per call and check before the next call:
// Module-level token budget tracker
const tokenBudget = {
inputTotal: 0,
outputTotal: 0,
cacheReads: 0,
record(usage: Anthropic.Usage) {
this.inputTotal += usage.input_tokens;
this.outputTotal += usage.output_tokens;
this.cacheReads += usage.cache_read_input_tokens ?? 0;
},
estimateNextCallFit(estimatedInputTokens: number, maxOutputTokens: number, modelLimit = 200_000): boolean {
return estimatedInputTokens + maxOutputTokens <= modelLimit;
},
};
server.tool(
"get_token_usage",
"Return the cumulative token usage for this MCP server session",
{},
async () => {
return {
content: [{
type: "text",
text: JSON.stringify({
input_tokens_total: tokenBudget.inputTotal,
output_tokens_total: tokenBudget.outputTotal,
cache_reads_total: tokenBudget.cacheReads,
estimated_cost_note: "Cache reads are ~90% cheaper than regular input tokens",
}),
}],
};
}
);
For the broader token budget management pattern across MCP tools, see token budget management for MCP servers.
Buffered streaming
The Anthropic SDK's streaming API uses an async iterator over MessageStreamEvents. MCP tool handlers return a complete result — they can't yield chunks mid-execution. Use stream.finalMessage() to buffer:
server.tool(
"claude_long_generation",
"Generate long-form content with Claude — buffers the full stream before returning",
{
prompt: z.string(),
model: z.enum(["claude-opus-4-7", "claude-sonnet-4-6"]).default("claude-sonnet-4-6"),
max_tokens: z.number().int().default(4096),
},
async ({ prompt, model, max_tokens }) => {
try {
const stream = anthropic.messages.stream({
model,
max_tokens,
messages: [{ role: "user", content: prompt }],
});
// finalMessage() collects all stream events and returns the complete Message
const message = await stream.finalMessage();
const text = message.content
.filter(b => b.type === "text")
.map(b => b.type === "text" ? b.text : "")
.join("");
tokenBudget.record(message.usage);
return {
content: [{ type: "text", text }],
};
} catch (err) {
return anthropicErrorToToolResult(err);
}
}
);
For more on the streaming-in-MCP pattern and when to use the two-tool start/poll approach instead, see streaming responses in MCP servers.
Nested tool use
Claude supports tool use — the model can respond with tool_use content blocks requesting a function call. When your MCP tool calls Claude with tools enabled, Claude may respond with tool use requests rather than text. You need to handle this loop:
server.tool(
"claude_agent_loop",
"Run a Claude agent loop that can call sub-tools (calculator, lookup) until it produces a final answer",
{
task: z.string(),
max_rounds: z.number().int().min(1).max(8).default(5),
},
async ({ task, max_rounds }) => {
const tools: Anthropic.Tool[] = [
{
name: "calculate",
description: "Evaluate a mathematical expression",
input_schema: {
type: "object",
properties: { expression: { type: "string", description: "Math expression to evaluate" } },
required: ["expression"],
},
},
];
const messages: Anthropic.MessageParam[] = [
{ role: "user", content: task },
];
for (let round = 0; round < max_rounds; round++) {
const response = await anthropic.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
tools,
messages,
});
tokenBudget.record(response.usage);
if (response.stop_reason === "end_turn") {
const textBlock = response.content.find(b => b.type === "text");
const text = textBlock?.type === "text" ? textBlock.text : "(no answer)";
return { content: [{ type: "text", text }] };
}
if (response.stop_reason === "tool_use") {
// Add Claude's response (with tool_use blocks) to messages
messages.push({ role: "assistant", content: response.content });
// Execute each tool call and collect results
const toolResults: Anthropic.ToolResultBlockParam[] = [];
for (const block of response.content) {
if (block.type !== "tool_use") continue;
let result: string;
if (block.name === "calculate") {
try {
// Safe eval replacement — parse and evaluate only numeric expressions
result = String(Function('"use strict"; return (' + block.input.expression + ')')());
} catch {
result = "Error: invalid expression";
}
} else {
result = "Tool not found";
}
toolResults.push({ type: "tool_result", tool_use_id: block.id, content: result });
}
messages.push({ role: "user", content: toolResults });
continue;
}
// stop_reason === "max_tokens" or unexpected
break;
}
return {
isError: true,
content: [{ type: "text", text: `Agent loop ended after ${max_rounds} rounds without a final answer` }],
};
}
);
The recursion risk: if your MCP server's own tools are exposed to Claude (as tools in the above loop), Claude might call your MCP tool, which calls Claude, which calls your MCP tool. Set a conservative max_rounds and track depth in module state to break cycles.
Error handling
| Error / HTTP status | error_code | Cause | Response |
|---|---|---|---|
| 401 | authentication_error | Invalid API key | isError: true — rotate key |
| 403 | permission_error | Model not available on tier | isError: true — check plan |
| 429 | rate_limit_error | Request or token limit exceeded | Backoff and retry |
| 529 | overloaded_error | Anthropic capacity pressure | Backoff and retry — not a billing issue |
| 500 | api_error | Anthropic infrastructure error | Retry with backoff |
function anthropicErrorToToolResult(err: unknown) {
if (err instanceof Anthropic.APIError) {
const retryable = err.status === 429 || err.status === 529 || err.status >= 500;
return {
isError: true,
content: [{
type: "text" as const,
text: JSON.stringify({
error: `Anthropic API error (${err.status})`,
message: err.message,
retryable,
hint: err.status === 529
? "Claude is temporarily overloaded — retry in 5-30 seconds"
: err.status === 429
? "Rate limit reached — check x-ratelimit-reset headers and retry"
: undefined,
}),
}],
};
}
throw err;
}
The overloaded_error (529) is specific to Anthropic and means the API received more traffic than capacity can serve at that moment. Unlike a rate limit error (which means you sent too many requests), an overloaded error means Anthropic is at capacity. Back off 5–30 seconds and retry. For MCP tools where the caller can't wait, return isError: true with the retry hint and let the calling agent decide when to retry.
For the general error handling pattern in MCP tools, see error handling patterns for MCP tools.
Frequently asked questions
When should I use prompt caching versus just passing context each time?
Prompt caching pays off when: (1) you have a large static system prompt (> 1024 tokens) that doesn't change between calls, (2) you're doing RAG and the retrieved documents are the same across multiple calls in a session, or (3) you're running a multi-turn conversation where the full conversation history is re-sent on each call. Cache writes cost 25% more; reads cost 10% of the original. The break-even is on the second hit. If your system prompt is short (< 1024 tokens) or changes on every call, prompt caching has no effect — the minimum cacheable prefix is 1024 tokens.
How do I use Claude's extended thinking in an MCP tool?
Pass thinking: { type: "enabled", budget_tokens: N } in the messages.create call. The response will include thinking content blocks before the text block. In MCP tools you have two choices: return the thinking blocks alongside the text (useful for debugging or when the caller wants to see reasoning), or filter them out and return only the final text. Thinking tokens count toward your token budget but typically improve answer quality significantly for complex multi-step problems. Be aware that extended thinking adds latency and cost.
Can I call the Anthropic Batch API from an MCP tool?
Yes. The Batch API (anthropic.messages.batches.create()) is useful when your MCP tool needs to process many independent inputs in parallel without hitting rate limits. Use the two-tool start/poll pattern: a start_batch tool submits the batch and returns the batch ID, and a poll_batch tool checks status and returns results when ready. Batches can take minutes to hours depending on size, so they're unsuitable for synchronous tool calls where the user is waiting.
What's the right model for an MCP tool that runs inside a Claude session?
If your MCP tool is called by Claude (the user's AI assistant), and you call Claude again inside that tool, you're creating nested Claude sessions. For most use cases, use Haiku 4.5 for the inner call — it's fast and cheap, which matters because the outer Claude session is already thinking about when to call your tool. Reserve Sonnet or Opus for inner calls that genuinely need heavy reasoning (e.g., code generation, long-form synthesis). Track the total token spend across both sessions in your tool response so the outer Claude can account for it.
How does AliveMCP help with Anthropic API reliability?
AliveMCP monitors your MCP server's endpoint — the transport layer — not the Anthropic API itself. When your MCP server is down (process crashed, OOM, deployment failed), AliveMCP pages you within 60 seconds. This is distinct from the Anthropic API being slow or returning 529 errors, which is Anthropic's problem to fix. By monitoring the MCP endpoint separately, you can distinguish "my server crashed" from "Anthropic is having an outage" in your incident response.
Further reading
- MCP tools for the OpenAI API — chat completions, function calling, rate limits
- MCP tools for Ollama — local LLM inference, model management, health checks
- MCP tools for AWS Bedrock — Converse API, IAM auth, cross-region inference
- Token budget management for MCP servers
- Streaming responses in MCP servers
- Caching patterns for MCP tool responses
- Error handling patterns for MCP tools
- AliveMCP — production protocol monitoring for MCP servers