LLM Provider Integrations · 2026-07-03 · LLM Provider Integrations arc

Building MCP Tools for LLM Provider APIs: Token Budgets, Streaming, and the Three Patterns Every Integration Shares

When you build your first MCP tool that calls an LLM provider — OpenAI, Anthropic, Ollama, or a vector database — you solve three problems in sequence: you hit a context overflow error and add token counting, you notice the tool hangs because the SDK is streaming and you add buffering, then you hardcode a model name and immediately regret it. By the second integration you recognize the same three problems showing up in a different uniform. This synthesis covers all four integrations — OpenAI, Anthropic, Ollama, and vector databases — through the three patterns they all share, so you recognize them immediately the next time they appear.

TL;DR

Four different providers, three shared challenges. (1) Token budget management: count tokens before every LLM call (return isError: true with exact counts if the prompt would overflow); track usage after every call; cap retrieved vector chunks at 500 tokens each before returning them as context. OpenAI uses gpt-tokenizer to count before calling; Anthropic tracks usage.input_tokens + output_tokens per call; Ollama exposes per-model context limits that vary wildly (llama3.2: 8k, phi4: 16k, gemma3: 8k, mistral: 32k, qwen2.5: 32k). (2) Streaming-to-buffered conversion: every LLM SDK streams by default; MCP tool handlers return synchronous results. Buffer with stream.finalChatCompletion() (OpenAI), stream.finalMessage() (Anthropic), or stream: false in the request body (Ollama — the most non-obvious one). Vector databases are already synchronous. (3) Model/provider selection tools: expose a list_models tool in every LLM integration so the calling agent can select quality/cost/latency tradeoffs without hardcoded model IDs. Where they diverge: auth (API keys for OpenAI/Anthropic/Pinecone, no auth for Ollama/Chroma), error handling (OpenAI throws RateLimitError at 429, Anthropic throws overloaded_error at 529 — a capacity signal, not a quota signal, requiring different backoff), and context window size (OpenAI/Anthropic are 128k–200k; Ollama varies by model; vector databases need chunk-level budget management). Wire AliveMCP on your MCP server endpoint — an LLM provider outage looks identical to a tool crash without protocol-level probing.

The four integrations at a glance

Before diving into each pattern, here's where the four integrations stand on the dimensions that matter most for MCP tool design:

IntegrationAuthStreaming methodContext windowTyped error class
OpenAIOPENAI_API_KEYstream.finalChatCompletion()128k (gpt-4o/mini)RateLimitError, AuthenticationError
AnthropicANTHROPIC_API_KEYstream.finalMessage()200k (all Claude models)overloaded_error (529)
OllamaNone — localhost:11434stream: false in request bodyVaries by model (2k–128k)HTTP 503, connection refused
Vector databasesAPI key (Pinecone) or none (Chroma)Already synchronousN/A — chunk-level cap insteadIndex not found, dimension mismatch

The three shared patterns all stem from the same root cause: MCP tools are synchronous request-response units operating inside a context-limited protocol, while LLM provider SDKs are designed for streaming, open-ended generation with large buffers. Every friction point is a version of that mismatch.

Pattern 1 — Token budget management

When you call a SaaS API from an MCP tool, the payload size is under your control — you pass a fixed set of arguments, get a fixed-size response. When you call an LLM provider, the payload grows with the context you inject, the conversation history you carry forward, and the chunks you retrieve from a vector database. Overflow that context window and the API fails — but from the MCP tool's perspective, a context_length_exceeded error looks the same as any other crash. The calling agent gets an opaque failure with no signal on how to fix it.

The correct pattern: check before calling, and when the check fails, return isError: true with the exact token counts so the agent can truncate and retry rather than giving up.

OpenAI: count with gpt-tokenizer before the call

The OpenAI API does not expose token counting as a separate endpoint — you have to count locally with gpt-tokenizer. The count is imperfect (it doesn't include tool definition overhead) but accurate enough to catch the common overflow cases:

import { encode } from "gpt-tokenizer";

const MODEL_CONTEXT_LIMITS: Record<string, number> = {
  "gpt-4o": 128_000,
  "gpt-4o-mini": 128_000,
  "gpt-4-turbo": 128_000,
  "gpt-3.5-turbo": 16_385,
};

function countMessageTokens(messages: OpenAI.ChatCompletionMessageParam[]): number {
  // 4 tokens overhead per message (role + framing), 2 for reply primer
  return messages.reduce((total, msg) => {
    const content = typeof msg.content === "string" ? msg.content : "";
    return total + encode(content).length + 4;
  }, 2);
}

// At the top of every chat completions tool handler:
const { ok, inputTokens, limit } = checkTokenBudget(model, messages, max_tokens);
if (!ok) {
  return {
    isError: true,
    content: [{ type: "text", text: JSON.stringify({
      error: "context_overflow",
      input_tokens: inputTokens,
      max_output_tokens: max_tokens,
      model_limit: limit,
      overflow_by: inputTokens + max_tokens - limit,
      hint: "Reduce the prompt length or split into smaller calls",
    })}],
  };
}

The overflow_by field is what makes this useful: the agent can see it needs to shorten the prompt by exactly N tokens, rather than guessing. For a deeper treatment of the general token budget management problem, see token budget management for MCP servers.

Anthropic: track usage per call and check before accumulating context

Anthropic's 200k-token context window is large enough that single-call overflow is rare — the more common problem is accumulating context across multiple calls in an agent loop. The Claude API returns exact token usage in every response; use it to maintain a session-level budget tracker:

const tokenBudget = {
  inputTotal: 0,
  outputTotal: 0,
  cacheReads: 0,

  record(usage: Anthropic.Usage) {
    this.inputTotal += usage.input_tokens;
    this.outputTotal += usage.output_tokens;
    this.cacheReads += usage.cache_read_input_tokens ?? 0;
  },

  estimateNextCallFit(
    estimatedInputTokens: number,
    maxOutputTokens: number,
    modelLimit = 200_000
  ): boolean {
    // Check whether a new call with this token estimate will fit
    return estimatedInputTokens + maxOutputTokens <= modelLimit;
  },
};

// After each successful API call:
tokenBudget.record(response.usage);

// Expose as a tool so agents can introspect their own budget:
server.tool("get_token_usage", "Return cumulative token usage for this session", {}, async () => {
  return { content: [{ type: "text", text: JSON.stringify({
    input_tokens_total: tokenBudget.inputTotal,
    output_tokens_total: tokenBudget.outputTotal,
    cache_reads_total: tokenBudget.cacheReads,
    cache_savings_note: "Cache reads are ~90% cheaper than regular input tokens",
  })}] };
});

Anthropic's prompt caching feature intersects with budget management in an important way: a system prompt marked with cache_control: { type: "ephemeral" } that's ≥ 1024 tokens will be cached after the first call. Cache reads cost 10% of normal input token price — so a repeated system prompt of 2000 tokens costs 200 tokens' worth of compute on every read after the first. Track cache_read_input_tokens separately so you can report the actual compute cost rather than the nominal token count.

Ollama: know your model's context window before sending

Ollama's context limits vary wildly between models, and unlike OpenAI/Anthropic, Ollama doesn't return an error if you exceed the limit — it silently truncates the context from the beginning. The only way to prevent silent truncation is to know the limit before you send:

ModelDefault context (tokens)Can override with num_ctx?
llama3.2 (3B/1B)8,192Yes, up to hardware limit
mistral (7B)32,768Yes
gemma3 (12B/27B)8,192Yes
qwen2.5 (7B/14B)32,768Yes
phi4 (14B)16,384Yes
deepseek-r1 (7B/14B)65,536Yes

The num_ctx parameter in the Ollama request body overrides the model's default, but is capped by your available GPU VRAM. Expose the context limit as part of your list_models tool response so the calling agent knows the constraint before constructing its prompt. The Ollama response body includes prompt_eval_count (tokens in the prompt) and eval_count (tokens generated) — log both per call so you can detect when you're approaching the limit.

Vector databases: chunk token cap as budget management

Vector databases don't have a context window in the LLM sense, but they contribute to the LLM's context budget through the chunks they return. Without a cap, a single search_pinecone call with top_k: 10 can return 10 × (chunk size) tokens — enough to overflow a model with a 16k context if your chunks average 2000 tokens each. Apply two independent caps before returning results:

// After filtering by score threshold, apply per-chunk and total token caps
const MAX_TOKENS_PER_CHUNK = 500;
const MAX_TOTAL_TOKENS = 3_000;

function truncateToTokens(text: string, maxTokens: number): string {
  const words = text.split(/\s+/);
  let tokenEstimate = 0;
  const kept: string[] = [];
  for (const word of words) {
    tokenEstimate += Math.ceil(word.length / 4); // rough 4 chars/token estimate
    if (tokenEstimate > maxTokens) break;
    kept.push(word);
  }
  return kept.join(" ");
}

const filtered = matches
  .filter(m => (m.score ?? 0) >= score_threshold)
  .map(m => ({
    id: m.id,
    score: m.score,
    text: truncateToTokens(String(m.metadata?.text ?? ""), MAX_TOKENS_PER_CHUNK),
  }));

// Apply total token budget: stop adding chunks once we'd exceed the limit
const budgeted: typeof filtered = [];
let totalTokens = 0;
for (const chunk of filtered) {
  const chunkTokens = Math.ceil(chunk.text.length / 4);
  if (totalTokens + chunkTokens > MAX_TOTAL_TOKENS) break;
  budgeted.push(chunk);
  totalTokens += chunkTokens;
}

The score threshold and token cap solve different problems: score threshold removes low-quality results that would pollute the LLM's context with irrelevant noise; the token cap prevents high-quality results from exhausting the LLM's context budget before the agent can include its own context. Apply both. For a deeper treatment, see embedding generation in MCP servers and semantic caching for MCP tools.

Pattern 2 — Streaming-to-buffered conversion

The MCP protocol is synchronous at the tool call level: a client sends a tools/call request and waits for a CallToolResult response. There's no mechanism for a tool handler to send partial results mid-call. LLM provider SDKs are designed for the opposite: they stream tokens as they generate so users see output immediately. Every LLM integration therefore requires the same conversion step: buffer the stream and return only when it's complete.

Where the integrations diverge is in how you buffer — and the difficulty of the buffering step is inversely proportional to how obvious it is that you need it.

OpenAI: stream.finalChatCompletion()

The OpenAI Node.js SDK's streaming API returns an async iterable. Without buffering, a naively written tool handler would return a stream object that the MCP SDK can't serialize:

// ✗ WRONG — returns a stream, not a result
const stream = await openai.chat.completions.stream({ model, messages, max_tokens });

// ✓ CORRECT — buffer to completion, then return the full message
const stream = openai.chat.completions.stream({ model, messages, max_tokens });
const completion = await stream.finalChatCompletion();
const text = completion.choices[0]?.message?.content ?? "";
const usage = completion.usage;

return {
  content: [{ type: "text", text: JSON.stringify({ reply: text, usage }) }],
};

finalChatCompletion() is a convenience method on the stream object that resolves only when generation is complete. Under the hood it consumes the async iterator. You can equivalently use await openai.chat.completions.create({ ..., stream: false }) to skip streaming entirely — but stream.finalChatCompletion() is useful if you also want to monitor token throughput or detect if the stream stalls. For the non-streaming API case, stream: false (or omitting the flag) returns the completion directly without a stream object. For related streaming patterns, see streaming responses in MCP servers.

Anthropic: stream.finalMessage()

Anthropic's SDK follows the same pattern with a different method name. The SDK's streaming response exposes stream.finalMessage() which resolves to a complete Message object:

// ✓ CORRECT — buffer the Anthropic stream to completion
const stream = anthropic.messages.stream({
  model,
  max_tokens,
  system: system_prompt
    ? [{ type: "text", text: system_prompt, cache_control: { type: "ephemeral" } }]
    : undefined,
  messages: [{ role: "user", content: user_message }],
});

const message = await stream.finalMessage();

const textBlock = message.content.find(b => b.type === "text");
const replyText = textBlock?.type === "text" ? textBlock.text : "(no text response)";

return {
  content: [{ type: "text", text: JSON.stringify({
    reply: replyText,
    model: message.model,
    stop_reason: message.stop_reason,
    usage: {
      input_tokens: message.usage.input_tokens,
      output_tokens: message.usage.output_tokens,
      cache_read_input_tokens: message.usage.cache_read_input_tokens ?? 0,
      cache_creation_input_tokens: message.usage.cache_creation_input_tokens ?? 0,
    },
  })}],
};

One advantage of using the stream API even when you're buffering: stream.finalMessage() returns the same Message object as the non-streaming API, but you can also attach event handlers to monitor the stream in flight — useful for logging latency to first token or detecting stuck generations:

stream.on("inputJson", (delta, snapshot) => {
  // Fires on each tool_use input JSON delta — useful for nested tool use
});
// Stream finalizes in the same finalMessage() call regardless
const message = await stream.finalMessage();

Ollama: stream: false in the request body (the non-obvious one)

Ollama is the most problematic case because the streaming behavior is a default, not an opt-in. When you fetch Ollama's /api/chat or /api/generate endpoint without setting stream: false, Ollama responds with newline-delimited JSON (NDJSON): one JSON object per token, separated by newlines. await res.json() parses only the first line and discards the rest. You get a partial object that claims done: false — and the tool returns a truncated, incomplete response.

The fix is a single flag in the request body:

const res = await fetch(`${OLLAMA_BASE}/api/chat`, {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model,
    messages,
    stream: false,  // CRITICAL — Ollama streams NDJSON by default; this returns a single JSON object
    options: { temperature },
  }),
  signal: controller.signal,
});

// Now res.json() returns a single complete object:
const data = await res.json() as {
  message: { role: string; content: string };
  done: boolean;
  done_reason: string;
  prompt_eval_count: number;
  eval_count: number;
  total_duration: number;
};

If you forget stream: false and your tool returns something like { "message": { "role": "assistant", "content": "Tell" }, "done": false } instead of a complete response, that's the bug. It's one of the most common Ollama integration mistakes because the Ollama documentation doesn't prominently warn about the default streaming behavior.

Vector databases: already synchronous

Vector databases are the exception: they don't stream. A Pinecone index.query() or Chroma collection.query() call returns all results at once, synchronously from the await perspective. No buffering step is needed — but the other two budget patterns (score threshold and token cap) still apply to what you do with the synchronous results before returning them.

ProviderDefault behaviorBuffering approachDifficulty
OpenAIStreaming (async iterable)stream.finalChatCompletion() or stream: falseLow — well-documented
AnthropicStreaming (async iterable)stream.finalMessage()Low — well-documented
OllamaStreaming (NDJSON, default)stream: false in request bodyHigh — easy to miss
Vector databasesAlready synchronousNo buffering neededN/A — but apply chunk caps

Pattern 3 — Model/provider selection tools

Hardcoding a model name in a tool definition is a local optimization that creates a global constraint. Once a model name is hardcoded in a tool, every call to that tool uses that model regardless of the task — a simple classification task pays Opus prices, a complex reasoning task gets Haiku quality. The fix is to expose a model parameter (with a sensible default) and a separate list_models tool that returns the options with enough quality/cost/speed information for the agent to make an informed choice.

OpenAI: separate quality and speed tiers clearly

The OpenAI model family has clear cost/quality tiers that map to different task types:

server.tool(
  "list_openai_models",
  "List available OpenAI models with quality, cost, and context window information",
  {},
  async () => {
    return {
      content: [{ type: "text", text: JSON.stringify({
        models: [
          {
            id: "gpt-4o",
            tier: "premium",
            context_tokens: 128_000,
            best_for: "Complex reasoning, multi-step analysis, tool use chains",
            relative_cost: "high",
          },
          {
            id: "gpt-4o-mini",
            tier: "standard",
            context_tokens: 128_000,
            best_for: "Fast extraction, classification, simple Q&A, high-volume tasks",
            relative_cost: "low",
          },
          {
            id: "gpt-3.5-turbo",
            tier: "legacy",
            context_tokens: 16_385,
            best_for: "Very simple tasks where cost is paramount",
            relative_cost: "lowest",
          },
        ],
        recommendation: "Default to gpt-4o-mini; use gpt-4o for tasks requiring multi-step reasoning or tool use",
      })}],
    };
  }
);

The model selection tool also gives you a safe place to centralize model ID validation: if OpenAI deprecates a model, update the list tool and the enum in your chat tool's schema rather than hunting for hardcoded strings across multiple tools.

Anthropic: Opus/Sonnet/Haiku map to cost tiers, not just quality

The Claude model family's three-tier structure is designed for cost-aware selection. Haiku is not a "worse" model — it's a fast, cheap model that outperforms larger models on many extraction and classification tasks. The selection tool should communicate this:

server.tool(
  "list_claude_models",
  "List available Claude models with their strengths and cost tier",
  {},
  async () => {
    return {
      content: [{ type: "text", text: JSON.stringify({
        models: [
          {
            id: "claude-opus-4-7",
            tier: "premium",
            context_tokens: 200_000,
            best_for: "Complex multi-step reasoning, high-stakes analysis, nuanced judgment",
            relative_cost: "highest",
          },
          {
            id: "claude-sonnet-4-6",
            tier: "standard",
            context_tokens: 200_000,
            best_for: "Balanced performance, production workloads, most tasks",
            relative_cost: "medium",
          },
          {
            id: "claude-haiku-4-5-20251001",
            tier: "fast",
            context_tokens: 200_000,
            best_for: "Classification, extraction, summarization, high-volume processing",
            relative_cost: "low",
          },
        ],
        prompt_caching_note: "All models support prompt caching via cache_control. Cache reads cost 10% of normal input token price.",
      })}],
    };
  }
);

One Anthropic-specific detail worth including in the list tool: prompt caching applies to all three models, but the break-even point is different. At Haiku prices, caching is less dramatic as a cost reduction; at Opus prices, a long system prompt cached across 100 calls saves significantly. The cache_read_input_tokens field in every response lets you calculate the actual savings per session.

Ollama: list installed models from the local server

Ollama's model list is dynamic — it reflects what's actually installed on the local machine, not a fixed catalog. The list_models tool should always query the live Ollama server rather than returning a hardcoded list:

server.tool(
  "list_ollama_models",
  "List models currently installed on this Ollama server, with context window and size information",
  {},
  async () => {
    try {
      const res = await ollamaFetch("/api/tags");
      const data = await res.json() as {
        models: Array<{
          name: string;
          size: number;
          modified_at: string;
          details?: { parameter_size?: string; context_length?: number };
        }>
      };

      // Annotate with known context limits where Ollama doesn't expose them
      const KNOWN_CONTEXT: Record<string, number> = {
        "llama3.2": 8_192, "mistral": 32_768, "gemma3": 8_192,
        "qwen2.5": 32_768, "phi4": 16_384, "deepseek-r1": 65_536,
      };

      return {
        content: [{ type: "text", text: JSON.stringify({
          models: data.models.map(m => {
            const baseName = m.name.split(":")[0];
            return {
              name: m.name,
              size_gb: (m.size / 1e9).toFixed(1),
              context_tokens: m.details?.context_length ?? KNOWN_CONTEXT[baseName] ?? "unknown",
              modified_at: m.modified_at,
            };
          }),
          hint: "Use check_ollama_health first to verify the server is running",
        })}],
      };
    } catch (err) {
      return {
        isError: true,
        content: [{ type: "text", text: `Cannot list models: ${err instanceof Error ? err.message : String(err)}` }],
      };
    }
  }
);

The dynamic listing matters for Ollama because an agent asking "what models are available?" will get wrong answers if you return a hardcoded list that includes models the user never pulled. A 7B model that isn't installed will cause a timeout rather than a clean error when the agent tries to use it.

Vector databases: list indexes instead of models

The selection tool concept applies to vector databases too, but the choice is an index rather than a model. An agent searching across multiple knowledge bases needs to know what indexes exist, what they contain, and their dimensionality (to match the embedding model). The Pinecone list_pinecone_indexes tool:

server.tool(
  "list_pinecone_indexes",
  "List available Pinecone indexes with their dimension and metric, so you can select the right index for a query",
  {},
  async () => {
    try {
      const indexes = await pinecone.listIndexes();
      return {
        content: [{ type: "text", text: JSON.stringify({
          indexes: indexes.indexes?.map(idx => ({
            name: idx.name,
            dimension: idx.dimension,
            metric: idx.metric,
            status: idx.status?.ready ? "ready" : "not_ready",
          })) ?? [],
          hint: "Dimension must match your embedding model: text-embedding-3-small = 1536, text-embedding-3-large = 3072",
        })}],
      };
    } catch (err) {
      return {
        isError: true,
        content: [{ type: "text", text: `Pinecone index list error: ${err instanceof Error ? err.message : String(err)}` }],
      };
    }
  }
);

The dimension/metric mismatch between embedding model and index is the most common vector database error in MCP tools — an agent generates a 1536-dimension embedding and queries a 3072-dimension index, getting a dimension mismatch error that looks like a configuration failure rather than a tool design gap. Exposing the dimension in the list tool lets the agent cross-check before querying.

Where the integrations diverge

The three patterns above are constant across all four integrations. Below are the things that are genuinely different — places where you can't carry a pattern from one integration and expect it to work in another.

Auth: from none to explicit API keys

The four integrations span the full range of auth complexity:

IntegrationAuth mechanismWhere it fails silently
OpenAIOPENAI_API_KEY env varExpired/revoked key throws AuthenticationError — catch it and throw (not isError: true)
AnthropicANTHROPIC_API_KEY env varSame as OpenAI — throw on auth failure
OllamaNone — assumes localhost:11434Process not running → connection refused → tool timeout after 2 minutes unless you add a health check
PineconePINECONE_API_KEY env varWrong region or index namespace returns empty matches, not an auth error
Chroma (local)None (default), token auth optionalServer not running → connection refused, same pattern as Ollama

The Ollama and Chroma pattern — where "auth failure" is actually "process not running" — requires a different handling strategy than API key auth. Add an explicit health check tool (check_ollama_health, check_chroma_health) that verifies connectivity early, and check health at MCP server startup so you surface the failure immediately rather than on the first inference call. See the Ollama integration guide for the health check implementation.

Error handling: RateLimitError vs overloaded_error — they're not the same

Both OpenAI and Anthropic can be temporarily unavailable, but the error type and the correct response differ significantly:

OpenAI's RateLimitError (HTTP 429) means you've exceeded a quota — either requests per minute, tokens per minute, or requests per day. The x-ratelimit-reset-requests response header tells you when the quota resets. The backoff strategy is to wait until reset and retry. This is a billing/usage signal.

Anthropic's overloaded_error (HTTP 529) means Anthropic's API is under high load and can't accept new requests right now. It's a capacity signal, not a quota signal. There's no "wait until your quota resets" — you wait until load decreases, which may be seconds or minutes. The correct backoff is exponential without a reset timer, and the error is transient by nature rather than tied to your usage level.

// OpenAI rate limit handling — wait for the quota window to reset
if (err instanceof OpenAI.RateLimitError) {
  const resetMs = parseInt(err.headers?.["x-ratelimit-reset-requests"] ?? "60") * 1000;
  return {
    isError: true,
    content: [{ type: "text", text: JSON.stringify({
      error: "rate_limited",
      retry_after_ms: resetMs,
      retryable: true,
      message: `OpenAI rate limit hit (quota). Retry after ${Math.ceil(resetMs / 1000)}s when your quota window resets.`,
    })}],
  };
}

// Anthropic overloaded_error — capacity, not quota
if (err instanceof Anthropic.APIError && err.status === 529) {
  const backoffMs = 5_000; // start at 5s, not at a reset timer
  return {
    isError: true,
    content: [{ type: "text", text: JSON.stringify({
      error: "provider_overloaded",
      retry_after_ms: backoffMs,
      retryable: true,
      message: "Anthropic API is overloaded (capacity, not quota). Retry with exponential backoff.",
    })}],
  };
}

The practical difference: if you treat overloaded_error like a rate limit and wait for a quota reset that never comes, you'll retry at the wrong time. If you treat a RateLimitError like a transient capacity issue and retry with random jitter, you'll burn more quota while the window hasn't reset yet. For comprehensive error handling patterns, see MCP server error handling and rate limiting in MCP servers.

The Ollama-specific errors: process crashes, not API errors

Ollama's error surface is fundamentally different from the managed cloud providers because Ollama is a local process, not a remote service. The errors you encounter are:

function ollamaErrorToToolResult(err: unknown, model: string): CallToolResult {
  const msg = err instanceof Error ? err.message : String(err);

  if (msg.includes("ECONNREFUSED") || msg.includes("connection refused")) {
    return { isError: true, content: [{ type: "text", text: JSON.stringify({
      error: "ollama_not_running",
      hint: "Run `ollama serve` to start the Ollama server",
      retryable: false,
    })}] };
  }
  if (msg.includes("model") && msg.includes("not found")) {
    return { isError: true, content: [{ type: "text", text: JSON.stringify({
      error: "model_not_installed",
      model,
      hint: `Run \`ollama pull ${model}\` to install it`,
      retryable: false,
    })}] };
  }
  if (err instanceof Error && err.name === "AbortError") {
    return { isError: true, content: [{ type: "text", text: JSON.stringify({
      error: "inference_timeout",
      model,
      hint: "Try reducing num_ctx or switching to a smaller model",
      retryable: true,
    })}] };
  }
  return { isError: true, content: [{ type: "text", text: `Ollama error: ${msg}` }] };
}

The retryable: false flag on connection-refused and model-not-found errors is important: these are configuration problems the LLM cannot fix by retrying. Setting retryable: false signals to the agent that it should escalate rather than loop.

Vector database errors: dimension mismatches and index-level failures

Vector databases have their own error category that doesn't exist in text generation APIs: structural incompatibility between the embedding you're querying with and the index you're querying against.

function vectorDbErrorToToolResult(err: unknown, indexName: string): CallToolResult {
  const msg = err instanceof Error ? err.message : String(err);

  // Pinecone dimension mismatch — most common misconfiguration
  if (msg.includes("dimension") && (msg.includes("mismatch") || msg.includes("does not match"))) {
    return { isError: true, content: [{ type: "text", text: JSON.stringify({
      error: "dimension_mismatch",
      index: indexName,
      hint: "Call list_pinecone_indexes to check the index dimension, then use the matching embedding model (3-small=1536, 3-large=3072)",
      retryable: false,
    })}] };
  }

  // Index not found — can happen if index is deleted or wrong namespace
  if (msg.includes("not found") || msg.includes("404")) {
    return { isError: true, content: [{ type: "text", text: JSON.stringify({
      error: "index_not_found",
      index: indexName,
      hint: "Call list_pinecone_indexes to see available indexes",
      retryable: false,
    })}] };
  }

  return { isError: true, content: [{ type: "text", text: `Vector DB error: ${msg}` }] };
}

For the broader vector database integration — Chroma, Weaviate, and pgvector alongside Pinecone — see the vector database integration guide. For caching repeated embedding queries to avoid both latency and cost, see semantic caching for MCP tools and MCP server caching patterns.

The AWS Bedrock variant

AWS Bedrock adds a fifth integration in this arc that uses a different API shape: instead of a provider-specific SDK, you call the Bedrock InvokeModel or Converse API with a model ARN. The same three patterns apply — token budget management (Bedrock returns inputTokens and outputTokens in the response metadata), streaming-to-buffered conversion (Bedrock streams with InvokeModelWithResponseStream and requires collecting chunks until the amazon-bedrock-invocationMetrics event arrives), and model selection (Bedrock's ListFoundationModels API returns available models). The auth model differs: Bedrock uses AWS IAM credentials rather than a dedicated API key. See the AWS Bedrock MCP integration guide for the full pattern.

Composing the patterns: RAG as the example

Retrieval-augmented generation (RAG) composes all four integrations in a single tool chain — and all three patterns apply simultaneously. The agent calls a flow that:

  1. Generates an embedding for the query (OpenAI embeddings API — token budget on query length, model selection between 3-small and 3-large)
  2. Searches a vector database for relevant chunks (Pinecone/Chroma — score threshold, chunk token cap, index selection)
  3. Calls an LLM with the chunks as context (OpenAI or Anthropic — count total tokens before calling, buffer the stream)

Each step has its own token budget constraint that cascades into the next. A common bug: the vector database returns 10 chunks that pass the score threshold, but the total is 6,000 tokens. Added to the system prompt and the original query, the total prompt is 9,000 tokens. GPT-3.5-turbo has a 16k window and 4k for output — but with a 9,000-token prompt and max_tokens: 4096, you're at 13,096, which is fine until the system prompt changes and adds 5,000 more tokens, overflowing the context.

The correct RAG tool chain applies budget checks at each step and short-circuits early with isError: true when any step would exceed the next step's budget:

server.tool(
  "rag_search_and_answer",
  "Search the knowledge base and generate an answer using retrieved context",
  {
    query: z.string().max(1000),
    model: z.enum(["gpt-4o", "gpt-4o-mini"]).default("gpt-4o-mini"),
    max_chunks: z.number().int().min(1).max(10).default(5),
    max_tokens: z.number().int().min(100).max(4096).default(1024),
  },
  async ({ query, model, max_chunks, max_tokens }) => {
    // Step 1: embed the query
    const embeddingRes = await openai.embeddings.create({
      model: "text-embedding-3-small",
      input: query,
    });
    const embedding = embeddingRes.data[0].embedding;

    // Step 2: search vector database with token-capped chunks
    const index = pinecone.index(process.env.PINECONE_INDEX_NAME!);
    const results = await index.query({
      vector: embedding, topK: max_chunks,
      includeMetadata: true, includeValues: false,
    });

    const chunks = (results.matches ?? [])
      .filter(m => (m.score ?? 0) >= 0.7)
      .map(m => truncateToTokens(String(m.metadata?.text ?? ""), 400))
      .slice(0, max_chunks);

    // Step 3: build prompt and check token budget before calling LLM
    const systemPrompt = "You are a helpful assistant. Answer using only the provided context.";
    const userMessage = `Context:\n${chunks.map((c, i) => `[${i + 1}] ${c}`).join("\n\n")}\n\nQuestion: ${query}`;
    const messages = [
      { role: "system" as const, content: systemPrompt },
      { role: "user" as const, content: userMessage },
    ];

    const { ok, inputTokens, limit } = checkTokenBudget(model, messages, max_tokens);
    if (!ok) {
      return { isError: true, content: [{ type: "text", text: JSON.stringify({
        error: "context_overflow_in_rag",
        input_tokens: inputTokens,
        limit,
        hint: `Reduce max_chunks or use a model with a larger context window`,
      })}] };
    }

    // Step 4: call LLM with buffered streaming
    const completion = await openai.chat.completions.create({ model, messages, max_tokens });
    const reply = completion.choices[0]?.message?.content ?? "";

    return { content: [{ type: "text", text: JSON.stringify({
      reply,
      chunks_used: chunks.length,
      input_tokens: inputTokens,
      usage: completion.usage,
    })}] };
  }
);

This tool compresses all three patterns into a single handler. Each step's output is the next step's input, and the token budget check before the LLM call guards against the accumulated context of all previous steps.

Monitoring: why LLM provider outages look like tool crashes

A final note on observability that applies to all four integrations: when an LLM provider is down, the error that surfaces in the MCP tool response looks identical to a bug in your tool handler. From the calling agent's perspective, isError: true with "OpenAI is unavailable" and isError: true with "tool threw an exception" are the same signal. Without external monitoring, you can't distinguish "my tool is broken" from "OpenAI is having an outage" without checking the status page manually.

AliveMCP probes your MCP server's protocol endpoint every 60 seconds — not the LLM provider's status page, but your own server's tools/call responses. This means it catches the cases that status pages miss: your API key expired at 3am, the Ollama process crashed on the GPU box, or Pinecone's index in your region is degraded while the global status page shows green. If your MCP server is down, AliveMCP tells you before your users do.

Further reading