Guide · LLM Provider Integrations
MCP server OpenAI integration
Building MCP tools that call the OpenAI API surfaces three problems that don't exist with typical REST integrations: token budget management (context overflow kills a tool call silently), streaming responses (the SDK is async-iterator-based but MCP tools are synchronous), and model selection (hardcoding a model ID in a tool description limits reuse). This guide covers the production patterns for each.
TL;DR
Install openai and initialize with OPENAI_API_KEY. For chat completions, count tokens with gpt-tokenizer before calling — return isError: true if the prompt exceeds the model's context window. Buffer streaming responses with stream.finalChatCompletion() rather than returning mid-stream text. Handle RateLimitError with exponential backoff (respect the x-ratelimit-reset-requests header). Expose a list_models tool so the agent can select the right model per subtask rather than hardcoding. Wire AliveMCP to monitor your MCP server's endpoint — an OpenAI API outage looks identical to a tool crash unless you have protocol-level probing.
SDK setup and authentication
Install the official OpenAI Node.js SDK and Zod for argument validation:
npm install openai zod gpt-tokenizer
import OpenAI from "openai";
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { z } from "zod";
// Initialize once at module level — the client maintains a connection pool
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
// Optional: scope to a project for key rotation isolation
// project: process.env.OPENAI_PROJECT_ID,
maxRetries: 2, // SDK retries transient 5xx and connection errors automatically
timeout: 30_000, // ms — increase for long-generation requests
});
const server = new McpServer({ name: "openai-tools", version: "1.0.0" });
Organization vs project API keys: project keys (sk-proj-...) scope access to a single project and support per-project rate limits and spend caps. Use them for production MCP servers where you want billing isolation per product. Organization keys (sk-org-...) can create and manage projects. For most MCP integrations a project key is the right choice — it limits blast radius if the key leaks.
Token counting before calls
The most common silent failure when calling OpenAI from an MCP tool: the combined prompt (system + user messages) exceeds the model's context window, and the API returns a context_length_exceeded error. Because MCP tool errors surface as unhelpful generic failures, you want to catch this before the API call:
import { encode } from "gpt-tokenizer";
const MODEL_CONTEXT_LIMITS: Record<string, number> = {
"gpt-4o": 128_000,
"gpt-4o-mini": 128_000,
"gpt-4-turbo": 128_000,
"gpt-3.5-turbo": 16_385,
};
function countMessageTokens(messages: OpenAI.ChatCompletionMessageParam[]): number {
// 4 tokens overhead per message (role + framing), 2 for reply primer
return messages.reduce((total, msg) => {
const content = typeof msg.content === "string" ? msg.content : "";
return total + encode(content).length + 4;
}, 2);
}
function checkTokenBudget(
model: string,
messages: OpenAI.ChatCompletionMessageParam[],
maxOutputTokens: number
): { ok: boolean; inputTokens: number; limit: number } {
const limit = MODEL_CONTEXT_LIMITS[model] ?? 8_192;
const inputTokens = countMessageTokens(messages);
return { ok: inputTokens + maxOutputTokens <= limit, inputTokens, limit };
}
Use this guard at the start of every chat completions tool. Return isError: true with the token counts so the calling agent can truncate its context and retry rather than hitting an opaque API error.
For more on the general token budget pattern inside MCP tools, see token budget management for MCP servers.
Chat completions tool with error handling
A production chat completions tool guards token budget, handles rate limits with backoff, and maps OpenAI error types to informative isError: true responses:
server.tool(
"chat_with_gpt",
"Send a message to an OpenAI chat model and return the response",
{
model: z.enum(["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"])
.default("gpt-4o-mini")
.describe("OpenAI model to use — gpt-4o-mini for fast/cheap, gpt-4o for complex reasoning"),
system_prompt: z.string().optional().describe("System message to set context and persona"),
user_message: z.string().describe("The user's message or question"),
max_tokens: z.number().int().min(1).max(4096).default(1024),
temperature: z.number().min(0).max(2).default(0.7),
},
async ({ model, system_prompt, user_message, max_tokens, temperature }) => {
const messages: OpenAI.ChatCompletionMessageParam[] = [
...(system_prompt ? [{ role: "system" as const, content: system_prompt }] : []),
{ role: "user", content: user_message },
];
// Token budget guard
const { ok, inputTokens, limit } = checkTokenBudget(model, messages, max_tokens);
if (!ok) {
return {
isError: true,
content: [{
type: "text",
text: `Prompt too long: ${inputTokens} input tokens + ${max_tokens} output tokens exceeds model limit of ${limit}. Reduce system_prompt or user_message length.`,
}],
};
}
try {
const completion = await openai.chat.completions.create({
model,
messages,
max_tokens,
temperature,
});
const reply = completion.choices[0]?.message.content ?? "(no response)";
return {
content: [{
type: "text",
text: JSON.stringify({
reply,
model: completion.model,
usage: completion.usage,
finish_reason: completion.choices[0]?.finish_reason,
}),
}],
};
} catch (err) {
return openaiErrorToToolResult(err);
}
}
);
function openaiErrorToToolResult(err: unknown) {
if (err instanceof OpenAI.APIError) {
return {
isError: true,
content: [{
type: "text" as const,
text: `OpenAI API error (${err.status} ${err.code}): ${err.message}`,
}],
};
}
throw err; // unexpected — re-throw so MCP surfaces it as a transport error
}
Rate limit handling
OpenAI rate limits operate on two axes simultaneously: Requests Per Minute (RPM) and Tokens Per Minute (TPM). Both can trigger a 429 error independently. The SDK retries automatically (controlled by maxRetries), but for MCP tools that call OpenAI in a loop or process batches, you need explicit backoff logic.
| Tier | gpt-4o RPM | gpt-4o TPM | gpt-4o-mini RPM | gpt-4o-mini TPM |
|---|---|---|---|---|
| Free | 3 | 40,000 | 3 | 40,000 |
| Tier 1 | 500 | 30,000 | 500 | 200,000 |
| Tier 2 | 5,000 | 450,000 | 5,000 | 2,000,000 |
| Tier 3 | 5,000 | 800,000 | 5,000 | 4,000,000 |
When your MCP tool gets a 429, check the x-ratelimit-reset-requests and x-ratelimit-reset-tokens headers for the exact reset time rather than using a fixed sleep:
async function callWithRateLimitBackoff<T>(
fn: () => Promise<T>,
maxAttempts = 4
): Promise<T> {
for (let attempt = 0; attempt < maxAttempts; attempt++) {
try {
return await fn();
} catch (err) {
if (!(err instanceof OpenAI.RateLimitError)) throw err;
if (attempt === maxAttempts - 1) throw err;
// Parse reset header — falls back to exponential backoff
const resetHeader = (err as any).headers?.["x-ratelimit-reset-requests"];
let delayMs: number;
if (resetHeader) {
const resetAt = new Date(resetHeader).getTime();
delayMs = Math.max(resetAt - Date.now(), 0) + 100;
} else {
delayMs = Math.min(1000 * 2 ** attempt, 30_000);
}
await new Promise(res => setTimeout(res, delayMs));
}
}
throw new Error("unreachable");
}
For more on rate limit patterns in MCP tools generally, see MCP server rate limiting and per-tool rate limiting.
Buffered streaming for MCP tools
The OpenAI SDK's streaming API uses an async iterator — you read chunks as they arrive. MCP tool handlers must return a complete result synchronously, so streaming doesn't compose directly. The clean solution: buffer the full stream inside the tool before returning.
server.tool(
"generate_long_form",
"Generate long-form text (article, code, etc.) — buffers the full stream before returning",
{
prompt: z.string(),
model: z.enum(["gpt-4o", "gpt-4o-mini"]).default("gpt-4o"),
max_tokens: z.number().int().default(2048),
},
async ({ prompt, model, max_tokens }) => {
try {
// stream.finalChatCompletion() buffers the full stream and returns
// the same shape as a non-streaming completion
const stream = await openai.chat.completions.stream({
model,
messages: [{ role: "user", content: prompt }],
max_tokens,
});
const completion = await stream.finalChatCompletion();
const text = completion.choices[0]?.message.content ?? "";
return {
content: [{
type: "text",
text: JSON.stringify({
text,
usage: completion.usage,
finish_reason: completion.choices[0]?.finish_reason,
}),
}],
};
} catch (err) {
return openaiErrorToToolResult(err);
}
}
);
The two-tool alternative for truly long generations: a start_generation tool that kicks off a background job and returns a job ID, and a poll_generation tool that checks status. This avoids MCP client timeouts for multi-minute generations. For most use cases under 60 seconds, buffered streaming is simpler and correct.
For the general streaming-in-MCP problem and the start/poll pattern, see streaming responses in MCP servers.
Function calling from within MCP tools
A less common but powerful pattern: your MCP tool passes a set of tools to the OpenAI API (using OpenAI's function calling feature) and lets GPT call those sub-tools to complete a task. This creates a nested execution loop inside a single MCP tool call:
const AVAILABLE_FUNCTIONS = {
get_weather: async ({ location }: { location: string }) => {
// Call your own weather API
return `Weather in ${location}: 72°F, partly cloudy`;
},
search_web: async ({ query }: { query: string }) => {
// Call your own search API
return `Search results for: ${query}`;
},
};
server.tool(
"gpt_with_tools",
"Run a GPT-4o agent loop that can call sub-tools (weather, search) to answer the question",
{
question: z.string(),
max_rounds: z.number().int().min(1).max(10).default(5),
},
async ({ question, max_rounds }) => {
const tools: OpenAI.ChatCompletionTool[] = [
{
type: "function",
function: {
name: "get_weather",
description: "Get current weather for a location",
parameters: { type: "object", properties: { location: { type: "string" } }, required: ["location"] },
},
},
{
type: "function",
function: {
name: "search_web",
description: "Search the web for current information",
parameters: { type: "object", properties: { query: { type: "string" } }, required: ["query"] },
},
},
];
const messages: OpenAI.ChatCompletionMessageParam[] = [
{ role: "user", content: question },
];
for (let round = 0; round < max_rounds; round++) {
const completion = await openai.chat.completions.create({ model: "gpt-4o", messages, tools });
const choice = completion.choices[0];
messages.push(choice.message);
if (choice.finish_reason !== "tool_calls") {
return { content: [{ type: "text", text: choice.message.content ?? "(no answer)" }] };
}
// Execute each requested tool call
for (const call of choice.message.tool_calls ?? []) {
const fn = AVAILABLE_FUNCTIONS[call.function.name as keyof typeof AVAILABLE_FUNCTIONS];
const args = JSON.parse(call.function.arguments);
const result = fn ? await fn(args) : "Tool not found";
messages.push({ role: "tool", tool_call_id: call.id, content: result });
}
}
return { isError: true, content: [{ type: "text", text: `Exceeded max_rounds (${max_rounds}) without a final answer` }] };
}
);
Model selection tool
Hardcoding a model name in a tool description creates a maintenance problem: when OpenAI releases a new model, you need to redeploy. More importantly, calling agents can't choose a cheaper model for simple tasks. A list_models tool solves both:
server.tool(
"list_available_models",
"List the OpenAI models available for use in other tools, with their context limits and cost tier",
{},
async () => {
const models = [
{ id: "gpt-4o", context_tokens: 128_000, cost_tier: "high", best_for: "complex reasoning, vision, long context" },
{ id: "gpt-4o-mini", context_tokens: 128_000, cost_tier: "low", best_for: "simple tasks, high volume, cost-sensitive" },
{ id: "gpt-4-turbo", context_tokens: 128_000, cost_tier: "high", best_for: "high-quality text, code generation" },
{ id: "gpt-3.5-turbo", context_tokens: 16_385, cost_tier: "very_low", best_for: "legacy, minimal complexity tasks" },
];
return { content: [{ type: "text", text: JSON.stringify(models) }] };
}
);
Pair this with a chat_with_gpt tool that accepts a model argument. The calling agent calls list_available_models first when it needs to make a cost-vs-quality tradeoff, then picks a model for the actual completion call.
Error handling reference
| Error class | HTTP status | Common cause | Recommended response |
|---|---|---|---|
AuthenticationError | 401 | Invalid or expired API key | isError: true — key needs rotation |
PermissionDeniedError | 403 | Model not available on this org/project | isError: true — check tier / model access |
NotFoundError | 404 | Model ID does not exist | isError: true — surface model name to caller |
RateLimitError | 429 | RPM or TPM exceeded | Backoff and retry (see above) |
InternalServerError | 500 | OpenAI infrastructure issue | Retry with backoff, surface if persistent |
APIConnectionError | — | Network failure reaching OpenAI | Retry with backoff |
The SDK's maxRetries setting handles InternalServerError and APIConnectionError automatically. You need to handle RateLimitError yourself if you need custom backoff behavior. Never surface raw error messages to users — they can contain prompt content or key prefixes.
Frequently asked questions
Can I use the Responses API instead of Chat Completions in an MCP tool?
Yes. The Responses API is OpenAI's newer stateful interface that maintains conversation state server-side and supports built-in tools like web search and code interpreter. For MCP servers, the tradeoff is: Chat Completions is stateless (your tool manages message history), while Responses API manages state for you but adds session management complexity. For single-turn tool calls, Chat Completions is simpler. For multi-turn agent loops that persist across MCP sessions, Responses API reduces your state management burden.
How do I handle the OpenAI API when it's down without blocking my MCP server?
Implement a circuit breaker: track consecutive failures in module-level state, and after N failures within a window, return isError: true with "OpenAI API currently unavailable — try again in 60 seconds" without attempting the call. Reset the circuit after a successful call. This prevents your MCP tool from hanging on timeouts during an outage. Wire AliveMCP on your own MCP server's endpoint so you know if the transport layer failed versus the OpenAI API specifically.
What's the right way to pass large context to an OpenAI tool call?
Use the gpt-tokenizer check to verify the context fits before calling. If it doesn't fit, either truncate the oldest messages (for conversation history) or use OpenAI's file upload + Assistants API for large documents. For RAG patterns where you retrieve text before calling GPT, cap retrieval to the top-K chunks that fit within your token budget after accounting for system prompt, query, and desired output length. See token budget management for the budget calculation pattern.
How do I implement caching for repeated OpenAI calls?
OpenAI offers prompt caching automatically for prefixes over 1024 tokens on supported models (cache hits appear in usage.prompt_tokens_details.cached_tokens). For semantic deduplication — where slightly different queries should return the same cached answer — implement a local cache keyed on a hash of the normalized prompt. For MCP servers processing high-throughput identical queries (like a lookup tool), a simple Map<string, string> with a TTL covers most cases. See semantic caching for MCP servers for the embedding-based approach.
Should my MCP tool expose streaming to the MCP client?
MCP supports streaming tool responses via content arrays returned progressively, but client support varies. Most clients expect a single complete response per tool call. The safe approach for MCP tools calling OpenAI: always buffer the full stream using stream.finalChatCompletion() and return it as a single response. If you're building a custom MCP client and want live streaming, implement the two-tool start/poll pattern — start_generation returns a job ID, poll_generation returns chunks as they complete.
Further reading
- MCP tools for the Anthropic/Claude API — Messages API, tool use, prompt caching
- MCP tools for Ollama — local LLM inference, model management, health checks
- MCP tools for AWS Bedrock — Converse API, IAM auth, cross-region inference
- Token budget management for MCP servers
- Rate limiting in MCP tool handlers
- Streaming responses in MCP servers
- Error handling patterns for MCP tools
- AliveMCP — production protocol monitoring for MCP servers