Platform guide · 2026-06-15 · MCP + AI Platform Integration

MCP Servers Across AI Inference Platforms

The MCP wire protocol is the same regardless of which AI inference platform calls it. initialize, tools/list, tools/call — the same JSON-RPC sequence runs under every integration. What differs across OpenAI Agents SDK, AWS Bedrock, Google Gemini, Ollama, and Groq is the adapter layer each platform requires to bridge between its native function-calling interface and MCP's JSON-RPC protocol. OpenAI Agents SDK ships native MCP support and abstracts the adapter entirely. AWS Bedrock requires a hand-written conversion loop from MCP tool definitions to Bedrock's ToolSpec format, with a second pattern for Lambda-based action groups. Google Gemini converts MCP inputSchema to FunctionDeclaration objects and — unlike the other platforms — returns multiple function calls per turn, making parallel dispatch not optional but mandatory for performance. Ollama and Groq both expose an OpenAI-compatible API, so a single adapter function handles both, but each has characteristics that affect how you integrate MCP tools: Ollama's local inference means unattended deployments can silently lose remote MCP connectivity; Groq's ultra-fast inference means MCP round-trips become a disproportionate share of total latency. All five share one more thing: none of them distinguishes an MCP server failure from an application-layer error, so external monitoring is the only mechanism that catches MCP downtime before it appears as agent misbehavior.

Five platforms at a glance

The table below captures the integration approach, the adapter each platform requires, the key performance or architecture consideration, and the silent failure mode that makes external MCP server monitoring necessary for each one.

Platform	Integration approach	Adapter type	Key consideration	Silent failure mode
OpenAI Agents SDK	Native MCP support	`MCPServerHTTP` / `MCPServerStdio` in `Agent(mcp_servers=[...])`	Open persistent connection at FastAPI lifespan; tool list fetched once and cached for connection lifetime	Server down while persistent connection is live → agent sees no tools and hallucinates or loops mid-run
AWS Bedrock	Manual adapter (two patterns)	boto3 Converse API loop with MCP SDK, or Lambda proxy for Bedrock Agents action group	Converse API requires a hand-written `ToolUseBlock` dispatch loop; Lambda action group can't discover tools at runtime — schema must be committed manually	Bedrock errors and MCP errors surface through the same exception type — one dead MCP server makes the entire Converse loop fail without indicating which tier caused it
Google Gemini	Manual adapter or Google ADK	`FunctionDeclaration` conversion or Google ADK `MCPToolset`	Gemini returns multiple function calls per turn — parallel `asyncio.gather` dispatch is mandatory, not optional; latency = max of individual calls	One degraded MCP server in a parallel batch blocks the entire batch at its latency — no per-call timeout isolation
Ollama	OpenAI-compatible adapter	`openai.AsyncOpenAI(base_url="http://localhost:11434/v1")` with MCP-to-OpenAI conversion	Verify tool-calling capability before building — not all Ollama models support tools reliably; inference latency dominates over MCP round-trips	Local LLM inference with remote MCP servers — Ollama process restarts silently drop all MCP server connections; no process manager = no alert
Groq	OpenAI-compatible adapter	`groq.AsyncGroq` or `openai.AsyncOpenAI(base_url="https://api.groq.com/openai/v1")`	MCP round-trips are 25–35% of total run time (vs <5% on GPT-4o) — parallel dispatch and rate-limit context budgeting are mandatory	Slow MCP server eliminates Groq's speed advantage before any timeout fires — response-time degradation is invisible until the Groq rate limit hits

The shared protocol layer

Every platform in this post calls the same MCP protocol to discover and invoke tools. No matter what adapter layer wraps it, the wire protocol is identical:

// 1. Client connects and negotiates the protocol version
{ "jsonrpc": "2.0", "id": 1, "method": "initialize",
  "params": { "protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": { "name": "my-client", "version": "1.0" } } }

// 2. Server acknowledges with its capabilities
{ "jsonrpc": "2.0", "id": 1, "result": { "serverInfo": { "name": "my-tools", "version": "1.0" }, "capabilities": { "tools": {} } } }

// 3. Client fetches the tool list
{ "jsonrpc": "2.0", "id": 2, "method": "tools/list" }

// 4. Client calls a specific tool
{ "jsonrpc": "2.0", "id": 3, "method": "tools/call",
  "params": { "name": "search_docs", "arguments": { "query": "MCP monitoring" } } }

// 5. Server returns the result
{ "jsonrpc": "2.0", "id": 3, "result": { "content": [{ "type": "text", "text": "..." }], "isError": false } }

What differs across the five platforms is everything above this wire protocol: how the platform's function-calling API is structured, how tool definitions from MCP's inputSchema format must be converted to the platform's native format, how the platform dispatches multiple tool calls within a single model turn, and how errors from the MCP layer are surfaced (or absorbed) by the platform's orchestration code.

The adapter layer is the integration surface. A skill learned on one platform transfers directly: flat inputSchema designs perform better than nested schemas on every platform (the LLM fills fewer levels, and validation errors are clearer); connection pooling matters on every platform that charges per-request latency; and MCP server uptime is critical on all five regardless of their orchestration differences.

OpenAI Agents SDK — native MCP and the persistent-connection lifecycle

The OpenAI Agents SDK is the only platform in this group with native MCP support built into its core. You pass MCPServerHTTP or MCPServerStdio objects directly to the Agent constructor; the SDK handles the full protocol lifecycle without any adapter code:

import asyncio
from openai_agents import Agent, Runner
from openai_agents.mcp import MCPServerHTTP

research_agent = Agent(
    name="ResearchAgent",
    model="gpt-4o",
    instructions="Use the search and fetch tools to answer questions thoroughly.",
    mcp_servers=[
        MCPServerHTTP(
            url="https://search.internal/mcp",
            headers={"Authorization": "Bearer sk-..."},
            timeout=30,
        ),
    ],
)

async def main():
    result = await Runner.run(research_agent, "What are common MCP server failure modes?")
    print(result.final_output)

asyncio.run(main())

By default, the SDK opens an MCP connection at the start of each Runner.run() call and closes it when the run completes. For a FastAPI service handling many requests, this means one MCP handshake per request — typically 50–300 ms overhead per call. The remedy is the same as in LangChain and Pydantic AI: open the connection once at service startup using agent.run_mcp_servers():

from contextlib import asynccontextmanager
from fastapi import FastAPI
from openai_agents import Agent, Runner
from openai_agents.mcp import MCPServerHTTP

search_agent = Agent(
    name="SearchAgent",
    model="gpt-4o",
    instructions="Answer questions using the search tools.",
    mcp_servers=[MCPServerHTTP(url="https://search.internal/mcp")],
)

@asynccontextmanager
async def lifespan(app: FastAPI):
    async with search_agent.run_mcp_servers():
        yield  # Connection stays open for all requests

app = FastAPI(lifespan=lifespan)

@app.post("/ask")
async def ask(question: str):
    result = await Runner.run(search_agent, question)
    return {"answer": result.final_output}

The tool list is fetched once when run_mcp_servers() opens the connection and cached for the connection's lifetime. If the MCP server adds or removes tools while the service is running, those changes are invisible until the service restarts. Build MCP servers so that tool additions are backward-compatible — adding tools is safe, removing them breaks cached tool lists.

The Handoffs feature, which routes the conversation to specialist agents, requires a key startup consideration: each agent in a handoff graph carries its own mcp_servers list, and the SDK opens connections for each agent independently. If you're using persistent connections, open connections for all agents in the handoff graph at startup, not just the entry-point agent. An agent that is handed off to mid-conversation and has not pre-opened its MCP connection will open it on demand — adding the handshake latency at the worst possible moment.

The SDK's silent failure mode: when the MCP server goes down while a persistent connection is live, the SDK's next tools/call attempt fails mid-run. The agent receives no tool results and may hallucinate answers or enter a loop. Neither failure is distinguishable from the SDK's side without external observability — the failure looks like an application-layer issue, not an infrastructure failure.

AWS Bedrock — two adapter patterns and structured error isolation

AWS Bedrock has no native MCP support. Connecting MCP tools to Bedrock requires writing one of two adapter patterns: a Converse API loop that manages the tool-call cycle in your own code, or a Lambda proxy that bridges Bedrock Agents' action groups to an MCP server.

The Converse API pattern gives you full control. You call bedrock_client.converse(), inspect the stopReason in the response, and dispatch MCP tool calls when the model requests them:

import asyncio, boto3, json
from mcp import ClientSession
from mcp.client.sse import sse_client

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

async def run_with_mcp(prompt: str) -> str:
    async with sse_client("https://tools.internal/mcp") as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            tools_result = await session.list_tools()

            # Convert MCP tool definitions to Bedrock ToolSpec format
            bedrock_tools = [
                {
                    "toolSpec": {
                        "name": t.name,
                        "description": t.description,
                        "inputSchema": { "json": t.inputSchema },  # Bedrock wraps in {"json": ...}
                    }
                }
                for t in tools_result.tools
            ]

            messages = [{"role": "user", "content": [{"text": prompt}]}]

            while True:
                response = bedrock.converse(
                    modelId="anthropic.claude-sonnet-4-6-v1:0",
                    messages=messages,
                    toolConfig={"tools": bedrock_tools},
                )
                messages.append({"role": "assistant", "content": response["output"]["message"]["content"]})

                if response["stopReason"] == "end_turn":
                    # Extract final text response
                    for block in response["output"]["message"]["content"]:
                        if "text" in block:
                            return block["text"]

                elif response["stopReason"] == "tool_use":
                    # Dispatch all tool calls in parallel
                    tool_results = []
                    tool_use_blocks = [b for b in response["output"]["message"]["content"] if "toolUse" in b]

                    async def call_tool(block):
                        tool_use = block["toolUse"]
                        try:
                            result = await session.call_tool(tool_use["name"], tool_use["input"])
                            return {
                                "toolResult": {
                                    "toolUseId": tool_use["toolUseId"],
                                    "content": [{"text": result.content[0].text}],
                                    "status": "error" if result.isError else "success",
                                }
                            }
                        except Exception as e:
                            return {
                                "toolResult": {
                                    "toolUseId": tool_use["toolUseId"],
                                    "content": [{"text": f"MCP tool error: {e}"}],
                                    "status": "error",
                                }
                            }

                    tool_results = await asyncio.gather(*[call_tool(b) for b in tool_use_blocks])
                    messages.append({"role": "user", "content": tool_results})

asyncio.run(run_with_mcp("Summarize the latest MCP reliability data."))

The critical difference in the Bedrock ToolSpec format is the inputSchema wrapping: where MCP's tool definition has "inputSchema": { "type": "object", "properties": {...} }, Bedrock requires "inputSchema": { "json": { "type": "object", "properties": {...} } } — the same JSON Schema, but wrapped one level deeper. Missing the wrapper produces a Bedrock validation error that looks like a schema problem, not an adapter problem.

The Lambda proxy pattern serves a different architecture: when you're using Bedrock Agents (not the Converse API directly), Bedrock calls your Lambda function as an action group. The Lambda in turn calls the MCP server. The limitation is that Bedrock Agents' action group schema is defined statically in the Bedrock console or CloudFormation — there is no runtime tools/list discovery. When the MCP server adds or removes tools, the Bedrock action group schema must be updated manually and the agent alias republished. This eliminates one of MCP's key operational advantages (dynamic tool registration) in exchange for Bedrock Agents' orchestration capabilities.

Structured error logging is especially important in the Bedrock integration because boto3 exceptions and MCP SDK exceptions can both arise from the same converse() call path. Without explicit logging at each layer, a dead MCP server produces a Python exception that is indistinguishable from a Bedrock API error, a quota exhaustion, or a network timeout. The try/except blocks in the call_tool function above are a start; pairing them with structured log fields (error_source: "mcp" | "bedrock" | "network") enables log-based alerting that distinguishes MCP infrastructure failures from Bedrock service issues.

Google Gemini — parallel dispatch and the ADK shortcut

Google Gemini requires converting MCP tool definitions to FunctionDeclaration objects — Gemini's native tool format. The conversion is straightforward, but the dispatch pattern is not: Gemini's function-calling model frequently returns multiple function calls in a single model turn, which means your dispatch loop must call MCP tools in parallel, not sequentially:

import asyncio
import google.generativeai as genai
from mcp import ClientSession
from mcp.client.sse import sse_client

async def run_with_gemini_mcp(prompt: str) -> str:
    async with sse_client("https://tools.internal/mcp") as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            tools_result = await session.list_tools()

            # Convert MCP inputSchema to Gemini FunctionDeclaration format
            gemini_tools = [
                genai.protos.Tool(function_declarations=[
                    genai.protos.FunctionDeclaration(
                        name=t.name,
                        description=t.description,
                        parameters=genai.protos.Schema(
                            type=genai.protos.Type.OBJECT,
                            properties={
                                name: genai.protos.Schema(
                                    type=genai.protos.Type.STRING,
                                    description=prop.get("description", ""),
                                )
                                for name, prop in t.inputSchema.get("properties", {}).items()
                            },
                            required=t.inputSchema.get("required", []),
                        ),
                    )
                    for t in tools_result.tools
                ])
            ]

            model = genai.GenerativeModel("gemini-1.5-pro", tools=gemini_tools)
            chat = model.start_chat()
            messages = [{"role": "user", "parts": [prompt]}]

            while True:
                response = await asyncio.to_thread(chat.send_message, messages[-1]["parts"])
                candidate = response.candidates[0]

                # Check if the model made function calls
                function_calls = [
                    part.function_call
                    for part in candidate.content.parts
                    if hasattr(part, "function_call") and part.function_call.name
                ]

                if not function_calls:
                    # Final text response
                    return "".join(
                        part.text for part in candidate.content.parts if hasattr(part, "text")
                    )

                # Dispatch ALL function calls in parallel — latency = max, not sum
                async def dispatch(fc):
                    try:
                        result = await session.call_tool(fc.name, dict(fc.args))
                        return genai.protos.Part(
                            function_response=genai.protos.FunctionResponse(
                                name=fc.name,
                                response={"result": result.content[0].text if result.content else ""},
                            )
                        )
                    except Exception as e:
                        return genai.protos.Part(
                            function_response=genai.protos.FunctionResponse(
                                name=fc.name,
                                response={"error": str(e)},
                            )
                        )

                parts = await asyncio.gather(*[dispatch(fc) for fc in function_calls])
                messages.append({"role": "model", "parts": list(candidate.content.parts)})
                messages.append({"role": "user", "parts": list(parts)})

asyncio.run(run_with_gemini_mcp("Compare the uptime of MCP servers across major registries."))

The parallel dispatch pattern is not an optimization here — it is architecturally correct behavior. When Gemini returns multiple function calls in a single turn, it expects all results before generating the next response. Sequential dispatch (calling each MCP tool one at a time) works functionally but multiplies latency: if Gemini requests three tools and each takes 200 ms, sequential dispatch takes 600 ms; parallel dispatch takes 200 ms. The failure mode is the inverse: one degraded MCP server in a parallel batch blocks all results at that server's latency. A server that normally completes in 200 ms and starts timing out at 5 seconds turns a typical three-tool response from 200 ms into 5 seconds.

For teams already using Google's Agent Development Kit (ADK), the MCPToolset class provides native integration without manual adapter code:

from google.adk.agents import Agent
from google.adk.tools.mcp_tool.mcp_toolset import MCPToolset, SseServerParams

# ADK agent with MCP tools — no manual FunctionDeclaration conversion
research_agent = Agent(
    name="research_agent",
    model="gemini-1.5-pro",
    description="Research agent with access to external search and data tools.",
    instruction="Answer questions thoroughly using available tools.",
    tools=[
        MCPToolset(
            connection_params=SseServerParams(
                url="https://search.internal/mcp",
                headers={"Authorization": "Bearer sk-..."},
            )
        )
    ],
)

from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService

runner = Runner(agent=research_agent, app_name="research", session_service=InMemorySessionService())
# runner.run_async() handles the tool dispatch loop, including parallel calls

The ADK's MCPToolset handles FunctionDeclaration conversion and parallel dispatch internally. The trade-off vs the manual adapter is flexibility: the manual adapter lets you customize error handling, add structured logging, implement per-tool timeouts, and control the connection lifecycle independently. The ADK handles all of this for you at the cost of control over the internals.

Ollama — local inference with remote MCP tools

Ollama exposes an OpenAI-compatible REST API, which means the same adapter code that works with the OpenAI API works with Ollama by changing only the base URL. The critical first step is confirming that the Ollama model you've chosen actually supports tool calling — not all do, and silent failures are the common outcome when they don't:

import asyncio
from openai import AsyncOpenAI
from mcp import ClientSession
from mcp.client.sse import sse_client

# Ollama's OpenAI-compatible API
ollama = AsyncOpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

async def verify_tool_capability(model: str) -> bool:
    """Verify the model responds to tool calls rather than ignoring them."""
    probe_tool = [{
        "type": "function",
        "function": {
            "name": "health_check",
            "description": "Return the string 'ok'.",
            "parameters": { "type": "object", "properties": {} },
        }
    }]
    try:
        response = await ollama.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": "Call the health_check tool."}],
            tools=probe_tool,
            tool_choice="required",  # Force a tool call — models that don't support tools respond with plain text
        )
        return response.choices[0].message.tool_calls is not None
    except Exception:
        return False

# Tool-capable models as of mid-2026:
# llama3.1:8b     — reliable tool calls, 1-2s on M3 GPU
# llama3.1:70b    — excellent, requires 40+ GB VRAM
# qwen2.5:7b      — reliable, good JSON adherence
# qwen2.5:72b     — excellent, best open-source option for complex tasks
# gemma2:9b       — limited, frequent plain-text fallbacks

async def run_with_ollama_mcp(model: str, prompt: str) -> str:
    if not await verify_tool_capability(model):
        raise RuntimeError(f"Model {model} does not support tool calling — check model selection")

    async with sse_client("https://tools.internal/mcp") as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            tools_result = await session.list_tools()

            # Convert MCP tools to OpenAI function format — same conversion works for Groq too
            openai_tools = [
                {
                    "type": "function",
                    "function": {
                        "name": t.name,
                        "description": t.description,
                        "parameters": t.inputSchema,
                    }
                }
                for t in tools_result.tools
            ]

            messages = [{"role": "user", "content": prompt}]

            while True:
                response = await ollama.chat.completions.create(
                    model=model,
                    messages=messages,
                    tools=openai_tools,
                )
                choice = response.choices[0]
                messages.append({"role": "assistant", "content": choice.message.content,
                                  "tool_calls": choice.message.tool_calls})

                if not choice.message.tool_calls:
                    return choice.message.content or ""

                # Ollama models rarely return multiple tool calls per turn — sequential is fine
                for tool_call in choice.message.tool_calls:
                    args = json.loads(tool_call.function.arguments)
                    try:
                        result = await session.call_tool(tool_call.function.name, args)
                        tool_content = result.content[0].text if result.content else ""
                        if result.isError:
                            tool_content = f"Error: {tool_content}"
                    except Exception as e:
                        tool_content = f"MCP tool unavailable: {e}"
                    messages.append({
                        "role": "tool",
                        "tool_call_id": tool_call.id,
                        "content": tool_content,
                    })

asyncio.run(run_with_ollama_mcp("llama3.1:8b", "What MCP servers are currently healthy?"))

The latency profile for Ollama integrations is the inverse of every other platform in this group. For cloud platforms, MCP round-trips (50–300 ms) can represent 25–35% of total agent latency because LLM inference is fast (200–500 ms for GPT-4o, <200 ms for Groq). For Ollama on consumer hardware, LLM inference is 1–30 seconds depending on the model and hardware, and MCP round-trips (50–300 ms) are typically under 10% of total latency. Optimizing MCP connection pooling matters less; optimizing model size for the task matters more.

The monitoring gap specific to Ollama is the local + remote split: Ollama runs locally (or on a local network machine) while MCP servers typically run on the public internet or a cloud VPC. When the Ollama process restarts — OS update, crash, container recycle — it does not restore any MCP client connections. Any running agent sessions lose their MCP tool access silently. In production setups without a process manager (systemd, supervisord, Docker restart policies) watching Ollama, crashes go undetected and unrecovered. AliveMCP monitors the remote MCP servers themselves; local Ollama process health needs a separate watchdog.

Groq — ultra-fast inference and the MCP round-trip budget

Groq uses the same OpenAI-compatible adapter pattern as Ollama, so the same openai_tools conversion code works unchanged. The difference is context: Groq's inference is so fast (50–200 ms for most completions) that MCP round-trips become a meaningful share of total agent latency, making parallel dispatch and connection management more important than on slower platforms:

import asyncio, json
from groq import AsyncGroq
from mcp import ClientSession
from mcp.client.sse import sse_client

groq = AsyncGroq()  # Uses GROQ_API_KEY env var

async def run_with_groq_mcp(prompt: str) -> str:
    async with sse_client("https://tools.internal/mcp") as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            tools_result = await session.list_tools()

            openai_tools = [
                {
                    "type": "function",
                    "function": {
                        "name": t.name,
                        "description": t.description,
                        "parameters": t.inputSchema,
                    }
                }
                for t in tools_result.tools
            ]

            messages = [{"role": "user", "content": prompt}]
            # Token budget tracking — Groq rate limits are per-minute, not per-request
            total_tokens = len(prompt.split()) * 4  # rough estimate

            while True:
                response = await groq.chat.completions.create(
                    model="llama-3.3-70b-versatile",
                    messages=messages,
                    tools=openai_tools,
                    max_tokens=4096,
                )
                choice = response.choices[0]
                total_tokens += response.usage.total_tokens

                messages.append({
                    "role": "assistant",
                    "content": choice.message.content,
                    "tool_calls": [tc.model_dump() for tc in (choice.message.tool_calls or [])],
                })

                if not choice.message.tool_calls:
                    return choice.message.content or ""

                # Parallel MCP dispatch — essential for Groq where LLM completes in <200ms
                async def call_mcp(tool_call):
                    args = json.loads(tool_call.function.arguments)
                    try:
                        result = await session.call_tool(tool_call.function.name, args)
                        content = result.content[0].text if result.content else ""
                        return {
                            "role": "tool",
                            "tool_call_id": tool_call.id,
                            "content": f"Error: {content}" if result.isError else content,
                        }
                    except Exception as e:
                        # Return error as string — never raise; uncaught exceptions break the loop
                        return {"role": "tool", "tool_call_id": tool_call.id, "content": f"Tool error: {e}"}

                tool_results = await asyncio.gather(
                    *[call_mcp(tc) for tc in choice.message.tool_calls],
                    return_exceptions=True,
                )
                for r in tool_results:
                    if isinstance(r, Exception):
                        messages.append({"role": "tool", "content": f"Dispatch error: {r}"})
                    else:
                        messages.append(r)

                # Rolling context trim at 8 turns to manage Groq's context window limits
                if len(messages) > 16:
                    messages = [messages[0]] + messages[-15:]

asyncio.run(run_with_groq_mcp("Analyze MCP server uptime trends over the past 30 days."))

Groq's rate limits are structured around tokens per minute (TPM) rather than requests per minute — at the free tier, approximately 14,400 TPM for Llama 3.3-70B-Versatile. In an agent loop that calls tools on every turn, token consumption accumulates quickly: a prompt of 200 tokens that triggers two tool calls returning 500 tokens each plus a 300-token model response costs roughly 1,500 tokens per turn. At 14,400 TPM, that's roughly 9 turns per minute. Rolling context trimming (keeping the system prompt plus the most recent N turns) prevents the context window from growing linearly with conversation length while staying within rate limits.

The Groq-specific monitoring concern is the interaction between MCP server response time and Groq's speed advantage. When a slow or intermittently degraded MCP server adds 2–5 seconds of latency per tool call, Groq's 100–200 ms inference advantage becomes irrelevant — the total agent latency is dominated by the slow MCP server. Because Groq's inference completes before the MCP tool has even started returning data, the degraded MCP server is the only bottleneck, but nothing in the Groq API or error messages identifies it as such. External monitoring that tracks per-server response time provides the signal that separates "Groq is slow today" from "an MCP dependency is degraded."

The shared failure mode: all five platforms absorb MCP failures silently

Despite their different architectures, all five platforms share a structural blind spot: when an MCP server becomes unavailable mid-run, the failure does not surface immediately as an unambiguous "MCP server down" error. Each platform absorbs the failure in its own way, and each absorption mechanism costs time and compute before the root cause becomes visible:

OpenAI Agents SDK: A server down while a persistent connection is live causes the next tools/call attempt to fail with a transport error. The SDK's behavior depends on the agent's retry configuration — by default, the agent may loop on tool invocation failures before eventually returning an incomplete or hallucinated response. The error that surfaces looks like an agent reasoning failure, not an infrastructure failure.
AWS Bedrock (Converse API): A dead MCP server causes the session.call_tool() call inside the adapter to raise a Python exception. If the exception is caught by a generic handler and returned as a toolResult with status: "error", the Bedrock model sees an error response and may retry with different arguments (spending tokens), try a different approach, or report failure — none of which distinguishes the MCP server failure from a bad tool argument. Without structured error-source logging, the Bedrock CloudWatch logs show a tool error with no indication that the cause was the MCP server going down rather than the model making a bad request.
Google Gemini: A dead server in a parallel dispatch batch causes asyncio.gather to either return an exception (if the gather call re-raises) or a failed result. In either case, the failed function call result gets returned to Gemini as an error response. Gemini typically asks the model to reason about the error and try again — each retry is a new inference call plus another MCP round-trip attempt. The monitoring signal from inside the application layer is ambiguous: it looks like the model is struggling with a tool, not like the tool's server is down.
Ollama: A dead remote MCP server causes tool calls to fail with connection errors. Because Ollama models are less strict about tool-call retry logic than cloud models, the failure mode depends on the specific model: some models will report the error in their final response; others will attempt alternative reasoning paths that ignore the tool failure entirely. The failure is completely invisible if the model doesn't mention it in its output.
Groq: A slow or dead MCP server eliminates Groq's entire speed advantage. Because inference completes in 100–200 ms while the MCP server is timing out over 5–30 seconds, the agent loop appears to "run fine" except that every tool call takes far longer than expected. This latency creep is invisible to Groq's monitoring — the API sees normal inference calls completing normally, with tool result latency entirely outside its visibility. The agent eventually times out or exhausts rate limits, but the proximate cause (slow MCP server) is never named.

The pattern is the same across all five: MCP server downtime does not produce an immediate, unambiguous platform-level failure. It produces a series of partial failures that each platform's orchestration layer attempts to absorb, generating LLM token spend in the process, before eventually surfacing as a high-level agent error. The error message that arrives says something about the agent's response, not about the MCP server.

AliveMCP closes this gap by monitoring the MCP server independently of any platform. Probes run the full protocol sequence — initialize, tools/list, and actual tools/call invocations — every 60 seconds. When a probe fails, an alert fires within one check interval. The monitoring is platform-agnostic: one AliveMCP monitor per MCP server endpoint detects failures before any platform's retry cycle has a chance to waste LLM tokens on a server that isn't coming back.

Choosing a platform for a new MCP-backed project

The five platforms cover different positions on the control-vs-abstraction tradeoff. A few heuristics that emerge from their MCP-specific integration characteristics:

Use OpenAI Agents SDK if you're building on GPT-4o models and want the lowest-friction MCP integration: native MCPServerHTTP support, built-in Handoffs for multi-agent workflows, and Guardrails for output safety — all without writing adapter code. The persistent-connection lifespan pattern is the one optimization worth adding before launch.
Use AWS Bedrock Converse API if you need to run inference in AWS for data residency or compliance reasons, or if you want to combine MCP tools with AWS-native infrastructure (IAM, CloudWatch, S3). Accept that you'll maintain the adapter loop and structured error logging yourself. Use structured error_source fields in every log line from the adapter to distinguish Bedrock failures from MCP failures.
Use Google Gemini (or Google ADK) if you need models with long context windows (Gemini 1.5 Pro's 1M token context) or multimodal tool outputs. Use the ADK's MCPToolset for the fastest integration; write the manual adapter if you need control over per-tool error handling or connection lifecycle. Internalize that parallel dispatch is mandatory — sequential dispatch in a multi-function-call turn multiplies latency unnecessarily.
Use Ollama for local inference on hardware you control — privacy-sensitive data, air-gapped environments, or cost-constrained development. Verify tool capability per model before building (tool_choice="required" probe) and set up a process manager to restart Ollama on crashes. Expect that LLM inference is the bottleneck, not MCP round-trips.
Use Groq when latency is the primary product requirement — chat applications, real-time assistance, or time-sensitive pipelines where 200 ms end-to-end matters. Implement parallel MCP dispatch unconditionally and rolling context trimming to manage TPM rate limits. Set AliveMCP alerts to fire on response-time degradation (not just downtime), because a slow MCP server kills Groq's speed advantage before any timeout fires.

All five choices share the same operational requirement: external MCP server monitoring. The platform determines your inference model and orchestration architecture; the MCP server monitoring determines how quickly you know when the tools that architecture depends on become unavailable.