Guide · LLM Provider Integrations

MCP server Ollama integration

Ollama runs open-source LLMs locally via a simple REST API — no API keys, no per-token billing, no data leaving your machine. For MCP servers, this makes Ollama the default choice for privacy-sensitive workloads, zero-cost local development, and offline-capable deployments. The integration is straightforward but has three non-obvious edges: Ollama streams newline-delimited JSON by default (MCP tools need buffered responses), context window limits vary wildly between models, and a missing health check leads to silent failures when the Ollama process isn't running.

TL;DR

Call Ollama at http://localhost:11434 with plain fetch — no SDK required. Use /api/chat for messages format and /api/generate for raw text. Always set "stream": false in the request body for MCP tools — Ollama streams by default and a streaming response can't be returned synchronously from a tool handler. Implement a check_ollama_health tool that calls GET /api/tags before any inference — return isError: true immediately if Ollama isn't running rather than timing out. Know your model's context limit before sending large prompts. Wire AliveMCP on your MCP server's endpoint so you catch Ollama process crashes before users do.

Setup: no SDK required

Ollama exposes a REST API on localhost:11434. You don't need an npm package — the built-in fetch API (available in Node.js 18+) is sufficient:

import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { z } from "zod";

const OLLAMA_BASE = process.env.OLLAMA_HOST ?? "http://localhost:11434";

const server = new McpServer({ name: "ollama-tools", version: "1.0.0" });

// Shared fetch helper with timeout
async function ollamaFetch(path: string, body?: object): Promise<Response> {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), 120_000); // 2 min timeout

  try {
    const res = await fetch(`${OLLAMA_BASE}${path}`, {
      method: body ? "POST" : "GET",
      headers: body ? { "Content-Type": "application/json" } : undefined,
      body: body ? JSON.stringify(body) : undefined,
      signal: controller.signal,
    });
    if (!res.ok) {
      const text = await res.text();
      throw new Error(`Ollama ${path} returned ${res.status}: ${text}`);
    }
    return res;
  } finally {
    clearTimeout(timeout);
  }
}

The OLLAMA_HOST environment variable lets you point MCP tools at a remote Ollama instance (on a GPU box, for example) without code changes. The default localhost:11434 covers local dev.

Health check tool

Ollama is a separate process — it can crash, fail to start, or be unavailable while your MCP server is running. Always check health before inference rather than letting inference calls time out after 2 minutes:

server.tool(
  "check_ollama_health",
  "Verify that the Ollama server is running and return the list of available models",
  {},
  async () => {
    try {
      const res = await ollamaFetch("/api/tags");
      const data = await res.json() as { models: Array<{ name: string; size: number; modified_at: string }> };

      return {
        content: [{
          type: "text",
          text: JSON.stringify({
            status: "running",
            host: OLLAMA_BASE,
            models: data.models.map(m => ({
              name: m.name,
              size_gb: (m.size / 1e9).toFixed(1),
              modified_at: m.modified_at,
            })),
          }),
        }],
      };
    } catch (err) {
      return {
        isError: true,
        content: [{
          type: "text",
          text: JSON.stringify({
            status: "unreachable",
            host: OLLAMA_BASE,
            error: err instanceof Error ? err.message : String(err),
            hint: "Run `ollama serve` to start the Ollama server",
          }),
        }],
      };
    }
  }
);

Call check_ollama_health at MCP server startup and log the result. For production deployments, add this as a liveness check in your process monitor so restarts are automatic when Ollama goes down.

Chat endpoint tool

The /api/chat endpoint takes a messages array (same shape as OpenAI chat format) and returns the response. The critical MCP-specific flag: "stream": false. Without it, Ollama returns a stream of newline-delimited JSON objects — your await res.json() will parse only the first object and miss the rest.

server.tool(
  "chat_with_model",
  "Send a conversation to an Ollama model and return the response",
  {
    model: z.string().describe("Ollama model name — e.g. llama3.2, mistral, gemma3, qwen2.5"),
    messages: z.array(z.object({
      role: z.enum(["system", "user", "assistant"]),
      content: z.string(),
    })).min(1).describe("Conversation history — the last message should be role: user"),
    temperature: z.number().min(0).max(2).default(0.7).optional(),
    num_ctx: z.number().int().optional()
      .describe("Override context window size — defaults to model's built-in limit"),
  },
  async ({ model, messages, temperature, num_ctx }) => {
    const options: Record<string, unknown> = {};
    if (temperature !== undefined) options.temperature = temperature;
    if (num_ctx !== undefined) options.num_ctx = num_ctx;

    try {
      const res = await ollamaFetch("/api/chat", {
        model,
        messages,
        stream: false,  // CRITICAL — must be false for MCP tools
        options: Object.keys(options).length > 0 ? options : undefined,
      });

      const data = await res.json() as {
        message: { role: string; content: string };
        done: boolean;
        total_duration: number;
        prompt_eval_count: number;
        eval_count: number;
      };

      return {
        content: [{
          type: "text",
          text: JSON.stringify({
            reply: data.message.content,
            model,
            prompt_tokens: data.prompt_eval_count,
            output_tokens: data.eval_count,
            total_duration_ms: Math.round(data.total_duration / 1e6),
          }),
        }],
      };
    } catch (err) {
      return {
        isError: true,
        content: [{ type: "text", text: `Ollama chat error: ${err instanceof Error ? err.message : String(err)}` }],
      };
    }
  }
);

Generate endpoint tool

The /api/generate endpoint takes a single text prompt (not a messages array) and is useful for completion-style tasks — filling in code, generating raw text without a conversational structure:

server.tool(
  "generate_with_model",
  "Complete a text prompt using an Ollama model — useful for code generation, summarization, extraction",
  {
    model: z.string().describe("Ollama model name"),
    prompt: z.string().describe("Text prompt to complete"),
    system: z.string().optional().describe("System context to prepend to the prompt"),
    max_tokens: z.number().int().min(1).max(8192).default(512)
      .describe("Maximum number of tokens to generate (num_predict)"),
  },
  async ({ model, prompt, system, max_tokens }) => {
    try {
      const res = await ollamaFetch("/api/generate", {
        model,
        prompt,
        system,
        stream: false,
        options: { num_predict: max_tokens },
      });

      const data = await res.json() as {
        response: string;
        done: boolean;
        done_reason: string;
        prompt_eval_count: number;
        eval_count: number;
      };

      return {
        content: [{
          type: "text",
          text: JSON.stringify({
            text: data.response,
            done_reason: data.done_reason,
            prompt_tokens: data.prompt_eval_count,
            output_tokens: data.eval_count,
          }),
        }],
      };
    } catch (err) {
      return {
        isError: true,
        content: [{ type: "text", text: `Ollama generate error: ${err instanceof Error ? err.message : String(err)}` }],
      };
    }
  }
);

The done_reason field tells you why generation stopped: "stop" means the model finished naturally, "length" means num_predict was hit (increase max_tokens if you're seeing truncated output).

Model management tools

MCP tools for pulling, listing, and removing models turn Ollama model management into agent-accessible operations:

server.tool(
  "pull_model",
  "Download an Ollama model from the registry (ollama.com/library). Blocks until complete.",
  {
    model: z.string().describe("Model name to pull — e.g. 'llama3.2', 'mistral:7b', 'qwen2.5:14b'"),
  },
  async ({ model }) => {
    try {
      // pull streams progress — we collect the final status line
      const res = await ollamaFetch("/api/pull", { name: model, stream: false });
      const data = await res.json() as { status: string };
      return {
        content: [{ type: "text", text: JSON.stringify({ model, status: data.status }) }],
      };
    } catch (err) {
      return {
        isError: true,
        content: [{ type: "text", text: `Pull failed: ${err instanceof Error ? err.message : String(err)}` }],
      };
    }
  }
);

server.tool(
  "delete_model",
  "Remove an Ollama model from local storage to reclaim disk space",
  {
    model: z.string().describe("Model name to delete — must match exactly as returned by check_ollama_health"),
  },
  async ({ model }) => {
    try {
      const res = await fetch(`${OLLAMA_BASE}/api/delete`, {
        method: "DELETE",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ name: model }),
      });
      if (res.status === 404) {
        return {
          isError: true,
          content: [{ type: "text", text: `Model '${model}' not found — run check_ollama_health to see available models` }],
        };
      }
      return {
        content: [{ type: "text", text: `Deleted model: ${model}` }],
      };
    } catch (err) {
      return {
        isError: true,
        content: [{ type: "text", text: `Delete failed: ${err instanceof Error ? err.message : String(err)}` }],
      };
    }
  }
);

Context window limits by model

Unlike OpenAI and Anthropic where all current models have 128k–200k token windows, Ollama models vary dramatically. Sending a prompt that exceeds the model's context window causes silent truncation — the model silently ignores the oldest tokens rather than returning an error.

Model	Default context (num_ctx)	Max supported	Notes
llama3.2:3b	131,072	131,072	Full context by default
llama3.2:1b	131,072	131,072	Full context by default
mistral:7b	32,768	32,768	Use qwen2.5 for longer context
gemma3:4b	131,072	131,072	Google's efficient model
qwen2.5:7b	131,072	131,072	Excellent coding + long context
qwen2.5:14b	131,072	131,072	Higher quality, requires 10GB+ RAM
phi4:14b	16,384	16,384	Short context — use for quick tasks

You can override num_ctx in the options object, but going above the model's maximum has no effect. Going above available GPU/CPU memory causes OOM errors. Check prompt_eval_count in responses — if it equals num_ctx, the prompt was silently truncated.

When to use Ollama vs cloud APIs in an MCP server

Factor	Use Ollama	Use cloud API (OpenAI / Anthropic)
Data privacy	Data stays on your machine	Data sent to third-party API
Cost	Zero per-token cost (hardware only)	Pay per token — varies by model
Latency	Low on GPU; slow on CPU (10-100+ s)	Consistent 1-15s depending on model
Quality	Good (Llama3.2, Qwen2.5, Gemma3)	Best available (GPT-4o, Claude Opus)
Context window	Up to 128k for most models	128k-200k
Availability	Depends on local Ollama process	99.9%+ SLA
API keys	None required	Required — rotation, expiry risks
Offline use	Works without internet	Requires internet

The practical rule: use Ollama in local development (zero cost, no key management) and cloud APIs in production (reliability, quality). For privacy-sensitive production MCP servers (medical, legal, financial data), Ollama on a private GPU box is the correct architecture — no data leaves the network perimeter.

Frequently asked questions

How do I handle the case where Ollama is slow on CPU versus fast on GPU?

Check the total_duration field in responses (converted from nanoseconds to seconds). If it's > 30s on simple prompts, Ollama is running on CPU rather than GPU. Set a longer timeout in your ollamaFetch helper (120s+) for CPU inference. Expose the duration in your tool response so calling agents can decide whether to retry with a smaller model. For production MCP servers, always deploy Ollama on a machine with a compatible GPU — CPU inference is 5-20x slower and unsuitable for interactive workloads.

How do I pin a specific model version rather than using the floating 'latest' tag?

Ollama model names include a tag: llama3.2:3b pins the 3B parameter variant. To pin a specific digest (exact model version), use ollama pull llama3.2@sha256:abc123... — the full SHA256 digest appears in the check_ollama_health response as digest. For MCP tools in production, pass the full name with tag in the model argument and document which tag you've tested. Avoid bare names like llama3 without a tag — Ollama resolves them to the latest available, which can change.

Can I expose Ollama to the internet so remote MCP clients can use it?

Not directly — Ollama's default server has no authentication. For remote access, put a reverse proxy (nginx, Caddy) in front with mTLS or basic auth, and set OLLAMA_HOST=0.0.0.0 to bind to all interfaces. A safer pattern: run your MCP server (with proper auth) on a public endpoint, and have the MCP server call Ollama on localhost:11434 internally. This keeps Ollama completely isolated from the public internet. Wire AliveMCP on your public MCP endpoint to monitor the full stack.

How do I implement a model selection tool for Ollama?

Call GET /api/tags (the same endpoint as check_ollama_health) and return the model list. Include the context window size for each model — query POST /api/show with { name: modelName } to get model details including model_info.llama.context_length. Expose this as a list_ollama_models tool that the calling agent calls first, then picks the right model based on task complexity and available context.

How is Ollama different from the Ollama MCP server listed in the MCP registry?

The Ollama MCP server in the MCP registry is a pre-built MCP server that wraps Ollama — you install it and connect your MCP client to it. The approach in this guide is building your own MCP server that calls Ollama internally as one of many backends. The registry server is useful if you want Ollama access without writing code; the embedded approach is right when your MCP server has other tools and you want Ollama as one of them. See Ollama as an MCP server for the registry approach.