Guide · LLM Provider Integrations

MCP server vector database integration

Vector databases are the retrieval layer for RAG-powered MCP tools — the LLM calls a search tool, the tool queries a vector database, and the retrieved chunks are returned as context. Three patterns define the MCP-specific design: whether to generate embeddings inside the tool or in a separate dedicated tool, how to apply metadata filters so the LLM can scope retrieval without embedding knowledge of your schema, and how to cap retrieved chunk sizes so you don't overflow the LLM's context window with a single tool call.

TL;DR

Expose embedding generation as a separate generate_embedding tool so the LLM can reuse embeddings across multiple search calls. For Pinecone, use @pinecone-database/pinecone — the query and upsert tools are simple wrappers around index.query() and index.upsert(). For Chroma (local), call the HTTP API directly at localhost:8000 or use chromadb npm. Always apply a score threshold (discard results below 0.7 cosine similarity) and a chunk token cap (max 500 tokens per chunk) before returning to prevent low-quality context that wastes the LLM's attention. Wire AliveMCP on your MCP endpoint so vector database connection failures (index down, auth expired) surface as monitored outages rather than silent wrong answers.

Embedding generation as a separate tool

The first design question: should your search tool generate embeddings internally, or expose a separate generate_embedding tool? Separate tools are better because:

import OpenAI from "openai";
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { z } from "zod";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const server = new McpServer({ name: "rag-tools", version: "1.0.0" });

server.tool(
  "generate_embedding",
  "Convert text to a vector embedding for semantic search. Returns a float array. Cache the result and reuse it across multiple search calls.",
  {
    text: z.string().max(8000).describe("Text to embed — keep under 8000 chars for text-embedding-3-small"),
    model: z.enum(["text-embedding-3-small", "text-embedding-3-large"])
      .default("text-embedding-3-small")
      .describe("Embedding model — 3-small (1536 dims, cheap) vs 3-large (3072 dims, more accurate)"),
  },
  async ({ text, model }) => {
    try {
      const response = await openai.embeddings.create({
        model,
        input: text,
        encoding_format: "float",
      });

      const embedding = response.data[0].embedding;
      return {
        content: [{
          type: "text",
          text: JSON.stringify({
            embedding,
            dimensions: embedding.length,
            model,
            usage: response.usage,
          }),
        }],
      };
    } catch (err) {
      return {
        isError: true,
        content: [{ type: "text", text: `Embedding error: ${err instanceof Error ? err.message : String(err)}` }],
      };
    }
  }
);

For local embedding generation without API calls, use ollama with an embedding model like nomic-embed-text via POST /api/embeddings. See embedding generation in MCP servers for the local alternative.

Pinecone query and upsert tools

Pinecone is the most common managed vector database. The SDK wraps the Pinecone REST API:

npm install @pinecone-database/pinecone
import { Pinecone } from "@pinecone-database/pinecone";

const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });

// Query tool — retrieve semantically similar records
server.tool(
  "search_pinecone",
  "Query a Pinecone index for records similar to an embedding vector. Returns top-K matches with scores and metadata.",
  {
    index_name: z.string().describe("Pinecone index name"),
    embedding: z.array(z.number()).describe("Query vector from generate_embedding"),
    top_k: z.number().int().min(1).max(20).default(5),
    score_threshold: z.number().min(0).max(1).default(0.7)
      .describe("Minimum cosine similarity score — discard results below this"),
    filter: z.record(z.unknown()).optional()
      .describe("Pinecone metadata filter object — e.g. { source: { '$eq': 'docs' }, year: { '$gte': 2024 } }"),
    namespace: z.string().optional().describe("Pinecone namespace for multi-tenant isolation"),
  },
  async ({ index_name, embedding, top_k, score_threshold, filter, namespace }) => {
    try {
      const index = pinecone.index(index_name);
      const ns = namespace ? index.namespace(namespace) : index;

      const queryResponse = await ns.query({
        vector: embedding,
        topK: top_k,
        filter: filter as Record<string, unknown> | undefined,
        includeMetadata: true,
        includeValues: false, // don't return vectors — saves bandwidth
      });

      // Apply score threshold and token budget
      const filtered = (queryResponse.matches ?? [])
        .filter(m => (m.score ?? 0) >= score_threshold)
        .map(m => ({
          id: m.id,
          score: m.score,
          // Truncate long text chunks to avoid context overflow
          text: truncateToTokens(String(m.metadata?.text ?? ""), 500),
          metadata: excludeKey(m.metadata ?? {}, "text"), // return metadata without the full text
        }));

      return {
        content: [{
          type: "text",
          text: JSON.stringify({
            matches: filtered,
            total_returned: queryResponse.matches?.length ?? 0,
            after_threshold: filtered.length,
            score_threshold,
          }),
        }],
      };
    } catch (err) {
      return {
        isError: true,
        content: [{ type: "text", text: `Pinecone query error: ${err instanceof Error ? err.message : String(err)}` }],
      };
    }
  }
);

// Upsert tool — store new records
server.tool(
  "upsert_pinecone",
  "Store records in a Pinecone index. Each record needs an ID, embedding vector, and optional metadata.",
  {
    index_name: z.string(),
    records: z.array(z.object({
      id: z.string(),
      embedding: z.array(z.number()),
      text: z.string().describe("The original text for this chunk — stored in metadata"),
      metadata: z.record(z.unknown()).optional().describe("Additional metadata fields for filtering"),
    })).min(1).max(100),
    namespace: z.string().optional(),
  },
  async ({ index_name, records, namespace }) => {
    try {
      const index = pinecone.index(index_name);
      const ns = namespace ? index.namespace(namespace) : index;

      const vectors = records.map(r => ({
        id: r.id,
        values: r.embedding,
        metadata: { text: r.text, ...(r.metadata ?? {}) },
      }));

      await ns.upsert(vectors);

      return {
        content: [{ type: "text", text: JSON.stringify({ upserted: records.length, index: index_name }) }],
      };
    } catch (err) {
      return {
        isError: true,
        content: [{ type: "text", text: `Pinecone upsert error: ${err instanceof Error ? err.message : String(err)}` }],
      };
    }
  }
);

function truncateToTokens(text: string, maxTokens: number): string {
  // Approximate: 1 token ≈ 4 chars for English text
  const maxChars = maxTokens * 4;
  return text.length > maxChars ? text.slice(0, maxChars) + "…" : text;
}

function excludeKey(obj: Record<string, unknown>, key: string): Record<string, unknown> {
  const { [key]: _, ...rest } = obj;
  return rest;
}

The metadata filter syntax uses Pinecone's filter language: $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin. Expose filter building as a separate build_pinecone_filter tool if your agents need to construct complex filters without knowing Pinecone's syntax.

Chroma tools (local vector database)

Chroma is the standard choice for local development — it runs in-process or as a server, requires no API key, and persists to disk. Use the HTTP API (chromadb npm) when running Chroma as a separate server process:

npm install chromadb
import { ChromaClient } from "chromadb";

const chroma = new ChromaClient({ path: process.env.CHROMA_HOST ?? "http://localhost:8000" });

server.tool(
  "search_chroma",
  "Query a Chroma collection for records similar to an embedding vector",
  {
    collection_name: z.string(),
    embedding: z.array(z.number()).describe("Query vector from generate_embedding"),
    n_results: z.number().int().min(1).max(20).default(5),
    where: z.record(z.unknown()).optional()
      .describe("Chroma where clause for metadata filtering — e.g. { source: 'docs' } or { '$and': [{ year: { '$gte': 2024 } }] }"),
    where_document: z.record(z.unknown()).optional()
      .describe("Chroma where_document clause — filter by document content — e.g. { '$contains': 'TypeScript' }"),
  },
  async ({ collection_name, embedding, n_results, where, where_document }) => {
    try {
      const collection = await chroma.getCollection({ name: collection_name });

      const results = await collection.query({
        queryEmbeddings: [embedding],
        nResults: n_results,
        where: where as Record<string, unknown> | undefined,
        whereDocument: where_document as Record<string, unknown> | undefined,
        include: ["documents", "metadatas", "distances"],
      });

      const ids = results.ids[0] ?? [];
      const documents = results.documents[0] ?? [];
      const metadatas = results.metadatas[0] ?? [];
      const distances = results.distances?.[0] ?? [];

      const matches = ids.map((id, i) => ({
        id,
        // Chroma returns L2 distances — convert to similarity score (0-1)
        score: Math.max(0, 1 - (distances[i] ?? 0) / 2),
        text: truncateToTokens(documents[i] ?? "", 500),
        metadata: metadatas[i] ?? {},
      })).filter(m => m.score >= 0.5); // threshold for L2-converted scores

      return {
        content: [{
          type: "text",
          text: JSON.stringify({ matches, collection: collection_name }),
        }],
      };
    } catch (err) {
      return {
        isError: true,
        content: [{ type: "text", text: `Chroma query error: ${err instanceof Error ? err.message : String(err)}` }],
      };
    }
  }
);

server.tool(
  "upsert_chroma",
  "Add or update documents in a Chroma collection. Creates the collection if it doesn't exist.",
  {
    collection_name: z.string(),
    records: z.array(z.object({
      id: z.string(),
      text: z.string(),
      embedding: z.array(z.number()),
      metadata: z.record(z.union([z.string(), z.number(), z.boolean()])).optional(),
    })).min(1).max(100),
  },
  async ({ collection_name, records }) => {
    try {
      const collection = await chroma.getOrCreateCollection({ name: collection_name });

      await collection.upsert({
        ids: records.map(r => r.id),
        documents: records.map(r => r.text),
        embeddings: records.map(r => r.embedding),
        metadatas: records.map(r => r.metadata ?? {}),
      });

      return {
        content: [{ type: "text", text: JSON.stringify({ upserted: records.length, collection: collection_name }) }],
      };
    } catch (err) {
      return {
        isError: true,
        content: [{ type: "text", text: `Chroma upsert error: ${err instanceof Error ? err.message : String(err)}` }],
      };
    }
  }
);

Chroma uses L2 distance by default. The score conversion above (1 - distance/2) gives an approximate 0–1 similarity score, but it's not a true cosine similarity. For true cosine similarity, create the collection with metadata: { "hnsw:space": "cosine" } — then the distances are already cosine dissimilarities (0 = identical, 2 = opposite).

Context window budget management

The most common mistake in RAG MCP tools: returning too many or too long chunks and overflowing the LLM's context window. A search tool that returns 20 chunks of 2000 tokens each adds 40,000 tokens of context to the next LLM call — potentially more than the model can handle.

The correct approach combines three constraints:

ConstraintValuePurpose
Score threshold≥ 0.7 cosine similarityDiscard low-relevance chunks that add noise
top_k cap5–10 chunks maximumLimit total retrieved chunks
Per-chunk token cap300–500 tokens per chunkPrevent single long chunks from dominating context
Total token budget2,000–4,000 tokens for all chunksLeave room for system prompt + query + LLM response

Implement the total token budget check in the tool before returning:

function applyContextBudget(
  chunks: Array<{ id: string; score: number; text: string; metadata: unknown }>,
  options: { scoreThreshold: number; maxChunks: number; maxTokensPerChunk: number; totalTokenBudget: number }
): Array<{ id: string; score: number; text: string; metadata: unknown }> {
  let tokenTotal = 0;
  return chunks
    .filter(c => c.score >= options.scoreThreshold)
    .slice(0, options.maxChunks)
    .filter(c => {
      const chunkTokens = Math.ceil(c.text.length / 4); // ~4 chars/token
      if (tokenTotal + chunkTokens > options.totalTokenBudget) return false;
      tokenTotal += chunkTokens;
      c.text = truncateToTokens(c.text, options.maxTokensPerChunk);
      return true;
    });
}

For the broader token budget problem across MCP tool calls, see token budget management for MCP servers.

Vector database comparison

DatabaseHostingBest forMCP integration effort
PineconeManaged cloudProduction RAG, large indexes (>1M vectors)Low — official SDK
ChromaLocal / self-hostedLocal dev, small indexes, no API keyLow — official SDK or HTTP API
WeaviateCloud + self-hostedHybrid search (vector + BM25), GraphQL queriesMedium — GraphQL client
QdrantCloud + self-hostedHigh-performance, filtering, payload storageLow — official JS SDK
pgvectorPostgreSQL extensionExisting Postgres infra, SQL queries with vectorsLow — standard Postgres client

For most new MCP servers: Chroma for local dev, Pinecone for production. Switch to pgvector if you already have Postgres and want to avoid adding another service. For more on the vector search patterns these tools support, see vector search in MCP servers.

Frequently asked questions

Should my MCP tool generate the embedding or expect the caller to provide it?

Expose both options, but default to generating internally. A search_documents tool that accepts a plain text query and generates the embedding internally is simpler for most callers. A separate generate_embedding tool is useful when the LLM needs to embed multiple texts for comparison or use the same embedding across several collections. For production RAG MCP servers, expose both: search_by_text (embeds internally) and search_by_vector (accepts a pre-computed embedding).

How do I handle the case where no results meet the score threshold?

Return an empty matches array with a clear message: { matches: [], reason: "No results above score threshold 0.7 — try a lower threshold or rephrase the query" }. Never return low-score results silently — a 0.3-similarity chunk is worse than no context for most LLMs. Consider including the top score that was available even when below threshold, so the calling agent can decide whether to lower the bar or escalate to a different retrieval strategy.

How do I implement delete in these tools?

Pinecone: index.deleteOne(id) or index.deleteMany(ids). Chroma: collection.delete({ ids: [id] }) or with a where filter to delete by metadata. Expose a delete_records tool with an ids array argument. Implement a confirmation step for bulk deletes — have the LLM call a preview_delete tool first that returns what would be deleted before executing. Delete operations are irreversible and should be treated with the same caution as database mutations.

How do I chunk documents for storage in a vector database?

Sentence-window chunking works well for most MCP RAG use cases: split at sentence boundaries, target 200–400 tokens per chunk, and overlap adjacent chunks by 1–2 sentences to avoid splitting context at boundaries. Store the parent document ID in each chunk's metadata so you can fetch the full document when a chunk matches. Fixed-size character splitting is simpler but creates worse retrieval quality because it cuts mid-sentence. For code documents, chunk at function/class boundaries rather than by token count — code semantics don't transfer well with arbitrary splits.

How do I monitor vector database health in an MCP server?

Add a check_vector_db_health tool that performs a lightweight query (fetch one record by ID, or query with a zero vector and top_k=1) to verify the connection and index are operational. Run this at MCP server startup and surface the result in your server's health check endpoint. Wire AliveMCP on your MCP server endpoint — when the vector database goes down (Pinecone outage, Chroma process crash, network partition), your MCP server's tools fail, and AliveMCP will page you before users report wrong answers.

Further reading

Know when your MCP server is down — before users do

AliveMCP probes your server's MCP endpoint every minute, detects protocol errors and transport failures, and pages you before users notice.

Start monitoring free