Guide · AI Integration

Embedding Tools in MCP Servers — generate and store vectors via MCP

Exposing embedding generation as an MCP tool lets agents generate and store vectors without direct API key access — the MCP server holds the credentials, handles batching, caches identical inputs, and presents a single tool interface regardless of whether the embedding model is OpenAI, Cohere, or a local sentence-transformers model. When the embedding API goes down, the failure propagates silently through the tool layer: the generate_embedding tool returns an error response, the agent stops indexing, and your RAG pipeline degrades without a visible alarm. AliveMCP detects this gap — but only if you configure the readiness probe to distinguish process liveness from embedding API reachability.

TL;DR

Expose a generate_embedding tool that accepts text (or an array of texts for batch), calls your embedding model API, caches the result keyed by SHA-256 of the input, and returns the vector as a JSON array. Cache hits make the tool free (no API cost, ~1ms latency). For production, configure two health probes: /live checks the process, /ready calls the embedding API with a 1-token test input and returns 503 if the call fails or takes more than 2 seconds. AliveMCP monitors the MCP protocol layer; point its custom health check URL at /ready to catch embedding API outages before they degrade your RAG pipeline.

Why MCP as the embedding layer

Direct embedding API calls from agent code create several problems. Every agent component that needs embeddings duplicates the API client, key management, retry logic, and error handling. Rate limits are managed per-client rather than globally. When you switch from text-embedding-ada-002 to text-embedding-3-small, you update every agent component separately. Caching — the biggest cost reduction lever for embedding — requires a shared cache, but each embedded client has its own in-process cache that doesn't help other processes.

An MCP embedding server centralizes all of this. Agents call generate_embedding({ text: "...", model: "default" }) and receive a vector. The server decides which model to call, manages API keys, enforces rate limits across all callers, and maintains a shared embedding cache. Switching models means updating one server and re-indexing the corpus — agent code doesn't change.

Concern Direct API calls (agent code) MCP embedding server
API key management Per-agent env var Centralized in server
Rate limiting Per-agent (10 agents = 10× usage) Global pool with queuing
Caching In-process, not shared Shared across all callers
Model switching Update every agent Update one server
Cost tracking Fragmented across agents Unified in server logs
Fallback to local model Complex per-agent logic One server-side fallback

Embedding model selection

Model Dimensions Cost Latency Best for
text-embedding-3-small 1536 (reducible) $0.02 / 1M tokens 50–100ms / call Most use cases; good quality/cost
text-embedding-3-large 3072 (reducible) $0.13 / 1M tokens 80–150ms / call High-precision search; multilingual
Cohere embed-v3-english 1024 $0.10 / 1M tokens 80–200ms / call Search-optimized; input_type routing
BAAI/bge-small-en-v1.5 384 Free (local) 5–20ms / call (GPU) Offline, privacy-sensitive, cost-zero
all-MiniLM-L6-v2 384 Free (local) 10–30ms / call (CPU) Dev/test; small corpora

A critical constraint: the embedding model used at indexing time must be used at query time. The vector space is model-specific — embeddings from text-embedding-3-small and embeddings from bge-small-en-v1.5 are not interchangeable in the same index. If you switch models, you must re-embed the entire corpus and rebuild the index. Plan model choice before committing to production scale.

OpenAI's text-embedding-3 models support dimension reduction via the dimensions parameter. Reducing text-embedding-3-small from 1536 to 256 dimensions reduces storage and query latency with measurable but often acceptable quality loss. Test against your specific domain before committing to reduced dimensions in production.

MCP tool schema and implementation

// Tool schema
const generateEmbeddingTool = {
  name: "generate_embedding",
  description: "Generate a vector embedding for text using the configured embedding model. Returns a numeric array suitable for similarity search.",
  inputSchema: {
    type: "object",
    properties: {
      text: {
        oneOf: [
          { type: "string", description: "Single text to embed" },
          { type: "array", items: { type: "string" }, maxItems: 100, description: "Batch of texts (max 100)" }
        ],
        description: "Text or array of texts to embed"
      },
      model: {
        type: "string",
        enum: ["default", "small", "large"],
        default: "default",
        description: "Embedding model variant. 'default' uses text-embedding-3-small."
      }
    },
    required: ["text"]
  }
};
// Implementation with SHA-256 caching
import OpenAI from 'openai';
import { createHash } from 'crypto';
import Database from 'better-sqlite3';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const db = new Database('./embedding_cache.db');

db.exec(`
  CREATE TABLE IF NOT EXISTS embedding_cache (
    hash       TEXT PRIMARY KEY,
    model      TEXT NOT NULL,
    dimensions INTEGER NOT NULL,
    vector     BLOB NOT NULL,   -- Float32Array as binary
    created_at INTEGER DEFAULT (unixepoch())
  )
`);

const MODEL_MAP = {
  default: 'text-embedding-3-small',
  small: 'text-embedding-3-small',
  large: 'text-embedding-3-large',
};

function cacheKey(text, model) {
  return createHash('sha256').update(`${model}:${text}`).digest('hex');
}

function vectorToBlob(vector) {
  const buf = Buffer.allocUnsafe(vector.length * 4);
  vector.forEach((v, i) => buf.writeFloatLE(v, i * 4));
  return buf;
}

function blobToVector(blob) {
  const arr = [];
  for (let i = 0; i < blob.length; i += 4) {
    arr.push(blob.readFloatLE(i));
  }
  return arr;
}

async function generateEmbedding(text, modelAlias = 'default') {
  const model = MODEL_MAP[modelAlias] || MODEL_MAP.default;
  const isBatch = Array.isArray(text);
  const inputs = isBatch ? text : [text];

  // Check cache for each input
  const results = new Array(inputs.length);
  const uncachedIndexes = [];

  for (let i = 0; i < inputs.length; i++) {
    const key = cacheKey(inputs[i], model);
    const cached = db.prepare('SELECT vector FROM embedding_cache WHERE hash = ?').get(key);
    if (cached) {
      results[i] = blobToVector(cached.vector);
    } else {
      uncachedIndexes.push(i);
    }
  }

  if (uncachedIndexes.length > 0) {
    const uncachedTexts = uncachedIndexes.map(i => inputs[i]);

    // OpenAI batch API: up to 2048 inputs per request
    const response = await openai.embeddings.create({
      model,
      input: uncachedTexts,
      encoding_format: 'float',
    });

    const insertStmt = db.prepare(
      'INSERT OR REPLACE INTO embedding_cache (hash, model, dimensions, vector) VALUES (?, ?, ?, ?)'
    );

    uncachedIndexes.forEach((originalIndex, batchIndex) => {
      const vector = response.data[batchIndex].embedding;
      results[originalIndex] = vector;

      const key = cacheKey(inputs[originalIndex], model);
      insertStmt.run(key, model, vector.length, vectorToBlob(vector));
    });
  }

  return {
    content: [{
      type: "text",
      text: JSON.stringify({
        embeddings: isBatch ? results : results[0],
        model,
        dimensions: results[0].length,
        cache_hits: inputs.length - uncachedIndexes.length,
        api_calls: uncachedIndexes.length > 0 ? 1 : 0,
      })
    }]
  };
}

Batch embedding for efficiency

Calling the embedding API once per text is the most expensive pattern. OpenAI's API allows up to 2048 texts per request (subject to total token limits). A single API request with 100 short texts costs essentially the same as a single request with 1 text in terms of latency — the per-request overhead dominates, not the per-token cost.

For document indexing workloads (where an agent calls index_document with a large corpus), batch the embedding calls:

async function embedCorpus(chunks, batchSize = 200) {
  const embeddings = [];
  const tokenCounts = [];

  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const texts = batch.map(c => c.text);

    const response = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: texts,
    });

    response.data.forEach(item => {
      embeddings.push(item.embedding);
    });

    tokenCounts.push(response.usage.total_tokens);

    // Rate limit: 3,000 requests/min on tier 1; 1M tokens/min
    // At batch size 200 × avg 100 tokens = 20K tokens per batch
    // Safe to send 50 batches/min = ~10M tokens/min (above tier 1 limit, but tier 2+)
    if (i + batchSize < chunks.length) {
      await new Promise(r => setTimeout(r, 100));  // 10 batches/sec = safe for tier 2
    }
  }

  const totalTokens = tokenCounts.reduce((sum, t) => sum + t, 0);
  const costUsd = (totalTokens / 1_000_000) * 0.02;  // $0.02 per 1M tokens

  console.log(JSON.stringify({
    event: 'corpus_embedded',
    chunks: chunks.length,
    total_tokens: totalTokens,
    cost_usd: costUsd.toFixed(4),
  }));

  return embeddings;
}

Local embedding with sentence-transformers

For privacy-sensitive corpora or cost-zero deployments, run embedding inference locally. The @xenova/transformers package runs ONNX-format models in Node.js without a Python dependency or GPU requirement — though GPU acceleration significantly reduces inference time.

import { pipeline } from '@xenova/transformers';

let embedder = null;

async function getEmbedder() {
  if (!embedder) {
    // Downloads model on first call (~90MB for bge-small-en-v1.5)
    embedder = await pipeline('feature-extraction', 'Xenova/bge-small-en-v1.5');
  }
  return embedder;
}

async function localEmbed(texts) {
  const model = await getEmbedder();
  const isArray = Array.isArray(texts);
  const inputs = isArray ? texts : [texts];

  const outputs = await model(inputs, { pooling: 'mean', normalize: true });

  // outputs.tolist() returns number[][] — one vector per input
  const vectors = outputs.tolist();

  return isArray ? vectors : vectors[0];
}

The normalize: true flag applies L2 normalization, producing unit vectors. With normalized vectors, cosine similarity equals dot product — you can use the faster dot product operator in pgvector (<#>) instead of the cosine operator (<=>).

Local inference adds 5–30ms per batch call (CPU) compared to 50–100ms for the OpenAI API. However, there's no per-call cost and no rate limit. For high-throughput indexing pipelines, local inference is faster and cheaper. For sporadic queries where the model initialization cost (model download + load, ~1–3 seconds) would dominate, the OpenAI API is faster unless you keep the model loaded in memory.

Embedding cache TTL and invalidation

Embedding cache entries are keyed by SHA-256(model + text) — they never expire from correctness concerns, because the same text with the same model always produces the same vector. The only reason to expire cache entries is disk space management.

For a corpus of 100,000 chunks averaging 200 tokens (approximately 150 characters), the SQLite cache database is approximately 100,000 × (256 bytes SHA256 hash + 1536 × 4 bytes float32 vector + metadata) ≈ 700MB. On modern servers, this is negligible. Increase corpus size by 10× and you still fit in 7GB — a single SQLite file.

-- Prune cache entries older than 90 days for text no longer in the corpus
-- Run this periodically via cron, not on every tool call
DELETE FROM embedding_cache
WHERE created_at < unixepoch('now', '-90 days')
  AND hash NOT IN (
    SELECT embedding_hash FROM document_chunks  -- keep if still referenced
  );

If you switch embedding models, the old cache entries are still valid but will never be accessed (the new model produces different hashes for the same text, since the key includes the model name). Either delete the old model's entries manually or let them age out via TTL pruning.

Health probes: separating process liveness from API reachability

An MCP embedding server has two independent failure modes that require separate health probes:

  1. Process dead: the Node.js/Python process crashed. AliveMCP's protocol probe detects this within 60 seconds via connection refused.
  2. Embedding API down: the process is running but calls to OpenAI/Cohere are failing. AliveMCP's protocol probe cannot detect this — the MCP server responds to initialize and tools/list normally. Only a test tool call reveals the failure.
// /live: process liveness
app.get('/live', (req, res) => {
  res.json({ status: 'ok' });
});

// /ready: embedding API reachability
app.get('/ready', async (req, res) => {
  try {
    const start = Date.now();

    // Test call with a minimal 1-token input
    await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: 'ok',
      encoding_format: 'float',
    });

    const latencyMs = Date.now() - start;

    if (latencyMs > 2000) {
      return res.status(503).json({
        status: 'degraded',
        reason: 'embedding_api_slow',
        latency_ms: latencyMs,
      });
    }

    res.json({ status: 'ok', latency_ms: latencyMs });
  } catch (err) {
    res.status(503).json({
      status: 'unhealthy',
      reason: 'embedding_api_unreachable',
      error: err.message,
    });
  }
});

Configure AliveMCP's custom health check URL to point at /ready. This makes embedding API outages visible as MCP server failures in your AliveMCP dashboard. The failure_reason field in AliveMCP's alert payload will read external_api_failure (triggered by your 503 response), distinguishing it from connection_refused (process dead) and timeout (process overloaded). This distinction matters for the runbook: process dead → restart the server; embedding API down → check OpenAI status page, activate local model fallback.

Fallback to local model on API outage

async function generateEmbeddingWithFallback(text, model = 'text-embedding-3-small') {
  try {
    const response = await openai.embeddings.create({ model, input: text });
    return { vector: response.data[0].embedding, source: 'openai' };
  } catch (err) {
    // Circuit open or API down: fall back to local model
    console.warn('OpenAI embedding API failed, using local model:', err.message);
    const vector = await localEmbed(text);
    return { vector, source: 'local' };
  }
}

The fallback introduces a subtle correctness issue: the corpus was indexed with OpenAI vectors, but queries on the fallback path use local model vectors. The two vector spaces are not compatible — cosine similarity between them is meaningless. The fallback produces wrong results, not an error. Document this trade-off: failing fast with an error response is often preferable to silently returning incorrect results.

Frequently asked questions

Should I expose embedding generation as an MCP tool or keep it internal to the server?

Keep it internal unless you have a specific reason to expose it. If the only use case is "generate embeddings and store them," the agent should call a higher-level tool like index_document — the MCP server handles embedding internally. Expose generate_embedding as a tool when agents need embeddings for purposes outside your vector store: storing them in a separate system, computing similarity between two texts, or generating embeddings as inputs to other ML models. If your agents are already well-served by search_documents, adding generate_embedding to the tool surface increases the agent's context window usage without benefit.

How much does the embedding cache reduce costs in practice?

Cache hit rate depends on query repetition patterns. For a customer support bot that handles common questions repeatedly, cache hit rates of 60–80% are typical — most users ask the same things in similar ways. For a research assistant handling unique queries, cache hit rates fall to 5–20%. Measure your actual hit rate by logging cache_hits / total_inputs from the tool response and aggregating over a week. The cost calculation: at 70% cache hit rate, 100,000 queries of average 50 tokens each — total tokens without cache = 5M ($0.10); with cache = 1.5M ($0.03). The $0.07 saving per 100K queries is small, but the latency saving is more significant: cache hits take ~1ms vs 50–100ms for API calls, reducing P50 tool call latency noticeably.

What happens if I hit OpenAI's rate limit during a large indexing job?

OpenAI returns HTTP 429 with a Retry-After header specifying how many seconds to wait. The official Node.js SDK retries automatically with exponential backoff if you configure maxRetries. For large indexing jobs, prefer the OpenAI Batch API which is 50% cheaper, allows up to 50,000 requests per batch, and completes within 24 hours — synchronous rate limits don't apply. Use the synchronous embeddings API only for real-time requests where a user is waiting. The distinction: document indexing (asynchronous, batch) vs query embedding (synchronous, real-time).

Can I use different embedding models for indexing vs querying?

No. The entire corpus must be embedded with the same model, and every query must be embedded with the same model. The vector space is model-specific — similarity scores between vectors from different models are not meaningful. If you want to test a new model, embed the entire corpus with the new model into a separate collection, run evaluation queries against both collections to compare retrieval quality, and then switch the production server to the new model after re-indexing completes. This is the main operational cost of switching embedding models at scale.

How does AliveMCP detect an embedding API outage vs a process crash?

AliveMCP's baseline protocol probe detects process crashes within 60 seconds: the MCP server stops responding to initialize handshakes, and AliveMCP fires failure_reason: connection_refused. An embedding API outage with the process still running is invisible to the protocol probe — the server responds normally until an actual tool call fails. Configure your MCP server's /ready endpoint to call the embedding API with a test input and return 503 on failure. Point AliveMCP's custom health check URL at /ready. When the embedding API goes down, AliveMCP sends an alert with failure_reason: external_api_failure (from your 503 response), distinct from the process crash failure reason. Your runbook can branch on this: process crash → restart; API outage → activate local fallback and check provider status page.

Further reading

Know when your MCP server is down — before users do

AliveMCP probes your MCP embedding server every minute and detects embedding API outages via your custom health check URL — before agents start returning silently wrong results.

Start monitoring free