Guide · AI Integration

Vector Search MCP Tools — pgvector, Chroma, and Pinecone integration

Every vector store has a different failure surface. pgvector saturates connection pools under concurrent MCP tool calls. Chroma's EphemeralClient loses the index on process restart. Pinecone's managed API adds external network latency that breaks tool call P95 budgets. This guide walks through integrating each vector backend into an MCP search_documents tool, covering the specific gotchas of each store, index type selection, latency budgets across the full tool call stack, and how AliveMCP monitors the retrieval path that your process health check cannot see.

TL;DR

Choose pgvector (HNSW index) if you already run PostgreSQL — same infra, low operational overhead, connection pool sizing is the critical tuning knob. Choose Chroma's PersistentClient if you want embedded storage without a separate server. Choose Pinecone if you need managed horizontal scaling and can accept ~80ms extra latency per query for the network round-trip to their API. Monitor all three with AliveMCP: configure a custom /health endpoint that runs a real vector query against a canary embedding and returns 503 if latency exceeds your SLA or results are empty — AliveMCP's protocol probe alone cannot detect a saturated connection pool returning empty results.

Vector store deployment models

Store	Deployment	Latency (query)	Scaling	Best for
pgvector (HNSW)	In PostgreSQL	5–30ms (warm)	Vertical; read replicas	Existing PG infra; <10M vectors
Chroma (PersistentClient)	Embedded / local server	5–50ms (warm)	Single node	Dev, small corpora, no infra budget
Qdrant	Self-hosted / cloud	5–20ms (warm)	Horizontal sharding	High-throughput; custom filtering
Pinecone	Managed SaaS	50–150ms (API)	Fully managed	No infra ops; >10M vectors
SQLite-vec	Embedded file	1–10ms (small)	None (file-based)	Tiny corpora; serverless edge

pgvector — HNSW vs IVFFlat, connection pooling, and the MCP concurrency problem

pgvector exposes two index types. IVFFlat (Inverted File Flat) partitions vectors into lists and searches the nearest lists. Fast to build, uses less memory, but requires setting ivfflat.probes at query time — higher probes improves recall at the cost of latency. HNSW (Hierarchical Navigable Small World) builds a multi-layer graph for navigation, offers better recall at the same latency, and does not require probes tuning. For MCP tools where query latency matters more than index build time, prefer HNSW.

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Document chunks table with HNSW index
CREATE TABLE document_chunks (
  id          BIGSERIAL PRIMARY KEY,
  source_id   TEXT NOT NULL,
  chunk_index INTEGER NOT NULL,
  content     TEXT NOT NULL,
  embedding   vector(1536),       -- match your embedding model's dimensions
  created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- HNSW index on cosine distance
-- m: max connections per node (higher = better recall, more memory)
-- ef_construction: search width during build (higher = better recall, slower build)
CREATE INDEX ON document_chunks USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Set ef for query-time recall vs latency tradeoff (64 is a good default)
SET hnsw.ef_search = 64;

The MCP concurrency problem: an MCP server handling 10 simultaneous agent sessions issues 10 concurrent embedding lookups to PostgreSQL. Each embedding query acquires a connection from the pool. If the pool is sized at 5 connections (pg's default), 5 queries wait, causing P99 latency spikes that look like a slow server but are actually a pool exhaustion event.

// Node.js: size the pool to match expected MCP concurrency
import { Pool } from 'pg';

const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 20,             // max connections; tune to: (CPU cores × 2) + effective_spindle_count
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 2000,  // fail fast rather than queue indefinitely
});

// Health check endpoint: verify pool has idle connections available
app.get('/ready', async (req, res) => {
  if (pool.idleCount === 0 && pool.waitingCount > 0) {
    return res.status(503).json({
      status: 'degraded',
      reason: 'connection_pool_exhausted',
      idle: pool.idleCount,
      waiting: pool.waitingCount,
    });
  }
  res.json({ status: 'ok', idle: pool.idleCount });
});

async function vectorSearch(queryEmbedding, topK = 5, sourceFilter = null) {
  const client = await pool.connect();
  try {
    const params = [JSON.stringify(queryEmbedding), topK];
    const whereClause = sourceFilter ? 'AND source_id = $3' : '';
    if (sourceFilter) params.push(sourceFilter);

    const result = await client.query(`
      SELECT id, source_id, chunk_index, content,
             1 - (embedding <=> $1::vector) AS similarity
      FROM document_chunks
      WHERE 1=1 ${whereClause}
      ORDER BY embedding <=> $1::vector
      LIMIT $2
    `, params);

    return result.rows;
  } finally {
    client.release();
  }
}

The pool.idleCount === 0 && pool.waitingCount > 0 check in the readiness endpoint is the canary for pool exhaustion. AliveMCP polling your /ready endpoint catches this condition before it becomes a full outage. Configure your AliveMCP Author plan webhook to send an alert when /ready returns 503 — this gives you time to scale the connection pool before latency degrades enough to break the agent's context window budget.

Chroma — PersistentClient vs EphemeralClient, query API, and restart behavior

Chroma's two client modes have critically different behavior for MCP servers. EphemeralClient stores the vector index in memory — when the MCP server process restarts, all indexed documents are lost. PersistentClient persists the index to disk in a LanceDB-backed store. Always use PersistentClient in production. The directory you provide must be on a persistent volume — not a tmpfs or container ephemeral layer.

import chromadb
from chromadb.config import Settings

# PersistentClient: survives process restarts
client = chromadb.PersistentClient(
    path="/data/chroma",           # must be on a persistent volume
    settings=Settings(
        anonymized_telemetry=False,
        allow_reset=False,         # prevent accidental collection deletion
    )
)

collection = client.get_or_create_collection(
    name="mcp_documents",
    metadata={"hnsw:space": "cosine"}  # or "l2" or "ip"
)

async def chroma_search(query_text: str, top_k: int = 5, source_filter: str = None):
    where_filter = {"source": source_filter} if source_filter else None

    results = collection.query(
        query_texts=[query_text],         # Chroma embeds internally if embedding_function is set
        n_results=top_k,
        include=["documents", "distances", "metadatas"],
        where=where_filter,
    )

    chunks = []
    for doc, dist, meta in zip(
        results["documents"][0],
        results["distances"][0],
        results["metadatas"][0],
    ):
        # Chroma cosine distance: 0 = identical, 2 = opposite
        # Convert to similarity: 1 - (distance / 2)
        similarity = 1.0 - (dist / 2.0)
        chunks.append({
            "text": doc,
            "source": meta.get("source", "unknown"),
            "similarity": round(similarity, 4),
        })

    return chunks

Chroma uses HNSW internally via the hnswlib library. The hnsw:space metadata field at collection creation time sets the distance function — it cannot be changed after creation without rebuilding the collection. Use cosine for text embeddings from transformer models; use l2 for embeddings from models that were trained with L2 normalization.

For the Chroma HTTP server mode (running Chroma as a separate service instead of embedded), use chromadb.HttpClient(host="localhost", port=8000). This adds an internal network hop but allows the Chroma server to be scaled independently and monitored separately. AliveMCP can probe the Chroma HTTP server's /api/v1/heartbeat endpoint directly if you add it as a secondary monitor.

Pinecone — serverless vs pod-based, upsert batching, and latency characterization

Pinecone offers two architectures. Serverless indexes auto-scale and bill per query — no provisioning, but latency is variable (50–300ms depending on index warmth). Pod-based indexes use dedicated compute, offer predictable latency (~30–80ms), and charge per pod per hour. For MCP tools with a 500ms total latency budget, serverless Pinecone with a large index can consume the entire budget just on the vector search step.

import { Pinecone } from '@pinecone-database/pinecone';

const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pc.index(process.env.PINECONE_INDEX_NAME);

async function pineconeSearch(queryEmbedding, topK = 5, namespace = 'default') {
  const start = Date.now();

  const response = await index.namespace(namespace).query({
    vector: queryEmbedding,
    topK,
    includeMetadata: true,
    includeValues: false,
  });

  const latencyMs = Date.now() - start;

  // Log for P95 tracking — emit to your metrics system
  console.log(JSON.stringify({
    event: 'pinecone_query',
    latency_ms: latencyMs,
    matches: response.matches.length,
    namespace,
  }));

  if (latencyMs > 400) {
    console.warn(`Pinecone query slow: ${latencyMs}ms — may exceed tool call latency budget`);
  }

  return response.matches.map(m => ({
    text: m.metadata.text,
    source: m.metadata.source,
    score: m.score,
  }));
}

// Upsert in batches of 100 (Pinecone hard limit is 2MB per request)
async function upsertBatch(vectors, namespace = 'default') {
  const BATCH_SIZE = 100;
  for (let i = 0; i < vectors.length; i += BATCH_SIZE) {
    const batch = vectors.slice(i, i + BATCH_SIZE);
    await index.namespace(namespace).upsert(batch);
    // Small delay to avoid rate limiting
    if (i + BATCH_SIZE < vectors.length) {
      await new Promise(resolve => setTimeout(resolve, 100));
    }
  }
}

Pinecone namespaces partition vectors within a single index. Use namespaces for multi-tenancy in an MCP server: each client gets their own namespace, queries are isolated, and you pay for one index instead of N. The namespace is determined at query time from the agent's authentication context — include it in the tool's internal routing, not in the tool's public input schema.

Latency budget across the full MCP tool call stack

An MCP agent calling search_documents has a total patience budget — typically 2,000–5,000ms before it times out the tool call. That budget is consumed by every step in the retrieval chain.

Step	Typical latency	Worst case	What can go wrong
MCP transport overhead (SSE)	5–15ms	50ms	High server load
Query embedding (OpenAI API)	50–100ms	500ms (rate limit)	API rate limiting, cold start
Vector search (pgvector HNSW)	10–30ms	200ms (pool exhaustion)	Connection pool saturation
Vector search (Pinecone serverless)	80–200ms	500ms (cold)	Index not warm, serverless cold start
Cross-encoder reranking (CPU)	100–300ms (20 candidates)	1000ms	OOM, model not loaded
Context assembly + serialization	2–5ms	20ms	Very large result sets

The P95 tool call latency target is 800ms for documentation retrieval (users tolerate it as a background operation) and 300ms for real-time assistant responses (latency visible to users). With Pinecone, hitting the 300ms target requires skipping cross-encoder reranking or using a very small candidate set. With pgvector HNSW and a warm connection pool, 300ms is achievable including cross-encoding of 10 candidates.

AliveMCP tracks your server's P95 response time at the protocol level. A spike in P95 that correlates with increased traffic usually indicates connection pool exhaustion in pgvector or Pinecone API rate limiting — both produce slow responses before they produce errors.

Index warmth and cold-start latency

HNSW indexes (in pgvector and Chroma) must be loaded into memory before they serve queries at full speed. On a fresh process start, the first query triggers a cold read from disk that can take 1–10 seconds for large indexes. Subsequent queries run from the OS page cache — orders of magnitude faster.

Warm the index on startup before marking the MCP server ready:

async function warmVectorIndex() {
  console.log('Warming vector index...');
  const start = Date.now();

  // Send a no-op query to force index load
  const warmupEmbedding = new Array(1536).fill(0);  // zero vector forces full scan of nearest neighbors
  await vectorSearch(warmupEmbedding, 1);

  console.log(`Vector index warm in ${Date.now() - start}ms`);
}

async function startServer() {
  await warmVectorIndex();   // wait for warmup before accepting traffic
  server.listen(3000);
}

The /ready endpoint should return 503 until warmup completes. If you deploy with Kubernetes, AliveMCP's external probe detects failed readiness probes and alerts — combine AliveMCP with Kubernetes readiness to catch the case where the pod is running but the index never finishes loading (out-of-memory during index load produces a zombie process that passes liveness but fails readiness).

Frequently asked questions

Should I use pgvector or a dedicated vector database for an MCP server?

If your MCP server already uses PostgreSQL for other data (user records, document metadata, audit logs), add pgvector to the same database — you avoid a second database to operate, monitor, and backup. pgvector with HNSW handles millions of 1536-dimension vectors with query latencies under 30ms on a reasonable server. Only move to a dedicated vector database (Qdrant, Weaviate, Milvus) when you exceed 10–20 million vectors, need multi-tenant vector isolation at the storage level, or require sub-10ms latency at high concurrency that pgvector's connection pool can't achieve without a very large PG instance.

How do I prevent Chroma from losing my index when the MCP server restarts?

Always use PersistentClient with a path on a durable volume — not a path under /tmp or the container's writable layer. If you deploy on Docker, mount a named volume to the Chroma data path: -v chroma_data:/data/chroma. If you deploy on Kubernetes, use a PersistentVolumeClaim. Verify persistence by restarting the container and checking that collection.count() returns the expected number of documents — not zero. AliveMCP's canary query health check (a known query that must return at least 1 result) will alert you within 60 seconds if the index was not restored after a restart.

What happens when the Pinecone API is down and my MCP tool is called?

Without a circuit breaker, each tool call waits up to the configured timeout (default 20s in the Pinecone client) before failing. With 10 concurrent tool calls, you have 10 threads/coroutines blocked for 20 seconds each — the MCP server becomes unresponsive to all tool calls, not just the retrieval ones. Add a circuit breaker: after 3 consecutive Pinecone failures, open the circuit and return a structured error immediately: {"error": "retrieval_unavailable", "reason": "vector_store_circuit_open", "retry_after_seconds": 30}. The circuit closes after 30 seconds and probes Pinecone with a single query before resuming normal routing. AliveMCP will detect the MCP server's degraded state via increased P95 latency before the circuit opens.

How many vectors per collection is too many for Chroma?

Chroma using hnswlib handles up to approximately 1 million 1536-dimension vectors before query latency degrades noticeably on a single node with 8GB RAM. The HNSW index for 1M × 1536-dim vectors requires approximately 6–8GB of memory when loaded. Above 1M vectors, either shard across multiple Chroma collections (with routing logic in your MCP tool) or migrate to Qdrant or Weaviate which support distributed sharding. A more practical limit for a single Chroma MCP server on a 4GB instance is around 250K vectors — beyond that, index load time on restart and memory pressure during concurrent queries start affecting tool call latency.

How does AliveMCP detect vector store connection pool saturation?

AliveMCP's protocol probe measures the end-to-end latency of an MCP initialize handshake — if the server is slow because connections are queued behind a saturated pool, this shows up as increased protocol response time even before error responses appear. Configure your MCP server's /ready endpoint to return 503 when pool.waitingCount > 0 (for pgvector) or when a test query exceeds 200ms (for any backend). AliveMCP's Author and Team plans let you point the health check at a custom URL — point it at /ready to get an alert the moment the pool saturates, before users experience degraded retrieval quality.