Data Infrastructure · 2026-07-04 · Data Infrastructure arc
Building MCP Tools for Data Infrastructure: The Three Contracts That Apply to PostgreSQL, MySQL, S3, Kafka, and DSPy
When you build your first MCP tool that touches data infrastructure — a PostgreSQL query tool, an S3 upload tool, a Kafka publisher — three problems appear in sequence. You hit a SQL injection vector and add parameterized queries. You notice every tool call is reconnecting to the database and you add a connection pool. You deploy a health check that confirms the MCP server process is running and later discover your database pool was exhausted for 20 minutes without anyone noticing. By the second integration you recognize the same three problems in a different uniform. This synthesis covers five integrations — PostgreSQL, MySQL, AWS S3, Kafka, and DSPy — through the three contracts they all share, so you recognize them immediately the next time they appear.
TL;DR
Five different data infrastructure backends, three shared contracts. (1) The parameter safety contract: every data infrastructure integration has a distinct injection vector — SQL injection via string-interpolated queries (PostgreSQL: $1/$2 placeholders; MySQL: ? placeholders), path traversal via S3 key names (reject ../, leading /, null bytes), topic injection via Kafka topic names (hard-coded allow-list — an unrestricted producer can corrupt internal broker metadata topics), and prompt injection via DSPy inputs (Zod schemas at the tool boundary before any LLM call). The common failure: letting caller-provided strings reach the infrastructure API unsanitized. Each integration has exactly one failure path and one fix. (2) The singleton resource contract: reconnecting on every tool call is the most common performance bug in data infrastructure MCP servers. The fix is always a singleton: Pool instance for PostgreSQL and MySQL (never a bare Client), singleton S3Client for AWS (credential resolution overhead on every call otherwise), singleton connected Producer + Admin for KafkaJS (reconnecting triggers broker rebalances that affect other consumers), and a module-level dspy.configure(lm=...) for DSPy. (3) The health transparency contract: each integration requires a resource-specific /health endpoint that surfaces the integration-specific failure mode — not just whether the process is running. pool.waitingCount for PostgreSQL (pool exhaustion is invisible to HTTP process checks), HeadBucketCommand for S3 (catches IAM credential expiry — a 403 means the credentials are expired, not a network failure), admin.describeCluster() for Kafka (catches broker reachability changes), and a canary inference call for DSPy (LLM backend outages look identical to server crashes from the calling agent's perspective). Where the integrations diverge: error taxonomy (PostgreSQL string codes like 23505, MySQL errno integers like 1062, AWS SDK named error classes like NoSuchKey, KafkaJS typed errors, DSPy Python exceptions surfaced via HTTP status codes), transaction model (PostgreSQL/MySQL have ACID transactions with pinned connections; S3 uses idempotent PutObject with conditional writes; Kafka has producer-side idempotence but no cross-topic transactions; DSPy has no transaction concept), and monitoring blind spots (pool exhaustion, IAM expiry, broker rebalancing, LLM API outages — each requires a different probe). Wire AliveMCP to your resource-specific health endpoints — a process health check alone misses every failure mode in this list.
The five integrations at a glance
Before diving into each contract, here's where the five data infrastructure backends stand on the dimensions that matter most for MCP tool design:
| Integration | Injection vector | Singleton resource | Health probe | Transaction model |
|---|---|---|---|---|
| PostgreSQL | SQL injection via string interpolation | pg.Pool with max: 10 | pool.waitingCount === 0 | ACID via pool.connect() pinned client |
| MySQL | SQL injection via string interpolation | mysql2.createPool() | Internal pool stats via _freeConnections | ACID via pool.getConnection() |
| AWS S3 | Path traversal via key names | Singleton S3Client | HeadBucketCommand — catches IAM expiry | None — use idempotent PutObject |
| Apache Kafka | Topic injection via topic names | Connected Producer + Admin | admin.describeCluster() | Producer-side idempotence only |
| DSPy | Prompt injection via unvalidated inputs | dspy.configure(lm=...) at module level | Canary inference endpoint on FastAPI service | None |
The three contracts all stem from the same root cause: MCP tools are synchronous request-response units that are called concurrently and with potentially untrusted arguments from an LLM caller, while data infrastructure systems were designed for long-lived connections, bulk operations, and trusted administrative clients. Every friction point is a version of that mismatch.
Contract 1 — The parameter safety contract
When you call a SaaS API from an MCP tool, the arguments you pass are bounded by your own type definitions — the LLM fills in a structured object, your Zod schema validates it, and the API call is parameterized by the SDK. When you call data infrastructure, the path from tool argument to infrastructure API is often shorter and more dangerous: you concatenate a SQL fragment, you join a file path, you accept a topic name. The injection vector is always a string that reaches the infrastructure API without going through the infrastructure's own parameter handling.
PostgreSQL: $1/$2 parameterized queries
SQL injection is the canonical example. The calling agent may pass values derived from untrusted sources — scraped content, user input, values read from other tools. If you interpolate those values into a SQL string, the database executes whatever SQL the attacker embedded. PostgreSQL's parameterized query syntax breaks the injection path at the driver level: $1, $2, etc. are not string-substituted — they are sent as separate protocol values and the database engine never interprets them as SQL.
// WRONG — interpolated string reaches the SQL engine
const result = await pool.query(`SELECT * FROM users WHERE email = '${email}'`);
// RIGHT — $1 is a typed protocol parameter, not a string substitution
const result = await pool.query('SELECT * FROM users WHERE email = $1', [email]);
The pattern gets harder for dynamic WHERE clauses — when the tool builds a filter from multiple optional arguments. The correct approach tracks conditions and parameters as parallel arrays, incrementing the placeholder index after each push:
const conditions: string[] = [];
const params: unknown[] = [];
if (filters.email) {
conditions.push(`email = $${params.length + 1}`);
params.push(filters.email);
}
if (filters.status) {
conditions.push(`status = $${params.length + 1}`);
params.push(filters.status);
}
const where = conditions.length > 0 ? `WHERE ${conditions.join(' AND ')}` : '';
const result = await pool.query(`SELECT * FROM users ${where} LIMIT 50`, params);
Every value the caller provides goes through params, never into the SQL string. The placeholder index ($${params.length + 1}) is computed from the growing params array — this is the only correct way to build dynamic parameterized queries in PostgreSQL without using a query builder library.
MySQL: ? placeholders with the same discipline
MySQL uses ? as the parameter placeholder instead of PostgreSQL's positional $1 syntax, but the discipline is identical. Pass the values array as the second argument to pool.query(); never interpolate. The driver replaces each ? with the corresponding array element, properly escaped at the driver level:
// WRONG
const [rows] = await pool.query(`SELECT * FROM orders WHERE customer_id = ${id}`);
// RIGHT
const [rows] = await pool.query('SELECT * FROM orders WHERE customer_id = ?', [id]);
One MySQL-specific note: numeric values in LIMIT and OFFSET must also go through params — even though they look safe, allowing the caller to specify arbitrary LIMIT values opens a resource exhaustion vector. Pass LIMIT ? with a validated, capped integer.
AWS S3: key validation prevents path traversal
S3 doesn't have SQL, but it has a different injection class: path traversal via object key names. An agent that can specify arbitrary S3 keys can read ../../../config/secrets.json (on systems that map S3 keys to filesystem paths), can escape a key prefix scope, or can access objects in sibling "directories" by traversing up with dot-dot sequences. The fix is a validateKey() function that runs before any S3 operation:
function validateKey(key: string): void {
if (key.includes('../') || key.includes('..\\')) {
throw new McpError(ErrorCode.InvalidParams, 'Key contains path traversal sequence');
}
if (key.startsWith('/')) {
throw new McpError(ErrorCode.InvalidParams, 'Key must not start with /');
}
if (key.includes('\0')) {
throw new McpError(ErrorCode.InvalidParams, 'Key contains null byte');
}
if (key.length > 1024) {
throw new McpError(ErrorCode.InvalidParams, 'Key exceeds 1024 character limit');
}
}
// Call before every S3 operation
server.tool('get_object', { key: z.string() }, async ({ key }) => {
validateKey(key);
const response = await s3.send(new GetObjectCommand({ Bucket: BUCKET, Key: key }));
// ...
});
Defense in depth: scope the IAM policy to a specific key prefix too — "Resource": "arn:aws:s3:::your-bucket/uploads/*". A bug in the tool's validation doesn't grant broader access because IAM rejects the call regardless. The bucket name should never come from the caller — hard-code it from an environment variable.
Kafka: topic allow-lists prevent internal topic access
Kafka's injection vector is the most dangerous: an unrestricted producer that accepts arbitrary topic names from callers can write to internal Kafka topics — __consumer_offsets, __transaction_state, compacted state topics — that corrupt broker metadata. The fix is a hard-coded allow-list checked before every publish:
const ALLOWED_TOPICS = new Set(['events.user-actions', 'events.orders', 'events.analytics']);
function validateTopic(topic: string): void {
if (!ALLOWED_TOPICS.has(topic)) {
throw new McpError(
ErrorCode.InvalidParams,
`Topic '${topic}' is not in the allowed list. Allowed: ${[...ALLOWED_TOPICS].join(', ')}`
);
}
}
server.tool('publish_message', { topic: z.string(), ... }, async ({ topic, ... }) => {
validateTopic(topic); // before any Kafka operation
// ...
});
The allow-list must be hard-coded, not configurable by the caller. An agent that can modify the allow-list to add arbitrary topics has the same attack surface as no allow-list at all.
DSPy: Zod schemas before LLM calls
DSPy's injection vector is prompt injection: tool arguments that reach a DSPy module without validation can carry embedded instructions that redirect the LLM's reasoning. The fix is the same Zod schema validation you apply to any MCP tool — but applied before passing values to the DSPy module:
const querySchema = z.object({
question: z.string().min(1).max(500),
context_source: z.enum(['docs', 'code', 'issues']),
});
server.tool('answer_question', querySchema, async ({ question, context_source }) => {
// question is validated — max 500 chars, cannot be empty
const result = await fetch(DSPY_SERVICE_URL, {
method: 'POST',
body: JSON.stringify({ question, context_source }), // validated values only
});
// ...
});
Length caps on string inputs matter particularly for LLM-backed tools — a 50,000-character "question" that embeds adversarial instructions is a different threat than a long SQL value.
Contract 2 — The singleton resource contract
The second contract is about connection lifecycle. Every data infrastructure system has a "connect" operation that is expensive relative to the "use" operation. Reconnecting on every MCP tool call adds latency, wastes resources, and — for systems with connection limits — causes failures under concurrent load. MCP tools are called concurrently: multiple tool calls arrive before previous ones complete. The singleton resource pattern ensures the expensive connection setup happens once at startup, and all concurrent tool calls share it.
PostgreSQL and MySQL: Pool, not Client
Both database drivers expose two connection objects. The temptation is to use the simpler one:
| Object | PostgreSQL | MySQL | Concurrent queries | Auto-reconnect | Use for MCP? |
|---|---|---|---|---|---|
| Single connection | pg.Client | Individual createConnection() | One at a time — queued | No | No |
| Pool | pg.Pool | mysql2.createPool() | Parallel across pool connections | Yes | Always |
A single Client handles one query at a time. Under concurrent MCP tool calls, queries queue behind each other — a slow query blocks all subsequent ones. A pool distributes concurrent queries across multiple connections, auto-reconnects failed connections, and manages idle connection lifetimes. Create the pool once at module scope and export it:
// db.ts — PostgreSQL
import { Pool } from 'pg';
const pool = new Pool({
connectionString: process.env.DATABASE_URL,
max: 10,
idleTimeoutMillis: 30_000,
connectionTimeoutMillis: 5_000,
ssl: { rejectUnauthorized: true },
});
pool.on('error', (err) => console.error('Pool error:', err.message));
export { pool };
// db.ts — MySQL
import mysql from 'mysql2/promise';
const pool = mysql.createPool({
uri: process.env.DATABASE_URL,
connectionLimit: 10,
waitForConnections: true,
timezone: '+00:00', // critical — prevents DATETIME offset bugs
});
export { pool };
The timezone: '+00:00' setting for MySQL is non-obvious but critical: MySQL DATETIME columns store literal time strings without timezone information. If the Node.js process runs in a different timezone than the MySQL server, Date objects serialize with the wrong offset. Always set +00:00 on both the connection and the MySQL server.
AWS S3: singleton S3Client
The AWS SDK v3 resolves credentials through a chain: environment variables, EC2 instance metadata, ECS task role, IAM role attached to the process. This chain resolution involves HTTP calls to the metadata service on first use. Creating a new S3Client on every tool call re-runs this chain on every request. Create one client at startup:
import { S3Client } from '@aws-sdk/client-s3';
// Singleton — credential chain resolves once at startup
const s3 = new S3Client({ region: process.env.AWS_REGION ?? 'us-east-1' });
const BUCKET = process.env.S3_BUCKET!;
export { s3, BUCKET };
The credentials are cached by the SDK after the first resolution. Subsequent tool calls reuse the cached credentials until they expire, at which point the SDK refreshes them automatically in the background.
Kafka: connected Producer and Admin kept open
KafkaJS uses a layered client model: Kafka (configuration), Producer (publishing), Consumer (subscribing), Admin (cluster management). The expensive operations are the TCP connections to brokers and the producer registration. Keep the Producer and Admin connected as singletons:
import { Kafka } from 'kafkajs';
const kafka = new Kafka({ brokers: process.env.KAFKA_BROKERS!.split(',') });
const producer = kafka.producer({ idempotent: true, maxInFlightRequests: 5 });
const admin = kafka.admin();
// Connect once at startup — before registering tools
export async function connectKafka() {
await producer.connect();
await admin.connect();
}
export { producer, admin };
The idempotent: true flag enables Kafka's idempotent producer mode: the broker deduplicates retried messages using sequence numbers, so a network error during publish doesn't result in duplicate messages. This is the right default for MCP tools — a failed tool call often results in a retry from the calling agent, and duplicate events in downstream consumers cause hard-to-debug state corruption.
For consumer operations, don't keep a long-running consumer loop. Instead, use a transient consumer pattern: connect, subscribe, collect N messages, disconnect. A long-running consumer group that stays connected indefinitely doesn't fit the synchronous MCP tool model and causes consumer group rebalances every time the MCP server restarts.
DSPy: module-level configure
DSPy's LM client is configured globally at module level via dspy.configure(). Call it once at service startup — not inside tool handlers:
# dspy_service.py — Pattern 2: DSPy module inside an MCP tool handler
import dspy
from fastapi import FastAPI
# Configure once at startup
dspy.configure(lm=dspy.LM('anthropic/claude-sonnet-4-6', timeout=25))
app = FastAPI()
DSPy programs are Python-only as of mid-2026. For TypeScript MCP servers using DSPy for reasoning (Pattern 2), the integration point is HTTP: run the DSPy module as a FastAPI service and call it from TypeScript tool handlers via fetch(). Set explicit timeouts on both sides — 25 seconds for the DSPy LLM call and 30 seconds for the MCP handler's HTTP request to the DSPy service — so the DSPy service fails cleanly before the MCP handler gives up.
Contract 3 — The health transparency contract
The third contract is the most commonly skipped. Process-level health checks — "is the MCP server process running?" — miss every infrastructure-specific failure mode. A PostgreSQL pool can be fully exhausted (all connections in use, new queries queuing indefinitely) while the MCP server process reports healthy. IAM credentials can expire silently while the S3Client continues to return healthy because the credentials haven't been used since startup. A Kafka broker can become unreachable while the singleton Producer is still marked connected because it hasn't attempted a publish since the failure. Each integration requires a resource-specific health probe that exercises the actual failure path.
PostgreSQL: pool.waitingCount reveals exhaustion
The PostgreSQL Pool exposes connection counts directly on the pool instance:
app.get('/health/postgres', async (_req, res) => {
try {
const start = Date.now();
await pool.query('SELECT 1');
const latencyMs = Date.now() - start;
const poolStats = {
totalCount: pool.totalCount, // connections currently open
idleCount: pool.idleCount, // connections waiting for work
waitingCount: pool.waitingCount // requests waiting for a free connection
};
// Pool is exhausted if requests are queuing for connections
const healthy = poolStats.waitingCount === 0;
res.status(healthy ? 200 : 503).json({
status: healthy ? 'ok' : 'degraded',
latencyMs,
pool: poolStats,
});
} catch (err) {
res.status(503).json({ status: 'error', error: (err as Error).message });
}
});
pool.waitingCount is the critical signal: it represents requests that have been queued because all pool connections are busy. A non-zero waitingCount means tool calls are experiencing latency proportional to how long they wait for a connection. HTTP process checks see none of this — the endpoint responds as long as the process is alive, whether or not the pool is degraded. Wire AliveMCP to /health/postgres and set the failure threshold on waitingCount rather than just HTTP status.
MySQL: internal pool stats via driver internals
The mysql2 pool exposes connection stats through internal pool references. The health endpoint pattern is the same:
app.get('/health/mysql', async (_req, res) => {
try {
const start = Date.now();
await pool.query('SELECT 1');
const latencyMs = Date.now() - start;
const internalPool = (pool as unknown as {
pool: { _allConnections: unknown[]; _freeConnections: unknown[]; _connectionQueue: unknown[] }
}).pool;
const poolStats = {
total: internalPool._allConnections.length,
idle: internalPool._freeConnections.length,
queued: internalPool._connectionQueue.length,
};
const healthy = poolStats.queued === 0;
res.status(healthy ? 200 : 503).json({ status: healthy ? 'ok' : 'degraded', latencyMs, pool: poolStats });
} catch (err) {
res.status(503).json({ status: 'error', error: (err as Error).message });
}
});
The access pattern through internal pool references is less clean than PostgreSQL's public API — but the signal is the same. A non-zero _connectionQueue.length means tool calls are waiting for connections.
AWS S3: HeadBucketCommand catches IAM expiry
The S3 health probe must exercise the actual credentials and bucket access path, not just network connectivity to the S3 endpoint. HeadBucketCommand does both:
import { HeadBucketCommand, S3ServiceException } from '@aws-sdk/client-s3';
app.get('/health/s3', async (_req, res) => {
try {
await s3.send(new HeadBucketCommand({ Bucket: BUCKET }));
res.status(200).json({ status: 'ok', bucket: BUCKET });
} catch (err) {
const s3err = err as S3ServiceException;
if (s3err.$metadata?.httpStatusCode === 403) {
// IAM credentials expired or missing s3:HeadBucket permission
res.status(503).json({ status: 'error', reason: 'iam_credentials_invalid', bucket: BUCKET });
} else if (s3err.$metadata?.httpStatusCode === 404) {
// Bucket name wrong or region misconfigured
res.status(503).json({ status: 'error', reason: 'bucket_not_found', bucket: BUCKET });
} else {
res.status(503).json({ status: 'error', error: s3err.message });
}
}
});
The HeadBucketCommand failure mode table is specific to S3 and not covered by any general HTTP check:
| HeadBucket response | Root cause | Health endpoint result |
|---|---|---|
| 200 OK | All good | 200 ok |
| 403 Forbidden | IAM credentials expired or policy changed | 503 — iam_credentials_invalid |
| 404 Not Found | Bucket name wrong or region mismatch | 503 — bucket_not_found |
| Timeout | Regional S3 outage or VPC endpoint issue | 503 after timeout |
| Connection refused | VPC endpoint misconfigured | 503 — network error |
A 403 on HeadBucket is particularly useful: it tells you the credentials have expired without requiring an actual GetObject or PutObject call that modifies data. Wire your monitoring to alert on iam_credentials_invalid specifically — it's the most common S3 failure mode for long-running services where credentials rotate on a schedule.
Kafka: describeCluster verifies broker reachability
admin.describeCluster() makes a metadata request to the Kafka broker and returns information about the cluster, including the number of brokers available. A broker count that drops below expected is an early signal of cluster instability:
app.get('/health/kafka', async (_req, res) => {
try {
const start = Date.now();
const cluster = await admin.describeCluster();
const latencyMs = Date.now() - start;
const healthy = cluster.brokers.length > 0;
res.status(healthy ? 200 : 503).json({
status: healthy ? 'ok' : 'degraded',
brokerCount: cluster.brokers.length,
controllerId: cluster.controller?.nodeId,
latencyMs,
});
} catch (err) {
res.status(503).json({ status: 'error', error: (err as Error).message });
}
});
The admin client stays connected as a singleton — the health check doesn't create a new connection. If the admin client's connection to the broker has been lost (network partition, broker restart), describeCluster() will throw and the health endpoint returns 503. This is the right signal: the Producer's connectivity is likely affected by the same partition, and tool calls that attempt to publish will fail.
DSPy: canary inference endpoint
DSPy's health probe is the most complex because it sits at the intersection of three failure layers: the MCP server process, the DSPy FastAPI service, and the LLM backend. Add a health endpoint to the FastAPI service that runs a lightweight canary inference:
# DSPy FastAPI service health endpoint
@app.get('/health')
async def health():
try:
# Lightweight canary inference — one-word classification
class HealthCheck(dspy.Signature):
text: str = dspy.InputField()
label: str = dspy.OutputField()
checker = dspy.ChainOfThought(HealthCheck)
result = checker(text='ping')
return {'status': 'ok', 'lm_reachable': bool(result.label)}
except Exception as e:
raise HTTPException(status_code=503, detail=str(e))
Monitor both endpoints — the MCP server's own health endpoint for server availability, and the DSPy service's /health for LLM backend reachability. A DSPy service outage causes all tool handlers that depend on it to return InternalError. Without a dedicated probe on the DSPy service, these errors look identical to MCP server crashes from the calling agent's perspective — and from AliveMCP's perspective, the MCP server endpoint might still respond while all reasoning tools are failing.
Where the integrations diverge
The three contracts give you a framework that works across all five integrations. But each integration diverges in three specific areas that don't generalize: how it surfaces errors, whether it supports transactions, and which failure modes are invisible to general-purpose health checks.
Error taxonomy: four different languages for failure
Each infrastructure backend uses a different error representation, which means error handling code cannot be shared across integrations:
| Integration | Error type | How to read it | MCP error response |
|---|---|---|---|
| PostgreSQL | String error code on err.code | '23505' = unique_violation, '23503' = foreign_key, '23514' = check_violation | isError: true — constraint violations are domain errors the LLM can act on |
| MySQL | Integer errno on err.errno | 1062 = ER_DUP_ENTRY, 1452 = ER_NO_REFERENCED_ROW_2, 1213 = ER_LOCK_DEADLOCK | isError: true — same principle; MySQL integers not PostgreSQL strings |
| AWS S3 | Named exception class from SDK | NoSuchKey, NoSuchBucket, AccessDenied, S3ServiceException | isError: true for not-found/permission; throw McpError for internal SDK errors |
| Kafka | KafkaJS typed errors | KafkaJSNumberOfRetriesExceeded, KafkaJSProtocolError, topic-not-found produces a typed code | isError: true for publish failures; throw McpError for broker connectivity loss |
| DSPy | Python exceptions surfaced via HTTP status | 5xx from FastAPI = DSPy module error; timeout = LLM backend unreachable; 422 = Pydantic validation error from bad input | throw McpError(ErrorCode.InternalError) for service failures; isError: true for validation errors |
The most important distinction is PostgreSQL vs MySQL error codes. Both are relational databases with ACID transactions, but a handler that checks err.code === '23505' won't catch MySQL's ER_DUP_ENTRY errno 1062. If you share tool code between a PostgreSQL-backed and a MySQL-backed MCP server, the error mapping layer must be abstracted separately.
Transaction model: four different answers to atomicity
Atomic multi-step operations — insert a record and update a counter in the same transaction — have different answers across the five integrations:
PostgreSQL and MySQL both support full ACID transactions, but require a pinned connection. Use pool.connect() (PostgreSQL) or pool.getConnection() (MySQL) to check out a dedicated connection, run all statements on it, and release in a finally block. Failing to release leaks the connection from the pool permanently — it's the most dangerous resource leak in database-backed MCP servers.
// PostgreSQL transaction pattern
const client = await pool.connect();
try {
await client.query('BEGIN');
await client.query('INSERT INTO events (type) VALUES ($1)', [type]);
await client.query('UPDATE counters SET count = count + 1 WHERE type = $1', [type]);
await client.query('COMMIT');
} catch (err) {
await client.query('ROLLBACK');
throw err;
} finally {
client.release(); // CRITICAL — always in finally
}
AWS S3 has no transaction concept. S3 operations are individually atomic (a PutObject either succeeds completely or fails completely) but there is no way to make two S3 operations atomic together. For scenarios that need atomicity, use idempotent operations with conditional writes: include an IfNoneMatch: '*' condition on PutObject to fail if the object already exists, and design the overall operation to be safe on retry. Accept that S3 does not offer cross-object atomicity.
Kafka has producer-side idempotence (enabled by idempotent: true — the broker deduplicates retried messages by sequence number) but not cross-topic atomicity. Kafka transactions exist in the protocol but require a transactional producer, additional broker configuration, and careful consumer-side handling — significantly more complexity than the typical MCP tool needs. For most MCP use cases, producer-side idempotence is sufficient: a retry from the calling agent produces at most one event delivery.
DSPy has no transaction concept — it's a reasoning engine, not a data store. LLM inference is not transactional by nature: if a DSPy module call fails partway through a multi-step reasoning chain, there is no rollback. Design DSPy-backed tools to be idempotent where possible, and document clearly when tool side effects (writes to databases triggered by DSPy output) are not atomic.
Monitoring blind spots: what each integration hides from general health checks
Each integration has a failure mode that is invisible to process-level health checks and requires the resource-specific probe:
| Integration | Invisible failure mode | Why process checks miss it | What catches it |
|---|---|---|---|
| PostgreSQL | Pool exhaustion — all connections in use, queries queuing | MCP server process is alive and responding to HTTP | /health/postgres with pool.waitingCount > 0 → 503 |
| MySQL | Pool exhaustion — connection queue growing under concurrent load | Same as PostgreSQL — process health is orthogonal to pool health | /health/mysql with _connectionQueue.length > 0 → 503 |
| AWS S3 | IAM credential expiry — credentials expired since last use | S3Client is initialized; process is alive; no error until first actual S3 call | /health/s3 with HeadBucketCommand → 403 = iam_credentials_invalid |
| Kafka | Broker rebalancing — producer connected but broker count changed | Producer connection object exists; no error surfaced until next publish | /health/kafka with admin.describeCluster() returns fewer brokers than expected |
| DSPy | LLM backend outage — DSPy service running but LLM API unreachable | DSPy FastAPI service responds 200 on process check; LLM calls fail inside handlers | Canary inference on /health endpoint — catches LLM backend failures separately from service health |
Wire AliveMCP to the resource-specific health endpoint, not the generic process health endpoint. The process-level check tells you the MCP server is running. The resource-specific check tells you the data infrastructure behind it is working — which is what the calling agent actually needs to know.
Putting the three contracts together: a reference matrix
The three contracts produce a consistent checklist for every data infrastructure integration. Before shipping any new data backend behind an MCP tool, verify all three:
| Checkpoint | PostgreSQL | MySQL | AWS S3 | Kafka | DSPy |
|---|---|---|---|---|---|
| Safety contract met? | $1/$2 placeholders everywhere; no string interpolation |
? placeholders everywhere; no string interpolation |
validateKey() on every tool; bucket from env only |
Topic allow-list checked before every publish | Zod schema validates inputs before DSPy module call |
| Singleton contract met? | Module-level Pool exported from db.ts |
Module-level createPool() with timezone: '+00:00' |
Module-level S3Client; bucket from env |
Connected Producer + Admin at startup |
dspy.configure(lm=...) at module level, once |
| Health contract met? | /health/postgres returns 503 if waitingCount > 0 |
/health/mysql returns 503 if connection queue non-empty |
/health/s3 runs HeadBucketCommand; maps 403/404 to reasons |
/health/kafka calls admin.describeCluster() |
/health on FastAPI runs canary inference call |
The contracts are orthogonal — each addresses a different failure mode. You need all three. Skipping the safety contract opens an injection vector. Skipping the singleton contract causes performance failures under concurrent load. Skipping the health contract means failures are invisible until tool callers start timing out.
The integration order that builds confidence
If you're adding data infrastructure to a new MCP server, this is the sequence that catches problems earliest:
- Set up the singleton first — confirm the connection succeeds with a startup log line. A failed database connection at startup is easier to debug than a failed connection during a tool call at runtime.
- Add the safety contract before the first tool — write the parameterized query template or key validator before adding any tool handlers. Retrofitting safety into existing handlers is harder and error-prone.
- Wire the health endpoint and confirm it before deployment — call it yourself with
curlbefore deploying. A health endpoint that returns200even when the database is down (because you forgot to await the query) is worse than no health endpoint — it creates false confidence. - Add AliveMCP monitoring on the resource-specific endpoint — not the process health endpoint. Set an alert threshold on the resource-specific field (pool waiting count, IAM error reason, broker count) rather than just HTTP status.
Monitoring gap: the failure mode that survives all three contracts
The three contracts address the most common failure classes, but there's a failure mode that survives all three: the MCP server is healthy, the database pool is healthy, the health endpoint returns 200 — and a tool call still fails because the LLM caller passed an argument combination that triggers a slow query, a rare constraint violation, or an edge case in the DSPy reasoning chain.
These failures require protocol-level monitoring — not health endpoint polling. AliveMCP probes the full MCP protocol sequence (initialize → tools/list → tool call with a known-good argument set) from outside the server. This catches failures that health endpoint checks don't see:
- A tool call that succeeds most of the time but fails on specific argument combinations
- A schema drift where the tool's parameter schema changed but the health endpoint doesn't reflect tool call behavior
- A pool exhaustion that resolves temporarily (healthy endpoint) but recurs under load
- A DSPy service that responds to process checks but returns garbage output on actual inference
The health endpoints you build with Contract 3 give you resource-level visibility. Protocol-level monitoring gives you tool-call-level visibility — which is what the calling agent actually experiences. Both are required for data infrastructure MCP servers that go into production.
Related guides
- MCP Server PostgreSQL — pg driver, parameterized queries, and connection pooling
- MCP Server MySQL — mysql2 driver, connection pooling, and prepared statements
- MCP Server AWS S3 — GetObject, PutObject, presigned URLs, and health monitoring
- MCP Server Kafka — KafkaJS producer and consumer tools, topic management, health monitoring
- MCP Server DSPy — typed signatures, MCP tool adapters, and monitoring
- MCP Server Connection Pooling — pool sizing, timeout configuration, and exhaustion detection
- MCP Server Error Handling — isError vs throw, retry signals, and error taxonomy
- MCP Server MongoDB — document operations and connection management
- MCP Server Redis — caching, pub/sub, and session state
- MCP Server Health Check — designing endpoints for uptime monitors
- Building MCP Tools for LLM Provider APIs — the three patterns every LLM integration shares