Guide · Monitoring

Monitoring an MCP server

Monitoring an MCP server means answering four questions continuously: is the protocol handshake succeeding, is the tool surface stable, are latencies inside your envelope, and are callers getting clean responses? Everything else is cosmetics.

TL;DR

The four signals worth monitoring on an MCP server: handshake success rate, tool-surface stability (count + schema hash), p95 response latency, and client-facing error rate. Anything you measure beyond those is usually noise. You can self-host a probe in under an hour with a cron and curl, or let AliveMCP do it across every public MCP server for free.

Why MCP monitoring is different from API monitoring

A conventional REST API has one natural health signal — an HTTP 2xx on a known route — and monitoring tools can piggyback on that. MCP doesn't. An MCP server over HTTP is a JSON-RPC endpoint where every call is a POST to the same URL, the method is in the body, and the meaning of "healthy" depends on which capabilities the server advertised. You can't monitor an MCP server by GETing its root and checking for 200. You have to speak the protocol.

The consequence: dropping a tool like UptimeRobot or generic Pingdom on an MCP endpoint tells you the HTTPS server is up. It tells you nothing about whether agents talking to it can initialize, list tools, or get valid responses. In our April-2026 audit of 2,181 public MCP endpoints, 91% failed at the protocol or tool layer while passing a generic HTTP probe.

The four signals

Handshake success rate. Every probe runs an initialize request. Record success / failure. Alert when the rolling 5-minute success rate drops below 99% — most auth or deploy-breakage failures show up here first.
Tool-surface stability. Call tools/list on every probe. Hash the sorted list of (tool_name, input_schema). Alert on a shrinking count (a registration crashed) and on unexpected hash changes outside release windows (a deploy broke a contract).
p95 response latency. Track two latencies: time-to-first-byte on initialize, and total round-trip on tools/list. Baseline the 7-day rolling p95; alert on 3× deviation sustained over 3+ consecutive probes.
Client-facing error rate. If you have access to the server's own logs (self-hosted), count the JSON-RPC error responses it emits per minute. A server that probes clean but is emitting errors to real clients is the worst failure mode — your dashboard says green while users see nothing but red.

Four monitoring mistakes we see constantly

Only probing HTTP. Your dashboard says up, your users say down, your Slack is quiet. Root cause: your monitor never spoke JSON-RPC.
Probing too fast. Sub-10-second probes from a single IP trigger rate limiters, get you IP-banned by Cloudflare, and poison your own latency metrics. 60 seconds is the sweet spot.
No schema baseline. You catch outages but not drift. An MCP server that quietly renames search → find looks identical on every other metric; only schema hashing surfaces it.
One alert tier for everything. A 200ms latency spike and a full initialize failure should not page the same person the same way. Tiered alerts (critical → Slack → daily digest) are the difference between a useful signal and a team that ignores its monitoring.

The minimum self-hosted stack

If you want to roll your own, here's the shortest path that actually works:

Probe: a shell script or Node program that runs the five-gate sequence from the liveness check guide and writes one row per probe to SQLite or Postgres.
Schedule: cron every 60 seconds (* * * * * with a wrapper that staggers if you have multiple servers).
Storage: SQLite is fine up to a few hundred servers at 60s cadence.
Alerting: at the end of each probe, check the last 3 rows for that server; if 2+ failed, POST to a Slack webhook.
Dashboard: a 30-line static HTML page that reads the latest row per server. Don't bother with Grafana for < 50 servers.

Budget: 3–6 hours to ship, 30 minutes a week to keep running. Fine for one or two internal servers. Past that, the hosted option is cheaper once you count your time.

When hosted makes sense

AliveMCP already runs the probe against every public MCP server in MCP.so, Glama, PulseMCP, Smithery, the Official Registry, and the GitHub mcp topic — so if you're monitoring third-party MCPs, there's nothing to build. For your own servers, the Author tier ($9/mo) adds webhook + Slack alerts, 90-day response-time history, a public status badge, and a verified-author mark on the public dashboard. The Team tier ($49/mo) adds 10 private endpoints, per-environment status pages, and SSO — comparable to what Datadog charges a 100× premium for, minus the dashboards-within-dashboards.

Join the waitlist