Reference · Health checks

MCP server health check

A correct MCP health check verifies three things: the protocol handshake succeeds, the tool registry responds without error, and the response envelope conforms to the MCP spec. Anything less is false-positive-prone.

TL;DR

Run this sequence every 60 seconds: initialize → verify protocolVersion and capabilities → tools/list → hash the returned schema → compare with last known hash. Alert on any JSON-RPC error, missing keys, or schema drift. AliveMCP does exactly this for every public MCP server, for free. Join the waitlist to add your private endpoints.

Why "HTTP 200" isn't a health check

MCP speaks JSON-RPC 2.0 over HTTP, SSE, or stdio. The transport layer and the protocol layer can fail independently: the transport can be fine while the protocol is broken, or the protocol can be fine while one tool is misconfigured. A health check that only inspects the transport is blind to the most common real-world failures — which is why MCP authors keep discovering their server has been silently broken for a week.

The probe sequence

initialize request. Send a POST with body {"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-06-18","capabilities":{},"clientInfo":{"name":"healthcheck","version":"1"}}}. Expect a result with protocolVersion, serverInfo.name, and capabilities.
notifications/initialized send. Per spec, the client signals it's ready. A server that behaves differently after this signal reveals state-dependent bugs.
tools/list request. Returns an array of {name, description, inputSchema}. Empty array is legal only if the server advertised no tools capability; otherwise it's a failure.
Hash the schema. SHA-256 over the sorted, canonicalized list of (name, inputSchema). Store it. Compare next run — if it's different, emit a schema drift event (not always an alert; sometimes it's an intentional deploy).
Measure latency. Record time-to-first-byte and time-to-complete. Keep p50/p95 rolling windows.

Alert signals ranked by urgency

Critical (page now): TCP refused, TLS failure, initialize returns JSON-RPC error, tools/list returns error, no response within 30s.
High (Slack within 1 min): schema hash change, tool count drop > 0, initialize succeeds but protocolVersion has changed without a release.
Medium (daily digest): p95 latency > 3× 7-day baseline for 3+ consecutive probes, intermittent 5xx, transient timeouts under 5% of probes.
Low (weekly): descriptions changed but schemas are stable, capabilities block changed, serverInfo version bumped (useful for release auditing).

Probe frequency

For agent-facing infrastructure, 60-second probes are the floor. Agents retry fast, and a five-minute-old cached status is a lifetime in conversation time. For cold internal tooling, 5-minute intervals are fine. Don't probe faster than 15 seconds unless you coordinate with the server author — it looks like abuse and can trigger rate limits.

How AliveMCP implements this

The public dashboard runs exactly the sequence above, every 60 seconds, against every MCP endpoint it discovers in MCP.so, Glama, PulseMCP, Smithery, the Official Registry, and GitHub. Private endpoints go on the Team tier at $49/mo with Slack + webhook alerts and a public status-page subdomain. Enterprise teams running 5-30 servers get SAML SSO, an audit log, and monthly SLA PDFs. See the live feed on the AliveMCP home page or review what's in each tier.

Get early access