Reference · Health checks
MCP server health check
A correct MCP health check verifies three things: the protocol handshake succeeds, the tool registry responds without error, and the response envelope conforms to the MCP spec. Anything less is false-positive-prone.
TL;DR
Run this sequence every 60 seconds: initialize → verify protocolVersion and capabilities → tools/list → hash the returned schema → compare with last known hash. Alert on any JSON-RPC error, missing keys, or schema drift. AliveMCP does exactly this for every public MCP server, for free. Join the waitlist to add your private endpoints.
Why "HTTP 200" isn't a health check
MCP speaks JSON-RPC 2.0 over HTTP, SSE, or stdio. The transport layer and the protocol layer can fail independently: the transport can be fine while the protocol is broken, or the protocol can be fine while one tool is misconfigured. A health check that only inspects the transport is blind to the most common real-world failures — which is why MCP authors keep discovering their server has been silently broken for a week.
The probe sequence
initializerequest. Send a POST with body{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-06-18","capabilities":{},"clientInfo":{"name":"healthcheck","version":"1"}}}. Expect a result withprotocolVersion,serverInfo.name, andcapabilities.notifications/initializedsend. Per spec, the client signals it's ready. A server that behaves differently after this signal reveals state-dependent bugs.tools/listrequest. Returns an array of{name, description, inputSchema}. Empty array is legal only if the server advertised notoolscapability; otherwise it's a failure.- Hash the schema. SHA-256 over the sorted, canonicalized list of
(name, inputSchema). Store it. Compare next run — if it's different, emit a schema drift event (not always an alert; sometimes it's an intentional deploy). - Measure latency. Record time-to-first-byte and time-to-complete. Keep p50/p95 rolling windows.
Alert signals ranked by urgency
- Critical (page now): TCP refused, TLS failure,
initializereturns JSON-RPC error,tools/listreturns error, no response within 30s. - High (Slack within 1 min): schema hash change, tool count drop > 0,
initializesucceeds butprotocolVersionhas changed without a release. - Medium (daily digest): p95 latency > 3× 7-day baseline for 3+ consecutive probes, intermittent 5xx, transient timeouts under 5% of probes.
- Low (weekly): descriptions changed but schemas are stable, capabilities block changed, serverInfo version bumped (useful for release auditing).
Probe frequency
For agent-facing infrastructure, 60-second probes are the floor. Agents retry fast, and a five-minute-old cached status is a lifetime in conversation time. For cold internal tooling, 5-minute intervals are fine. Don't probe faster than 15 seconds unless you coordinate with the server author — it looks like abuse and can trigger rate limits.
How AliveMCP implements this
The public dashboard runs exactly the sequence above, every 60 seconds, against every MCP endpoint it discovers in MCP.so, Glama, PulseMCP, Smithery, the Official Registry, and GitHub. Private endpoints go on the Team tier at $49/mo with Slack + webhook alerts and a public status-page subdomain. Enterprise teams running 5-30 servers get SAML SSO, an audit log, and monthly SLA PDFs. See the live feed on the AliveMCP home page or review what's in each tier.
Related questions
What HTTP method should my MCP accept for probes?
The spec uses POST for JSON-RPC envelopes. A monitor that uses GET will not correctly test the protocol — it'll hit whatever the server decides to serve on GET, which may be a landing page, a docs redirect, or a 405.
Should the probe execute a real tool call?
Usually no — tools/list exercises the same code paths as a call without the side effects. If a specific tool is your critical path (e.g. a payment tool), add a synthetic check for it, but keep it isolated from the main liveness probe.
How do I avoid rate-limiting my own server?
60s probes from a fixed set of AliveMCP IPs are designed to stay below any sane rate limit. If you use internal monitoring too, coordinate cadences or allowlist the probe's source by IP or JWT claim.