Troubleshooting · Outages
MCP endpoint not responding — diagnostic checklist
Your MCP endpoint is answering (or not), an agent is timing out, and you need to know where the failure is. Walk this checklist from the outside in — transport, then TLS, then JSON-RPC, then MCP semantics, then auth.
TL;DR
Test in this order: DNS → TCP → TLS → HTTP status → JSON-RPC parse → initialize → tools/list → auth on a real tool call. The first layer that fails is your problem. Most "not responding" reports resolve at the JSON-RPC or auth layer, not the transport layer — which is why HTTP-only monitors miss them.
Step 1 — DNS and TCP
Run dig +short your-server.com and nc -vz your-server.com 443. If DNS fails or TCP is refused, the server's host is down or firewalled. This is the only layer a classical uptime monitor reliably catches — everything past here requires protocol awareness.
Step 2 — TLS
openssl s_client -connect your-server.com:443 -servername your-server.com. Look for Verify return code: 0 (ok). Common failures: expired cert (especially around auto-renewal boundaries), wrong SNI, or a cert that doesn't include your-server.com in SANs. This is frequently the culprit after a migration or a registrar change.
Step 3 — HTTP surface
curl -sv -X POST https://your-server.com/mcp -H 'content-type: application/json' -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-06-18","capabilities":{},"clientInfo":{"name":"diag","version":"1"}}}'. If you get HTTP 4xx on POST with a correct body: routing or method filter is wrong. If you get HTTP 200 with an empty or non-JSON body: your server is accepting the request but crashing before writing the response.
Step 4 — JSON-RPC parse
Pipe the response through jq. You should see a result key with protocolVersion, serverInfo, and capabilities. If you see error.code -32700 (Parse error): your server can't deserialize its own inputs. If error.code -32601 (Method not found): the server doesn't implement initialize, which means it isn't an MCP server in any meaningful sense.
Step 5 — MCP semantics
With the initialize response in hand, send tools/list. If this returns an error: a backend dependency (DB, API key, memory cache) is broken in a way the initialize handler happened to survive. If it returns with fewer tools than expected: a deploy silently removed one — this is the classic "schema drift" failure and agents will fail in confusing ways downstream. See our health check reference for what to hash and alert on.
Step 6 — Auth on a real call
Invoke one low-impact tool end-to-end (something idempotent and cheap). A common failure mode is that initialize and tools/list are unauthenticated but real calls require a bearer token; the server looks healthy in any naive probe, and every downstream call fails on auth. A working auth path is part of "responsive."
How to not find out from a user next time
If the above walk surfaced the problem, it means you didn't have a continuous probe that does this. That's what AliveMCP exists for — we run steps 3–5 against every public MCP endpoint every 60 seconds, and we run all six steps against private endpoints on the Author and Team tiers with Slack + webhook alerts. Claim your listing on the AliveMCP home page, and the next "not responding" report comes from the monitor, not from a customer.
Related questions
My MCP shows up as dead on the public AliveMCP dashboard — what does that mean?
It means our 60-second probe hit step 1–5 somewhere and failed. Open the per-server status page to see which step; the last 90 days of probes are logged with the specific error. If you think it's a false positive, let us know and we'll re-probe.
Why does my server pass curl but fail a client library?
Client libraries set Accept and MCP-Protocol-Version headers; a permissive curl doesn't. Servers that reject missing headers will fail clients and pass curl. Probe with the exact headers a real client would send.
How often does this turn out to be a schema mismatch rather than a real outage?
In our crawl data, roughly 1 in 4 "not responding" reports on MCP servers are actually tool-level regressions, not host outages. That ratio is the whole case for MCP-aware monitoring.