Reference · Monitoring stacks

Prometheus MCP monitoring

Prometheus is the right tool for measuring what's happening inside your MCP server: handler latency, tool call counts, error rates, memory pressure. It's the wrong tool for measuring whether your MCP server looks alive from the outside, because Prometheus scrapes from inside your infrastructure — it can't see what your users' agents see.

TL;DR

Use Prometheus for in-process observability (request rates, handler durations, memory, queues). Use AliveMCP for external protocol-level uptime (does the server respond to a real MCP handshake from outside your network?). The two are complementary — one watches the inside, the other watches the outside. Join the waitlist to add your private endpoints to AliveMCP alongside your existing Prometheus stack.

What Prometheus does well for MCP servers

If you're already running Prometheus in your infrastructure, instrumenting your MCP server is straightforward. Expose a /metrics endpoint with standard histogram and counter metrics:

mcp_tool_call_duration_seconds — histogram of per-tool handler latency. Gives you p50/p95/p99 bucketed by tool name.
mcp_tool_call_total — counter by tool name and result (success/error). Rate of this over time = your tool call throughput.
mcp_initialize_duration_seconds — time for the protocol handshake to complete, from server-side perspective. Useful for tracking cold-start regressions after deploys.
mcp_active_sessions — gauge of currently active client sessions (if your server is stateful). Tracks session leak behavior.
mcp_schema_version_info — gauge-per-label that records the current tool list hash as a label value. Changes appear as metric-label transitions, visible in Grafana.

With these, a Grafana dashboard gives you a rich real-time view of your server's internal behavior.

What Prometheus misses for MCP monitoring

External protocol verification

Prometheus scrapes your server's own /metrics endpoint, which only works if your server is up from inside your infrastructure. It cannot verify that the MCP endpoint itself is reachable and responding correctly from the internet — from where your users' agents are making calls. A misconfigured reverse proxy, an expired TLS certificate, or a firewall rule change can make your MCP endpoint unreachable from outside while your internal Prometheus scrape continues to succeed perfectly.

Protocol-level validation

Even with Prometheus blackbox-exporter, you'd write a TCP or HTTP probe — not an MCP protocol probe. The blackbox exporter can verify that your endpoint responds to an HTTP POST, but it can't verify that the JSON-RPC response contains a valid protocolVersion, that tools/list returns the expected schemas, or that the response envelope conforms to the MCP spec. Protocol conformance failures are invisible to HTTP-level probes.

Third-party MCPs you don't own

If your product depends on third-party MCP servers (from public registries or vendor MCPs), Prometheus has nothing to say about their health. You can't instrument a server you don't control. You need external probing for the dependency graph, and that probing has to speak the same protocol the servers use.

Cross-region visibility

Your Prometheus scraper runs from a fixed location inside your infrastructure. It doesn't see whether your MCP endpoint is accessible from EU, APAC, or us-west-2 independently. Regional failures — a CDN misconfiguration, a split-brain DNS entry, a region-specific TLS certificate — are invisible unless you run scrapers in each region, which adds complexity most teams skip.

Combining Prometheus and AliveMCP

The two tools cover different angles of the same question. A practical setup:

Signal	Source
Handler latency (p50/p95/p99)	Prometheus histogram
Tool call error rate	Prometheus counter
Memory / CPU pressure	Prometheus + node_exporter
External protocol liveness	AliveMCP (60s probes)
Tool schema drift detection	AliveMCP (hash comparison)
Third-party dependency uptime	AliveMCP public dashboard
Public status page for users	AliveMCP status subdomain
90-day uptime history	AliveMCP probe archive

Run them together. They don't overlap; they cover each other's blind spots.

Alerting across both systems

The simplest routing: Prometheus alerts go to your team's primary engineering channel (they signal internal infrastructure behavior); AliveMCP alerts go to a customer-facing oncall channel (they signal what users actually experience). A memory-pressure alert from Prometheus is informational. A "3 consecutive probe failures" alert from AliveMCP means users are broken right now.

For small teams, both channels can be the same Slack channel — the important thing is to label them clearly so you can distinguish an internal performance regression from an external availability incident at a glance.

Getting started

If you already have Prometheus running, exposing MCP-specific metrics from your server is a 30-minute task — most Node.js and Python MCP server frameworks have a middleware hook that's the right integration point. AliveMCP requires nothing from your server: it probes your public endpoint from outside and works with any MCP implementation. Check if your server is already on the public dashboard, or join the waitlist to add private endpoints alongside your Prometheus stack. See pricing for the Team tier ($49/mo) which includes private endpoint monitoring and Slack alert integration.

Get early access