Reference · Monitoring stacks
Prometheus MCP monitoring
Prometheus is the right tool for measuring what's happening inside your MCP server: handler latency, tool call counts, error rates, memory pressure. It's the wrong tool for measuring whether your MCP server looks alive from the outside, because Prometheus scrapes from inside your infrastructure — it can't see what your users' agents see.
TL;DR
Use Prometheus for in-process observability (request rates, handler durations, memory, queues). Use AliveMCP for external protocol-level uptime (does the server respond to a real MCP handshake from outside your network?). The two are complementary — one watches the inside, the other watches the outside. Join the waitlist to add your private endpoints to AliveMCP alongside your existing Prometheus stack.
What Prometheus does well for MCP servers
If you're already running Prometheus in your infrastructure, instrumenting your MCP server is straightforward. Expose a /metrics endpoint with standard histogram and counter metrics:
mcp_tool_call_duration_seconds— histogram of per-tool handler latency. Gives you p50/p95/p99 bucketed by tool name.mcp_tool_call_total— counter by tool name and result (success/error). Rate of this over time = your tool call throughput.mcp_initialize_duration_seconds— time for the protocol handshake to complete, from server-side perspective. Useful for tracking cold-start regressions after deploys.mcp_active_sessions— gauge of currently active client sessions (if your server is stateful). Tracks session leak behavior.mcp_schema_version_info— gauge-per-label that records the current tool list hash as a label value. Changes appear as metric-label transitions, visible in Grafana.
With these, a Grafana dashboard gives you a rich real-time view of your server's internal behavior.
What Prometheus misses for MCP monitoring
External protocol verification
Prometheus scrapes your server's own /metrics endpoint, which only works if your server is up from inside your infrastructure. It cannot verify that the MCP endpoint itself is reachable and responding correctly from the internet — from where your users' agents are making calls. A misconfigured reverse proxy, an expired TLS certificate, or a firewall rule change can make your MCP endpoint unreachable from outside while your internal Prometheus scrape continues to succeed perfectly.
Protocol-level validation
Even with Prometheus blackbox-exporter, you'd write a TCP or HTTP probe — not an MCP protocol probe. The blackbox exporter can verify that your endpoint responds to an HTTP POST, but it can't verify that the JSON-RPC response contains a valid protocolVersion, that tools/list returns the expected schemas, or that the response envelope conforms to the MCP spec. Protocol conformance failures are invisible to HTTP-level probes.
Third-party MCPs you don't own
If your product depends on third-party MCP servers (from public registries or vendor MCPs), Prometheus has nothing to say about their health. You can't instrument a server you don't control. You need external probing for the dependency graph, and that probing has to speak the same protocol the servers use.
Cross-region visibility
Your Prometheus scraper runs from a fixed location inside your infrastructure. It doesn't see whether your MCP endpoint is accessible from EU, APAC, or us-west-2 independently. Regional failures — a CDN misconfiguration, a split-brain DNS entry, a region-specific TLS certificate — are invisible unless you run scrapers in each region, which adds complexity most teams skip.
Combining Prometheus and AliveMCP
The two tools cover different angles of the same question. A practical setup:
| Signal | Source |
|---|---|
| Handler latency (p50/p95/p99) | Prometheus histogram |
| Tool call error rate | Prometheus counter |
| Memory / CPU pressure | Prometheus + node_exporter |
| External protocol liveness | AliveMCP (60s probes) |
| Tool schema drift detection | AliveMCP (hash comparison) |
| Third-party dependency uptime | AliveMCP public dashboard |
| Public status page for users | AliveMCP status subdomain |
| 90-day uptime history | AliveMCP probe archive |
Run them together. They don't overlap; they cover each other's blind spots.
Alerting across both systems
The simplest routing: Prometheus alerts go to your team's primary engineering channel (they signal internal infrastructure behavior); AliveMCP alerts go to a customer-facing oncall channel (they signal what users actually experience). A memory-pressure alert from Prometheus is informational. A "3 consecutive probe failures" alert from AliveMCP means users are broken right now.
For small teams, both channels can be the same Slack channel — the important thing is to label them clearly so you can distinguish an internal performance regression from an external availability incident at a glance.
Getting started
If you already have Prometheus running, exposing MCP-specific metrics from your server is a 30-minute task — most Node.js and Python MCP server frameworks have a middleware hook that's the right integration point. AliveMCP requires nothing from your server: it probes your public endpoint from outside and works with any MCP implementation. Check if your server is already on the public dashboard, or join the waitlist to add private endpoints alongside your Prometheus stack. See pricing for the Team tier ($49/mo) which includes private endpoint monitoring and Slack alert integration.
Related questions
Can I forward AliveMCP alert webhooks into my Alertmanager?
Yes — Author and Team tier webhooks send a standard JSON payload that you can ingest into Alertmanager via a webhook receiver. The payload includes the server slug, the probe result, the previous status, and the UTC timestamp. Full payload spec is in the Author tier documentation.
Does AliveMCP expose a Prometheus-compatible /metrics endpoint?
Not yet — that's on the Enterprise tier roadmap. For now, the API (Author tier) returns uptime data in JSON, which you can scrape with a custom exporter or a simple curl + textfile collector if you want to pull it into Grafana alongside your own metrics.
What about OpenTelemetry?
If you're running OTel instead of Prometheus, the same split applies: OTel traces and spans give you internal call-path visibility; AliveMCP gives you external protocol-level uptime. OTel doesn't have a concept of "does the endpoint look healthy from outside your infrastructure" — that's a fundamentally external check.