Guide · Dashboard

MCP server monitoring dashboard

A generic uptime dashboard shows you a green dot or a red dot. An MCP-aware monitoring dashboard shows you which protocol layer failed, how latency has drifted over 30 days, which tools disappeared from the surface, and whether the failure is isolated to one server or a cross-provider common-mode event. Here's what to build, and what to look at.

TL;DR

An MCP monitoring dashboard needs five views: a multi-server health matrix (one row per server, columns for each protocol layer), a latency heatmap (p50/p95/p99 over time), a tool surface changelog (schema diff on every tools/list change), a cross-server correlation panel (to separate common-mode failures from individual outages), and a 30-day uptime summary per server. AliveMCP provides all five out of the box — no Grafana configuration required.

Why generic uptime dashboards fail for MCP

Standard uptime dashboards were designed for HTTP APIs: is the server returning 200? That's the whole signal. MCP servers have a four-layer protocol where each layer can fail independently while the layers above it look healthy to a generic HTTP probe:

Transport (TCP/TLS): the connection itself. An HTTP uptime monitor catches this.
HTTP layer: the server is up but returns non-JSON-RPC responses, maintenance pages, or proxy errors. An HTTP uptime monitor usually catches 5xx here.
JSON-RPC handshake (initialize): the server speaks HTTP but fails the MCP handshake — wrong protocolVersion, auth error, or method not found. An HTTP uptime monitor sees 200 and marks the server green.
Tool surface (tools/list): initialize succeeds but tool list is empty, truncated, or schema-changed. The server is alive; your agent can't do any work.

A dashboard built on top of a generic HTTP monitor will show four of those failures as "healthy." An MCP monitoring dashboard runs all four probes and surfaces them as independent signals. See MCP server health check for the full four-layer probe sequence.

The five panels every MCP dashboard needs

1. Multi-server health matrix

The health matrix is the home screen of your dashboard: a table where each row is one monitored MCP server and each column is one protocol layer. Each cell shows the current state (green/yellow/red) and the probe timestamp. At a glance you can see whether one server has a tools/list failure while all others are green, or whether all servers failed at the transport layer simultaneously (common-mode failure — read below).

Column headers: Transport (TCP connect + TLS) · HTTP (response code + content-type) · Initialize (JSON-RPC method success, protocolVersion) · Tools/List (tool count, schema hash). A fifth column for Latency (current p95 vs 30-day baseline) rounds out the row.

Useful secondary columns: SSL expiry (days remaining), Last incident (how long ago), 30-day uptime %. See MCP server SSL certificate monitoring for why SSL expiry belongs in the matrix, not a separate page.

2. Latency heatmap

Uptime percentage is binary; latency is continuous. A server can have 100% uptime and still be unusably slow during peak hours. The latency panel shows p50, p95, and p99 response times over a configurable time window (1 hour, 24 hours, 7 days, 30 days). Heatmap rendering — color-coding each time bucket by latency percentile — makes slow periods visible immediately without needing to read numbers.

Separate the latency into protocol phases where possible: time to TCP connect, time to HTTP response (TTFB), time to complete the initialize handshake, time to receive tools/list response. Knowing that your server's latency spike is in TCP connect (DNS issue, cold start) rather than in tools/list processing (slow database query returning tool definitions) points you toward the right fix. See MCP server latency for full guidance on latency measurement and thresholds.

3. Tool surface changelog

Every time the tools/list response changes — new tool added, tool removed, description text changed, input schema modified — the dashboard should show a diff: what changed, when it changed, and what the previous state was. This is the panel that answers "when did the agent start saying it can't find the tool?" (Answer: the tool was removed from the surface at 14:32 on Tuesday.)

Schema drift is the silent killer of MCP integrations. An agent trained to use search_documents breaks when the tool is renamed to search_docs — silently, with no error at the MCP protocol level, because the initialize call still succeeds and tools/list still returns a non-empty list. The changelog panel makes this visible. See schema drift in MCP tool definitions for the full taxonomy of schema changes and their impact.

4. Cross-server correlation panel

When multiple servers fail at the same time, the dashboard needs to show it explicitly. The correlation panel groups incidents by time window and flags common-mode failures: if ≥ 50% of monitored servers failed within a 5-minute window, it's almost certainly not all of them failing independently — it's your probe origin losing connectivity, a shared hosting provider outage, or a CDN edge failure.

Common-mode failure detection prevents alert fatigue: instead of 40 individual P1 alerts, you get one cross-server event with a suppression note that explains the common cause. See multi-tenant MCP probe collector for the architecture behind this suppression logic.

5. 30-day uptime summary

The historical summary panel answers "what's the SLA story for this server?" It shows uptime percentage by layer (not just overall), total downtime minutes, incident count, mean time to detection (MTTD), and mean time to recovery (MTTR). These numbers are what you show to users, put in status page reports, and use to decide whether your current monitoring cadence (60-second probes) is sufficient or whether you need shorter intervals for a mission-critical server.

Building a custom MCP dashboard with Grafana

If you want to self-host your MCP monitoring dashboard, Grafana + Prometheus is the most common stack. The setup has three parts:

Prometheus exporter for MCP probes. Write a probe runner that calls initialize + tools/list on each monitored server every 60 seconds and exposes the results as Prometheus metrics: mcp_probe_success{server="...", layer="transport"}, mcp_probe_latency_seconds{server="...", layer="initialize", quantile="0.95"}, mcp_tools_count{server="..."}, mcp_tools_schema_hash{server="..."}. See Prometheus MCP monitoring for the full metric schema.
Grafana dashboard panels. Build one row per server with four stat panels (one per protocol layer), a timeseries panel for latency percentiles, and a state timeline panel for uptime history. The tool surface changelog requires a text panel reading from a Loki log stream or a custom Grafana plugin — it's the hardest panel to build from scratch.
Alert rules. Grafana alerting rules trigger on Prometheus query thresholds: fire P1 when mcp_probe_success{layer="transport"} == 0 for 3 consecutive minutes, P2 when mcp_tools_count drops below its 7-day minimum, P3 when mcp_probe_latency_seconds{quantile="0.95"} exceeds 3× the 30-day p95 baseline. See MCP server alerting for the full severity ladder and routing table.

The build cost for a self-hosted Grafana stack is roughly 8–12 engineering hours for the initial setup, plus ongoing maintenance of the probe exporter as MCP protocol versions evolve. AliveMCP's Team tier ($49/mo) provides all five panels pre-built with no infrastructure to maintain — the trade-off is worth evaluating against your team's available bandwidth.

Public vs private dashboard views

Your monitoring dashboard typically has two audiences with different access requirements:

Internal engineering view: full four-layer detail, latency percentiles, schema diffs, incident history, alert configuration. Password-protected or SSO-gated. This is the dashboard your on-call engineer has open on their second monitor.
External user-facing status page: simplified uptime indicator, current incident banner, 90-day history. No tool surface detail (that's internal IP). Publicly accessible. Embeddable as a badge in documentation or README. See MCP server status page for the design decisions around public status pages, and MCP server uptime badge for the embed pattern.

The cleanest implementation keeps these as two separate rendering targets from the same underlying data store — not two separate monitoring setups. AliveMCP generates both from the same probe data: the internal dashboard is the management console, and the public status page is the /status/{server-slug} URL you share with users.

Dashboard refresh rate and probe cadence

Dashboard refresh rate and probe cadence are independent settings that are often confused. Probe cadence is how often the monitoring system actually pings the server: AliveMCP runs every 60 seconds. Dashboard refresh rate is how often the browser fetches updated data from the monitoring backend.

For a real-time incident view, a 15-second dashboard refresh is reasonable — you see new probe results as they arrive. For a daily health review, refreshing every 5 minutes is fine. The key is that the dashboard refresh rate can never show you information fresher than the probe cadence — if your probes run every 60 seconds, a 10-second dashboard refresh will just show the same data six times.

In practice, the dashboard should show the probe timestamp alongside each status indicator so the viewer knows exactly how stale the data could be. "Transport: green (14:32:00)" is more informative than "Transport: green" when you're trying to determine if an incident you just learned about has already resolved.