Guide · Alerting

MCP server alerting

Alert routing for MCP servers fails in two ways: it wakes up engineers for noise they can't act on, or it misses real failures until a user complains. Both are fixable with a severity ladder, a routing table, and three suppression rules. Here's the complete wiring.

TL;DR

Map each failure mode to a severity level (P1–P4), route each severity to a different channel (PagerDuty / Slack / digest / weekly), suppress duplicate alerts with a 15-minute dedup window, and keep a maintenance-mode bypass for planned downtime. AliveMCP Author tier ($9/mo) ships this wiring out of the box — you paste a webhook URL, we do the routing and dedup.

Why MCP server alerting is different from HTTP alerting

A standard HTTP uptime monitor fires one alert: "got 5xx, site down." An MCP server has a four-layer protocol where each layer can fail silently while the layers above it appear healthy:

  1. Transport (TCP/TLS): the socket never opens or TLS handshake fails. Easy to detect — your HTTP client errors immediately.
  2. HTTP layer: the server responds 200 but the body is not JSON-RPC. Usually a reverse-proxy misconfiguration, a maintenance page, or a catch-all error handler that eats the request.
  3. JSON-RPC handshake (initialize): the server speaks HTTP but doesn't implement the MCP initialize method, returns an auth error, or replies with a protocolVersion your client doesn't accept.
  4. Tool surface (tools/list): initialize succeeds but tools/list returns an empty array, a subset of expected tools, or a structurally different schema. The server is "up" but your agent can't do any work.

An HTTP uptime monitor catches failure at layer 1 and sometimes layer 2. Layers 3 and 4 require MCP-aware probing, which is what AliveMCP runs every 60 seconds. Your alert routing needs to cover all four layers, not just the TCP check.

The severity ladder (P1–P4)

Each failure mode maps to a severity level. The mapping should be stable and documented — if your team argues over whether a missing tool is P2 or P3 during an incident, the mapping isn't documented well enough.

The most common calibration mistake is putting schema-hash changes at P1. Schema drift is worth knowing about, but a changed hash doesn't mean broken — it could be an intentional tool addition. P2 with a 5-minute Slack post is the right home for schema events, with an escalation path to P1 if tools count actually drops.

Routing table

Severity alone isn't routing — you also need the destination. A minimal routing table for a two-person MCP team:

P1  → PagerDuty (alerts → on-call rotation)
        + Slack #oncall (informational mirror)
P2  → Slack #mcp-health (tagged, not @channel)
P3  → Slack #mcp-digest (daily roll-up post at 09:00 UTC)
P4  → email digest (weekly, Monday 08:00 UTC)

Larger teams typically split P1 into two routing targets: the MCP owner (always paged) and the platform on-call (paged only if the failure blocks multiple downstream agents). That split requires ownership metadata per endpoint — either in your monitoring config or in a CODEOWNERS-style file next to your MCP repo.

The three suppression rules that prevent alert fatigue

Without suppression, a server that stays down for an hour generates one alert per minute. Nobody reads alert #12, let alone #60. Three rules keep your alert volume at signal level:

  1. Consecutive-probe threshold before first fire. Don't fire P1 on the first failed probe — require 3 consecutive failures within a 5-minute window. This absorbs transient network blips and cold-start latency spikes without reducing real-failure detection speed (3 × 60-second probes = 3 minutes to first alert).
  2. Dedup window after first fire. Once an incident fires, suppress all duplicate alerts for the same (server_id, failure_mode) pair for 15 minutes. After 15 minutes, post a "still failing" update rather than a fresh incident. This turns an hour of downtime from 60 alerts to approximately 5 (initial + 4 "still failing" updates).
  3. Maintenance mode bypass. When you push a release or run a migration, you want zero alerts for the duration. Implement a maintenance window that accepts a start time, end time, and server scope — and suppresses all non-P1 alerts in that window. P1 should still fire (because "we went further down than expected" during maintenance is always important to know).

Escalation: when an alert doesn't get acknowledged

A P2 Slack post that nobody looks at for 2 hours is functionally a missed alert. Escalation policies close this gap:

PagerDuty, Opsgenie, and Grafana OnCall all support escalation policies natively. If you're routing via Slack only, you need to implement escalation in your alerting code — a cron that checks open incidents and re-notifies if no emoji reaction or reply has appeared in the thread within the window.

Recovery alerts: the other half of the signal

Every incident alert needs a paired recovery alert. Without it, the team sees the P1 fire but never learns the server came back — they assume it's still broken, and someone manually re-probes it hours later. Recovery alerts should post to the same channel as the incident, thread into the original message (not a top-level post), and include the total downtime duration and the failure mode that triggered the incident.

The recovery condition is symmetric to the fire condition: if you require 3 consecutive failures to fire, require 3 consecutive successes to recover. An instant-recovery on the first passing probe is how you get flapping alerts when a server teeters on the edge of health. See MCP server flapping for the full treatment of that failure mode.

PagerDuty wiring for MCP servers

If you're using PagerDuty, the integration contract is:

POST https://events.pagerduty.com/v2/enqueue
{
  "routing_key": "YOUR_INTEGRATION_KEY",
  "event_action": "trigger",
  "dedup_key": "mcp-{server_id}-{failure_mode}",
  "payload": {
    "summary": "P1: {server_slug} — {failure_mode}",
    "severity": "critical",
    "source": "alivemcp.com",
    "component": "{server_slug}",
    "custom_details": {
      "probe_url": "https://alivemcp.com/status/{server_slug}",
      "failure_mode": "{failure_mode}",
      "consecutive_failures": "{count}",
      "first_seen": "{iso8601}"
    }
  }
}

The dedup_key is the field that collapses re-fires into the same incident rather than opening a new one. Use event_action: "resolve" with the same dedup_key on recovery. AliveMCP Team tier ($49/mo) has native PagerDuty routing with configurable severity mapping — no code required.

Related questions

Should I alert on latency or just on failures?

Both, but separate. Latency alerts are P2/P3 by default — a slow server is worse than a fast server but better than a dead one. Set a p95 threshold at 3× your 30-day baseline and require 3 consecutive probe periods before firing. Don't alert on p50 — the median hides burst traffic patterns. Alert on p95 or p99.

How do I handle multi-region MCP deployments?

Fire P1 only when the failure is global (all regions failing). Single-region failures are P2 — degraded but not dead. Your routing key should include a scope field (global / region) so your escalation policy can treat them differently.

What's the minimum viable alerting setup for a solo MCP author?

One Slack webhook, one rule: if 3 consecutive probes fail, post to Slack with a link to the status page. That's it. The four-tier routing table is for teams. A solo author just needs to know before their users do — which is exactly what AliveMCP's Author tier ($9/mo) delivers with zero infrastructure setup.

How does AliveMCP handle alerting for servers with authentication?

On the Author tier, you provide a demo credential (API key, Bearer token, or OAuth client credential) that AliveMCP stores encrypted and uses only for your server's probes. Alerts fire on auth failure separately from protocol failure — so you can tell "my server is down" from "my demo credential expired."

Further reading