Guide · Alerting

MCP server alerting

Alert routing for MCP servers fails in two ways: it wakes up engineers for noise they can't act on, or it misses real failures until a user complains. Both are fixable with a severity ladder, a routing table, and three suppression rules. Here's the complete wiring.

TL;DR

Map each failure mode to a severity level (P1–P4), route each severity to a different channel (PagerDuty / Slack / digest / weekly), suppress duplicate alerts with a 15-minute dedup window, and keep a maintenance-mode bypass for planned downtime. AliveMCP Author tier ($9/mo) ships this wiring out of the box — you paste a webhook URL, we do the routing and dedup.

Why MCP server alerting is different from HTTP alerting

A standard HTTP uptime monitor fires one alert: "got 5xx, site down." An MCP server has a four-layer protocol where each layer can fail silently while the layers above it appear healthy:

Transport (TCP/TLS): the socket never opens or TLS handshake fails. Easy to detect — your HTTP client errors immediately.
HTTP layer: the server responds 200 but the body is not JSON-RPC. Usually a reverse-proxy misconfiguration, a maintenance page, or a catch-all error handler that eats the request.
JSON-RPC handshake (initialize): the server speaks HTTP but doesn't implement the MCP initialize method, returns an auth error, or replies with a protocolVersion your client doesn't accept.
Tool surface (tools/list): initialize succeeds but tools/list returns an empty array, a subset of expected tools, or a structurally different schema. The server is "up" but your agent can't do any work.

An HTTP uptime monitor catches failure at layer 1 and sometimes layer 2. Layers 3 and 4 require MCP-aware probing, which is what AliveMCP runs every 60 seconds. Your alert routing needs to cover all four layers, not just the TCP check.

The severity ladder (P1–P4)

Each failure mode maps to a severity level. The mapping should be stable and documented — if your team argues over whether a missing tool is P2 or P3 during an incident, the mapping isn't documented well enough.

P1 — page on-call immediately: TCP refused for 3+ consecutive probes, TLS certificate expired, initialize returns auth error, tools/list returns HTTP 5xx, complete tool surface gone (0 tools where ≥ 1 expected), no response in 30 seconds for 3+ probes.
P2 — Slack, no page, within 5 minutes: tool surface shrinkage (tools count dropped but not to zero), schema hash changed outside a release window, protocolVersion mismatch on non-breaking version bump, p95 latency ≥ 3× 30-day baseline for 3+ consecutive probes.
P3 — daily digest: single-probe timeouts below 1% rate, transient errors recovering within 5 minutes, latency creep that hasn't crossed the p95 threshold, new tool added (informational).
P4 — weekly digest: description text changed, capabilities block additions, server-info version bumps with no schema change, any event that resolves before the next probe.

The most common calibration mistake is putting schema-hash changes at P1. Schema drift is worth knowing about, but a changed hash doesn't mean broken — it could be an intentional tool addition. P2 with a 5-minute Slack post is the right home for schema events, with an escalation path to P1 if tools count actually drops.

Routing table

Severity alone isn't routing — you also need the destination. A minimal routing table for a two-person MCP team:

P1  → PagerDuty (alerts → on-call rotation)
        + Slack #oncall (informational mirror)
P2  → Slack #mcp-health (tagged, not @channel)
P3  → Slack #mcp-digest (daily roll-up post at 09:00 UTC)
P4  → email digest (weekly, Monday 08:00 UTC)

Larger teams typically split P1 into two routing targets: the MCP owner (always paged) and the platform on-call (paged only if the failure blocks multiple downstream agents). That split requires ownership metadata per endpoint — either in your monitoring config or in a CODEOWNERS-style file next to your MCP repo.

The three suppression rules that prevent alert fatigue

Without suppression, a server that stays down for an hour generates one alert per minute. Nobody reads alert #12, let alone #60. Three rules keep your alert volume at signal level:

Consecutive-probe threshold before first fire. Don't fire P1 on the first failed probe — require 3 consecutive failures within a 5-minute window. This absorbs transient network blips and cold-start latency spikes without reducing real-failure detection speed (3 × 60-second probes = 3 minutes to first alert).
Dedup window after first fire. Once an incident fires, suppress all duplicate alerts for the same (server_id, failure_mode) pair for 15 minutes. After 15 minutes, post a "still failing" update rather than a fresh incident. This turns an hour of downtime from 60 alerts to approximately 5 (initial + 4 "still failing" updates).
Maintenance mode bypass. When you push a release or run a migration, you want zero alerts for the duration. Implement a maintenance window that accepts a start time, end time, and server scope — and suppresses all non-P1 alerts in that window. P1 should still fire (because "we went further down than expected" during maintenance is always important to know).

Escalation: when an alert doesn't get acknowledged

A P2 Slack post that nobody looks at for 2 hours is functionally a missed alert. Escalation policies close this gap:

P2 unacknowledged for 30 minutes → escalate to P1 path (PagerDuty).
P1 unacknowledged for 10 minutes → escalate to secondary on-call.
P1 unacknowledged for 30 minutes → escalate to engineering manager.

PagerDuty, Opsgenie, and Grafana OnCall all support escalation policies natively. If you're routing via Slack only, you need to implement escalation in your alerting code — a cron that checks open incidents and re-notifies if no emoji reaction or reply has appeared in the thread within the window.

Recovery alerts: the other half of the signal

Every incident alert needs a paired recovery alert. Without it, the team sees the P1 fire but never learns the server came back — they assume it's still broken, and someone manually re-probes it hours later. Recovery alerts should post to the same channel as the incident, thread into the original message (not a top-level post), and include the total downtime duration and the failure mode that triggered the incident.

The recovery condition is symmetric to the fire condition: if you require 3 consecutive failures to fire, require 3 consecutive successes to recover. An instant-recovery on the first passing probe is how you get flapping alerts when a server teeters on the edge of health. See MCP server flapping for the full treatment of that failure mode.

PagerDuty wiring for MCP servers

If you're using PagerDuty, the integration contract is:

POST https://events.pagerduty.com/v2/enqueue
{
  "routing_key": "YOUR_INTEGRATION_KEY",
  "event_action": "trigger",
  "dedup_key": "mcp-{server_id}-{failure_mode}",
  "payload": {
    "summary": "P1: {server_slug} — {failure_mode}",
    "severity": "critical",
    "source": "alivemcp.com",
    "component": "{server_slug}",
    "custom_details": {
      "probe_url": "https://alivemcp.com/status/{server_slug}",
      "failure_mode": "{failure_mode}",
      "consecutive_failures": "{count}",
      "first_seen": "{iso8601}"
    }
  }
}

The dedup_key is the field that collapses re-fires into the same incident rather than opening a new one. Use event_action: "resolve" with the same dedup_key on recovery. AliveMCP Team tier ($49/mo) has native PagerDuty routing with configurable severity mapping — no code required.