Guide · Alerting

MCP server Slack alerts

Useful Slack alerts for MCP servers are specific (one failure mode per message), tiered (critical vs. digest), and dedup'd (no re-posting the same failure every minute). Here's the wiring, the payload, and how to stop paging your team for noise.

TL;DR

Four alert tiers: critical (fire immediately, page on-call), high (Slack, within 1 minute, no page), medium (daily digest), low (weekly digest). Dedup by (server_id, failure_mode) with a 15-minute TTL. Include the probe timestamp, the failure signature, and a link to the public dashboard row. AliveMCP Author tier ($9/mo) ships this webhook with zero code.

The four alert tiers (and what fires in each)

Critical — fire now, page on-call: TCP refused, TLS failure, initialize returns JSON-RPC error, tools/list returns error, no response within 30 seconds, 3 consecutive failed probes.
High — Slack within 1 minute, no page: tool-surface shrinkage (count drop > 0), schema-hash change outside a release window, p95 latency 3× baseline for 3+ consecutive probes, protocolVersion change without a release tag.
Medium — daily digest: transient error rate 1–5% over 24h, single-probe timeouts <1% of total, incremental latency drift, new tool added (interesting, not urgent).
Low — weekly digest: description text changed but schemas stable, capabilities-block additions, server-info version bumps useful for release audits.

The boundary between tiers is the hardest part. A good rule of thumb: if a fully-rested engineer would be annoyed to be woken up for it, it isn't critical. If they'd want to read it before morning coffee, it's high. If it's a "neat, look at that" fact, it's medium or low.

The Slack payload that actually helps

Alerts fail by being either too thin (one line, no context, person has to click out) or too noisy (50 fields, person scans nothing). Target the middle.

{
  "text": "🔴 my-server.com/mcp — initialize failed",
  "attachments": [{
    "color": "#c40040",
    "fields": [
      {"title": "Failure mode", "value": "JSON-RPC error: auth_required", "short": true},
      {"title": "First seen",   "value": "2026-04-24 03:50:22Z",         "short": true},
      {"title": "Probe",        "value": "initialize",                   "short": true},
      {"title": "Consecutive",  "value": "3 of 3",                       "short": true}
    ],
    "actions": [
      {"type": "button", "text": "Open in AliveMCP",
       "url": "https://alivemcp.com/status/my-server-com"}
    ]
  }]
}

One line that a mobile lock-screen can read. Four fields a phone-scanner can parse in 3 seconds. One button to drill in. That's it.

Dedup, throttling, and snooze

Dedup by (server, failure mode). A server that stays down for an hour should not post 60 identical alerts. Hash (server_id, failure_signature) and suppress duplicates inside a 15-minute window. After 15 minutes, post a "still failing" update, not a fresh incident.
Post recoveries, not recoveries-and-also-incident-resolved. One green message when the server comes back. Don't re-post the incident timeline — people find it in the scrollback.
Snooze windows. Scheduled maintenance: quiet the alerts between those hours. If your monitoring can't snooze, it'll train the team to snooze Slack, which is worse.
Per-channel tiering. Send critical → #oncall, high → #mcp-monitoring, digests → email or a #mcp-health-digest. Mixing tiers in one channel is how you get an ignored channel.

Wiring it yourself vs letting AliveMCP do it

A minimum DIY Slack-alert setup on a cron-probed MCP: Slack incoming webhook URL + a ~30-line script that reads your probe's last 3 rows, decides on a tier, formats the payload, and POSTs. Add a dedup table in SQLite. Add a snooze check. Add a recovery post. You're at 150 lines and a weekend.

On AliveMCP Author ($9/mo), paste the Slack webhook URL into the dashboard and we do the rest with the tier defaults above. Team tier ($49/mo) adds per-environment channels and on-call rotation support. Join the waitlist.

Common alerting mistakes

One alert per probe. 60-second probes × 1 hour of downtime = 60 messages. Nobody reads alert #42.
No recovery alert. Team sees the incident, never sees the resolution, assumes it's still broken. Every alert needs its mate.
Tagging @channel on medium-severity. One @channel too many and the team mutes the channel, which means they miss the next critical.
No link to primary data. "Server is down" without a link to the probe history is unactionable — the operator has to re-derive which server, which failure, when it started.