Guide · Alerting
MCP server Slack alerts
Useful Slack alerts for MCP servers are specific (one failure mode per message), tiered (critical vs. digest), and dedup'd (no re-posting the same failure every minute). Here's the wiring, the payload, and how to stop paging your team for noise.
TL;DR
Four alert tiers: critical (fire immediately, page on-call), high (Slack, within 1 minute, no page), medium (daily digest), low (weekly digest). Dedup by (server_id, failure_mode) with a 15-minute TTL. Include the probe timestamp, the failure signature, and a link to the public dashboard row. AliveMCP Author tier ($9/mo) ships this webhook with zero code.
The four alert tiers (and what fires in each)
- Critical — fire now, page on-call: TCP refused, TLS failure,
initializereturns JSON-RPC error,tools/listreturns error, no response within 30 seconds, 3 consecutive failed probes. - High — Slack within 1 minute, no page: tool-surface shrinkage (count drop > 0), schema-hash change outside a release window, p95 latency 3× baseline for 3+ consecutive probes,
protocolVersionchange without a release tag. - Medium — daily digest: transient error rate 1–5% over 24h, single-probe timeouts <1% of total, incremental latency drift, new tool added (interesting, not urgent).
- Low — weekly digest: description text changed but schemas stable, capabilities-block additions, server-info version bumps useful for release audits.
The boundary between tiers is the hardest part. A good rule of thumb: if a fully-rested engineer would be annoyed to be woken up for it, it isn't critical. If they'd want to read it before morning coffee, it's high. If it's a "neat, look at that" fact, it's medium or low.
The Slack payload that actually helps
Alerts fail by being either too thin (one line, no context, person has to click out) or too noisy (50 fields, person scans nothing). Target the middle.
{
"text": "🔴 my-server.com/mcp — initialize failed",
"attachments": [{
"color": "#c40040",
"fields": [
{"title": "Failure mode", "value": "JSON-RPC error: auth_required", "short": true},
{"title": "First seen", "value": "2026-04-24 03:50:22Z", "short": true},
{"title": "Probe", "value": "initialize", "short": true},
{"title": "Consecutive", "value": "3 of 3", "short": true}
],
"actions": [
{"type": "button", "text": "Open in AliveMCP",
"url": "https://alivemcp.com/status/my-server-com"}
]
}]
}
One line that a mobile lock-screen can read. Four fields a phone-scanner can parse in 3 seconds. One button to drill in. That's it.
Dedup, throttling, and snooze
- Dedup by (server, failure mode). A server that stays down for an hour should not post 60 identical alerts. Hash
(server_id, failure_signature)and suppress duplicates inside a 15-minute window. After 15 minutes, post a "still failing" update, not a fresh incident. - Post recoveries, not recoveries-and-also-incident-resolved. One green message when the server comes back. Don't re-post the incident timeline — people find it in the scrollback.
- Snooze windows. Scheduled maintenance: quiet the alerts between those hours. If your monitoring can't snooze, it'll train the team to snooze Slack, which is worse.
- Per-channel tiering. Send critical →
#oncall, high →#mcp-monitoring, digests → email or a#mcp-health-digest. Mixing tiers in one channel is how you get an ignored channel.
Wiring it yourself vs letting AliveMCP do it
A minimum DIY Slack-alert setup on a cron-probed MCP: Slack incoming webhook URL + a ~30-line script that reads your probe's last 3 rows, decides on a tier, formats the payload, and POSTs. Add a dedup table in SQLite. Add a snooze check. Add a recovery post. You're at 150 lines and a weekend.
On AliveMCP Author ($9/mo), paste the Slack webhook URL into the dashboard and we do the rest with the tier defaults above. Team tier ($49/mo) adds per-environment channels and on-call rotation support. Join the waitlist.
Common alerting mistakes
- One alert per probe. 60-second probes × 1 hour of downtime = 60 messages. Nobody reads alert #42.
- No recovery alert. Team sees the incident, never sees the resolution, assumes it's still broken. Every alert needs its mate.
- Tagging
@channelon medium-severity. One@channeltoo many and the team mutes the channel, which means they miss the next critical. - No link to primary data. "Server is down" without a link to the probe history is unactionable — the operator has to re-derive which server, which failure, when it started.
Related questions
Can I route to Discord or MS Teams instead?
Yes — AliveMCP webhooks post raw JSON that Discord and Teams accept with their own wrappers. The tier structure and dedup logic are identical; only the payload shape differs.
How do I stop false positives from a flaky network?
Require 3 consecutive failed probes (or 2-of-3 in a 3-minute window) before firing critical. That single change removes >90% of flappy alerts in our dataset.
What's the right cadence for the daily digest?
One post, same time every day. 09:00 local is the usual pick. Include the 24h summary: uptime % per server, total incidents, top 3 latency offenders. No links unless something needs action.