Guide · Alert Routing
MCP Server Alert Routing Architecture — multi-channel, deduplication, and escalation
Alert routing is the design layer between detection and response. AliveMCP handles detection — it tells you an MCP server is down, why, and for how long. Alert routing handles what happens next: which channel receives the alert, how it is deduplicated when the server stays down for 30 minutes, when it escalates to a louder channel, and how it closes when the server recovers. Getting this architecture right means the right person is woken up for the right incident, with zero duplicate noise, every time. Getting it wrong means alert fatigue that causes real incidents to be ignored.
TL;DR
Build a three-tier alert pipeline: (1) detect with AliveMCP, (2) classify into critical/warning/info severity based on server tier and failure type, (3) route to the appropriate channel. Use a single stable key per server for deduplication across all channels. Implement an escalation ladder with time-based level upgrades: push notification at T+0, phone call at T+5min, secondary contact at T+15min. Suppress alerts during maintenance windows by checking a local flag before routing. Always send the resolved signal through the same routing path as the trigger — auto-resolution is non-negotiable for preventing stale open incidents.
The complete alert routing pipeline
An MCP server alert begins at detection and ends at resolution. The pipeline has six stages:
- Detect. AliveMCP probes your server every minute via a full MCP
initializehandshake. If the server fails 2 consecutive checks (2 minutes of confirmed downtime), it sendsalert.triggeredto your configured webhook URL. - Classify. Your bridge receives the webhook and classifies the alert severity based on the server's tier (internal vs third-party), the failure type (protocol failure vs connection refused vs timeout), and any business rules you configure.
- Deduplicate. Check whether an active alert already exists for this server slug. If yes, update the existing incident rather than creating a new one.
- Route. Send the alert to the appropriate channel(s) based on severity. Critical alerts go to PagerDuty + Slack. Warning alerts go to Slack only. Info alerts go to an email digest queue.
- Escalate. If the alert is not acknowledged within the configured timeout, upgrade to the next escalation level (push → phone call → secondary contact).
- Resolve. When AliveMCP sends
alert.resolved, close the incident in every channel where it was opened: resolve the PagerDuty incident, update the Slack message to green, delete from the email queue.
Severity taxonomy for MCP server alerts
Not all MCP server failures have the same urgency. The severity taxonomy maps failure characteristics to response expectations.
| Severity | Trigger condition | Response expectation | Notification channels |
|---|---|---|---|
| Critical (P1) | Internal MCP server down; or any server failing with connection refused (process crashed, not just degraded) | Acknowledge within 5 minutes, resolve within 30 | PagerDuty phone call + Slack ping + SMS |
| High (P2) | Author-claimed server down; or response time > 10× baseline for 5+ minutes | Acknowledge within 30 minutes, resolve within 2 hours | PagerDuty push + Slack ping |
| Medium (P3) | Third-party dependency down; or schema drift detected in tools/list | Acknowledge within 4 hours; handled during business hours | Slack message (no ping) + Jira ticket |
| Low (P4) | Brief blip (< 3 minutes, recovered); or non-critical dependency degraded | Review in morning digest; no action required | Email digest only |
The classification logic in your bridge should be deterministic and fast: if any condition matches a higher severity, use the higher severity. Do not attempt ML-based severity prediction for MCP server alerts — the rule table above covers 95% of real incidents, and a misclassified severe incident is worse than an overclassified minor one.
function classifySeverity(event) {
const { failure_type, server_tier, downtime_seconds, schema_drift } = event;
// P1: connection refused means the process is dead
if (failure_type === 'connection_refused') return 'P1';
// P1: internal servers are always critical
if (server_tier === 'internal') return 'P1';
// P2: author-claimed servers or sustained high latency
if (server_tier === 'author') return 'P2';
if (event.p95_latency_ms > event.p95_baseline_ms * 10) return 'P2';
// P3: third-party or schema drift
if (schema_drift) return 'P3';
if (server_tier === 'thirdparty') return 'P3';
// P4: everything else (short blips, recovered quickly)
if (downtime_seconds < 180) return 'P4';
return 'P3'; // default: medium
}
Deduplication strategy
AliveMCP sends alert.triggered every minute while a server is down. Without deduplication, a 30-minute outage generates 30 separate alerts. Deduplication requires two things: a stable key per server, and state to track whether an alert is already open.
The deduplication key pattern:
- Key:
alivemcp-{serverSlug}— use the server slug, not the server name (names can change, slugs don't). - State store: A key-value store mapping
dedup_key → {channelAlertIds}, wherechannelAlertIdscontains the identifier for the active alert in each routing channel (Slack message ID, PagerDuty incident key, OpsGenie alert alias). - TTL: Set a 24-hour TTL on stored entries. If a server is still down after 24 hours, create a new alert rather than updating a 24-hour-old one — the original context has scrolled out of view.
// Deduplication state store (Redis-backed)
class AlertStateStore {
async getActiveAlert(serverSlug) {
const raw = await redis.get(`alert:${serverSlug}`);
return raw ? JSON.parse(raw) : null;
}
async setActiveAlert(serverSlug, alertIds) {
await redis.set(
`alert:${serverSlug}`,
JSON.stringify(alertIds),
'EX', 86400 // 24-hour TTL
);
}
async clearActiveAlert(serverSlug) {
await redis.del(`alert:${serverSlug}`);
}
}
// In your webhook handler:
const existing = await store.getActiveAlert(event.server_slug);
if (existing) {
// Update all channels with new timestamp / failure count
await updateAllChannels(existing, event);
} else {
// Create new alerts in all channels, collect their IDs
const alertIds = await createInAllChannels(event, severity);
await store.setActiveAlert(event.server_slug, alertIds);
}
Escalation ladder design
An escalation ladder upgrades the alert to a louder channel if no one responds within a time threshold. For MCP servers, the standard ladder is:
| Time since trigger | Action | Channel |
|---|---|---|
| T+0 | Initial alert | Slack message + PagerDuty push notification |
| T+5 min (no ack) | Escalate to voice | PagerDuty phone call |
| T+15 min (no ack) | Escalate to secondary | PagerDuty escalates to secondary on-call person |
| T+30 min (no ack) | Escalate to manager | PagerDuty escalates to engineering manager + Slack DM |
| T+60 min | War room trigger | Create incident channel, notify all stakeholders |
The escalation ladder lives in PagerDuty or OpsGenie's escalation policy configuration, not in your bridge code. Your bridge only needs to trigger the initial incident; the on-call tool handles the escalation timing. What your bridge does control is the severity mapping: a P1 alert starts the ladder at T+0 with phone call enabled; a P3 alert starts with push only and no escalation ladder (it is handled during business hours).
Multi-channel fan-out
Critical MCP server failures should reach multiple channels simultaneously, not sequentially. A bridge that posts to Slack first and then PagerDuty means the first responder might see the Slack message and start investigating without acknowledging in PagerDuty — which causes the escalation ladder to continue even though someone is already handling the incident.
Fan out to all channels in parallel using Promise.allSettled, then collect results. Promise.allSettled rather than Promise.all ensures a failure in one channel (e.g., Slack is having an outage) does not prevent the alert from reaching PagerDuty.
async function routeAlert(event, severity) {
const routes = [];
if (severity === 'P1' || severity === 'P2') {
routes.push(routeToPagerDuty(event, severity));
routes.push(routeToSlack(event, severity, { ping: true }));
}
if (severity === 'P3') {
routes.push(routeToSlack(event, severity, { ping: false }));
routes.push(createJiraTicket(event));
}
if (severity === 'P4') {
routes.push(addToEmailDigest(event));
}
// Always include Discord for community visibility
routes.push(routeToDiscord(event, severity));
const results = await Promise.allSettled(routes);
const failures = results.filter(r => r.status === 'rejected');
if (failures.length > 0) {
console.error('Alert routing partial failure:', failures.map(f => f.reason));
// Don't throw — partial delivery is better than no delivery
}
return results;
}
Maintenance window suppression
Planned MCP server deployments cause brief downtime. Without suppression, a 30-second rolling restart during a deployment generates a flurry of alerts that train your team to ignore notifications.
Implement maintenance windows as time-bounded suppression rules. AliveMCP's Team tier supports API-configurable maintenance windows. Alternatively, manage suppression in your bridge:
// maintenance-windows.json: [{serverSlug, startISO, endISO, reason}]
const windows = JSON.parse(fs.readFileSync('./maintenance-windows.json'));
function isInMaintenanceWindow(serverSlug) {
const now = Date.now();
return windows.some(w =>
w.serverSlug === serverSlug &&
new Date(w.startISO).getTime() <= now &&
new Date(w.endISO).getTime() >= now
);
}
// In your webhook handler:
if (isInMaintenanceWindow(event.server_slug)) {
console.log(`Suppressing alert for ${event.server_slug}: maintenance window active`);
return res.status(200).json({ suppressed: true });
}
Integrate maintenance window management into your deploy pipeline: before running helm upgrade or kubectl rollout, add the server slug to the maintenance window file with a 10-minute window. After the deploy completes, remove the window. This way alert suppression is a code-committed artifact that goes through your normal deployment review process.
Frequently asked questions
How do I handle alert routing when my bridge itself is down?
Your bridge is a single point of failure in the alert pipeline. If it crashes, AliveMCP's webhooks have no destination and you lose alert coverage. Three mitigations: First, deploy the bridge on a separate, independent infrastructure from your MCP server — if your VPS goes down (taking both the MCP server and the bridge), you want the bridge to still be reachable on a different provider. Cloudflare Workers or Vercel Edge functions are good choices because they run on globally distributed infrastructure independent of your VPS. Second, configure AliveMCP to retry failed webhook deliveries with exponential backoff (AliveMCP retries up to 24 hours). Third, configure a PagerDuty or OpsGenie heartbeat as a dead-man switch: your bridge pings the heartbeat every 5 minutes, and if pings stop arriving, the on-call tool pages you directly — bypassing the bridge entirely.
Should I route alerts from all MCP servers through one bridge or one bridge per server?
One bridge that handles all servers is almost always the right choice. A single bridge can apply classification logic, deduplication state, and routing rules across your entire server portfolio. Multiple bridges fragment state (deduplication becomes impossible across bridges), multiply infrastructure (each bridge needs its own deployment and monitoring), and make rule changes require updates in multiple places. The only exception is security isolation: if different teams should not see each other's MCP server alerts (e.g., multi-tenant enterprise setup), separate bridges per team with separate webhook URLs per bridge is appropriate. Even then, consider a single bridge with team-level access control in the routing logic rather than fully separate deployments.
How do I handle "alert storm" when many MCP servers go down simultaneously?
An alert storm occurs when a shared dependency (a database, a network segment, a cloud provider region) causes multiple MCP servers to fail simultaneously. Naive routing creates dozens of simultaneous PagerDuty incidents. The mitigation is alert correlation: detect when multiple server slugs fail within a short window and create a single aggregated incident rather than N individual ones. In your bridge, maintain a sliding window counter: if more than 3 servers fail within 2 minutes, create a single "MCP server portfolio degradation" incident with a list of affected servers, rather than 3+ individual incidents. The individual incidents can be linked from the aggregated one. This requires storing timestamps of recent triggers and comparing against a correlation window on each new event.
How long should the auto-resolution delay be after a server recovers?
AliveMCP sends alert.resolved after 2 consecutive successful checks following a failure — 2 minutes of confirmed recovery. Your bridge should forward the resolution immediately without adding an additional delay. Some teams add a 5-minute "stabilization" delay before closing the incident, under the assumption that the server might fail again immediately after recovery. In practice, this delay creates more problems than it solves: an on-call person who has fixed the issue and sees the server recover expects the incident to close immediately. A 5-minute delay looks like a bug. If the server fails again after recovery, AliveMCP will send a new alert.triggered event and the bridge creates a new incident — the resolved incident correctly represents the first failure window, and the new incident represents the second.
What is the right number of routing channels for a solo MCP server author?
Two: one async channel (Slack or Discord) for daytime awareness, and one synchronous channel (PagerDuty free tier or OpsGenie free tier) for after-hours escalation. The async channel gives you a browsable history of all alerts and recoveries. The synchronous channel wakes you up for sustained outages at night. Adding more channels without adding team members creates more noise than value. Solo authors often start with just Slack and discover that they miss alerts that arrive while they are sleeping; adding PagerDuty as the second channel solves that without adding complexity. The email digest is useful as a third channel for third-party dependency alerts that don't warrant immediate response — it collects the day's P3/P4 events for morning review.
Further reading
- PagerDuty for MCP Servers — Events API integration and escalation policies
- OpsGenie for MCP Servers — team-based routing and on-call schedules
- Discord Alerts for MCP Servers — webhook routing and embed formatting
- MCP Server Slack Alerts — channel routing and Slack Block Kit formatting
- MCP Server Incident Runbook — response playbook for common failure modes
- MCP Server Flapping — detecting and suppressing oscillating alerts