Guide · Alerting

MCP server downtime alerting

A downtime alert that fires on every dropped probe will page you 20 times a month for noise. One that requires too much confirmation will miss 15-minute outages entirely. Getting MCP server downtime alerting right means threading the needle: quick enough to catch real outages within 3–5 minutes of onset, selective enough to suppress single-probe jitter, smart enough to route by severity, and quiet during planned maintenance.

TL;DR

Require 3 consecutive failing probes before firing a P1 downtime alert — at 60-second cadence, that's a 3-minute detection window with a false-positive probability under 1% for a server at 99.9% uptime. Route the alert to PagerDuty or on-call rotation for P1, Slack for P2 (partial failure at a single layer). Suppress alerts during registered maintenance windows. Send a recovery alert threaded into the original incident. For the most reliable downtime detection, use multi-region probing: a server that fails from three independent geographic origins simultaneously is down, not jittery.

Downtime alerting vs. error rate alerting

These are two distinct alert types that answer different questions:

Downtime alert: fires when the server transitions from "available" to "unavailable" — a state change. The trigger is N consecutive failing probes with no recovery in between. The alert represents an incident: something needs human attention now.
Error rate alert: fires when the fraction of failing probes over a rolling window exceeds a threshold — a continuous signal. The trigger is degradation that may not look like binary downtime. See MCP server error rate for the error-rate alerting model.

Both alert types are useful, and they complement each other. A server can have a 100% error rate for 2 minutes (downtime alert fires) or a 3% error rate continuously for a week (error rate alert fires; downtime alert never fires because there's never 3 consecutive failures). Your alerting config should include both.

The consecutive-probe confirmation window

The core design decision in downtime alerting is how many consecutive failing probes constitute confirmed downtime. This decision has two competing forces:

Fewer probes required → faster detection, more false positives. Fire on 1 failure: maximum detection speed (1 probe interval = 60 seconds), but packet-level probe jitter produces frequent false positives — a single dropped probe looks identical to a genuine outage.
More probes required → slower detection, fewer false positives. Fire on 5 failures: detection takes 5 minutes at 60-second cadence, but you're extremely confident it's real by the time the alert fires.

The industry standard for well-tuned monitoring is N=3 consecutive failures. At 60-second probe cadence: 3-minute detection window. False-positive probability at 99.9% uptime: approximately 0.01% (the probability that 3 independent random probes fail simultaneously due to jitter, not server failure). This is low enough to avoid alert fatigue while remaining operationally fast enough that incidents are caught within a business-tolerable window.

For MCP servers specifically, 3-probe confirmation matters because MCP initialization involves multiple protocol layers. A single transport timeout may be network-path jitter; 3 consecutive transport timeouts means the server or its network path is genuinely unavailable. See MCP server flapping for the detailed hysteresis math.

Cold-start exemption

Serverless MCP servers (Cloud Run, Lambda, Railway free tier, Render free tier) have a cold-start problem: the first probe after an idle period will time out because the server needs 500ms–30s to boot. Cold starts look identical to real downtime in a probe log.

The solution is a post-idle probe exemption: suppress the first probe failure after a period of server inactivity (typically >10 minutes with no traffic). Only trigger downtime alerting if the second post-idle probe also fails. This exemption requires knowing which platforms have cold starts — AliveMCP maintains a recognized platform list (vercel.app, railway.app, render.com, onrender.com, fly.dev, *.lambda-url.*.on.aws) and applies the suppression automatically. See MCP server cold start monitoring for platform-specific thresholds.

Severity tiers for MCP downtime

Not all downtime is equally urgent. A 4-layer MCP probe produces different signals at each layer, and the appropriate alert severity depends on which layer failed:

P1 — Full outage (any layer unavailable)

Transport failure (TCP refused/timeout), HTTP 5xx, or JSON-RPC initialize error. The server is completely unreachable or refusing connections. All dependent agent workflows are failing. Page on-call immediately. Routing: PagerDuty (or equivalent interrupt-driven on-call tool), SMS, phone call after 5 minutes unacknowledged.

Trigger: 3 consecutive failing probes at transport, HTTP, or initialize layer. Resolution: 3 consecutive passing probes at the same layer.

P2 — Partial failure (tools/list layer only)

Transport, HTTP, and initialize all pass, but tools/list returns an error, an empty array, or a malformed schema. The server is alive but your agents can't discover or invoke tools. This is serious but not an all-hands emergency — agents may still be able to use cached tool definitions. Routing: Slack alert to the relevant channel, no phone wake-up unless it persists for 30+ minutes.

Trigger: 3 consecutive tools/list failures with all other layers passing. Escalate to P1 if unresolved after 30 minutes.

P3 — Degradation (latency SLO breach)

All layers pass but latency exceeds 3× the 30-day rolling baseline for 3 consecutive probes. The server is technically available but slow enough to affect agent performance. Routing: async notification (email digest, low-priority Slack message). No interrupt. See MCP server latency for latency-specific alerting thresholds.

Alert routing configuration

The right routing for a downtime alert depends on the time of day, the severity, and whether anyone has already acknowledged it.

Business hours vs. after-hours routing

For indie MCP authors running a server as a side project, the same person is on-call 24/7. For teams, separate routing makes sense: business hours (9am–6pm local) → Slack channel for immediate manual response; after-hours → PagerDuty escalation policy with a 5-minute acknowledgment window before escalating to phone.

Escalation policy

A well-designed escalation policy for MCP downtime:

T+0: P1 alert fires. Notify primary on-call via PagerDuty push notification.
T+5 min: Primary on-call has not acknowledged. Re-notify via SMS.
T+15 min: Still unacknowledged. Escalate to secondary on-call (if defined) or team Slack channel.
T+30 min: P2 alerts that haven't resolved upgrade to P1 escalation path.

Deduplication

A 2-hour outage should produce one incident, not one alert per probe cycle. Deduplication: after the initial P1 fires, suppress subsequent failure alerts for the same server until either (a) the server recovers and then goes down again, or (b) 4 hours pass. Use a unique dedup_key per server per incident window (server ID + incident start timestamp) so incident management tools like PagerDuty or OpsGenie can group all alert events into one incident thread.

Maintenance window suppression

Planned maintenance that causes genuine downtime should not page anyone. Maintenance window suppression requires:

Registered time window: define the maintenance window in your monitoring system before it starts. Start time, end time (or duration), affected servers.
Alert suppression during window: probes continue running (so you have continuity data), but alerts are not fired. If the server is still down 5 minutes after the maintenance window ends, a post-maintenance downtime alert fires — maintenance overruns require attention too.
Window expiry protection: if you forget to end a maintenance window, cap suppression at 4 hours maximum. An accidental 24-hour window that covers a real outage defeats the purpose of monitoring entirely.

AliveMCP supports maintenance windows via the dashboard or API: POST /api/v1/monitors/{id}/maintenance with starts_at, ends_at, and an optional reason field logged to the incident timeline.

Recovery alerts

A recovery alert is as important as the downtime alert — it closes the incident. Recovery alert design:

Require 3 consecutive passing probes before declaring recovery. A server that flaps (down-up-down) should not generate a recovery alert followed by another downtime alert — the hysteresis window should hold the incident open until the server has been stable for 3 minutes. See MCP server flapping.
Thread into the original incident. Send the recovery notification as a reply or update to the original incident thread (PagerDuty incident timeline, Slack thread, email reply) rather than a new top-level message. This keeps the incident timeline coherent and makes post-incident review easier.
Include duration and layer summary. "MCP server api.example.com recovered after 23 minutes (14:32–14:55 UTC). Root layer: HTTP 5xx × 23 probes. All layers now passing."

Multi-region downtime confirmation

Single-origin monitoring has a fundamental ambiguity: if your probe origin is in us-east-1 and your server is in us-east-1, a probe failure could be a server failure or a us-east-1 network issue. Multi-region probing resolves this:

All regions fail simultaneously: the server is down. High confidence. Fire P1.
One region fails, others pass: the server is up globally but unreachable from one region. Fire P2 with regional context ("available from eu-west-1 and ap-southeast-1, but not from us-east-1 — possible routing or CDN issue").
Intermittent failures across regions with no pattern: probe-origin jitter or edge-CDN intermittency. Do not fire a downtime alert; log the anomaly for review.

AliveMCP Team tier ($49/mo) includes probing from three independent geographic origins (US East, EU West, Asia Pacific) with cross-region correlation built into the alerting engine. Single-origin probing is available on Author tier ($9/mo) — sufficient for most indie MCP authors where a single-region false positive rate of <0.1% is acceptable.