Guide · Alerting

MCP server downtime alerting

A downtime alert that fires on every dropped probe will page you 20 times a month for noise. One that requires too much confirmation will miss 15-minute outages entirely. Getting MCP server downtime alerting right means threading the needle: quick enough to catch real outages within 3–5 minutes of onset, selective enough to suppress single-probe jitter, smart enough to route by severity, and quiet during planned maintenance.

TL;DR

Require 3 consecutive failing probes before firing a P1 downtime alert — at 60-second cadence, that's a 3-minute detection window with a false-positive probability under 1% for a server at 99.9% uptime. Route the alert to PagerDuty or on-call rotation for P1, Slack for P2 (partial failure at a single layer). Suppress alerts during registered maintenance windows. Send a recovery alert threaded into the original incident. For the most reliable downtime detection, use multi-region probing: a server that fails from three independent geographic origins simultaneously is down, not jittery.

Downtime alerting vs. error rate alerting

These are two distinct alert types that answer different questions:

Both alert types are useful, and they complement each other. A server can have a 100% error rate for 2 minutes (downtime alert fires) or a 3% error rate continuously for a week (error rate alert fires; downtime alert never fires because there's never 3 consecutive failures). Your alerting config should include both.

The consecutive-probe confirmation window

The core design decision in downtime alerting is how many consecutive failing probes constitute confirmed downtime. This decision has two competing forces:

The industry standard for well-tuned monitoring is N=3 consecutive failures. At 60-second probe cadence: 3-minute detection window. False-positive probability at 99.9% uptime: approximately 0.01% (the probability that 3 independent random probes fail simultaneously due to jitter, not server failure). This is low enough to avoid alert fatigue while remaining operationally fast enough that incidents are caught within a business-tolerable window.

For MCP servers specifically, 3-probe confirmation matters because MCP initialization involves multiple protocol layers. A single transport timeout may be network-path jitter; 3 consecutive transport timeouts means the server or its network path is genuinely unavailable. See MCP server flapping for the detailed hysteresis math.

Cold-start exemption

Serverless MCP servers (Cloud Run, Lambda, Railway free tier, Render free tier) have a cold-start problem: the first probe after an idle period will time out because the server needs 500ms–30s to boot. Cold starts look identical to real downtime in a probe log.

The solution is a post-idle probe exemption: suppress the first probe failure after a period of server inactivity (typically >10 minutes with no traffic). Only trigger downtime alerting if the second post-idle probe also fails. This exemption requires knowing which platforms have cold starts — AliveMCP maintains a recognized platform list (vercel.app, railway.app, render.com, onrender.com, fly.dev, *.lambda-url.*.on.aws) and applies the suppression automatically. See MCP server cold start monitoring for platform-specific thresholds.

Severity tiers for MCP downtime

Not all downtime is equally urgent. A 4-layer MCP probe produces different signals at each layer, and the appropriate alert severity depends on which layer failed:

P1 — Full outage (any layer unavailable)

Transport failure (TCP refused/timeout), HTTP 5xx, or JSON-RPC initialize error. The server is completely unreachable or refusing connections. All dependent agent workflows are failing. Page on-call immediately. Routing: PagerDuty (or equivalent interrupt-driven on-call tool), SMS, phone call after 5 minutes unacknowledged.

Trigger: 3 consecutive failing probes at transport, HTTP, or initialize layer. Resolution: 3 consecutive passing probes at the same layer.

P2 — Partial failure (tools/list layer only)

Transport, HTTP, and initialize all pass, but tools/list returns an error, an empty array, or a malformed schema. The server is alive but your agents can't discover or invoke tools. This is serious but not an all-hands emergency — agents may still be able to use cached tool definitions. Routing: Slack alert to the relevant channel, no phone wake-up unless it persists for 30+ minutes.

Trigger: 3 consecutive tools/list failures with all other layers passing. Escalate to P1 if unresolved after 30 minutes.

P3 — Degradation (latency SLO breach)

All layers pass but latency exceeds 3× the 30-day rolling baseline for 3 consecutive probes. The server is technically available but slow enough to affect agent performance. Routing: async notification (email digest, low-priority Slack message). No interrupt. See MCP server latency for latency-specific alerting thresholds.

Alert routing configuration

The right routing for a downtime alert depends on the time of day, the severity, and whether anyone has already acknowledged it.

Business hours vs. after-hours routing

For indie MCP authors running a server as a side project, the same person is on-call 24/7. For teams, separate routing makes sense: business hours (9am–6pm local) → Slack channel for immediate manual response; after-hours → PagerDuty escalation policy with a 5-minute acknowledgment window before escalating to phone.

Escalation policy

A well-designed escalation policy for MCP downtime:

  1. T+0: P1 alert fires. Notify primary on-call via PagerDuty push notification.
  2. T+5 min: Primary on-call has not acknowledged. Re-notify via SMS.
  3. T+15 min: Still unacknowledged. Escalate to secondary on-call (if defined) or team Slack channel.
  4. T+30 min: P2 alerts that haven't resolved upgrade to P1 escalation path.

Deduplication

A 2-hour outage should produce one incident, not one alert per probe cycle. Deduplication: after the initial P1 fires, suppress subsequent failure alerts for the same server until either (a) the server recovers and then goes down again, or (b) 4 hours pass. Use a unique dedup_key per server per incident window (server ID + incident start timestamp) so incident management tools like PagerDuty or OpsGenie can group all alert events into one incident thread.

Maintenance window suppression

Planned maintenance that causes genuine downtime should not page anyone. Maintenance window suppression requires:

AliveMCP supports maintenance windows via the dashboard or API: POST /api/v1/monitors/{id}/maintenance with starts_at, ends_at, and an optional reason field logged to the incident timeline.

Recovery alerts

A recovery alert is as important as the downtime alert — it closes the incident. Recovery alert design:

Multi-region downtime confirmation

Single-origin monitoring has a fundamental ambiguity: if your probe origin is in us-east-1 and your server is in us-east-1, a probe failure could be a server failure or a us-east-1 network issue. Multi-region probing resolves this:

AliveMCP Team tier ($49/mo) includes probing from three independent geographic origins (US East, EU West, Asia Pacific) with cross-region correlation built into the alerting engine. Single-origin probing is available on Author tier ($9/mo) — sufficient for most indie MCP authors where a single-region false positive rate of <0.1% is acceptable.

Related questions

How do I avoid alert fatigue with MCP downtime alerts?

Alert fatigue comes from three sources: too-aggressive trigger thresholds (fire on 1 failure instead of 3), missing deduplication (one alert per probe during a 2-hour outage), and missing maintenance window suppression (planned restarts page on-call). Fix all three: require 3 consecutive failures, deduplicate by incident, register maintenance windows. If you're still getting too many alerts, examine your alert history — are they mostly from a specific server? Check that server's cold-start pattern and apply the post-idle probe exemption.

What probe cadence should I use for downtime alerting?

60 seconds is the standard for production MCP servers. At 60-second cadence with 3-probe confirmation, you get a 3-minute detection window — fast enough for most production use cases. 30-second cadence halves the detection window to 90 seconds but doubles probe volume. 5-minute cadence (useful for dev/staging environments where fast detection isn't critical) gives a 15-minute detection window. AliveMCP Author tier uses 60-second cadence; for dev environments, a 5-minute cadence on a free-tier monitor is appropriate.

Should I alert on HTTP 429 (rate limited) as downtime?

HTTP 429 is ambiguous — it could mean the probe origin is being rate-limited (not a server problem) or that the server's rate limit is too aggressive (a configuration problem). The safest approach: don't count a single 429 as downtime, but fire a P2 alert if you see 3 consecutive 429s. This distinguishes between "the probe got rate limited once" (likely a probe-origin issue) and "the monitoring system is consistently rate limited" (likely a real server misconfiguration). Include probe-origin IP in your server's rate-limit allowlist if 429s are recurring.

How should I test my downtime alert configuration?

Test your alerting pipeline end-to-end at least once per month. The easiest method: use the monitoring system's "fire test alert" function if available. If not: temporarily set the confirmation window to 1 probe (not 3), manually stop your MCP server, verify the alert fires and routes correctly, then restart the server and verify the recovery alert. Reset the confirmation window back to 3. Document the test in your runbook so the on-call rotation doesn't panic when they see the test alert. AliveMCP has a built-in "test alert" button on each monitor that fires a simulated P1 through your full routing config without requiring a real server outage.

Further reading