Reference · Operations

MCP server incident response

MCP server incidents are structurally different from web application incidents. The affected party is an agent framework, not a human — which means no one files a support ticket when the server goes down. The agent simply returns a degraded response, and the user assumes the AI is having a bad day. Without external monitoring and a defined incident response process, MCP outages go unresolved until someone notices the pattern.

TL;DR

Define three severity tiers (P1: complete outage, P2: degraded, P3: warning-level leading indicator), route P1 alerts to your phone, P2 to Slack, P3 to a weekly review. Update your status page within 5 minutes of a P1. Write a 5-bullet postmortem within 24 hours of resolution. AliveMCP fires alerts at the moment of detection — three consecutive failed probes — so you're notified within 3–4 minutes of an outage starting. Join the waitlist to wire up the alert pipeline.

Why MCP incidents are different

For a web application, the incident detection pipeline has a human layer: a user tries to log in, gets an error, and files a support ticket or posts on Twitter. Your support queue becomes a lagging indicator of server health. You know something is wrong because people tell you.

MCP servers have no such feedback loop. When an MCP server goes down, the calling agent framework logs an error, retries once or twice, and then either returns a degraded response ("I wasn't able to complete that action") or silently skips the tool call and fabricates a response. The user doesn't necessarily see an error — they see an AI that seems slightly less capable than usual. They don't know a tool is broken. They don't report it.

This means detection depends entirely on external monitoring. If you don't have a probe running against your MCP endpoint, you may not discover an outage until you look at your server logs days later, or until a technically sophisticated user notices that the AI's responses have changed. The seven silent failure modes of MCP servers are almost all in this invisible-to-users category.

Severity tiers

P1 — Complete outage

Definition: no agents can connect and use the MCP server. The failure affects 100% of users and every agent session that tries to use this tool suite.

Triggers:

Server is unreachable (TCP connection refused, DNS failure, TLS handshake failure)
initialize request times out or returns a non-JSON-RPC response
tools/list returns an empty array when it should contain tools
Certificate expired (same practical effect as server down)

MTTR target: 15 minutes. Most P1 MCP incidents resolve to one of three causes: (1) process crashed — restart fixes it; (2) deploy broke something — rollback fixes it; (3) upstream dependency (database, API) is down — escalate to that dependency's owner. The 15-minute target assumes one person with server access is on-call.

Communication: update the status page to "Investigating" within 5 minutes of detection. Update every 15 minutes until resolved.

P2 — Degraded

Definition: the server is reachable and handling MCP initialization, but a subset of functionality is impaired. Affects some users or some agent flows, not all.

Triggers:

One or more tools return errors while others succeed — a tool-specific upstream dependency is failing
p95 latency is 2× or more above the 7-day baseline — agents are waiting noticeably longer for tool responses
Flapping — the server alternates between up and down on consecutive probes, suggesting instability (resource exhaustion, unhealthy instance behind a load balancer, intermittent network path issue)
tools/list returns fewer tools than expected — tool registry connection is intermittent

MTTR target: 1 hour. P2 incidents have a broader range of root causes than P1 incidents and often require investigation rather than an immediate restart/rollback.

Communication: update the status page to "Degraded" within 15 minutes. Update once per hour until resolved.

P3 — Warning (leading indicator)

Definition: no current user-visible impact, but a metric is trending toward a failure threshold. These are predictive indicators, not active incidents.

Triggers:

TLS certificate expiry within 14 days
p95 latency trending upward over 48 hours — not yet in the danger zone but heading there
Tool schema drift detected — a tool's inputSchema has changed without a corresponding version bump
Error rate on a specific tool is elevated but below the P2 threshold
Probe success rate is 95–99% over the past 24 hours — isolated failures that don't trigger the consecutive-probe threshold, but suggest instability

MTTR target: 24 hours. P3 items go into a weekly review queue unless they escalate to P2 or P1 in the meantime.

Communication: internal only. No status page update required unless the P3 escalates.

Alert routing

Route alerts by severity, not by channel preference:

P1 → phone (PagerDuty, OpsGenie, direct SMS): P1 alerts must be loud enough to wake someone up. A Slack message that nobody sees for 45 minutes while an outage is in progress is worse than no alert at all — it creates a false sense that the incident is acknowledged. Use a system that can escalate to a backup on-call contact if the primary doesn't acknowledge within 5 minutes.
P2 → Slack / Discord channel + email: P2 requires attention but not an immediate phone wake-up. A channel alert that the on-call can acknowledge during business hours is appropriate for most P2 incidents. Add an escalation to phone if the P2 is not acknowledged within 30 minutes.
P3 → email / weekly digest: P3 items are inputs to a scheduled review, not immediate action items. Configure them to aggregate into a weekly digest rather than generating individual notifications — P3 alert noise is one of the fastest ways to train an on-call engineer to ignore all alerts.

AliveMCP's Author tier supports webhook-based alert routing. You can point the P1 webhook at a PagerDuty integration URL and the P2/P3 webhooks at a Slack channel. The webhook payload includes the severity, the failing probe step (TLS, HTTP, initialize, tools/list), and the consecutive-failure count, so your router can dispatch accordingly.

Incident response workflow

1. Detect

AliveMCP declares an incident after three consecutive failed probes (roughly 3 minutes from the first failure). The alert fires at probe 3. If you're monitoring more granularly (sub-60-second probes via your own infrastructure), detection can be faster — but three-probe confirmation is the standard to avoid false positives.

2. Triage

Within 5 minutes of the P1 alert firing, establish which layer failed:

Can you reach the server at all? (ping, TCP connect, TLS handshake)
Does the HTTP listener respond? (curl https://your-server/health)
Does the MCP endpoint respond to initialize? (MCP-specific probe)
Does tools/list return the expected tools?

AliveMCP's dashboard shows which step in the probe sequence failed, giving you a head start on triage without running these checks manually.

Cross-reference with your deployment history: did anything deploy in the 30 minutes before the incident started? A deploy is the most common cause of sudden MCP outages.

3. Communicate

Update the status page immediately. "Investigating" is enough for the first update — you don't need to know the root cause before communicating. Agents that depend on your server (and the developers who run them) will check the status page once they notice degraded behavior. A prompt "Investigating" entry is substantially better than a 45-minute silence followed by "Resolved".

4. Resolve

Common resolution paths for P1 MCP incidents:

Process restart — covers crashes, memory leaks, hung threads. Takes 30–120 seconds.
Deploy rollback — if a deploy caused the incident, rolling back is usually faster than forward-fixing. Requires knowing which deploy is current and having a rollback procedure that doesn't itself deploy.
Certificate renewal — for cert-expiry outages. certbot renew --force-renewal or equivalent, followed by an nginx/Caddy reload. Takes 2–5 minutes if you have the renewal command documented.
Upstream dependency escalation — if the failure is a downstream API (database, third-party service), the fix is in that system's SLA queue. Communicate the dependency status to your users so they understand the blocker isn't your server.

5. Postmortem

Write a postmortem within 24 hours of resolution. The goal is not blame assignment — it's timeline reconstruction and prevention. A five-bullet format works for small teams:

What happened (the observable facts: server went down at X, recovered at Y, N users affected)
Why it happened (root cause, one level down: deploy introduced a bug in the MCP router)
Why we didn't catch it sooner (detection gap: no external monitoring was configured on this endpoint)
What we're changing (prevention: add AliveMCP monitoring + alert before next deploy)
When we'll verify the changes are in place (deadline: by end of sprint)

Postmortems don't need to be long. A 5-bullet summary written the same day has ten times the operational value of a 10-page document written two weeks later.

How AliveMCP fits into incident response

AliveMCP covers steps 1 and 2 of the incident lifecycle: detection (three-probe confirmation, typically 3–4 minutes from first failure) and first-stage triage (the probe result tells you exactly which protocol layer failed, not just that "something is wrong"). The webhook alert is the bridge to your existing on-call tooling — PagerDuty, OpsGenie, Slack, or direct email.

Setting up Slack alerts for MCP server events takes about 2 minutes: create an incoming webhook in Slack, paste the URL into AliveMCP's alert configuration. From that point, any P1 or P2 event fires a channel notification with the server name, the failing step, and a link to the status page.

Get early access

MCP server incident response

TL;DR

Why MCP incidents are different

Severity tiers

P1 — Complete outage

P2 — Degraded

P3 — Warning (leading indicator)

Alert routing

Incident response workflow

1. Detect

2. Triage

3. Communicate

4. Resolve

5. Postmortem

How AliveMCP fits into incident response

Related questions

Further reading