Alert routing guide · 2026-06-19 · Alert Routing & Incident Management
MCP Server Alert Routing: PagerDuty, OpsGenie, Discord, and the Architecture to Connect Them
AliveMCP detects when your MCP server is down and fires a webhook. What happens next is a design problem that has nothing to do with monitoring — it is an alert routing problem. Who receives the alert? How does it deduplicate when the server stays down for 30 minutes? When does it escalate to a phone call? What does the person who wakes up actually do? This guide synthesizes five components that together form a complete incident response architecture: PagerDuty for guaranteed wakeup with escalation, OpsGenie for team-based routing and on-call schedules, Discord webhooks for community-visible incident tracking, the routing pipeline architecture that ties multiple channels together without noise, and the incident runbook that tells you what to do once the alert arrives.
Five components, one incident response architecture
The table below maps each component to its role, what it does that the others cannot, and which author profile it fits.
| Component | Primary role | Unique capability | Best for |
|---|---|---|---|
| PagerDuty | Guaranteed wakeup — escalation if not acknowledged | Phone call + SMS if push notification is silenced; on-call rotation; structured incident lifecycle (open → acknowledged → resolved) | Solo authors with SLA commitments; teams where on-call burden needs rotation |
| OpsGenie | Team-based routing — alerts go to teams, not individuals | Team routing policies; Heartbeat dead-man switch (AliveMCP pings OpsGenie every minute to prove connectivity); Jira auto-ticket creation on P1/P2 unresolved after 15 minutes | Multi-squad organizations where different teams own different MCP servers |
| Discord | Community-visible incident tracking | Message-edit deduplication (one embed updated in place across a 30-minute outage, not 30 separate messages); thread-based incident timeline; zero additional tool cost | Indie authors and open-source projects whose users are already in a Discord server |
| Routing architecture | Pipeline design — severity → channel → deduplication → escalation | Single stable dedup key per server across all channels; Promise.allSettled fan-out (Slack outage cannot block PagerDuty); alert storm correlation when multiple servers fail simultaneously |
Any setup using more than one notification channel |
| Incident runbook | Response playbook — what to do after the alert fires | Playbook indexed by AliveMCP failure_reason field; eliminates the context-reconstruction step that adds 5–15 minutes to every incident |
Any MCP server operator regardless of alerting stack |
The pattern across all five: they operate on the downstream side of the detection boundary. AliveMCP sends the initialize JSON-RPC sequence to your MCP server every 60 seconds from outside your infrastructure, from the same network path an LLM client traverses. When it detects a failure — connection refused, protocol handshake failure, tool call timeout, schema drift, elevated error rate — it fires a webhook. Everything in this guide is about what happens to that webhook. None of these five components is a substitute for AliveMCP; they amplify the detection signal into a response.
The detection layer: what AliveMCP provides to the routing pipeline
Before designing the routing layer, it is worth being precise about what arrives in the AliveMCP webhook payload — because the routing architecture depends on it. AliveMCP detects five distinct failure modes, each with a different failure_reason field in the payload:
| Failure mode | failure_reason |
What it means | Severity mapping |
|---|---|---|---|
| Connection refused | connection_refused |
Process not listening / container not running / network unreachable | P1 — server completely unavailable |
| Protocol handshake failure | protocol_error |
Process running but initialize returns error or unexpected response |
P1 — server available but broken for all LLM clients |
| Tool call timeout | timeout |
initialize succeeds but tool call exceeds timeout threshold |
P2 — degraded; some tool calls failing |
| Schema drift | schema_drift |
tools/list response changed — tools added, removed, or renamed since last check |
P3 — informational; may or may not affect clients depending on which tools changed |
| Elevated error rate | error_rate_elevated |
Tool calls returning errors above the configured threshold | P2 — degraded; fraction of tool calls failing |
The failure_reason field is the entry point for the incident runbook. When you receive a PagerDuty page at 2 AM, the first thing to look at is not the server's own logs — it is the AliveMCP dashboard's failure_reason field, because AliveMCP knows what the server returned (or failed to return) from an external protocol probe. That tells you which of the five runbook playbooks to open before you have touched a single CLI tool.
AliveMCP also sends three event types in the webhook: alert.triggered (failure first detected), alert.updated (still failing after N minutes), and alert.resolved (server recovered). The alert routing pipeline must handle all three correctly — deduplicating alert.updated events so they do not create 30 separate notifications for a 30-minute outage, and closing the incident automatically on alert.resolved without requiring manual action.
PagerDuty: guaranteed wakeup for production MCP servers
Slack does not wake you up. A channel notification at 2 AM is silenced by Do Not Disturb. Discord has the same limitation. Email is slower. PagerDuty solves the last-mile problem of getting a notification through to a sleeping human — it escalates through push notification, SMS, and phone call in sequence until someone acknowledges. For any MCP server that has paying users or commitments around availability, that escalation model is not optional.
The integration is a small webhook bridge between AliveMCP and the PagerDuty Events API v2. The key design decisions are the dedup_key strategy and the escalation policy:
// Bridge: AliveMCP webhook → PagerDuty Events API v2
async function handleAliveMcpWebhook(event) {
const payload = {
routing_key: process.env.PAGERDUTY_INTEGRATION_KEY,
dedup_key: event.server_slug, // Stable per server — prevents duplicate incidents
event_action: event.type === 'alert.resolved' ? 'resolve' : 'trigger',
payload: {
summary: `MCP server ${event.server_slug} — ${event.failure_reason}`,
severity: mapSeverity(event.failure_reason), // 'critical' | 'error' | 'warning' | 'info'
source: 'alivemcp',
custom_details: {
failure_reason: event.failure_reason,
duration_minutes: event.duration_minutes,
server_url: event.server_url,
runbook: 'https://alivemcp.com/seo/mcp-server-runbook',
}
}
};
await fetch('https://events.pagerduty.com/v2/enqueue', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(payload),
});
}
The dedup_key set to server_slug is the critical design choice. It means that when AliveMCP's minute-by-minute probe fires 30 alert.updated events during a 30-minute outage, PagerDuty sees 30 trigger events with the same dedup_key and collapses them all into a single open incident. No new pages are fired for updates — only the initial trigger and the final resolution. Without this, a 30-minute outage generates 30 phone calls.
Escalation policy design for solo authors: a single policy with two levels — push notification at T+0, phone call at T+5 minutes if not acknowledged. For teams, a weekly on-call rotation in the escalation policy distributes the burden. The maintenance window suppression feature in PagerDuty (or a maintenance window flag in the bridge) prevents alerts from firing during scheduled deploys, where a 60-second downtime during restart would otherwise trigger a P1.
For multi-server setups, use PagerDuty's Event Rules to route alerts from different server slugs to different services and escalation policies. A production-tier server ($49/mo Team plan) gets an aggressive escalation; a development-tier server gets a low-urgency policy that only sends push notifications and never calls.
Severity mapping from AliveMCP failure_reason to PagerDuty severity: connection_refused and protocol_error → critical (immediate phone escalation); timeout and error_rate_elevated → error (push notification, phone escalation if not acked in 5 minutes); schema_drift → warning (push notification only, no escalation — this is informational and the server is still serving).
OpsGenie: team-based routing and the heartbeat dead-man switch
OpsGenie's routing model is team-centric rather than service-centric. Where PagerDuty routes alerts to a service (with its own escalation policy and on-call schedule), OpsGenie routes alerts to a team (and the team manages its own on-call schedule and escalation rules). This distinction matters when different squads own different MCP servers — the payment-API MCP server routes to the platform team; the customer-data MCP server routes to the data team; the third-party MCP servers route to the integrations team.
The integration uses OpsGenie's Alert API v2 with an alias field for deduplication — the OpsGenie equivalent of PagerDuty's dedup_key:
async function handleAliveMcpWebhook(event) {
const alias = `alivemcp-${event.server_slug}`; // Stable per server
if (event.type === 'alert.resolved') {
// Close the alert via alias — OpsGenie closes the open incident automatically
await fetch(`https://api.opsgenie.com/v2/alerts/${alias}/close?identifierType=alias`, {
method: 'POST',
headers: { 'Authorization': `GenieKey ${process.env.OPSGENIE_API_KEY}` },
body: JSON.stringify({ note: `MCP server recovered after ${event.duration_minutes}m` }),
});
} else {
// Create or update the open alert — alias deduplication prevents duplicates
await fetch('https://api.opsgenie.com/v2/alerts', {
method: 'POST',
headers: {
'Authorization': `GenieKey ${process.env.OPSGENIE_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
alias,
message: `MCP server ${event.server_slug} — ${event.failure_reason}`,
priority: mapPriority(event.failure_reason), // P1-P4
tags: ['mcp-server', event.server_tier, event.failure_reason],
details: {
failure_reason: event.failure_reason,
server_url: event.server_url,
runbook: 'https://alivemcp.com/seo/mcp-server-runbook',
},
responders: [{ type: 'team', name: mapTeam(event.server_slug) }],
}),
});
}
}
OpsGenie's Heartbeat feature provides a dead-man switch that covers an important gap: what if AliveMCP itself cannot reach your private endpoint due to a network issue between AliveMCP and your server? The Heartbeat turns the monitoring direction around — instead of AliveMCP probing your server, your server (or AliveMCP) pings OpsGenie every N minutes. If OpsGenie stops receiving the ping, it fires an alert. Configure AliveMCP to ping the Heartbeat URL every 5 minutes as a secondary detection path for endpoints that are behind a VPN or private network.
On-call schedule configuration: OpsGenie's schedule editor supports business-hours restrictions (alert only during work hours for P3/P4), follow-the-sun multi-region rotation (US team handles 9am–6pm EST; EU team handles CEST business hours), and override rules for planned absences. For teams using Jira, OpsGenie can create a Jira ticket automatically when a P1 or P2 alert is not resolved within 15 minutes — useful for postmortem tracking and sprint board visibility.
OpsGenie vs PagerDuty decision guide: use PagerDuty if your team has an existing service-centric on-call structure, uses PagerDuty's native scheduling UI, or has fewer than 5 people on rotation. Use OpsGenie if your team is in the Atlassian ecosystem (Jira, Confluence), has multiple distinct squads who should own their own alerts, or wants the Heartbeat dead-man switch as a built-in feature without a separate service.
Discord: community-visible incident tracking with message-edit deduplication
Indie MCP server authors live in Discord. Their community is there, their beta users are there, and their contributors are there. Routing AliveMCP alerts to Discord keeps the incident timeline in the same space where users will ask "is the server down?" — no context switch, no separate status page to check. The engineering challenge with Discord is deduplication: the default approach of POSTing a new message for each AliveMCP webhook event produces 30 separate messages for a 30-minute outage, flooding the alert channel.
The correct implementation uses Discord's message-edit API. POST the initial alert with ?wait=true to get the message_id back, store it in memory (or Redis for durability), and PATCH the same message on every subsequent webhook event:
const alertStore = new Map(); // server_slug → { messageId, channelId }
async function handleAliveMcpWebhook(event) {
const existing = alertStore.get(event.server_slug);
if (event.type === 'alert.triggered' && !existing) {
// Initial alert: POST with ?wait=true to get message_id
const res = await fetch(`${WEBHOOK_URL}?wait=true`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
content: `<@&${MCP_ONCALL_ROLE_ID}>`, // Ping role ONLY on initial trigger
embeds: [buildEmbed(event, 'red')],
}),
});
const msg = await res.json();
alertStore.set(event.server_slug, { messageId: msg.id });
} else if (event.type === 'alert.updated' && existing) {
// Update existing message — no new ping, no new message, no noise
await fetch(`${WEBHOOK_URL}/messages/${existing.messageId}`, {
method: 'PATCH',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
content: '', // Remove role ping from updates
embeds: [buildEmbed(event, 'orange')],
}),
});
} else if (event.type === 'alert.resolved' && existing) {
// Patch to green — incident resolved
await fetch(`${WEBHOOK_URL}/messages/${existing.messageId}`, {
method: 'PATCH',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
content: '',
embeds: [buildEmbed(event, 'green')],
}),
});
alertStore.delete(event.server_slug); // Ready for next incident
}
}
The embed color field uses integer values: 15158332 (red, #E74C3C) for triggered, 15105570 (orange, #E67E22) for sustained outage updates, 3066993 (green, #2ECC71) for resolved. Combined with an emoji in the embed title (🔴 / 🟡 / 🟢), this makes alert state readable at a glance even for users with color vision deficiencies.
For sustained outages beyond 5 minutes, create a Discord thread on the alert message and post duration updates there. The main channel stays clean — one embed that changes color in place — while the thread holds the full timeline for users who want detail. Discord's POST /channels/{channel.id}/messages/{message_id}/threads endpoint creates the thread attached to the alert message; subsequent updates post to the thread rather than patching the parent embed.
Discord limitation: Do Not Disturb is per-user, not per-notification, and Discord cannot bypass it. This is why Discord is a community-visibility layer, not a wakeup layer. For any server where you need guaranteed wakeup, layer PagerDuty or OpsGenie on top of Discord — Discord gets the community-visible timeline, PagerDuty gets the guaranteed phone call. The layered pattern (Discord + PagerDuty simultaneously) costs nothing extra per alert — the bridge routes to both channels with Promise.allSettled in parallel.
Alert routing architecture: building the pipeline that ties all channels together
When you have more than one notification channel, you need an explicit routing architecture. Without it, you end up with: duplicate alerts in multiple channels (because each integration re-fires on every alert.updated event independently), Slack outages blocking PagerDuty notifications (because they are sequentially awaited in the same function), and no consistent severity classification (PagerDuty gets a critical, Discord gets a red embed, but they do not agree on what "critical" means).
The complete alert routing pipeline has six stages:
async function routeAlert(event) {
// Stage 1: Classify severity
const severity = classifySeverity(event);
// Stage 2: Check maintenance window
if (maintenanceWindows.isActive(event.server_slug)) return;
// Stage 3: Deduplicate — check alert state store
const existing = await alertState.get(event.server_slug);
if (event.type === 'alert.triggered' && existing?.status === 'open') return; // Already open
await alertState.set(event.server_slug, { status: 'open', severity, openedAt: Date.now() });
// Stage 4: Alert storm correlation — if >3 servers failed in last 2 minutes, aggregate
const recentFailures = await alertState.countRecentTriggers(120_000);
if (recentFailures > 3 && event.type === 'alert.triggered') {
return routeToAggregatedIncident(event); // Single aggregated alert, not N individual alerts
}
// Stage 5: Fan-out to channels in parallel — failures are independent
const channels = getChannelsForSeverity(severity);
const results = await Promise.allSettled(channels.map(ch => ch.notify(event)));
const failures = results.filter(r => r.status === 'rejected');
if (failures.length) logChannelFailure(failures);
// Stage 6: Schedule escalation check
if (severity === 'P1' || severity === 'P2') {
await escalationScheduler.schedule(event.server_slug, { afterMinutes: 5 });
}
}
The Promise.allSettled fan-out is the most important architectural decision. Using Promise.all instead means a Slack outage (HTTP 503 from Slack's API) prevents PagerDuty from being notified — the most important notification fails because the least critical one threw an error. Promise.allSettled runs all channel notifications in parallel and captures failures without propagating them. You log the Slack failure separately and PagerDuty still fires.
Severity taxonomy for MCP servers:
| Severity | Trigger condition | Response SLA | Channels |
|---|---|---|---|
| P1 — Critical | connection_refused or protocol_error on production/internal-tier server |
Acknowledge within 5 min | PagerDuty (phone) + Slack + Discord |
| P2 — High | timeout or error_rate_elevated on any server tier |
Acknowledge within 15 min | PagerDuty (push) + Slack |
| P3 — Medium | schema_drift on any tier; connection_refused on third-party dependency servers |
Review within 1 hour | Slack only |
| P4 — Low | Blip <3 minutes (recovered before first alert.updated fires) |
Review at next business-hours review | Log only — no notification |
Alert storm correlation handles the case where a shared dependency (a database, a VPN, a certificate authority) causes multiple MCP servers to fail simultaneously. Without correlation, each of your 10 monitored servers fires a P1 independently — you get 10 PagerDuty pages within 60 seconds. With correlation, the routing layer detects that more than 3 servers failed within a 2-minute window and routes a single aggregated incident ("10 servers failed simultaneously — likely shared dependency") rather than 10 individual pages. The investigation path for "10 servers failed at once" is completely different from "one server failed" — the correlation alert points directly at the shared dependency class.
Deduplication state store design: a Redis HASH keyed by alivemcp-{server_slug} with a 24-hour TTL. The hash stores status (open/resolved), severity, openedAt timestamp, and a channelAlertIds map (PagerDuty dedup_key, Discord messageId, OpsGenie alias). Having all channel IDs in one place makes the resolution path clean — when alert.resolved arrives, the routing layer reads the store, knows exactly which open incidents to close in each channel, and closes them all in one fan-out.
Incident runbook: what to do after the alert fires
The runbook is the document you wish you had when the alert wakes you at 2 AM. High-stakes, low-context, half-asleep — the worst possible conditions for diagnostic reasoning. A good runbook converts that moment into a sequence of specific, low-ambiguity steps: check this first, if you see X do Y, escalate when Z. For MCP servers, the entry point into the runbook is the AliveMCP failure_reason field — it names the failure mode before you open a single CLI tool.
The five playbooks, indexed by failure_reason:
1. connection_refused — process dead or container not running. Investigation sequence: (a) systemctl status mcp-server / pm2 list — is the process running? (b) If dead: journalctl -u mcp-server -n 50 / pm2 logs --lines 50 — why did it crash? (c) dmesg | grep -i oom — was it OOM-killed? (d) If running: curl -s localhost:3000/healthz — is it listening on the expected port? (e) If port is correct but external probe still fails: check network/firewall rule, check Caddy/nginx config. Remediation: systemctl restart mcp-server / pm2 restart mcp-server. If OOM-killed, increase memory limit before restarting.
2. protocol_error — process alive but MCP initialize fails. Investigation sequence: (a) Send manual initialize probe: curl -X POST https://your-server.com -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","clientInfo":{"name":"probe","version":"1"}}}' — what does the response contain? (b) git log --oneline -5 — was there a recent deploy? (c) node --version / check MCP SDK version in package.json — SDK version mismatch? (d) Check for syntax errors in tool registration code if a recent deploy changed tool definitions. Remediation: if caused by deploy, rollback: git revert HEAD --no-edit && pm2 restart mcp-server.
3. timeout — initialize succeeds but tool calls are slow. Investigation sequence: (a) top / htop — CPU or memory maxed out? (b) Check external dependency status pages (database, API providers your tools call). (c) Check DB connection pool: are all connections held by slow queries? (d) Check for a recent deploy that introduced an N+1 query or a missing index. Remediation: if external dependency is down, there is no local fix — monitor for recovery and post a status update. If local resource exhaustion: scale up, or identify and kill the slow operation.
4. schema_drift — tools/list changed. Investigation sequence: (a) AliveMCP's dashboard shows the before/after tools/list comparison — read it. (b) git log --oneline -3 — does a recent deploy explain the change? (c) If a tool was removed, check whether any clients are calling it — removed tools are breaking changes. Remediation: if the change was intentional, update documentation and notify clients. If unintentional, rollback.
5. error_rate_elevated — fraction of tool calls returning errors. Investigation sequence: (a) Check application error logs for exception patterns — what error type is recurring? (b) Check external dependency status. (c) Check AliveMCP's trend graph — is the error rate rising (escalating problem) or steady (partial failure)? Remediation depends entirely on root cause — this is the playbook that most often ends in "wait for external dependency to recover."
Escalation decision tree: if you cannot identify root cause within 15 minutes, escalate immediately. Do not extend the investigation window because the problem feels like it should be solvable quickly — 15 minutes of additional investigation that does not reach a conclusion is better given to a second pair of eyes. Post a status update at 15 minutes regardless of whether you have found the cause: "Investigating MCP server downtime, root cause not yet identified, update in 15 minutes." The update timer prevents a situation where the investigation continues in silence and users perceive silence as "no one is working on it."
Pre-incident runbook hygiene: the runbook must be accessible from a phone without internet connection to the MCP server's hosting environment (it should be a static page, not a file on the server that's down). Commit the runbook URL into the PagerDuty alert payload (custom_details.runbook) and the OpsGenie alert details — when the page fires, the person who receives it sees the runbook link before they open a second tab.
Putting it together: recommended stack by author profile
The five components are not all required simultaneously. Which ones to deploy depends on the author profile and the MCP server's tier:
| Profile | Detection | Routing architecture | Wakeup | Community | Runbook |
|---|---|---|---|---|---|
| Solo indie author, free-tier MCP server | AliveMCP | Not needed (single channel) | Discord only (no SLA) | Discord | Minimal (connection_refused playbook only) |
| Solo author, paid-tier MCP server or SLA commitment | AliveMCP | Single-channel bridge | PagerDuty (push + phone escalation) | Optional Discord layer | Full 5-playbook runbook |
| Small team (2–5), multiple MCP servers, different owners | AliveMCP | Required (multi-channel fan-out + severity taxonomy) | PagerDuty or OpsGenie with on-call rotation | Slack for team channel | Full runbook + escalation decision tree |
| Organization with Atlassian stack | AliveMCP | Required (team-based routing, alert storm correlation) | OpsGenie (team routing + Jira auto-ticket) | Slack for team visibility | Full runbook + postmortem template in Confluence |
The common thread across all profiles: the routing pipeline must send the resolved signal through the same path as the trigger. Auto-resolution is non-negotiable. A PagerDuty incident that is not auto-resolved by alert.resolved must be manually closed — and manual close disciplines degrade under load, leaving stale open incidents that pollute the incident history, invalidate MTTD/MTTR metrics, and cause alert fatigue when team members start ignoring "always-open" incidents. Every bridge implementation must handle alert.resolved explicitly.
Frequently asked questions
Do I need both PagerDuty and Discord, or is one enough?
It depends on whether you need guaranteed wakeup. Discord does not bypass Do Not Disturb — if you are asleep and your phone is on DND, a Discord webhook notification will not wake you. If your MCP server has users with expectations about availability, or if a downtime event at 2 AM has meaningful consequence, you need PagerDuty (or OpsGenie) as the wakeup layer. Discord is excellent as a community-visibility layer that shows your users you are on top of incidents — but it is not a replacement for a system designed to wake up a human.
How do I prevent alert fatigue from minute-by-minute re-checks?
Deduplication at the routing layer is the structural answer. For PagerDuty: use the same dedup_key for every alert.triggered and alert.updated event from the same server. PagerDuty will suppress all updates and not re-page. For Discord: use the message-edit pattern with ?wait=true — update the same embed instead of posting new messages. For Slack: use the ts timestamp to call chat.update on the original message. The routing architecture's deduplication state store (Redis, 24h TTL, keyed by server slug) is the source of truth for which channel IDs to update vs create.
What if the alert routing bridge itself goes down?
Design the bridge for independence from the MCP server it monitors. Deploy it on separate infrastructure — a different VPS, a different cloud provider, a serverless function — so that the same failure that took down the MCP server does not also take down the alert bridge. OpsGenie's Heartbeat feature provides a dead-man switch: configure AliveMCP to ping the Heartbeat URL every 5 minutes; if OpsGenie stops receiving the ping (because the bridge went down), OpsGenie fires an alert independently of the webhook bridge. For PagerDuty, use PagerDuty's native email integration as a backup: point a monitored email address at PagerDuty's inbound email integration, so that even if the webhook bridge is down, a manually-sent email from any device can open a PagerDuty incident.
Should the alert routing bridge include the runbook URL in the notification?
Yes, always. The runbook URL should appear in every notification that reaches a human — in the PagerDuty custom_details.runbook field, in the OpsGenie alert details, in the Discord embed footer, and in the Slack message. The person who receives a 2 AM page should be able to open the correct runbook playbook within 30 seconds of receiving the alert, without navigating to any other document or system. AliveMCP's failure_reason field in the webhook payload tells the bridge which playbook URL to include: connection_refused links directly to the "Connection Refused" section of the runbook, not just the runbook homepage.
How often should the incident runbook be updated?
After every postmortem where the runbook did not correctly describe the investigation path. The runbook is wrong if the actual investigation found a step that the runbook did not suggest, or if the runbook suggested steps that were not relevant to this failure type. Treat runbook updates as a postmortem action item with the same priority as code fixes — a runbook that is wrong in the middle of an incident is worse than no runbook, because it sends the responder down an incorrect investigation path under time pressure. A runbook that is reviewed and updated after every incident converges quickly on an accurate representation of the actual failure modes your specific server experiences.