Guide · Alert Routing

PagerDuty Integration for MCP Servers — route downtime alerts to on-call

When an MCP server goes down at 2 AM, a Slack message no one reads is not an alert — it is a log entry. PagerDuty turns AliveMCP's detection signal into a structured incident that pages the right person, escalates if they don't acknowledge, and auto-resolves when the server recovers. This guide walks through the complete integration: Events API v2 payload format, deduplication key strategy, escalation policies sized for indie authors and small teams, and the exact Node.js bridge code to run between AliveMCP webhooks and PagerDuty.

TL;DR

Create a PagerDuty service, grab its Events API v2 integration key, and write a small webhook bridge that forwards AliveMCP alert.triggered events to https://events.pagerduty.com/v2/enqueue with event_action: "trigger" and dedup_key: serverSlug. When AliveMCP sends alert.resolved, forward with event_action: "resolve" and the same dedup_key — PagerDuty closes the incident automatically. Use a single escalation policy that pages you immediately and escalates to a backup after 5 minutes. For multi-server setups, use PagerDuty's routing rules to direct alerts from different servers to different on-call teams.

Why PagerDuty for MCP server incidents

Most MCP server authors start with Slack alerts and quickly discover two problems. First, Slack does not wake you up — a channel notification at 2 AM is silenced by Do Not Disturb. Second, Slack has no acknowledgment model: there is no way to know whether anyone has seen the alert, who is handling it, or whether it resolved. PagerDuty solves both problems.

PagerDuty's core value for MCP server monitoring is the on-call rotation model. If you have a solo side project, PagerDuty pages you via push notification, phone call, and SMS in sequence until you acknowledge. If you have a team, PagerDuty rotates the on-call burden so one person doesn't carry it indefinitely. For teams running MCP servers in production — especially the Author tier ($9/mo) and Team tier ($49/mo) use cases — the escalation model is the difference between a 5-minute recovery and a 4-hour outage discovered by a user.

Channel	Wakes you up?	Escalates on no-ack?	Auto-resolves?	Best for
Slack	No (DND)	No	No	Daytime awareness
Email	No	No	No	Digest / low-priority
Discord	Push only	No	No	Dev community teams
PagerDuty	Yes (call + SMS + push)	Yes	Yes	Production on-call
OpsGenie	Yes	Yes	Yes	Team-based routing

PagerDuty integration architecture

AliveMCP sends HTTP POST webhooks to a URL you configure on your Author or Team plan. The simplest integration is a small serverless function — a Cloudflare Worker, a Vercel edge function, or a Node.js Express route — that receives AliveMCP webhooks, transforms the payload, and forwards to PagerDuty's Events API v2.

The data flow is:

AliveMCP detects your MCP server is down and sends a POST with alert.triggered to your webhook URL.
Your bridge function transforms the payload and calls https://events.pagerduty.com/v2/enqueue with event_action: "trigger".
PagerDuty opens an incident and pages the on-call person.
The on-call person acknowledges the incident.
AliveMCP detects recovery and sends alert.resolved.
Your bridge sends event_action: "resolve" with the same dedup_key.
PagerDuty closes the incident automatically.

You do not need PagerDuty's native webhook receiver or the PagerDuty app in AliveMCP — a simple bridge gives you full control over payload mapping and routing logic.

Setting up the PagerDuty service and integration

In PagerDuty, create a new Service for your MCP server monitoring. Name it something like "MCP Server Uptime". In the service's integrations tab, add an Events API v2 integration. PagerDuty will generate an integration key (a 32-character string starting with the service identifier). Copy this key — it is the routing_key in every Events API call.

Set up an Escalation Policy on the service. For a solo author:

Level 1: Page you immediately via push notification + phone call (PagerDuty app on your phone).
Level 2: Escalate after 5 minutes to a backup contact (could be the same person with SMS as the channel, or a trusted colleague).
High-urgency vs low-urgency: Use PagerDuty's urgency rules to mark alerts as high-urgency only when the server has been down for more than 5 minutes. Set this as a suppression rule: suppress alerts for the first minute (brief blips don't need a 2 AM call), then escalate to high-urgency for sustained outages.

For a team setup, create on-call schedules with weekly rotation and assign them to escalation levels. PagerDuty's schedule editor handles timezone-aware handoffs and allows override scheduling for planned absence.

AliveMCP webhook to PagerDuty Events API bridge

The following Node.js function runs on any HTTP server (Express, Fastify, Cloudflare Workers, Vercel Edge) and handles the AliveMCP → PagerDuty translation.

// alivemcp-pagerduty-bridge.js
const PAGERDUTY_ROUTING_KEY = process.env.PAGERDUTY_ROUTING_KEY;
const ALIVEMCP_WEBHOOK_SECRET = process.env.ALIVEMCP_WEBHOOK_SECRET;

async function handleAliveMCPWebhook(req, res) {
  // Verify the webhook signature (HMAC-SHA256 of the raw body)
  const signature = req.headers['x-alivemcp-signature'];
  const expectedSig = hmacSha256(ALIVEMCP_WEBHOOK_SECRET, req.rawBody);
  if (!timingSafeEqual(signature, expectedSig)) {
    return res.status(401).json({ error: 'Invalid signature' });
  }

  const event = req.body;
  const { type, server_slug, server_name, failure_reason, check_url } = event;

  if (type === 'alert.triggered') {
    await triggerPagerDuty({
      routing_key: PAGERDUTY_ROUTING_KEY,
      event_action: 'trigger',
      dedup_key: `alivemcp-${server_slug}`,   // stable key per server
      payload: {
        summary: `MCP server down: ${server_name}`,
        source: 'AliveMCP',
        severity: 'critical',
        custom_details: {
          server_slug,
          failure_reason,
          check_url,
          dashboard: `https://alivemcp.com/status/${server_slug}`,
        }
      },
      links: [
        { href: `https://alivemcp.com/status/${server_slug}`, text: 'AliveMCP dashboard' },
        { href: check_url, text: 'MCP endpoint' }
      ]
    });
  }

  if (type === 'alert.resolved') {
    await triggerPagerDuty({
      routing_key: PAGERDUTY_ROUTING_KEY,
      event_action: 'resolve',
      dedup_key: `alivemcp-${server_slug}`,   // same key — closes the incident
    });
  }

  res.status(200).json({ ok: true });
}

async function triggerPagerDuty(body) {
  const resp = await fetch('https://events.pagerduty.com/v2/enqueue', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(body),
  });
  if (!resp.ok) {
    console.error('PagerDuty error:', resp.status, await resp.text());
  }
}

Three details matter in this bridge:

dedup_key is the server slug. PagerDuty deduplicates alerts with the same dedup_key within a service. If AliveMCP sends multiple alert.triggered events for the same server (because the server keeps failing the health check every minute), PagerDuty will not open multiple incidents — it will update the existing one. The first event creates the incident; subsequent events with the same dedup_key append to it.
Severity is always critical. MCP server downtime means your users' LLM interactions are failing. There is no "warning" level for a down MCP server — it is either up or down. Use PagerDuty's urgency suppression rules to control call timing, not the severity field.
Include the AliveMCP dashboard link. The first thing the on-call person does is check context — how long has it been down, what is the failure reason, is this a flap or sustained. The links field in the Events API payload puts the AliveMCP status page one tap away in the PagerDuty mobile app.

Alert routing for multi-server setups

If you monitor multiple MCP servers — your own plus several third-party ones your application depends on — you need routing rules to direct alerts to the right team or person. PagerDuty's Event Orchestration (available on Business plans) lets you route events from a single global routing key to different services based on event payload fields.

For Team tier users monitoring private endpoints, a simpler approach is to use separate PagerDuty services per MCP server category (e.g., "Internal MCP Servers" and "Third-party MCP Dependencies"), each with its own integration key. Your bridge function selects the routing key based on the server's tier field in the AliveMCP webhook payload.

// Route to different PagerDuty services based on server tier
const ROUTING_KEYS = {
  internal: process.env.PD_KEY_INTERNAL,
  author:   process.env.PD_KEY_AUTHOR,
  thirdparty: process.env.PD_KEY_THIRDPARTY,
};

function selectRoutingKey(event) {
  if (event.server_tags?.includes('internal')) return ROUTING_KEYS.internal;
  if (event.server_tags?.includes('claimed'))  return ROUTING_KEYS.author;
  return ROUTING_KEYS.thirdparty;
}

This lets you configure different escalation policies for different server categories: internal servers get immediate on-call pages, third-party dependency alerts create low-urgency tickets for morning review.

Suppression and maintenance windows

Two noise sources cause alert fatigue in MCP server monitoring: planned maintenance (you are deploying a new version, the server will be down for 30 seconds) and flapping (the server alternates between up and down faster than you can respond).

Maintenance windows: AliveMCP's Team tier lets you schedule maintenance windows that suppress alerts during planned downtime. Pass the maintenance window ID in the AliveMCP API call and the webhook will not fire during the window. If your PagerDuty bridge receives an alert during a period you know is maintenance, you can also suppress on the bridge side by checking a local flag set by your deploy script.

Flap suppression: AliveMCP's flap detection uses a sliding window: it does not send alert.triggered until the server has been down for at least 2 consecutive checks (2 minutes). This eliminates single-check false positives. At the PagerDuty layer, you can add a second suppression tier: only escalate to phone call after the incident has been open for 5 minutes. Configure this in the service's urgency settings — "low urgency" for the first 5 minutes (push notification only), "high urgency" after that (phone call + SMS).

What PagerDuty does not replace

PagerDuty is an alerting and escalation tool, not a monitoring tool. It does not probe your MCP server — it receives the detection signal from AliveMCP and routes it to the right person. The monitoring gap matters: if AliveMCP does not detect the failure, PagerDuty never fires.

AliveMCP probes the full MCP protocol stack from outside your infrastructure: it sends an actual initialize handshake, validates the protocol version in the response, checks for schema drift in tools/list, and measures response time. A server that responds HTTP 200 to a basic health check but returns a malformed MCP response will not fool AliveMCP — but it will fool a simple ping monitor. This is what makes the combination of AliveMCP detection + PagerDuty routing effective for MCP-specific incidents: the detection layer understands the protocol, the alerting layer handles the human escalation.

Frequently asked questions

Do I need a paid PagerDuty plan to use this integration?

PagerDuty's free tier (Developer plan) supports up to 5 users and includes the Events API v2, escalation policies, and on-call scheduling — enough for a solo author or small team. The Events API v2 integration key is available on all plans including free. You do not need PagerDuty's premium features (Intelligent Alert Grouping, Event Intelligence, AIOps) for basic MCP server alerting. The main limitation of the free plan is the 5-user ceiling and the lack of multi-schedule override management, which matters only when you are rotating on-call across a team larger than 5.

How do I prevent PagerDuty from calling me for every brief blip?

Two layers of suppression work together. First, AliveMCP only sends alert.triggered after 2 consecutive failed checks (2 minutes of confirmed downtime), so brief blips never reach PagerDuty at all. Second, configure your PagerDuty service to use low-urgency notification for the first 5 minutes after an alert triggers: low-urgency sends a push notification only (not a phone call). After 5 minutes without acknowledgment, the policy escalates to high-urgency (phone call + SMS). This means a 2-minute blip wakes you via push only; a sustained 7-minute outage wakes you via phone call. Tune the thresholds based on your server's criticality and your own tolerance for 2 AM calls.

Can I use PagerDuty's native AliveMCP integration instead of a custom bridge?

AliveMCP does not currently have a native PagerDuty app in the PagerDuty integration directory. The custom bridge described here is the recommended approach. It is also more flexible: you can add routing logic, payload enrichment (appending the last 3 failure reasons), and per-server escalation rules that a native integration might not support. The bridge code is about 50 lines and runs free on Cloudflare Workers (the free tier handles millions of requests per month — far more than MCP server alerts will ever generate).

What happens if my bridge function is down when AliveMCP fires an alert?

AliveMCP retries failed webhook deliveries with exponential backoff: 1 minute, 5 minutes, 15 minutes, then hourly for 24 hours. If your bridge is temporarily unavailable, the alert will be delivered once the bridge recovers. For the case where both AliveMCP and your bridge are involved in the same infrastructure incident (unlikely but possible), configure PagerDuty's native heartbeat monitoring as a secondary check: create a heartbeat that expects a ping from AliveMCP every 5 minutes; if the ping stops, PagerDuty pages you directly. AliveMCP's uptime page (alivemcp.com/status/alivemcp-itself) is also publicly visible so you can check whether AliveMCP itself is experiencing issues.

How do I test the integration without waiting for a real outage?

Send a test event directly to the Events API v2 endpoint using curl: curl -X POST https://events.pagerduty.com/v2/enqueue -H "Content-Type: application/json" -d '{"routing_key": "YOUR_KEY", "event_action": "trigger", "dedup_key": "test-001", "payload": {"summary": "Test: MCP server down", "source": "AliveMCP", "severity": "critical"}}'. This triggers a real incident in PagerDuty — verify you receive the notification on your phone. Then send the resolve event with the same dedup_key to confirm auto-resolution. AliveMCP also has a "send test alert" button on the webhook configuration page that fires a simulated alert.triggered event through your configured webhook URL, exercising the full pipeline including your bridge code.