Guide · Alerting

MCP server webhook alerts

Webhook alerts give you programmatic control over what happens when your MCP server goes down. Instead of relying on a monitoring service's built-in notification UI, you receive a structured HTTP POST to an endpoint you own — and you decide what to do with it: page an on-call engineer, post to Slack, open a ticket, trigger an auto-remediation script, or all of the above. The flexibility comes with responsibility: your webhook endpoint needs to be fast, idempotent, and secure. This guide covers every decision point from payload schema design to HMAC signature verification to retry handling.

TL;DR

A webhook alert is a POST request with a JSON payload to an endpoint you control. Your endpoint must respond with a 2xx status within the delivery timeout (typically 10–30 seconds). Use a dedup_key field in the payload for idempotency — retry logic means you may receive the same alert more than once. Sign payloads with HMAC-SHA256 and verify the signature before processing. Test locally with a free inspection service before deploying. AliveMCP Author tier ships webhook routing with configurable URLs, signatures, and per-severity routing out of the box.

Webhook payload schema

A well-designed webhook payload gives the consumer everything it needs to act without a follow-up API call. The minimum required fields:

{
  "event": "downtime_started",
  "dedup_key": "alivemcp-incident-7f3a91",
  "server_slug": "my-mcp-server",
  "server_url": "https://api.example.com/mcp",
  "failure_layer": "initialize",
  "severity": "P1",
  "started_at": "2026-06-01T14:32:00Z",
  "probe_count": 3,
  "last_error": "connection refused on port 443",
  "dashboard_url": "https://alivemcp.com/status/my-mcp-server"
}

Key fields explained:

event: one of downtime_started, downtime_resolved, slo_breach_warning, schema_drift_detected. Your consumer branches on this field.
dedup_key: stable identifier for the incident lifecycle. The same key appears in both the downtime_started and downtime_resolved events, allowing your consumer to thread recovery alerts into the original incident. It also lets you deduplicate retries — if you receive two deliveries with the same dedup_key and event, the second is a retry of the first.
failure_layer: which protocol layer failed — transport, http, initialize, or tools_list. Use this for alert routing: transport failures go to infrastructure on-call; initialize failures go to the MCP server developer. See MCP server downtime alerting for the full severity-per-layer routing table.
probe_count: how many consecutive failed probes triggered this alert. A value of 3 with 60-second cadence means the server has been down at least 3 minutes. Higher values in the payload context indicate longer outages before the first successful delivery to your webhook.

Recovery payloads use the same schema with "event": "downtime_resolved" and an additional resolved_at timestamp and duration_seconds field. Thread the recovery into your original incident ticket or PagerDuty incident using dedup_key.

HTTP delivery mechanics

The monitoring system sends a POST request with Content-Type: application/json to your configured endpoint. Your endpoint must:

Respond within the timeout window. Most monitoring webhooks use a 10–30 second delivery timeout. If your endpoint takes longer than the timeout to respond, the delivery is treated as a failure and retried. Respond with 2xx immediately after receiving and validating the request, then process asynchronously in a background queue. Do not do synchronous work (database writes, downstream API calls, ticket creation) inside the synchronous response path.
Return an appropriate HTTP status. 200, 201, or 204 all indicate successful delivery. 4xx responses indicate a permanent delivery failure (the monitoring system will not retry a 4xx — the assumption is that 4xx means your endpoint rejected the payload intentionally). 5xx responses indicate a transient failure and trigger retry logic.
Handle duplicate deliveries. Retry logic means you may receive the same payload more than once. Your processing logic must be idempotent — processing the same event twice must produce the same outcome as processing it once. Use dedup_key + event as the idempotency key: if you've already processed this combination, return 200 without re-processing.

Retry logic and delivery guarantees

At-least-once delivery is the standard webhook guarantee. The monitoring system retries on 5xx responses and timeout failures. A typical retry policy:

Attempt 1: immediate on event trigger.
Attempt 2: 30 seconds after attempt 1 failure.
Attempt 3: 2 minutes after attempt 2 failure.
Attempt 4: 10 minutes after attempt 3 failure.
Attempt 5: 30 minutes after attempt 4 failure.
Dead letter: after 5 failed attempts, the delivery is dropped and logged as a delivery failure in the monitoring system's audit log.

Exponential backoff prevents retry storms from overwhelming an endpoint that is itself temporarily unavailable. The most dangerous pattern is configuring your webhook endpoint on the same host as the MCP server being monitored — if the host goes down, the webhook endpoint goes down at the same time, defeating the purpose of the alert. Deploy your webhook endpoint on separate infrastructure from your MCP server.

For critical P1 alerts, don't rely solely on webhooks. Configure a secondary alert channel (email, SMS, push) as a fallback in case your webhook endpoint is unreachable. The primary path gets the structured webhook; the fallback path gets a plain text alert. See MCP server on-call for how to structure the complete alerting chain.

HMAC signature verification

Webhook payloads arrive over the public internet. Without signature verification, any party can POST a spoofed alert to your endpoint. HMAC-SHA256 signing is the standard mitigation.

The monitoring system holds a shared signing secret (a random 32-byte string you configure). Before delivery, it computes:

signature = HMAC-SHA256(secret, raw_request_body)

The signature is sent in an HTTP header — typically X-Signature: sha256=<hex_digest> or Authorization: Signature sha256=<hex_digest>. Your endpoint:

Reads the raw request body as bytes before any JSON parsing.
Computes HMAC-SHA256(your_secret, raw_body).
Compares your computed signature to the header value using a constant-time comparison function (not string equality — timing attacks).
Returns 401 if the signatures don't match. Never process an unverified payload.

Replay attack prevention: include a delivered_at timestamp in the payload header or body, and reject deliveries where abs(now - delivered_at) > 300 seconds. A captured and replayed legitimate payload can't be used to trigger spurious actions more than 5 minutes after initial delivery.

Rotate your signing secret annually or after any suspected compromise. After rotation, there is a brief window where both the old and new secrets are valid — the monitoring system sends using the new secret, but in-flight retries from before the rotation used the old secret. Support a 10-minute overlap window during which both secrets are accepted.

The slow consumer problem

The most common webhook implementation mistake is synchronous processing inside the request handler. A typical bad pattern:

app.post('/webhook/alivemcp', async (req, res) => {
  await pagerduty.createIncident(req.body);     // 2–5 seconds
  await slack.postMessage(req.body);             // 1–2 seconds
  await db.insertAlert(req.body);               // 50ms
  res.sendStatus(200);
});

If PagerDuty is slow, the total handler time can exceed 10 seconds — causing the monitoring system to treat the delivery as a timeout failure and retry. You now get a duplicate PagerDuty incident. The correct pattern:

app.post('/webhook/alivemcp', async (req, res) => {
  verifySignature(req);                          // fast, synchronous
  queue.enqueue('process-alert', req.body);      // fast, in-memory
  res.sendStatus(202);                           // immediately
});
// Background worker processes queue asynchronously

The endpoint acknowledges receipt in under 100ms. The queue worker handles PagerDuty, Slack, and database writes at its own pace without the delivery timeout constraint.

Testing webhook endpoints without a public URL

During development, your webhook endpoint runs on localhost — not reachable from the monitoring system. Three testing approaches:

Request inspection service: webhook.site and requestbin.com give you a public URL that logs incoming payloads. Configure this URL in your monitoring system temporarily to capture real payload shapes from your test servers. Capture one real payload, then use it as a fixture for local handler testing.
Local tunnel: ngrok, Cloudflare Tunnel, or localtunnel expose a localhost port with a public HTTPS URL. The monitoring system delivers to the tunnel URL; the tunnel proxies to your local handler. Useful for testing the full signature verification and idempotency logic in the real delivery path.
Unit tests from fixtures: once you have a real payload shape, write a test suite that sends the raw JSON to your handler directly, without a tunnel. Test: valid signature accepted; invalid signature returns 401; duplicate dedup_key returns 200 without re-processing; 5xx from a downstream (PagerDuty mock) triggers queue retry not a handler error.

AliveMCP webhook configuration

AliveMCP Author tier ($9/mo) includes configurable webhook routing per monitored endpoint. Configuration options:

Webhook URL: any HTTPS URL you own.
Signing secret: random 32-byte string; AliveMCP uses it for HMAC-SHA256 payloads on the X-AliveMCP-Signature header.
Per-severity routing: P1 events can route to a different URL than P2 and P3, allowing you to send critical downtime to PagerDuty and non-critical SLO warnings to a logging endpoint.
Recovery alerts: toggle whether resolved events are also delivered — some consumers only want downtime_started (for fire-and-forget incident creation); others need downtime_resolved to close incidents automatically.

See MCP server alerting for the full severity ladder that drives per-tier routing, and per-tenant alert routing at scale for the multi-tenant webhook fanout pattern.