Guide · Alerting

MCP server webhook alerts

Webhook alerts give you programmatic control over what happens when your MCP server goes down. Instead of relying on a monitoring service's built-in notification UI, you receive a structured HTTP POST to an endpoint you own — and you decide what to do with it: page an on-call engineer, post to Slack, open a ticket, trigger an auto-remediation script, or all of the above. The flexibility comes with responsibility: your webhook endpoint needs to be fast, idempotent, and secure. This guide covers every decision point from payload schema design to HMAC signature verification to retry handling.

TL;DR

A webhook alert is a POST request with a JSON payload to an endpoint you control. Your endpoint must respond with a 2xx status within the delivery timeout (typically 10–30 seconds). Use a dedup_key field in the payload for idempotency — retry logic means you may receive the same alert more than once. Sign payloads with HMAC-SHA256 and verify the signature before processing. Test locally with a free inspection service before deploying. AliveMCP Author tier ships webhook routing with configurable URLs, signatures, and per-severity routing out of the box.

Webhook payload schema

A well-designed webhook payload gives the consumer everything it needs to act without a follow-up API call. The minimum required fields:

{
  "event": "downtime_started",
  "dedup_key": "alivemcp-incident-7f3a91",
  "server_slug": "my-mcp-server",
  "server_url": "https://api.example.com/mcp",
  "failure_layer": "initialize",
  "severity": "P1",
  "started_at": "2026-06-01T14:32:00Z",
  "probe_count": 3,
  "last_error": "connection refused on port 443",
  "dashboard_url": "https://alivemcp.com/status/my-mcp-server"
}

Key fields explained:

Recovery payloads use the same schema with "event": "downtime_resolved" and an additional resolved_at timestamp and duration_seconds field. Thread the recovery into your original incident ticket or PagerDuty incident using dedup_key.

HTTP delivery mechanics

The monitoring system sends a POST request with Content-Type: application/json to your configured endpoint. Your endpoint must:

Retry logic and delivery guarantees

At-least-once delivery is the standard webhook guarantee. The monitoring system retries on 5xx responses and timeout failures. A typical retry policy:

Exponential backoff prevents retry storms from overwhelming an endpoint that is itself temporarily unavailable. The most dangerous pattern is configuring your webhook endpoint on the same host as the MCP server being monitored — if the host goes down, the webhook endpoint goes down at the same time, defeating the purpose of the alert. Deploy your webhook endpoint on separate infrastructure from your MCP server.

For critical P1 alerts, don't rely solely on webhooks. Configure a secondary alert channel (email, SMS, push) as a fallback in case your webhook endpoint is unreachable. The primary path gets the structured webhook; the fallback path gets a plain text alert. See MCP server on-call for how to structure the complete alerting chain.

HMAC signature verification

Webhook payloads arrive over the public internet. Without signature verification, any party can POST a spoofed alert to your endpoint. HMAC-SHA256 signing is the standard mitigation.

The monitoring system holds a shared signing secret (a random 32-byte string you configure). Before delivery, it computes:

signature = HMAC-SHA256(secret, raw_request_body)

The signature is sent in an HTTP header — typically X-Signature: sha256=<hex_digest> or Authorization: Signature sha256=<hex_digest>. Your endpoint:

  1. Reads the raw request body as bytes before any JSON parsing.
  2. Computes HMAC-SHA256(your_secret, raw_body).
  3. Compares your computed signature to the header value using a constant-time comparison function (not string equality — timing attacks).
  4. Returns 401 if the signatures don't match. Never process an unverified payload.

Replay attack prevention: include a delivered_at timestamp in the payload header or body, and reject deliveries where abs(now - delivered_at) > 300 seconds. A captured and replayed legitimate payload can't be used to trigger spurious actions more than 5 minutes after initial delivery.

Rotate your signing secret annually or after any suspected compromise. After rotation, there is a brief window where both the old and new secrets are valid — the monitoring system sends using the new secret, but in-flight retries from before the rotation used the old secret. Support a 10-minute overlap window during which both secrets are accepted.

The slow consumer problem

The most common webhook implementation mistake is synchronous processing inside the request handler. A typical bad pattern:

app.post('/webhook/alivemcp', async (req, res) => {
  await pagerduty.createIncident(req.body);     // 2–5 seconds
  await slack.postMessage(req.body);             // 1–2 seconds
  await db.insertAlert(req.body);               // 50ms
  res.sendStatus(200);
});

If PagerDuty is slow, the total handler time can exceed 10 seconds — causing the monitoring system to treat the delivery as a timeout failure and retry. You now get a duplicate PagerDuty incident. The correct pattern:

app.post('/webhook/alivemcp', async (req, res) => {
  verifySignature(req);                          // fast, synchronous
  queue.enqueue('process-alert', req.body);      // fast, in-memory
  res.sendStatus(202);                           // immediately
});
// Background worker processes queue asynchronously

The endpoint acknowledges receipt in under 100ms. The queue worker handles PagerDuty, Slack, and database writes at its own pace without the delivery timeout constraint.

Testing webhook endpoints without a public URL

During development, your webhook endpoint runs on localhost — not reachable from the monitoring system. Three testing approaches:

AliveMCP webhook configuration

AliveMCP Author tier ($9/mo) includes configurable webhook routing per monitored endpoint. Configuration options:

See MCP server alerting for the full severity ladder that drives per-tier routing, and per-tenant alert routing at scale for the multi-tenant webhook fanout pattern.

Related questions

Why does my webhook receive duplicate alerts?

Duplicates are caused by retry logic — your endpoint returned a 5xx or timed out, so the monitoring system retried. The most common cause is synchronous processing in the request handler: a downstream call (PagerDuty, Slack) is slow, the handler doesn't respond within the timeout window, and the monitoring system retries. Fix: move all downstream calls to a background queue and respond with 202 immediately. Implement deduplication on dedup_key + event to handle any duplicates that slip through.

Should I use webhook alerts or email alerts for P1 downtime?

Both. Webhooks give you machine-readable structured data for automated incident creation; email gives you a human-readable fallback if the webhook endpoint is unreachable. P1 events warrant redundant alerting — configure webhook as primary (routes to PagerDuty or Slack) and email as secondary (delivered independently via the monitoring system's own email path). P2 and P3 events can use webhook-only since the failure to deliver a P3 doesn't warrant waking someone up.

How do I handle alert fatigue from too many webhook deliveries?

Alert fatigue from webhooks usually means either (1) your severity thresholds are too low — P3 noise is being treated like P1; or (2) you're not deduplicating within an incident window — each re-check failure triggers a new delivery instead of one delivery per incident. Mitigations: configure a minimum duration threshold (don't fire until 3 consecutive failed probes, not just 1); set per-severity routing so P3 events go to a low-priority logging endpoint rather than a push notification channel; use dedup_key to suppress duplicate deliveries within the same incident lifecycle. See MCP server on-call for on-call alert fatigue mitigation more broadly.

Further reading