Guide · Alerting
MCP server webhook alerts
Webhook alerts give you programmatic control over what happens when your MCP server goes down. Instead of relying on a monitoring service's built-in notification UI, you receive a structured HTTP POST to an endpoint you own — and you decide what to do with it: page an on-call engineer, post to Slack, open a ticket, trigger an auto-remediation script, or all of the above. The flexibility comes with responsibility: your webhook endpoint needs to be fast, idempotent, and secure. This guide covers every decision point from payload schema design to HMAC signature verification to retry handling.
TL;DR
A webhook alert is a POST request with a JSON payload to an endpoint you control. Your endpoint must respond with a 2xx status within the delivery timeout (typically 10–30 seconds). Use a dedup_key field in the payload for idempotency — retry logic means you may receive the same alert more than once. Sign payloads with HMAC-SHA256 and verify the signature before processing. Test locally with a free inspection service before deploying. AliveMCP Author tier ships webhook routing with configurable URLs, signatures, and per-severity routing out of the box.
Webhook payload schema
A well-designed webhook payload gives the consumer everything it needs to act without a follow-up API call. The minimum required fields:
{
"event": "downtime_started",
"dedup_key": "alivemcp-incident-7f3a91",
"server_slug": "my-mcp-server",
"server_url": "https://api.example.com/mcp",
"failure_layer": "initialize",
"severity": "P1",
"started_at": "2026-06-01T14:32:00Z",
"probe_count": 3,
"last_error": "connection refused on port 443",
"dashboard_url": "https://alivemcp.com/status/my-mcp-server"
}
Key fields explained:
- event: one of
downtime_started,downtime_resolved,slo_breach_warning,schema_drift_detected. Your consumer branches on this field. - dedup_key: stable identifier for the incident lifecycle. The same key appears in both the
downtime_startedanddowntime_resolvedevents, allowing your consumer to thread recovery alerts into the original incident. It also lets you deduplicate retries — if you receive two deliveries with the samededup_keyandevent, the second is a retry of the first. - failure_layer: which protocol layer failed —
transport,http,initialize, ortools_list. Use this for alert routing: transport failures go to infrastructure on-call; initialize failures go to the MCP server developer. See MCP server downtime alerting for the full severity-per-layer routing table. - probe_count: how many consecutive failed probes triggered this alert. A value of 3 with 60-second cadence means the server has been down at least 3 minutes. Higher values in the payload context indicate longer outages before the first successful delivery to your webhook.
Recovery payloads use the same schema with "event": "downtime_resolved" and an additional resolved_at timestamp and duration_seconds field. Thread the recovery into your original incident ticket or PagerDuty incident using dedup_key.
HTTP delivery mechanics
The monitoring system sends a POST request with Content-Type: application/json to your configured endpoint. Your endpoint must:
- Respond within the timeout window. Most monitoring webhooks use a 10–30 second delivery timeout. If your endpoint takes longer than the timeout to respond, the delivery is treated as a failure and retried. Respond with 2xx immediately after receiving and validating the request, then process asynchronously in a background queue. Do not do synchronous work (database writes, downstream API calls, ticket creation) inside the synchronous response path.
- Return an appropriate HTTP status. 200, 201, or 204 all indicate successful delivery. 4xx responses indicate a permanent delivery failure (the monitoring system will not retry a 4xx — the assumption is that 4xx means your endpoint rejected the payload intentionally). 5xx responses indicate a transient failure and trigger retry logic.
- Handle duplicate deliveries. Retry logic means you may receive the same payload more than once. Your processing logic must be idempotent — processing the same event twice must produce the same outcome as processing it once. Use
dedup_key+eventas the idempotency key: if you've already processed this combination, return 200 without re-processing.
Retry logic and delivery guarantees
At-least-once delivery is the standard webhook guarantee. The monitoring system retries on 5xx responses and timeout failures. A typical retry policy:
- Attempt 1: immediate on event trigger.
- Attempt 2: 30 seconds after attempt 1 failure.
- Attempt 3: 2 minutes after attempt 2 failure.
- Attempt 4: 10 minutes after attempt 3 failure.
- Attempt 5: 30 minutes after attempt 4 failure.
- Dead letter: after 5 failed attempts, the delivery is dropped and logged as a delivery failure in the monitoring system's audit log.
Exponential backoff prevents retry storms from overwhelming an endpoint that is itself temporarily unavailable. The most dangerous pattern is configuring your webhook endpoint on the same host as the MCP server being monitored — if the host goes down, the webhook endpoint goes down at the same time, defeating the purpose of the alert. Deploy your webhook endpoint on separate infrastructure from your MCP server.
For critical P1 alerts, don't rely solely on webhooks. Configure a secondary alert channel (email, SMS, push) as a fallback in case your webhook endpoint is unreachable. The primary path gets the structured webhook; the fallback path gets a plain text alert. See MCP server on-call for how to structure the complete alerting chain.
HMAC signature verification
Webhook payloads arrive over the public internet. Without signature verification, any party can POST a spoofed alert to your endpoint. HMAC-SHA256 signing is the standard mitigation.
The monitoring system holds a shared signing secret (a random 32-byte string you configure). Before delivery, it computes:
signature = HMAC-SHA256(secret, raw_request_body)
The signature is sent in an HTTP header — typically X-Signature: sha256=<hex_digest> or Authorization: Signature sha256=<hex_digest>. Your endpoint:
- Reads the raw request body as bytes before any JSON parsing.
- Computes
HMAC-SHA256(your_secret, raw_body). - Compares your computed signature to the header value using a constant-time comparison function (not string equality — timing attacks).
- Returns 401 if the signatures don't match. Never process an unverified payload.
Replay attack prevention: include a delivered_at timestamp in the payload header or body, and reject deliveries where abs(now - delivered_at) > 300 seconds. A captured and replayed legitimate payload can't be used to trigger spurious actions more than 5 minutes after initial delivery.
Rotate your signing secret annually or after any suspected compromise. After rotation, there is a brief window where both the old and new secrets are valid — the monitoring system sends using the new secret, but in-flight retries from before the rotation used the old secret. Support a 10-minute overlap window during which both secrets are accepted.
The slow consumer problem
The most common webhook implementation mistake is synchronous processing inside the request handler. A typical bad pattern:
app.post('/webhook/alivemcp', async (req, res) => {
await pagerduty.createIncident(req.body); // 2–5 seconds
await slack.postMessage(req.body); // 1–2 seconds
await db.insertAlert(req.body); // 50ms
res.sendStatus(200);
});
If PagerDuty is slow, the total handler time can exceed 10 seconds — causing the monitoring system to treat the delivery as a timeout failure and retry. You now get a duplicate PagerDuty incident. The correct pattern:
app.post('/webhook/alivemcp', async (req, res) => {
verifySignature(req); // fast, synchronous
queue.enqueue('process-alert', req.body); // fast, in-memory
res.sendStatus(202); // immediately
});
// Background worker processes queue asynchronously
The endpoint acknowledges receipt in under 100ms. The queue worker handles PagerDuty, Slack, and database writes at its own pace without the delivery timeout constraint.
Testing webhook endpoints without a public URL
During development, your webhook endpoint runs on localhost — not reachable from the monitoring system. Three testing approaches:
- Request inspection service:
webhook.siteandrequestbin.comgive you a public URL that logs incoming payloads. Configure this URL in your monitoring system temporarily to capture real payload shapes from your test servers. Capture one real payload, then use it as a fixture for local handler testing. - Local tunnel: ngrok, Cloudflare Tunnel, or
localtunnelexpose a localhost port with a public HTTPS URL. The monitoring system delivers to the tunnel URL; the tunnel proxies to your local handler. Useful for testing the full signature verification and idempotency logic in the real delivery path. - Unit tests from fixtures: once you have a real payload shape, write a test suite that sends the raw JSON to your handler directly, without a tunnel. Test: valid signature accepted; invalid signature returns 401; duplicate
dedup_keyreturns 200 without re-processing; 5xx from a downstream (PagerDuty mock) triggers queue retry not a handler error.
AliveMCP webhook configuration
AliveMCP Author tier ($9/mo) includes configurable webhook routing per monitored endpoint. Configuration options:
- Webhook URL: any HTTPS URL you own.
- Signing secret: random 32-byte string; AliveMCP uses it for HMAC-SHA256 payloads on the
X-AliveMCP-Signatureheader. - Per-severity routing: P1 events can route to a different URL than P2 and P3, allowing you to send critical downtime to PagerDuty and non-critical SLO warnings to a logging endpoint.
- Recovery alerts: toggle whether resolved events are also delivered — some consumers only want
downtime_started(for fire-and-forget incident creation); others needdowntime_resolvedto close incidents automatically.
See MCP server alerting for the full severity ladder that drives per-tier routing, and per-tenant alert routing at scale for the multi-tenant webhook fanout pattern.
Related questions
Why does my webhook receive duplicate alerts?
Duplicates are caused by retry logic — your endpoint returned a 5xx or timed out, so the monitoring system retried. The most common cause is synchronous processing in the request handler: a downstream call (PagerDuty, Slack) is slow, the handler doesn't respond within the timeout window, and the monitoring system retries. Fix: move all downstream calls to a background queue and respond with 202 immediately. Implement deduplication on dedup_key + event to handle any duplicates that slip through.
Should I use webhook alerts or email alerts for P1 downtime?
Both. Webhooks give you machine-readable structured data for automated incident creation; email gives you a human-readable fallback if the webhook endpoint is unreachable. P1 events warrant redundant alerting — configure webhook as primary (routes to PagerDuty or Slack) and email as secondary (delivered independently via the monitoring system's own email path). P2 and P3 events can use webhook-only since the failure to deliver a P3 doesn't warrant waking someone up.
How do I handle alert fatigue from too many webhook deliveries?
Alert fatigue from webhooks usually means either (1) your severity thresholds are too low — P3 noise is being treated like P1; or (2) you're not deduplicating within an incident window — each re-check failure triggers a new delivery instead of one delivery per incident. Mitigations: configure a minimum duration threshold (don't fire until 3 consecutive failed probes, not just 1); set per-severity routing so P3 events go to a low-priority logging endpoint rather than a push notification channel; use dedup_key to suppress duplicate deliveries within the same incident lifecycle. See MCP server on-call for on-call alert fatigue mitigation more broadly.
Further reading
- MCP server alerting — severity ladder and routing table
- MCP server downtime alerting — confirmation window and false-positive reduction
- MCP server on-call — structuring the full alerting chain
- MCP server reliability — MTTD and MTTR engineering
- Per-tenant alert routing at scale — webhook fanout for multi-tenant MCP
- AliveMCP — webhook alerts with HMAC signing, built in