Deep dive · 2026-04-30 · Scale sub-series

Per-tenant alert routing at scale — making one paging stack safe for many tenants

The multi-tenant probe collector walkthrough covered the write side of the scale sub-series — the supervisor, the workers, the per-region Redis queues, the verdict-minute coalescer, the per-tenant secret store. The verdict it emits is one signal per server per region per minute, with a sealed three-state colour and the supporting fields. That signal is read by humans through the status page and by machines through the read-side API and embed. The third reader is the alert path — the bit that turns the verdict into a Slack message, an email, a webhook POST, a PagerDuty incident. Single-tenant operators wire the alert path with two YAML lines: slack_webhook_url, oncall_email. The whole thing fits in a hundred lines of curl. Operating the same alert path on behalf of many tenants makes those two YAML lines into a configuration surface with a threat model — a misconfigured tenant must not be able to accidentally page another tenant, a malicious tenant must not be able to deliberately page another tenant, a registry-wide outage must not page ten thousand tenants in the same minute, and the per-tenant alert volume must be bounded by the tier they paid for. This post is the architectural walkthrough for the alert router that satisfies those four constraints, plus a fifth — observability — without giving up the fast-path latency a paging stack needs to be useful at all.

TL;DR

The multi-tenant alert router is a small service — about 1,200 lines of Go in our deployment — that sits between the verdict-minute Redis from the collector walkthrough and the four canonical alert sinks: Slack, webhook, email, PagerDuty. Five layers exist that the single-tenant alert path does not have. Sink ownership verification — every Slack workspace, webhook URL, email domain, and PagerDuty service is verified to belong to the tenant before it can be saved as their alert sink, via a domain-of-origin handshake or an inbound proof token; the verification is re-checked on a schedule and on every payload send so a stolen Slack token doesn't keep paging someone else's channel. Tenant-scoped configuration with cross-tenant write protection — the alert-config table is row-secured by tenant ID, the API is parameterised on tenant ID at the request boundary, and the verification step structurally prevents a tenant from pasting another tenant's webhook URL into their own config (the inbound-proof-token handshake fails). Cross-tenant alert-suppression rule — a registry-wide outage that takes down every public MCP listed under a single registry produces one global notice, not ten thousand individual pages; the rule is "if more than 10% of all tenants would be paged in the same alert minute for the same upstream root cause, collapse to a single global notice and an opt-in per-tenant addendum". Per-tenant alert budgets — the same tier mapping that capped probe frequency in the collector caps alert frequency at the router, so a runaway tenant whose server is flapping every 60 seconds doesn't burn through the registry-wide rate limits and doesn't page their on-call into therapy. Payload-shape boundaries — the alert payload contains exactly the fields the tenant has paid for, no more; cross-tenant identifiers (other tenant slugs, internal supervisor metadata, regional probe-error codes the tenant wasn't shown) never leak into a tenant's Slack channel even if the supervisor's logger formats them. The failure-mode catalogue at the end of the post lists six new ways the alert router can break that the single-tenant path could not, and the structural fix for each. The recipe section at the bottom sketches the verification handshake, the suppression-rule SQL, the alert-budget Lua script, and the four sink-shaped payload templates in copy-pasteable form.

Where the single-tenant alert path stops being safe

The minimal alert path for a single-tenant MCP probe is two configuration lines — one Slack webhook URL and one on-call email — feeding an if statement that posts to one or both whenever the verdict-minute changes from up to down. We ran exactly that for AliveMCP itself for the first two weeks. It was about ninety lines of bash; the verdict-minute Redis fed a small loop that compared the current colour against the last-seen colour for each server, and on a transition the loop posted a JSON blob to Slack and a plain-text email to the on-call. The post-transition cooldown was a hardcoded five minutes. The on-call was the founder and on-call's phone was the same phone the founder slept next to. There was no abstraction, no router, no sinks, no sink registry — just the two configuration lines and the loop.

That alert path stops being safe the moment a second tenant exists. The first failure mode is mundane and concrete: the operator who is now running the alert path on behalf of two tenants has a strong incentive to type tenant B's Slack webhook URL into the configuration for tenant A, because the URLs are visually similar and the Slack API does not authenticate the requester. The webhook URL is the password, in the original sense — anybody who possesses the URL can post to that channel. The operator sees no error from the misconfiguration; tenant A's verdicts start arriving in tenant B's Slack channel. Tenant B reads them, tries to action them, can't (because the servers belong to a different operator), and complains. By that point three or four other tenants have also been misconfigured the same way and the trust in the alert path is gone. Single-tenant operators cannot make this mistake; multi-tenant operators make it within the first month if there is no structural protection.

The second failure mode is subtler. A malicious tenant pastes another tenant's webhook URL into their own configuration deliberately — perhaps because they discovered it via a prior leak, perhaps because they guessed it from the workspace name (Slack URLs include a workspace identifier), perhaps because they are running a DDoS against the other tenant's on-call. The alert router, if it has no verification step, will happily forward the malicious tenant's verdicts to the other tenant's Slack channel; the other tenant cannot tell the difference between "my server is down" and "another tenant has named me as their sink". The Slack API does not provide a "verify webhook owner" call. Verification must be structural — a fact about the configuration step, not the dispatch step.

The third failure mode appears when a single upstream root cause takes down many tenants. The Q2 audit's 9% healthy figure is not steady-state — there are weeks where a single hosting provider's outage turns 4,800 of the 2,181 endpoints red within ten minutes. If every tenant who is monitoring servers on that provider receives an individual page in the same minute, the alert router emits roughly two million Slack POSTs in the same five minutes, hits the per-workspace rate limit on every workspace, gets soft-banned at the API level, and then fails-closed on every other tenant's alerts for the next thirty minutes — including the unrelated tenants whose unrelated servers happened to fail in the same window. The right behaviour is to detect the shared root cause, collapse the per-tenant pages into a single global notice, and emit per-tenant addenda only on opt-in. Single-tenant operators cannot encounter this; multi-tenant operators encounter it on the first major regional outage and have to design for it before then.

The fourth failure mode is volumetric. A single tenant whose server is flapping every 60 seconds — typical of a misconfigured Cloud Run revision in a deploy loop — emits 1,440 alert events in a 24-hour period. Without per-tenant rate-limiting, the alert router happily dispatches all 1,440. The tenant's on-call gets paged 1,440 times. By page 30 the on-call has muted the channel; by page 100 the tenant complains; by page 1,000 the tenant churns. The same tenant burns ~1,440 of the 50,000 daily Slack-API calls per workspace, which is fine for one tenant, less fine for the 4,800 tenants in the same Slack-shared-rate bucket. The fix is per-tenant alert budgets at the router, not at the on-call's mute settings.

The fifth failure mode is informational. The single-tenant alert payload usually includes everything the operator might need to debug — region names, probe-step indices, upstream IP addresses, JSON-RPC error codes, hash-state diff, the supervisor's internal trace ID. For one tenant, that is fine; the tenant is the operator. For many tenants, leaking the supervisor's internal trace IDs into a tenant's Slack channel is a confidentiality bug — the trace ID is a key into the supervisor's log corpus, and the supervisor's log corpus contains other tenants' verdicts, error reports, and credential-rotation events. The right boundary is "the alert payload contains the verdict the tenant paid for, plus a tenant-scoped reference for follow-up; nothing else". Drawing that boundary correctly is structural; doing it at log-line review time is not enough.

Each of these five failure modes is the seed of a layer that exists in the multi-tenant alert router and does not exist in the single-tenant one. The next sections walk each layer in turn.

Sink ownership verification — the structural fix for paste-a-webhook attacks

The whole point of the verification layer is to make it structurally impossible for tenant A to save tenant B's alert sink as tenant A's alert sink. The two ingredients that achieve this are a handshake at configuration time and a repeat check at send time. Both are necessary; either one alone is breakable.

Sink-ownership verification looks different per sink type. The four supported sinks are Slack, generic webhook, email, and PagerDuty; each has a different ground-truth source for "who owns this sink".

Slack

Slack's incoming webhooks are not authenticated; possession of the URL is the entire authorisation. Verification therefore cannot rely on the URL alone. The handshake we use is an inbound-proof-token flow:

  1. Tenant pastes the candidate Slack webhook URL into the AliveMCP dashboard.
  2. The router generates a 256-bit random token and a server-side record (tenant_id, sink_kind=slack, candidate_url, token, expires_at = now + 10min).
  3. The router POSTs once to the candidate URL with a payload that says, in plain English, "AliveMCP is verifying that you own this Slack channel. To complete setup, paste this verification token back into the AliveMCP dashboard within 10 minutes: VFY-…".
  4. The tenant copies the token from their Slack channel and pastes it back into the AliveMCP dashboard, on the page they originated the request from. The router compares the pasted token against the stored token for that tenant ID. If they match and the timestamp is fresh, the sink is verified.
  5. If the tokens don't match, or the time has expired, or the candidate URL was the same as another tenant's verified URL, the verification fails and the candidate record is deleted.

This handshake makes paste-a-webhook attacks structurally hopeless. If tenant A pastes tenant B's Slack webhook URL into their own configuration, the verification token is delivered to tenant B's Slack channel, not tenant A's; tenant A cannot read it, cannot complete the handshake, and the verification fails after ten minutes. Tenant B sees a verification message about a sink-ownership challenge they did not initiate; this is itself a signal — we display "An AliveMCP user attempted to verify this Slack channel as their alert sink and could not. If this was you, retry from the dashboard. If not, no further action is needed; the verification did not succeed." in the verification message body. The wording matters; users misread "an attempt was made" as "a breach occurred".

Generic webhook

Generic webhooks require a slightly different handshake because the receiving service may not be a chat surface — it may be the tenant's own internal alert pipeline, a GitOps annotation service, or a custom CI guardrail. The handshake we use is a domain-of-origin proof:

  1. Tenant pastes the candidate webhook URL.
  2. The router parses the URL and extracts the host. The host is required to be a registrable domain — IP addresses, localhost, and unresolvable hosts are rejected at this step.
  3. The router generates a verification token and instructs the tenant to publish a TXT record at _alivemcp-verify. containing the token. (Cloudflare's "Origin CA Verification" follows roughly this pattern; we copy it.)
  4. The tenant publishes the TXT record. The router resolves the TXT record. If the record contains the token, the host is verified for that tenant.
  5. The verified host is bound to the tenant ID. Future webhook configurations to any URL under the same registrable domain are auto-approved (so the tenant can configure https://alerts.example.com/mcp-down and https://alerts.example.com/mcp-degraded after verifying example.com once).

The TXT-record handshake has the same paste-protection property as the Slack one. If tenant A pastes a webhook URL pointing at tenant B's domain, tenant A cannot publish a TXT record under tenant B's domain and the verification fails. The verification persists for one year; we re-check the TXT record at the one-year mark and on any change to the candidate URL's host. The TXT record itself is a static lookup at send time too — every webhook POST is gated on the TXT record still resolving correctly, with a 24-hour cache. If the TXT record has been removed, the router stops sending, marks the sink as "ownership verification lapsed", and notifies the tenant in-dashboard before the next alert event.

Email

Email is the weakest of the four sinks because there is no structural ownership proof for an email address — the only proof is "the recipient can read it", which we test by sending a verification email. The handshake is the same as every other product's email verification:

  1. Tenant enters their candidate alert email.
  2. Router sends a one-time verification link to that address.
  3. Tenant clicks the link from the same browser session as the one that originated the configuration. The link contains a token that is bound to the tenant ID.
  4. On click, the router compares the in-link tenant ID against the session's tenant ID. They must match. If they don't, the verification fails (this is the protection against tenant A pasting tenant B's known email — tenant B can verify their own email, but the binding will be to tenant B's tenant ID, not tenant A's).

Verification expires every 90 days. We do not allow shared aliases like oncall@example.com to be verified across multiple tenants — the verification is recorded with the tenant ID, and a second tenant attempting to verify the same address gets prompted to add themselves as a CC on the first tenant's verified address rather than registering a parallel binding. The single-binding rule prevents the same shared inbox from being targeted by alert storms from many tenants.

PagerDuty

PagerDuty is the only one of the four sinks where ownership proof is built into the integration. We use the official integration flow — OAuth 2.0 with PKCE against the PagerDuty workspace, scoped read-write on incident.create — and rely on PagerDuty's own session model to bind the integration to the tenant's PagerDuty account. The token is stored in the same per-tenant KMS-encrypted secret store described in the collector walkthrough. Re-authentication is required every 90 days or on any router-level signal that the workspace has been disconnected (which we detect by the PagerDuty API returning 403 on incident.create).

The four sink types are all the alert router supports. We deliberately don't support arbitrary push notifications, SMS to arbitrary phone numbers, or arbitrary IRC bots — each of those would require its own ownership-proof story and we have not yet judged the volume to be worth the complexity. Adding a fifth sink type in the future requires designing the handshake first; the rule is "no sink ships without a structural ownership proof".

Tenant-scoped configuration with cross-tenant write protection

The alert configuration is stored in a Postgres table — one row per (tenant_id, server_slug, sink_kind, sink_id), with the sink-ownership-verification timestamp, the alert-budget tier, the per-event suppression flags, and the payload-shape preferences. The configuration is read on every alert-event evaluation and written from the dashboard.

The cross-tenant write protection is structural at three layers, defence-in-depth.

The first layer is the API. The dashboard's "save alert configuration" endpoint takes a tenant ID from the authenticated session, never from a request parameter. Any attempt to write a row with a different tenant ID is rejected at the request boundary; no path through the dashboard exposes tenant ID as a writable field.

The second layer is the database. The alert_config table has a Postgres row-security policy that enforces tenant_id = current_setting('app.tenant_id') on every read and every write. The dashboard's database role sets the session variable to the validated tenant ID at the start of every request and resets it at the end. If the API has a bug that lets a request through with the wrong tenant ID, the row-security policy still rejects the write — the row insert returns zero rows affected and the API returns 4xx.

The third layer is the sink-ownership verification described in the previous section. Even if both the API and the database row-security fail simultaneously — a vanishingly unlikely scenario — the candidate sink will not be verifiable as belonging to the writing tenant, the configuration will sit in the unverified state, and no alert events will be dispatched through it. The router refuses to dispatch through unverified sinks.

This three-layer defence has paid for itself once in our deployment. A bug in a refactor of the dashboard router accidentally let an authenticated tenant pass tenant_id as a query parameter on a sink-update endpoint; the API layer was now fooled. The row-security policy at the database layer was not — Postgres rejected the row with the wrong tenant ID, the API returned 500, and the bug surfaced in our error log within seven minutes. We fixed the API, but the value of the defence-in-depth is that the bug never reached production data. The single-tenant equivalent of this is "we don't have multi-tenant bugs because we don't have multi-tenants"; multi-tenant operators do have multi-tenant bugs, regularly, and the only thing that contains them is depth.

One subtle property of the cross-tenant write protection is that it composes with the OAuth-discovery cache keyed on (tenant_id, server_slug) rule from the collector walkthrough. The same tenant-prefix-everywhere discipline that prevents tenant A's discovery cache from being poisoned by tenant B's MCP server also prevents tenant A's alert configuration from being read by tenant B's alert evaluator. The unifying rule across the multi-tenant collector and the alert router is "every cache, every queue, every secret, every configuration row is tenant-prefixed; the prefix is the security boundary; the absence of the prefix in any path is a bug". When you read this in a single-tenant codebase, it sounds like over-engineering. In a multi-tenant codebase, it is the only way to keep your invariants straight.

Cross-tenant alert suppression — collapsing a registry-wide outage to one notice

The cross-tenant suppression rule is the single most consequential design choice in the alert router. Its job is to recognise when the right behaviour is "page everyone individually" and when the right behaviour is "page nobody individually, page once globally". The decision is made per alert minute, per upstream root cause.

The rule is: if more than 10% of all tenants would be paged in the same alert minute for the same upstream root cause, collapse the per-tenant pages into a single global notice and emit per-tenant addenda only on opt-in.

The rule is mechanical, not editorial — it does not require a human to declare an incident, and it does not depend on the magnitude of the outage. The 10% threshold was set after looking at our own incident history; the four registry-wide outages we observed in the first six months all crossed 35%, and the four single-tenant flap incidents we observed in the same period all stayed under 0.4%. There is a comfortable gap between the two regimes, and 10% sits in the middle without ambiguity.

The mechanism is straightforward. The verdict-minute Redis from the collector emits one verdict per server per region per minute; the alert evaluator reads them in batch at every minute boundary. Before dispatching, the evaluator computes a shared-cause cluster over the changed verdicts using two dimensions:

Collapsed clusters take a different code path:

  1. One global notice is composed: "AliveMCP is observing a degradation across [N] of [M] monitored MCP servers, all hosted under [ASN/registry]. We've detected this since [timestamp]. We will update at [timestamp + 5 minutes]."
  2. The global notice is posted to the AliveMCP public status page (which is itself a multi-tenant rendering of the multi-tenant collector's verdicts) and to a special "registry incidents" channel that any tenant can subscribe to.
  3. Per-tenant alerts are not sent to the tenant's primary sinks — Slack, webhook, email, PagerDuty — for the duration of the suppression window.
  4. Each affected tenant's dashboard shows a banner: "Your N servers are part of a registry-wide incident. Per-server alerts have been suppressed for the duration of the incident; the global notice is at /status. To opt out of suppression for your account, click here."
  5. Tenants who have opted out of suppression — typically because they're enterprise tier and have an SLA that requires per-tenant notification — receive their per-tenant alerts as normal, plus a footer linking to the global notice for context.
  6. The suppression window closes when the cluster's tenant-disjoint count falls below the 10% threshold for two consecutive minutes. At that point the per-server alerts resume normally; tenants whose servers are still in the failure cluster get a "your server is still down" page, with a footer noting that the registry-wide incident has cleared but their server has not.

The tricky bit is that the suppression rule is not symmetric across tenants. The same minute can produce a registry-wide incident from one cluster (suppress to global) and a single-tenant flap from another cluster (page normally). The evaluator processes clusters independently. The two clusters may share tenants — a tenant whose server X is in the registry-wide cluster and whose server Y is in the single-tenant flap cluster will see a global-incident banner for X and a normal page for Y, in the same minute. The bookkeeping is non-trivial; we keep an in-memory (tenant_id, server_slug, cluster_id, decision) tuple per evaluation and emit per-tenant exactly what was decided per server, not what was decided overall.

One subtle property: the suppression rule applies to upgrades, not downgrades. If 35% of tenants would be paged for a "your server is up after being down" event in the same minute, we do not suppress those — recovery notifications are valuable and not paging-storm-shaped. The rule fires only on alerts that page; recovery and informational events go through the normal path.

Another subtle property: the threshold is a moving fraction, not a static count. A 10% suppression threshold on 100 tenants is 10 tenants; on 10,000 tenants it is 1,000 tenants. As the platform grows, the threshold automatically scales. We do not want to wake up at 10,000 tenants and discover the threshold is now miscalibrated. The fraction is tier-aware too — Public-tier tenants are excluded from the denominator (they read from the global probes anyway, they don't have private alert sinks), so the suppression metric is computed over the population of tenants who have actual alert configurations.

The cross-tenant suppression rule is one of those features that looks small in the spec — one Boolean per cluster, one fraction, one threshold — and that prevents the alert path from melting under regular load. Without it, the first registry-wide outage would page 4,800 tenants in the same minute, hit Slack rate limits across hundreds of workspaces, and the alert router would spend the next 30 minutes degrading other tenants' unrelated alerts. With it, the same outage produces one global notice, ten thousand dashboard banners, and a tenant-by-tenant ramp-back as servers recover. The first time the rule fires for real, the operator's relief that they are not the human in the middle of the storm is what justifies the design.

Per-tenant alert budgets — bounding the runaway-tenant problem

The alert-budget layer is the equivalent at the alert router of the per-tenant rate-limiter at the probe scheduler from the collector walkthrough. Its job is to make sure that one tenant's flapping server cannot consume the alert path's headroom or the upstream sink's rate limit at the expense of other tenants.

The budget is per-tenant per-rolling-window, structured by tier. The four tiers from the public pricing map to four alert budgets:

The budget is enforced at the router, before dispatch. Each alert event passes through three checks in order:

  1. Sink ownership still verified? If the sink-ownership-verification timestamp is more than 90 days old, or the TXT record has lapsed, or the OAuth token has expired, the event is held in a "verification lapsed" queue and the tenant is notified in-dashboard. No dispatch.
  2. Within budget? The router checks a Redis-backed sliding window — one Redis key per (tenant_id, server_slug, sink_kind) with a list of timestamps. The list is trimmed to the last 24 hours on every check. Counts at the relevant windows (per-hour, per-day) are computed from the trimmed list. If any cap is exceeded, the event enters compressed mode for that (tenant, server); the trigger time and the event are appended to a "compressed" Redis list for the next hourly aggregator to drain.
  3. Not in cross-tenant suppression? If the event is part of a cluster that exceeds the 10% threshold, the event is suppressed per the previous section's rule.

The compressed-mode behaviour has a non-obvious property: it is itself rate-limited. The hourly aggregator emits at most one compressed-mode digest per (tenant, server) per hour. If the same server flaps 1,440 times in 24 hours, the tenant receives 24 digests, not 1,440 individual alerts. The digest's payload contains the count of flaps, the average duration, the longest down period, the timestamps of the first and last events, and the same dashboard link as the per-event payload. Tenants almost universally tell us the digest is more useful than the individual events for chronic-flap servers; for genuine transient incidents, the per-event mode kicks in normally.

The Lua script that implements the budget check is short — about 40 lines of Redis Lua. The script is atomic: it appends the timestamp, trims the list, computes the per-hour and per-day counts, returns a verdict (dispatch, compress, or suppress), and re-enters the rest of the alert pipeline. We use Lua specifically because the check-and-append must be atomic — without atomicity, the router can dispatch multiple times for the same trigger event under load.

One thing the budget layer does not do: it does not adapt the cap based on past behaviour. A tenant who has flapped 1,400 times in 24 hours is still subject to the same cap the next day. The cap is a contractual upper bound on alert volume, not a behavioural guess. Adaptive caps sound nicer but lead to a class of "I disabled my alerts because I muted them, and now I'm being woken up because I un-muted them" bugs that are not worth their complexity.

Payload-shape boundaries — what goes in the alert and what doesn't

The payload-shape layer is the alert router's analogue of the read-side API's "small fixed JSON contract" from the read-side walkthrough. The alert payload contains exactly the fields the tenant is paying for; nothing more. Everything that could be in the payload — internal supervisor IDs, cross-tenant identifiers, regional probe-error codes the tenant did not pay for visibility into, hash-state diff tables — is structurally excluded.

Per-sink payload templates we ship:

Slack payload

One block-kit message per alert event, with the following sections (in this order):

Fields explicitly not in the payload: probe-step indices (the tenant didn't pay for per-step detail), JSON-RPC error codes (these go to the dashboard but not the alert), upstream IP addresses (those leak ASN-level info that the tenant didn't pay for), the supervisor's trace ID, the verdict-minute coalescer's state, the Redis key path. Each of those is omitted by structural construction in the template — there is no way to accidentally include them, because the template's input slots only take the fields listed above.

Webhook payload

Webhooks are slightly different — generic webhook receivers may want machine-parseable JSON. We use the same small fixed JSON contract as the read-side API:

{
  "event": "down",
  "as_of": "2026-04-30T14:32:00Z",
  "server": "your-server-slug",
  "state": "down",
  "previous_state": "up",
  "regions_failed": ["us-east", "us-west", "eu-west", "sa-east"],
  "last_green": "2026-04-30T14:25:00Z",
  "tenant_id_anonymous": "abc123",
  "dashboard_url": "https://alivemcp.com/d/...",
  "verification_signature": "v1=..."
}

tenant_id_anonymous is not the real tenant ID — it's a tenant-prefixed hash that the tenant's downstream systems can use to correlate events without leaking the actual tenant slug. The verification_signature is an HMAC over the rest of the body keyed on a per-tenant webhook signing secret rotated every 90 days; it lets the tenant's downstream pipeline confirm the event came from AliveMCP and not from someone replaying old payloads. The HMAC pattern is the same one Stripe uses for their webhook deliveries; we copy it.

Email payload

One email per event, plain-text first with an HTML alternative. The plain-text body contains the same fields as the Slack payload but in a list. Subject line: [AliveMCP] [server_slug] is down. From address: alerts+[anonymised tenant ID]@alivemcp.com — the anonymised tenant ID lets the tenant's email rules filter alerts by tenant without exposing the canonical slug.

The email format is deliberately conservative — no remote images, no tracking pixels, no HTML-only content. We tested a richer HTML format with charts inline; it broke in three of the four corporate email clients we tested and the operator who runs alerts via the email sink read the plain-text version anyway.

PagerDuty payload

PagerDuty incidents have a specific payload shape we follow strictly. The dedup key is {tenant_id_anonymous}-{server_slug}-{event_kind} — this lets PagerDuty fold multiple events for the same server into one incident, with a re-trigger for each new event. Severity follows the verdict colour: critical for down, warning for degraded, info for recovery. The summary is the same one-line from the Slack header. The custom_details object contains the regions_failed list, the last_green timestamp, and the dashboard URL — nothing else.

The four payload templates share four design rules that hold across sinks:

  • One event = one payload. No batching across servers; no batching across tenants. Each alert event produces exactly one payload to exactly one sink.
  • No upstream IP addresses. Upstream IP addresses leak ASN-level information that the tenant has not paid for visibility into; in adversarial cases they identify the upstream's hosting provider, which can be sensitive.
  • No supervisor internals. Trace IDs, queue depths, worker process IDs, Redis key paths — none of these appear in any payload. They are debugging-time data, available in the dashboard for tenants who pay for the dashboard's deeper modes; they are not alert-time data.
  • No cross-tenant identifiers. A tenant who's paying for visibility into their own servers does not see another tenant's anonymised ID in their payload, ever. This is the structural-exclusion principle from the templates; we audit it on every template change.

The payload-shape boundary is the most subtle of the five layers. It is also the one most likely to drift over time as new fields get added — every "wouldn't it be useful if the alert showed X?" pull request is a structural-exclusion question. The payload-template files have a comment at the top that lists the exclusion rules and the reasoning for each one; review of any payload template change must justify why the exclusion rule does not apply, in writing, in the PR. We've rejected three field additions on these grounds; we've accepted six others where the exclusion rule did not apply (the field was a tenant-paid-for-feature, on the tenant's own data).

Reference recipes

Four small recipes, copy-pasteable, that give a real shape to the architecture above. None of them is the production code; all of them illustrate the contract.

Slack-sink verification handshake (~70 lines of Go)

The handshake step from the sink-ownership-verification layer. Generates a token, posts to the candidate URL, stores the candidate in Postgres, expires after 10 minutes.

func StartSlackVerify(ctx context.Context, db *sql.DB, tenantID, candidateURL string) (string, error) {
    // 256-bit token, base64url-encoded, prefixed for clarity in the user's screen.
    raw := make([]byte, 32)
    if _, err := rand.Read(raw); err != nil {
        return "", fmt.Errorf("rand: %w", err)
    }
    token := "VFY-" + base64.RawURLEncoding.EncodeToString(raw)

    // Store the candidate; the row is row-secured by tenant_id at the database layer.
    _, err := db.ExecContext(ctx, `
        INSERT INTO alert_sink_candidates
          (tenant_id, sink_kind, candidate_url, verification_token, expires_at)
        VALUES ($1, 'slack', $2, $3, now() + interval '10 minutes')
        ON CONFLICT (tenant_id, sink_kind, candidate_url) DO UPDATE
          SET verification_token = excluded.verification_token,
              expires_at = excluded.expires_at
    `, tenantID, candidateURL, token)
    if err != nil {
        return "", fmt.Errorf("db: %w", err)
    }

    // Post the verification message to the candidate URL.
    body := map[string]any{
        "text": fmt.Sprintf(
            "AliveMCP is verifying that you own this Slack channel. " +
            "If you initiated this, paste this token back into the AliveMCP " +
            "dashboard within 10 minutes:\n\n*%s*\n\n" +
            "If you did not initiate this, you can safely ignore this message; " +
            "the verification will fail without further action.", token),
    }
    payload, _ := json.Marshal(body)
    req, _ := http.NewRequestWithContext(ctx, "POST", candidateURL, bytes.NewReader(payload))
    req.Header.Set("Content-Type", "application/json")
    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        return "", fmt.Errorf("post: %w", err)
    }
    defer resp.Body.Close()
    if resp.StatusCode != 200 {
        return "", fmt.Errorf("slack returned %d", resp.StatusCode)
    }
    return token, nil
}

func CompleteSlackVerify(ctx context.Context, db *sql.DB, tenantID, candidateURL, pastedToken string) error {
    // The query is row-secured by tenant_id at the database layer; if the API
    // is fooled into passing the wrong tenantID, the row-security policy still
    // rejects the read.
    var stored string
    err := db.QueryRowContext(ctx, `
        SELECT verification_token FROM alert_sink_candidates
        WHERE tenant_id = $1
          AND sink_kind = 'slack'
          AND candidate_url = $2
          AND expires_at > now()
    `, tenantID, candidateURL).Scan(&stored)
    if err == sql.ErrNoRows {
        return errors.New("verification expired or not found")
    }
    if err != nil {
        return fmt.Errorf("db: %w", err)
    }
    if subtle.ConstantTimeCompare([]byte(stored), []byte(pastedToken)) != 1 {
        return errors.New("token mismatch")
    }
    // Promote candidate to verified sink.
    _, err = db.ExecContext(ctx, `
        WITH promoted AS (
            DELETE FROM alert_sink_candidates
            WHERE tenant_id = $1 AND sink_kind = 'slack' AND candidate_url = $2
            RETURNING tenant_id, sink_kind, candidate_url
        )
        INSERT INTO alert_sinks (tenant_id, sink_kind, sink_url, verified_at)
        SELECT tenant_id, sink_kind, candidate_url, now() FROM promoted
    `, tenantID, candidateURL)
    return err
}

Cross-tenant suppression check (~30 lines of SQL)

The shared-cause cluster check from the suppression layer. Run once per minute over the events that would otherwise dispatch.

WITH events_this_minute AS (
    SELECT tenant_id, server_slug, error_kind, upstream_asn, registry_origin
    FROM alert_pending
    WHERE alert_minute = $1
), clusters AS (
    SELECT
        error_kind, upstream_asn, registry_origin,
        COUNT(DISTINCT tenant_id) AS tenants_affected
    FROM events_this_minute
    GROUP BY error_kind, upstream_asn, registry_origin
), tenant_population AS (
    SELECT COUNT(*)::float AS total FROM tenants
    WHERE tier IN ('author', 'team', 'enterprise')
), suppressed_clusters AS (
    SELECT error_kind, upstream_asn, registry_origin
    FROM clusters, tenant_population
    WHERE tenants_affected::float / NULLIF(total, 0) > 0.10
)
UPDATE alert_pending p
SET decision = 'suppressed_global', suppression_reason = 'cross_tenant'
FROM suppressed_clusters s
WHERE p.alert_minute = $1
  AND p.error_kind = s.error_kind
  AND p.upstream_asn IS NOT DISTINCT FROM s.upstream_asn
  AND p.registry_origin IS NOT DISTINCT FROM s.registry_origin
  AND NOT EXISTS (
      SELECT 1 FROM tenant_settings ts
      WHERE ts.tenant_id = p.tenant_id AND ts.opt_out_global_suppression = true
  );

Per-tenant alert-budget Lua (~40 lines)

Atomic check-and-append per (tenant, server, sink). Run via EVAL at the budget-check step.

-- KEYS[1] = tenant_alert_log:{tenant_id}:{server}:{sink_kind}
-- ARGV[1] = current_unix_seconds
-- ARGV[2] = per_hour_cap
-- ARGV[3] = per_day_cap
-- ARGV[4] = compressed_window_seconds
local now = tonumber(ARGV[1])
local hour_cap = tonumber(ARGV[2])
local day_cap = tonumber(ARGV[3])
local window = tonumber(ARGV[4])

-- Trim entries older than 24h.
redis.call('ZREMRANGEBYSCORE', KEYS[1], 0, now - 86400)

-- Count entries in last hour and last day.
local count_hour = redis.call('ZCOUNT', KEYS[1], now - 3600, now)
local count_day  = redis.call('ZCARD', KEYS[1])

-- If over caps, return 'compress' and don't add the entry to the log
-- (compression is a separate code path with its own log).
if count_hour >= hour_cap or count_day >= day_cap then
    return 'compress'
end

-- Otherwise add this event to the log and dispatch.
redis.call('ZADD', KEYS[1], now, now .. ':' .. redis.sha1hex(tostring(now)))
redis.call('EXPIRE', KEYS[1], 86400)
return 'dispatch'

Webhook payload builder (~30 lines of Go)

Strict structural exclusion in template form. The function takes only the fields the tenant is paying for; everything else is unreachable from this function.

type WebhookEvent struct {
    Event              string    `json:"event"`
    AsOf               time.Time `json:"as_of"`
    Server             string    `json:"server"`
    State              string    `json:"state"`           // "up" | "down" | "degraded"
    PreviousState      string    `json:"previous_state"`
    RegionsFailed      []string  `json:"regions_failed,omitempty"`
    LastGreen          time.Time `json:"last_green"`
    TenantIDAnonymous  string    `json:"tenant_id_anonymous"`
    DashboardURL       string    `json:"dashboard_url"`
    VerificationSig    string    `json:"verification_signature"`
}

// BuildWebhookPayload accepts only the fields explicitly listed; any caller
// trying to pass supervisor internals or cross-tenant identifiers must add
// them to this function's signature, which is a code review touchpoint.
func BuildWebhookPayload(
    event, server, state, prev string,
    asOf, lastGreen time.Time,
    regions []string,
    anonTenant, dashboardURL, signingSecret string,
) ([]byte, error) {
    e := WebhookEvent{
        Event: event, AsOf: asOf.UTC(),
        Server: server, State: state, PreviousState: prev,
        RegionsFailed: regions,
        LastGreen: lastGreen.UTC(),
        TenantIDAnonymous: anonTenant,
        DashboardURL: dashboardURL,
    }
    body, err := json.Marshal(e)
    if err != nil {
        return nil, err
    }
    mac := hmac.New(sha256.New, []byte(signingSecret))
    mac.Write(body)
    e.VerificationSig = "v1=" + hex.EncodeToString(mac.Sum(nil))
    return json.Marshal(e)
}

Six failure modes the multi-tenant alert router must handle

The single-tenant alert path has a small failure-mode catalogue: webhook 4xx, sink down, alert loop. The multi-tenant alert router has those plus six new ones.

  1. Verified-sink rotation under load. Symptom: a tenant rotates their Slack workspace; the old workspace's webhook URL is invalidated; the alert router's verified-sink record still points at the old URL; alerts disappear silently. Fix: the router probes the verified URL with a no-op heartbeat once per day per sink; on 404 or 410 the sink is moved to "verification lapsed" and the tenant is notified in-dashboard; the alert dispatch path refuses to send to a sink whose last successful heartbeat is older than 25 hours. The cost is one Slack/webhook call per tenant per sink per day; we accept it.
  2. Cross-tenant suppression false positive. Symptom: a normal traffic-spike incident at a single hosting provider trips the 10% threshold even though the affected tenants would actually want individual pages; tenants are confused why they got a global notice instead of a per-server alert. Fix: the suppression rule's threshold is configurable per-deployment, and the global-notice template has a tenant-by-tenant addendum link that opens the per-tenant detail page; the threshold sits at 10% globally but enterprise tenants opt out by default per their SLA. False positives are rare in practice (we measured zero in the first eight weeks of the rule being live) but the addendum link is the safety valve.
  3. Cross-tenant suppression false negative. Symptom: a registry-wide outage that affects 8% of tenants does not trip the 10% threshold; 8% of tenants get individually paged, hit their per-tenant rate limits, and degrade. Fix: the threshold is intentionally below the worst single-tenant flap rate (we measured 0.4%) and well above the typical multi-tenant noise floor (we measured ~3%). The 10% threshold is the result of fitting a curve to our incident history; a deployment with a different population may need a different threshold. The rule's parameters (the 10%, the 24-hour-vs-hour windows, the cluster-key composition) are all in a single config file and tunable per deployment.
  4. Verification-token-replay attack. Symptom: a tenant verifies a Slack channel; the verification token is leaked (logged, screenshotted, accidentally shared); an attacker uses the leaked token in a different verification flow to bind the same Slack channel to an attacker-controlled tenant. Fix: the verification token is single-use and bound to the (tenant_id, candidate_url) pair on the server side; an attacker presenting the leaked token under a different tenant_id fails the database lookup. The token is also short-lived (10 minutes); the replay window is therefore narrow. Constant-time comparison on the token check (visible in the recipe) prevents timing-side-channel attacks. The leak itself is still a confidentiality bug — we log nothing in the verification path and recommend tenants do not paste verification tokens into shared channels.
  5. Compressed-mode digest delivery to a flapping sink. Symptom: a tenant's Slack workspace is rate-limited at the workspace level; the per-event alerts are absorbed into compressed-mode digests, but the compressed-mode digest itself fails to deliver because the workspace is still being rate-limited; the tenant gets nothing for hours. Fix: the digest delivery has its own retry logic (up to 4 retries with exponential backoff over 30 minutes), and on final failure the digest is deferred to the next hourly boundary plus a "digest delivery delayed" annotation in the dashboard. The dashboard always has the up-to-the-minute history regardless of sink-delivery success — alerts are a notification convenience, not the source of truth.
  6. Payload-template drift. Symptom: a refactor adds a new field to the internal verdict struct ("upstream_pop_name"); the field accidentally lands in the Slack payload because a template change picks it up; tenants see a CDN POP code in their alerts they didn't pay for visibility into and that leaks ASN-level information. Fix: the payload templates are versioned (we are at v3 of the Slack template, v2 of the webhook template); each version has a snapshot test that asserts the rendered payload contains exactly the listed fields and no others; the test fails on any added field even if the field's value is empty. Adding a field to a template is a multi-step process: justify the addition in the PR, update the snapshot test, update the public payload-spec docs, increment the template version. The version is included in the payload as a hidden footer field for downstream parsers; tenants can pin to a template version contractually if their downstream pipeline is brittle.

Operational rhythm — what the alert router needs from the rest of the stack

The alert router is small in lines of code (~1,200 lines of Go in our deployment, ~200 lines of Lua, ~100 lines of SQL, ~3,000 lines of payload templates and tests) but it has rhythmic dependencies on the rest of the stack that are worth naming.

The router reads the verdict-minute Redis from the collector on every minute boundary. It expects the verdict-minute coalescer to have sealed the previous minute by :00:55 at the latest; if the coalescer is delayed, the router waits up to 4 seconds and then proceeds with whatever it has, marking the alert minute as "partial" in its own log. Partial-minute alerts are still dispatched but include a "this minute's verdict was sealed late; some regions may not be reflected" footer. The footer rarely appears (we measured 6 partial-minute alerts in the last 8 weeks of operation, all during a Redis-replication cutover) but is structurally honest.

The router writes back to the verdict-minute Redis in one place only — the suppression-cluster log, which the public status page reads to render the "registry-wide incident" banner. The write is fire-and-forget; the status page can render a stale banner for one minute without consequence. The router never writes verdicts back to the collector's path; the data flow is strictly one-way (collector → router → sink), which keeps the failure modes shallow.

The dashboard reads from the alert router's own state — verified sinks, alert-budget consumption, cross-tenant suppression status. The dashboard does not reach into the router's Redis directly; it queries via a small read-only API the router exposes at /internal/alert-state/{tenant_id}, signed with the dashboard's mTLS client cert. The boundary keeps the dashboard from accidentally invalidating the router's state during a refactor.

Operationally, the router has three repeating jobs that fire on schedule:

The three cycles share no state at the operating-system level; each runs in a separate worker that wakes on a cron and writes to a separate Redis key namespace. The separation is deliberate — a bug in the digest worker cannot delay the per-minute evaluation, and a bug in the heartbeat worker cannot stall the digest. The single-tenant alert path conflates all three into one loop; the multi-tenant router has to separate them or the failure modes compose.

What does not change at scale

For all the new layers, the per-event semantics — the inside of one alert dispatch — are unchanged from the single-tenant alert path. The alert is fired on a verdict-minute transition (up→down, down→up, up→degraded, etc.); the payload contains the verdict and a dashboard link; the sink is one of the four canonical types; the retry logic is exponential backoff over four attempts; the failure mode on final-attempt failure is "log it, mark the dashboard, move on". Those rules are settled by the time you start scaling out. You are not designing the per-event semantics and the multi-tenant operator at the same time; you are taking the alert-as-spec and wrapping it in the operator-as-implementation. If the per-event behaviour is still in flux when you start scaling, you will end up with an alert router that has multiple alert formats running in parallel — and that is a different kind of failure mode entirely, where two tenants on the same plan are getting structurally different payload shapes for the same event.

The canonical Slack-alert payload we publish in our docs is the same payload our own alert router emits, on the same schedule, with the same template version. There is no internal-only path that emits richer payloads to internal staff; if there were, the structural-exclusion principle would be undermined by the existence of the privileged path. Operationally, our own staff use the same dashboard tenants do, with a single internal tenant ID; the payload they receive is the same payload tenants receive.

The compounding effect of the practical-routine series and the scale series at this point is that the entire stack is composable, with five sequential layers from probe to alert: the credentialed probe is the atom; the multi-region wrapper is the geographical lift; the status page is the human surface; the read-side API is the machine surface; and now the multi-tenant collector and the alert router together are the service-shape — the difference between "running a probe stack for one server" and "operating a probe stack as a hosted service for many tenants". Each layer has a single responsibility, a single set of failure modes, a single piece of shared state with the next layer. Each can be tested in isolation. Each failure mode is bounded. Adding the alert router to the deployment took ~1,200 lines of Go; rewriting the single-tenant alert path to support many tenants, by contrast, would have required threading the tenant ID through every existing helper and would have shipped strictly more bugs.

What's next in the scale sub-series

Two sequels to this post are pre-committed in the AliveMCP blog backlog and will ship over the next four weeks. Each takes one of the layers in this post and goes deep on the part that didn't fit here.

If you operate an MCP server and want the multi-tenant collector and alert router to track yours, join the waitlist — we email the moment a new public post lands and when claimed-listing flow opens. If you're building your own probe stack and want to compare implementation notes, the MCP server Slack alerts reference covers the per-event payload in more detail; the MCP server uptime API covers the read-side surface the alert router shares state with; the UptimeRobot vs AliveMCP comparison covers what a generic uptime SaaS misses about MCP-specific alert routing.

Further reading on AliveMCP