Deep dive · 2026-04-30 · Scale sub-series — alert-router companion
Operating per-tenant alert routing with five staff or fewer
The per-tenant alert routing walkthrough built five new layers on top of the single-tenant alert path: sink-ownership verification with per-sink-type handshakes, tenant-scoped configuration with three-layer cross-tenant write protection, a cross-tenant suppression rule that collapses a registry-wide outage to one notice, per-tenant alert budgets with hourly compressed-mode digests, and payload-shape boundaries with four design rules. That post answered "what does the alert router look like at scale." This post answers a different question: "how does a small team actually run that alert router every day, when the entire team is one founder, or one founder plus one ops hire, or maybe five humans on a good week, and the founder is also the on-call." The five layers from the architectural walkthrough survive the small-team setting unchanged — the threat model is the same — but the human routines that operate them are very different from the routines a fifty-person SRE rotation runs. This post is the operator's guide. It maps headcount to alert ownership for one-, two-, three-, four-, and five-person deployments, walks the week-1 setup checklist that turns the alert-router architecture into a working deployment, sketches the daily and weekly and monthly and quarterly drills that keep the verifications and the suppression rule and the budgets honest, names the seven failure modes that show up specifically when the alert router is operated by a small team, and gives the reference recipes — the sink-verification handshake template, the IdP-bound on-call rotation script, the synthetic-outage drill harness, the sink-credential rotation runbook — that turn the routine into something a one-to-five-person team can actually run without the alert router collapsing into "we just point everything at the founder's Slack DMs and hope."
TL;DR
A five-or-fewer team operating a multi-tenant alert router is not a smaller version of an enterprise paging team; it has its own shape and its own failure modes. The five-layer alert-router architecture from the previous post still applies — sink-ownership verification, tenant-scoped configuration with cross-tenant write protection, cross-tenant suppression, per-tenant alert budgets, and payload-shape boundaries are all required at any team size — but the way the layers map onto humans, the cadence at which they are exercised, and the failure modes the team has to watch for change with team size. The headcount-to-alert-ownership mapping is the first decision: in a one-person deployment the founder owns sink verification, alert-rule editing, on-call response, the cross-tenant suppression cron's output review, and the budget-cap exception list, with a strict rule that they explicitly switch hats for each task and the role-switch is captured in the audit log; in a two-person deployment the ops hire takes the on-call response and the founder retains alert-rule editing and verification rotation; in a three-person deployment the third slot is the alert-rule reviewer who exists structurally to refuse the founder's "just push this rule live, it's small" requests; the four- and five-person deployments add a dedicated on-call rotation with a separate escalation channel and a sink-rotation owner. The week-1 setup is the minimum-viable boundary: pick which four canonical sinks the platform supports (Slack, generic webhook, email, PagerDuty — same four as the architectural walkthrough), run the sink-verification handshake template against the team's own internal sinks first, set per-tenant alert budgets per tier with hourly compressed-mode caps, schedule the cross-tenant suppression cron, configure the on-call rotation in the IdP rather than in PagerDuty's own roster, stand up a synthetic-outage drill tenant whose entire purpose is being deliberately broken once a month, configure the payload-shape blacklist CI check, and park a compressed-mode digest reader role on the rotation calendar. The daily routine is one line: the on-call reads yesterday's notification stream end-to-end, looks for one anomaly, and either notes it as benign or escalates. The weekly routine is the sink-verification re-handshake output and the payload-shape audit. The monthly routine is the synthetic-outage drill: deliberately break the synthetic tenant, watch the alert reach the right sink within the SLA, watch the cross-tenant suppression rule stay silent on it, watch the per-tenant budget reset cleanly. The quarterly routine is the sink-credential rotation drill: rotate the Slack-app token, the webhook signing secret, the email DKIM key, and the PagerDuty OAuth refresh token, in that order, and watch each verification handshake re-handshake automatically. Seven small-team failure modes with structural fixes — the founder-paging-themselves problem (a hard rule that the founder is not on the on-call rotation past two ops staff, structurally enforced in the IdP), the customer paste-a-webhook attack on a small workspace (the verification handshake refuses), the cross-tenant suppression false positive when you serve fewer than 100 tenants (the threshold is a percentage with a minimum-tenant-count floor, not a flat percentage), the per-tenant budget set too generous on the free tier (a tier-default registry that is reviewed quarterly), the sink-rotation drill that collides with the support queue (a calendar lock and a same-day rule that no support tickets ship during a sink rotation), the on-call channel that depends on one phone (a hardware-failover plan with a second phone or laptop and the recovery codes already in PagerDuty), and the compressed-mode digest that no one reads (a named role on the rotation calendar, with the read-receipt logged). The recipe section sketches the sink-verification handshake template, the IdP-bound on-call rotation script, the synthetic-outage drill harness, and the sink-credential rotation runbook in copy-pasteable form. This post is the practical companion to the per-tenant alert routing architectural walkthrough; together they describe both halves of how a small multi-tenant MCP-monitoring team operates the alert-router side of the stack. Two more small-team-companion posts are scheduled before the Q3 2026 audit lands mid-July: the small-team companion to the shared-state archiver and the small-team companion to the multi-tenant probe collector.
Why five-or-fewer changes the alert router
The five layers in the architectural walkthrough are shaped by threat models, not by team size. The defences against a customer pasting a malicious webhook into the configuration UI, against a tenant whose rules write into another tenant's row, against a registry-wide outage that pages every tenant individually instead of as one notice, against a runaway tenant whose budget exhausts the platform's send capacity, against a payload that leaks an upstream IP or a supervisor internal — all of those defences sit in the alert router's middleware and the cross-tenant suppression cron and the per-tenant budget Lua, and they are the same whether the team is one person or fifty. What changes with team size is who watches the verifications, who reviews the suppression-cron output, who renews the sink credentials, and how the on-call rotation works when there is no rotation because there is one operator and one phone.
Three things are different at small scale and they cascade. The first is verification rotation. The sink-ownership verification handshakes are not one-time events — Slack-app tokens get re-issued when the workspace owner changes the app, generic-webhook TXT records get rotated when a customer's DNS provider changes, email per-recipient bindings get challenged whenever the customer changes their contact lists, PagerDuty OAuth refresh tokens expire on whatever cadence PagerDuty picks. With fifty operators a "verifications expiring in the next 7 days" Jira board is a routine ticket; with one operator the same expiry list arrives as a Slack DM and gets buried under three customer support tickets and a partition-roll cron failure. The structural answer is to make the verification renewals run on a calendar-bound cadence the team controls, with the on-call (or the founder, in a one-person deployment) given an explicit slot every Friday for re-verifications, and to make the dashboard refuse to send any alert through a sink whose verification is older than the cadence. We will name this concretely in §5.
The second is the suppression-rule false positive. The cross-tenant suppression rule from the architectural walkthrough collapses ">10% of tenants paged for the same upstream root cause clustered on error.kind+ASN+registry" into one global notice. With fifty thousand tenants the 10% threshold means the rule fires only when at least 5,000 tenants are seeing the same outage; the false-positive rate is essentially zero. With one hundred tenants the same 10% threshold means the rule fires when ten tenants are seeing the same outage, which is well within the noise floor of a normal Tuesday morning if a popular registry's CDN has a bad cache shard. The structural answer is to make the threshold both a percentage and a minimum-tenant-count floor — for example, "fires only when both ≥10% of tenants are paged AND ≥30 distinct tenants are paged" — and to make the floor configurable per platform, with the floor scaling as the tenant count grows. We will name this concretely in §7.
The third is the on-call rotation that does not exist. With one operator the rotation is a fiction. The dashboard cannot pretend the rotation exists; it has to know that "everyone on the rotation" means "the founder" and that the founder's phone is the only PagerDuty-attached device on the platform. The right way to handle this is not to fake a rotation but to be explicit: the team has one on-call human, that human's contact details are in PagerDuty, and the platform's escalation policy is documented as "founder-only, 30-minute auto-escalate to a parked secondary escalation that emails an off-site contact whose only job is to text the founder again." Pretending the rotation is fully staffed when it is not produces worse outcomes than admitting it is a single person and writing down the bus-factor plan that compensates. The right shape for a small team is explicit single-point-of-contact, not simulated rotation: the dashboard knows there is one on-call, the IdP knows there is one on-call, the audit log records that the alert was sent to and acknowledged by that one on-call, and the team writes down what happens when the one on-call is genuinely unreachable.
None of those three are reasons to abandon the five-layer alert router. They are reasons to operate it deliberately. The rest of this post walks how.
Mapping headcount to alert ownership
The decision of who owns which piece of the alert router is the most consequential staffing choice the deployment makes after the four-layer permission model from the previous companion post. The right answer depends on team size and on what other systems your team already uses (your IdP, your support queue, your shared on-call calendar). The five-team-size mapping below is what we have run; treat it as a starting point that you adapt to your team's actual shape, not as a rule.
One-person deployment — the founder is everything
The single operator owns sink verification, alert-rule editing, on-call response, the cross-tenant suppression cron's output review, the per-tenant budget exception list, and the synthetic-outage drill execution. The auditor seat from the permission-model companion exists on paper but is unstaffed; we provision the auditor account, leave it parked at zero permissions, and use that parked seat to grant alert-router-config read access to a SOC-2 reviewer or a part-time security advisor when the team has one to give the seat to.
The discipline that prevents the alert router from collapsing into "founder's Slack DMs receive everything" is explicit hat-switching on the dashboard. The alert-router config UI in the dashboard has a role selector at the top of the navigation that shows the operator's current role; the default is tenant_scoped_operator scoped to no tenant. To edit a per-tenant alert rule the founder selects a tenant from the tenant-pin field; to edit a platform-wide rule (the cross-tenant suppression rule, the per-tier budget defaults, the payload-shape blacklist) the founder clicks "elevate to root operator" and the click is gated by an MFA prompt with the hardware token, with a justification field that the UI refuses to accept as a copy-paste of the previous justification. Every elevation is in the audit log with the role and the justification. Every tenant pin is in the audit log with the tenant and the justification. The single founder's own audit log is what they read on Monday morning to verify that last week's actions match their memory of what they did; the discipline is to read it. If the founder does not read their own audit log, the audit log is a forensic-only artefact and the alert-router operations have effectively collapsed to "one founder, all permissions, no log."
One non-obvious choice for the one-person deployment: the synthetic-outage drill tenant is provisioned at week one, even though the deployment has zero paying tenants. The reason is that the day a real outage happens, the founder needs to know that the alert router actually delivers, that the cross-tenant suppression rule actually stays silent on a single-tenant outage, and that the per-tenant budget actually resets on the hour. Provisioning the drill tenant at week one means the synthetic-outage harness has been exercised, the founder has a calibration baseline for "what does a real-but-deliberately-injected outage look like in this dashboard," and the drill's output is what the founder reads when the suppression rule fires for the first time in production. None of that is true on day one if the drill tenant is added the morning of the first real outage.
Two-person deployment — founder and first ops hire
The founder retains alert-rule editing for platform-wide rules (the cross-tenant suppression rule, the per-tier budget defaults, the payload-shape blacklist). The first ops hire takes on-call response, sink verification, and per-tenant alert-rule editing. The auditor seat is still parked, for the same reason as the one-person deployment — the day a security advisor or a SOC-2 reviewer arrives, the seat is the first thing they need.
The two-person deployment introduces rotation discipline on the on-call hat, even though the rotation has only one human. The first ops hire is on-call; the founder is the secondary escalation. The founder is not on the day-to-day on-call rotation. If the first ops hire is on holiday, the founder picks up on-call but does so by switching roles to tenant_scoped_operator for the duration of the cover, the same way the founder picks up tenant-scoped support actions during a cover. The role switch is one click and one justification, and is captured in the audit log. The founder does not "log in as root and just answer the page" during the cover, because the structural defence — the dashboard refuses tenant-scoped actions from a root-operator session, the IdP refuses to route on-call to a root-only account — holds. The structural defence is what makes the rotation discipline survive a Saturday outage when the first ops hire is on a flight.
The single most important thing the two-person deployment does is elect a sink-verification reviewer. With one operator, the founder reviews their own verification handshakes and that is the routine. With two, there is a temptation for the first ops hire to verify their own sinks and never have the founder review them, and the verification handshakes silently become unwitnessed. The fix is to put the verification review on the rotation, not on the operator: every sink-verification handshake is approved by a different human from the one who initiated it, the dashboard refuses to mark a verification as "live" until the second human has approved it, and the audit log records both rows. The rotation is what keeps the verification real.
Three-person deployment — adding the alert-rule reviewer
The founder retains platform-wide rule editing. The first ops hire is on the on-call rotation. The third hire — call them the alert-rule reviewer — is the structural counterweight whose only job is to refuse the founder's "just push this rule live, it's small" requests. The auditor seat moves from "parked" to "rotated quarterly with an external advisor." The alert-rule reviewer is staffed before the platform reaches 50 tenants because that is roughly the size at which the founder's rule changes start affecting other people's customers in non-obvious ways.
The alert-rule reviewer is not a senior hire; they are a discipline. The role can be the third human on the team regardless of seniority, provided they have refusal rights structurally enforced in the dashboard: any platform-wide rule change goes through a pull-request-style two-stage flow where the founder proposes the change, the alert-rule reviewer reviews the diff against the rule definitions repo, and the dashboard refuses to apply the change without the reviewer's MFA-gated approval. The audit log records both the proposal and the approval. The reviewer's job is not to be smarter than the founder; it is to be a different pair of eyes whose first job is to ask "is this change going to fire on the synthetic-outage drill tenant the way we expect" and refuse if the answer is unclear. The rule changes that get past the reviewer are the ones the founder bothered to explain.
Four-person deployment — second on-call and escalation
The fourth hire is the second human on the on-call rotation. With two ops staff on a 7-day-on / 7-day-off rotation, the on-call burden becomes survivable; with one ops hire alone the rotation is fiction. The fourth hire flips the rotation from fiction to reality. The dashboard's escalation policy is rewritten: primary on-call rotates between the two ops staff, the founder is the secondary escalation only for the cross-tenant-suppression-fired and tier-budget-cap events, and the alert-rule reviewer (the third hire) is the tertiary escalation for the rare cases where the secondary is unreachable.
The four-person deployment is also the size at which the on-call channel splits. With two ops staff the team is large enough that "the founder's Slack DMs" is no longer a viable on-call channel. The on-call channel becomes a dedicated channel — Slack workspace channel, PagerDuty, or both — with the rotation membership pulled from the IdP rotation group, not maintained by hand in PagerDuty. The IdP-bound rotation is what makes the rotation survive a hire or a leave; the rotation membership is the IdP group, not a static list in PagerDuty's own roster. We will name this concretely in §5 and §9.
Five-person deployment — the largest size the model is calibrated for
The fifth hire is the sink-rotation owner — the human whose explicit role is to drive the quarterly sink-credential rotation drill, to file the renewal tickets, and to be the standing reviewer on the verification handshakes when the on-call rotation is unavailable. With five humans and three hats (founder for platform-wide rules, two ops on rotation, alert-rule reviewer, sink-rotation owner) the model is at its calibrated size. Beyond five humans the model still works but the hats start to specialise — the alert-rule reviewer becomes a security engineer, the sink-rotation owner becomes a release engineer, the on-call rotation grows to three or more staff — and at that point the deployment has crossed out of the small-team companion's scope and into the architecture's enterprise-team default.
The week-1 alert-router setup checklist
The week-1 boundary is the minimum-viable line that converts the alert-router architecture into a running deployment. Every item below is required at any team size; the difference at smaller team sizes is not which items get done but how they are split across humans. The list is calibrated for a one-person deployment to be able to complete in one full working day; larger teams parallelise.
1. Pick the four canonical sinks the platform supports
The architectural walkthrough fixes the four canonical sinks at Slack, generic webhook, email, and PagerDuty. The week-1 decision is not which sinks to add; it is the explicit, written-down statement that only those four are supported. The justification field on the alert-rule UI refuses to accept a sink type outside the four. Customers asking for a fifth sink (Discord, Microsoft Teams, Mattermost, SMS-via-Twilio, OpsGenie) get a written response that the platform supports four sinks and will revisit the list at the next quarterly review. The structural reason the list is fixed is that the verification handshake — the structural defence against a paste-a-webhook attack — is per-sink-type, and adding a fifth sink type without a verified handshake widens the attack surface in a way the small team cannot review. Locking the list at four is not a feature ceiling; it is a verification ceiling, and the verification ceiling is what makes the alert router safe to operate with five humans.
2. Run the sink-verification handshake against the team's own internal sinks first
Before any customer's sink is verified, the team's own internal sinks are verified. The on-call channel's Slack workspace is verified through the inbound-proof-token handshake. The team's own dogfooded webhook (typically a webhook into the on-call channel's incident-management bot) is verified through the TXT-record domain-of-origin handshake. The team's email contact for billing-only on-call notices is verified through the per-recipient binding handshake. The team's PagerDuty integration is verified through the OAuth 2.0 PKCE flow. Each verification is a row in the dashboard's verification log, marked "live," with a "verified-by" column pointing to the human who initiated the handshake and a "approved-by" column pointing to a different human (in the one-person deployment the second column is the founder's own audit-trail confirmation rather than a separate human). The team's own internal sinks being verified through the same path the customers' sinks will be verified through is what makes the team confident that the path works, before any customer's sink touches it.
3. Set per-tenant alert budgets per tier with hourly compressed-mode caps
The architectural walkthrough specifies per-tenant alert budgets per tier; the week-1 decision is to write down the tier-default registry. For the AliveMCP pricing tiers from the homepage, the budgets are: Public (free) gets 50 alerts/hour, Author ($9/mo) gets 200 alerts/hour, Team ($49/mo) gets 1,000 alerts/hour, Enterprise ($299/mo) gets 5,000 alerts/hour. Above the cap each tier rolls into the hourly compressed-mode digest: a single alert is sent to the tenant's primary sink per hour summarising the suppressed alerts. The tier-default registry lives in a checked-in YAML file (tier-defaults.yaml in the operator-config repo); changes to the registry are platform-wide rule changes and go through the alert-rule reviewer's two-stage approval flow. The compressed-mode-digest's payload itself is subject to the payload-shape blacklist CI check (item 7).
4. Schedule the cross-tenant suppression cron with the minimum-tenant-count floor
The cross-tenant suppression rule from the architectural walkthrough fires when ">10% of tenants paged for the same upstream root cause clustered on error.kind+ASN+registry." For the small-team companion the threshold is rewritten with a minimum-tenant-count floor: ">10% of tenants paged AND ≥30 distinct tenants paged." The floor of 30 is what keeps the rule from firing on a Tuesday morning when 10 of the platform's 80 tenants happen to be probing the same registry shard that just had a CDN hiccup. The floor is configurable in the suppression-cron config; the week-1 default is 30, and the floor is reviewed quarterly with the platform's actual tenant count. The cron runs every 60 seconds against the verdict-minute Redis described in the collector walkthrough; its output is a row in the suppression log with the cluster's error.kind, ASN, registry, and the list of tenant IDs that were silenced. The on-call's daily routine (§6) reads the previous day's suppression-log rows.
5. Configure the on-call rotation in the IdP, not in PagerDuty
The week-1 boundary on the rotation is that the rotation membership is an IdP group (Google Workspace group, GitHub team, Authentik group, whichever IdP the team uses), and PagerDuty's roster is populated from the IdP via SCIM or a nightly sync script. The rotation is not maintained by hand in PagerDuty's own roster. The reason is that the IdP is the team's source of truth for "who is allowed to assert that they are an on-call ops engineer," and the alert router's escalation policy must trace back to the IdP for the same audit-log reasons the four-layer permission model from the operator companion post traces back to the IdP. The rotation cadence (week-on / week-off, day-on / day-off, custom) is a separate decision the team makes at week one; the structural decision is that the rotation membership is the IdP group, not a hand-maintained list. We give the script for the IdP-to-PagerDuty sync in §9.
6. Stand up the synthetic-outage drill tenant
The drill tenant is a fully provisioned tenant whose only purpose is to be deliberately broken once a month. It has the four sinks wired to four different drill destinations (a separate Slack channel for drill alerts, a separate webhook into the team's drill-receipt bot, a separate email address that the team owns, a separate PagerDuty service whose escalation goes only to the rotation, never to the founder unless the rotation is asleep). The drill tenant has its own per-tenant budget that is set deliberately low — 10 alerts/hour — so that the compressed-mode digest is exercised every drill. The drill tenant's MCP endpoint is a synthetic endpoint hosted on the team's own infrastructure that the team can break on demand. Provisioning the drill tenant at week one means the monthly drill (§6) is a one-line invocation against the drill harness, not a multi-day setup project the first time the team needs it.
7. Configure the payload-shape blacklist CI check
The payload-shape boundaries from the architectural walkthrough — one event per payload, no upstream IPs, no supervisor internals, no cross-tenant identifiers — are enforced by a CI check that runs against every alert-rule change and every payload-template change. The CI check is the structural defence against a payload-template drift; without it the rules drift over time as customers ask for "just include the upstream IP this once" and the team adds it without remembering that the rule exists for a reason. The CI check is a single Go test that loads the latest payload-template-set and validates each template against four assertions: the template emits exactly one event per payload, the template's variable bindings include no upstream IP fields, the template's variable bindings include no supervisor-internal fields, the template's variable bindings include no fields that name another tenant's identifier. The CI check is run on every PR to the operator-config repo; the team's PR cannot merge without the check passing. The CI check is the smallest possible representation of the payload-shape contract — four assertions, one Go file — and the smallness is what makes it survive the small team's review cadence.
8. Park the compressed-mode digest reader role on the rotation calendar
The compressed-mode digest is the message that goes to a tenant's primary sink when their per-tenant budget is exhausted; it summarises the suppressed alerts. From the platform side, the team also receives the team's own compressed-mode digest — the message that goes to the on-call channel when the cross-tenant suppression rule fires and a registry-wide outage is collapsed to one notice. That digest is what the on-call reads to know "ah, registry X is having a bad day, here's the cluster of tenants affected, the suppression rule did its job." The digest is high-information but it has to be read; if no one reads it, the suppression rule's output is a forensic-only artefact and the team will find out about a registry-wide outage from a customer ticket instead of from the digest. Parking the digest reader role on the rotation calendar at week one means the role is named, the read-receipt is logged in the audit log when the on-call clicks "acknowledged" on the digest, and the team has a structural defence against the "digest is a Slack notification no one ever read" failure mode (§7).
Daily, weekly, monthly, quarterly alert-router routines
The week-1 setup is what gets the alert router running. The routines are what keep it running. The routines are calibrated for the on-call to be one human on a one-person deployment, two humans on a 7-day rotation on a four-person deployment, and to scale linearly between. The cadence below is the cadence we have run; treat the times as a starting point and adapt them to your team's actual day length.
Daily — the previous-day notification stream review
Every day, the on-call reads the previous day's notification stream end-to-end. The notification stream is a single chronological log of every alert that left the alert router in the last 24 hours, grouped by tenant, with the sink, the verdict, the cluster ID (if the suppression rule fired), and the budget-state (in-budget, compressed-digest, capped). The on-call's job is to look for one anomaly. An anomaly is anything that looks unfamiliar: an alert that fired on a customer who has been quiet for weeks, a verdict whose error.kind is one the team has not seen before, a budget-cap event on a tenant whose tier should not be hitting the cap, a verification status that has slipped from "live" to "pending re-handshake" without anyone noticing.
The discipline is the one-anomaly-per-day rule. The on-call is not asked to read the full notification stream and remember everything; they are asked to find one anomaly and either note it as benign in the dashboard's anomaly journal or escalate it. The benign annotation is one click and the escalation is one click and the dashboard records both with the actor and the timestamp. If the on-call finds zero anomalies on a quiet day, they record "zero anomalies" with one click and the routine is logged. The structural reason the rule is one-per-day rather than zero-or-many is that "find at least one" forces the on-call to engage with the stream rather than scrolling past it; the dashboard's anomaly journal is what the team reviews on Friday to see how the week's signal looked.
Weekly — sink-verification re-handshake and payload-shape audit
Every Friday, the on-call (or the verification-rotation owner on the larger deployments) runs the sink-verification re-handshake against every sink whose verification is older than the cadence (default cadence: 30 days). The re-handshake is the same handshake the verification went through originally; the dashboard's verification log records the re-handshake as a new row, marked "renewed," with a "renewed-by" actor. Sinks whose re-handshake fails are marked "pending" and the dashboard refuses to send any alert through them until a fresh successful handshake is recorded; the customer is notified through the in-product surface that their sink is pending re-verification.
The same Friday slot is also when the payload-shape CI check is run against the latest payload-template-set, even if no PRs landed in the past week. The reason is that the payload-template-set may have been edited through the dashboard's UI rather than the operator-config repo (a path the architectural walkthrough deprecates but small teams sometimes still have); the weekly re-run is the safety net that catches a UI-edited template that bypassed the CI check on its way in. The weekly re-run is run from the operator-config repo against the live template set, and any failure produces a P2 ticket that the alert-rule reviewer (or the founder, on smaller deployments) addresses on Monday. We give the harness for the weekly re-run in §9.
Monthly — the synthetic-outage drill
Every month, on a calendar-pinned day (the first Wednesday of the month), the team runs the synthetic-outage drill against the drill tenant provisioned at week one. The drill is one of three flavours, rotating: (a) single-tenant outage — break the drill tenant's MCP endpoint, watch the alert reach the drill destinations within the SLA, watch the cross-tenant suppression rule stay silent on it, watch the per-tenant budget reset cleanly on the hour boundary; (b) cross-tenant cluster — break the drill tenant and four other synthetic tenants set up specifically for the cross-tenant drill, watch the suppression rule fire above the floor, watch the team's own compressed-mode digest arrive in the on-call channel, watch the digest reader role acknowledge it; (c) budget-exhaustion — break the drill tenant in a way that produces 30 alerts/hour against its 10-alert budget, watch the compressed-mode digest fire to the drill tenant's sink at the cap, watch the budget reset on the hour boundary, watch the audit log record the cap-event and the reset.
The drill takes 30 minutes if everything works and up to 2 hours if a defect is found. The drill harness produces a one-page receipt: which flavour ran, which sinks delivered, which assertions held, which assertions failed. The receipt is committed to the team's drill-log repo and signed by the on-call. A failed assertion is a P1; a successful drill is a one-line note in the team's channel. The drill's purpose is not to find defects in the alert router (defects should be caught by the CI check) but to maintain the team's calibration baseline for what a real outage looks like. The drill is what makes the team confident that when a real outage happens, the alert router will deliver, and that confidence is what allows the team to stop worrying about the alert router on the other 28 days of the month.
Quarterly — the sink-credential rotation drill
Every quarter, on a calendar-pinned day (the second Tuesday of the quarter's first month), the team runs the sink-credential rotation drill. The drill rotates the four classes of credential the alert router holds: the Slack-app token (rotated by the team via the Slack admin UI, with the new token pasted into the dashboard's secret store; the verification handshake re-runs automatically against the new token), the webhook signing secret (rotated by the team via the dashboard's webhook config UI, with the new secret pushed to the customer's webhook receiver via the customer's documented rotation channel), the email DKIM key (rotated by the team via the email provider's admin UI, with the new key pushed to the team's DNS provider), and the PagerDuty OAuth refresh token (rotated by the team via the PagerDuty admin UI, with the new refresh token pasted into the dashboard's secret store).
The four rotations run in series, not parallel. Each rotation produces a verification log row marked "rotated," and the dashboard refuses to send alerts through the rotated sink until the verification handshake has re-run successfully against the new credential. A rotation that fails the re-handshake is rolled back to the previous credential through the dashboard's secret-store versioning (the secret store keeps the previous N=2 versions for exactly this reason); the rollback is one click and is in the audit log. The rotation drill takes 2 hours if everything works and up to a day if one of the four sinks does not re-handshake on the new credential, in which case the team's on-call burden for the rest of the day is the rollback investigation. The drill's quarterly cadence matches the typical credential-rotation cadence the platform's customers expect; the alignment is what makes the customer's rotation experience match the platform's.
The contractor and external-handshake pattern
One of the under-discussed features of small-team operations is that not every role on the alert-router rotation has to be a full-time employee. The roles that show up specifically with five-or-fewer staff and that map well onto contractor or advisor relationships are: the fractional security advisor (the alert-rule reviewer's seat at quarter-time intensity), the part-time on-call cover (the second on-call hat at 30% intensity, typically a freelancer with operations experience who can take the rotation on the holiday weeks), and the third-party sink-rotation auditor (the SOC-2 reviewer or a paid security-engineering firm that audits the quarterly sink-credential rotation drill once a year and signs off on the receipts).
Each contractor pattern has a structural shape that mirrors the corresponding employee role from §3. The fractional security advisor lives in the alert-rule reviewer IdP group with full refusal rights on platform-wide rule changes, but a tag in the IdP group says "fractional, 12-month renewal" and the dashboard's session timeout is 8 hours instead of the employee default of 30 days. The part-time on-call cover lives in the on-call rotation IdP group with a tag that says "part-time, holiday cover only" and a calendar binding that activates the seat for the cover week and deactivates it after. The third-party sink-rotation auditor lives in the auditor IdP group (the same parked seat from the permission-model companion) with a scope-restricting tag that says "audit-the-rotation-drill" and a one-time receipt CSV export at the end of the audit.
The structural decision for each contractor role is the same: the role lives in the IdP, the dashboard refuses to grant the role permissions outside the IdP-bound scope, and the role's expiry is calendar-bound at week one. The contractor pattern is what makes the small-team operation survive the human reality that not every role on the rotation can be staffed full-time at week one and not every role needs to be.
Seven failure modes specific to small-team alert routing
The architectural walkthrough listed six multi-tenant-specific failure modes for the alert router (verified-sink rotation, suppression false positive, suppression false negative, verification-token replay, compressed-mode digest delivery to a flapping sink, payload-template drift). All six survive at small-team scale unchanged. What follows is seven additional failure modes that show up specifically when the team is small and the rotation is one or two humans deep. Each has a structural fix that does not depend on team discipline alone.
1. Founder-paging-themselves on a Saturday outage
The founder is on the on-call rotation in a one- or two-person deployment. On Saturday at 3am the alert router pages the founder, who acknowledges the page on their phone and then opens the dashboard on their laptop to investigate. The acknowledged page is in the audit log under the founder's name, the investigation is in the audit log under the founder's name, and the resolution (a tenant-scoped operation to mute the noisy customer's sink while the team investigates) is in the audit log under the founder's name as a tenant_scoped_operator action. The failure mode is not that the founder is on-call; it is that on the Saturday outage the founder is tempted to "log in as root and just fix it" because root is one click away.
The structural fix from the permission-model companion's failure mode #2 applies here unchanged: the dashboard refuses tenant-scoped actions from a root-operator session, the founder must explicitly switch to tenant_scoped_operator for any per-tenant alert-rule edit, and the role switch is captured in the audit log. The structural fix specific to the alert router is that the on-call escalation policy is structurally barred from routing to a root-operator-only account — the IdP group "on-call" cannot contain a member whose only role is root operator; the membership requires a tenant-scoped-operator role too. The combination of the two structural fixes is what makes the founder paging themselves on Saturday at 3am still produce an audit log that records a tenant-scoped action under the tenant-scoped role, not a root-only action that nobody reviewed.
2. Customer paste-a-webhook attack on a small workspace
A customer pastes a webhook URL into the alert-rule UI that points at a Slack-incoming-webhook for the team's own internal channel rather than the customer's own channel. The customer's intent is innocent (typo) or malicious (phishing-by-misdirection). On a small Slack workspace this attack is harder to spot than on an enterprise workspace because the team has fewer channels and the team's own incoming-webhook URLs are statistically more likely to be in the customer's clipboard.
The structural fix is the sink-verification handshake from the architectural walkthrough's §3. The TXT-record domain-of-origin handshake refuses any webhook URL whose domain does not have a TXT record proving the customer owns the domain. For Slack-incoming-webhook URLs the equivalent handshake is the inbound-proof-token handshake that the customer must paste back into the dashboard from a private message in the workspace. The handshake refuses any URL whose target workspace did not produce the inbound-proof-token via a private-message round-trip. On a small workspace the failure mode is that the team has historically put the inbound-proof-token in a public channel for convenience; the fix is to make the workflow strictly private-message-only, structurally enforced in the handshake. We name the handshake's refusal logic in the recipe section of the architectural walkthrough; we will not repeat it here.
3. Cross-tenant suppression false positive on a small tenant base
The cross-tenant suppression rule's threshold of ">10% of tenants paged" fires on a Tuesday morning when 10 of the platform's 80 tenants happen to be probing the same registry shard that just had a CDN hiccup. The suppression rule silences all 10 tenants behind one notice. Two of the silenced tenants are paying customers whose individual outages are real and unrelated to the cluster's root cause; the suppression rule has produced a false positive that hid two paying customers' outages.
The structural fix is the minimum-tenant-count floor from §4 of this post: the suppression rule fires only when both ≥10% of tenants are paged AND ≥30 distinct tenants are paged. The floor of 30 is what stops the rule from firing on a 10-tenant Tuesday-morning cluster. The floor is configurable; the week-1 default is 30, and the floor is reviewed quarterly with the platform's actual tenant count. As the platform grows past 300 tenants the floor still scales with the percentage; as it shrinks past 50 tenants the floor effectively makes the suppression rule dormant, which is the right behaviour for a tenant base that is too small for cross-tenant clustering to be statistically distinguishable from noise.
4. Per-tenant budget set too generous on the free tier
The week-1 tier-default registry sets the Public (free) tier budget at 50 alerts/hour. A free-tier customer accidentally configures their alert rule to fire on every probe verdict instead of every transition; the customer's budget exhausts at 50 alerts in the first 90 seconds. The compressed-mode digest fires for the rest of the hour with one summary message to the customer's primary sink. So far the alert router did its job. The failure mode is that the team's bandwidth bill from the 50 alerts/hour delivered before the cap absorbs the cost, and the team's own compressed-mode-digest read role does not realise the customer's rule is misconfigured because the digest fires only once per hour at the cap.
The structural fix is to make the per-tenant budget event itself a P3 ticket on the team's side: the cap-event row in the audit log triggers a one-line message in the team's #alerts-cap channel, and the on-call's daily routine (§5) has the cap-channel as one of the streams it reads. The fix does not change the customer's experience (they still get the compressed-mode digest); it changes the team's visibility into "which customers are hitting their cap." The team's quarterly tier-default review (the alert-rule reviewer's review) consumes the cap-channel's last-90-day history when deciding whether to adjust the tier-default budgets up or down. The fix is what closes the loop between "the platform absorbs the cap-event silently" and "the team learns from the cap-event quarterly."
5. Sink-rotation drill collides with the support queue
The quarterly sink-credential rotation drill (§6) takes 2 hours if everything works. On the drill day the on-call has a queue of customer support tickets; the drill is calendar-locked to a specific day and the support queue does not pause. The on-call switches between the drill and the support queue every 15 minutes, makes a mistake on the second sink's rotation (pastes the new credential into the wrong tenant's sink), and the rollback adds another hour to the drill day. The failure mode is that the small team's calendar has no slack for the drill to run uninterrupted.
The structural fix is a same-day rule: no support tickets ship on the drill day, no PRs merge on the drill day, no platform-wide rule changes apply on the drill day. The day is calendar-locked in the team's shared calendar at week one and the rule is documented in OPERATIONS.md. The customer-facing impact is one-day SLAs slipping by one day on the drill day, which the team's published SLA accounts for. The structural reason the same-day rule exists is that the drill is the team's calibration baseline for the alert router's credential rotation; collisions between the drill and the support queue eat the calibration baseline and the team loses confidence in the rotation. The same-day rule is what makes the calibration baseline survive.
6. On-call channel depends on one phone
The founder's phone is the only PagerDuty-attached device on the platform. On Saturday the phone's battery dies, the founder is at a friend's wedding 200km from a charger, and the alert router's escalation policy fires for 30 minutes against the unreachable founder before timing out. The 30-minute timeout is too long for a credentialed-probe outage on the platform's largest customer.
The structural fix is a hardware-failover plan with a second phone or laptop and the recovery codes already in PagerDuty: the founder's secondary device (a tablet, an old phone, the founder's laptop) is also on the rotation, with a shared PagerDuty seat the founder owns rather than a separate seat. The shared seat means the alert lands on both devices; the secondary device is silent unless the primary's heartbeat is older than 5 minutes, at which point it un-mutes. The recovery codes for the founder's account are stored at two off-site locations (the same recovery-code locations from the permission-model companion's failure mode #1), so a lost primary device does not lock the founder out of PagerDuty for the duration of the outage. The structural fix is the same shape as the bus-factor fix from the permission-model companion; the alert-router companion just specialises it to the on-call channel's hardware dependency.
7. Compressed-mode digest no one reads
The cross-tenant suppression rule fires on a Tuesday morning. The team's own compressed-mode digest arrives in the on-call channel summarising the outage. The on-call is mid-stream on a different ticket and the Slack notification is dismissed without being clicked. The digest is the only place the team would have learned about the registry-wide outage; the team learns about it from a customer ticket six hours later.
The structural fix is the compressed-mode digest reader role from §5, item 8: the role is named on the rotation calendar, the digest's "acknowledged" click is required within 15 minutes of the digest landing, and the dashboard escalates an unacknowledged digest to the secondary on-call after 15 minutes. The escalation cascade is the same shape as the per-customer escalation policy; the digest reader's failure to acknowledge promotes the digest to the team's escalation tree. The structural reason the role is named on the calendar rather than left to "whoever is around" is that "whoever is around" produces the failure mode; an explicit named role with an explicit acknowledgement cadence is what turns the digest from a Slack notification into a structural signal.
Reference recipes
The recipes below are calibrated for a small team — short, copy-pasteable, and defensible against the failure modes named above. They are not full implementations; the architectural walkthrough has the full Go and SQL and Lua. These are the small-team operator's drop-in scaffolds.
The sink-verification handshake template (markdown — drop into OPERATIONS.md)
# OPERATIONS.md — sink-verification handshake template
## When to run
- New customer adds a sink (initiated automatically by the dashboard).
- Quarterly rotation drill (§6 of operating-per-tenant-alert-routing).
- Friday weekly re-handshake for sinks > 30 days old (§5 daily/weekly).
## Slack — inbound-proof-token handshake
1. The dashboard generates a 32-byte random nonce.
2. The customer pastes the nonce into a private-DM-only Slack message
to a Slack-app the dashboard owns.
3. The Slack-app verifies the nonce, signs it with the workspace ID,
and posts the signed nonce back to the dashboard.
4. The dashboard verifies the signature against the workspace's public
key (cached from the OAuth handshake at workspace-install time).
5. Verification log row written: actor=customer, verified-by=dashboard,
approved-by=<reviewer-on-rotation, or self for one-person dep.>.
## Generic webhook — TXT-record domain-of-origin handshake
1. The dashboard generates a 16-byte random nonce.
2. The customer adds a TXT record at _alivemcp-verify.<their-domain>
containing the nonce.
3. The dashboard polls DNS for the TXT record at 5-minute intervals,
up to 24 hours.
4. On match the verification log row is written; on timeout the URL
is marked "verification-failed" and refused.
5. The TXT record can be removed after the verification log row is
marked "live"; the binding is to the domain, not to the TXT record.
## Email — per-recipient binding handshake
1. The dashboard generates a 32-byte random nonce and sends a one-line
email to the recipient address with the nonce as a magic-link.
2. The recipient clicks the magic-link; the click is from the recipient's
browser and includes the recipient's email account's session cookie.
3. The dashboard verifies the cookie's tenant-binding (the recipient is
a member of a tenant the customer owns).
4. Verification log row written; binding is to the recipient address,
tenant ID, and a 12-month expiry.
5. Re-verification at 12 months is automatic; re-handshake required.
## PagerDuty — OAuth 2.0 PKCE flow
1. The dashboard initiates a PKCE flow against the customer's PagerDuty
account; the customer authorises the dashboard's app.
2. PagerDuty returns an authorisation code; the dashboard exchanges
the code for an access token + refresh token.
3. The refresh token is stored in the dashboard's secret store.
4. Verification log row written; binding is to the PagerDuty user ID,
the team ID, and the refresh-token expiry.
5. Re-verification on refresh-token expiry is automatic; re-handshake
required if the customer revokes the dashboard's app.
## Verification log row format
| column | type | notes |
|----------------|-----------|---------------------------------------------|
| sink_id | uuid | the sink being verified |
| sink_type | enum | slack | webhook | email | pagerduty |
| status | enum | pending | live | rotated | failed |
| verified_by | actor_id | the human who initiated the handshake |
| approved_by | actor_id | the second human who approved (if any) |
| handshake_at | timestamp | when the handshake completed |
| expires_at | timestamp | when re-verification is required |
| nonce_hash | sha-256 | for the Slack and webhook handshakes |
| audit_log_link | uuid | link to the audit log row for this action |
The IdP-bound on-call rotation (bash + jq + curl)
#!/usr/bin/env bash
# idp-to-pagerduty-sync.sh — sync the IdP "on-call" group to PagerDuty.
# Run from cron every 15 minutes; idempotent.
set -euo pipefail
IDP_TOKEN="${IDP_TOKEN:?missing}"
PAGERDUTY_TOKEN="${PAGERDUTY_TOKEN:?missing}"
PAGERDUTY_SCHEDULE_ID="${PAGERDUTY_SCHEDULE_ID:?missing}"
# 1. Pull the IdP "on-call" group's current members.
idp_members=$(curl -s \
-H "Authorization: Bearer ${IDP_TOKEN}" \
"https://idp.example.com/api/groups/on-call/members" \
| jq -r '.members[].email' | sort)
# 2. Pull the PagerDuty schedule's current members.
pd_members=$(curl -s \
-H "Authorization: Token token=${PAGERDUTY_TOKEN}" \
"https://api.pagerduty.com/schedules/${PAGERDUTY_SCHEDULE_ID}" \
| jq -r '.schedule.users[].email' | sort)
# 3. Diff. Members in IdP but not PD = add. Members in PD but not IdP = remove.
to_add=$(comm -23 <(echo "${idp_members}") <(echo "${pd_members}"))
to_remove=$(comm -13 <(echo "${idp_members}") <(echo "${pd_members}"))
# 4. Apply diffs.
for email in ${to_add}; do
user_id=$(curl -s \
-H "Authorization: Token token=${PAGERDUTY_TOKEN}" \
"https://api.pagerduty.com/users?query=${email}" \
| jq -r '.users[0].id')
curl -s -X POST \
-H "Authorization: Token token=${PAGERDUTY_TOKEN}" \
-H "Content-Type: application/json" \
--data "{\"user\":{\"id\":\"${user_id}\",\"type\":\"user_reference\"}}" \
"https://api.pagerduty.com/schedules/${PAGERDUTY_SCHEDULE_ID}/users"
done
for email in ${to_remove}; do
user_id=$(curl -s \
-H "Authorization: Token token=${PAGERDUTY_TOKEN}" \
"https://api.pagerduty.com/users?query=${email}" \
| jq -r '.users[0].id')
curl -s -X DELETE \
-H "Authorization: Token token=${PAGERDUTY_TOKEN}" \
"https://api.pagerduty.com/schedules/${PAGERDUTY_SCHEDULE_ID}/users/${user_id}"
done
# 5. Log the diff.
echo "$(date -Iseconds) idp-to-pd: added=${to_add:-none} removed=${to_remove:-none}" \
| tee -a /var/log/idp-to-pagerduty.log
The synthetic-outage drill harness (Go)
// drill_test.go — monthly synthetic-outage drill against the drill tenant.
// Run via: go test -run TestDrill_SingleTenantOutage -tags=drill -v
// Drill tenant ID is loaded from DRILL_TENANT env var.
package drill
import (
"context"
"encoding/json"
"fmt"
"os"
"testing"
"time"
)
func TestDrill_SingleTenantOutage(t *testing.T) {
tenantID := os.Getenv("DRILL_TENANT")
if tenantID == "" {
t.Fatal("DRILL_TENANT not set")
}
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Minute)
defer cancel()
h := newDrillHarness(t, tenantID)
h.recordReceiptStart("single-tenant-outage")
// 1. Break the drill tenant's MCP endpoint.
if err := h.injectOutage(ctx); err != nil {
t.Fatalf("inject outage: %v", err)
}
h.recordStep("inject", "ok", time.Now())
// 2. Wait up to 90s for the alert to reach the four drill destinations.
deadline := time.Now().Add(90 * time.Second)
var deliveries map[string]bool
for time.Now().Before(deadline) {
deliveries = h.checkDeliveries(ctx)
if all(deliveries) {
break
}
time.Sleep(2 * time.Second)
}
h.recordStep("delivery", deliveryStatus(deliveries), time.Now())
// 3. Assert the cross-tenant suppression rule did NOT fire.
suppressionRow, err := h.checkSuppressionLog(ctx, time.Now().Add(-5*time.Minute))
if err != nil {
t.Fatalf("check suppression log: %v", err)
}
if suppressionRow != nil {
t.Errorf("suppression rule fired on a single-tenant outage: %+v", suppressionRow)
}
h.recordStep("suppression-silent", "ok", time.Now())
// 4. Assert the per-tenant budget reset cleanly on the hour boundary.
nextHour := time.Now().Truncate(time.Hour).Add(time.Hour)
if time.Until(nextHour) > 5*time.Minute {
t.Skip("drill started too late in the hour to test budget reset")
}
time.Sleep(time.Until(nextHour) + 30*time.Second)
budget, err := h.checkTenantBudget(ctx, tenantID)
if err != nil {
t.Fatalf("check tenant budget: %v", err)
}
if budget.Used != 0 {
t.Errorf("budget did not reset on hour boundary: used=%d", budget.Used)
}
h.recordStep("budget-reset", "ok", time.Now())
// 5. Restore the drill tenant.
if err := h.restoreTenant(ctx); err != nil {
t.Fatalf("restore tenant: %v", err)
}
h.recordStep("restore", "ok", time.Now())
// 6. Write the receipt.
receipt := h.finalReceipt()
receiptJSON, _ := json.MarshalIndent(receipt, "", " ")
fmt.Printf("DRILL RECEIPT\n%s\n", receiptJSON)
if !receipt.AllPassed() {
t.Fatalf("drill failed: %+v", receipt.Failures())
}
}
The sink-credential rotation runbook (markdown)
# OPERATIONS.md — sink-credential quarterly rotation runbook
## Calendar
Second Tuesday of each quarter's first month.
Calendar-locked: no support tickets, no PRs, no platform-wide rule changes.
Same-day rule documented in OPERATIONS.md and pinned in the team channel.
## Order of operations (4 sinks, in series, NOT parallel)
### Step 1 — Slack-app token (~30 min)
- [ ] In Slack admin UI, regenerate the app's signing secret + bot token.
- [ ] In dashboard secret store, add the new secret as version N+1.
- [ ] In dashboard, click "rotate Slack secret" — refuses if verification
handshake does not re-handshake on new secret.
- [ ] Verification log row written with status=rotated, verified-by=<you>.
- [ ] If re-handshake fails: rollback to version N, file P2, page founder.
### Step 2 — Webhook signing secret (~30 min)
- [ ] In dashboard webhook config UI, regenerate the platform's webhook
signing secret. New secret is now version N+1 in the secret store.
- [ ] Push the new secret to each customer's documented rotation channel
(the email address each customer registered for credential rotation).
- [ ] Customers update their webhook receiver to accept signatures from
either version N or N+1 for the next 30 days.
- [ ] At day 31 the dashboard refuses to sign with version N.
- [ ] Verification log row written for each customer's webhook sink.
### Step 3 — Email DKIM key (~30 min)
- [ ] In email provider's admin UI, generate a new DKIM key.
- [ ] Push the new public-key DNS record to the team's DNS provider.
- [ ] Wait for DNS propagation (10-30 min depending on TTL).
- [ ] In dashboard, click "rotate email DKIM" — refuses if a fresh
delivery test through the email sink does not pass DKIM check.
- [ ] Verification log row written; old DNS record retained for 30 days
to handle in-flight queued emails signed with old key.
### Step 4 — PagerDuty OAuth refresh token (~30 min)
- [ ] In PagerDuty admin UI, revoke the dashboard's existing OAuth grant.
- [ ] In dashboard, click "re-authorise PagerDuty" — initiates PKCE flow.
- [ ] Customer authorises (or the team's own service-account authorises
for the team's own PagerDuty integration).
- [ ] Dashboard receives new refresh token; stored in secret store as
version N+1.
- [ ] Verification log row written.
## Receipt
- [ ] Drill receipt committed to drill-log/<quarter>-rotation.md
- [ ] Receipt signed by the rotation owner and the founder
- [ ] One-line note posted to the team's channel:
"Q<n> rotation drill ran. 4/4 sinks rotated. No P1 raised."
- [ ] If a P1 was raised: receipt includes the rollback details and
the next-day re-attempt plan.
Where this fits — alert-router companion
This post is the alert-router-side companion to the architectural walkthrough. The architectural walkthrough described the five layers — sink-ownership verification, tenant-scoped configuration with cross-tenant write protection, cross-tenant suppression, per-tenant alert budgets, payload-shape boundaries — that make a single alert router safe across many tenants. This post described how to actually operate that architecture with one to five humans on the team — the headcount-to-alert-ownership mapping, the week-1 setup checklist, the daily and weekly and monthly and quarterly drill cadence, the contractor and external-handshake pattern, and seven small-team-specific failure modes with structural fixes. Together they are both halves of how a small multi-tenant MCP-monitoring team operates the alert-router side of the stack. The architectural side and the operational side reinforce each other; neither stands alone.
The small-team-companion arc is now two posts deep. The first companion (post #14) paired with the operator-dashboard architectural walkthrough and described the four-layer permission model in operation. This post pairs with the per-tenant alert routing architectural walkthrough and describes the five-layer alert router in operation. Two more companions are scheduled: the small-team companion to the shared-state archiver — long-term retention, GDPR delete fan-out, archive-snapshot rotation — when operated with five-or-fewer staff, and the small-team companion to the multi-tenant probe collector — supervisor, workers, queues, secret store, coalescer — when operated with five-or-fewer staff.
The next deliverable after the small-team-companion arc is the Q3 2026 registry audit, landing mid-July 2026. The audit re-runs every probe from all five regions in parallel through the multi-tenant collector designed in post #10, with verdicts archived through the system designed in post #12, with cross-tenant suppression measured against the cluster log designed in post #11, and with operator actions during the audit window logged in the audit-log designed in post #13. The audit will report bucket-by-bucket movement vs the Q2 baseline — including how the cross-tenant alert-suppression rule from this post's architectural counterpart behaved on the registry-wide outages observed during the audit window; whether the credentialed-probe rollout from post #6 shrunk the auth-walled 16.8% bucket as expected; whether the schema-drift detector from post #4 caught the same 7.1%/48h drift rate or a different one; and the first end-to-end pass through the archiver designed in post #12 at registry scale. Between now and then the small-team-companion arc continues — the practical guides that pair with each architectural walkthrough, calibrated for the team that has actually been doing the work the post describes.
Further reading on AliveMCP
- Per-tenant alert routing at scale — the architectural reference this post operationalises.
- Operating the four-layer permission model with five staff or fewer — the first small-team companion; this post's predecessor in the arc.
- Operator dashboard walkthrough — the operator-architecture side of the scale sub-series.
- Multi-tenant MCP probe collector — the write side of the scale sub-series; the verdict-minute Redis the alert router consumes.
- Shared-state archiver walkthrough — the persistence side; the alert-history's long-term home.
- State of the MCP Registry — Q2 2026 — the audit baseline the next quarterly audit will measure against.
- Why MCP servers die silently — 7 failure modes — the failure-class taxonomy the alert payloads encode.
- JSON-RPC health checks vs HTTP probes — the protocol-aware probe whose verdicts the alert router emits on.
- Schema drift in MCP tool definitions — the canonical-JSON SHA-256 hash that the verification handshake reuses.
- MCP authentication primer — the four-posture decision tree that the credentialed-sink handshakes inherit from.
- Running a credentialed MCP health check, end to end — the per-region probe atom whose outage events the alert router fans out.
- Multi-region MCP probe deployment — the geographic-redundancy wrapper.
- Public status page for an MCP server — the human-facing surface the alert router does not have to send to.
- MCP uptime API and embeddable badge — the read-side surface the small-team's daily review never has to touch.
- MCP server uptime monitoring — the whole stack
- MCP server Slack alerts
- MCP server health check — probe sequence explained
- MCP monitoring tool
- MCP endpoint not responding
- Check if your MCP server is alive
- UptimeRobot vs AliveMCP — a direct comparison
Want to be told before your MCP server dies silently?
AliveMCP probes every public MCP endpoint every 60 seconds, fans the verdict through a multi-tenant alert router that survives sink-ownership attacks and registry-wide outages, archives the canonical history through an audited operator console, and gives your own staff a self-serve surface for alert sinks, retention preferences, and Article 17 requests — all from the same multi-tenant stack described across the posts of the scale sub-series and operated by the small-team routines this post and its predecessors walk. Public servers are free; private servers start at $9/mo.