Deep dive · 2026-04-30 · Scale sub-series — collector companion · Closes the small-team-companion arc

Operating the multi-tenant probe collector with five staff or fewer

The multi-tenant MCP probe collector walkthrough built six new layers on top of the single-tenant probe stack: per-tenant worker isolation with cgroup CPU/memory caps and a 50-second wall-clock cap that the supervisor enforces with SIGKILL, KMS-envelope-encrypted credential storage scoped per tenant with a 5-minute signed IAM token mounted into the worker container at start-up, a fan-out architecture that turns a cron-driven scheduler into a per-region work-queue dispatcher with a worker pool sized to the 95th-percentile load, per-tenant rate limiting at the scheduler tied to the billing tier so Public reads from global probes only and Enterprise gets dedicated workers, shared state with a tenant-prefixed key namespace and a verdict-minute Lua coalescer that turns five region-writes per server-minute into one, and billing-aware probe paths that filter the read-side surface by what the tier paid for. That post answered "what does the collector look like at scale." This post answers a different question: "how does a small team actually run that collector every day, when the supervisor is one binary on one box, the founder is also the on-call for runaway tenants and the KMS-key-grant auditor, the secret store is one Postgres table the founder once configured and has never since rotated grants on, and the queue-depth alert has to fire correctly the first time a Q3-audit re-run pushes 200,000 Redis writes per minute through a worker pool that has never been touched since week one." The six layers from the architectural walkthrough survive the small-team setting unchanged — the threat model is the same, the noisy-neighbour blast radius is the same, the credential-leakage exposure is the same — but the human routines that operate them are very different from the routines a fifty-person platform team runs. This post is the operator's guide. It maps headcount to collector ownership for one-, two-, three-, four-, and five-person deployments, walks the week-1 setup checklist that turns the collector architecture into a working deployment with one supervisor and a small worker pool on a single VPS, sketches the daily and weekly and monthly and quarterly drills that keep the queue depth, the KMS-grant inventory, the verdict-minute coalesce health, and the supervisor's SIGKILL discipline honest, names the seven failure modes that show up specifically when the collector is operated by a small team, and gives the reference recipes — the small-team supervisor with cgroup CPU/memory caps, the envelope-encrypted-Postgres-column secret-store recipe, the IdP-bound KMS-grant rotation script, the queue-depth alert with a small-team rate window — that turn the collector into something a one-to-five-person team can actually run without the noisy-neighbour problem collapsing the platform on a Saturday afternoon. This post closes the small-team-companion arc; the next post pivots back to primary research with the Q3 2026 registry audit.

TL;DR

A five-or-fewer team operating a multi-tenant probe collector is not a smaller version of an enterprise platform team; it has its own shape and its own failure modes. The six-layer collector architecture from the previous post still applies — worker-as-security-boundary tenant isolation, KMS-envelope-encrypted per-tenant secret store, per-region work-queue fan-out, per-tenant rate limiting at the scheduler, tenant-prefixed shared state with a verdict-minute Lua coalescer, billing-aware probe paths — but the way the layers map onto humans, the cadence at which they are exercised, and the failure modes the team has to watch for change with team size. The headcount-to-collector-ownership mapping is the first decision: in a one-person deployment the founder owns the supervisor's tenant manifest, the KMS-grant inventory, the queue-depth alert, the verdict-minute coalescer's health, the per-region worker pool's deploy cadence, and the runaway-tenant on-call seat by virtue of being the only human; in a two-person deployment the ops hire takes the queue-depth alert and the per-region worker rotation and the founder retains the tenant manifest and the KMS-grant inventory; in a three-person deployment the third slot is the secret-store reviewer who exists structurally to refuse the founder's "just store this credential in a config file, it's small" requests; the four- and five-person deployments add a dedicated KMS-grant rotation owner and a third-party security advisor who runs a quarterly KMS-grant audit. The week-1 setup is the minimum-viable boundary: choose the secret store between the three small-team-viable options (envelope-encrypted Postgres column, age-encrypted file in the operator-config repo, hosted secret store like AWS Secrets Manager / GCP Secret Manager — the architectural walkthrough's KMS pattern but with the small-team trade-offs each implies), set the supervisor's rate-limit knobs that matter on day one (per-tenant probe budget per minute, per-tenant CPU and memory caps tied to the billing tier, the 50-second wall-clock cap with the supervisor's SIGKILL discipline, the per-region queue's BLPOP timeout), schedule the per-region worker rotation that survives a small team's deploy cadence (rolling restart on the 1st and 15th of each month with a per-region staggered window so no two regions deploy simultaneously), calibrate the queue-depth alert for the team's actual probe volume (a small team's pool size is much smaller than the architectural reference's, so the alert thresholds are correspondingly smaller and the signal-to-noise ratio is what matters), stand up the synthetic noisy-neighbour drill tenant whose only job is to be a deliberately-runaway tenant once a quarter, configure the supervisor's audit-log row format so every SIGKILL is in the audit log with the killed worker's CPU and memory and wall-clock state at the moment of kill, lock the IdP-bound KMS-grant rotation cadence (one grant audited per quarter on the third Wednesday with the auditor signing the receipt). The daily routine is one line: the on-call reads the queue-depth dashboard end-to-end, looks for one anomaly, and either notes it as benign or escalates. The weekly routine is the supervisor's SIGKILL log review (which tenants got killed for wall-clock overrun this week, are any of them the same tenant repeatedly) and the per-region worker pool health review (each of the five regions has its expected pool size, no region has decayed on a missing worker). The monthly routine is the per-region worker rotation: deploy the new worker image to one region at a time, verify it for one full probe minute before the next region rolls. The quarterly routine is the KMS-grant audit (every grant in the inventory is scoped to one tenant ID and one container instance and has a 5-minute expiry; no grant has been added without an audit-log row; no rolled-off contractor still has an active grant) and the synthetic noisy-neighbour drill (a deliberately runaway tenant probe-job is enqueued; the supervisor must SIGKILL it within the wall-clock window; the read-side API must serve a "probe timed out" partial verdict for that tenant within 60 seconds; no other tenant is affected). Seven small-team failure modes with structural fixes — the runaway tenant on a Saturday afternoon when the founder is asleep (the supervisor's wall-clock SIGKILL is the structural defence and the supervisor's restart loop has to be self-healing without paging the founder), the secret-store cache poisoning that the small team cannot catch in code review alone (the OAuth-discovery cache key is structurally bound to the tuple (tenant_id, server_slug) and the unit test asserts the binding on every change), the per-region worker rotation that misses a region (the deploy script reads the canonical region list from the tenant manifest and refuses to declare success if any region's pool size is short), the supervisor SIGKILL that left a half-decrypted credential on the host (the worker mounts credentials into a tmpfs that is unmapped on supervisor SIGKILL via a cgroup release notifier), the queue-depth alert calibrated for the worst-case minute (the alert is a percentile and a rate-of-change rather than an absolute, so a Q3-audit re-run does not page every minute), the KMS grant that was never revoked when a contractor rolled off (the grant inventory is a checked-in YAML in operator-config that the IdP-bound rotation cron compares to the live KMS state on every cron run), and the verdict-minute coalesce race that surfaces only on the first cross-region partition (the Lua coalescer is unit-tested against the cross-region-partition adversary and the test runs in CI on every coalescer change). The recipe section sketches the small-team supervisor with cgroup CPU/memory caps, the envelope-encrypted-Postgres-column secret-store recipe, the IdP-bound KMS-grant rotation script, and the queue-depth alert with a small-team rate window in copy-pasteable form. This post is the practical companion to the multi-tenant collector architectural walkthrough; together they describe both halves of how a small multi-tenant MCP-monitoring team operates the write side of the stack. The small-team-companion arc closes here; the next post is the Q3 2026 registry audit.

Why five-or-fewer changes the collector

The six layers in the architectural walkthrough are shaped by the multi-tenant threat model, not by team size. The defences against a noisy-neighbour tenant whose probe hangs for 60 seconds and pegs a worker, against an ambient-credentials leak via a logging line, against a registry rate-limit blowback that pages every tenant in the same minute, against a verdict-minute coalesce race that flips a tenant's colour for 60 seconds — all of those defences sit in the supervisor's SIGKILL discipline and the KMS-envelope encryption and the registry-deduplicating crawl and the Lua coalescer's atomic two-of-N rule, and they are the same whether the team is one person or fifty. What changes with team size is who is paged when the supervisor SIGKILLs a tenant at 03:14 UTC, who reviews the SIGKILL log on Friday morning, who runs the quarterly KMS-grant audit, who owns the per-region worker rotation, and how the secret-store reviewer hat works when there are three humans on the team and one of them is the founder.

Three things are different at small scale and they cascade. The first is the runaway-tenant pager. The architectural walkthrough has a supervisor that SIGKILLs a worker after the 50-second wall-clock cap and a partial verdict that surfaces in the read-side API as "probe timed out". With fifty operators on a tiered on-call rotation, "worker_killed at 03:14 UTC for tenant X server Y" is a P3 ticket that is reviewed by the on-call the next morning. With one operator the same kill arrives as a Slack DM that the founder snoozes; the same tenant continues to schedule probes that get killed every minute for the next eight hours; the founder's Slack inbox accumulates 480 kill notifications by lunch on Saturday. The structural answer is to make the runaway-tenant condition self-quenching at the supervisor level — the supervisor automatically reduces a tenant's probe cadence after the third SIGKILL within an hour and surfaces the throttle as a customer-visible note in the dashboard's tenant detail page; the founder is paged only on the first SIGKILL per tenant per day, not on every kill. We will name this concretely in §5.

The second is the KMS-grant inventory drift. The architectural walkthrough fixes the per-tenant KMS keys at one per tenant, with the worker mounting only its own tenant's grant via a 5-minute signed IAM token. With fifty operators and a dedicated platform-security engineer, the KMS-grant inventory is reconciled with the live KMS state on a quarterly automated cadence and any drift is a P2 ticket. With one operator, the KMS-grant inventory is whatever the founder remembers; a contractor who needed a temporary grant six months ago to debug an OAuth-discovery problem still has it; the live KMS state has 23 grants while the founder's mental model has 8. The structural answer is to make the grant inventory a checked-in YAML file in the operator-config repo (one row per grant with the grantee's IdP user ID, the tenant ID, the expiry, the justification ticket reference) and to make the IdP-bound KMS-grant rotation cron compare the YAML to the live KMS state on every run; any drift is a P2 ticket that the founder addresses on Monday. We will name this concretely in §4.

The third is the secret-store reviewer seat that does not exist. With one operator the secret-store reviewer hat is a fiction. The temptation to write a tenant credential to a .env file "just for this debug session" is large, the structural defence has to live in the tooling rather than in the team's review cadence. The right way to handle this is not to fake a reviewer function but to be explicit: the deploy script refuses to start a worker with any environment variable whose name matches the credential blacklist (*_TOKEN, *_KEY, *_SECRET, *_PASSWORD) other than the signed IAM token; the supervisor's pre-flight check refuses to start a worker if any tenant credential is found in any plain-text-readable surface other than the encrypted Postgres column; the supervisor logs the pre-flight check's outcome on every worker start. The defences live in the tooling because they cannot live in the human review cadence at one operator. We will name this concretely in §3.

None of those three are reasons to abandon the six-layer collector. They are reasons to operate it deliberately. The rest of this post walks how.

Mapping headcount to collector ownership

The decision of who owns which piece of the collector is the most consequential staffing choice the deployment makes after the four-layer permission model from the first companion post, the five-layer alert router from the second companion post, and the five-layer archiver from the third companion post. The right answer depends on team size and on what other systems your team already uses (your IdP, your KMS provider, your queue runtime). The five-team-size mapping below is what we have run; treat it as a starting point that you adapt to your team's actual shape, not as a rule.

One-person deployment — the founder is the supervisor

The single operator owns the supervisor's tenant manifest, the KMS-grant inventory, the queue-depth alert, the verdict-minute coalescer's health, the per-region worker pool's deploy cadence, and the runaway-tenant on-call seat. The secret-store reviewer seat from §2 exists on paper but is unstaffed; we provision the reviewer account, leave it parked at zero permissions, and use that parked seat to grant secret-store-review rights to a part-time security advisor or a SOC-2 reviewer when the team has one to give the seat to. The KMS-grant rotation owner role is also held by the founder; the IdP-bound rotation cron's output is reviewed by the founder every quarter and the founder's review is the receipt the SOC-2 auditor reads.

The discipline that prevents the collector from collapsing into "founder writes the supervisor once, mounts a credential by hand once, and prays" is explicit calendar-binding on the dashboard, the same shape as the calendar-binding from the archiver companion. The collector-config UI in the dashboard has a routine selector at the top of the navigation; the default is "no routine selected." To run the daily queue-depth review the founder selects the daily-queue-depth routine; to run the quarterly KMS-grant audit the founder selects the quarterly-kms-grant-audit routine and the click is gated by an MFA prompt with the hardware token, with the live KMS state and the YAML inventory pre-loaded. Every routine completion is in the audit log with the routine name and the actor and a JSON summary of the routine's outcome (queue depth at the time of review, SIGKILL count over the prior 24 hours, KMS grants reconciled). Every supervisor SIGKILL is in the audit log with the killed worker's CPU and memory and wall-clock state at the moment of kill, the tenant ID, the server slug, the regions affected, and the partial-verdict outcome.

One non-obvious choice for the one-person deployment: the synthetic noisy-neighbour drill tenant is provisioned at week one, even though the deployment has zero paying tenants. The reason is that the day a real runaway-tenant condition occurs — a customer's MCP server hangs for 60 seconds on every probe, or a customer's probe-credential rotation produces a worker that loops on a malformed credential — the founder needs to know that the supervisor's SIGKILL discipline actually works, that the wall-clock cap fires within a 50-second window, that the partial-verdict surfaces as "probe timed out" in the read-side API within 60 seconds, that the runaway tenant's throttle kicks in after the third SIGKILL within an hour, and that no other tenant is affected during the kill. Provisioning the synthetic noisy-neighbour drill tenant at week one means the SIGKILL discipline has been exercised, the founder has a calibration baseline for "what does a real runaway tenant look like in this dashboard," and the drill's output is the receipt the founder hands to the SOC-2 auditor when the auditor asks for evidence that the platform's tenant-isolation claims are real. None of that is true on day one if the synthetic drill tenant is added the morning of the first real runaway-tenant condition — and a runaway tenant on a Saturday afternoon does not pause for "we are still configuring the drill harness."

Two-person deployment — founder and first ops hire

The founder retains the tenant manifest, the KMS-grant inventory, and the secret-store reviewer hat. The first ops hire takes the queue-depth alert, the per-region worker pool's deploy cadence, the supervisor's SIGKILL log review, and the day-to-day collector-worker monitoring. The secret-store reviewer seat is still parked, for the same reason as the one-person deployment — the day a security advisor or a SOC-2 reviewer arrives, the seat is the first thing they need.

The two-person deployment introduces routine-rotation discipline on the queue-depth alert, even though the rotation has only one human on the daily slot. The first ops hire is the daily-queue-depth reader; the founder is the secondary check on the weekly SIGKILL log review. The founder is not on the day-to-day queue-depth rotation. If the first ops hire is on holiday, the founder picks up the queue-depth alert but does so by switching to the daily-queue-depth routine for the duration of the cover, the same way the founder picks up tenant-scoped support actions during a cover (per the permission-model companion's rotation discipline). The routine switch is one click and is captured in the audit log. The founder does not "log in as root and just check the queue-depth dashboard" during the cover, because the structural defence — the dashboard refuses tenant-scoped actions from a root-operator session — holds.

The single most important thing the two-person deployment does is elect a per-region worker rotation reviewer. With one operator, the founder reviews their own deploy script's output and that is the routine. With two, there is a temptation for the first ops hire to deploy a new worker image without the founder reviewing the per-region pool-size diff first, and the per-region rotations silently become unwitnessed. The fix is to put the per-region rotation review on the rotation, not on the operator: every rotation is approved by a different human from the one who scheduled it, the dashboard refuses to apply the rotation until the second human has approved the per-region pool-size diff, and the audit log records both rows. The rotation is what keeps the per-region rotations real.

Three-person deployment — adding the secret-store reviewer

The founder retains the tenant manifest and the KMS-grant inventory. The first ops hire is on the daily-queue-depth rotation. The third hire — call them the secret-store reviewer — is the structural counterweight whose only job is to refuse the founder's "just store this credential in a config file, it's small" requests. The secret-store reviewer seat moves from "parked" to "staffed by an internal hire or a quarterly-rotated external advisor." The secret-store reviewer is staffed before the platform reaches 50 paying tenants because that is roughly the size at which the founder's secret-store changes start affecting other people's customers in non-obvious ways (a credential format change that breaks the credentialed-probe sequence from the credentialed walkthrough, a KMS-key-grant scope expansion that lets a worker read a different tenant's credentials, a Postgres row-security policy relaxation that lets the worker join across tenants).

The secret-store reviewer is not a senior hire; they are a discipline. The role can be the third human on the team regardless of seniority, provided they have refusal rights structurally enforced in the dashboard: any change to the secret-store schema (a new credential kind, a new KMS-key scope, a new IdP-bound IAM role with KMS access, a new Postgres row-security policy) goes through a pull-request-style two-stage flow where the founder proposes the change, the secret-store reviewer reviews the diff against the schema-definitions repo, and the dashboard refuses to apply the change without the reviewer's MFA-gated approval. The audit log records both the proposal and the approval. The reviewer's job is not to be smarter than the founder; it is to be a different pair of eyes whose first job is to ask "does this change widen the credential-leakage blast radius" and refuse if the answer is unclear.

Four-person deployment — KMS-grant rotation owner and per-region rotation lead

The fourth hire is the KMS-grant rotation owner and per-region rotation lead. With three or fewer staff the founder is the only human who has ever logged into the KMS console; if the founder is unreachable on the day a grant rotation is needed, the platform's KMS-grant inventory is effectively un-rotated. The fourth hire takes ownership of the KMS-grant inventory — the IdP-bound rotation cron's output is reviewed by the fourth hire on a weekly cadence, the YAML inventory is updated by the fourth hire on every grant change, the quarterly grant audit is led by the fourth hire and signed by the founder. The fourth hire also owns the per-region worker rotation: the deploy script's per-region staggered window is the fourth hire's responsibility, the post-rotation pool-size diff is the fourth hire's review, the rollback decision in the event of a regional regression is the fourth hire's call.

The four-person deployment is also the size at which the runaway-tenant on-call rotation flips from fiction to reality. With three or fewer staff the on-call channel is the founder's Slack DMs by default; the supervisor's SIGKILL notifications go to the founder regardless of who is on call. With four staff and a dedicated on-call rotation, the SIGKILL notifications go to a dedicated channel that the on-call rotation reads (per the alert-router companion's on-call channel discipline). The dashboard's runaway-tenant inbox routing is rewritten: primary recipient is the on-call rotation channel, secondary recipient is the founder for kills involving Enterprise-tier tenants, the dashboard's auto-throttle status surfaces in both channels from the moment the third SIGKILL within an hour fires.

Five-person deployment — the largest size the model is calibrated for

The fifth hire is the third-party security advisor or the fractional KMS-grant auditor — the human whose explicit role is to run the quarterly KMS-grant audit cycle, to sign off on the secret-store reviewer's diffs when the secret-store reviewer is unavailable, to be the standing reviewer on the supervisor's source-code changes, and to be the off-site escrow contact for the KMS root-key recovery materials. With five humans and four hats (founder for the tenant manifest and the secret-store reviewer primary, two ops on queue-depth and per-region rotation, KMS-grant rotation owner, and the fifth hire wearing the security-advisor / KMS-audit / off-site-escrow hat) the model is at its calibrated size. Beyond five humans the model still works but the hats start to specialise — the secret-store reviewer becomes a security engineer, the KMS-grant rotation owner becomes a platform engineer, the security advisor becomes a compliance lead — and at that point the deployment has crossed out of the small-team companion's scope and into the architecture's enterprise-team default.

The week-1 collector setup checklist

The week-1 boundary is the minimum-viable line that converts the collector architecture into a running deployment. Every item below is required at any team size; the difference at smaller team sizes is not which items get done but how they are split across humans. The list is calibrated for a one-person deployment to be able to complete in two full working days; larger teams parallelise.

1. Choose the secret store between the three small-team-viable options

The architectural walkthrough specifies KMS-envelope encryption with per-tenant keys served by AWS KMS, Google Cloud KMS, or HashiCorp Vault transit. For a five-or-fewer team there are three viable variations: (a) envelope-encrypted Postgres column with a hosted-KMS data key — the credential ciphertexts live in a relational table keyed by (tenant_id, server_slug, credential_kind), the data key is wrapped by a tenant-scoped hosted KMS key (AWS KMS or GCP KMS), the worker decrypts the data key once at boot via a 5-minute signed IAM token, and the table is gated by Postgres row-security as a second-line defence; (b) age-encrypted file in the operator-config repo — credential ciphertexts live in operator-config/secrets/(tenant_id)/(server_slug).age, encrypted to a tenant-scoped public key whose private half lives in the worker's runtime via a hardware-backed token, the worker decrypts at boot from the file mounted into a tmpfs, and the file's pull-request review is the structural gate; (c) hosted secret store like AWS Secrets Manager or GCP Secret Manager — credentials live in the hosted secret store, scoped per tenant via IAM, the worker reads via a 5-minute signed IAM token, and the IAM-policy diff is the structural gate. The trade-offs: option (a) requires a Postgres deployment and a hosted KMS but gives the team row-security as a second line; option (b) requires no hosted-secret-store dependency but requires every credential rotation to be a pull request and a deploy; option (c) requires no Postgres deployment but ties the team to the hosted secret store's IAM model and pricing curve. The week-1 decision is to pick one and write it down. We pick (a) for AliveMCP because the platform already has a Postgres and a hosted KMS (per the archiver post and the read-side API's database) and option (a) folds into the existing schema-reviewer flow; the recipe is in §9.

2. Set the supervisor's rate-limit knobs that matter on day one

The architectural walkthrough specifies five knobs at the supervisor level: per-tenant probe budget per minute (mapped to the billing tier — Public reads only the global probes, Author 3 servers × 60s × 3 regions, Team 10 servers × 60s × 5 regions, Enterprise dedicated workers and contractual cadence), per-tenant CPU and memory caps per worker tied to the billing tier (0.25/0.5/1 vCPU, 128/256/512 MB), the 50-second wall-clock cap with the supervisor's SIGKILL discipline, the per-region queue's BLPOP timeout (5 seconds at the worker, so a worker that idles for 5 seconds checks for a new minute boundary and exits cleanly), and the per-server probe budget per minute (one per region per minute per server, deduplicated at the scheduler so two tenants who list the same public server share the verdict from one set of probes). The week-1 decision is to set the values once and check them into the supervisor's config repo. The values map to the tier-defaults YAML from the archiver companion's §3 (one source of truth for tier defaults across the archiver's retention values, the alert router's per-tier alert budgets, and the collector's per-tier probe budgets). A change to any of the five knobs is a pull request reviewed by the secret-store reviewer (or the founder, on smaller deployments) and approved via the dashboard's MFA-gated flow. We give the YAML shape in §9.

3. Schedule the per-region worker rotation

The architectural walkthrough specifies a worker pool sized to the 95th-percentile load with rolling restart on deploys. The week-1 decision is the rotation cadence and the per-region staggered window. The cadence is the 1st and 15th of each month at 02:00 UTC; the per-region staggered window is 30 minutes per region in the order us-east → us-west → eu-west → ap-southeast → sa-east. The supervisor's deploy script reads the canonical region list from the tenant manifest and refuses to declare success if any region's pool size is short of the configured floor; a short region triggers a P2 ticket on the deploy log and a manual investigation before the next region rolls. The two-rotations-per-month cadence is what keeps the worker image fresh across security patches and credential-format updates without overloading the team's deploy cadence; the per-region staggered window is what limits the blast radius of a regression to one region at a time. We give the deploy script's pseudocode in §9.

4. Calibrate the queue-depth alert for the team's actual probe volume

The architectural walkthrough specifies a queue-depth alert at the per-region work-queue level. The week-1 decision is to calibrate the alert thresholds for the small team's actual probe volume, not for the architectural reference's. A small team's pool size is much smaller than the architectural reference's; the threshold for "the queue is backing up" scales with the pool size. The alert is a percentile (the queue depth at the 90th percentile over the last 5 minutes) and a rate-of-change (the queue depth's first derivative over the last minute), not an absolute. The structural reason is that an absolute threshold calibrated for steady-state probe volume fires constantly during a Q3-audit re-run when the probe volume legitimately spikes; a percentile-and-rate-of-change calibration recognises the spike as a rising-then-falling shape and only fires when the shape stays elevated. The alert routes to the on-call channel (per the alert-router companion) on a 15-minute compressed-mode digest cadence, not on every minute; the digest is the structural defence against alert-fatigue at small scale.

5. Stand up the synthetic noisy-neighbour drill tenant

The synthetic noisy-neighbour drill tenant is provisioned at week one for the same reason the synthetic-outage drill tenant from the alert-router companion is provisioned at week one and the synthetic deletion-target tenant from the archiver companion is provisioned at week one: the day a real runaway-tenant condition occurs, the team needs to know that the supervisor's SIGKILL discipline actually works. The synthetic noisy-neighbour drill tenant is a fully provisioned tenant with a distinctive tenant-id prefix (drill-noisy-neighbour-), one server slug pointing at a deliberately-hanging endpoint that the team controls (a small Go server that accepts a TCP connection and then sleeps for 90 seconds before responding), sample rows in the tenant manifest, and an entry in the supervisor's tenant-aware throttle. The drill against this tenant exercises the SIGKILL discipline end-to-end every quarter; the receipt is committed to the team's drill-log repo. The synthetic tenant is reseeded after each drill so the next quarter's drill has a fresh hang to exercise.

6. Configure the supervisor's audit-log row format

The architectural walkthrough specifies the supervisor's audit-log row format as the row that records every worker start and every worker termination. The week-1 decision is to lock the row's shape: every SIGKILL is in the audit log with the killed worker's CPU and memory and wall-clock state at the moment of kill, the tenant ID, the server slug, the regions affected, the partial-verdict outcome, and the byte-count of the worker's stdout and stderr at the moment of kill (the byte-count is the structural defence against a logging regression that floods the supervisor's log forwarder). The audit-log row's schema is the same as the audit-log row from the operator-dashboard walkthrough with two collector-specific extension columns; the row is written by the supervisor's middleware in the same write transaction as the SIGKILL outcome (no try/catch absorption — if the audit-log insert fails, the supervisor restarts the worker and re-tries the kill on the next probe minute).

7. Lock the IdP-bound KMS-grant rotation cadence

The architectural walkthrough specifies one grant per tenant with a 5-minute expiry on the signed IAM token. The week-1 decision is the rotation cadence: one grant audited per quarter on the third Wednesday of each quarter's first month at 14:00 UTC (the same calendar slot as the archiver companion's offsite-backup restore drill, just one hour earlier). The audit walks the YAML inventory in operator-config/kms-grants.yaml and the live KMS state, finds drift, surfaces the drift as a P2 ticket, and the receipt is committed to the team's drill-log repo and signed by the rotation owner and the founder. The rotation cron's source is in the supervisor's config repo; a change to the cron is a pull request reviewed by the secret-store reviewer.

8. Stand up the registry-deduplicating crawl

The architectural walkthrough specifies that the hourly registry crawl deduplicates server-level (not tenant-level) so the discovery probe runs once and the verdict fans out to every tenant who has listed that server. The week-1 decision is to lock the crawl's deduplication boundary at the supervisor level — the supervisor's crawl worker reads the tenant manifest's union of listed servers, deduplicates, fans the verdicts out to per-tenant verdict-minute Redis prefixes, and the crawl has its own per-source-IP rate limit per registry that is half of the registry's published cap. The half-cap is the structural defence against the registry rate-limit problem from the architectural walkthrough's §1; the half is what gives the platform headroom to re-crawl on a registry's transient outage without crossing the cap. The crawl's source is in the supervisor's source repo; a change to the half-cap is a pull request reviewed by the secret-store reviewer.

Daily, weekly, monthly, quarterly collector routines

The week-1 setup is what gets the collector running. The routines are what keep it running. The routines are calibrated for the on-call to be one human on a one-person deployment, two humans on a 7-day rotation on a four-person deployment, and to scale linearly between. The cadence below is the cadence we have run; treat the times as a starting point and adapt them to your team's actual day length.

Daily — the queue-depth review

Every day, at the start of the operator's working day, the on-call reads the queue-depth dashboard end-to-end. The dashboard shows the last 24 hours of per-region queue depth at the 90th percentile over a 5-minute rolling window, the previous day's longest queue depth, the previous day's count of supervisor SIGKILL events, and the rate of worker_started and worker_killed events arriving from the supervisor. The on-call's job is to look for one anomaly. An anomaly is anything that looks unfamiliar: a SIGKILL count that is higher than the previous week's median, a queue-depth percentile that has crept up over the last three days, a region whose queue depth pattern differs from the other four regions, a tenant that has shown up in the SIGKILL log on three consecutive days.

The discipline is the one-anomaly-per-day rule, the same rule from the alert-router companion and the archiver companion. The on-call is not asked to read the full queue-depth dashboard and remember everything; they are asked to find one anomaly and either note it as benign in the dashboard's anomaly journal or escalate it. The benign annotation is one click and the escalation is one click and the dashboard records both with the actor and the timestamp. If the on-call finds zero anomalies on a quiet day, they record "zero anomalies" with one click and the routine is logged. The structural reason the rule is one-per-day rather than zero-or-many is that "find at least one" forces the on-call to engage with the dashboard rather than scrolling past it; the dashboard's anomaly journal is what the team reviews on Friday to see how the week's signal looked.

Weekly — supervisor SIGKILL log review and per-region pool health review

Every Friday, the on-call (or the per-region rotation lead on the larger deployments) runs the supervisor's SIGKILL log review. The review has three assertions: no tenant appears in the SIGKILL log on more than three consecutive days (a same-tenant repeat is the supervisor's auto-throttle working as intended once it crosses the third-SIGKILL-within-an-hour threshold; a same-tenant repeat across days is a pattern worth the customer-success conversation), no region accounts for more than 30% of the week's SIGKILL events (a region with disproportionate kills is a regional infrastructure regression — the network is slow to that region's MCPs, or the worker image has a regression that affects only that region), and no SIGKILL has been recorded with the worker's stdout or stderr byte-count at zero (a zero-byte SIGKILL means the worker died before it logged a single byte, which is the symptom of a credential-format crash rather than a wall-clock overrun and is a P2 ticket). The check is a single SQL statement that the dashboard's SIGKILL-review widget executes on demand; the on-call clicks the widget and reads three green checkmarks or one red one. A red checkmark is a P2 ticket that the secret-store reviewer (or the founder, on smaller deployments) addresses on Monday.

The same Friday slot is also when the per-region pool health review runs. The dashboard's pool-health widget shows the last 7 days of per-region pool size; each region should have its expected pool size (e.g. 8 workers in us-east, 8 in us-west, 8 in eu-west, 4 in ap-southeast, 4 in sa-east), no region should be short on workers (a region that is short on workers means the supervisor has been SIGKILLing workers faster than the deploy script is restoring them, which is a P2), and no region should be over the cap (a region that is over the cap is a deploy-script regression that is creating workers without registering them in the tenant manifest's region list). The fix for a short region is a manual deploy of the canonical worker image to the region; the discipline is reading the widget every Friday rather than waiting for the queue-depth alert to start firing late.

Monthly — the per-region worker rotation

Every month, on calendar-pinned days (the 1st and 15th at 02:00 UTC), the per-region worker rotation runs. The deploy script rotates each region's pool in a 30-minute staggered window; each region's pool is rotated in three batches (33% / 33% / 34%) with a 10-minute soak between batches. After each batch, the deploy script verifies that the region's queue depth has not risen above its 5-minute rolling 90th-percentile baseline by more than one standard deviation; if it has, the rotation aborts in that region and the deploy script raises a P2 ticket. After all five regions are rotated, the deploy script writes the rotation receipt to the audit log and posts a one-line note to the team's channel: "April 15 2026 worker rotation: 5/5 regions rotated, 0 aborts, 0 anomalies."

The rotation's structural-defence assertions are what make it survive a small team's review cadence. The assertion that the queue depth has not risen by more than one standard deviation is the rotation's structural defence against a regression that ships a slow worker image; the abort is what catches a regression before it propagates to the next region. The rotation cron's structural-defence assertions live in the deploy script's source; both are tested by a unit test that runs in CI on every change to the deploy script. The unit test is the small team's structural defence against a deploy-script change that breaks one of the assertions.

Quarterly — the synthetic noisy-neighbour drill and the KMS-grant audit

Every quarter, on calendar-pinned days (the third Wednesday of each quarter's first month at 14:00 UTC for the KMS-grant audit, the third Thursday at 14:00 UTC for the noisy-neighbour drill), the team runs the two quarterly drills. The two drills are intentionally separated by a day to keep their failure modes from compounding into one bad day; the third-Wednesday slot for the KMS-grant audit is one hour earlier than the archiver companion's offsite-backup restore drill so the team's "drill day" muscle memory carries over.

The synthetic noisy-neighbour drill exercises the synthetic noisy-neighbour drill tenant. The drill has six steps: (1) reseed the synthetic tenant's deliberately-hanging endpoint with a fresh 90-second sleep; (2) confirm the supervisor's tenant-aware throttle is in its initial state for the synthetic tenant; (3) enqueue a probe-job for the synthetic tenant; (4) within 60 seconds confirm the supervisor SIGKILLs the worker with the wall-clock cap firing within the 50-second window; (5) confirm the read-side API serves a "probe timed out" partial verdict for the synthetic tenant within 60 seconds; (6) repeat steps 3-5 two more times and confirm that after the third SIGKILL within an hour the supervisor's auto-throttle reduces the synthetic tenant's probe cadence and the throttle surfaces in the dashboard's tenant detail page. The drill takes 30 minutes if everything works and up to 2 hours if a defect is found. The drill receipt is committed to the team's drill-log repo and signed by the rotation owner and the founder. A failed assertion is a P1; a successful drill is a one-line note in the team's channel.

The KMS-grant audit exercises the YAML inventory and the live KMS state. The audit has six steps: (1) read the YAML inventory in operator-config/kms-grants.yaml; (2) read the live KMS state via the cloud provider's IAM API; (3) compute the diff between the YAML and the live state; (4) for each YAML row not in the live state, surface a P2 ticket (the inventory thinks a grant exists that does not — likely a deploy that did not propagate); (5) for each live grant not in the YAML, surface a P1 ticket (the live state has a grant that the inventory does not — likely a contractor's grant that was never recorded or a grant that should have been revoked); (6) write the audit receipt to the drill-log repo with the diff and the action items. The audit's quarterly cadence matches the typical SOC-2 cadence the platform's auditors expect; the alignment is what makes the auditor's KMS-evidence request match a receipt the team already has. We give the audit script in §9.

The contractor and external-handshake pattern

One of the under-discussed features of small-team operations is that not every role on the collector rotation has to be a full-time employee. The roles that show up specifically with five-or-fewer staff and that map well onto contractor or advisor relationships are: the part-time security advisor (the secret-store reviewer's seat at quarter-time intensity), the fractional KMS-grant auditor (the rotation owner's quarterly audit at one day per quarter, typically a security-engineering firm or a SOC-2 reviewer with cloud-IAM experience), and the third-party SOC-2 reviewer (the same reviewer from the archiver companion, expanded to also sign off on the supervisor's source-code changes once a year).

Each contractor pattern has a structural shape that mirrors the corresponding employee role from §3. The part-time security advisor lives in the secret-store-reviewer IdP group with full refusal rights on secret-store changes, but a tag in the IdP group says "fractional, 12-month renewal" and the dashboard's session timeout is 8 hours instead of the employee default of 30 days. The fractional KMS-grant auditor lives in the audit IdP group with a tag that says "quarterly engagement" and a calendar binding that activates the seat for the audit week and deactivates it after; the audit's receipt is the deliverable. The third-party SOC-2 reviewer lives in the auditor IdP group (the same parked seat from the permission-model companion) with a scope-restricting tag that says "audit-the-supervisor-source" and a one-time receipt CSV export at the end of the audit.

The structural decision for each contractor role is the same: the role lives in the IdP, the dashboard refuses to grant the role permissions outside the IdP-bound scope, and the role's expiry is calendar-bound at week one. The fractional KMS-grant auditor role has one additional structural defence: the contractor's grant on the cloud-IAM side has a hard expiry at the audit-week's end, enforced by the cloud provider's IAM policy not by the team's IdP alone — so even if the IdP-side cleanup is forgotten, the cloud-IAM side has expired the grant. The contractor pattern is what makes the small-team collector operation survive the human reality that the KMS-grant audit in particular cannot be staffed full-time at week one and the SOC-2 audit cycle does not pause for hiring.

Seven failure modes specific to small-team collector operations

The architectural walkthrough listed five collector-specific failure modes (noisy-neighbour CPU contention, secret-store cache poisoning across tenants, queue starvation under one Enterprise tenant's burst, registry rate-limit blowback, verdict-minute coalesce races). All five survive at small-team scale unchanged. What follows is seven additional failure modes that show up specifically when the team is small and the rotation is one or two humans deep. Each has a structural fix that does not depend on team discipline alone.

1. The runaway tenant on a Saturday afternoon when the founder is asleep

The supervisor SIGKILLs a worker for the runaway tenant at 03:14 UTC. The same tenant's next probe minute schedules another worker that hangs on the same endpoint and gets killed at 04:14, 05:14, 06:14, 07:14. The founder is asleep. By 11:14 UTC the founder has 480 SIGKILL Slack DMs in their inbox; the supervisor has spent eight hours in a kill-and-restart loop on one tenant's probes; the read-side API has served "probe timed out" for that tenant the whole night, and that tenant's customer is justifiably angry on Monday morning.

The structural fix is the supervisor's tenant-aware auto-throttle. After the third SIGKILL within an hour, the supervisor reduces the tenant's probe cadence by half (60 seconds → 120 seconds) and surfaces the throttle as a customer-visible note in the dashboard's tenant detail page; after the sixth SIGKILL within four hours, the supervisor reduces the cadence to 600 seconds and pages the founder once with a P2 (one page, not 480); after the twelfth SIGKILL within twelve hours, the supervisor stops scheduling probes for that tenant entirely and the customer's status page shows "monitoring suspended for this server — contact support." The structural reason the throttle is supervisor-side rather than alert-router-side is that the alert router is downstream of the verdict; if the verdict is "probe timed out" every minute for eight hours, the alert router has the right to suppress them, but the structural cost-of-running-a-probe is at the supervisor and the supervisor is where the throttle has to live. The auto-throttle is what makes the small-team on-call survive a runaway-tenant Saturday.

2. The secret-store cache poisoning the small team cannot catch in code review alone

The OAuth-discovery cache from the architectural walkthrough's §3 is keyed on (tenant_id, server_slug). A code change accidentally re-keys the cache on server_slug alone — the diff is small (one line removed from the cache-key tuple), the code review on a small team passes the change because the diff looks like a refactor. A malicious tenant's MCP server returns a poisoned discovery document with the auth-server URL pointing to the attacker's server; the cache stores the document under the unscoped key; another tenant's worker, probing the same server_slug on a different tenant, reads the poisoned document and starts an OAuth flow against the attacker's server. The other tenant's bearer token is exfiltrated.

The structural fix is a property-based unit test that asserts the cache-key binding on every change. The test enumerates the cache's API surface (every read, every write), constructs an adversarial input where two tenants have the same server_slug with different auth-server URLs, asserts that the second tenant's read does not return the first tenant's URL, and runs in CI on every change to the cache module. The structural defence is in CI because the small-team code-review surface cannot reliably catch a one-line change to a tuple; the property-based test is what catches it. We give the test outline in §9.

3. The per-region worker rotation that misses a region

The deploy script rolls the worker image to four regions; the fifth region (sa-east) is missing from the deploy script's region list because the script reads from a hand-maintained constant rather than from the tenant manifest. The fifth region's pool runs the previous worker image for two weeks while the other four regions run the new one. A regression in the previous worker's credentialed-probe sequence affects only the fifth region; the team learns about it only when a customer in São Paulo files a support ticket.

The structural fix is the deploy script's canonical-region-list-from-the-tenant-manifest read. The deploy script reads the region list from the tenant manifest's union of regions in use, not from a hand-maintained constant; the script refuses to declare success if any region in the manifest is not in the deploy's per-region success list. The structural defence is the manifest-as-source-of-truth: the deploy script and the supervisor share the same source of truth for which regions exist, and the script cannot miss a region because the manifest tells it the region exists.

4. The supervisor SIGKILL that left a half-decrypted credential on the host

The supervisor SIGKILLs a worker for wall-clock overrun. The worker had decrypted the tenant's credentials at boot and was mid-flight on a probe when the kill arrived. The decrypted credentials live in the worker's process memory at the moment of kill; on a normal exit the kernel reclaims the memory and the credentials disappear. On a SIGKILL during a kernel-buffer-flush operation, the page may be flushed to swap before the kernel reclaims it; the half-decrypted credential is now on disk in the swap partition.

The structural fix is the worker mounts credentials into a tmpfs that is unmapped on supervisor SIGKILL via a cgroup release notifier. The tmpfs is configured with noswap so the credential never reaches the swap partition; the cgroup release notifier fires when the worker's cgroup transitions to empty (which is what SIGKILL produces) and the notifier executes a one-line script that unmounts the tmpfs and zeroes the underlying memory. The structural defence is the kernel: the credential physically cannot survive a supervisor SIGKILL because the tmpfs is gone before the kernel even gets to the swap-eviction step. The configuration is in the supervisor's source; a change to the tmpfs flags is a pull request reviewed by the secret-store reviewer.

5. The queue-depth alert calibrated for the worst-case minute

The queue-depth alert is calibrated at the absolute threshold "queue depth ≥ 100." During a Q3-audit re-run the probe volume legitimately spikes by 5× for an hour; the queue depth crosses 100 every minute for an hour; the alert fires every minute for an hour; the on-call rotation receives 60 pages in an hour and the on-call human's pager-fatigue threshold is exceeded by minute 20. The on-call disables the alert in PagerDuty's UI for the rest of the day; the next real queue-starvation incident, three days later, goes unpaged for 40 minutes.

The structural fix is the alert as a percentile and a rate-of-change rather than an absolute. The alert fires when the 90th-percentile queue depth over a 5-minute rolling window stays elevated above its baseline for more than 15 minutes (so a Q3-audit-re-run spike fires once at minute 15 if it stays high, not every minute) AND the queue depth's first derivative over the last minute is positive (so a falling queue depth does not fire even if it is still high). The alert routes to the on-call channel on a 15-minute compressed-mode digest cadence (per the alert-router companion's compressed-mode pattern), not on every minute. The structural defence is the percentile-and-rate-of-change shape: the alert recognises a Q3-audit re-run as a rising-then-falling shape and only fires when the shape stays elevated.

6. The KMS grant that was never revoked when a contractor rolled off

A contractor was given a KMS grant six months ago to debug an OAuth-discovery problem. The contractor rolled off three months ago; their IdP account was deactivated; their grant on the cloud-IAM side was forgotten. The cloud-IAM grant is still active. A supply-chain compromise of the contractor's laptop produces an attacker who has the contractor's IAM credentials; the attacker decrypts a tenant's credentials via the still-active grant.

The structural fix is the IdP-bound KMS-grant rotation cron. The cron reads the YAML inventory in operator-config/kms-grants.yaml and the live KMS state on every quarterly audit; the diff between the two is the audit's deliverable. A live grant whose grantee's IdP account is deactivated is a P1 ticket that the rotation owner addresses immediately. The structural defence is the cross-check between the IdP and the cloud IAM: the IdP knows the contractor is gone; the cron's reconciliation is what surfaces the still-active grant. The contractor pattern's hard-expiry on the cloud-IAM side from §8 is the second structural defence; the cron is the first.

7. The verdict-minute coalesce race that surfaces only on the first cross-region partition

The verdict-minute Lua coalescer from the architectural walkthrough's §6 runs the two-of-N aggregation atomically and refuses to seal until ≥2 regions are in. A network partition isolates eu-west and ap-southeast from the central Redis for 90 seconds. The remaining three regions (us-east, us-west, sa-east) seal a verdict for a server based on those three regions' results; eu-west and ap-southeast's results arrive 90 seconds later, after the verdict has been sealed; the coalescer's idempotency contract has to handle the late arrival without flipping the colour. A regression in the coalescer's late-arrival handling that is only exercised on the first cross-region partition flips the colour for 60 seconds and the read-side API serves the wrong colour to the read-side cache.

The structural fix is a property-based unit test that asserts the coalescer's late-arrival behaviour. The test constructs the cross-region-partition adversarial input — three regions seal a verdict, two regions arrive late, the test asserts the post-late-arrival sealed verdict is byte-identical to the pre-late-arrival one — and runs in CI on every change to the coalescer. The structural defence is the test: the coalescer's late-arrival contract is too subtle for the small-team code-review surface to catch reliably, so the contract is enforced in CI. We give the test outline in §9.

Reference recipes

The recipes below are calibrated for a small team — short, copy-pasteable, and defensible against the failure modes named above. They are not full implementations; the architectural walkthrough has the full Go and SQL and Lua. These are the small-team operator's drop-in scaffolds.

The small-team supervisor with cgroup CPU/memory caps (Go pseudocode)

// supervisor.go — small-team supervisor that starts and reaps workers
// per probe minute, with cgroup CPU/memory caps tied to the billing tier
// and a 50-second wall-clock cap enforced by SIGKILL.

package supervisor

import (
    "context"
    "fmt"
    "os/exec"
    "syscall"
    "time"
)

const wallClockCap = 50 * time.Second

type tenantTier int

const (
    tierAuthor tenantTier = iota
    tierTeam
    tierEnterprise
)

type tierLimits struct {
    cpuShares   int64 // cgroup v2 cpu.weight (1..10000)
    memoryBytes int64 // cgroup v2 memory.max
}

var limitsByTier = map[tenantTier]tierLimits{
    tierAuthor:     {cpuShares: 250, memoryBytes: 128 << 20},
    tierTeam:       {cpuShares: 500, memoryBytes: 256 << 20},
    tierEnterprise: {cpuShares: 1000, memoryBytes: 512 << 20},
}

// startWorker mounts the worker's tmpfs (no-swap, unmounted on cgroup release),
// starts the worker container with the tenant's signed IAM token, and
// returns the worker's process handle so the supervisor can reap it.
func startWorker(
    ctx context.Context,
    tenantID, serverSlug, region string,
    tier tenantTier,
    signedIAMToken string,
) (*exec.Cmd, error) {
    cgroupPath := fmt.Sprintf("/sys/fs/cgroup/alivemcp-workers/%s-%s",
        tenantID, region)
    if err := writeCgroupFile(cgroupPath, "cpu.weight",
        fmt.Sprintf("%d", limitsByTier[tier].cpuShares)); err != nil {
        return nil, err
    }
    if err := writeCgroupFile(cgroupPath, "memory.max",
        fmt.Sprintf("%d", limitsByTier[tier].memoryBytes)); err != nil {
        return nil, err
    }
    if err := writeCgroupFile(cgroupPath, "cgroup.events.release_agent",
        "/usr/local/bin/unmount-tmpfs-on-release.sh"); err != nil {
        return nil, err
    }

    cmd := exec.CommandContext(ctx, "/usr/local/bin/probe-worker",
        "--tenant-id", tenantID,
        "--server-slug", serverSlug,
        "--region", region)
    cmd.Env = append(cmd.Env,
        "SIGNED_IAM_TOKEN="+signedIAMToken,
        // ABSOLUTELY NOTHING ELSE — see §3 of the post.
    )
    cmd.SysProcAttr = &syscall.SysProcAttr{
        UseCgroupFD: true,
        CgroupFD:    openCgroup(cgroupPath),
    }
    return cmd, cmd.Start()
}

// reapWorker waits up to wallClockCap for the worker to exit, then
// SIGKILLs it. Records the kill in the audit log with the worker's
// CPU/memory/wall-clock state and the byte-counts of stdout/stderr.
func reapWorker(
    ctx context.Context,
    cmd *exec.Cmd,
    tenantID, serverSlug, region string,
    auditLog AuditLog,
) ProbeOutcome {
    deadline := time.Now().Add(wallClockCap)
    done := make(chan error, 1)
    go func() { done <- cmd.Wait() }()

    select {
    case err := <-done:
        return outcomeFromExit(err)
    case <-time.After(time.Until(deadline)):
        cpuStat, memStat := readCgroupStat(cmd)
        stdoutBytes, stderrBytes := readPipeBytes(cmd)
        cmd.Process.Signal(syscall.SIGKILL)
        // Audit-log row in the same transaction as the partial-verdict.
        auditLog.WriteSIGKILL(ctx, SIGKILLRow{
            TenantID:        tenantID,
            ServerSlug:      serverSlug,
            Region:          region,
            CPUUsageNanos:   cpuStat.UsageNanos,
            MemoryRSSBytes:  memStat.RSSBytes,
            WallClockSec:    int(wallClockCap / time.Second),
            StdoutBytes:     stdoutBytes,
            StderrBytes:     stderrBytes,
            PartialVerdict:  "probe timed out",
        })
        // Increment the tenant-aware throttle counter — see §7.1.
        throttle.RecordSIGKILL(tenantID)
        return outcomeFromKill()
    }
}

The envelope-encrypted-Postgres-column secret-store recipe (SQL + bash)

-- secret-store.sql — week-1 setup for the envelope-encrypted column option.
-- Each tenant's data key is wrapped by a tenant-scoped hosted-KMS key.
-- Postgres row-security is the second-line defence.

CREATE TABLE tenant_credentials (
    tenant_id        UUID NOT NULL,
    server_slug      TEXT NOT NULL,
    credential_kind  TEXT NOT NULL CHECK (credential_kind IN
                       ('bearer', 'api_key', 'oauth_refresh', 'mtls_key')),
    ciphertext       BYTEA NOT NULL,
    wrapped_data_key BYTEA NOT NULL, -- KMS-encrypted data key
    kms_key_arn      TEXT NOT NULL,
    created_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    rotates_at       TIMESTAMPTZ,
    PRIMARY KEY (tenant_id, server_slug, credential_kind)
);

ALTER TABLE tenant_credentials ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON tenant_credentials
    USING (tenant_id = current_setting('app.tenant_id')::uuid);

CREATE ROLE probe_worker NOINHERIT;
GRANT SELECT ON tenant_credentials TO probe_worker;
-- probe_worker NEVER has UPDATE/INSERT/DELETE on this table.
-- Credential rotation is performed by a separate role.

CREATE ROLE secret_store_writer NOINHERIT;
GRANT INSERT, UPDATE, DELETE ON tenant_credentials TO secret_store_writer;
-- Granted only to the secret-store-reviewer's IdP-bound IAM session
-- via the dashboard's MFA-gated approval flow.

#!/usr/bin/env bash
# fetch-credential.sh — worker-side credential fetch at boot.
# Called once per worker per probe minute; never logged.

set -euo pipefail

TENANT_ID="${TENANT_ID:?missing}"
SERVER_SLUG="${SERVER_SLUG:?missing}"
SIGNED_IAM_TOKEN="${SIGNED_IAM_TOKEN:?missing}"
KMS_KEY_ARN="${KMS_KEY_ARN:?missing}"

# 1. Set the row-security session variable from the validated tenant ID.
export PGOPTIONS="-c app.tenant_id=${TENANT_ID}"

# 2. Read the wrapped data key for this tenant + server.
wrapped_key=$(psql "${PG_DSN}" -tAc \
    "SELECT encode(wrapped_data_key, 'base64')
     FROM tenant_credentials
     WHERE tenant_id = '${TENANT_ID}'
       AND server_slug = '${SERVER_SLUG}'
     LIMIT 1")

# 3. Unwrap the data key via KMS using the signed IAM token.
data_key=$(aws kms decrypt \
    --key-id "${KMS_KEY_ARN}" \
    --ciphertext-blob "fileb://<(echo \"${wrapped_key}\" | base64 -d)" \
    --query Plaintext --output text \
    --no-cli-pager 2>/dev/null)

# 4. Decrypt the credential ciphertext locally — never leaves this shell.
ciphertext=$(psql "${PG_DSN}" -tAc \
    "SELECT encode(ciphertext, 'base64')
     FROM tenant_credentials
     WHERE tenant_id = '${TENANT_ID}'
       AND server_slug = '${SERVER_SLUG}'
     LIMIT 1")
credential=$(echo "${ciphertext}" | base64 -d \
    | openssl enc -d -aes-256-gcm -K "${data_key}")

# 5. Mount the credential on tmpfs (noswap) so SIGKILL cannot leak it.
mkdir -p /run/probe-creds
mount -t tmpfs -o size=1m,mode=0600,noswap tmpfs /run/probe-creds
echo "${credential}" > /run/probe-creds/credential
unset credential data_key ciphertext

# 6. Hand off to the credentialed probe (see the credentialed walkthrough).
exec /usr/local/bin/credentialed-probe \
    --credential-file /run/probe-creds/credential

The IdP-bound KMS-grant rotation script (bash)

#!/usr/bin/env bash
# kms-grant-audit.sh — quarterly cron run by the rotation owner.
# Reads the YAML inventory and the live KMS state, surfaces the diff.

set -euo pipefail

INVENTORY="${INVENTORY:-/opt/operator-config/kms-grants.yaml}"
DRIFT_LOG="${DRIFT_LOG:-/opt/drill-log/kms-grant-drift.log}"

# 1. Read the YAML inventory: each row is grantee + tenant + expiry + ticket.
yq eval '.grants[] |
    [.grantee_idp_user, .tenant_id, .kms_key_arn, .expires_at, .ticket]
    | @csv' "${INVENTORY}" | sort > /tmp/yaml-inventory.csv

# 2. Read the live KMS state via the cloud-IAM API.
aws kms list-grants --key-id "${PRIMARY_KEY_ARN}" \
    --query 'Grants[].[GranteePrincipal, Constraints.EncryptionContextEquals.tenant_id, GrantId, IssuingAccount]' \
    --output text | sort > /tmp/live-kms.csv

# 3. Diff the two surfaces.
diff /tmp/yaml-inventory.csv /tmp/live-kms.csv \
    | tee -a "${DRIFT_LOG}"

if [ -s "${DRIFT_LOG}" ]; then
    echo "P2: KMS-grant inventory drift detected at $(date -Iseconds)"
    echo "    See ${DRIFT_LOG} for the diff"
    echo "    Rotation owner reviews on Monday morning"

    # 4. For each live grant whose grantee's IdP account is deactivated,
    #    raise a P1 immediately.
    while IFS=$'\t' read -r grantee tenant_id grant_id _; do
        if ! idp-status "${grantee}" | grep -q ACTIVE; then
            echo "P1: live KMS grant ${grant_id} for deactivated grantee ${grantee}"
            echo "    revoking grant immediately"
            aws kms revoke-grant --key-id "${PRIMARY_KEY_ARN}" \
                --grant-id "${grant_id}"
        fi
    done < /tmp/live-kms.csv
fi

# 5. Write the audit receipt.
cat >> "${DRILL_LOG_REPO}/kms-grant-audit-$(date +%Y-Q%q).md" <<EOF
# KMS grant audit — $(date -Iseconds)

## Inventory rows: $(wc -l < /tmp/yaml-inventory.csv)
## Live KMS rows:  $(wc -l < /tmp/live-kms.csv)
## Drift rows:     $(wc -l < "${DRIFT_LOG}")

## Action items
$(cat "${DRIFT_LOG}" | head -20)

Signed: $(whoami) at $(date -Iseconds)
EOF

The queue-depth alert with a small-team rate window (PromQL)

# queue-depth-alert.yaml — calibrated for a small-team probe volume.
# The alert is a percentile-and-rate-of-change, not an absolute.
# See §7.5 of the post for the failure mode this defends against.

groups:
- name: collector-queue-depth
  rules:
  - alert: CollectorQueueDepthSustainedAboveBaseline
    expr: |
      # 90th percentile queue depth over a 5-minute rolling window
      quantile_over_time(0.9,
        alivemcp_queue_depth_per_region[5m]
      )
      >
      # Baseline: median over the prior 7 days, plus 1 stddev.
      (avg_over_time(alivemcp_queue_depth_per_region[7d])
       + stddev_over_time(alivemcp_queue_depth_per_region[7d]))
      and
      # AND the queue is still rising over the last minute
      # (so a falling queue does not fire even if still elevated).
      deriv(alivemcp_queue_depth_per_region[1m]) > 0
    for: 15m
    labels:
      severity: page
      team: collector
      delivery_mode: compressed-digest
      digest_window: 15m
    annotations:
      summary: |
        Queue depth has stayed above its 7-day baseline + 1 stddev
        for >15 minutes, AND is still rising. Likely causes:
        Q3-audit re-run (legitimate, will subside), regional
        infrastructure regression, or a deploy-script regression
        that under-sized the worker pool.
      runbook: https://operator-config/runbooks/queue-depth-alert.md

Where this fits — collector companion · closes the small-team-companion arc

This post is the collector-side companion to the architectural walkthrough. The architectural walkthrough described the six layers — worker-as-security-boundary tenant isolation, KMS-envelope-encrypted per-tenant secret store, per-region work-queue fan-out, per-tenant rate limiting at the scheduler, tenant-prefixed shared state with a verdict-minute Lua coalescer, billing-aware probe paths — that make a single collector safe across many tenants. This post described how to actually operate that architecture with one to five humans on the team — the headcount-to-collector-ownership mapping, the week-1 setup checklist, the daily and weekly and monthly and quarterly drill cadence, the contractor and external-handshake pattern, and seven small-team-specific failure modes with structural fixes. Together they are both halves of how a small multi-tenant MCP-monitoring team operates the write side of the stack. The architectural side and the operational side reinforce each other; neither stands alone.

The small-team-companion arc is now four posts deep and closes here. The first companion (post #14) paired with the operator-dashboard architectural walkthrough and described the four-layer permission model in operation. The second companion (post #15) paired with the per-tenant alert routing architectural walkthrough and described the five-layer alert router in operation. The third companion (post #16) paired with the shared-state archiver architectural walkthrough and described the five-layer archiver in operation. This post (post #17) pairs with the multi-tenant probe collector architectural walkthrough and describes the six-layer collector in operation. With this post the small-team-companion arc is complete — every architectural walkthrough in the scale sub-series now has its operational counterpart, and the team that has actually been doing the work has a practical guide for each layer of the stack.

The next deliverable is the Q3 2026 registry audit, landing mid-July 2026. The audit re-runs every probe from all five regions in parallel through the multi-tenant collector designed in post #10 (the architectural reference this post operationalises) and operated by the routines this post walks, with verdicts archived through the system designed in post #12 and operated by the routines from post #16, with cross-tenant suppression measured against the cluster log designed in post #11 and operated by the routines from post #15, and with operator actions during the audit window logged in the audit-log designed in post #13 and operated by the routines from post #14. The audit will report bucket-by-bucket movement vs the Q2 baseline — including how the collector from this post's architectural counterpart held up under the Q3 audit's per-minute write rate, whether the credentialed-probe rollout from post #6 shrunk the auth-walled 16.8% bucket as expected, whether the schema-drift detector from post #4 caught the same 7.1%/48h drift rate or a different one, and the first end-to-end pass through the supervisor's tenant-aware throttle and the verdict-minute coalescer at registry scale. The audit will be the first end-to-end exercise of the entire scale stack — collector, alert router, archiver, operator dashboard — under load, and will be operated by the small-team routines this companion arc has now described in full.

Want to be told before your MCP server dies silently?

AliveMCP probes every public MCP endpoint every 60 seconds, runs the supervisor's tenant-isolating worker pool with cgroup CPU/memory caps and a 50-second wall-clock SIGKILL discipline, mounts credentials only into a noswap tmpfs that the kernel reclaims on every kill, audits the KMS-grant inventory against the live cloud-IAM state every quarter, and gives your own staff a self-serve surface for retention preferences and deletion requests — all from the same multi-tenant stack described across the posts of the scale sub-series and operated by the small-team routines this post and its predecessors walk. Public servers are free; private servers start at $9/mo.

Join the waitlist