Deep dive · 2026-04-30 · Scale sub-series

Multi-tenant MCP probe collector — what changes when the probe stack becomes a service

The four walkthroughs that built the practical-routine series — the credentialed probe, the multi-region wrapper, the public status page, the read-side API and badge — describe a stack that fits one MCP server cleanly. One server, one cron, one set of credentials, one shared-state Redis, one status page, one badge. That stack is what most indie operators should run today; it is also exactly the stack we run at AliveMCP, multiplied by 2,181. Crossing from one to many is not a matter of running the same script in a wider for loop. The probe stack becomes a service, and a service has obligations the script never had — process isolation so a noisy tenant cannot kill the rest, secret stores that never leak across boundaries, fan-out that does not melt the registries, per-tenant rate limiting that maps to billing tiers, shared state that survives ten thousand writes per minute, and a failure-mode catalogue that is specific to operating one probe stack on behalf of many. This post is the architectural walkthrough for that crossing — what changes, what stays, and why.

TL;DR

A multi-tenant MCP probe collector is the single-tenant stack from the credentialed walkthrough and the multi-region wrapper, surrounded by six new layers that exist only because there are now many tenants sharing one collector. Tenant isolation: each tenant's probes run in a separate worker process with a CPU and memory cap, so a tenant pointing the probe at a server that hangs for 60s does not stall the other 1,999 probes scheduled for the same minute. Per-tenant secret store: the probe credentials from the auth primer are stored under a tenant-scoped key in an envelope-encrypted KMS, and the probe worker mounts only its own tenant's keys at start-up — never the global set. Fan-out: the cron-driven scheduler from the single-tenant stack becomes a work-queue dispatcher that emits one job per probe-per-region-per-minute, dequeued by a worker pool sized to the 95th-percentile load not the worst case. Per-tenant rate limiting: probe frequency is capped per tenant per minute, mapped onto the billing tiers — Public reads from the global probes only, Author runs at 60-second cadence on three regions, Team at 60-second cadence on five regions plus credentialed probes, Enterprise gets dedicated workers. Shared state at scale: the single Redis from the multi-region walkthrough partitions by tenant prefix; reads stay cheap, writes go through a verdict-minute write coalescer that turns 5 region writes per server per minute into 1 per server per minute. Billing-aware probe paths: every probe is tagged with the tenant ID and the cost tier, so the read-side API serves only what that tier paid for. The recipe section sketches the worker pool, the secret-mount loop, the queue partition keys, and the verdict-minute coalesce step in copy-pasteable form. Five new failure modes are specific to multi-tenant operation: noisy-neighbour CPU contention, secret-store cache poisoning across tenants, queue starvation under one Enterprise tenant's burst, registry rate-limit blowback that pages every tenant, and verdict-minute coalesce races that produce the wrong colour for one tenant for 60 seconds. Each gets a paragraph and a mitigation.

Why the single-tenant stack does not scale by repetition

The reflex when an operator first looks at the multi-tenant problem is to wrap the existing probe in a for loop over a list of tenants and call it done. We tried this internally for two days when the AliveMCP collector was first being stood up. It looks plausible on paper; the per-tenant probe is independent, the cron fires every 60 seconds, the script visits each tenant in turn. The reasons it fails are subtle and specific.

The first reason is the blocking-tenant problem. The credentialed probe sequence from the credentialed walkthrough has a 4-second timeout per step. Eight steps, four seconds each, worst case is 32 seconds per probe. A loop over 2,000 tenants in a single process is a worst-case 17.7-hour run — not a 60-second one. Concurrency is therefore non-negotiable; the question is what concurrency primitive. A naive asyncio.gather across 2,000 coroutines or 2,000 goroutines blows the file-descriptor table on the first burst, and a single bad tenant whose origin is loop-stalling its TCP read will sit there absorbing 4 seconds of wall-clock for every step before timing out. A loop with bounded concurrency — say, 50 workers each picking from a queue — is closer to right, but it still suffers from the second reason.

The second reason is the noisy-neighbour problem. Even with 50 workers each handling 40 tenants, one tenant's probe that hits a slow upstream pegs one worker for 32 seconds while the other 49 spin. If five such tenants appear in one minute — common during a regional outage that affects a whole cluster of MCPs — five workers stall and the remaining 45 must finish all 2,000 probes in 60 seconds, which they cannot. The probe latency budget collapses; some tenants miss a probe minute; the verdict-minute timestamps drift; the read-side API starts serving stale answers. None of these are visible in single-tenant testing because there is only one tenant, so noise is signal.

The third reason is the secret-store leakage problem. The single-tenant stack treats credentials as ambient — a .env file mounted at probe start, environment variables for the probe credential, perhaps an OAuth refresh token cached in a local file. Every line of probe code can see every credential. Inside one tenant, that is fine; the credentials all belong to the same operator. The moment the probe stack is shared across tenants, ambient credentials are a confidentiality bug. If tenant A's probe code can read tenant B's environment variables, even by accident in a logging line, the entire trust model of the collector is broken. The fix is not "be careful with logging"; the fix is structural — credentials per tenant are mounted only into the worker that handles that tenant's probes, and the worker's process boundary is the security boundary.

The fourth reason is the shared-state contention problem. The multi-region wrapper from the multi-region walkthrough writes one verdict per server per region per minute to a shared Redis. For one tenant with one server, that is five writes per minute. For 2,000 tenants, with three to twenty servers each, it is 30,000 to 200,000 Redis writes per minute. The Redis instance that worked perfectly for a single tenant becomes a bottleneck and a coordination point. Worse, naive multi-tenant writing creates verdict-minute coalesce races — region eu-west finishes its probe at :00:42 and writes the green verdict, region us-east finishes at :00:58 and writes red, the read-side API queries between those two writes and serves a colour the per-server aggregation rule never produced. Without explicit coalescing, the colour-flicker problem at scale is constant.

The fifth reason is the registry rate-limit problem. The hourly registry crawl from the Q2 audit works fine when one operator runs it for their own use. When the same crawl supports 2,000 tenants who each want a fresh listing, and several of them share an MCP-author author, and the registries' rate limits are per-source-IP, the collector becomes the loud guy on the registry's nightly graph. The fix is structural too — the crawl deduplicates server-level (not tenant-level), the discovery probe runs once and the verdict fans out to every tenant who has listed that server, and per-tenant probes only ever talk to the tenant's own MCP servers, never to public registries.

Each of these five reasons is the seed of a layer that has to exist in the multi-tenant collector and does not exist in the single-tenant one. The next sections walk each layer in turn.

Tenant isolation — the worker as the security boundary

Multi-tenant isolation is structural, not procedural. The two structural choices that matter are the unit of isolation and the resource cap that surrounds it. The unit we settled on is one tenant per worker process, sized to the tenant's plan tier. Public-tier listings — the global probes that everyone reads — share a pool of generic workers, because there is no per-tenant credential to protect and the probes are running against public servers anyway. Author, Team, and Enterprise tier probes each get a dedicated worker process, started with the tenant's credentials mounted, and torn down after each probe minute.

The resource cap surrounding each worker is what stops one tenant's probes from pegging the host. The CPU cap is enforced by the container runtime — a worker for an Author tenant gets 0.25 vCPU, a Team tenant gets 0.5 vCPU, an Enterprise tenant gets 1 vCPU. The memory cap is similarly tiered — 128 MB, 256 MB, 512 MB. The wall-clock cap is the same across tiers — 50 seconds total per worker run, leaving 10 seconds of slack inside the 60-second probe minute. If a worker exceeds its wall clock, the supervisor SIGKILLs it; the tenant gets a partial verdict for that minute and a "probe timed out at supervisor level" note attached. The note is non-fatal — it does not page the tenant — but it does show up in their next-day report so an operator can see "your probe ran over the wall clock 4 times this week".

Isolation has a less obvious benefit: blast-radius containment for credential bugs. If a tenant's probe credential rotates, and the probe code has a bug that crashes on the new credential format, only that tenant's worker crashes. The supervisor restarts the worker, marks the tenant's probe as "error" for that minute, and continues. The other 1,999 tenants are untouched. In an ambient-credentials single-process design, the same crash takes down the whole probe minute for everyone.

The supervisor itself is a small piece of code — a few hundred lines — and is one of the few components shared across all tenants. It must be paranoid. Its job is to read the tenant manifest, fan out one job per tenant per region per minute, pull a worker image from the local cache, mount only that tenant's secrets and only that tenant's region-specific Redis credentials, start the worker, time it, kill it on overrun, and aggregate the verdict it emits over a Unix socket back to the supervisor. The supervisor never reads tenant secrets directly; it reads references to tenant secrets, and the secret-mount step happens inside the worker container's start-up so the supervisor's process memory never holds them.

One of the more counter-intuitive tenant-isolation rules: workers do not talk to other workers. The temptation is to share a database connection pool, an HTTP client, an OAuth-discovery cache. Don't. Every worker has its own ephemeral connections, its own short-lived caches, and its results land in shared Redis only after the verdict is sealed. The cost is a few extra TCP handshakes per probe minute; the benefit is that a memory-corruption bug in one worker's HTTP client cannot leak bytes into another worker's request. The single-tenant stack treats this as overkill; the multi-tenant stack treats it as the floor.

Per-tenant secret store — KMS envelopes, scoped mounts, no ambient credentials

The auth primer covered the four authentication patterns in the wild — bearer token, API key in custom header, OAuth 2.1 with PKCE, mTLS. For a single-tenant probe, those credentials sit in a .env file or an environment variable. For a multi-tenant collector, that pattern is structurally unsafe; ten thousand tenants' bearer tokens cannot share one process's environment without one of them eventually leaking via a stack trace, a log line, or a misconfigured error reporter.

The model that works is envelope encryption with per-tenant keys, served by a Key Management Service (KMS — AWS KMS, Google Cloud KMS, HashiCorp Vault transit). The credential ciphertexts live in a relational table keyed by (tenant_id, server_slug, credential_kind); the data key that decrypts each row is itself encrypted by a tenant-scoped KMS key. To read tenant A's credentials, the worker must hold an IAM credential that is authorised to call Decrypt on tenant A's KMS key — and only tenant A's. The IAM credential is mounted into the worker container by the supervisor at start-up via a short-lived (5-minute) signed token; the token is scoped to one KMS key, one tenant, one start time, one container instance.

The flow per probe minute is:

Supervisor decides which tenant's worker to start. Reads the tenant manifest, looks up the tenant's KMS key ARN, mints a signed 5-minute IAM token scoped to that ARN, that tenant ID, and the worker's container ID. Mints nothing else.
Supervisor starts the worker container with the tenant ID and the signed token in its environment. The signed token is the only credential in the worker's environment.
Worker boots, reads the tenant ID and signed token, calls KMS Decrypt with the signed token to unwrap the data key. KMS verifies the token is correctly scoped, time-valid, and authorises the call. Returns the data key.
Worker uses the data key to decrypt the tenant's credentials from the relational table. Decrypts only the rows for this tenant — the table query is itself parameterised on the tenant ID and the database role the worker uses is row-scoped to that tenant by a Postgres row-security policy.
Worker runs the credentialed probe sequence from the credentialed walkthrough, using the now-in-memory credentials. Plaintext credentials never touch disk; never enter logs (the worker's logger redacts on a regex of the credential's known prefixes); never leave the worker process.
Worker writes the verdict to shared Redis, exits. Container is reaped. The signed token is now expired; even if leaked, it cannot be used.

Five rules that fall out of this model and are easy to violate accidentally:

The supervisor must never log the signed token. Most logging frameworks log environment-variable contents on container start unless explicitly told not to. The supervisor's logger has a redaction pass on environment-variable values that match the signed-token format (typically a base64 prefix). The redaction is regex-based and is unit-tested with adversarial cases.
OAuth refresh tokens are stored encrypted, but the OAuth-discovery cache is per-tenant too. The MCP OAuth flow caches the auth-server URL and the client metadata; if the cache is shared across tenants, a malicious tenant's MCP server can return a poisoned discovery document and influence another tenant's auth flow. Cache-key the discovery on (tenant_id, server_slug), not on server_slug alone.
Database row-security is the second line of defence, not the first. The worker's database role has a Postgres row-security policy that enforces tenant_id = current_setting('app.tenant_id') on every read. The worker sets the session variable at boot from the validated tenant ID. If the worker code has a bug that joins across tenants, the row-security policy still rejects the rows. This is a common pattern in multi-tenant Postgres operators; for the probe collector it is non-optional.
Credential rotation is per-tenant and ladder-driven. The probe-credential watchdog from the credentialed walkthrough still runs at the 30/7/3-day escalation tiers, but per-tenant. The supervisor's tenant manifest holds the rotation state for each credential; on rotation, the new credential is encrypted under the same tenant KMS key, the old one is left readable for 24 hours so a slow-rolling probe doesn't fail mid-window, and after 24 hours the old ciphertext is deleted.
Decryption costs are per-call to KMS. Calling Decrypt 200,000 times per minute is expensive both in latency and in dollars. The worker decrypts the tenant data key once at boot, then uses it locally for the remaining steps in the probe minute. Each KMS call is one per worker per minute, not one per credential.

Fan-out — work queues, worker pools, and probe-minute discipline

Single-tenant probes are scheduler-driven: a cron fires every 60 seconds, the probe runs, the verdict lands. Multi-tenant probes are queue-driven: a scheduler fires every 60 seconds, fans out one job per tenant per region per minute, and a pool of workers consumes from the queue. The choice of queue is consequential.

The queue we use is a Redis list per region, with one list per region — q:probes:us-east, q:probes:us-west, q:probes:eu-west, q:probes:ap-southeast, q:probes:sa-east. Each region's list lives in the regional Redis the worker pool there reads from. The reasons are mundane and load-bearing:

Per-region queues match per-region workers. The multi-region walkthrough decouples regions; the queue infrastructure should not re-couple them. A region failure should isolate to that region, not back-pressure the global queue.
Job shape is small and deterministic. A job is JSON, ~200 bytes — {tenant_id, server_slug, region, minute, kind: "credentialed" | "public"}. The credentials are not in the job; the worker fetches them from KMS on dequeue, scoped to the tenant ID in the job.
Dequeue is atomic and at-least-once. BLPOP with a 50-second timeout. If a worker dies after dequeue but before writing the verdict, the supervisor's per-tenant heartbeat misses, the supervisor re-enqueues the job, and the next worker picks it up within the same minute. Duplicate verdicts are possible and are handled by the verdict-minute coalescer (next section). The trade-off is that a tail-end worker can produce a duplicate write; the coalescer collapses it.
The scheduler is one process per region. Each scheduler reads the tenant manifest at boot, computes the per-tenant probe schedule for the next minute, and enqueues exactly that many jobs at the minute boundary. The scheduler's own clock skew is the dominant source of minute-boundary error; we run NTP on the scheduler hosts and gate the enqueue on NTP being healthy. If NTP is unhealthy, the scheduler refuses to enqueue and a backup scheduler in another region takes over within 90 seconds.

The worker pool is sized for the 95th-percentile load, not the worst case. We measured the per-minute job count for our own deployment over 30 days — the median is 11,200 jobs per minute, the 95th percentile is 13,600, the maximum is 18,400 (during a registry-import burst). The pool size is computed as p95 jobs / p95 jobs-per-worker-per-minute. A typical worker handles 240 jobs per minute (4 jobs per second sustained, with burst). 13,600 / 240 ≈ 57 workers. We run 64 to leave headroom. The remaining 4,800 jobs in the worst minute spill over by ~10 seconds and complete inside the 60-second window thanks to stale-while-revalidate on the read-side API and the verdict-minute coalescer's tolerance of late writes.

The single most-common multi-tenant fan-out bug is probe-minute discipline drift. A worker takes 47 seconds because the upstream is slow; finishes at :00:47; the verdict's as_of field is :00:00 because that was the minute boundary; the read-side API serves it as the verdict for that minute. The next minute begins; the upstream is faster; the worker finishes at :01:09; the as_of is :01:00; the verdict is still valid for minute :01 because the discipline says "the verdict applies to the minute it was scheduled for, not the minute it finished in". The discipline is enforced inside the worker — the as_of is computed from the job's minute field, not from the worker's wall-clock at write time. Single-tenant stacks usually have the script and the cron tightly coupled; multi-tenant stacks have to make this explicit.

Per-tenant rate limiting — billing tiers as probe budgets

Probe frequency is the most expensive lever in the collector. Doubling the cadence from 60 seconds to 30 seconds doubles the worker count; doubling the regions from three to five increases per-tenant probe count by 67%; turning on credentialed probes adds 8 steps' worth of latency to each probe minute. The product surface that maps these levers to billing is what makes the collector economically viable.

Our four-tier mapping (matching the public pricing):

Public (free) — the tenant has no probes of their own. The verdict for any public-listed server is computed from the global probes that AliveMCP runs on every server in every registry; the tenant just reads it. No worker is allocated; no credentials are stored; no rate limit applies because there is no per-tenant probe to limit.
Author ($9/mo) — the tenant owns one to three claimed listings. Each claimed listing gets a 60-second probe in three regions (us-east, eu-west, ap-southeast), unauthenticated. Total: ~3 probes per tenant per minute, or ~129,600 per month. The rate limit is structural — the tenant has to claim each listing individually, and "claim" is per-server-slug; we cap at 3 in software, but in practice the tier's value collapses past 3 anyway.
Team ($49/mo) — the tenant owns up to ten servers, can mix public and private. Each server gets a 60-second probe in five regions; credentialed probes are supported (the auth primer's Posture C and D). Total: ~50 probes per tenant per minute, ~2.16M per month. The rate limit is enforced at the scheduler — the scheduler reads the tenant manifest, which carries a hard cap of 10 server slots; the 11th server is rejected at write time, not at probe time. Slack and PagerDuty alert routing is part of the tier.
Enterprise (custom) — the tenant owns 30+ servers, often with mTLS or OAuth, and has a dedicated worker pool. The rate limit is contractual, not structural; the cadence can drop to 15 seconds for critical servers, regions can be added (e.g., ap-northeast for Tokyo-anchored MCPs), and the worker isolation is at the host level — the tenant's workers do not share a host with other tenants' workers. The shared Redis is per-region but the schema is namespaced by the Enterprise tenant ID.

The mistake to avoid in this design is letting the rate limit be enforced at the wrong layer. The temptation is to rate-limit at the worker — "if this tenant has run more than 50 probes this minute, drop the job." This works in single-load scenarios and fails in two ways. First, by the time the worker dequeues the job, the supervisor has already mounted credentials and started a worker, so the wasted cost is most of the per-probe cost. Second, dropped jobs do not produce a verdict, so the read-side API serves a stale answer for that minute, and the tenant cannot tell whether the probe was rate-limited or the upstream was slow.

The correct layer is at the scheduler. The scheduler reads the tenant manifest, which carries the tier's probe budget; the scheduler emits exactly that many jobs per tenant per minute, no more. If the tenant is somehow flagged as exceeding their budget, the scheduler emits zero jobs and surfaces a "probe budget exhausted — see plan upgrade" notice on the dashboard and the read-side API. The decision is visible to the tenant in real-time; the worker pool is not wasted on rate-limited jobs; the verdict-minute is served as the last good verdict with a "stale" flag.

A specific Enterprise edge case worth calling out: an Enterprise tenant with 15-second cadence does not run 4× more probes per minute against their MCP than a Team tenant does at 60-second cadence. The MCP server is the tenant's own server; the probe budget is for the AliveMCP collector, not for the tenant's MCP. We coordinate with the Enterprise tenant on a per-server probe budget so the collector doesn't accidentally turn into a small DDoS against the tenant's own infrastructure. The contract clause that codifies this is short — "AliveMCP probes will not exceed N requests per minute per server endpoint, where N defaults to 60 and can be lowered on request" — and is paired with a probe-rate dashboard the tenant can read.

Shared state at scale — partitioning, write coalescing, and the verdict-minute lock

The single Redis from the multi-region walkthrough handled five writes per server per minute. The multi-tenant Redis handles the same five writes, multiplied by 2,000 to 10,000 servers. Three changes are required to keep that workable.

The first change is tenant-prefixed key namespace. Every key in the shared state carries the tenant ID as a prefix: v1:tenant:{tid}:server:{slug}:region:{region}:verdict:{minute}. The prefix is used both for partitioning (Redis Cluster routes keys to slots by hash; including the tenant ID in the slot key spreads tenants across the cluster evenly) and for access control (the worker's Redis ACL is configured with a key-pattern restriction on the tenant ID prefix). A tenant's worker physically cannot read or write another tenant's keys; the Redis ACL enforces this at the wire protocol level. Misconfigured workers fail loudly on first write attempt to a non-prefixed key, which surfaces in the supervisor's startup logs.

The second change is verdict-minute write coalescing. Five regions write five verdicts per server per minute. The read-side API consumers want one verdict per server per minute. Without coalescing, a read between region writes can see "us-east wrote up at :42, eu-west wrote down at :48, sa-east hasn't written yet". The colour the API serves between :42 and :48 would be wrong — the two-of-N aggregation rule requires all five region writes to be in before the verdict is known.

The coalescer is a per-server light-weight Lua script that runs on the regional Redis. Each region writes its verdict to a region-tagged key (v1:t:{tid}:s:{slug}:r:{region}:m:{minute}). After the write, the script checks the count of region keys for that (tenant, server, minute); if all five are present, it computes the two-of-N aggregate, writes it to the canonical verdict key (v1:t:{tid}:s:{slug}:verdict:{minute}), and emits a "verdict-sealed" event on a Redis Pub/Sub channel. The read-side API reads the canonical verdict key. Until the canonical key exists, the API serves the previous minute's canonical verdict with a last_probe_ago that reflects the truth — "37s" for last minute, "97s" for two minutes ago. It does not serve a partial verdict.

The Lua-script approach is the right primitive because Redis runs it atomically; the count-and-aggregate has no race. The alternative — a Postgres SERIALIZABLE transaction across a verdict table — works but adds 30–50 ms of latency per write and burns Postgres connection capacity on writes that are inherently temporary. Redis Lua keeps the verdict on the hot path and lets Postgres own the long-term history table (one row per server per minute, written by a per-region archiver process that runs out-of-band).

The third change is read-amplification pacing. With 2,000 tenants reading the API at 5–60 second cadences, the regional Redis would absorb millions of GETs per minute. The API server caches each verdict for the verdict-minute window plus 60 seconds of stale-while-revalidate; cache hit rates exceed 99% in steady state. The read path that goes to Redis is reserved for cache misses (typically the first read of a new server slug) and verdict-sealed events; the verdict-sealed event invalidates the API server's cache for that key.

One subtle property of the verdict-minute coalescer is that partial-verdict edge cases are bounded. If region us-east is itself unreachable from the worker pool — say the AliveMCP us-east region has a network partition for 20 minutes — the coalescer waits 4 of 5 region writes and emits a partial verdict tagged partial:true for those 20 minutes. The read-side API surfaces it with the same green/amber/red but adds a "based on 4 of 5 regions" note in the per-server status page. The two-of-N rule is robust to one missing region; the operator dashboard pages on three or more.

Billing-aware probe paths — the cost tier as a request property

In a single-tenant probe stack, "the probe" is a single thing — the operator runs it, the verdict comes back, the cost is the operator's compute bill. In a multi-tenant collector, "the probe" is parameterised by what the tenant is paying for. A Public-tier reader and a Team-tier subscriber both ask the read-side API for the same server slug and get an answer; the Team-tier subscriber's answer includes credentialed-probe data and per-region detail, the Public-tier reader's answer includes only the global verdict. The mechanics of serving these two answers correctly is the billing-aware probe path.

The principle is to tag every verdict with the cost tier that produced it. The Lua coalescer writes the canonical verdict with a tier field — public, author, team, enterprise. The read-side API checks the requesting tenant's tier against the verdict's tier and serves only fields that are at-or-below the tenant's authorisation. A Public-tier reader querying a Team-tier server's status sees only state and uptime_30d; a Team-tier subscriber gets the full shape from the read-side walkthrough; an Enterprise tenant gets the full per-region detail and the credentialed probe step status.

The mistake here is to do the field filtering at the database layer. The verdict is a single Redis blob; trying to slice it differently for different readers means encoding the slice rules in the read path, which is exactly the place tenant authorisation bugs hide. The right approach is to store the verdict whole and filter at the API boundary, with the boundary's filter logic in a single tested function. The function takes (verdict, requesting_tenant_tier) and returns a JSON object that is provably a subset of the input. The filter is unit-tested with each tier as input, and integration-tested against a synthetic verdict that contains every possible field.

One frequent confusion: a tenant's own server-status read is different from another tenant reading that server. Team-tier tenant Alice has a server. Bob is a Public-tier reader; he asks the API for Alice's server. Alice gets the full shape (her own server, her own tier). Bob gets only the public fields. The filter takes both the verdict tier and the (tenant, server) ownership relationship into account. The filter signature in our codebase is (verdict, viewer_tenant_id, viewer_tier, server_owner_tenant_id) -> subset(verdict), and the implementation is one function with five branches, mirrored to the four tiers plus the cross-tenant case.

Recipe — the worker pool, the secret mount, the queue partition, the coalescer

The architectural sketch is the load-bearing part of this post; the recipe below is a deliberately simplified reference implementation, suitable for adapting to your stack. Production code at AliveMCP has retries, observability hooks, and operator escape hatches that are out of scope here.

1. The supervisor (per region, one process)

#!/usr/bin/env bash
# supervisor.sh — runs once per minute via cron in each region
set -euo pipefail

REGION=${REGION:?required}
TENANT_MANIFEST=${TENANT_MANIFEST:-/etc/alivemcp/tenants.json}
NTP_HEALTH=${NTP_HEALTH:-/usr/bin/chronyc -n tracking}
KMS_TOKEN_TTL=300

# Refuse to run if NTP isn't healthy (clock-skew protection).
$NTP_HEALTH | grep -q 'Leap status     : Normal' || { echo "NTP unhealthy"; exit 1; }

MINUTE=$(date -u +%Y-%m-%dT%H:%M:00Z)

# For each tenant in the manifest, enqueue one job per (server, region).
jq -c '.tenants[] | select(.tier != "public")' "$TENANT_MANIFEST" \
| while read -r tenant; do
    TID=$(echo "$tenant" | jq -r .id)
    TIER=$(echo "$tenant" | jq -r .tier)
    SERVERS=$(echo "$tenant" | jq -c '.servers[]')

    # Per-tier probe budget enforced here (scheduler-side, not worker-side).
    case "$TIER" in
        author) MAX=3 ;;
        team)   MAX=10 ;;
        enterprise) MAX=$(echo "$tenant" | jq -r '.enterprise_max // 100') ;;
    esac

    COUNT=0
    echo "$SERVERS" | while read -r server; do
        COUNT=$((COUNT + 1))
        [ "$COUNT" -gt "$MAX" ] && break

        SLUG=$(echo "$server" | jq -r .slug)
        KIND=$(echo "$server" | jq -r '.credentialed // false | if . then "credentialed" else "public" end')

        JOB=$(jq -nc --arg tid "$TID" --arg slug "$SLUG" \
                     --arg region "$REGION" --arg minute "$MINUTE" \
                     --arg kind "$KIND" --arg tier "$TIER" \
              '{tenant_id:$tid, server_slug:$slug, region:$region, minute:$minute, kind:$kind, tier:$tier}')

        redis-cli -h "redis-$REGION" RPUSH "q:probes:$REGION" "$JOB"
    done
done

# Workers will dequeue with BLPOP for the next 50 seconds.

2. The worker (one process per dequeued job)

#!/usr/bin/env bash
# worker.sh — dequeues, mints scoped credentials, runs probe, writes verdict
set -euo pipefail

REGION=${REGION:?required}
WORKER_ID=$(uuidgen)
WALL_CLOCK_LIMIT=50

# Block-pop a job for up to 50s; exit clean if no work.
JOB=$(redis-cli -h "redis-$REGION" BLPOP "q:probes:$REGION" 50 | tail -1)
[ -z "$JOB" ] && exit 0

TID=$(echo "$JOB" | jq -r .tenant_id)
SLUG=$(echo "$JOB" | jq -r .server_slug)
MINUTE=$(echo "$JOB" | jq -r .minute)
KIND=$(echo "$JOB" | jq -r .kind)
TIER=$(echo "$JOB" | jq -r .tier)

# Mint a 5-minute KMS scoped token. Supervisor's IAM role can mint tokens
# scoped to any tenant; the token itself is single-tenant.
TOKEN=$(aws sts assume-role \
    --role-arn "arn:aws:iam::$ACCT:role/probe-tenant-$TID" \
    --role-session-name "$WORKER_ID" \
    --duration-seconds $((KMS_TOKEN_TTL)) \
    --output json | jq -r .Credentials.SessionToken)

# Decrypt tenant data key once; use locally for row-decryption.
DK=$(AWS_SESSION_TOKEN="$TOKEN" aws kms decrypt \
    --ciphertext-blob "$(get_tenant_dk_ciphertext "$TID")" \
    --query Plaintext --output text)

if [ "$KIND" = "credentialed" ]; then
    # Pull the credential row (Postgres row-security enforces tenant scope)
    PSQL_VARS="-c app.tenant_id=$TID"
    CREDS_JSON=$(psql $PSQL_VARS -A -t -c \
        "SELECT pgp_sym_decrypt(ciphertext, '$DK') FROM creds
         WHERE tenant_id='$TID' AND server_slug='$SLUG' LIMIT 1")
    # Run credentialed-probe sequence from the credentialed walkthrough
    VERDICT=$(./probe-credentialed.sh "$SLUG" "$CREDS_JSON" "$REGION")
else
    VERDICT=$(./probe-public.sh "$SLUG" "$REGION")
fi

# Tag with tier and minute; redis Lua coalescer does the per-server aggregation.
KEY="v1:t:$TID:s:$SLUG:r:$REGION:m:$MINUTE"
echo "$VERDICT" \
| jq --arg minute "$MINUTE" --arg tier "$TIER" \
     '. + {as_of: $minute, tier: $tier}' \
| redis-cli -h "redis-$REGION" SET "$KEY" - EX 600

# Trigger the per-server coalescer
redis-cli -h "redis-$REGION" EVALSHA "$COALESCER_SHA" 1 \
    "v1:t:$TID:s:$SLUG:m:$MINUTE" "$REGION"

# Plaintext credential leaves scope here; container exit reaps memory.

3. The verdict-minute coalescer (Redis Lua)

-- coalescer.lua
-- Called as: EVALSHA <sha> 1 v1:t:{tid}:s:{slug}:m:{minute} <writing_region>
-- Aggregates per the two-of-N rule and seals the canonical verdict.

local key_prefix = KEYS[1]                          -- e.g. v1:t:abc:s:ant:m:2026-04-30T12:34:00Z
local tenant, slug, minute = key_prefix:match("v1:t:(.-):s:(.-):m:(.+)")
local regions = {"us-east","us-west","eu-west","ap-southeast","sa-east"}
local present = 0
local up, down, degraded = 0, 0, 0

for _, r in ipairs(regions) do
    local k = "v1:t:"..tenant..":s:"..slug..":r:"..r..":m:"..minute
    local v = redis.call("GET", k)
    if v then
        present = present + 1
        local state = cjson.decode(v).state
        if state == "up" then up = up + 1
        elseif state == "down" then down = down + 1
        else degraded = degraded + 1 end
    end
end

local canonical_state = "degraded"
if up >= 2 and down == 0 then canonical_state = "up"
elseif down >= 2 then canonical_state = "down" end

local canonical_key = "v1:t:"..tenant..":s:"..slug..":verdict:"..minute
local existing = redis.call("GET", canonical_key)

-- Only write the canonical verdict once we have at least 2 regions in;
-- and never overwrite a more-complete verdict with a less-complete one.
if present >= 2 then
    local payload = cjson.encode({
        state = canonical_state,
        regions_present = present,
        partial = present < 5,
        as_of = minute
    })
    redis.call("SET", canonical_key, payload, "EX", 7200)
    redis.call("PUBLISH", "verdict-sealed", canonical_key)
end

return present

4. Tenant manifest (the supervisor's input)

// /etc/alivemcp/tenants.json (owned by the dashboard's writer; read-only here)
{
  "tenants": [
    { "id": "t_anth_001", "tier": "team",
      "servers": [
        { "slug": "anthropic-server-everything", "credentialed": false },
        { "slug": "anthropic-server-fetch",       "credentialed": false }
      ] },
    { "id": "t_team_042", "tier": "team",
      "servers": [
        { "slug": "acme-private-search", "credentialed": true,
          "kms_key_arn": "arn:aws:kms:us-east-1:...:key/abc-123" }
      ] },
    { "id": "t_ent_009", "tier": "enterprise",
      "enterprise_max": 50,
      "dedicated_workers": true,
      "servers": [ /* 30+ entries */ ] }
  ]
}

This is the minimum viable shape. To run it as a service you need at least: a per-tenant probe-rate dashboard the tenant can see, a per-tenant alert-routing config (Slack, PagerDuty, email), the verdict-history archiver to Postgres, the per-region health monitor on the Redis itself (a scheduler that checks regional Redis health every 30 seconds and pages on partition), and a tenant-onboarding flow that registers the KMS key and provisions the IAM role in one step. None of those are in the recipe; all of them follow the same isolation rules the recipe sketches.

Five failure modes specific to multi-tenant operation

Each of these has bitten us at least once in production. Each has a structural fix, not a procedural one.

Noisy-neighbour CPU contention. Symptom: per-tenant probe latency p95 climbs by 40% on a host that's running 30 workers; one of those workers happens to be doing a slow JSON-RPC parse on a malformed response. Fix: enforce per-worker CPU caps via cgroups (already in the worker model above); pin worker count per host so the 30 is the cap, not the median; the supervisor refuses to start a 31st worker on a host. The contention surfaces as queue depth growth, not as silent latency drift.
Secret-store cache poisoning across tenants. Symptom: tenant A's OAuth flow starts working with tenant B's authorization-server URL after tenant A claims a server with a similar slug. Cause: the OAuth-discovery cache was keyed on server_slug alone. Fix: cache-key on (tenant_id, server_slug); treat the discovery cache as tenant-scoped state, not global state. The bug is cheap to fix and easy to introduce; tests for cross-tenant cache isolation are a fixed cost worth paying once.
Queue starvation under one Enterprise tenant's burst. Symptom: an Enterprise tenant doubles their server count overnight from 30 to 60; the next minute the queue absorbs 60 extra jobs all destined for the same regional worker pool; mid-tier tenants' jobs sit in the queue past the 50-second timeout. Fix: per-tenant queue partitioning. The job's tenant ID is hashed into a slot key; a per-tenant max-in-queue is enforced; once a tenant has 60 jobs in flight, their 61st is held back to the next minute and a warning surfaces on their dashboard. Other tenants' jobs are unaffected. The contention is bounded by tenant, not by the global queue.
Registry rate-limit blowback. Symptom: the hourly registry crawl gets 429s from MCP.so for 90 minutes after a re-deploy; every tenant's "discovered servers" list goes stale; the dashboard pages every tenant about "discovery failed". Fix: dedupe the crawl at the source — one crawl per registry, results fanned out to every tenant's discovery cache; back off on 429s with exponential delay; the per-tenant alert path checks for "global discovery degraded" state and suppresses the per-tenant page for the duration. The tenant sees one global notice, not ten thousand individual ones.
Verdict-minute coalesce races. Symptom: between region writes for one server in one minute, the read-side API serves a green colour, then amber, then red, before sealing on green. The sealed state is correct; the in-window flicker is wrong. Fix: the coalescer (above) refuses to write a canonical verdict until at least 2 of 5 regions are in. The API serves the previous minute's verdict during the in-window period, with last_probe_ago incrementing by seconds. The flicker is gone because there's nothing to flicker — the API serves one verdict at a time.

What does not change at scale

For all the new layers, the per-probe sequence — the inside of the worker — is unchanged from the single-tenant credentialed walkthrough. The eight-step probe still runs DNS, TLS, unauthenticated initialize, OAuth discovery, authenticated initialize, tools/list, tools/call against the health tool, and the canonical-JSON SHA-256 hash for drift detection. The probe credential watchdog still fires at the 30/7/3-day escalation tiers. The two-of-N region aggregation is the same rule. The four authentication patterns from the auth primer are still the four authentication patterns. The schema-drift detector is the same canonical-JSON hash. The distinction between HTTP probes and JSON-RPC health checks is still the distinction.

The reason any of this is tractable at all is that the per-probe behaviour is a settled question by the time you start scaling out. You are not designing the probe and the multi-tenant operator at the same time; you are taking the probe-as-spec and wrapping it in the operator-as-implementation. If the probe behaviour is still in flux when you start scaling, you will end up with a multi-tenant collector that has multiple probe versions running in parallel — and that is a different kind of failure mode entirely, where two tenants are getting different verdicts for the same server because they happen to be sampled by different probe versions.

The compounding effect of the four practical-routine walkthroughs and now this scale walkthrough is that the entire collector is composable. The probe is the atom; the multi-region wrapper is the geographical lift; the status page is the human surface; the read-side API is the machine surface; the multi-tenant operator is the service-shape. Each layer has a single responsibility, a single set of failure modes, a single piece of shared state with the next layer. Every layer can be tested in isolation. Every failure mode is bounded. The whole stack is ~3,500 lines of code in our deployment, sized into ~12 binaries; the supervisor is 400 lines of Bash, the worker is 600 lines of Bash plus a 200-line Python module for OAuth discovery, the coalescer is 50 lines of Lua, the API server is 1,200 lines of Go, and the rest is dashboards and admin tooling. None of the binaries are large. The leverage is that the layering makes each one small.

What's next in the scale sub-series

Three sequels to this post are pre-committed in the AliveMCP blog backlog and will ship over the next four weeks. Each takes one of the layers in this post and goes deep on the part that didn't fit here.

Per-tenant alert routing at scale — what changes when a Slack channel and a webhook URL are tenant-scoped configuration, how to enforce that a misconfigured tenant cannot accidentally page another tenant, and the cross-tenant alert-suppression rule that prevents a registry-wide outage from paging every tenant in the same minute.
The shared-state archiver — the path from the verdict-minute Redis to the long-term Postgres history table, the schema choice (one row per server per minute vs JSONB partitioned by month), the retention policy by tier, and the GDPR-shaped delete path.
Q3 2026 registry audit (mid-July) — the next iteration of the Q2 audit, run from the multi-region multi-tenant collector built across the practical-routine and scale series. The Q3 numbers will be the first comparison point — bucket-by-bucket movement vs Q2 including the regionally degraded bucket the multi-region rollout surfaced and whether the credentialed-probe rollout has shrunk the auth-walled 16.8% bucket as expected.

If you operate an MCP server and want the multi-tenant collector to track yours, join the waitlist — we email the moment a new public post lands and when claimed-listing flow opens. If you're building your own probe stack and want to compare implementation notes, the MCP server uptime API and open-source MCP monitoring reference pages collect every primitive used in this series, with the relevant deep-dive linked. The UptimeRobot vs AliveMCP comparison covers what a generic uptime SaaS misses about MCP-specific failure modes; the Datadog MCP monitoring page covers the cost shape when scaling beyond the indie tier.