Deep dive · 2026-04-25 · Probes
Multi-region MCP probe deployment — the walkthrough for catching edge-cache-localised outages
A single-region probe is a useful lie. It catches the failure modes that take a server fully down — DNS, TLS, hard 5xx — and it confidently misses every failure mode that is regional. An MCP server fronted by Cloudflare or CloudFront serves a stale cached tools/list from the EU edge for two hours after the origin's protocol-version field changed; the probe in us-east-1 sees the new shape, the probe in eu-west-2 sees the old shape, and a single-region probe — wherever it happens to live — has no way to know which one represents the truth users are getting. This post is the practical follow-up to the credentialed health check walkthrough: how to deploy MCP probes across multiple geographic regions, the five regions worth probing from, the two-of-five aggregation rule that converts single-region noise into a real signal, the shared-state design that lets every probe agree on a verdict, the credentialed-probe intersection (token replication, scoping, rotation), and a copy-pasteable shell wrapper that runs the eight-step probe in parallel from any number of regions.
TL;DR
Deploy at minimum three regions: one in the Americas, one in EMEA, one in APAC. Five regions is better — add a second North American region and one in South America or Africa to catch ASN-level routing problems. Run the eight-step credentialed probe from each region every 60 seconds. Adopt a two-of-N aggregation rule: a single region failing in isolation is a warning (probably a regional issue, possibly a probe-side fault, never a hard page); two or more regions failing concurrently is an alert. Replicate the probe credential with read-only scope to each region's secret store, never share files between regions, and rotate via the same token-expiry watchdog covered in post #6. Keep the canonical-JSON tool-list hash per region, not global — region-local cache divergence is signal, not noise. Aggregate verdicts in a shared state store (one Redis or one Postgres row per endpoint, not per region). The whole thing is the credentialed probe with a region label and an aggregation pass on top — about 40 extra lines of bash and one shared-state write.
Why one region of probes lies — the three failure modes you only see from a second region
The Q2 2026 audit ran from a single North American region. We were upfront about it in the methodology — the audit is a snapshot, the bucket numbers are accurate for what a North American probe saw at that moment, and a multi-region re-run would shake out a slightly different shape. The reason it produces a different shape is the same reason that single-region probes mislead production monitors: a non-trivial number of MCP-server failure modes are regional, not global.
1. Edge-cache divergence on tool lists
An MCP server fronted by a CDN — Cloudflare, CloudFront, Fastly, Cloud Run with the default edge cache — caches the tools/list response at every edge POP. When the origin updates the tool list (a deploy, a registry refresh, a tool added or removed), the cache invalidates on the POPs the deploy traffic hits first; the rest serve the old shape until their TTL expires or they get invalidated. For a five-minute TTL on a multi-POP CDN that's typically a 90-second-to-five-minute window of divergence; for a one-hour TTL it's an hour. During that window the canonical-JSON tool-list hash from the schema-drift detector is region-dependent: probes in the EU see one hash, probes in us-east-1 see another, and only a multi-region probe knows the divergence exists at all.
The cleanest example we've seen in production: an indie MCP server author renamed a tool from get_data to fetch_data on a Tuesday afternoon. The deploy hit the origin at 14:02 UTC. Their single-region uptime monitor (us-east) saw the new hash by 14:03 and reported a clean drift event. Their users in Europe were calling get_data and getting tool not found errors until 14:47 UTC — the EU edge cache TTL. Forty-five minutes of "the dashboard is green and users can't use the server" because the probe and the affected users were on different sides of the same CDN.
2. ASN-level routing failures
Sometimes an MCP server is up, the CDN is up, and one specific ISP's BGP routes are wrong. A Hurricane Electric peering issue, a misconfigured prefix announcement on the customer's side, a route-leak incident at a transit provider — all of which take some fraction of the internet off the server while the server itself is healthy. From the operator's monitoring host, the server is fine. From a third of the operator's users, the server is unreachable. A single-region probe will never catch this; a probe in a different ASN sometimes will, and a probe in three different ASNs almost always will.
3. Region-local origin outages
Multi-region origins (Cloud Run multi-region, Lambda with regional endpoints, manually-deployed origins behind a latency-routed DNS record) can have one region of the origin go down while the others stay up. Probes from regions that hit the healthy origin will report green; probes from regions that hit the dead origin will report red. The user impact is non-uniform — the same fraction of users hitting the dead region see outages — but a single-region probe can only describe one slice of the user population. Two regions is the minimum that can detect this; three regions is enough to triangulate which origin is dead.
The TL;DR of these three failure modes: the unit of "is the server up" is not the server, it's the (server, region) pair. A probe that only measures one region only describes one (server, region) pair, and the other pairs you didn't measure can be in a different state at the same instant.
The empirical evidence — what fraction of MCP outages are region-local
From the Q2 audit re-runs we've quietly run from two extra regions over the last fortnight, the share of healthy-from-one-region / unhealthy-from-another events on the 196 servers in the healthy bucket has been ~3.4% measured over 24 hours. That number is small but load-bearing: 3.4% of the healthy bucket is a meaningful number of servers that look healthy from one place and broken from another, and the disposition of those servers is overwhelmingly the edge-cache-divergence and region-local-origin failure modes above. Our working number for the Q3 2026 audit (mid-July re-run) is that the multi-region probe will move ~3-5% of servers from the healthy bucket to a new "regionally-degraded" bucket — and that bucket is where the most genuinely user-impacting silent outages are hiding.
The number is not huge. The number is also not zero. For a Posture C server (sign-up gated) where the operator has paying users on three continents, that 3.4% probability per day is roughly a one-in-30-day chance of a region-local outage that single-region monitoring will miss. For a hobby MCP with three users all in the same metro area, multi-region monitoring is overkill. The decision rule is simple: if your users are concentrated in one region and you can verify that, single-region is fine; otherwise, the deployment patterns below pay for themselves quickly.
Three deployment patterns (which one fits your stack)
Pattern A — probe from a laptop in three cities, run on a cron
For the indie MCP author with one or two servers and zero infrastructure budget, the cheapest viable multi-region probe is the credentialed probe in three crontabs on three different machines: a home machine in your local region, a free-tier VPS in a second region, and either a friend's machine or a free-tier compute instance in a third region. The probes write their results to a shared store (a free-tier Redis, a Cloudflare KV, a public-read Gist with a token) and an aggregator script reads the three results and emits one verdict.
This pattern is genuinely fine for hobby use. The downside is a single point of human attention — when one of the three machines reboots, gets disconnected from Wi-Fi, or has its power supply die, the probe goes silent and nobody finds out until the watchdog catches the missing-data signal. We recommend it for MCPs with fewer than 100 users, where one missed regional failure per quarter is an acceptable cost.
Pattern B — probe from three cloud providers
The canonical pattern. One probe runner each in: AWS (an us-east-1 Lambda or a tiny EC2), Google Cloud (a europe-west2 Cloud Run job), and Hetzner or another EU/US-mixed indie provider (a $4/mo VPS in fsn1 or similar). Three different providers means the probe survives an AWS-wide outage; three different regions means a regional CDN issue is visible. Each runner runs the credentialed probe from post #6 on a 60-second cron, writes results to a shared store, and a fourth lightweight aggregator (one Lambda invocation per minute) computes the consensus verdict.
Cost: typically $10-15/mo for three providers' probe runners plus a state store. For most teams this is the right call. The implementation is the credentialed probe binary plus a region environment variable and a write to a shared store; no other code changes.
Pattern C — probe from the edge
The most-region-coverage-per-dollar pattern. A Cloudflare Worker, a Lambda@Edge function, or a Fly.io machine deployed to multiple regions runs the probe from whichever POP routes to the worker. With Cloudflare Workers in particular, a single deploy reaches 300+ POPs; the probe runs from whichever POP's cron triggers it, and the aggregator sees results tagged by POP. This pattern catches CDN-cache divergence with the highest resolution because the probe shares the same edge-routing layer as real users.
The trade-off is that edge runtimes have a strict execution-time budget (Cloudflare Workers: 30 seconds CPU on the paid tier, much less on free), so the eight-step probe needs to be split into smaller chunks: DNS + TLS + unauthenticated initialize in one Worker invocation; authenticated initialize + tools/list + tools/call in a second; the canonical-JSON hash compare in a third. The aggregator stitches them together. The other trade-off is that edge probes can't easily hold mTLS client certificates; if the server uses mTLS, fall back to Pattern B.
The five regions worth probing from (and why these specifically)
For the regional probe deployment to actually catch the three failure modes above, the region selection needs to span three things: continents, ASNs, and CDN POPs. Five regions is the practical minimum that does all three.
- North America East (
us-east-1/ Ashburn / NYC) — covers the highest concentration of internet traffic and the largest CDN POP cluster. Skipping this region means missing the largest single audience block. - North America West (
us-west-2/us-west-1/ Oregon / California) — coast-to-coast variance is its own signal. Different transit providers, different submarine-cable ingress to APAC. A failure visible from the East but not the West is almost always a CloudFlare/Fastly POP issue rather than an origin issue. - EU West (
eu-west-2/ London oreu-central-1/ Frankfurt) — a different CDN POP cluster, different ISPs, different DNS resolvers. Catches the EU-edge-cache-divergence case that prompted this whole discussion. - APAC (
ap-southeast-1/ Singapore orap-northeast-1/ Tokyo) — the highest network distance from typical origin regions, the highest TLS-handshake latency, the most likely region to surface protocol-timeout failures that don't manifest closer to origin. Singapore is the better default — it routes to both Tokyo and Mumbai/Sydney via reasonable paths. - South America or Africa (
sa-east-1/ São Paulo oraf-south-1/ Cape Town) — the region most likely to expose ASN-level routing weirdness. The smallest user share, but the highest "tells you something the other four don't" rate per probe.
Five is the practical minimum. Three (one Americas, one EMEA, one APAC) is the floor for "multi-region" to be a meaningful description of the probe — anything less than three and the aggregation rule below isn't statistically distinguishable from a probe-side fault. Beyond five, marginal returns drop quickly: more probes mean more probe-credential replication, more shared-state writes, and more aggregator complexity, with diminishing yield in newly-detectable failure modes.
The aggregation rule — single-region failure is a warning, two-region is an alert
The single most important design decision in a multi-region probe deployment is the aggregation rule. Get it wrong in one direction and you'll page the on-call every time a single regional probe has a bad minute (probe noise becomes false-page noise). Get it wrong in the other direction and you'll suppress real outages because two regions need to be down before anything fires.
The rule that has worked in practice, and the one we run on the AliveMCP collector, is two-of-N agreement on the same step failing within the same 60-second window. If one region reports step 5 failed and four other regions report all-pass, the verdict is regionally degraded — investigate but don't page. If two regions report step 5 failed within 60 seconds of each other, the verdict is step-5 alert — page on-call. If three or more regions report step 5 failed, the verdict is hard outage — top-priority page.
Two adjustments to this baseline that have proved load-bearing:
- The "same step" qualifier matters. Two regions both failing, but on different steps (one on TLS, one on tool-list hash mismatch) is almost certainly two unrelated regional issues — neither alone, nor both together, indicates a server-side problem. Don't aggregate them. The alert format covered in MCP server Slack alerts includes a
probe_stepfield on every payload exactly so the aggregator can do this by-step grouping. - Concurrent vs non-concurrent matters. Two regions failing 30 seconds apart is one signal (a propagating outage); two regions failing 20 minutes apart is two unrelated incidents. The aggregator's lookback window for "same outage" should be 2–5 minutes — long enough to absorb cross-region propagation, short enough that two truly-separate incidents don't accidentally merge.
The dashboard surfaces three colours that map to this rule directly: green (all regions pass), amber (one region fails, others pass — annotated with the region for human triage), red (two or more regions fail concurrently). The same three-state machine from the credentialed probe walkthrough applies — the per-region probe still emits healthy/auth-walled/broken, the aggregator computes the global verdict.
Time-skew and clock-drift gotchas
The aggregation rule depends on knowing which probe results happened "in the same window." That sounds trivial; it stops being trivial the moment one of your probe runners drifts more than a few seconds. We've seen these specific problems in production multi-region deployments:
- NTP drift on a tiny VPS. Some indie VPS providers don't run NTP by default. A box that drifts 30 seconds per week is a box whose probe results land in the wrong aggregation window after a month. Before deploying a probe runner, verify NTP or chrony is enabled and the offset is < 1 second. Run
chronyc trackingortimedatectlas part of the probe-runner bootstrap script. - The "minute boundary" trap. If every probe runs at
:00seconds of every minute, a 0.5-second-skewed runner can land its result in either the last aggregation window or the next one depending on which side of the second its skew falls. The fix: tag every probe result with a server-sideprobe_started_at(the actual epoch time it started, not the cron-tick time) and let the aggregator window probes by that timestamp with a 60-second tolerance. Don't assume the cron tick equals the probe's actual fire time. - Long-running probes overlapping into the next window. An eight-step probe targeting a slow server can take 15+ seconds; if the server is timing out, it can take the full timeout (typically 30 seconds). A probe that started at
:00and finished at:31belongs to the window of:00, not the window of:31. Aggregate byprobe_started_at, notprobe_finished_at. - The shared-state-write race. Multiple probe runners writing concurrently to a single Redis key can race; with Redis's
SETthe last writer wins, which is fine if the last writer is the most recent probe but bad if it's a stale probe whose result took longer to ship. UseSET NXwith a per-window key (probe:<slug>:<window-start-epoch>:<region>) so concurrent writes for the same (region, window) collide and the first writer wins; subsequent writes for the next window get a fresh key.
None of these are fancy. All of them are the kind of thing that doesn't show up in a single-region probe deployment because there's only one clock, one writer, and one window. The cost of multi-region is the cost of distributed-systems-thinking applied to a problem most teams have not previously needed to apply it to. Plan for it.
The shared-state design — where the probe state lives so all regions converge
The multi-region probe needs exactly one place where every region writes its result and the aggregator reads from. The choices, in increasing order of operational weight:
- A single Redis instance with a 30-day TTL. Cheapest. Works for hundreds of endpoints and dozens of regions. Lives in one geographic region — accept the eventual-consistency latency from probes in other regions. Schema: one hash per (endpoint, region), keyed by epoch-minute window; one set per endpoint of "regions reporting this minute"; one string per endpoint of the latest aggregated verdict.
- A single Postgres row per endpoint, JSONB column for region results. Slightly more operational weight. Buys you SQL queryability and easy back-fills when the aggregator's logic changes. Use a
FOR UPDATElock or anINSERT ... ON CONFLICT DO UPDATEwith a region-keyed JSONB merge to avoid the write-race trap above. - A regional KV store that replicates (Cloudflare KV, DynamoDB Global Tables). Highest operational ceiling. Lets each region write to its local KV node and reads pick up the latest state with low latency. Worth the complexity only at the scale where probes-per-second exceed the throughput of a single-instance Redis (typically several thousand endpoints with sub-60-second cadence).
The mistake to avoid: writing the state into the probe runner's local filesystem and trying to rsync it. The race conditions are unmanageable. The shared store has to be shared, not synchronised — exactly one source of truth, accessed concurrently, with the storage layer enforcing serialisation. For most teams running fewer than 1,000 endpoints, single-instance Redis is the boring right answer, and the operational cost is one $5/mo Redis on a single region with a daily snapshot to S3.
One more piece of the design: keep the canonical-JSON tool-list hash per region. The whole point of multi-region probing is detecting that the EU-edge hash differs from the us-east hash for 45 minutes after a deploy; collapsing the per-region hashes into one global hash erases that signal. Schema: probe:<slug>:<region>:tool-hash — one key per region, with hash-drift detection running per region. Cross-region hash divergence within a 5-minute window is its own alert tier (cdn-cache-divergence), distinct from same-region drift (schema-drift from the drift detector post).
The credentialed-probe + multi-region intersection
The probe credential design from post #6 survives the multi-region jump but needs three small adjustments.
- Replicate the credential to each region's secret store, don't share a single store across regions. Cross-region secret-store reads add latency to every probe and create a single point of failure (if the secret store goes down, every probe in every region goes blind, which is the opposite of what multi-region buys you). Each region's runner reads its credential from its own region-local secret store; rotation is done by the watchdog writing the new credential to all stores at once.
- Same credential, different region label. Don't issue a different probe credential per region — that multiplies the rotation surface and makes it harder to verify all regions are actually running. Issue one credential with the full month-or-quarter expiry and a region claim added to the request as a header (
X-Probe-Region: us-east-1) so the server's logs can distinguish them. The token-expiry watchdog runs once and pages once; the rotation script writes to N region stores in one go. - The token-expiry watchdog stays single-region. The watchdog's job is to alert on credential lifecycle — there's no multi-region dimension to that. Run it from one region (the same region as the aggregator), keep the alert path single-source, and accept that the watchdog itself is a single point of failure. If the watchdog fails, the worst case is a credential expires unnoticed; the multi-region probe will then go red from every region simultaneously, which trips the hard-outage alert. The redundancy is built into the system without needing a multi-region watchdog.
The shell recipe — multi-region wrapper around the credentialed probe
What follows is the structure of the multi-region orchestration, written as a single bash script that runs the credentialed probe from post #6 in N parallel SSH-or-Lambda invocations and aggregates results from a shared Redis. The recipe assumes the credentialed-probe script is already deployed to each region's runner and that redis-cli can reach the shared state store. Substitute the variables at the top with your endpoint, your regions, and your Redis URL.
#!/usr/bin/env bash
# multi-region-mcp-probe.sh — orchestrates the credentialed probe across regions
# and writes the aggregated verdict to a shared Redis.
# Dependencies: bash 4+, ssh or aws-cli, redis-cli, jq, date (GNU coreutils).
set -euo pipefail
# --- config -----------------------------------------------------------------
SERVER_SLUG="${SERVER_SLUG:-example-mcp}"
SERVER_URL="${SERVER_URL:-https://mcp.example.com/}"
REGIONS=("us-east-1" "us-west-2" "eu-west-2" "ap-southeast-1" "sa-east-1")
PROBE_RUNNER="${PROBE_RUNNER:-ssh}" # ssh | lambda | worker
REDIS_URL="${REDIS_URL:-redis://localhost:6379}"
WINDOW_SEC=60
WINDOW_START=$(( ($(date +%s) / WINDOW_SEC) * WINDOW_SEC ))
# --- 1. fan out: invoke the credentialed probe in every region in parallel --
declare -A pids
declare -A results
for region in "${REGIONS[@]}"; do
(
# the per-region runner returns one JSON line per step; we collect them all.
case "$PROBE_RUNNER" in
ssh) ssh "probe-${region}" "SERVER_URL='$SERVER_URL' /usr/local/bin/credentialed-mcp-probe.sh" ;;
lambda) aws lambda invoke --region "$region" --function-name mcp-probe \
--payload "{\"server_url\":\"$SERVER_URL\"}" --cli-binary-format raw-in-base64-out /dev/stdout ;;
worker) curl -fsS "https://probe-${region}.example.com/probe?url=$SERVER_URL" ;;
esac
) > "/tmp/probe-${region}-${WINDOW_START}.jsonl" 2>&1 &
pids[$region]=$!
done
# --- 2. wait with bounded timeout (no probe should exceed 35s) --------------
for region in "${REGIONS[@]}"; do
if ! timeout 35 wait "${pids[$region]}" 2>/dev/null; then
echo '{"step":"runner","status":"timeout","region":"'"$region"'"}' \
> "/tmp/probe-${region}-${WINDOW_START}.jsonl"
fi
done
# --- 3. write each region's result to shared Redis (per-window key) ---------
for region in "${REGIONS[@]}"; do
result_file="/tmp/probe-${region}-${WINDOW_START}.jsonl"
# SET NX so the first writer for this (region, window) wins.
redis-cli -u "$REDIS_URL" SET \
"probe:${SERVER_SLUG}:${WINDOW_START}:${region}" \
"$(cat "$result_file" | jq -sc .)" \
NX EX 86400 >/dev/null
redis-cli -u "$REDIS_URL" SADD \
"probe:${SERVER_SLUG}:${WINDOW_START}:reporters" "$region" >/dev/null
done
# --- 4. aggregate: count failures by (step, region) within window -----------
reporters=$(redis-cli -u "$REDIS_URL" SMEMBERS "probe:${SERVER_SLUG}:${WINDOW_START}:reporters")
n=$(echo "$reporters" | wc -l)
declare -A step_failures
for region in $reporters; do
results=$(redis-cli -u "$REDIS_URL" GET "probe:${SERVER_SLUG}:${WINDOW_START}:${region}")
failed_steps=$(echo "$results" | jq -r '.[] | select(.status=="fail") | .step')
for step in $failed_steps; do
step_failures["$step"]=$(( ${step_failures["$step"]:-0} + 1 ))
done
done
# --- 5. emit verdict per the two-of-N rule ---------------------------------
verdict="green"
fail_step=""
fail_count=0
for step in "${!step_failures[@]}"; do
if (( step_failures[$step] >= 2 )); then
verdict="red"; fail_step="$step"; fail_count="${step_failures[$step]}"
break
fi
done
if [[ "$verdict" == "green" ]]; then
for step in "${!step_failures[@]}"; do
if (( step_failures[$step] >= 1 )); then
verdict="amber"; fail_step="$step"; fail_count="${step_failures[$step]}"
fi
done
fi
# --- 6. cross-region hash-divergence check ----------------------------------
hashes=$(for region in $reporters; do
redis-cli -u "$REDIS_URL" GET "probe:${SERVER_SLUG}:${region}:tool-hash"
done | sort -u | wc -l)
divergence=""
if (( hashes > 1 )); then divergence="cdn-cache-divergence"; fi
# --- 7. write the aggregated verdict ----------------------------------------
redis-cli -u "$REDIS_URL" SET "probe:${SERVER_SLUG}:verdict" \
"{\"window\":$WINDOW_START,\"verdict\":\"$verdict\",\"step\":\"$fail_step\",\"failed_regions\":$fail_count,\"reporters\":$n,\"divergence\":\"$divergence\"}" \
EX 86400 >/dev/null
# --- 8. fire alert at red, log at amber, no-op at green --------------------
case "$verdict" in
red) /usr/local/bin/alert-fire "$SERVER_SLUG" "$fail_step" "$fail_count" ;;
amber) /usr/local/bin/alert-log "$SERVER_SLUG" "$fail_step" "$fail_count" ;;
green) : ;;
esac
What the recipe is doing, in plain English: invoke the credentialed probe in every region in parallel; bound the wait so a hung region can't stall the whole batch; write each region's per-step results to a window-keyed Redis entry that locks first-writer-wins; once all reporting regions have written, count failures by step across regions; emit green/amber/red per the two-of-N rule; check the per-region tool-list hashes for cross-region divergence; write the verdict; route alerts according to severity. About 80 lines of bash, plus the credentialed-probe script from post #6 deployed to each runner. The total operational footprint is one Redis instance, one shared alert-fire binary, and one cron entry per region.
What "AliveMCP Author tier" handles for you
The shell recipe above is intentionally minimal so anyone can audit it and run it themselves. It is also approximately the same logic that runs in the AliveMCP collector for every endpoint we monitor, with a few extras that the indie-author tier wires in by default rather than asking the author to operate:
- Three-region probe pool by default (us-east, eu-west, ap-southeast) for every endpoint, with two more regions available on the Team tier (us-west, sa-east).
- Probe-credential vault replicated to all five region-local secret stores; one rotation, one watchdog, no per-region credential management.
- Per-region tool-list hash with a CDN-cache-divergence alert tier — surfaces the EU-edge-stale-cache case automatically without any configuration on the author's side.
- Aggregation in a managed Redis with a daily snapshot, sub-second-skewed NTP-synced runners, and the two-of-N rule wired by default. The author never sees a false page from a single regional blip.
- Public dashboard surfaces all five region states for free public listings — anyone curious about a server's regional uptime can see the breakdown without claiming the listing.
The honest summary: the multi-region probe is straightforward to build but has half a dozen distributed-systems edges (clock skew, write races, hash-partitioning) that take a quarter to find and another to wire correctly. The Author tier exists for indie authors who would rather pay $9/mo than spend that quarter; the shell recipe exists for everyone who wants to start tonight and decide later whether the $9/mo is worth it.
How this fits the rest of the AliveMCP probe stack
The multi-region probe is the second layer of the practical-routine series. The credentialed probe is the per-region atom; this post wraps that atom in geographic redundancy. The full monitoring page walks through how the per-region and aggregated states sit on the same dashboard panel; the short version is "the global verdict is the headline number, the per-region breakdown is the drill-down, and the cross-region hash-divergence alert is its own row."
For the Q3 2026 audit re-run (mid-July), we're going to run the audit from all five regions in parallel — not just to refresh the bucket numbers, but to surface the regionally-degraded bucket explicitly. The expectation is that 3-5% of servers in the Q2 healthy bucket will move into the new regionally-degraded bucket, and that ~1% of servers in the Q2 hard-down bucket will move into "regionally up" (servers that looked dead from us-east but are responding from EU or APAC — typically because the operator deployed origin-side fixes that propagated to one region's CDN POPs first). Either way the bucket map gets less ambiguous, which is the point.
What we'll cover next
This is post #7 in the Q2-audit-driven series and the second of the practical-routine sub-series. Posts #1-#5 covered the audit, the seven failure modes, the JSON-RPC probe, the schema-drift detector, and the auth primer. Post #6 was the credentialed-probe walkthrough; this post extends it across regions. Up next: the Q3 2026 registry audit (mid-July re-run, with bucket-by-bucket movement vs Q2 and the new regionally-degraded bucket explicitly enumerated), and a follow-up walkthrough on the public status-page surface area — what to publish, what to keep internal, and how the per-region state map should render for users with no infra context.
If you operate an MCP server and want multi-region probing wired up without operating five region-local probe runners and a shared-state Redis, claim your listing on the public dashboard. The Author tier covers the credential vault, three-region probe pool by default, per-region tool-list hashing, and the CDN-cache-divergence alert tier. $9/mo for Slack or webhook delivery the moment any of the eight steps fail in two or more regions concurrently, with the failed step and the failing regions on the alert payload so on-call doesn't have to guess.
Further reading
- Running a credentialed MCP health check, end to end — the per-region probe atom this post wraps. Read first if you haven't wired the eight-step probe yet.
- State of the MCP Registry — Q2 2026 — the audit whose single-region methodology motivates the multi-region re-run for Q3.
- Why MCP servers die silently — 7 failure modes — the taxonomy. Edge-cache divergence sits between schema drift and route-moved.
- JSON-RPC health checks vs HTTP probes — the unauthenticated five-step probe; the multi-region wrapper applies equally to it.
- Schema drift in MCP tool definitions — the per-region hash divergence is the same canonical-JSON SHA-256 hash, computed independently per region.
- MCP authentication primer — the four-posture decision tree that determines whether the per-region probes need to carry credentials at all.
- MCP server health check — probe sequence explained — the canonical reference page; multi-region probing is layered on top of the same eight steps.
- MCP server Slack alerts — payload shape — the alert payload's
probe_stepandregionfields are what the aggregator's by-step grouping rule reads. - MCP endpoint not responding — diagnostic walkthrough — the user-facing companion for "the dashboard is amber from EU but green from us-east, what does that mean?"
- Check if an MCP server is alive — the human-grade single-curl test the multi-region probe automates 7,200 times a day across five regions.
- Monitoring an MCP server — signals worth watching — the dashboard layer that the per-region states feed into.
- MCP server status page — what to publish on it — what to surface from the per-region probe states publicly (and what to keep internal).
- How to monitor an MCP server — step by step — the setup walkthrough single-region readers should start with before scaling up to multi-region.
- MCP server uptime API — the read endpoint that exposes per-region states for badges, dashboards, and CI guardrails.
- MCP monitoring tool — buyer's evaluation checklist — multi-region probe coverage is one of the line items on the checklist; here's why it matters.
- MCP registry uptime — ecosystem-level numbers; the live tracker for how the multi-region rollout moves the regionally-degraded bucket between Q2 and Q3.
- UptimeRobot vs AliveMCP — UptimeRobot offers multi-region pings by default; what it still doesn't do is per-region MCP-protocol-aware verification.