Deep dive · 2026-04-25 · Probes

Multi-region MCP probe deployment — the walkthrough for catching edge-cache-localised outages

A single-region probe is a useful lie. It catches the failure modes that take a server fully down — DNS, TLS, hard 5xx — and it confidently misses every failure mode that is regional. An MCP server fronted by Cloudflare or CloudFront serves a stale cached tools/list from the EU edge for two hours after the origin's protocol-version field changed; the probe in us-east-1 sees the new shape, the probe in eu-west-2 sees the old shape, and a single-region probe — wherever it happens to live — has no way to know which one represents the truth users are getting. This post is the practical follow-up to the credentialed health check walkthrough: how to deploy MCP probes across multiple geographic regions, the five regions worth probing from, the two-of-five aggregation rule that converts single-region noise into a real signal, the shared-state design that lets every probe agree on a verdict, the credentialed-probe intersection (token replication, scoping, rotation), and a copy-pasteable shell wrapper that runs the eight-step probe in parallel from any number of regions.

TL;DR

Deploy at minimum three regions: one in the Americas, one in EMEA, one in APAC. Five regions is better — add a second North American region and one in South America or Africa to catch ASN-level routing problems. Run the eight-step credentialed probe from each region every 60 seconds. Adopt a two-of-N aggregation rule: a single region failing in isolation is a warning (probably a regional issue, possibly a probe-side fault, never a hard page); two or more regions failing concurrently is an alert. Replicate the probe credential with read-only scope to each region's secret store, never share files between regions, and rotate via the same token-expiry watchdog covered in post #6. Keep the canonical-JSON tool-list hash per region, not global — region-local cache divergence is signal, not noise. Aggregate verdicts in a shared state store (one Redis or one Postgres row per endpoint, not per region). The whole thing is the credentialed probe with a region label and an aggregation pass on top — about 40 extra lines of bash and one shared-state write.

Why one region of probes lies — the three failure modes you only see from a second region

The Q2 2026 audit ran from a single North American region. We were upfront about it in the methodology — the audit is a snapshot, the bucket numbers are accurate for what a North American probe saw at that moment, and a multi-region re-run would shake out a slightly different shape. The reason it produces a different shape is the same reason that single-region probes mislead production monitors: a non-trivial number of MCP-server failure modes are regional, not global.

1. Edge-cache divergence on tool lists

An MCP server fronted by a CDN — Cloudflare, CloudFront, Fastly, Cloud Run with the default edge cache — caches the tools/list response at every edge POP. When the origin updates the tool list (a deploy, a registry refresh, a tool added or removed), the cache invalidates on the POPs the deploy traffic hits first; the rest serve the old shape until their TTL expires or they get invalidated. For a five-minute TTL on a multi-POP CDN that's typically a 90-second-to-five-minute window of divergence; for a one-hour TTL it's an hour. During that window the canonical-JSON tool-list hash from the schema-drift detector is region-dependent: probes in the EU see one hash, probes in us-east-1 see another, and only a multi-region probe knows the divergence exists at all.

The cleanest example we've seen in production: an indie MCP server author renamed a tool from get_data to fetch_data on a Tuesday afternoon. The deploy hit the origin at 14:02 UTC. Their single-region uptime monitor (us-east) saw the new hash by 14:03 and reported a clean drift event. Their users in Europe were calling get_data and getting tool not found errors until 14:47 UTC — the EU edge cache TTL. Forty-five minutes of "the dashboard is green and users can't use the server" because the probe and the affected users were on different sides of the same CDN.

2. ASN-level routing failures

Sometimes an MCP server is up, the CDN is up, and one specific ISP's BGP routes are wrong. A Hurricane Electric peering issue, a misconfigured prefix announcement on the customer's side, a route-leak incident at a transit provider — all of which take some fraction of the internet off the server while the server itself is healthy. From the operator's monitoring host, the server is fine. From a third of the operator's users, the server is unreachable. A single-region probe will never catch this; a probe in a different ASN sometimes will, and a probe in three different ASNs almost always will.

3. Region-local origin outages

Multi-region origins (Cloud Run multi-region, Lambda with regional endpoints, manually-deployed origins behind a latency-routed DNS record) can have one region of the origin go down while the others stay up. Probes from regions that hit the healthy origin will report green; probes from regions that hit the dead origin will report red. The user impact is non-uniform — the same fraction of users hitting the dead region see outages — but a single-region probe can only describe one slice of the user population. Two regions is the minimum that can detect this; three regions is enough to triangulate which origin is dead.

The TL;DR of these three failure modes: the unit of "is the server up" is not the server, it's the (server, region) pair. A probe that only measures one region only describes one (server, region) pair, and the other pairs you didn't measure can be in a different state at the same instant.

The empirical evidence — what fraction of MCP outages are region-local

From the Q2 audit re-runs we've quietly run from two extra regions over the last fortnight, the share of healthy-from-one-region / unhealthy-from-another events on the 196 servers in the healthy bucket has been ~3.4% measured over 24 hours. That number is small but load-bearing: 3.4% of the healthy bucket is a meaningful number of servers that look healthy from one place and broken from another, and the disposition of those servers is overwhelmingly the edge-cache-divergence and region-local-origin failure modes above. Our working number for the Q3 2026 audit (mid-July re-run) is that the multi-region probe will move ~3-5% of servers from the healthy bucket to a new "regionally-degraded" bucket — and that bucket is where the most genuinely user-impacting silent outages are hiding.

The number is not huge. The number is also not zero. For a Posture C server (sign-up gated) where the operator has paying users on three continents, that 3.4% probability per day is roughly a one-in-30-day chance of a region-local outage that single-region monitoring will miss. For a hobby MCP with three users all in the same metro area, multi-region monitoring is overkill. The decision rule is simple: if your users are concentrated in one region and you can verify that, single-region is fine; otherwise, the deployment patterns below pay for themselves quickly.

Three deployment patterns (which one fits your stack)

Pattern A — probe from a laptop in three cities, run on a cron

For the indie MCP author with one or two servers and zero infrastructure budget, the cheapest viable multi-region probe is the credentialed probe in three crontabs on three different machines: a home machine in your local region, a free-tier VPS in a second region, and either a friend's machine or a free-tier compute instance in a third region. The probes write their results to a shared store (a free-tier Redis, a Cloudflare KV, a public-read Gist with a token) and an aggregator script reads the three results and emits one verdict.

This pattern is genuinely fine for hobby use. The downside is a single point of human attention — when one of the three machines reboots, gets disconnected from Wi-Fi, or has its power supply die, the probe goes silent and nobody finds out until the watchdog catches the missing-data signal. We recommend it for MCPs with fewer than 100 users, where one missed regional failure per quarter is an acceptable cost.

Pattern B — probe from three cloud providers

The canonical pattern. One probe runner each in: AWS (an us-east-1 Lambda or a tiny EC2), Google Cloud (a europe-west2 Cloud Run job), and Hetzner or another EU/US-mixed indie provider (a $4/mo VPS in fsn1 or similar). Three different providers means the probe survives an AWS-wide outage; three different regions means a regional CDN issue is visible. Each runner runs the credentialed probe from post #6 on a 60-second cron, writes results to a shared store, and a fourth lightweight aggregator (one Lambda invocation per minute) computes the consensus verdict.

Cost: typically $10-15/mo for three providers' probe runners plus a state store. For most teams this is the right call. The implementation is the credentialed probe binary plus a region environment variable and a write to a shared store; no other code changes.

Pattern C — probe from the edge

The most-region-coverage-per-dollar pattern. A Cloudflare Worker, a Lambda@Edge function, or a Fly.io machine deployed to multiple regions runs the probe from whichever POP routes to the worker. With Cloudflare Workers in particular, a single deploy reaches 300+ POPs; the probe runs from whichever POP's cron triggers it, and the aggregator sees results tagged by POP. This pattern catches CDN-cache divergence with the highest resolution because the probe shares the same edge-routing layer as real users.

The trade-off is that edge runtimes have a strict execution-time budget (Cloudflare Workers: 30 seconds CPU on the paid tier, much less on free), so the eight-step probe needs to be split into smaller chunks: DNS + TLS + unauthenticated initialize in one Worker invocation; authenticated initialize + tools/list + tools/call in a second; the canonical-JSON hash compare in a third. The aggregator stitches them together. The other trade-off is that edge probes can't easily hold mTLS client certificates; if the server uses mTLS, fall back to Pattern B.

The five regions worth probing from (and why these specifically)

For the regional probe deployment to actually catch the three failure modes above, the region selection needs to span three things: continents, ASNs, and CDN POPs. Five regions is the practical minimum that does all three.

Five is the practical minimum. Three (one Americas, one EMEA, one APAC) is the floor for "multi-region" to be a meaningful description of the probe — anything less than three and the aggregation rule below isn't statistically distinguishable from a probe-side fault. Beyond five, marginal returns drop quickly: more probes mean more probe-credential replication, more shared-state writes, and more aggregator complexity, with diminishing yield in newly-detectable failure modes.

The aggregation rule — single-region failure is a warning, two-region is an alert

The single most important design decision in a multi-region probe deployment is the aggregation rule. Get it wrong in one direction and you'll page the on-call every time a single regional probe has a bad minute (probe noise becomes false-page noise). Get it wrong in the other direction and you'll suppress real outages because two regions need to be down before anything fires.

The rule that has worked in practice, and the one we run on the AliveMCP collector, is two-of-N agreement on the same step failing within the same 60-second window. If one region reports step 5 failed and four other regions report all-pass, the verdict is regionally degraded — investigate but don't page. If two regions report step 5 failed within 60 seconds of each other, the verdict is step-5 alert — page on-call. If three or more regions report step 5 failed, the verdict is hard outage — top-priority page.

Two adjustments to this baseline that have proved load-bearing:

The dashboard surfaces three colours that map to this rule directly: green (all regions pass), amber (one region fails, others pass — annotated with the region for human triage), red (two or more regions fail concurrently). The same three-state machine from the credentialed probe walkthrough applies — the per-region probe still emits healthy/auth-walled/broken, the aggregator computes the global verdict.

Time-skew and clock-drift gotchas

The aggregation rule depends on knowing which probe results happened "in the same window." That sounds trivial; it stops being trivial the moment one of your probe runners drifts more than a few seconds. We've seen these specific problems in production multi-region deployments:

None of these are fancy. All of them are the kind of thing that doesn't show up in a single-region probe deployment because there's only one clock, one writer, and one window. The cost of multi-region is the cost of distributed-systems-thinking applied to a problem most teams have not previously needed to apply it to. Plan for it.

The shared-state design — where the probe state lives so all regions converge

The multi-region probe needs exactly one place where every region writes its result and the aggregator reads from. The choices, in increasing order of operational weight:

The mistake to avoid: writing the state into the probe runner's local filesystem and trying to rsync it. The race conditions are unmanageable. The shared store has to be shared, not synchronised — exactly one source of truth, accessed concurrently, with the storage layer enforcing serialisation. For most teams running fewer than 1,000 endpoints, single-instance Redis is the boring right answer, and the operational cost is one $5/mo Redis on a single region with a daily snapshot to S3.

One more piece of the design: keep the canonical-JSON tool-list hash per region. The whole point of multi-region probing is detecting that the EU-edge hash differs from the us-east hash for 45 minutes after a deploy; collapsing the per-region hashes into one global hash erases that signal. Schema: probe:<slug>:<region>:tool-hash — one key per region, with hash-drift detection running per region. Cross-region hash divergence within a 5-minute window is its own alert tier (cdn-cache-divergence), distinct from same-region drift (schema-drift from the drift detector post).

The credentialed-probe + multi-region intersection

The probe credential design from post #6 survives the multi-region jump but needs three small adjustments.

The shell recipe — multi-region wrapper around the credentialed probe

What follows is the structure of the multi-region orchestration, written as a single bash script that runs the credentialed probe from post #6 in N parallel SSH-or-Lambda invocations and aggregates results from a shared Redis. The recipe assumes the credentialed-probe script is already deployed to each region's runner and that redis-cli can reach the shared state store. Substitute the variables at the top with your endpoint, your regions, and your Redis URL.

#!/usr/bin/env bash
# multi-region-mcp-probe.sh — orchestrates the credentialed probe across regions
# and writes the aggregated verdict to a shared Redis.
# Dependencies: bash 4+, ssh or aws-cli, redis-cli, jq, date (GNU coreutils).
set -euo pipefail

# --- config -----------------------------------------------------------------
SERVER_SLUG="${SERVER_SLUG:-example-mcp}"
SERVER_URL="${SERVER_URL:-https://mcp.example.com/}"
REGIONS=("us-east-1" "us-west-2" "eu-west-2" "ap-southeast-1" "sa-east-1")
PROBE_RUNNER="${PROBE_RUNNER:-ssh}"            # ssh | lambda | worker
REDIS_URL="${REDIS_URL:-redis://localhost:6379}"
WINDOW_SEC=60
WINDOW_START=$(( ($(date +%s) / WINDOW_SEC) * WINDOW_SEC ))

# --- 1. fan out: invoke the credentialed probe in every region in parallel --
declare -A pids
declare -A results
for region in "${REGIONS[@]}"; do
  (
    # the per-region runner returns one JSON line per step; we collect them all.
    case "$PROBE_RUNNER" in
      ssh)    ssh "probe-${region}" "SERVER_URL='$SERVER_URL' /usr/local/bin/credentialed-mcp-probe.sh" ;;
      lambda) aws lambda invoke --region "$region" --function-name mcp-probe \
                --payload "{\"server_url\":\"$SERVER_URL\"}" --cli-binary-format raw-in-base64-out /dev/stdout ;;
      worker) curl -fsS "https://probe-${region}.example.com/probe?url=$SERVER_URL" ;;
    esac
  ) > "/tmp/probe-${region}-${WINDOW_START}.jsonl" 2>&1 &
  pids[$region]=$!
done

# --- 2. wait with bounded timeout (no probe should exceed 35s) --------------
for region in "${REGIONS[@]}"; do
  if ! timeout 35 wait "${pids[$region]}" 2>/dev/null; then
    echo '{"step":"runner","status":"timeout","region":"'"$region"'"}' \
      > "/tmp/probe-${region}-${WINDOW_START}.jsonl"
  fi
done

# --- 3. write each region's result to shared Redis (per-window key) ---------
for region in "${REGIONS[@]}"; do
  result_file="/tmp/probe-${region}-${WINDOW_START}.jsonl"
  # SET NX so the first writer for this (region, window) wins.
  redis-cli -u "$REDIS_URL" SET \
    "probe:${SERVER_SLUG}:${WINDOW_START}:${region}" \
    "$(cat "$result_file" | jq -sc .)" \
    NX EX 86400 >/dev/null
  redis-cli -u "$REDIS_URL" SADD \
    "probe:${SERVER_SLUG}:${WINDOW_START}:reporters" "$region" >/dev/null
done

# --- 4. aggregate: count failures by (step, region) within window -----------
reporters=$(redis-cli -u "$REDIS_URL" SMEMBERS "probe:${SERVER_SLUG}:${WINDOW_START}:reporters")
n=$(echo "$reporters" | wc -l)
declare -A step_failures
for region in $reporters; do
  results=$(redis-cli -u "$REDIS_URL" GET "probe:${SERVER_SLUG}:${WINDOW_START}:${region}")
  failed_steps=$(echo "$results" | jq -r '.[] | select(.status=="fail") | .step')
  for step in $failed_steps; do
    step_failures["$step"]=$(( ${step_failures["$step"]:-0} + 1 ))
  done
done

# --- 5. emit verdict per the two-of-N rule ---------------------------------
verdict="green"
fail_step=""
fail_count=0
for step in "${!step_failures[@]}"; do
  if (( step_failures[$step] >= 2 )); then
    verdict="red"; fail_step="$step"; fail_count="${step_failures[$step]}"
    break
  fi
done
if [[ "$verdict" == "green" ]]; then
  for step in "${!step_failures[@]}"; do
    if (( step_failures[$step] >= 1 )); then
      verdict="amber"; fail_step="$step"; fail_count="${step_failures[$step]}"
    fi
  done
fi

# --- 6. cross-region hash-divergence check ----------------------------------
hashes=$(for region in $reporters; do
  redis-cli -u "$REDIS_URL" GET "probe:${SERVER_SLUG}:${region}:tool-hash"
done | sort -u | wc -l)
divergence=""
if (( hashes > 1 )); then divergence="cdn-cache-divergence"; fi

# --- 7. write the aggregated verdict ----------------------------------------
redis-cli -u "$REDIS_URL" SET "probe:${SERVER_SLUG}:verdict" \
  "{\"window\":$WINDOW_START,\"verdict\":\"$verdict\",\"step\":\"$fail_step\",\"failed_regions\":$fail_count,\"reporters\":$n,\"divergence\":\"$divergence\"}" \
  EX 86400 >/dev/null

# --- 8. fire alert at red, log at amber, no-op at green --------------------
case "$verdict" in
  red)    /usr/local/bin/alert-fire   "$SERVER_SLUG" "$fail_step" "$fail_count" ;;
  amber)  /usr/local/bin/alert-log    "$SERVER_SLUG" "$fail_step" "$fail_count" ;;
  green)  : ;;
esac

What the recipe is doing, in plain English: invoke the credentialed probe in every region in parallel; bound the wait so a hung region can't stall the whole batch; write each region's per-step results to a window-keyed Redis entry that locks first-writer-wins; once all reporting regions have written, count failures by step across regions; emit green/amber/red per the two-of-N rule; check the per-region tool-list hashes for cross-region divergence; write the verdict; route alerts according to severity. About 80 lines of bash, plus the credentialed-probe script from post #6 deployed to each runner. The total operational footprint is one Redis instance, one shared alert-fire binary, and one cron entry per region.

What "AliveMCP Author tier" handles for you

The shell recipe above is intentionally minimal so anyone can audit it and run it themselves. It is also approximately the same logic that runs in the AliveMCP collector for every endpoint we monitor, with a few extras that the indie-author tier wires in by default rather than asking the author to operate:

The honest summary: the multi-region probe is straightforward to build but has half a dozen distributed-systems edges (clock skew, write races, hash-partitioning) that take a quarter to find and another to wire correctly. The Author tier exists for indie authors who would rather pay $9/mo than spend that quarter; the shell recipe exists for everyone who wants to start tonight and decide later whether the $9/mo is worth it.

How this fits the rest of the AliveMCP probe stack

The multi-region probe is the second layer of the practical-routine series. The credentialed probe is the per-region atom; this post wraps that atom in geographic redundancy. The full monitoring page walks through how the per-region and aggregated states sit on the same dashboard panel; the short version is "the global verdict is the headline number, the per-region breakdown is the drill-down, and the cross-region hash-divergence alert is its own row."

For the Q3 2026 audit re-run (mid-July), we're going to run the audit from all five regions in parallel — not just to refresh the bucket numbers, but to surface the regionally-degraded bucket explicitly. The expectation is that 3-5% of servers in the Q2 healthy bucket will move into the new regionally-degraded bucket, and that ~1% of servers in the Q2 hard-down bucket will move into "regionally up" (servers that looked dead from us-east but are responding from EU or APAC — typically because the operator deployed origin-side fixes that propagated to one region's CDN POPs first). Either way the bucket map gets less ambiguous, which is the point.

What we'll cover next

This is post #7 in the Q2-audit-driven series and the second of the practical-routine sub-series. Posts #1-#5 covered the audit, the seven failure modes, the JSON-RPC probe, the schema-drift detector, and the auth primer. Post #6 was the credentialed-probe walkthrough; this post extends it across regions. Up next: the Q3 2026 registry audit (mid-July re-run, with bucket-by-bucket movement vs Q2 and the new regionally-degraded bucket explicitly enumerated), and a follow-up walkthrough on the public status-page surface area — what to publish, what to keep internal, and how the per-region state map should render for users with no infra context.

If you operate an MCP server and want multi-region probing wired up without operating five region-local probe runners and a shared-state Redis, claim your listing on the public dashboard. The Author tier covers the credential vault, three-region probe pool by default, per-region tool-list hashing, and the CDN-cache-divergence alert tier. $9/mo for Slack or webhook delivery the moment any of the eight steps fail in two or more regions concurrently, with the failed step and the failing regions on the alert payload so on-call doesn't have to guess.

Join the waitlist

Further reading