Deep dive · 2026-04-29 · Status pages
Public status page for an MCP server — the surface-area walkthrough
The probe stack we built across the credentialed probe walkthrough and the multi-region wrapper emits a verdict every 60 seconds for every endpoint, in five regions, with the failed step and the failing region attached. That verdict is the truth. The status page is the shape of that truth that a non-technical reader can read in under five seconds and act on. Most teams who ship a status page for the first time get the surface area wrong — they publish too much, in the wrong vocabulary, with no edit policy, and the page becomes a source of confusion instead of trust. This post is the practical walkthrough for the last layer of the practical-routine series: what to render publicly from the shared-state Redis, what to keep internal, the five questions a reader actually needs answered, the per-region state-map UX that doesn't require knowing what an ASN is, the incident-card schema, the subscription model that doesn't spam, and the static-render recipe that turns the verdict into HTML on a 60-second cadence with no extra infrastructure.
TL;DR
A public MCP status page should answer five questions: is it working right now, where is it broken if anywhere, has anything been broken in the last 24 hours, are the operator and the system aware, and how do I get notified if it changes. Render those five answers above the fold; nothing else has to be there. Drive the page off the same shared-state Redis the multi-region probe writes to — re-render a static HTML file every 60 seconds and serve it from the same origin as the rest of your product. Show per-region state with a 5-cell strip of green/amber/red dots, labelled by city not region code (London not eu-west-2), and use exactly three colours: working, working slowly or partially, not working. Never show ASNs, never show CDN POP names, never show internal probe-step numbers, never show stack traces, never show the probe credential's expiry. Hide one server's auth-walled state behind one binary publicly-visible flag (responding) and elaborate in the operator-only internal view. Subscriptions are opt-in email or webhook, debounced to incident-creation, status-change, and resolution only — never same-incident heartbeat. The whole status page can ship as one ~250-line static-render bash + jq script that reads the shared Redis and writes status.html, plus a thin Caddy route to serve it from status.yourdomain.com. The recipe is at the end.
The five questions a reader needs answered (and what they don't need)
A status page exists to answer questions. The status pages that work answer five specific ones, all visible above the fold; the status pages that fail try to answer either fewer (just a green dot) or vastly more (every infra metric the operator has, presented to a reader who can't interpret them). The list:
- Is it working right now? One sentence, three states. All MCP services operational. / Some MCP requests are slow or failing. / MCP services are unavailable. A sentence the reader can read in two seconds and that resolves their primary question.
- Where is it broken, if anywhere? A five-region strip with one labelled cell per region. The reader doesn't know what
ap-southeast-1means and doesn't need to. They know what Singapore means, and if their agent is failing from the EU and the EU cell is amber, they have their answer. - Has anything been broken in the last 24 hours? A 24-hour bar at minute resolution. Each minute is a coloured rectangle. If the last 24 hours has been entirely green, that bar is the strongest piece of social proof the page can carry; if it has a red gap from 14:02 to 14:47 yesterday, that's something the reader's incident review will point at and you'd rather they pointed at your honest record than guessed.
- Are the operator and the system aware? An incident card. If the page is amber or red and there's no incident card open, the reader assumes the alert hasn't fired yet, and that erodes trust faster than the outage itself. The card needs three things: detection time, current state, next-update time. Nothing else above the fold.
- How do I get notified if it changes? A subscribe link, with email or webhook. The subscription is opt-in and debounced (more on this in the subscription section); the link is small but unmissable, in the top right.
Things that do not belong above the fold, regardless of how interesting they are to operate: protocol version negotiation results, individual probe-step numbers, per-tool latency percentiles, CDN POP coverage maps, BGP route diagnostics, internal alert-routing topology, the operator's pager schedule, the probe credential's expiry timer, the canonical-JSON tool-list hash, build SHAs, deploy event log, regional CDN cache hit rates. All of these are valuable in the internal view; none of them help a reader answer the five questions above. Push them below the fold or hide them on an operator-authenticated subdomain.
The clean test: ask three non-engineers to load the page on their phone and tell you whether the service is working. If two out of three can answer correctly within ten seconds, the surface area is right. If anyone has to scroll, ask, or interpret, the surface area is wrong.
The status-page state machine — three states, one truth source
The two-of-N aggregation rule from the multi-region probe walkthrough emits one of three verdicts per minute: green (all regions pass), amber (exactly one region fails), red (two or more regions fail concurrently). The status page renders those three states verbatim, with no extra states layered on top:
- Green — All MCP services operational. Every region pass on every step. Headline copy is exactly that sentence.
- Amber — Some MCP requests are slow or failing. One region fails, others pass. The headline copy names the affected region in plain language: Some MCP requests from Europe are slow or failing. If the failure is the credentialed-probe step (auth-walled), the copy degrades gracefully to Some private requests from Europe are not being authenticated.
- Red — MCP services are unavailable. Two or more regions fail concurrently. The headline drops the regional qualifier — at red, the failure is global from the user's perspective and the regional detail goes into the incident card body.
The headline copy is templated; the status page generator pulls one of three templates, fills in the affected-region name, and renders. Avoid creative wording. Operational, degraded, and down beat nominal, elevated error rates, and incident in progress for non-engineering readers; the cleverer the language, the more interpretation the reader has to do.
One state to explicitly not introduce: maintenance. Maintenance windows belong in the incident log as a scheduled-incident card, not as a fourth headline state. The reason is that maintenance is operationally indistinguishable from amber for the user — their request still doesn't work — and giving it a special colour invites the user to think the failure mode is somehow not the operator's responsibility. Render it as a regular amber or red incident with a different incident-card icon (scheduled) and the same five-question answer surface.
The per-region state map — UX for readers who don't know what an ASN is
The five-region probe deployment from post #7 produces five per-region state cells per minute. The temptation is to render those cells with their region codes — us-east-1, eu-west-2, ap-southeast-1 — because that's how the operator thinks about them. Don't. The status page reader is not your SRE. They are a hobbyist agent author, an indie hacker integrating your MCP into their app, an enterprise developer with a pager going off, or a curious user clicking the status-page link from your README. They know what cities are. They know roughly where their users live. They do not know what ap-southeast-1 resolves to.
The render rule:
us-east-1→ New Yorkus-west-2→ Oregon (or San Francisco if the reader is more likely west-coast-tech)eu-west-2→ Londonap-southeast-1→ Singaporesa-east-1→ São Paulo
Render the five cells as a horizontal strip, labelled with the city name, coloured green/amber/red. Each cell is clickable for users who want a per-region detail view (latency last 24h, last incident in this region) but the click-through is optional context — never required. The mental model: the strip is at the resolution of does the service work where I am, and that's the resolution the reader cares about.
Two things to avoid in the strip:
- Don't show internal failure-step detail in the cell tooltip. Hovering on the amber London cell should reveal Some requests from London are slow, not step 5 (tools/list) failing with JSON-RPC -32603 since 14:02. The internal detail belongs in the operator view; the reader doesn't have the context to interpret step 5.
- Don't show CDN POP detail. If your reader is in Manchester and the EU edge cache they hit is divergent from the rest of the EU edges, the right thing to show is some requests from London are slow, not cdn-cache-divergence on cloudflare-mma01-pop. The operator's view shows the POP detail because the operator's job is to fix it; the reader's job is to know that London is amber.
The 24-hour bar — minute-resolution honest history
The third question — has anything been broken in the last 24 hours — is answered by a single visual element: a horizontal bar with one minute per cell, 1,440 cells in total, coloured green/amber/red. Each cell is the global verdict for that minute (the same two-of-N rule output). The bar wraps around at midnight UTC; the current time is the right edge of the bar.
Three rules for honest rendering:
- Don't smooth. If a five-minute incident happened at 03:14 last night, render five red minutes at 03:14 last night. Smoothing those five minutes into a single average colour lies about both the duration and the recovery time. The incident card carries the summary; the bar carries the raw history.
- Don't backdate. If you fix an incident and want to relabel it from red to amber after-the-fact because root-cause analysis revealed it was always partial, fight the urge. The bar is what the probe saw at the time. The retrospective belongs in the incident card's post-mortem text. Backdating the bar is the cleanest way to lose reader trust permanently.
- Don't aggregate downtime into "uptime percentages." The 99.9% uptime this week banner is a lossy summary that hides whether the missing 0.1% was one ten-minute outage at 3am or a thousand thirty-second outages distributed evenly. Show the bar; let the reader compute their own summary if they want one. If you must show a number, show minutes affected in the last 24 hours, which is a count, not a percentage, and which the reader can interpret directly.
Storage cost: each cell is one colour code, 24 hours × 60 minutes = 1,440 cells per endpoint per day, retain 30 days for the public bar plus the bigger uptime chart. That's 43,200 cells per endpoint, easily a single Redis hash per endpoint with epoch-minute keys. Replicating the storage scheme from the shared-state design: one Redis hash per endpoint, one field per minute, value is the global verdict letter (g / a / r). The status-page renderer reads the last 1,440 fields and emits the bar.
What to publish vs what to keep internal
The tightest source of bugs in a first-time public status page is conflating the operator's view of the service with the reader's view. The operator's view exists to debug; the reader's view exists to answer the five questions. Trying to make one view serve both audiences ends up with a page that's too detailed for the reader and not detailed enough for the operator.
The tabular cut, drawing from the data the credentialed probe and the multi-region wrapper already collect:
| Field | Public | Operator-only | Why |
|---|---|---|---|
| Global verdict (green/amber/red) | yes | yes | The headline answer. No interpretation needed. |
| Per-region cell with city label | yes | yes | Answers where is it broken with no jargon. |
Region code (us-east-1) | no | yes | Internal vocabulary; the city label is enough public. |
| Probe step number / name | no | yes | Reader can't interpret step 5 failing. The incident card describes the user impact instead. |
JSON-RPC error code (-32603) | no | yes | Same. |
| CDN POP name | no | yes | Internal routing detail; surfaces in operator view as cdn-cache-divergence alert tier. |
| Tool-list canonical-JSON hash | no | yes | Internal drift signal; schema drift surfaces as a public incident only when it breaks user requests. |
| p50/p95 latency | yes (per region, headline only) | yes (per step, per region, per tool) | Headline latency answers is it slow; the breakdown is operator-only. |
| Tool count + last-changed timestamp | yes | yes | Helps integrators verify their integration matches the current surface. |
| Probe credential expiry | no | yes | Security-sensitive; goes in the credential watchdog, never the public page. |
| BGP / ASN routing diagnostic | no | yes | Internal; the public-facing failure is captured by the regional cell going amber. |
| 24-hour minute-resolution bar | yes | yes | Honest history. |
| 30-day daily summary chart | yes | yes | Below the fold; the broader history. |
| Incident card (detection / state / next-update) | yes | yes | The operator-and-aware answer. |
| Incident post-mortem | yes (after resolution) | yes | Honest; trust-building. |
| Internal alert routing / pager assignment | no | yes | Operationally sensitive. |
| Stack traces / log lines | no | yes | Security-sensitive (may leak internal hostnames, paths, dependency versions, ID schemas). |
| Auth-walled vs broken split | no (collapse to amber) | yes (split) | The auth-walled classification is internal triage detail; the user just sees some private requests are not authenticating. |
| Build SHA / deploy timeline | no | yes | Internal; correlates incidents to deploys for root-cause but not for reader. |
| Subscribe link | yes | n/a | The reader's escape hatch from polling. |
The honest summary: about a third of the data is on the public page, two-thirds is on the operator view. The public page is intentionally smaller and the temptation is always to add one more chart; resist. The operator view is where the surfaces multiply.
Incident cards — the schema for honest communication
An incident card is the unit of public communication when the headline state goes amber or red. The card has four elements, in this order:
- Title — one sentence, written in user-impact terms. Some MCP requests from Europe are timing out. Not JSON-RPC -32603 errors elevated in eu-west-2 origin.
- Detection time — when the probe first observed the failing state. Render as relative time (20 minutes ago) and absolute UTC. Honesty point: this is when the probe saw it, not when an operator confirmed it. If your probe is 60-second cadence, the worst-case lag is 60 seconds; that is fine and your reader can compute it.
- Current state — one of: investigating, identified, monitoring, resolved. This is the standard four-state Atlassian template and there is no upside in deviating from it. Don't invent looking, analyzing, working on it; readers who land on multiple status pages benefit from the standard.
- Next-update time — explicit. Next update at 15:45 UTC. If you commit to an update time, hit it; if you can't hit it, post one minute before saying so. The single biggest amplifier of trust on a status page is hitting your own next-update commitments. The single biggest erosion is letting them slip silently.
What does not belong on the public incident card while the incident is open: speculation about root cause, dependencies named, internal blame, time estimates beyond the next-update window. Wait for the post-mortem, which is a separate card pinned to the incident after resolution.
The post-mortem card, written within 72 hours of resolution and before the incident scrolls off the page, has five elements: what happened, what caused it, what we did about it during, what we'll do to prevent it, and timeline with timestamps. Three paragraphs and a timeline; not a 2,000-word incident report. The reader's tolerance for prose drops sharply after the immediate impact passed; keep it short, keep it honest, and let the timeline do the talking.
Subscriptions — opt-in, debounced, three event types only
The subscribe link in the top right of the status page is a load-bearing element. The reader who clicks it has decided to stop polling your page and trust you to tell them when something matters; betraying that trust is one of the easier ways to make them unsubscribe and never come back. The rules:
- Opt-in only. Email confirmation, double-opt-in, the works. No defaulting users to subscribed because they bought from you. Slack-channel webhooks are an additional opt-in channel that scales better than email for team subscribers.
- Three event types, period. Incident created, incident state changed (one of investigating → identified → monitoring → resolved transitions), incident resolved. That's it. No still investigating, no update yet. No nothing has changed in the last hour. The reader has subscribed precisely so they don't get those.
- Debounce within five minutes. If an incident transitions investigating → identified → monitoring within five minutes, send one email at the end of the window with the full transition log, not three emails. The reader's mailbox is not a chat log.
- Resolved means resolved. Don't reopen-then-resolve incidents on a flaky regional probe to get the resolved-email-trick of "thanks for the resolution!" feedback. Reopening means it broke again; if you re-resolve, the reader knows you fooled around with their inbox. Open a new incident card if a fresh failure happens; the bar will accurately show the gap.
- Per-component scoping. A reader who only uses the EU edge of your MCP doesn't want pages for São Paulo-only outages. The subscribe form should let them pick which regions / components they want notifications for, defaulting to all but allowing only my regions.
The minimum-viable subscribe backend: one Postgres table (or one SQLite, since your probe stack is already small), three columns (email, components-json, confirmed-bool), one HTML form on the status page that POSTs to a tiny endpoint that emits a confirmation email via your transactional sender of choice. The notification fan-out runs on the same cadence as the incident-card update job: once per minute, query for incidents that have changed state in the last 60 seconds, generate one email per subscriber whose component list intersects the incident's affected components, send. The whole subsystem is roughly 100 lines of code and the operational footprint is one table, one endpoint, one cron.
The auth-walled question — collapsing internal triage detail to a public binary
The auth primer classified MCP servers into three triage states from the credentialed probe: healthy (probe credential authenticated and tools list returned), auth-walled (initialize succeeded, every tool call 401'd or returned -32001), and broken (the server itself was unreachable or returned a hard error). The internal operator view distinguishes all three; the public view collapses two of them.
The collapse:
- healthy → public renders green, headline is operational.
- broken → public renders red if the breakage is in two-or-more regions, amber otherwise. Headline is unavailable or some requests slow or failing.
- auth-walled → public renders amber with a special headline: Some private requests are not being authenticated. The reader doesn't need the JSON-RPC
-32001code; they need to know whether their integration is the one affected. Surface the affected scope (private endpoints only) but not the auth method or the credential rotation event.
The reason for this care: auth-walled is an internal classification that exists to help the operator triage is this our auth or their auth. Publishing it directly to readers using the same word would invite questions the public page can't answer (walled by what?) and would imply more security context than the page is meant to carry. Translating it to some private requests are not authenticating answers the user's actual question (does my integration work) without exposing the underlying auth-method debugging detail.
The render cadence — why every 60 seconds is the right cadence
The status-page render job runs on the same 60-second tick as the probe. The probe writes a verdict to Redis at the top of every minute; the renderer reads the latest verdict, the last 1,440 minute-cells, the open incidents, and the per-region cells, and writes one HTML file. That HTML is served as a static file by the same Caddy that serves the rest of the product (or by Cloudflare R2 with edge caching, or by S3 + CloudFront, depending on your stack). No live-rendering, no JavaScript polling, no WebSockets.
Three reasons static-render-every-60-seconds is the right cadence:
- Survives traffic spikes. The status page is the URL traffic explodes to during incidents. A static file behind a CDN can serve millions of requests at the cost of $0 and the render job is unaffected by the traffic. A live-rendering page that hits Redis on every request can fall over in exactly the moment it's most needed.
- Probe cadence and render cadence align naturally. Rendering more often than the probe writes is wasted work; rendering less often is showing stale state. 60 seconds is the cadence both ends are already running on.
- The reader's expectations match the cadence. A reader looking at a status page during an incident is fine refreshing once a minute; they aren't expecting real-time push and most don't notice the absence of it. The page renders the timestamp prominently (updated 14:23 UTC, 38 seconds ago) so the reader can compute their own staleness.
One small refinement: during an active incident, push the render cadence to once every 30 seconds and surface a live indicator in the timestamp. The probe still writes once per minute, but the renderer reads twice — the second render is identical state, but the reader's perception of staleness drops, and during incidents that perception matters more than between them.
The shell recipe — static-render the status page from shared Redis
What follows is the structure of the status-page renderer, written as a single bash script that reads the shared Redis populated by the multi-region probe and writes status.html to a directory served by Caddy. The recipe assumes the probe is already writing per-window verdicts and per-region cells under the schema described in post #7; substitute the variables at the top with your endpoint slug, your status subdomain, and your Redis URL. Roughly 250 lines of bash + jq; the operational footprint is one cron entry, one shared Redis, one static-file directory.
#!/usr/bin/env bash
# render-status-page.sh — turns the multi-region probe's shared Redis into one
# static HTML page served from status.yourdomain.com.
# Dependencies: bash 4+, redis-cli, jq, date (GNU coreutils), envsubst.
set -euo pipefail
# --- config -----------------------------------------------------------------
SERVER_SLUG="${SERVER_SLUG:-example-mcp}"
SERVER_NAME="${SERVER_NAME:-Example MCP}"
REDIS_URL="${REDIS_URL:-redis://localhost:6379}"
STATUS_DIR="${STATUS_DIR:-/var/www/status}"
TPL_DIR="${TPL_DIR:-/etc/render-status/templates}"
WINDOW_SEC=60
NOW=$(date -u +%s)
NOW_HUMAN=$(date -u +"%Y-%m-%d %H:%M:%S UTC")
REGIONS_PUBLIC=("us-east-1:New York" "us-west-2:Oregon" \
"eu-west-2:London" "ap-southeast-1:Singapore" \
"sa-east-1:São Paulo")
# --- 1. read latest global verdict ------------------------------------------
verdict_blob=$(redis-cli -u "$REDIS_URL" GET "probe:${SERVER_SLUG}:verdict")
verdict=$(echo "$verdict_blob" | jq -r '.verdict') # green | amber | red
fail_step=$(echo "$verdict_blob" | jq -r '.step')
fail_count=$(echo "$verdict_blob" | jq -r '.failed_regions')
divergence=$(echo "$verdict_blob" | jq -r '.divergence')
# --- 2. translate verdict to public headline (templated, not generated) -----
case "$verdict" in
green) headline="All ${SERVER_NAME} services operational." ;;
amber) headline="Some ${SERVER_NAME} requests are slow or failing." ;;
red) headline="${SERVER_NAME} services are unavailable." ;;
esac
# auth-walled override (collapse internal classification to user-facing copy)
if [[ "$fail_step" == "tools/call" && "$verdict" == "amber" ]]; then
headline="Some private ${SERVER_NAME} requests are not being authenticated."
fi
# --- 3. read per-region cells, render city-labeled strip --------------------
region_strip=""
for pair in "${REGIONS_PUBLIC[@]}"; do
region_code="${pair%%:*}"
city="${pair##*:}"
cell=$(redis-cli -u "$REDIS_URL" \
GET "probe:${SERVER_SLUG}:current:${region_code}")
cell_state=$(echo "$cell" | jq -r '.cell_state') # g | a | r
region_strip+="${city} "
done
# --- 4. read 1,440 minute-cells from the last 24h (epoch-minute keys) -------
minute_bar=""
for offset in $(seq 1439 -1 0); do
minute_epoch=$(( (NOW / WINDOW_SEC - offset) * WINDOW_SEC ))
cell=$(redis-cli -u "$REDIS_URL" \
HGET "probe:${SERVER_SLUG}:minute-bar" "$minute_epoch")
case "${cell:-g}" in
r) minute_bar+="" ;;
a) minute_bar+="" ;;
*) minute_bar+="" ;;
esac
done
# --- 5. read open incidents (most recent first, public-only fields) ---------
incidents=$(redis-cli -u "$REDIS_URL" LRANGE "incidents:${SERVER_SLUG}:open" 0 5)
incident_html=""
while read -r incident; do
[[ -z "$incident" ]] && continue
title=$(echo "$incident" | jq -r '.title')
detected=$(echo "$incident" | jq -r '.detected_at_human')
detected_rel=$(echo "$incident" | jq -r '.detected_at_relative')
state=$(echo "$incident" | jq -r '.state') # investigating | identified | monitoring | resolved
next_update=$(echo "$incident" | jq -r '.next_update_at_human')
incident_html+="
${title}
Status: ${state}
Next update at ${next_update}
"
done <<< "$incidents"
[[ -z "$incident_html" ]] && incident_html="No open incidents.
"
# --- 6. read latency headline (p50, p95 last 24h, single number per region) -
latency_html=""
for pair in "${REGIONS_PUBLIC[@]}"; do
region_code="${pair%%:*}"
city="${pair##*:}"
p50=$(redis-cli -u "$REDIS_URL" GET "lat:${SERVER_SLUG}:${region_code}:p50")
p95=$(redis-cli -u "$REDIS_URL" GET "lat:${SERVER_SLUG}:${region_code}:p95")
latency_html+="${city}: p50 ${p50}ms · p95 ${p95}ms "
done
# --- 7. tool surface (count + last-changed) ---------------------------------
tool_count=$(redis-cli -u "$REDIS_URL" GET "tools:${SERVER_SLUG}:count")
tool_changed=$(redis-cli -u "$REDIS_URL" GET "tools:${SERVER_SLUG}:last-changed-human")
# --- 8. render the templated HTML -------------------------------------------
export SERVER_NAME HEADLINE="$headline" VERDICT="$verdict" \
NOW_HUMAN REGION_STRIP="$region_strip" MINUTE_BAR="$minute_bar" \
INCIDENT_HTML="$incident_html" LATENCY_HTML="$latency_html" \
TOOL_COUNT="$tool_count" TOOL_CHANGED="$tool_changed"
mkdir -p "$STATUS_DIR"
envsubst < "${TPL_DIR}/status.html.tpl" > "${STATUS_DIR}/status.html.tmp"
mv "${STATUS_DIR}/status.html.tmp" "${STATUS_DIR}/status.html"
# --- 9. emit subscription-fanout side-effect (one per minute, debounced) ----
# Reads incidents:${SERVER_SLUG}:state-changed:1m, fans out to subscribers.
# Implementation in render-subscribe-fanout.sh; called as a sibling job.
/usr/local/bin/render-subscribe-fanout.sh "$SERVER_SLUG" &
The HTML template (status.html.tpl) is intentionally minimal: one headline element with the $VERDICT class for colour, one <ul> for the region strip, one <div> for the minute bar (each <i> styled as a 1-pixel-wide block), one <section> for incidents, one <section> for latency. Three CSS classes (g, a, r) drive every colour. Two-page footprint of CSS, no JavaScript on the public page, total transferred bytes typically under 30KB. The page can be cached for the full minute by Caddy with header Cache-Control "public, max-age=60"; readers behind any proxy or CDN see consistent state during their visit.
What the recipe is doing, in plain English: read the latest global verdict from Redis, translate it to a templated headline using one of three sentences (with a special case for the auth-walled overlay), read each region's current cell and render it with the city label, read the last 1,440 minute-cells and render them as the 24-hour bar, read open incidents and render the four-element card per incident, read the latency-headline pair per region, read the current tool surface count, and render one HTML file. Cron the script every 60 seconds (every 30 during open incidents); serve status.html from status.yourdomain.com behind your existing CDN. Subscription fan-out is a sibling job that reads a separate state-changed list from Redis and emits opt-in emails — debounced as described above.
What "AliveMCP Team tier" handles for you
The shell recipe above is intentionally minimal so anyone can audit it and run it themselves. It is also approximately the same logic that ships with the AliveMCP Team tier's hosted status page, with the operational extras the indie author would otherwise need to wire by hand:
- Hosted at
status.yourdomain.comon a CNAME you point at us. No render job to operate, no Redis to manage, no template directory, no cron — the page reflects the same probe state your uptime monitoring dashboard does. - City-labelled per-region strip by default. Five regions on Team, three on Author; both render with city labels, both honour the public/internal cut from the table above.
- Subscription system pre-wired. Email confirmation, debounce, per-region scoping, three-event-type fan-out, and the opt-out link in every notification footer.
- Incident card lifecycle managed. The four-state machine (investigating / identified / monitoring / resolved), the next-update commitment timestamp, the post-mortem template, the 72-hour retention before the card scrolls into the archive — all default behaviour.
- The auth-walled collapse. Internal triage state is invisible publicly; the public copy is some private requests are not being authenticated, the operator view shows the full classification.
- Static rendering with sub-30KB pages, sub-second TTFB, behind a global CDN. Survives the traffic spike that is the incident.
The honest summary: a public status page is straightforward to build but has half a dozen judgement calls (city labels not region codes, three states not four, debounced subscriptions not heartbeat, auth-walled collapse) where the wrong call erodes reader trust faster than any uptime number can rebuild it. The Team tier exists for operators who would rather pay $49/mo for the judgement calls already made than spend a quarter rediscovering them. The recipe exists for everyone who wants to start tonight and decide later whether the $49/mo is worth it.
How this fits the rest of the AliveMCP probe stack
The status page is the third layer of the practical-routine series. The credentialed probe is the per-region atom; the multi-region wrapper aggregates per-region atoms into one verdict; this post turns that verdict into a public surface a non-technical reader can read in five seconds. The three posts together are the practical-routine spine: probe → aggregate → publish. Anything else on the AliveMCP roadmap is either a refinement of one of these layers or a new question the same data answers (the uptime API answers can my CI block on this server's verdict; the embed widget answers can I show this on my own README; the alerts integration answers can my team wake up when this changes).
The same layering applies to the operator view — built for the same probe state, same regions, same incident cards, but with the public/internal cut inverted. Operators see what readers see plus the internal-only fields. Readers never see the operator-only fields. The cleanest test for whether you've got the cut right is the screenshots: take a screenshot of the public page during an incident, take a screenshot of the operator view of the same incident, and ask yourself whether anything on the public page would have been better-omitted, and whether anything on the operator view would have been useful to the reader. If the answer to both is no, the cut is right.
What we'll cover next
This is post #8 in the Q2-audit-driven series and the third (and likely the final) post in the practical-routine sub-series. Posts #1-#5 covered the audit, the seven failure modes, the JSON-RPC probe, the schema-drift detector, and the auth primer; posts #6-#8 covered the credentialed probe, the multi-region wrapper, and this status-page surface area. Up next: the Q3 2026 registry audit (mid-July re-run, the headline numbers refresh with bucket-by-bucket movement vs Q2 plus the new regionally degraded bucket the multi-region rollout is built to surface), and a sub-series on uptime APIs and badges — the read-side of the same data the status page is built on, but for integrators who want to embed AliveMCP state directly into their CI, README, or runtime guards.
If you operate an MCP server and want a hosted status page on status.yourdomain.com without running your own render job, claim your listing on the public dashboard. The Team tier covers the hosted status page, the subscription system, the city-labelled per-region strip, the incident-card lifecycle, and the static-render CDN delivery. $49/mo for the package, billed monthly, cancel anytime.
Further reading
- Multi-region MCP probe deployment — the per-region probe state this post renders publicly. Read first if you haven't wired the multi-region wrapper.
- Running a credentialed MCP health check, end to end — the per-region atom; the auth-walled classification this post collapses to public copy comes from this layer.
- MCP authentication primer — the four-posture decision tree behind the auth-walled classification and the some private requests are not being authenticated public-copy collapse.
- Schema drift in MCP tool definitions — the canonical-JSON tool-list hash is internal-only on the public page; surfaces as an incident only when a user-visible request fails.
- JSON-RPC health checks vs HTTP probes — why the headline copy is operational and not HTTP 200; the protocol layer is the layer that has to be honest.
- Why MCP servers die silently — 7 failure modes — the failure taxonomy that drives which incident-card titles to template.
- State of the MCP Registry — Q2 2026 — the audit that motivated the whole series; the public-status-page rollout is its operational product.
- MCP server status page — the buyer's-guide overview — the companion overview page; this post is the practical walkthrough of what's on that page's checklist.
- MCP server health check — probe sequence explained — the probe layer that feeds the status page's underlying state.
- MCP server Slack alerts — payload shape — the subscription channel for teams that prefer Slack to email; same debounce rules apply.
- MCP endpoint not responding — diagnostic walkthrough — the user-side companion when the public status page is amber from one region; what to do while you wait for the operator's incident card to update.
- Check if an MCP server is alive — the human-grade single-curl test the status page automates from five regions; the page is the social-proof layer above this curl.
- MCP server uptime monitoring — the whole stack — the monitoring stack the status page is the public face of.
- Monitoring an MCP server — signals worth watching — the dashboard layer that the public status page is the read-only export of.
- MCP server uptime API — the next layer down; what to expose machine-readable in addition to the human-readable status page.
- How to monitor an MCP server — step by step — the on-ramp for operators who haven't yet wired even the single-region probe; the status page is the destination, the on-ramp gets you to the start of the runway.
- MCP monitoring tool — buyer's evaluation checklist — public status-page coverage is one of the line items; this post is why it matters.
- MCP registry uptime — the ecosystem-level page that aggregates every public MCP server's status; the per-server page is the leaf, this is the index.
- UptimeRobot vs AliveMCP — UptimeRobot ships generic status pages by default; what they don't ship is an MCP-protocol-aware status page with the auth-walled collapse and the city-labelled per-region cut.