Deep dive · 2026-04-29 · Read-side integrations
MCP uptime API and embeddable badge — the read-side walkthrough
The probe stack we built across the credentialed probe, the multi-region wrapper, and the public status page emits one verdict per minute, in five regions, with the failed step and failing region attached, and renders it for a non-technical reader. That covers the human surface. It does not cover the much larger surface that wants to consume the same verdict by machine — README badges, CI guardrails that block a deploy when an upstream MCP is failing, runtime liveness checks inside an agent platform, downstream dashboards that pull the verdict into Grafana or Datadog. Each of those wants the same truth in a different shape, with different polling rates, different cache rules, and a different graceful-degradation policy. This post is the practical walkthrough for that machine surface — the JSON contract, the polling-and-cache rules, the embeddable-badge anatomy, the four canonical read-side integrations, and a copy-pasteable recipe for each.
TL;DR
The read side of an MCP uptime stack needs one canonical JSON endpoint per server (/api/embed-status/<slug> in our shape), one aggregate endpoint per account (the dashboard's read source), and one downloadable widget script (embed.js). The endpoint returns a small fixed shape — state, uptime_30d, p95_ms, last_probe_ago — never more than that, never raw probe steps, never CDN POP names, never the credential's expiry timer. Cache it for 60 seconds with a strong ETag on the verdict-minute and a Cache-Control: public, max-age=60, stale-while-revalidate=300, so a CI guardrail polling every five seconds during a deploy doesn't melt the origin and a README-rendered badge keeps showing the last good answer through a brief CDN hiccup. The badge is one <script> tag, zero deps, ~3KB gzipped, mounts on a data-*-configured element, has two styles (card and inline pill) and two themes (dark and light), and fails open to a pending state on any 404, CORS error, or timeout — never silently disappears, never fakes a number. CI guardrails block the deploy on state == "down", warn on state == "degraded", pass on state == "up", and explicitly fail closed (block) on any non-2xx response so a misconfigured probe doesn't paint deploys green. Agent platforms running runtime liveness checks should treat the API as advisory not authoritative — an MCP can be up at probe time and rejecting calls one minute later — and combine it with an in-call circuit breaker. The recipe section at the end has the bash CI guardrail (~60 lines), the README badge embed snippet, the runtime liveness check (Node, ~80 lines), and the downstream-dashboard pull pattern (~40 lines).
Why a read-side API is the missing layer
The first three walkthroughs of the practical-routine series end at a publishable surface — the status page renders the per-region probe verdict for human readers, with three colour states, the city-labelled regional map, and the four-element incident card. Most teams stop there because that is the artefact a status page is "for". The problem is that the people who depend on the verdict are not all human readers. Most of them are not. The dependency tree on a healthy MCP server typically looks like:
- The author's README on GitHub, where new users land before they ever touch the server. A plain "this server is alive" badge prevents the most expensive failure mode in the ecosystem — a user pulls a server, follows a tutorial, watches their first call fail, and silently leaves without ever knowing the cause was a half-deployed origin not their config. The author's reputation is on the line; the badge has to be there and it has to be honest.
- Downstream agents' CI pipelines, which deploy code that depends on a third-party MCP. If that MCP is failing at deploy time, the deploy paints green and ten minutes later the agent's first production call fails. The deploy gate has to read upstream uptime before it lets the change land.
- Runtime liveness checks inside agent platforms, which decide whether to route a tool call to MCP A, MCP B, or fall back to no-tool mode. The platform is making this decision a thousand times a minute; it cannot do an in-band probe each time, so it polls a cached uptime API and combines that with an in-call circuit breaker.
- Operations dashboards — Grafana, Datadog, Honeycomb, Sentry — which want the same verdict on the same SRE board as the rest of the team's services. The team that already has a Datadog board doesn't want a separate tab for MCP uptime; they want a panel.
None of those four surfaces is rendered by the status page. Each one wants the same underlying truth — is this MCP server working right now, and what is its trailing performance — but in a different shape, at a different cadence, with a different graceful-degradation policy. That's what the read-side API is for. It is not a separate truth source from the status page; it is the same shared-state Redis the multi-region probe writes to, exposed through a different door, with cache headers and a stable JSON contract attached.
The endpoint contract — small shape, high stability
The first design decision is what the JSON looks like, because every other layer cascades from it. The temptation, when you control the probe stack, is to expose every internal field — the per-step timing, the canonical-JSON tool-list hash, the OAuth-discovery cached responses, the credential expiry tier, the CDN POP, the BGP path. Resist it. The endpoint is read by README badges, CI scripts written six months from now by someone who isn't on the team, and downstream dashboards that will pin one specific schema for a long time. Every field you expose is a field you can never change. Keep the surface small.
Our canonical shape, served from /api/embed-status/<slug>:
GET /api/embed-status/anthropic-server-everything HTTP/1.1
Host: alivemcp.com
Accept: application/json
HTTP/2 200
Content-Type: application/json; charset=utf-8
Cache-Control: public, max-age=60, stale-while-revalidate=300
ETag: "v1-2026-04-29T20:31:00Z"
Vary: Accept-Encoding
Access-Control-Allow-Origin: *
{
"state": "up",
"uptime_30d": 99.87,
"p95_ms": 142,
"last_probe_ago": "37s",
"as_of": "2026-04-29T20:31:00Z"
}
Five fields, three states, two timestamps. state is one of "up", "down", "degraded", mapped one-to-one from the green/amber/red verdict the two-of-N aggregation rule emits. uptime_30d is a percentage rounded to two decimals — never the full distribution, never per-region uptimes, never per-step uptimes; consumers who want those fetch the public status page or the authenticated dashboard. p95_ms is a single integer, milliseconds, computed from the trailing 24-hour window across all regions, integer-rounded — never the histogram. last_probe_ago is a human-readable relative-time string ("37s", "1m", "12m") because every consumer of the read API is going to render this directly to a UI; making them parse a Unix timestamp and format it themselves is the kind of friction that makes the surface go unused. as_of is the ISO-8601 timestamp of the verdict minute, for cache invalidation and freshness checks, and is the only Unix-shaped field.
The fields that are not in the response, by design:
- Per-region detail. The status page shows the per-region map; the API does not. Consumers who want that fetch
/api/status/<slug>/regions(a different shape, different cache) or the status page itself. The badge endpoint stays small. - The probe step. "Failed at
tools/list" is invaluable for the operator and meaningless to the consumer. It changes between probe runs, so it cannot be relied on as a stable contract. - The CDN POP, ASN, or BGP path. Same reasoning as the status page — those leak infrastructure detail and don't help the consumer make a routing decision.
- The credential expiry timer for credentialed servers. The probe-credential watchdog from the credentialed walkthrough stays internal. A consumer asking "is this MCP working" does not need to know that AliveMCP's probe credential rotates in 73 days; that's our operational concern, not theirs.
- Stack traces, error messages, JSON-RPC error codes. A 4xx error code surfaces as
state == "down"; a 5xx error code surfaces asstate == "down". The consumer doesn't need to switch on whether it was-32601or-32603.
The shape is small enough to memorise, stable enough to write a v1 client around, and honest enough that we never have to delete a field. That's the bar for a contract that's going to be consumed by code we don't control.
Cache-Control, ETag, and the polling-rate negotiation
The read API will be polled. We do not get to choose how often. README badges only fetch on page load; CI guardrails poll every five seconds during a deploy window; agent platforms doing runtime liveness checks poll every fifteen seconds across thousands of servers. Without cache headers, the origin and the probe collector get hammered together every time a deploy ships.
The numbers we serve from /api/embed-status/<slug>:
Cache-Control: public, max-age=60, stale-while-revalidate=300. The verdict only updates on the minute boundary, so caching for 60 seconds is honest — there isn't a fresher answer to give. Thestale-while-revalidate=300means a CDN with a 60-second-old cached value will keep serving it while it asynchronously re-fetches; for a CI guardrail polling every five seconds, that means the first poll inside a 60-second window pays the origin round-trip and the next eleven get the CDN cache for free.ETag: "v1-2026-04-29T20:31:00Z". The ETag is just the verdict-minute timestamp, prefixed with a schema version.If-None-Matchrequests get a304with no body, which is what an agent platform polling every 15 seconds across thousands of servers wants — a thousand 304-and-empty-body responses cost dramatically less bandwidth than a thousand JSON payloads. A misconfigured client that doesn't honourIf-None-Matchstill gets the right answer with a slightly larger bandwidth bill; we do not punish them with rate limits, we just stop sending bytes whenever we can.Vary: Accept-Encoding. Every CDN handles this correctly. Without it, a gzip-supporting client and a non-gzip-supporting client share a cache and one of them gets gibberish.Access-Control-Allow-Origin: *. The badge is rendered on third-party origins by definition. CORS has to be open. We allow GET only — never POST, never preflight-on-this-route, never auth headers — which keeps the surface secure-by-shape.
The cache rules are not just polite — they're load-bearing. Without them, a single popular MCP server with the badge on its README and 200 daily README-readers in five different regions could trigger 200 origin fetches every time a search bot crawls the page. With them, the same workload reduces to ~5 origin fetches per minute (one per region per minute) and the rest is CDN.
The recommended client polling rate, by surface:
- README badges — fetch on page load; never poll. Most readers see the badge once.
- CI guardrails — poll every 5–15 seconds during the deploy window only; stop polling once the deploy lands.
- Runtime liveness checks inside agent platforms — poll every 15–60 seconds, ideally with
If-None-Match. A 304 is a "no change" signal you can use without re-rendering the cached value. - Downstream dashboards (Grafana, Datadog scrape) — poll every 30–60 seconds. The dashboard isn't acting on the value in real time; it's plotting a trend.
If a client asks us to poll faster than 5 seconds we do not refuse — we just keep returning 304s most of the time, which the client is welcome to ignore. There's no rate limiting on the read endpoint at the contracted poll rates; we'd rather absorb the load than push consumers off a public surface.
The four read-side integrations and what each of them wants
The four canonical surfaces consume the same JSON in different shapes. Understanding what each one wants is the difference between an API that's used and one that's looked at and abandoned.
1. The README badge. A new user lands on the GitHub README. The author wants a small visual that says either this server is alive or this server is currently down, that links back to a status page where the user can dig in. The badge has to be tiny — the README is already long — and it has to render on first paint without blocking. It must never show a fake green dot if the API is unreachable; it must show a pending state. It must be free to embed and zero-config to add. The implementation we ship is embed.js, ~11KB raw / ~3.5KB gzipped, one script tag, two render styles.
2. The CI guardrail. A team's deploy pipeline runs terraform apply or kubectl rollout, and one of the things the new code touches is an upstream MCP server. The guardrail's job is to fail the deploy if that upstream is currently down, not for any deep reason but because shipping a change while a dependency is in incident is a way to confuse two bugs. The guardrail is a 30–60 line bash script in the pre-deploy stage that fetches the API, parses the state, exits 0 on up, exits 1 on down, and warns-but-passes on degraded. The behaviour on a 5xx from the API itself has to be configurable — fail-closed for production, fail-open for staging — and the script has to log the verdict to the CI artefact log so the post-mortem story is "we shipped at 14:02 when upstream was already amber" not "we shipped at 14:02".
3. The runtime liveness check inside an agent platform. A platform routes tool calls to several third-party MCPs. Before each call, it checks a tiny in-memory cache of "which MCPs are currently working" and skips the ones that aren't. The cache is populated from a 15-second poll of the read API across all configured MCPs. The poll uses If-None-Match, so 90% of the responses are 304s with no body, and the cache lookup on the hot path is O(1). The critical design rule is that the API is advisory, never authoritative — a server can be up at the last probe and reject the next call one minute later — so the platform combines the poll-driven cache with an in-call circuit breaker that opens after three consecutive call failures and stays open for one minute regardless of what the API says.
4. The downstream dashboard. An SRE team already has a Grafana board for the rest of their services. They want one panel for "third-party MCP uptime" so it's on the same screen as the rest of the on-call view. The dashboard polls the API every 30 seconds via a Prometheus scrape config or a Datadog HTTP check, parses state and uptime_30d, and plots them as a coloured-state-line and a trailing-percentage-line respectively. The dashboard does not act on the values; it surfaces them. The rule for this surface is the polling rate: 30 seconds is fine, 5 seconds is wasteful — the dashboard's refresh cadence is the thing that determines the budget here, not the API.
One contract, four surfaces, four polling rates, four graceful-degradation policies. The reason the contract works at all is that the JSON shape is small enough that none of the four surfaces has to bend around someone else's needs — they each pick the two or three fields they care about and ignore the rest.
The embed badge anatomy — what one script tag actually does
The badge looks like one line. It is one line. The HTML on a third-party origin is:
<div id="alivemcp-embed" data-server="my-mcp-slug"></div>
<script async src="https://alivemcp.com/embed.js"></script>
What the script does on load, in order, is:
- Find every mount point. Either
id="alivemcp-embed"(the documented pattern) orclass="alivemcp-embed"(for pages that want multiple badges). Idempotent — a mount markeddata-amcp-doneis skipped, so re-running the script doesn't double-render. - Inject inline CSS once. Two palettes (dark default, light opt-in via
data-theme="light") coexist in the same stylesheet. All selectors are namespaced under.amcp-wso styles from the host page don't leak in and the badge styles don't leak out.box-sizing: border-boxon every descendant prevents host-page CSS resets from breaking the layout. - Read configuration off the mount.
data-server(the slug),data-style(cardorbadge),data-theme(darkorlight). The slug is sanitised — lowercased, restricted to[a-z0-9._-], truncated to 80 chars — so a mistyped slug never becomes an XSS vector or a server-side path traversal. - Branch on whether
data-serveris set. No slug? Render the default CTA card — "Is your MCP server alive?" with a link back to alivemcp.com. The badge is doing double duty as a recruitment surface for new server authors when the embed is on a generic page like a docs site. With a slug? Fetch/api/embed-status/<slug>with a 4-second timeout and anAbortController. - Render the result. On 2xx with valid JSON, render a state badge with the live values. On any non-2xx, CORS error, parse error, or timeout, render a pending badge — a grey dot, the slug name, "awaiting first probe", and a link to the status page. Never a fake green dot. Never silent disappearance.
The whole script is ~260 lines, ~11KB raw, ~3.5KB gzipped. Zero dependencies. No build step. Plain ES5 so it works in old browsers without polyfills. The full source is at https://alivemcp.com/embed.js and is meant to be read; nothing is minified, every helper has a name, and the comments explain why the choices were made the way they were.
The two render styles solve different needs. The card is the README header — full width up to 420px, three lines of body, a CTA button, "powered by AliveMCP" attribution at the bottom. The badge is the inline pill — a single line, ~200px wide, designed to fit next to a project title on a docs site or a footer. Both link to the public status page; both carry the attribution. The attribution is non-removable by config — the deal is "free for public MCP servers in exchange for one small link back to alivemcp.com". If a team wants no attribution, they can request a paid licence; in two years of running embed widgets across the factory we have not had that come up enough to add the toggle.
Failure modes specific to the read side (and how to render each one)
The read side has its own failure modes that don't show up on the probe side. The probe side fails by detecting that an MCP is down; the read side fails by losing the answer about whether it is. Each failure mode has a correct render — a wrong choice here erodes trust faster than the original outage would have.
- Stale cache. The CDN is serving a 90-second-old verdict. Mostly fine — the badge says up and the server is up; harmless. But during a state transition, a 90-second-stale cache will report up for the first 90 seconds of an outage. The mitigation is the
as_offield; consumers who are doing real-time decisions can compare it to the wall clock and refuse to act on values older than 5 minutes. Most consumers don't, and that's okay — most consumers' read-side decisions tolerate 90 seconds of staleness because that's already the resolution of the underlying probe loop. - CORS error. A misconfigured CDN or an origin restriction breaks the cross-origin fetch. The badge can't get a value. Render: pending state with the slug name and a link to the status page. Never silently disappear; an empty
<div>on a README is more alarming than a grey dot saying "awaiting first probe". - 404 on the slug. The slug is wrong, or the server hasn't been discovered yet (we crawl the registries hourly; new servers can take up to an hour to show up in the API). Same render as CORS — pending state, link to the status page where the user can search.
- 4-second timeout. The collector is slow, the network is sad, the CDN cache miss took too long. The badge times out. Pending state. The next page load will retry; we don't auto-retry within the same page load because that turns one slow render into a much slower one.
- Schema bump. A future version of the API adds a field. Existing consumers are fine — the contract is additive only, and clients are coded against the small fixed shape. We never delete a field; we only add. If we have to delete, we ship a v2 endpoint at
/api/v2/embed-status/<slug>and leave v1 running indefinitely, because v1 is on README badges that the original author has long since stopped maintaining. - Outage of the read API itself. If
/api/embed-status/is returning 5xx for everyone, that's our incident. The badge renders pending state on 5xx; it does not fail to render. The status page has its own status-of-the-status banner from the meta-monitoring job we've described separately; the badge can stay quiet through the recovery.
The clean rule across all of these: never fake a green dot, never silently disappear, always link to the status page. Three rules, full coverage of the failure surface.
The CI guardrail — fail-closed by default, log the verdict
The CI guardrail is the read-side integration that's furthest from a UI and closest to consequential. Get the policy wrong and you ship a broken deploy, or you fail to ship a working one. The policy table that survives a real incident:
API state | HTTP from API | Production | Staging | Log line |
|---|---|---|---|---|
"up" | 2xx | pass | pass | "upstream slug is up — proceeding" |
"degraded" | 2xx | warn | warn | "upstream slug is degraded (uptime_30d=99.71%, p95=812ms) — proceeding with annotation" |
"down" | 2xx | fail | fail | "upstream slug is down (last_probe_ago=4m) — blocking deploy" |
| any | 5xx | fail (closed) | pass (open) | "upstream API returned code — falling back to fail-closed/open per env" |
| any | timeout | fail (closed) | pass (open) | "upstream API timeout after 5s — falling back to fail-closed/open per env" |
Three properties of this policy that aren't obvious until they catch a real incident:
- Fail-closed in production, fail-open in staging. Production deploys are infrequent and high-stakes; failing closed if the API is unreachable is the conservative choice. Staging deploys are frequent and low-stakes; failing open lets the team continue working through transient API hiccups. The script reads an environment variable to decide which it is.
- Annotate, don't block, on degraded. Degraded means one region is slow or failing; the deploy can still proceed but a record of the degraded state should be on the deploy itself for post-mortem context. The script writes the verdict to the deploy's annotations / labels / CI artefact log so the post-mortem six weeks later can see "the deploy at 14:02 happened while upstream was at
amberin EU". - Always log the verdict, even on pass. If the policy is "pass on up", the temptation is to log nothing. Don't — log the verdict and the timestamp on every run. The line "upstream slug is up at 14:02:11Z" in a CI log is what lets you confirm, six months from now, that the upstream was healthy at deploy time and the bug was elsewhere.
The full bash recipe is in the recipe section. ~60 lines, runs in any CI that can call curl and jq, no service account, no auth — the read API is public for the four documented integrations.
The runtime liveness check — advisory, not authoritative
An agent platform routing tool calls across multiple third-party MCPs has the most demanding read-side use case. The platform handles thousands of calls per minute; each call needs a sub-millisecond decision about whether to attempt the upstream MCP or skip it. The naive implementation — fetch the read API on each call — is wrong. Even with the cache headers, a thousand-calls-per-minute client polls the API a thousand times a minute, which is fine for the API but burns ~30ms of latency on each call for the network round-trip.
The correct shape is a poll-driven local cache:
- The platform maintains an in-memory map of
{slug: state}for every MCP it routes to. - A background poller fetches
/api/embed-status/<slug>for each configured slug every 15 seconds, withIf-None-Match. On 304 (no change), nothing happens. On 200 with new state, the map is updated atomically. - On the hot call path, the platform reads the local map.
up→ call the MCP;down→ skip and route to the fallback;degraded→ call but with a tightened timeout. - An in-call circuit breaker runs in parallel — three consecutive call failures opens the circuit for one minute regardless of the API state. This is the safety net that catches the "MCP was up at probe time and is down right now" race.
Three rules that prevent this from going wrong:
- Treat the API as advisory, not authoritative. The probe runs every 60 seconds; the call could fail in the 59 seconds between probes. The circuit breaker is the authoritative signal at call time; the API is the optimisation that lets the platform skip the call entirely when it knows it would fail.
- On API outage, fail open at the platform layer. If the read API is down, the platform should not refuse to route any calls — it should fall back to attempting every call and letting the in-call circuit breaker do its job. The opposite policy (refuse all calls on API outage) takes the platform down with the API, which is a much larger blast radius than necessary.
- Cache the verdict, not the JSON body. Memory pressure in a hot agent platform is real; the local map should hold a parsed enum, not the raw JSON. ~32 bytes per entry, hash-mapped, regardless of how many MCPs are configured.
The Node recipe is in the recipe section — ~80 lines, no dependencies, drop-in for any agent platform that already has a routing layer.
The recipes — copy-pasteable, four surfaces, ready to wire up
The four read-side integrations as ready-to-paste recipes. None of them needs a build step, none of them needs an auth token, all of them honour the cache-and-ETag contract. Test against your own slug; the API is public.
1. README badge — one HTML snippet
For static-rendered READMEs (GitHub, Forgejo, GitLab markdown), drop this into a HTML island:
<!-- AliveMCP status badge — see https://alivemcp.com/embed-preview -->
<div id="alivemcp-embed" data-server="my-mcp-slug" data-style="badge" data-theme="dark"></div>
<script async src="https://alivemcp.com/embed.js"></script>
For pure-markdown READMEs that don't allow inline JS, link to a static badge endpoint instead:
[](https://alivemcp.com/status/my-mcp-slug)
The SVG endpoint returns a small status pill rendered server-side, with the same cache headers as the JSON endpoint. It's the fallback for surfaces that won't run JS.
2. CI guardrail — bash, ~60 lines
Drop this into a pre-deploy CI step. Set AMCP_SLUG to your upstream's slug and AMCP_FAIL_MODE to closed for production or open for staging.
#!/usr/bin/env bash
# alivemcp-ci-guardrail.sh — block deploy when upstream MCP is failing.
set -euo pipefail
SLUG="${AMCP_SLUG:?AMCP_SLUG required}"
FAIL_MODE="${AMCP_FAIL_MODE:-closed}" # 'closed' = fail on API error, 'open' = pass
TIMEOUT="${AMCP_TIMEOUT:-5}"
URL="https://alivemcp.com/api/embed-status/${SLUG}"
ts() { date -u +'%Y-%m-%dT%H:%M:%SZ'; }
log() { echo "[alivemcp-guardrail $(ts)] $*"; }
http_code=$(curl -sS -o /tmp/amcp.json -w '%{http_code}' \
--max-time "${TIMEOUT}" \
-H 'Accept: application/json' \
"${URL}" || echo '000')
if [[ "${http_code}" == '000' || "${http_code}" -ge 500 ]]; then
log "API returned ${http_code} — fail-${FAIL_MODE} per AMCP_FAIL_MODE"
if [[ "${FAIL_MODE}" == 'closed' ]]; then exit 1; fi
exit 0
fi
if [[ "${http_code}" == '404' ]]; then
log "slug ${SLUG} not found — check spelling or wait for hourly registry crawl"
exit 1
fi
if [[ "${http_code}" -ge 400 ]]; then
log "API returned ${http_code} — failing"
exit 1
fi
state=$(jq -r '.state' /tmp/amcp.json)
uptime=$(jq -r '.uptime_30d' /tmp/amcp.json)
p95=$(jq -r '.p95_ms' /tmp/amcp.json)
ago=$(jq -r '.last_probe_ago' /tmp/amcp.json)
case "${state}" in
up)
log "upstream ${SLUG} is up — proceeding"
exit 0 ;;
degraded)
log "upstream ${SLUG} is degraded (uptime_30d=${uptime}%, p95=${p95}ms) — proceeding with annotation"
# Optional: emit annotation for the deploy system.
[[ -n "${GITHUB_STEP_SUMMARY:-}" ]] && \
echo "WARN: upstream ${SLUG} is degraded at deploy time" >> "${GITHUB_STEP_SUMMARY}"
exit 0 ;;
down)
log "upstream ${SLUG} is down (last_probe_ago=${ago}) — blocking deploy"
exit 1 ;;
*)
log "unknown state '${state}' — failing"
exit 1 ;;
esac
The script does the right thing on every documented branch of the policy table. Tested against the live API; honours the 60s cache. Total external dependencies: curl and jq.
3. Runtime liveness check — Node, ~80 lines
Drop this into an agent platform's tool-router. UpstreamLiveness exposes a synchronous state(slug) for the hot path and a background poller that maintains the cache. No dependencies beyond the standard library.
// alivemcp-liveness.js — advisory liveness cache for agent platforms.
'use strict';
const { setInterval, setTimeout } = require('timers');
class UpstreamLiveness {
constructor(slugs, opts = {}) {
this.cache = new Map(); // slug -> { state, etag, as_of, fetched_at }
this.slugs = new Set(slugs);
this.pollMs = opts.pollMs || 15_000;
this.timeoutMs = opts.timeoutMs || 4_000;
this.base = opts.base || 'https://alivemcp.com';
this._timer = null;
}
start() {
this.slugs.forEach((s) => this._fetch(s));
this._timer = setInterval(() => this.slugs.forEach((s) => this._fetch(s)), this.pollMs);
}
stop() { if (this._timer) clearInterval(this._timer); this._timer = null; }
state(slug) {
const e = this.cache.get(slug);
if (!e) return 'unknown'; // fail-open at the platform
if (Date.now() - e.fetched_at > 5 * 60_000) return 'unknown'; // stale > 5min
return e.state;
}
async _fetch(slug) {
const url = `${this.base}/api/embed-status/${encodeURIComponent(slug)}`;
const headers = { Accept: 'application/json' };
const cached = this.cache.get(slug);
if (cached && cached.etag) headers['If-None-Match'] = cached.etag;
const ctrl = new AbortController();
const t = setTimeout(() => ctrl.abort(), this.timeoutMs);
try {
const r = await fetch(url, { method: 'GET', headers, signal: ctrl.signal });
if (r.status === 304) {
// No state change — refresh the freshness clock.
if (cached) cached.fetched_at = Date.now();
return;
}
if (!r.ok) return; // fail-open on non-2xx
const j = await r.json();
this.cache.set(slug, {
state: j.state || 'unknown',
etag: r.headers.get('etag') || null,
as_of: j.as_of || null,
fetched_at: Date.now(),
});
} catch (_) {
// Network error or timeout — leave previous cache entry in place; it
// will go stale after 5 minutes and the platform will fail-open.
} finally {
clearTimeout(t);
}
}
}
module.exports = { UpstreamLiveness };
// Example usage in a tool router:
//
// const live = new UpstreamLiveness(['my-mcp', 'their-mcp']);
// live.start();
//
// function shouldRoute(slug) {
// const s = live.state(slug);
// if (s === 'down') return false; // skip and fall back
// return true; // 'up', 'degraded', 'unknown' → attempt
// }
Pair with an in-call circuit breaker — the recipe is on the health-check reference page. Three consecutive call failures opens the circuit for one minute regardless of the cached liveness state.
4. Downstream dashboard — Prometheus scrape config, ~40 lines
For teams already running Prometheus, scrape the API and let Prometheus do the alerting. The shape:
# prometheus.yml — fragment
scrape_configs:
- job_name: 'alivemcp_uptime'
scrape_interval: 60s
metrics_path: /api/embed-status-prom
static_configs:
- targets:
- my-mcp-slug
- their-mcp-slug
relabel_configs:
- source_labels: [__address__]
target_label: mcp_slug
- source_labels: [__address__]
target_label: __address__
replacement: alivemcp.com
- target_label: __metrics_path__
replacement: /api/embed-status-prom
- source_labels: [mcp_slug]
target_label: __param_slug
scheme: https
The /api/embed-status-prom endpoint exposes the same verdict in Prometheus exposition format:
GET /api/embed-status-prom?slug=my-mcp-slug HTTP/1.1
HTTP/2 200
Content-Type: text/plain; version=0.0.4
# HELP alivemcp_state 1=up, 0.5=degraded, 0=down
# TYPE alivemcp_state gauge
alivemcp_state{slug="my-mcp-slug"} 1
# HELP alivemcp_uptime_30d 30-day uptime percent
# TYPE alivemcp_uptime_30d gauge
alivemcp_uptime_30d{slug="my-mcp-slug"} 99.87
# HELP alivemcp_p95_ms p95 latency in ms across all regions
# TYPE alivemcp_p95_ms gauge
alivemcp_p95_ms{slug="my-mcp-slug"} 142
# HELP alivemcp_last_probe_seconds seconds since last successful probe
# TYPE alivemcp_last_probe_seconds gauge
alivemcp_last_probe_seconds{slug="my-mcp-slug"} 37
Prometheus alert rules write themselves from there. The same pattern works for Datadog HTTP checks, Honeycomb URL probes, and Grafana JSON-source datasources — each tool has its own config syntax but the shape of the integration is identical.
The arc — read-side closes the loop on the practical-routine series
The first eight posts of the AliveMCP blog form two arcs. The first is the failure-class taxonomy: the Q2 audit that quantified the surface, the seven-failure-modes deep-dive, the JSON-RPC vs HTTP distinction, the schema-drift treatment, and the auth-walled bucket primer. Five posts that together explain what failure looks like on Model Context Protocol servers in 2026, with the audit numbers as the anchor.
The second is the practical routine: the credentialed probe atom, the multi-region wrapper, the public status page, and now this read-side post. Four posts that together describe an end-to-end pipeline — probe, aggregate, publish for humans, publish for machines — that an indie MCP author can wire up in an afternoon and a small SRE team can run as the foundation of the rest of their MCP observability stack. Each post stands on its own; together they're the runbook.
The next sub-series we're working on is scale — what changes when you go from monitoring three MCPs to monitoring three thousand, when the probe collector itself becomes a multi-tenant service, when the read-side API has to serve six different schema versions across a long upgrade window, and when the alert volume has to be re-shaped because one human can't process a three-thousand-server pager queue. Different shape, different problems, same probe-to-publish skeleton underneath. We'll start with the multi-tenant probe collector.
Until then: the API and badge are public, free for any public MCP server, and the recipes above are tested against the live endpoint. If you're an indie author with an MCP in one of the registries, you already have a status page at alivemcp.com/status/<your-slug>. The badge is one line on your README, the CI guardrail is one bash file in your pre-deploy stage, the runtime liveness check is one Node module in your agent platform's router. The point of the read side is that none of those four needs to be reinvented for every server. We did it once; you wire it up once; the verdict propagates the rest of the way.
Further reading on AliveMCP
- Public status page for an MCP server — the surface-area walkthrough — the human-facing surface that this post complements.
- Multi-region MCP probe deployment — how the verdict served by this API is computed across five regions.
- Running a credentialed MCP health check, end to end — the probe atom underneath the multi-region wrapper.
- JSON-RPC health checks vs HTTP probes — what an MCP probe is actually checking, that an HTTP probe can't.
- State of the MCP Registry — Q2 2026 — the audit numbers that anchor the rest of the series.
- MCP server uptime API — reference page — the full programmatic contract, including the auth tier for private endpoints.
- MCP server status page — what a good one shows — the user-facing reference for the human surface.
- MCP server health check — probe sequence + alert tiers — the probe-side reference.
- MCP server uptime monitoring — the whole stack — the brand-match definition.
- MCP server Slack alerts — alert tiers + payload shape — what to do once the API tells you something is down.
- Check if an MCP server is alive — the 30-second curl test — the manual version of the API.
- MCP monitoring tool — buyer's evaluation checklist — what to look for if you're shopping.
- Monitoring an MCP server — signals worth watching — what the read API exposes vs what it leaves to the dashboard.
- How to monitor an MCP server — step-by-step — the entry-level walkthrough.
- MCP endpoint not responding — diagnostic ladder — what to do when the badge goes red.
- MCP registry uptime — Q2 2026 numbers — per-registry context for the audit.
- Embed preview — the four widget modes live — see the badge before you paste it.
- AliveMCP vs UptimeRobot — direct comparison — for teams already on UptimeRobot.