Deep dive · 2026-04-30 · Scale sub-series

Shared-state archiver walkthrough — turning verdict-minute Redis into long-term MCP uptime history

The multi-tenant probe collector walkthrough covered the write side of the scale sub-series. The per-tenant alert routing walkthrough covered the paging side. Both consume the same inner loop: a verdict-minute Redis that holds, for every (tenant_id, server_slug) pair, a sealed three-state colour, the supporting per-region cells, the regional p95_ms, the canonical-JSON tool-list hash, and a small handful of metadata fields, written once per minute by the verdict-coalesce Lua script. That Redis is the right surface for an inner loop — the alert router needs to compare the current minute against the last-seen minute on a sub-second cadence, and the read-side API serves badges and CI guardrails out of the same hot keys with a 60-second TTL. Redis is the wrong surface for everything beyond the inner loop. Memory is finite, eviction is real, replicas are not history, and a 30-day uptime percentage cannot be computed by reading 30 days of one-minute keys when the keys age out after 96 hours. The archiver is the service that drains the verdict-minute Redis into a long-term Postgres history table, applies retention by tier, exposes the rollups the read-side API depends on, and enforces a GDPR-shaped delete path that takes a tenant and a server and removes every archived row plus every derived view in one transaction. This post is the architectural walkthrough.

TL;DR

The archiver is a small, boring service — about 800 lines of Go in our deployment — that runs continuously alongside the verdict-coalesce supervisor described in the collector walkthrough. Every 60 seconds, on a watermark tick aligned to the verdict-minute boundary, it drains the previous minute's sealed verdict keys from Redis into a Postgres history table, aggregates the day-just-finished into a daily rollup, aggregates the month-just-finished into a monthly rollup, applies the tenant-tier retention policy by deleting rows older than the tier cap, and updates the suppression-cluster log used by the alert router to recognise registry-wide outages. Five structural choices make it work. Schema choice — one row per (tenant_id, server_slug, region, minute_bucket), native columns for the small fixed set of fields the read-side surfaces care about (state, p95_ms, tool_list_hash, error_kind, asn, registry_of_origin), and a single small JSONB column for the long tail; partitioned monthly so retention is a partition drop, not a million row deletes. Idempotent ingestion — the primary key is the deterministic tuple (tenant_id, server_slug, region, minute_bucket), the archiver writes with INSERT ... ON CONFLICT DO NOTHING, and Redis keys carry an archived_at marker so a worker crash mid-batch results in at-most one duplicate write attempt that the unique constraint absorbs. Retention by tier — Public 7 days, Author 90 days, Team 180 days, Enterprise contractual (default 365, capped at 730 absent a written agreement); rollups have their own retention (daily 730 days, monthly 2,555 days = 7 years) so the uptime_30d the read-side API serves never has to fall back to per-minute scans. GDPR-shaped delete — a single API call DELETE /api/admin/data/(tenant)/(server) takes a row-level lock, deletes from probe_minute, deletes from the daily rollup, deletes from the monthly rollup, deletes from the suppression-cluster log, deletes from the tenant's verdict-minute Redis prefix, and writes a tombstone row in data_deletion_log within one Postgres transaction; the read-side API and the alert router both honour the tombstone within the next minute and refuse to serve any cached row for the deleted server. Suppression-cluster log as a derived view — the alert router's cross-tenant suppression rule from the alert routing walkthrough reads from a derived view of the archiver, not from a separate store; suppression_clusters is a materialised view refreshed on the same minute-boundary tick, which keeps the alert router's view of "is this a registry-wide outage?" in step with the canonical history. The failure-mode catalogue at the end lists six archiver-specific ways things break — Redis eviction overtaking the watermark, watermark drift after a Postgres failover, partition-drop racing a slow read, JSONB schema bumps that re-write history, a suppression-cluster query plan flipping under load, and a GDPR delete that fans out to a stale rollup — with the structural fix for each. The recipe section sketches the table DDL, the archiver-worker pseudocode, the rollup queries, and the GDPR delete in copy-pasteable form.

Where the verdict-minute Redis stops being a history surface

The verdict-minute Redis is the right surface for an inner loop. Its primary keys are six-byte slug ids, its values are 200-to-400-byte CBOR blobs, the read latency is sub-millisecond, the write contention is bounded by the per-region coalescer Lua script, and the eviction policy is configured to volatile-lru with a TTL of 96 hours. That set of properties is exactly what the alert router and the read-side API need: the alert router's transition detector compares the current minute against the previous minute and a small per-tenant cooldown ledger, all of which live in Redis with TTLs measured in minutes; the read-side API serves badges, CI guardrails, runtime liveness checks, and downstream dashboards from the hot keys with a 60-second cache window described in the read-side walkthrough; the status page static-renders every 60 seconds from the same hot keys. None of those readers needs more than a few minutes of history.

Three things break the moment a reader needs more than a few minutes of history. The first is the uptime_30d field that the read-side API exposes as part of the small fixed JSON contract. uptime_30d is by definition 43,200 verdict-minutes long; reading 43,200 keys from Redis on every API call is impractical, the per-tenant Redis-bandwidth budget would exceed the API's bandwidth budget by an order of magnitude, and the API call would no longer fit inside the 60-second cache window because the read itself takes longer than that. The right shape for uptime_30d is a pre-computed rollup that the read-side API can fetch in a single Postgres lookup; the rollup has to be refreshed every minute (otherwise the reported uptime lags by hours during fast-flip incidents), it has to use archived minutes (otherwise it is wrong during the gap between Redis eviction and Postgres ingestion), and it has to be refreshable cheaply (otherwise the rollup itself becomes the bottleneck). All three of those constraints push toward an archiver that runs continuously and a rollup table that is updated once per minute on the same boundary as the verdict-coalescer.

The second thing that breaks is the 24-hour minute-resolution history bar on the public status page. The status page reader expects to see the last 1,440 minutes of state for each server, painted as small coloured cells. Those cells are minute-resolution, not aggregated; the bar would lie if we drew 1,440 cells from a daily rollup. The cells therefore have to come from per-minute archived rows. Drawing them from the verdict-minute Redis would work for the most recent ~96 hours, but the bar covers 24 hours and the bar is rendered from the same shared state for every server on every page load, which means the read pattern is N servers × 1,440 minutes per render. That read pattern is not friendly to Redis (N × 1,440 small reads) and is far friendlier to Postgres (one indexed range scan per server). The minute-resolution rendering moves to Postgres the moment "last 24 hours" stops fitting in a single Redis read.

The third thing that breaks is the alert router's suppression-cluster log. The cross-tenant suppression rule from the alert routing walkthrough needs to recognise that "more than 10% of all tenants would be paged in the same alert minute for the same upstream root cause" and collapse to a single global notice. The "same upstream root cause" is computed by clustering the alert events on (error.kind, asn, registry_of_origin); the "more than 10%" is computed by counting distinct tenant IDs in the cluster against the total tenant count for the relevant tier band; the "in the same alert minute" is the verdict-minute boundary. Every input to that rule lives, ultimately, in the archived history. The alert router could read the inputs from Redis, but the cluster query has to run every alert-minute, the cluster query has to be deterministic across cluster restarts, and the cluster history has to outlive Redis eviction so a post-incident investigation can replay why the suppression triggered. The right shape is a materialised view of the archiver that the alert router queries directly. The archiver is therefore not just a backup of Redis; it is the canonical source of truth that derived views like the suppression-cluster log compose against.

Each of those three readers has the same shape: a request for more than a few minutes of history, a tolerance for one-minute staleness, and a hard requirement that the data still exists after Redis evicts. Together they make the archiver a load-bearing part of the data path, not an offline batch job. The archiver is online, on the verdict-minute boundary, and on the same shared-state tick as the coalescer and the alert router.

The schema choice — one row per server per minute vs JSONB partitioned by month

The two reasonable schemas for an MCP uptime history table are very different shapes. The first is what we ended up with: one row per (tenant_id, server_slug, region, minute_bucket), with native columns for the small fixed set of fields the read-side cares about and a single small JSONB column for the long tail. The second is JSONB partitioned by month, where each row is the entire verdict blob and the row is keyed by (tenant_id, server_slug, minute_bucket). Both are workable; both have been tried by adjacent products. The walkthrough is most useful if it includes the wrong fork and explains why we abandoned it.

The wrong fork — JSONB partitioned by month

The starting instinct, especially for a system whose verdict shape is not stable in week one, is to dump the whole CBOR blob into a JSONB column and partition by month so retention is a partition drop. The schema is roughly:

CREATE TABLE probe_history (
    tenant_id   uuid NOT NULL,
    server_slug text NOT NULL,
    minute_at   timestamptz NOT NULL,
    verdict     jsonb NOT NULL,
    PRIMARY KEY (tenant_id, server_slug, minute_at)
) PARTITION BY RANGE (minute_at);

This shape works for ingestion and is friendly to schema evolution — a new field in the verdict blob shows up in the JSONB column without a migration. The retention policy is a single DROP TABLE probe_history_2026_03 at the end of every month. That is genuinely nice. The same shape becomes painful as soon as the read pattern is anything other than "fetch the verdict for one server in one month". Three problems compound.

First, the read-side API's uptime_30d aggregation has to walk every JSONB row in the 30-day window for the server, parse the JSONB on each row, evaluate verdict->'state' = 'up', and divide by the count. The plan is, in our profiling, dominated by JSONB parsing — Postgres JSONB is binary on disk but the planner pushes the type-cast cost up into every row. With 43,200 minutes per server and ~50,000 active servers per tenant in our enterprise-tier deployment, the rollup query is doing two-billion JSONB parses every minute. The native-column schema does the same work without parsing.

Second, the suppression-cluster log query needs to count distinct tenant_id values for which the verdict has the same (error.kind, asn, registry_of_origin) in the same minute. With JSONB, that query is a jsonb->'error'->>'kind' grouping on every row, which is unindexable without a generated column for each component of the grouping key. We tried the generated-column workaround; it works, but you end up with three generated columns that are effectively native columns with extra steps, and the JSONB row is now redundant with the generated columns it produced.

Third — and this is the one that finally killed the JSONB schema for us — JSONB does not enforce field-level constraints. The verdict shape is supposed to have region as one of five known values; we discovered, during a Q1 Redis eviction storm, that one of the region workers had been emitting verdicts with a typo in the region name (us-easr instead of us-east) for forty minutes. With native columns and a CHECK constraint, the bad rows would have been rejected at ingestion, the bug would have surfaced as an archiver error in the worker's logs, and the typo would have been fixed within ten minutes. With JSONB, the bad rows arrived silently, the typo persisted, and the operational visibility we needed (per-region uptime aggregation) was off by a few percent for forty minutes. JSONB's "no schema" property is not a feature for canonical history data; canonical data wants schema enforcement.

The right fork — one row per server per minute, native columns plus small JSONB tail

The schema we ended up with is verbose at the column-list level and small in row size:

CREATE TABLE probe_minute (
    tenant_id          uuid          NOT NULL,
    server_slug        text          NOT NULL,
    region             text          NOT NULL CHECK (region IN ('us-east','us-west','eu-west','ap-southeast','sa-east')),
    minute_bucket      timestamptz   NOT NULL,
    state              text          NOT NULL CHECK (state IN ('up','down','degraded','auth-walled','unknown')),
    p95_ms             integer       NULL,
    tool_list_hash     bytea         NULL,
    error_kind         text          NULL,
    asn                integer       NULL,
    registry_of_origin text          NULL,
    archived_at        timestamptz   NOT NULL DEFAULT now(),
    extra              jsonb         NOT NULL DEFAULT '{}'::jsonb,
    PRIMARY KEY (tenant_id, server_slug, region, minute_bucket)
) PARTITION BY RANGE (minute_bucket);

CREATE INDEX probe_minute_idx_cluster
    ON probe_minute (registry_of_origin, asn, error_kind, minute_bucket)
    WHERE state IN ('down','degraded');

CREATE INDEX probe_minute_idx_uptime
    ON probe_minute (tenant_id, server_slug, minute_bucket)
    INCLUDE (state, p95_ms);

The native columns carry exactly the fields that the read-side surfaces, the alert router's suppression-cluster query, and the daily/monthly rollups all read on the hot path. The extra JSONB column carries everything else — the per-step probe timings, the credential rotation marker, the per-region tool-list-hash divergence diagnostic — that surfaces only on operator drill-down and never on a hot-path query. The CHECK constraints on region and state reject the typo case at ingestion. The two indexes are the two indexes the hot-path queries actually need; we measured both before adding either, and we have not added a third.

The partition strategy is monthly. Every month we create a new partition (probe_minute_2026_05) one week before it is needed, and we drop a partition (probe_minute_2025_07, ten months old, past every tier's retention cap) on the first of the month. Partition creation is a cron; partition drop is a cron; both run in transactions that take a small lock and release it within a few seconds. We do not use pg_partman's automatic partition management because the failure mode of "partition for the current minute does not exist" is, in our deployment, the worst possible failure mode (it kills ingestion for every tenant) and we want partition creation to be operator-monitored and operator-fixable, not a background job that has its own failure mode.

The row size in this schema is around 90 bytes on average (the extra JSONB is empty for healthy minutes and small for the others). At 50,000 servers × 5 regions × 1,440 minutes × 30 days, the monthly partition is around 100 GB of raw rows compressed to ~30 GB on disk. That is large enough to want partition-level retention and small enough to fit on a single Postgres replica. We started talking about citus or timescaledb at 1 PB raw and found we did not need either at 100 GB; the planner is happy with monthly partitions, the queries hit single partitions on the hot path, and the I/O budget is dominated by the rollup refresh, not the per-minute ingest.

Retention policy by tier

The retention table is the contract between the tenant and the archiver. Three properties matter: it is a small fixed set of values per tier (no per-tenant exceptions), the values are visible to the tenant in the dashboard and in the privacy policy (no surprises), and the archiver enforces them by partition drop and per-row delete (no soft-delete, no flag-and-hide). The values for AliveMCP today are:

TierPer-minute retentionDaily-rollup retentionMonthly-rollup retentionNotes
Public (free)7 days365 days2,555 days (7 yr)Public servers; per-minute history serves the 24-hour bar; daily rollup serves uptime_30d with 7-day-per-minute fallback for the first 7 days. Monthly rollup is the public ecosystem-health dataset.
Author ($9/mo)90 days730 days2,555 daysPer-minute history serves the 90-day response-time history advertised on the pricing page; daily rollup serves uptime_90d and uptime_30d; monthly rollup is exportable as PDF in the dashboard.
Team ($49/mo)180 days1,095 days (3 yr)2,555 daysPer-minute history covers a 180-day window for incident review; daily rollup is exportable as CSV.
Enterprise ($299/mo+)365 days, contractual cap 7302,555 days2,555 daysPer-minute retention beyond 365 days is contractual and requires a written agreement; the contractual cap of 730 days exists to bound legal exposure on data-breach scenarios. Monthly rollup is the SLA PDF source dataset.

The retention values are deliberately unsymmetrical across the three tables. The per-minute table holds the most data and has the shortest retention (because per-minute resolution is most expensive to store and is least useful far in the past). The daily rollup has medium retention (because daily resolution is what humans use to understand "is the server reliable on average?" and the answer is interesting for a few years). The monthly rollup has the longest retention (because monthly resolution is cheap to store, monthly resolution is what regulators and auditors ask about, and seven years is the standard SOC 2 / GDPR-compatible audit horizon).

Two practical consequences fall out of this. First, the read-side API's uptime_30d field can always be served from the daily rollup, no matter the tier — the daily rollup retention is at least 365 days for every tier — so the read-side does not need a tier-conditional fallback path. Second, an Enterprise tenant who churns gets their per-minute history dropped after 365 days even though their last paid month is in the past; the archiver does not condition retention on subscription state, only on tier-at-time-of-write. We considered conditioning on subscription state and rejected it because the failure mode "we deleted your data because your card declined for one month" is a worse customer experience than the alternative ("we kept your data for 365 days from the day it was probed regardless of what your card did").

A subtle point about partition-drop retention: the per-minute table is partitioned monthly, but the retention windows above are 7, 90, 180, 365 days. The partition drop has to be conservative; we drop a partition only when the highest-tier retention has expired across every row in it. That means a partition for May 2026 is droppable in early May 2027, not May 2026 + 365 days. Per-row deletes inside a still-live partition handle the per-tier-cap-not-yet-partition-dropped window: every night, a small batch job runs DELETE FROM probe_minute WHERE minute_bucket < now() - tier_retention(tenant_id) AND minute_bucket >= partition_floor, scoped to one tenant per pass, with a row-limit cap so a misconfiguration cannot eat the entire partition's I/O budget. The two retention mechanisms — partition drop for the floor, per-row delete for the per-tier cap — coexist and never duplicate work.

The Redis-to-Postgres ingestion pipeline

The archiver worker is a single Go binary, deployed as one process per Redis shard. Its loop is intentionally boring:

  1. Wait until the next 60-second boundary aligned to NTP-corrected wall-clock. The same boundary the verdict-coalescer uses, offset by 5 seconds to give the coalescer time to seal the previous minute.
  2. Read the watermark — the last fully-archived minute — from the archive_watermark Postgres row, locked FOR UPDATE SKIP LOCKED to give multiple archiver replicas mutual exclusion without blocking.
  3. Compute the next minute to archive: watermark + 1 minute.
  4. If the next minute is in the future, release the lock and sleep until the next boundary.
  5. Otherwise, scan the verdict-minute Redis for keys matching the pattern verdict:<tenant_id>:<server_slug>:<region>:<next_minute> using SCAN with cursor pagination. The shard count is fixed; the worker addresses one shard per process; the SCAN match pattern is anchored on the minute-bucket so the scanned-key count per minute is bounded by the active server count.
  6. For each batch of 1,000 keys, parse the CBOR blob, build a Postgres bulk-insert via COPY through the pgx driver, and write with INSERT ... ON CONFLICT DO NOTHING on the deterministic primary key. (We use COPY for the bulk write and then INSERT from a temporary table for the conflict-handling because COPY alone does not support ON CONFLICT.)
  7. After the batch commits, advance the watermark to next_minute and release the lock.
  8. Mark the source Redis keys with an archived_at field via HSET on a small audit hash; this is what allows post-incident investigation to distinguish "Redis evicted before archive" from "archived and Redis evicted after".
  9. Loop.

The boring shape is the point. The pipeline is idempotent because the Postgres primary key is deterministic and the conflict handling is a no-op; a worker crash mid-batch results in a re-attempt of the same batch on the next loop, which inserts zero new rows because every row's primary key collides. A coordinated archiver-and-coalescer crash that loses the watermark (the watermark itself is a Postgres row, so this is unlikely, but we have to design for it) results in the next archiver to come up reading the watermark as NULL, defaulting to now() - 1 hour (configurable), and back-filling the last hour from Redis with zero net new rows because the primary key collides. Everything we do in the archiver is structured to make idempotency the default, not the exception.

The 5-second offset between the coalescer's verdict-minute boundary and the archiver's read tick is load-bearing. We tried 0 seconds initially; the consequence is that the archiver reads the verdict key just as the coalescer is writing it, and the read sometimes catches a stale value or, worse, a half-written CBOR blob (Redis is single-threaded so this is rare, but the Lua script does multiple HSETs and the read is not atomic with the script). 5 seconds is enough margin for the coalescer's slowest observed tail-latency (~3.4 seconds at the p99.9 in our deployment) plus a 1.5-second buffer; we chose 5 because it is round, easy to reason about, and well below the 60-second cadence floor that the read-side API's cache window assumes.

The watermark mechanism deserves a paragraph of its own because it is the part that, in practice, decides whether the archiver is correct. The watermark is one Postgres row, structured as archive_watermark(shard_id PRIMARY KEY, last_minute timestamptz NOT NULL, owner uuid NOT NULL, taken_at timestamptz NOT NULL, expires_at timestamptz NOT NULL). The owner column lets us see, in the dashboard, which archiver replica is currently holding the lock. The expires_at column lets us recover from a stuck owner — a replica that crashed without releasing the lock — by stealing the lock if expires_at < now() and the candidate replica can verify the previous owner is no longer running (we check via the supervisor's health endpoint; if the supervisor reports the previous owner is dead, we steal). Stealing the lock without verification is the wrong behaviour because two archivers writing to the same shard's verdict minutes can produce duplicate extra-column updates that, while idempotent on the primary key, can produce a benign-but-confusing log of conflicting extra field values; the verification step keeps the audit trail clean.

One subtle correctness note: the verdict-minute Redis is sealed by the coalescer with a separate verdict_sealed flag in the same hash, set to 1 only after every contributing region has either reported or been timed out. The archiver only ingests rows where verdict_sealed = 1. If a minute is unsealed (because a region is still reporting), the archiver skips that minute, leaves the watermark at the previous minute, and tries again on the next tick. This means the archiver's worst-case lag is bounded by the coalescer's worst-case unsealing window, which is 90 seconds in our deployment (the regional-probe wall-clock cap is 50 seconds plus a 40-second coalescer-side buffer for late-arriving cells). We surface that lag as archiver_lag_seconds in the operator dashboard; if it exceeds 180 seconds for more than five minutes, the supervisor fires the on-call.

Daily and monthly aggregation rollups

The rollups exist for the same reason the archiver exists at all: the read-side API and the alert router need to answer "how is this server doing on average over a long window?" and the per-minute table is too expensive to scan for that question. The shape of the rollup tables is dictated by the queries that read them.

Daily rollup

CREATE TABLE probe_day (
    tenant_id            uuid          NOT NULL,
    server_slug          text          NOT NULL,
    day                  date          NOT NULL,
    minutes_total        integer       NOT NULL,
    minutes_up           integer       NOT NULL,
    minutes_down         integer       NOT NULL,
    minutes_degraded     integer       NOT NULL,
    minutes_auth_walled  integer       NOT NULL,
    minutes_unknown      integer       NOT NULL,
    p95_ms_p50_of_day    integer       NULL,
    p95_ms_p95_of_day    integer       NULL,
    incident_count       integer       NOT NULL DEFAULT 0,
    PRIMARY KEY (tenant_id, server_slug, day)
);

The daily rollup row is one per server per day, regardless of how many regions contributed. The aggregation rule is "a minute counts as up if and only if the verdict colour for that minute was up" — the same two-of-N rule the verdict-coalescer applied at write time. The p95_ms_p50_of_day field is the 50th-percentile of the per-minute p95s during the day (yes, a percentile of a percentile; it is the right shape for "what was the typical p95 today" and we have measured this and found it to be the field operators actually want). The incident_count field is the number of state transitions during the day from up to anything-else, which gives the dashboard a quick "how flappy was this server today?" number without having to scan the per-minute history.

The daily rollup is computed by a Postgres job that runs at 00:01 UTC every day, scanning the previous day's per-minute partition, doing a single GROUP BY (tenant_id, server_slug, date_trunc('day', minute_bucket)), and writing the rollup with INSERT ... ON CONFLICT DO UPDATE SET .... The same job runs incrementally every minute on the day-in-progress so the read-side API can serve uptime_30d with at most 60-second staleness for the current day's contribution; the incremental update is bounded by the active-server count (50,000 in our enterprise deployment) and a small WHERE clause on the minute-bucket, and runs in around 1.2 seconds end-to-end on our production database.

Monthly rollup

The monthly rollup is structurally the same shape — one row per server per month, with the same per-state minute counts scaled up — but the source is the daily rollup, not the per-minute table. The reason is twofold: the daily rollup retention is always at least as long as the per-minute retention, so the monthly rollup's source is always available regardless of tier; and the daily rollup is two orders of magnitude smaller than the per-minute table, so the monthly aggregation is fast.

CREATE TABLE probe_month (
    tenant_id            uuid          NOT NULL,
    server_slug          text          NOT NULL,
    month                date          NOT NULL,  -- first of month
    days_total           integer       NOT NULL,
    minutes_up           integer       NOT NULL,
    minutes_down         integer       NOT NULL,
    minutes_degraded     integer       NOT NULL,
    minutes_auth_walled  integer       NOT NULL,
    minutes_unknown      integer       NOT NULL,
    incident_count       integer       NOT NULL DEFAULT 0,
    sla_uptime_pct       numeric(5,3)  NOT NULL,
    PRIMARY KEY (tenant_id, server_slug, month)
);

The sla_uptime_pct column is computed as minutes_up / (minutes_up + minutes_down + minutes_degraded) — note that auth_walled and unknown minutes are excluded from the denominator. The reason is that an auth_walled minute is, in the SLA sense, "we cannot tell" — the server may be up to its real users but is rejecting the probe credential, and counting it as either up or down would be wrong. unknown is similarly inconclusive. The exclusion is documented in the SLA PDF that the Enterprise tier consumes from the monthly rollup, and we have signed customer contracts that explicitly reference this calculation.

The monthly rollup is the surface that survives every retention cap. Seven years of monthly rollup is around 4 GB on disk for the entire AliveMCP corpus. The same surface is what regulators ask about ("show me the incident history of MCP server X for the year 2025"), what auditors ask about ("did this server meet its 99.5% SLA in March 2026?"), and what the public ecosystem-health dataset is built from. The dataset is published as a CSV on the quarterly registry report blog post and is updated every quarter from the same monthly rollup.

The GDPR-shaped delete path

The GDPR-shaped delete path is the part of the archiver that has the highest per-incident cost and the lowest per-incident frequency. We have executed it about a dozen times in the lifetime of the system; every execution has surfaced a corner case that motivated a fix; the path as it stands today survives every corner case we have surfaced. The endpoint is intentionally simple:

DELETE /api/admin/data/(tenant_id)/(server_slug)
Authorization: Bearer (admin_token)
X-Reason: gdpr-art17 | tenant-churned | manual-cleanup | ...

The endpoint requires admin auth, takes a tenant ID and a server slug, and synchronously executes the following sequence inside a single Postgres transaction with a row-level advisory lock keyed on (tenant_id, server_slug):

  1. Insert a tombstone row into data_deletion_log(tenant_id, server_slug, deleted_at, reason, requester). The tombstone exists before the deletes so a subsequent restore-from-backup never silently re-creates the data.
  2. Delete from probe_minute where (tenant_id, server_slug) matches. With the index probe_minute_idx_uptime, this is a single index range scan; for an enterprise-tier server with 365 days of per-minute data, the row count is around 2.6 million and the delete takes around 8 seconds. We have not found this latency to be a problem; the operation is rare and the lock granularity is per-server.
  3. Delete from probe_day where (tenant_id, server_slug) matches.
  4. Delete from probe_month where (tenant_id, server_slug) matches.
  5. Delete from suppression_clusters where the cluster's contributing-tenant set contains tenant_id. This is the subtle one: suppression_clusters is a materialised view, but it is materialised with a contributing_tenant_ids array column for exactly this case, and we delete rather than wait for the next refresh because the deletion has to be visible to the alert router immediately.
  6. Delete from the verdict-minute Redis using a SCAN MATCH verdict:<tenant_id>:<server_slug>:*:* and pipelined UNLINK for the matching keys. (UNLINK rather than DEL because UNLINK is non-blocking; the per-minute Redis can have ~50,000 keys for a long-tenured server and we do not want to block the event loop.)
  7. Delete from the alert router's per-tenant Redis namespace using the same scan-and-unlink pattern, scoped to keys whose payload references the deleted server.
  8. Delete from the read-side API's cache by issuing a POST /api/admin/cache-invalidate against every regional read-side API instance with the deleted server slug as the payload. This is what guarantees the API stops serving the stale row within at most 60 seconds (the cache TTL is 60s; the invalidation makes it immediate).
  9. Commit the Postgres transaction and release the advisory lock.

If any step in the sequence fails, the entire Postgres transaction rolls back, the advisory lock is released, and the tombstone is not recorded. The Redis deletes are not transactional with the Postgres transaction; we accept that this means a Redis delete that succeeds but a downstream Postgres step that fails leaves the Redis state ahead of the Postgres state for one minute (the next archiver tick will re-create the Redis-deleted-but-Postgres-still-there row from the now-stale Postgres state, which is wrong). The mitigation is that the Redis deletes happen inside the Postgres transaction (steps 6 and 7 are dispatched but their completion is awaited before commit); if the Postgres step fails, we have a small reconciliation job that runs every five minutes and re-creates the Redis keys for any tombstone whose Postgres rows are still present. We have triggered this reconciliation job exactly once in production, after a Postgres deadlock during step 4; the reconciliation completed in under a minute.

The reasoning behind step 1 — write the tombstone first — is worth surfacing. A tenant who exercises GDPR Article 17 expects, after the deletion, that the data does not return in a backup restore six months later. The only way to enforce that durably is to record the deletion request itself in a log that is itself backed up; on a restore, the post-restore reconciliation job reads the deletion log, replays every recorded deletion against the restored data, and only after that does the read-side API come back online. The tombstone log is therefore the actual GDPR-compliance artifact; the per-row deletes are merely the immediate effect.

One more wrinkle: step 5 (delete from suppression_clusters) is a data delete, but the suppression-cluster log itself is sometimes about deleted tenants. If the alert router clustered an event on (error.kind=tls-expired, asn=AS16509, registry_of_origin=mcp.so) and one of the contributing tenants subsequently exercised GDPR-Art-17, the cluster row's contributing-tenant set has to be updated to remove that tenant ID, but the cluster row itself survives because it documents an event that genuinely happened to other tenants. The structural choice we made is to store contributing tenant IDs as a salted hash, not the raw UUID, in suppression_clusters.contributing_tenant_hashes; the deletion replaces the salted hash with a constant 'tombstone' string. The cluster row is kept; the tenant identity is gone; the historical fact of the registry-wide outage survives. We documented this exact behaviour in the privacy policy under "what is preserved after deletion".

Step 8 (cache invalidation) is the part that makes the deletion visible in the time the GDPR regulation expects it to be visible. The Article 17 default expectation is "without undue delay"; courts have read that as ~30 days for paperwork-heavy deletions, but the read-side API's cache window is 60 seconds, and an actual cache miss after the invalidation is far cheaper than the legal cost of a single deferred delete. We take the latency hit happily.

The suppression-cluster log as a derived view of the archiver

The cross-tenant alert-suppression rule from the alert routing walkthrough collapses a registry-wide outage to one global notice when more than 10% of all tenants would be paged in the same alert minute for the same upstream root cause, clustered on (error.kind, asn, registry_of_origin). The rule's correctness depends on three properties: the cluster computation has to be deterministic across cluster restarts, the cluster history has to outlive Redis eviction so a post-incident investigation can replay why the suppression triggered, and the inputs to the cluster have to be the canonical archived verdicts, not the in-flight Redis verdicts.

All three properties are bought by making suppression_clusters a materialised view of probe_minute:

CREATE MATERIALIZED VIEW suppression_clusters AS
SELECT
    error_kind,
    asn,
    registry_of_origin,
    minute_bucket,
    count(DISTINCT tenant_id)                AS contributing_tenant_count,
    array_agg(DISTINCT encode(
        digest(tenant_id::text || :salt, 'sha256'),
        'hex'
    ))                                       AS contributing_tenant_hashes,
    count(DISTINCT (tenant_id, server_slug)) AS contributing_pair_count
FROM probe_minute
WHERE state IN ('down', 'degraded')
  AND minute_bucket >= now() - interval '24 hours'
GROUP BY error_kind, asn, registry_of_origin, minute_bucket
HAVING count(DISTINCT tenant_id) > 1;

The view is refreshed concurrently every minute on the verdict-minute boundary, immediately after the archiver writes the previous minute's per-minute rows. REFRESH MATERIALIZED VIEW CONCURRENTLY takes a unique index — we provide one on (error_kind, asn, registry_of_origin, minute_bucket) — and refreshes without blocking concurrent reads, which is what the alert router needs because the alert router queries this view on every alert evaluation.

The salted-hash form of the contributing tenant set is the GDPR-friendly representation discussed above; it lets the alert router answer "how many distinct tenants are in this cluster?" without ever revealing tenant identities, and it lets the GDPR-delete path replace one hash with the constant 'tombstone' without touching the surrounding data.

The view's WHERE clause filters to the last 24 hours, which is the only window the alert router's suppression rule cares about. The 24-hour window keeps the view small (a few thousand rows even during a registry-wide outage) and the refresh fast (under 200 ms in our deployment). A separate cron job runs every hour, snapshots suppression_clusters into a permanent suppression_clusters_history table, and that permanent table inherits the daily rollup's retention (730+ days). The split between the live materialised view (24-hour window, fast) and the permanent history table (long retention, append-only) is the same split as probe_minute vs probe_day; we apply the same shape twice, intentionally.

One operational observation: the uptime-monitoring read pattern the alert router executes against this view is "in the last alert minute, was there any cluster with contributing_tenant_count > 0.10 * total_tenant_count?" — the planner does this with one indexed lookup against the most recent minute and a constant comparison against the global tenant count cached in a separate config row. We measured the query at p99 under 12 ms across our production load; we have not had to tune it further.

Six archiver-specific failure modes and the structural fix for each

1. Redis eviction overtakes the archiver watermark

The verdict-minute Redis evicts unaccessed keys on a 96-hour TTL and on memory-pressure LRU. If the archiver falls behind (because of a long Postgres I/O stall, a watermark-lock contention, or a deployment that paused the worker for ten minutes), Redis can evict verdict keys before the archiver reads them. The result is a permanent gap in the per-minute history — that minute is gone from Redis and was never written to Postgres.

The structural fix is a TTL safety margin and an alert. The TTL on verdict keys is 96 hours; the archiver's worst-case lag (verdict-coalescer-unsealed window plus deployment-pause budget plus failover budget) is 30 minutes. The 96-hour TTL gives a margin of ~190x; we monitor (now() - archiver_watermark) as a Prometheus metric, page on > 5 minutes, and consider a sustained > 30 minutes a Sev-1. The 96-hour TTL itself is a defensive choice — we sized it from "if the archiver replica is dead and the on-call is asleep on a Friday night, how long do we have before data loss?" and answered "until Monday morning, with margin". The TTL is configurable per-deployment and we ship 96 hours as the default; we have not had to lower it.

2. Watermark drift after a Postgres failover

Postgres failover is rare but real. If the primary fails over to a synchronous replica that was 200 ms behind, and the watermark row was committed on the primary but not yet replicated, the new primary's watermark is one minute behind the actual archived state. The archiver sees the watermark as watermark - 1, attempts to ingest a minute that is already in Postgres, hits the primary-key conflict, no-ops, and advances to the same minute it would have advanced to anyway. The conflict-handling absorbs the drift.

The structural fix is the conflict-handling itself. The deterministic primary key plus ON CONFLICT DO NOTHING means the archiver is safe under any watermark drift up to the partition-boundary edge case (drifting back into a partition that has been dropped, which would be a Sev-1 failure-mode #1 above). We monitor for the case in alert tests; we have not seen it in production.

3. Partition drop races a slow read

Partition drops are scheduled at 03:00 UTC on the first of the month. A long-running analytics read against the partition that is about to be dropped — for example, a Q1 ecosystem-health dataset rebuild against partition probe_minute_2025_07 at 03:00 on 2026-08-01 — can hold a lock that the partition drop has to wait for. The wait is bounded; the partition drop will happen, but the analytics read may be cancelled or may run to completion depending on lock priority.

The structural fix is to never run analytics reads against partitions in the drop window. We have a simple convention: any analytics job that needs a per-minute table read snapshots the table to a read-only export at the start of the job and operates against the snapshot. The snapshot is a CREATE TABLE ... AS SELECT ... FROM probe_minute WHERE minute_bucket BETWEEN ... AND ... that runs in a transaction with SET LOCAL lock_timeout = '5 seconds', fails fast if the partition is being held, and retries from a one-minute-newer floor on failure. The convention is enforced in code review; analytics jobs that read probe_minute directly are rejected at PR time.

4. JSONB schema bumps re-write history

The extra JSONB column is the only field whose schema is not enforced by the table definition. It is also, by design, the field where new per-step probe diagnostics get added without a migration. The risk is that a downstream consumer — for example, a daily rollup that reads extra->'cdn_pop' for cache-divergence aggregation — can be broken by a schema bump that renames or restructures the field, and the breakage manifests not as a Postgres error but as silently-wrong rollup numbers.

The structural fix is to keep the extra JSONB strictly read-by-operator-only: no rollup, no read-side API, no alert router, no suppression-cluster view reads from extra. The contract is enforced by code review and by a unit test that walks the codebase looking for extra-> reads outside the operator-dashboard route. If a field needs to feed a rollup or an alert path, it gets promoted to a native column with a CHECK constraint and a backfill job. The backfill job is its own ceremony — we have done it three times for fields that started in extra and earned a column — and the ceremony is the right level of friction for "this field is now load-bearing".

5. Suppression-cluster query plan flips under load

The suppression-cluster query is the alert router's hot-path query; if its plan flips from "use the partial index on (registry_of_origin, asn, error_kind, minute_bucket)" to a sequential scan, the alert router's per-minute query goes from 12 ms to 30 seconds and the alert path falls behind the verdict-minute boundary. The plan flip can happen if Postgres's row estimates drift enough to make the planner think a sequential scan is cheaper.

The structural fix is two-fold. First, we run ANALYZE probe_minute on a 5-minute cron during the day-in-progress so the planner's row estimates stay current. Second, we have a synthetic test that runs the suppression-cluster query against a known-good fixture every minute and asserts the plan via EXPLAIN; if the plan flips, the synthetic test fails before the production query notices. The synthetic test caught one plan flip in production, during a database upgrade window where the planner's costing constants changed; we forced the index with a SET LOCAL enable_seqscan = off hint scoped to the alert router's session, and the plan stayed pinned until we updated the costing.

6. GDPR delete fans out to a stale rollup

The GDPR delete path deletes from probe_day and probe_month in steps 3 and 4. If a daily rollup recompute job was running concurrently — for example, a backfill triggered by a schema bump (failure-mode #4 above) — the recompute job can re-create the deleted rollup row by reading from probe_minute rows that the GDPR delete has not yet processed (the GDPR delete processes probe_minute in step 2; if the rollup recompute starts between the lock acquisition and step 2, it sees the not-yet-deleted data).

The structural fix is the advisory lock. The GDPR delete takes a per-server advisory lock at the very start of the transaction; the rollup recompute job takes the same advisory lock per server before reading. If the GDPR delete is running, the rollup recompute waits; if the rollup recompute is running, the GDPR delete waits. The lock is per-server, so a GDPR delete on one server does not block rollup recomputes on others; we have not observed measurable contention. The lock is held inside the Postgres transaction so a worker crash releases it automatically. We added the lock after a single near-miss in staging where a backfill ran concurrent with a GDPR-Art-17 delete and produced one orphan rollup row that we cleaned up by hand; the lock is the structural fix that ensures we never have to clean up by hand again.

Reference recipes

The recipes below are deliberately compact. Production code in our deployment is longer because it carries metrics, structured logging, retry policies, and a circuit breaker for the Postgres write path; the compact form below is the one we sketch on a whiteboard when explaining the archiver to a new engineer.

Table DDL

-- watermark
CREATE TABLE archive_watermark (
    shard_id    int           PRIMARY KEY,
    last_minute timestamptz   NOT NULL,
    owner       uuid          NOT NULL,
    taken_at    timestamptz   NOT NULL DEFAULT now(),
    expires_at  timestamptz   NOT NULL
);

-- per-minute (partitioned monthly; see "schema choice" section above)
CREATE TABLE probe_minute (
    tenant_id          uuid          NOT NULL,
    server_slug        text          NOT NULL,
    region             text          NOT NULL CHECK (region IN ('us-east','us-west','eu-west','ap-southeast','sa-east')),
    minute_bucket      timestamptz   NOT NULL,
    state              text          NOT NULL CHECK (state IN ('up','down','degraded','auth-walled','unknown')),
    p95_ms             integer       NULL,
    tool_list_hash     bytea         NULL,
    error_kind         text          NULL,
    asn                integer       NULL,
    registry_of_origin text          NULL,
    archived_at        timestamptz   NOT NULL DEFAULT now(),
    extra              jsonb         NOT NULL DEFAULT '{}'::jsonb,
    PRIMARY KEY (tenant_id, server_slug, region, minute_bucket)
) PARTITION BY RANGE (minute_bucket);

-- daily and monthly rollups (see "rollups" section above for full DDL)
-- suppression-cluster materialised view (see "suppression-cluster log" section)

-- GDPR tombstone log
CREATE TABLE data_deletion_log (
    id          bigserial PRIMARY KEY,
    tenant_id   uuid        NOT NULL,
    server_slug text        NOT NULL,
    deleted_at  timestamptz NOT NULL DEFAULT now(),
    reason      text        NOT NULL,
    requester   uuid        NOT NULL
);
CREATE INDEX ON data_deletion_log (tenant_id, server_slug);

Archiver worker — pseudocode

// archiver-worker.go (sketch; ~800 lines in production)
const TICK_OFFSET_S = 5
const SHARD_LOCK_TTL = 90 * time.Second

func ArchiverLoop(shardID int) {
    for {
        sleepUntilNextMinuteBoundary(TICK_OFFSET_S)

        tx := db.Begin()
        wm, ok := tx.SelectForUpdateSkipLocked("archive_watermark", shardID)
        if !ok {
            tx.Rollback()
            continue // another archiver holds the lock
        }
        if wm.expires_at.Before(now()) && previousOwnerIsDead(wm.owner) {
            tx.Update("archive_watermark", shardID,
                "owner", self.UUID,
                "taken_at", now(),
                "expires_at", now().Add(SHARD_LOCK_TTL))
        }

        nextMinute := wm.last_minute.Add(time.Minute)
        if nextMinute.After(now().Truncate(time.Minute)) {
            tx.Commit() // nothing to do this tick
            continue
        }

        keys := redis.ScanMatch(
            fmt.Sprintf("verdict:*:*:*:%d", nextMinute.Unix()),
            shardID)

        rows := []probeMinute{}
        for _, key := range keys {
            blob := redis.HGetAll(key)
            if blob["verdict_sealed"] != "1" {
                // unsealed; skip this minute, retry next tick
                tx.Rollback()
                goto nextTick
            }
            row, err := parseVerdict(blob, nextMinute)
            if err != nil { logAndSkip(key, err); continue }
            rows = append(rows, row)
        }

        if len(rows) > 0 {
            // bulk write via COPY into a staging table, then INSERT ... ON CONFLICT DO NOTHING
            stage := tx.CopyInto("probe_minute_stage", rows)
            tx.Exec(`INSERT INTO probe_minute SELECT * FROM probe_minute_stage
                     ON CONFLICT (tenant_id, server_slug, region, minute_bucket) DO NOTHING`)
            tx.Exec(`TRUNCATE probe_minute_stage`)
        }

        tx.Update("archive_watermark", shardID, "last_minute", nextMinute)
        tx.Commit()

        for _, key := range keys {
            redis.HSet(key, "archived_at", now().Unix())
        }
    nextTick:
    }
}

Daily rollup — incremental update query

INSERT INTO probe_day (
    tenant_id, server_slug, day,
    minutes_total, minutes_up, minutes_down, minutes_degraded,
    minutes_auth_walled, minutes_unknown,
    p95_ms_p50_of_day, p95_ms_p95_of_day, incident_count
)
SELECT
    tenant_id,
    server_slug,
    date_trunc('day', minute_bucket)::date AS day,
    count(*) AS minutes_total,
    count(*) FILTER (WHERE state = 'up') AS minutes_up,
    count(*) FILTER (WHERE state = 'down') AS minutes_down,
    count(*) FILTER (WHERE state = 'degraded') AS minutes_degraded,
    count(*) FILTER (WHERE state = 'auth-walled') AS minutes_auth_walled,
    count(*) FILTER (WHERE state = 'unknown') AS minutes_unknown,
    percentile_disc(0.5)  WITHIN GROUP (ORDER BY p95_ms) AS p95_ms_p50_of_day,
    percentile_disc(0.95) WITHIN GROUP (ORDER BY p95_ms) AS p95_ms_p95_of_day,
    count(*) FILTER (
        WHERE state != 'up' AND lag(state) OVER (
            PARTITION BY tenant_id, server_slug
            ORDER BY minute_bucket
        ) = 'up'
    ) AS incident_count
FROM probe_minute
WHERE minute_bucket >= date_trunc('day', now())
GROUP BY tenant_id, server_slug, date_trunc('day', minute_bucket)
ON CONFLICT (tenant_id, server_slug, day) DO UPDATE
SET minutes_total       = EXCLUDED.minutes_total,
    minutes_up          = EXCLUDED.minutes_up,
    minutes_down        = EXCLUDED.minutes_down,
    minutes_degraded    = EXCLUDED.minutes_degraded,
    minutes_auth_walled = EXCLUDED.minutes_auth_walled,
    minutes_unknown     = EXCLUDED.minutes_unknown,
    p95_ms_p50_of_day   = EXCLUDED.p95_ms_p50_of_day,
    p95_ms_p95_of_day   = EXCLUDED.p95_ms_p95_of_day,
    incident_count      = EXCLUDED.incident_count;

GDPR delete — single-transaction sequence

BEGIN;
SELECT pg_advisory_xact_lock(
    hashtext($1::text),         -- tenant_id
    hashtext($2::text)          -- server_slug
);

INSERT INTO data_deletion_log (tenant_id, server_slug, reason, requester)
VALUES ($1, $2, $3, $4);

DELETE FROM probe_minute
 WHERE tenant_id = $1 AND server_slug = $2;

DELETE FROM probe_day
 WHERE tenant_id = $1 AND server_slug = $2;

DELETE FROM probe_month
 WHERE tenant_id = $1 AND server_slug = $2;

-- the suppression-cluster contributing-tenant set is salted-hashed,
-- so we replace this tenant's hash with a tombstone in any cluster
-- that referenced it; the cluster row itself survives (it documents
-- a registry-wide event that genuinely happened to other tenants).
UPDATE suppression_clusters_history
   SET contributing_tenant_hashes = array_replace(
        contributing_tenant_hashes,
        encode(digest($1::text || current_setting('app.salt'), 'sha256'), 'hex'),
        'tombstone'
       )
 WHERE encode(digest($1::text || current_setting('app.salt'), 'sha256'), 'hex')
       = ANY (contributing_tenant_hashes);

-- redis deletes are dispatched here; their await-completion happens
-- after this transaction commits but before the API returns 200.
COMMIT;

Where this fits in the scale sub-series

The scale sub-series so far has three posts. The collector walkthrough (post #10) built the write side — supervisor, workers, per-region queues, per-tenant secret store, verdict-minute coalescer. The alert routing walkthrough (post #11) built the paging side — sink-ownership verification, tenant-scoped configuration, cross-tenant suppression, per-tenant alert budgets, payload-shape boundaries. This post built the persistence side — the Redis-to-Postgres archiver, the per-minute schema, the daily and monthly rollups, retention by tier, the GDPR-shaped delete path, and the suppression-cluster log as a derived view of the archiver. Together they describe a multi-tenant MCP uptime stack that probes from many regions, alerts safely across many tenants, and persists the canonical history in a shape that survives every retention cap and every Article 17 request.

The next instalment of the scale sub-series — the post that closes out the scale arc before the Q3 2026 audit re-run — is the operator-dashboard walkthrough. The collector, the alert router, and the archiver each emit metrics, surface tenant configuration, accept admin operations, and produce audit logs. The single tenant operator wires those four surfaces into a single Grafana board and a few command-line scripts; the multi-tenant operator needs an admin dashboard with per-tenant scoping, role-based access, an audit log that survives retention, and a customer-facing self-serve surface for alert-sink configuration, retention preferences, and Article 17 requests. The walkthrough will cover the four-layer admin permission model, the audit-log schema that survives every other retention cap, the customer-facing self-serve UX, and the failure modes specific to operating one console for many tenants.

After that, the Q3 2026 registry audit re-runs in mid-July, this time from the multi-tenant collector designed in post #10, with verdicts archived through the system designed in this post, and with cross-tenant suppression measured against the cluster log designed in post #11. The audit will report bucket-by-bucket movement vs the Q2 baseline — including the new regionally degraded bucket the multi-region rollout from post #7 surfaces, whether the credentialed-probe rollout from post #6 shrinks the auth-walled 16.8% bucket as expected, and whether the schema-drift detector from post #4 caught the same 7.1%/48h drift rate or a different one. The archiver designed in this post is what makes the comparison possible at all.

Further reading on AliveMCP

Want to be told before your MCP server dies silently?

AliveMCP probes every public MCP endpoint every 60 seconds, archives the verdict for as long as your tier specifies, and exposes the history to your dashboard, your README badge, your CI guardrail, and your runtime liveness check — all from the same canonical Postgres-backed history described in this post. Public servers are free; private servers start at $9/mo.

Join the waitlist