Deep dive · 2026-04-30 · Scale sub-series — archiver companion
Operating the shared-state archiver with five staff or fewer
The shared-state archiver walkthrough built five new layers on top of the verdict-minute Redis: a native-column-plus-small-JSONB schema partitioned monthly, idempotent ingestion behind a watermark with a 5-second offset to the verdict-coalescer, daily and monthly rollups that the read-side API and the alert router consume, a GDPR-shaped delete that fans out across probe_minute + probe_day + probe_month + suppression_clusters + the verdict-minute Redis prefix in one Postgres transaction, and a suppression-cluster log exposed as a materialised view of the canonical history. That post answered "what does the archiver look like at scale." This post answers a different question: "how does a small team actually run that archiver every day, when the platform has 80 paying tenants instead of 50,000, the founder is also the on-call and the data protection officer, the offsite backup is one S3 bucket the founder once configured and has never since restored from, and the GDPR delete fan-out has to work the first time it runs because the customer asked for it on a Sunday afternoon." The five layers from the architectural walkthrough survive the small-team setting unchanged — the threat model and the legal exposure are the same — but the human routines that operate them are very different from the routines a fifty-person platform team runs. This post is the operator's guide. It maps headcount to archiver ownership for one-, two-, three-, four-, and five-person deployments, walks the week-1 setup checklist that turns the archiver architecture into a working deployment, sketches the daily and weekly and monthly and quarterly drills that keep the watermark and the retention boundary and the rollups and the GDPR delete fan-out honest, names the seven failure modes that show up specifically when the archiver is operated by a small team, and gives the reference recipes — the daily watermark check script, the GDPR delete fan-out drill harness, the founder-as-DPO Article 17 response template, the S3-bucket-versioning offsite-backup runbook — that turn the routine into something a one-to-five-person team can actually run without the archiver collapsing into "we did the schema once and now we hope nobody asks for a delete."
TL;DR
A five-or-fewer team operating a multi-tenant shared-state archiver is not a smaller version of an enterprise data-platform team; it has its own shape and its own failure modes. The five-layer archiver architecture from the previous post still applies — schema with native columns and CHECK constraints, idempotent ingestion behind a watermark, retention by tier with two mechanisms, GDPR-shaped delete in one transaction, and a suppression-cluster log as a materialised view — but the way the layers map onto humans, the cadence at which they are exercised, and the failure modes the team has to watch for change with team size. The headcount-to-archiver-ownership mapping is the first decision: in a one-person deployment the founder owns the watermark health, the partition-rotation cron, the retention boundary, the GDPR delete fan-out, the offsite-backup restore drill, and the data protection officer (DPO) seat under GDPR Article 37 by virtue of being the only human; in a two-person deployment the ops hire takes the watermark health and the partition cron and the founder retains the retention-boundary decisions and the DPO hat; in a three-person deployment the third slot is the schema reviewer who exists structurally to refuse the founder's "just drop this column, it's small" requests; the four- and five-person deployments add a dedicated DPO-cover and a third-party SOC-2 reviewer rotation. The week-1 setup is the minimum-viable boundary: pick the retention boundary per tier with no contractual SLA at the free tier (Public 7 days per-minute / 365 daily / 7-year monthly — same as the architectural walkthrough's Public row), schedule the daily watermark health check with a 180-second lag SLO, set the GDPR delete fan-out drill calendar (one synthetic-tenant delete per quarter), configure the offsite-backup S3 bucket with versioning + Object Lock in compliance mode, decide the founder-as-DPO pattern with a 30-day Article 17 response window written into the privacy policy, lock the partition-rotation cron's calendar, configure the suppression-cluster materialised view's refresh-on-the-minute boundary, and stand up a synthetic deletion-target tenant whose entire purpose is being deliberately deleted once a quarter. The daily routine is one line: the on-call reads the watermark-lag dashboard end-to-end, looks for one anomaly, and either notes it as benign or escalates. The weekly routine is the partition-coverage check (next month's partition exists; the oldest partition has not been dropped early) and the suppression-cluster materialised-view refresh latency review. The monthly routine is the partition-rotation: create next month's partition one week ahead of need, drop the partition that has aged past every tier's retention cap. The quarterly routine is the GDPR delete fan-out drill against the synthetic deletion-target tenant and the offsite-backup restore drill against an empty Postgres instance. Seven small-team failure modes with structural fixes — the daily watermark check that no one runs (a calendar-bound task on the rotation, with the read-receipt logged), the retention boundary that drifts past free-tier customers (a checked-in tier-defaults YAML file with a quarterly review by the schema reviewer), the GDPR delete that misses a derived view (a single DELETE function that fans out in one transaction, with the function tested by the synthetic deletion drill), the offsite backup that has never been restored (a quarterly restore-into-an-empty-Postgres drill that ends with a row-count diff against the live database), the founder-as-DPO and the response-window failure mode (a 30-day calendar timer the dashboard surfaces from the moment the request is filed; a hardware-failover plan for the founder's email; a written response template the founder uses without re-drafting), the schema migration that breaks the archiver mid-flight (a two-stage migration where the new column is added with NULL allowed in stage 1, populated in stage 2, made NOT NULL in stage 3, with each stage gated by the schema reviewer's MFA approval), and the partition-roll cron that accidentally drops the wrong month (the cron is structurally barred from dropping any partition newer than 365 days plus the highest-tier retention cap; the drop is logged before it runs, dry-run-verified before it commits, and surfaces a P1 on any drop where the row-count of the partition is not zero in the read-side cache). The recipe section sketches the daily watermark check script, the GDPR delete fan-out drill harness, the founder-as-DPO Article 17 response template, and the S3-bucket-versioning offsite-backup runbook in copy-pasteable form. This post is the practical companion to the shared-state archiver architectural walkthrough; together they describe both halves of how a small multi-tenant MCP-monitoring team operates the persistence side of the stack. One more small-team-companion post is scheduled before the Q3 2026 audit lands mid-July: the small-team companion to the multi-tenant probe collector.
Why five-or-fewer changes the archiver
The five layers in the architectural walkthrough are shaped by data-shape constraints and by legal exposure, not by team size. The defences against a JSONB schema that hides typos in region, against an ingestion path that loses a minute when Redis evicts before Postgres ingests, against a per-row delete that scans the entire partition, against a GDPR delete that leaves a row in probe_day after deleting from probe_minute, against a suppression-cluster query that flips its plan under load — all of those defences sit in the schema's CHECK constraints and the watermark's idempotency contract and the partitioning strategy and the materialised view's refresh cadence, and they are the same whether the team is one person or fifty. What changes with team size is who watches the watermark, who runs the GDPR delete drill, who reviews the schema migrations, and how the data protection officer hat works when there is no DPO because there is one founder and one inbox.
Three things are different at small scale and they cascade. The first is the watermark health check. The architectural walkthrough has a 60-second cadence and a 180-second lag SLO; the supervisor pages on-call when the lag exceeds 180 seconds for more than five minutes. With fifty operators on a tiered on-call rotation, "archiver_lag_seconds exceeded SLO at 03:14 UTC" is a routine page that wakes the right human and is acknowledged within the SLA. With one operator the same page arrives as a Slack DM and gets snoozed at 03:14 because the founder is asleep; the lag accumulates to 30 minutes by 04:00 because Redis has evicted the un-archived minute, and the archiver is silently behind on a now-uncoverable gap. The structural answer is to make the watermark check run on a cadence the small team can actually keep up with, with the on-call (or the founder, in a one-person deployment) given an explicit calendar slot every morning for a watermark review, and to make the dashboard refuse to advance the watermark across a Redis-evicted gap without an explicit operator acknowledgement that the gap is accepted as data loss. We will name this concretely in §5.
The second is the retention-boundary drift. The architectural walkthrough fixes the per-tier retention values at Public 7 days, Author 90 days, Team 180 days, Enterprise 365 days (contractual cap 730). With fifty thousand tenants the per-tier mix is stable on a quarterly time scale and the rollup retention is a property the platform never has to think about. With eighty tenants the same per-tier mix can shift by 20% in a single sales month — a single Enterprise sign-up moves the mix more visibly than a year's worth of marketing — and the team is tempted to "extend the free-tier retention to 30 days, since we have the disk space." That extension is the structural drift: the privacy policy says 7 days, the database holds 30 days, the GDPR delete fan-out walks 30 days of rows, and the customer's reasonable expectation (the privacy policy contract) and the platform's actual behaviour have diverged. The structural answer is to make the tier-defaults a checked-in YAML file in the operator-config repo, gated by the schema reviewer's MFA approval, reviewed quarterly with the actual tenant-mix, and enforced by the archiver's per-row delete cron that uses the YAML as its sole source of truth. The privacy policy and the database are the same number because they are both projections of the same YAML. We will name this concretely in §4.
The third is the data protection officer seat that does not exist. With one operator the DPO hat is a fiction. The privacy policy cannot pretend the seat is staffed; it has to know that "the DPO" means "the founder" and that the founder's email is the only Article 17 inbox the platform has. The right way to handle this is not to fake a DPO function but to be explicit: the privacy policy names the founder by name and email as the DPO, the platform's Article 17 process is documented as "founder responds within 30 days, escalation to a parked external-DPO contact whose only job is to take over the inbox if the founder is unreachable for 7 days," and the dashboard surfaces the 30-day timer for every Article 17 request from the moment it is filed. Pretending the DPO seat is fully staffed when it is not produces worse outcomes than admitting it is the founder and writing down the bus-factor plan that compensates. The right shape for a small team is explicit single-point-of-contact DPO, not simulated DPO function: the privacy policy knows there is one DPO, the dashboard's deletion log records that the request was acknowledged and resolved by that one DPO, and the team writes down what happens when the one DPO is genuinely unreachable.
None of those three are reasons to abandon the five-layer archiver. They are reasons to operate it deliberately. The rest of this post walks how.
Mapping headcount to archiver ownership
The decision of who owns which piece of the archiver is the most consequential staffing choice the deployment makes after the four-layer permission model from the first companion post and the five-layer alert router from the second companion post. The right answer depends on team size and on what other systems your team already uses (your IdP, your support queue, your privacy-request inbox). The five-team-size mapping below is what we have run; treat it as a starting point that you adapt to your team's actual shape, not as a rule.
One-person deployment — the founder is the DPO
The single operator owns the watermark health check, the partition-rotation cron, the retention boundary, the GDPR delete fan-out, the offsite-backup restore drill, the schema migrations, and the data protection officer seat. The schema-reviewer seat from the permission-model companion exists on paper but is unstaffed; we provision the schema-reviewer account, leave it parked at zero permissions, and use that parked seat to grant schema-migration approval rights to a SOC-2 reviewer or a part-time data-platform advisor when the team has one to give the seat to. The DPO function is named by the founder's full legal name and personal email in the privacy policy; the dashboard's Article 17 inbox forwards to that email and surfaces the 30-day response timer in the founder's private operator view.
The discipline that prevents the archiver from collapsing into "founder writes the schema once, drops a partition by hand once a year, and prays" is explicit calendar-binding on the dashboard. The archiver-config UI in the dashboard has a routine selector at the top of the navigation that shows the operator's current routine; the default is "no routine selected." To run the daily watermark health check the founder selects the daily-watermark routine; to run the quarterly delete-drill the founder selects the quarterly-delete-drill routine and the click is gated by an MFA prompt with the hardware token, with the synthetic deletion-target tenant pre-loaded. Every routine completion is in the audit log with the routine name and the actor and a JSON summary of the routine's outcome (rows deleted, partitions verified, restore verified). Every Article 17 request is in the audit log with the requesting customer's tenant ID, the timestamp, the 30-day deadline, and the resolution timestamp. The single founder's own audit log is what they read on the first day of every quarter to verify that the previous quarter's routines all ran; the discipline is to read it. If the founder does not read their own audit log, the audit log is a forensic-only artefact and the archiver operations have effectively collapsed to "one founder, one schema, no log."
One non-obvious choice for the one-person deployment: the synthetic deletion-target tenant is provisioned at week one, even though the deployment has zero paying tenants. The reason is that the day a real Article 17 request arrives, the founder needs to know that the GDPR delete fan-out actually walks probe_minute + probe_day + probe_month + suppression_clusters + the verdict-minute Redis prefix in one transaction, and that the read-side API and the alert router both honour the tombstone within the next minute. Provisioning the synthetic deletion-target tenant at week one means the GDPR delete fan-out drill has been exercised, the founder has a calibration baseline for "what does a real-but-deliberately-injected delete look like in this dashboard," and the drill's output is the receipt the founder hands to the SOC-2 auditor when the auditor asks for evidence that the platform's GDPR claims are real. None of that is true on day one if the synthetic tenant is added the morning of the first real Article 17 request — and Article 17's 30-day clock does not pause for "we are still configuring the drill harness."
Two-person deployment — founder and first ops hire
The founder retains the retention-boundary decisions, the schema-migration approvals, and the DPO seat. The first ops hire takes the watermark health check, the partition-rotation cron, the offsite-backup restore drill, and the day-to-day archiver-worker monitoring. The schema-reviewer seat is still parked, for the same reason as the one-person deployment — the day a security advisor or a SOC-2 reviewer arrives, the seat is the first thing they need.
The two-person deployment introduces routine-rotation discipline on the watermark health check, even though the rotation has only one human on the daily slot. The first ops hire is the daily-watermark reader; the founder is the secondary check on the weekly partition-coverage review. The founder is not on the day-to-day watermark rotation. If the first ops hire is on holiday, the founder picks up the watermark check but does so by switching to the daily-watermark routine for the duration of the cover, the same way the founder picks up tenant-scoped support actions during a cover. The routine switch is one click and is captured in the audit log. The founder does not "log in as root and just check the watermark dashboard" during the cover, because the structural defence — the dashboard refuses tenant-scoped actions from a root-operator session — holds. The structural defence is what makes the routine discipline survive a two-week holiday for the first ops hire.
The single most important thing the two-person deployment does is elect a partition-roll reviewer. With one operator, the founder reviews their own partition-rotation cron output and that is the routine. With two, there is a temptation for the first ops hire to drop a month's partition without the founder reviewing the row-count diff first, and the partition drops silently become unwitnessed. The fix is to put the partition-roll review on the rotation, not on the operator: every partition drop is approved by a different human from the one who scheduled it, the dashboard refuses to apply the drop until the second human has approved the dry-run row-count diff, and the audit log records both rows. The rotation is what keeps the partition drops real.
Three-person deployment — adding the schema reviewer
The founder retains the retention-boundary decisions and the DPO seat. The first ops hire is on the daily-watermark rotation. The third hire — call them the schema reviewer — is the structural counterweight whose only job is to refuse the founder's "just drop this column, it's small" requests. The schema-reviewer seat moves from "parked" to "staffed by an internal hire or a quarterly-rotated external advisor." The schema reviewer is staffed before the platform reaches 50 tenants because that is roughly the size at which the founder's schema changes start affecting other people's customers in non-obvious ways (a column rename that breaks the read-side API's uptime_30d contract, a CHECK constraint relax that lets typos through ingestion, an index drop that flips a hot-path query plan).
The schema reviewer is not a senior hire; they are a discipline. The role can be the third human on the team regardless of seniority, provided they have refusal rights structurally enforced in the dashboard: any schema migration goes through a pull-request-style two-stage flow where the founder proposes the migration, the schema reviewer reviews the diff against the schema-definitions repo, and the dashboard refuses to apply the migration without the reviewer's MFA-gated approval. The audit log records both the proposal and the approval. The reviewer's job is not to be smarter than the founder; it is to be a different pair of eyes whose first job is to ask "is this migration going to break the archiver mid-flight" and refuse if the answer is unclear. The migrations that get past the reviewer are the ones the founder bothered to explain.
Four-person deployment — DPO cover and offsite-backup owner
The fourth hire is the second human on the DPO rotation cover and the offsite-backup owner. With two humans on the DPO cover (the founder primary, the fourth hire as the holiday-and-illness backup), the 30-day Article 17 response window becomes survivable; with the founder alone the window is fragile because a two-week founder vacation eats two-thirds of the response budget for any request filed on day one of the vacation. The fourth hire flips the DPO function from fragile to robust. The dashboard's Article 17 inbox routing is rewritten: primary recipient is the founder's personal email, secondary recipient is the fourth hire's personal email, the dashboard's 30-day timer surfaces in both inboxes from the moment the request is filed.
The four-person deployment is also the size at which the offsite-backup ownership splits from the founder. With three or fewer staff the founder is the only human who has ever logged into the S3 bucket; if the founder is unreachable on the day a restore is needed, the platform's offsite backup is effectively unrecoverable. The fourth hire takes ownership of the offsite-backup S3 bucket — the bucket's IAM policy is updated so the fourth hire's IAM user can read the bucket, the bucket's MFA-delete is enabled with a hardware token the fourth hire owns, the quarterly restore drill is run by the fourth hire from week one of the new role. The offsite-backup ownership transfer is documented in the operator handover runbook; the founder's IAM user retains read access for the duration of the transfer plus 90 days, then is reduced to the same audit-only access as every other operator.
Five-person deployment — the largest size the model is calibrated for
The fifth hire is the third-party SOC-2 reviewer or the fractional security engineer — the human whose explicit role is to run the quarterly SOC-2 audit cycle, to sign off on the GDPR delete drill receipts, to be the standing reviewer on the schema migrations when the schema reviewer is unavailable, and to be the off-site escrow contact for the backup-recovery hardware tokens. With five humans and four hats (founder for retention decisions and DPO primary, two ops on watermark rotation, schema reviewer, and the fifth hire wearing the SOC-2 / DPO-secondary / backup-escrow hat) the model is at its calibrated size. Beyond five humans the model still works but the hats start to specialise — the schema reviewer becomes a data engineer, the offsite-backup owner becomes a release engineer, the SOC-2 reviewer becomes a compliance lead — and at that point the deployment has crossed out of the small-team companion's scope and into the architecture's enterprise-team default.
The week-1 archiver setup checklist
The week-1 boundary is the minimum-viable line that converts the archiver architecture into a running deployment. Every item below is required at any team size; the difference at smaller team sizes is not which items get done but how they are split across humans. The list is calibrated for a one-person deployment to be able to complete in one full working day; larger teams parallelise.
1. Pick the retention boundary per tier with no contractual SLA at the free tier
The architectural walkthrough fixes the per-tier retention values at Public 7 days per-minute / 365 daily / 7-year monthly, Author 90 / 730 / 7-year, Team 180 / 1,095 / 7-year, Enterprise 365 (contractual cap 730) / 7-year / 7-year. The week-1 decision is not which numbers to pick; it is the explicit, written-down statement that only those four tiers exist and only those numbers apply, and that the free tier has no contractual retention SLA. The privacy policy lists the values from a single source of truth (the tier-defaults YAML in the operator-config repo), and the privacy policy refuses to assert any retention longer than the YAML's value. The structural reason the free tier has no contractual SLA is that the small team's bus factor and the free tier's cost structure cannot jointly absorb a contractual obligation that survives a founder vacation; the privacy policy says "best-effort 7 days" rather than "guaranteed 7 days" and the dashboard's customer-facing surfaces match. Locking the list at four tiers is not a feature ceiling; it is a retention-enforcement ceiling, and the retention-enforcement ceiling is what makes the archiver's per-row-delete cron survivable to operate with five humans.
2. Schedule the daily watermark health check with a 180-second lag SLO
The architectural walkthrough fixes the watermark cadence at 60 seconds and the lag SLO at 180 seconds. The week-1 decision is to write down who reads the watermark-lag dashboard every morning and what they do when they see a lag breach. The dashboard's watermark-lag widget is on the operator home page; the daily routine is one click ("acknowledge today's watermark") that is gated by the operator selecting the daily-watermark routine and entering one line of free-text that the dashboard refuses to accept as a copy-paste of the previous day's line. The acknowledgement is in the audit log. A lag breach is a P2 ticket: the on-call investigates the cause (Redis eviction storm, Postgres slow plan flip, partition-roll cron mis-scheduled), surfaces the cause in the ticket, and either advances the watermark across an explicitly-acknowledged data-loss gap or restores the archiver to under-SLO. The 180-second SLO is conservative for a one-person deployment because a single missed minute is recoverable from Redis (96-hour TTL) and a one-day gap is recoverable from the offsite backup; the SLO is what the daily routine defends against, not the SLA the team commits to.
3. Set the GDPR delete fan-out drill calendar — one synthetic-tenant delete per quarter
The architectural walkthrough fixes the GDPR delete fan-out as a single API call DELETE /api/admin/data/(tenant)/(server) that walks probe_minute + probe_day + probe_month + suppression_clusters + the verdict-minute Redis prefix in one Postgres transaction and writes a tombstone to data_deletion_log. The week-1 decision is to schedule the drill calendar: the second Tuesday of each quarter's first month (the same calendar slot as the alert-router companion's sink-rotation drill from post #15, just six hours later in the day). The drill is run against the synthetic deletion-target tenant — a fully provisioned tenant whose only purpose is to be deliberately deleted once a quarter. The drill's expected outcome is a row-count diff that goes from N rows in probe_minute + M in probe_day + K in probe_month + L in suppression_clusters + R Redis keys to zero across all five surfaces in one transaction; the drill receipt is committed to the team's drill-log repo and signed by the rotation owner and the founder. We give the harness for the drill in §9.
4. Configure the offsite-backup S3 bucket with versioning + Object Lock in compliance mode
The architectural walkthrough does not name the offsite-backup choice; the week-1 decision is to lock it down. The bucket lives at s3://alivemcp-archive-offsite-(region)/ with: versioning enabled (so a deleted backup is recoverable for the bucket's retention period), Object Lock in compliance mode with a 90-day default retention (so a compromised IAM role cannot prematurely-delete a backup), MFA-delete enabled with a hardware token the offsite-backup owner holds (so a compromised root account cannot bypass Object Lock), server-side encryption with a customer-managed KMS key whose grants are logged (so a compromised AWS account cannot read the backup), bucket-replication to a second region (so a single-region S3 outage does not lose the backup), and a lifecycle rule that transitions backups older than 30 days to S3 Glacier Deep Archive (so the cost is bounded at scale). The bucket is populated by a nightly pg_dump --format=custom of the canonical Postgres database, encrypted at rest, with the backup file's SHA-256 hash committed to the team's drill-log repo. The bucket is exercised by the quarterly restore drill (§7) which restores the most recent backup into an empty Postgres instance and runs a row-count diff against the live database.
5. Decide the founder-as-DPO pattern with a 30-day Article 17 response window
The week-1 decision is the privacy policy's wording on Article 17 and Article 30. The privacy policy names the founder by full legal name and personal email as the data protection officer. The Article 17 response window is documented as 30 days from the day the request is filed, with one possible 60-day extension if the request is "complex" (in the GDPR's language) and an explicit communication to the requestor on day 30 if the extension is invoked. The Article 30 record-of-processing-activities is maintained as a checked-in markdown file in the operator-config repo, updated whenever the schema changes (gated by the schema reviewer's approval). The dashboard's Article 17 inbox routes new requests to the founder's personal email and starts a 30-day timer that surfaces on the operator home page; the timer counts down in real time and turns red at day 25, the dashboard refuses to mark the request as resolved without the founder filling in a free-text resolution summary that the audit log captures.
6. Lock the partition-rotation cron's calendar
The architectural walkthrough fixes partition rotation at "create one week ahead of need, drop on the first of the month when the highest-tier retention has expired across every row." The week-1 decision is the cron's calendar: partition creation runs on the 24th of each month at 02:00 UTC (one week before the 1st), partition drop runs on the 1st of each month at 02:30 UTC (after the new partition is verified live for 30 minutes). Both jobs run as a single transaction; both write to the audit log; both are gated by a structural assertion before commit (creation refuses if a partition for the target month already exists; drop refuses if any row in the candidate partition has an effective retention longer than the partition's age). The drop job dry-runs the row-count of the candidate partition before commit and refuses to proceed if the row-count is not consistent with the read-side cache's idea of the same data — i.e., if the read-side cache thinks there is data the drop job would delete, the drop is aborted and the founder is paged. The dry-run-and-abort is what defends against the seventh failure mode (§7.7).
7. Configure the suppression-cluster materialised view's refresh-on-the-minute boundary
The architectural walkthrough specifies that the suppression-cluster log is a materialised view of the canonical history, refreshed on the verdict-minute boundary. The week-1 decision is the refresh statement and its concurrency: REFRESH MATERIALIZED VIEW CONCURRENTLY suppression_clusters runs every 60 seconds aligned to the verdict-minute, fired by the same supervisor that fires the archiver worker. The CONCURRENTLY mode is required because the alert router's cross-tenant suppression rule reads from this view on every alert minute and a non-concurrent refresh would block the read; the trade-off is that CONCURRENTLY requires a unique index on the view, which we provide via a composite index on (registry_of_origin, asn, error_kind, minute_bucket). The week-1 decision also fixes the view's freshness SLO at 90 seconds (one minute boundary plus 30-second margin); a freshness breach is a P2 ticket on the same cadence as the watermark-lag breach.
8. Stand up the synthetic deletion-target tenant
The synthetic deletion-target tenant is provisioned at week one for the same reason the synthetic-outage drill tenant from the alert-router companion is provisioned at week one: the day a real Article 17 request arrives, the team needs to know that the GDPR delete fan-out actually works. The synthetic deletion-target is a fully provisioned tenant with a distinctive tenant-id prefix (drill-deletion-), one server slug, sample rows in probe_minute + probe_day + probe_month, an entry in suppression_clusters, and a verdict-minute Redis prefix. The drill-deletion run against this tenant exercises the GDPR delete fan-out end-to-end every quarter; the receipt is committed to the team's drill-log repo. The synthetic tenant is reseeded after each drill so the next quarter's drill has rows to delete. The reseeding is a one-line invocation against the same drill harness; the architectural walkthrough's emphasis on idempotent ingestion is what makes the reseeding cheap.
Daily, weekly, monthly, quarterly archiver routines
The week-1 setup is what gets the archiver running. The routines are what keep it running. The routines are calibrated for the on-call to be one human on a one-person deployment, two humans on a 7-day rotation on a four-person deployment, and to scale linearly between. The cadence below is the cadence we have run; treat the times as a starting point and adapt them to your team's actual day length.
Daily — the watermark-lag review
Every day, at the start of the operator's working day (08:00 UTC for a Europe-based founder, 09:00 UTC for a UK-based one, 16:00 UTC for a US-East-based one — the time is operator preference, the discipline is calendar-binding it), the on-call reads the watermark-lag dashboard end-to-end. The dashboard shows the last 24 hours of archiver_lag_seconds as a single line chart, the previous day's longest lag, the previous day's count of lag-breach events (lags exceeding the 180-second SLO for more than five minutes), and the rate of verdict_sealed=1 Redis keys arriving (a proxy for the archiver's input rate). The on-call's job is to look for one anomaly. An anomaly is anything that looks unfamiliar: a lag-breach event whose root-cause was not investigated, a sudden drop in the input rate that suggests a coalescer regression, a sustained walk-up in the input rate that suggests a tenant has misconfigured a probe to fire 100× more often than intended.
The discipline is the one-anomaly-per-day rule, the same rule from the alert-router companion's daily routine. The on-call is not asked to read the full watermark dashboard and remember everything; they are asked to find one anomaly and either note it as benign in the dashboard's anomaly journal or escalate it. The benign annotation is one click and the escalation is one click and the dashboard records both with the actor and the timestamp. If the on-call finds zero anomalies on a quiet day, they record "zero anomalies" with one click and the routine is logged. The structural reason the rule is one-per-day rather than zero-or-many is that "find at least one" forces the on-call to engage with the dashboard rather than scrolling past it; the dashboard's anomaly journal is what the team reviews on Friday to see how the week's signal looked.
Weekly — partition-coverage check and materialised-view refresh-latency review
Every Friday, the on-call (or the partition-roll reviewer on the larger deployments) runs the partition-coverage check. The check has three assertions: next month's partition exists in the catalog (the cron runs on the 24th, but the cron has failure modes; the Friday before the 24th is the last sane chance to notice the cron has not run yet), the oldest existing partition is the partition we expect to be live (no partition has been dropped early or merged accidentally), and the per-partition row-count is within the expected band for the partition's age (a ten-fold growth in row-count for last month vs the month before is a P2 anomaly worth investigating). The check is a single SQL statement that the dashboard's partition-coverage widget executes on demand; the on-call clicks the widget and reads three green checkmarks or one red one. A red checkmark is a P2 ticket that the alert-rule reviewer (or the founder, on smaller deployments) addresses on Monday.
The same Friday slot is also when the materialised-view refresh-latency is reviewed. The dashboard's view-latency widget shows the last 7 days of suppression_clusters refresh durations; the median should be under 5 seconds, the p99 under 30 seconds. A p99 walk-up is the structural early-warning that the view's query plan has flipped under load (typically a planner regression after a stats refresh on the underlying probe_minute partition). The fix is a manual ANALYZE on the most recent partition followed by a re-run of the refresh; the discipline is reading the widget every Friday rather than waiting for the alert router's cross-tenant suppression rule to start firing late. We give the harness for the weekly check in §9.
Monthly — the partition rotation
Every month, on calendar-pinned days (the 24th at 02:00 UTC for the create job, the 1st at 02:30 UTC for the drop job), the partition-rotation cron runs. The create job creates next month's partition, runs an ANALYZE against the new partition's empty rowset (so the planner has fresh stats from minute one), and writes the create event to the audit log. The drop job dry-runs the row-count of the candidate partition, refuses to proceed if the row-count is non-zero in the read-side cache's idea of the same data (the drop's structural defence against the seventh failure mode, §7.7), drops the partition in a transaction that takes a small lock and releases it within seconds, and writes the drop event to the audit log. The rotation takes ~5 minutes if everything works and up to 2 hours if the drop job's dry-run flags a row-count anomaly that requires investigation. The rotation receipt is a one-line note posted to the team's channel: "April 2026 partition rotation: create-2026-05 ok, drop-2025-04 ok, 0 anomalies."
The rotation's structural-defence assertions are what make it survive a small team's review cadence. The create job's structural defence is "refuses if a partition for the target month already exists" — the assertion is what catches a failed October rotation that re-runs in November and would otherwise create a duplicate partition with overlapping bounds. The drop job's structural defence is the row-count dry-run-and-abort — the assertion is what catches a partition-bounds misconfiguration that would drop the wrong month. Both assertions live in the rotation cron's code; both are tested by a unit test that runs in CI on every change to the rotation cron's source. The unit test is the small team's structural defence against a cron change that breaks one of the assertions.
Quarterly — the GDPR delete fan-out drill and the offsite-backup restore drill
Every quarter, on calendar-pinned days (the second Tuesday of each quarter's first month at 14:00 UTC for the GDPR drill, the third Wednesday of each quarter's first month at 14:00 UTC for the restore drill), the team runs the two quarterly drills. The two drills are intentionally separated by a week to keep their failure modes from compounding into one bad day; the second-Tuesday slot for the GDPR drill matches the alert-router companion's sink-rotation drill calendar (just six hours later) so the team's "drill day" muscle memory carries over.
The GDPR delete fan-out drill exercises the synthetic deletion-target tenant. The drill has six steps: (1) reseed the synthetic tenant with row-count assertions in all five surfaces (probe_minute, probe_day, probe_month, suppression_clusters, the verdict-minute Redis prefix); (2) confirm the row-counts match the seed expectations; (3) issue the delete via DELETE /api/admin/data/(tenant)/(server) against the dashboard; (4) within 60 seconds confirm the read-side API and the alert router both honour the tombstone (the dashboard's customer-facing surfaces refuse to serve any cached row for the deleted server within the next minute); (5) confirm the row-counts in all five surfaces are zero; (6) confirm the tombstone row is in data_deletion_log. The drill takes 30 minutes if everything works and up to 2 hours if a defect is found. The drill receipt is committed to the team's drill-log repo and signed by the rotation owner and the founder. A failed assertion is a P1; a successful drill is a one-line note in the team's channel.
The offsite-backup restore drill exercises the S3 bucket from §4. The drill has six steps: (1) spin up an empty Postgres instance in the team's drill VPC; (2) download the most recent backup from the offsite-backup S3 bucket via the offsite-backup owner's IAM credentials; (3) verify the backup's SHA-256 hash against the value committed to the drill-log repo; (4) restore the backup into the empty instance via pg_restore; (5) run a row-count diff against the live database, scoped to the most recent rolled-up table (probe_day) so the diff fits in one query; (6) destroy the drill instance and write the receipt. The restore drill is the structural defence against the third failure mode (§7.4): the offsite backup that has never been restored. The drill takes 1 hour if everything works and up to a day if the restore reveals a backup-corruption issue, in which case the team's on-call burden for the rest of the day is the corruption investigation. The drill's quarterly cadence matches the typical SOC-2 cadence the platform's auditors expect; the alignment is what makes the auditor's restore-evidence request match a receipt the team already has.
The contractor and external-handshake pattern
One of the under-discussed features of small-team operations is that not every role on the archiver rotation has to be a full-time employee. The roles that show up specifically with five-or-fewer staff and that map well onto contractor or advisor relationships are: the part-time data-platform advisor (the schema reviewer's seat at quarter-time intensity), the fractional DPO (the data protection officer's secondary cover at 20% intensity, typically a privacy lawyer with GDPR experience who can take the inbox on the holiday weeks), and the third-party SOC-2 reviewer (the compliance auditor or a paid security-engineering firm that audits the quarterly drill receipts once a year and signs off on the GDPR fan-out evidence).
Each contractor pattern has a structural shape that mirrors the corresponding employee role from §3. The part-time data-platform advisor lives in the schema-reviewer IdP group with full refusal rights on schema migrations, but a tag in the IdP group says "fractional, 12-month renewal" and the dashboard's session timeout is 8 hours instead of the employee default of 30 days. The fractional DPO lives in the DPO IdP group with a tag that says "secondary, holiday cover only" and a calendar binding that activates the seat for the cover week and deactivates it after; the privacy policy lists the fractional DPO by name and email as the secondary contact, with the same 30-day Article 17 response window applying. The third-party SOC-2 reviewer lives in the auditor IdP group (the same parked seat from the permission-model companion) with a scope-restricting tag that says "audit-the-drill-receipts" and a one-time receipt CSV export at the end of the audit.
The structural decision for each contractor role is the same: the role lives in the IdP, the dashboard refuses to grant the role permissions outside the IdP-bound scope, and the role's expiry is calendar-bound at week one. The fractional DPO role has one additional structural defence: the privacy policy's claim that the secondary contact "responds within 30 days of escalation" must be backed by a written contract with the fractional DPO that obliges the response, signed before the privacy policy is published. The contractor pattern is what makes the small-team archiver operation survive the human reality that the DPO function in particular cannot be staffed full-time at week one and the GDPR Article 17 30-day clock does not pause for hiring.
Seven failure modes specific to small-team archiver operations
The architectural walkthrough listed six archiver-specific failure modes (Redis eviction overtaking the watermark, watermark drift after a Postgres failover, partition-drop racing a slow read, JSONB schema bumps that re-write history, a suppression-cluster query plan flipping under load, and a GDPR delete that fans out to a stale rollup). All six survive at small-team scale unchanged. What follows is seven additional failure modes that show up specifically when the team is small and the rotation is one or two humans deep. Each has a structural fix that does not depend on team discipline alone.
1. The daily watermark check no one runs
The watermark dashboard is on the operator home page; the daily routine is one click. On Tuesday the on-call is mid-stream on a customer support ticket and the dashboard's watermark widget is dismissed without being clicked. The lag-breach event from Tuesday at 03:14 UTC went unacknowledged; the next time the on-call reads the dashboard is Friday morning, by which time three more lag-breaches have stacked. The team learns about a sustained archiver regression three days late.
The structural fix is the calendar-bound daily routine from §5. The on-call's selection of the daily-watermark routine is required before any other archiver-related action in the dashboard is allowed; the dashboard refuses to surface the partition-coverage widget, the rollup-latency widget, or the GDPR-inbox widget until the daily-watermark routine has been acknowledged for the current calendar day. The acknowledgement is in the audit log. The structural reason the routine is calendar-gated rather than passive is that "look at the dashboard sometime today" produces the failure mode; an explicit gate that blocks every other archiver action until the routine has been acknowledged is what turns the routine from a Slack notification into a structural signal.
2. The retention boundary that drifts past free-tier customers
The platform launches with the per-tier retention values from §1: Public 7 days, Author 90 days, Team 180 days, Enterprise 365 days. Six months in, the founder notices the disk usage is well below capacity and decides to extend the free-tier retention to 30 days "since we have the disk space." The privacy policy is not updated; the database now holds 30 days of free-tier data while the privacy policy says 7 days. A free-tier customer's GDPR Article 15 access request returns 30 days of data the customer thought was deleted at day 7. The customer files a complaint with the data protection authority.
The structural fix is the tier-defaults YAML from §3. The retention values live in a single source of truth (a checked-in YAML file in the operator-config repo); the privacy policy is rendered from the YAML at build time, the archiver's per-row-delete cron reads from the YAML at runtime, and the dashboard's customer-facing surfaces query the YAML for the displayed retention values. Any change to the YAML goes through the schema reviewer's MFA-gated approval flow; the founder cannot extend the free-tier retention without the schema reviewer's explicit approval, and the approval is what triggers the privacy-policy rebuild and republish. The structural defence is the single source of truth: the privacy policy and the database cannot disagree because they are both projections of the same YAML.
3. The GDPR delete that misses a derived view
A real Article 17 request arrives. The founder issues the delete via DELETE /api/admin/data/(tenant)/(server). The fan-out walks probe_minute + probe_day + probe_month + the verdict-minute Redis prefix. The customer's data is gone from those four surfaces. Two weeks later the same customer notices their (error.kind, asn, registry_of_origin) still appears in a publicly-accessible aggregate visualisation on the platform's blog post about the Q3 audit. The fan-out missed the suppression-cluster materialised view because the view was added two months after the delete function was written, and the function was never updated.
The structural fix is the single DELETE function from §3 that fans out in one Postgres transaction. The function is a stored procedure (delete_tenant_data(tenant_id uuid, server_slug text)) whose source is in the schema-definitions repo; every surface that holds tenant data has its DELETE in the function, and every new surface (a new materialised view, a new rollup table, a new derived index) is added to the function as part of the schema migration that introduces the surface. The schema reviewer's MFA-gated approval on the migration is the structural defence: the migration is rejected if the diff against the schema-definitions repo introduces a new tenant-data-bearing surface without a corresponding addition to the DELETE function. The synthetic deletion drill (§6) exercises the function quarterly against the synthetic deletion-target tenant; the receipt is what proves the function still walks every surface the schema currently has.
4. The offsite backup that has never been restored
The offsite-backup S3 bucket is configured at week one. The nightly pg_dump runs every day, the SHA-256 hash is committed to the drill-log repo, the disk usage in the bucket grows on the expected curve. Eighteen months in, a Postgres failure requires a restore. The team logs into the S3 bucket, downloads the most recent backup, runs pg_restore, and the restore fails because the pg_dump binary the team used at week one has since been upgraded to a major version that produces a backup format the older pg_restore cannot read; the team's drill VPC is on the older Postgres major version because the team has not run a restore drill since week one. The team spends a day rebuilding the drill VPC at the new Postgres major version before the restore can complete.
The structural fix is the quarterly restore drill from §6. The drill is calendar-bound to the third Wednesday of each quarter's first month; the drill's six steps include downloading the backup, restoring into an empty Postgres instance, and running a row-count diff against the live database. The drill's receipt is what proves the backup is restorable on the team's current Postgres major version. The structural reason the drill is quarterly rather than annual is that Postgres major-version upgrades happen on a roughly-annual cadence and the restore drill is the team's structural defence against the upgrade cadence overtaking the backup-restore compatibility cadence; quarterly cadence keeps the gap to one quarter at most.
5. The founder-as-DPO and the response-window failure mode
An Article 17 request is filed at 09:00 UTC on the day the founder leaves for a two-week holiday in a country with no signal. The request lands in the founder's personal inbox; the dashboard's 30-day timer surfaces in the founder's operator home page but the founder is not logged in to read it. By the time the founder returns, 14 of the 30 days are gone; the request is complex and would need the 60-day extension that has to be invoked on day 30. The team's response window is now structurally squeezed.
The structural fix is the fractional DPO from §8 plus the dashboard's secondary-routing from §5. The privacy policy lists both the founder and the fractional DPO as Article 17 contacts; the dashboard's Article 17 inbox routes new requests to both the founder's personal email and the fractional DPO's personal email simultaneously, and the dashboard's 30-day timer surfaces in both inboxes from the moment the request is filed. The fractional DPO's contract obliges them to acknowledge the request within 48 hours of arrival, regardless of whether the founder has acknowledged it. The structural defence is what makes the founder's two-week holiday survivable; the founder is not the single point of failure on the response window because the fractional DPO is the structural backup. The hardware-failover plan for the founder's email is the same shape as the alert-router companion's hardware-failover for the on-call channel: a secondary device with the founder's mailbox synced and a recovery-codes-off-site plan; the structural fix is the same shape, just specialised to the DPO inbox's email dependency.
6. The schema migration that breaks the archiver mid-flight
The founder ships a schema migration that adds a NOT NULL column to probe_minute. The migration runs in production at 14:00 UTC; the archiver-worker pipeline is mid-batch, the next batch's INSERT fails because the new NOT NULL column has no value in the in-flight CBOR blobs. The archiver-worker's idempotent-retry loop now retries the batch on every tick, the watermark stops advancing, the lag accumulates to 20 minutes before the daily watermark check on Friday morning surfaces it.
The structural fix is the two-stage migration discipline. Stage 1: add the new column with NULL allowed. Stage 2: populate the new column for new ingestions (the archiver-worker's CBOR-parse layer adds the field, but old rows remain NULL). Stage 3 (run no earlier than one full archiver-cadence-window after stage 2): make the column NOT NULL once the team has verified that every new row has the field populated. Each stage is a separate migration; each stage is gated by the schema reviewer's MFA approval; the dashboard refuses to apply any single-stage migration that adds a NOT NULL column to an already-populated table. The two-stage discipline is what makes the migration survive a small team's change cadence; the structural defence is the dashboard's refusal of the single-stage form, not the team's discipline alone.
7. The partition-roll cron that accidentally drops the wrong month
The partition-roll cron is configured to drop partitions older than 365 days plus the highest-tier retention cap. The cron's date-arithmetic has a bug that, on a leap-year boundary, computes the threshold one month too aggressive. The drop job runs on the 1st of February in a leap year and drops the partition for January of the previous year — a partition that was not yet eligible for drop. The drop is committed; the data is gone from probe_minute for that month; the team learns about it the next time a customer queries the relevant rollup and finds it empty.
The structural fix is the dry-run-and-abort from §6. The drop job computes the partition's row-count and queries the read-side cache for the row-count of the same partition; if the read-side cache thinks there is data the drop job would delete, the drop is aborted and the founder is paged. In the leap-year case the read-side cache holds 30-day-rollup queries that touch the soon-to-be-dropped partition; the row-count check would refuse the drop because the cache's idea of the data and the drop job's idea of the data disagree. The structural defence is the disagreement: the drop job and the read-side cache are independent computations of the same retention rule; the defence kicks in when they disagree, regardless of which one is wrong. A pre-commit alarm with operator escalation is what gets the founder out of the leap-year-boundary failure mode without losing the month's data.
Reference recipes
The recipes below are calibrated for a small team — short, copy-pasteable, and defensible against the failure modes named above. They are not full implementations; the architectural walkthrough has the full Go and SQL. These are the small-team operator's drop-in scaffolds.
The daily watermark check (bash + psql)
#!/usr/bin/env bash
# daily-watermark-check.sh — read the watermark lag, refuse to proceed past
# 180s without an explicit acknowledgement. Run from the operator's morning.
set -euo pipefail
PG_DSN="${PG_DSN:?missing}"
SLO_SECONDS=180
# 1. Read the current watermark lag.
lag=$(psql "${PG_DSN}" -tAc \
"SELECT EXTRACT(EPOCH FROM (now() - last_minute))::int
FROM archive_watermark
WHERE shard_id = 'primary'")
echo "$(date -Iseconds) watermark_lag_seconds=${lag} slo=${SLO_SECONDS}"
# 2. Compare against SLO.
if [ "${lag}" -gt "${SLO_SECONDS}" ]; then
echo "WARN: watermark lag ${lag}s exceeds SLO ${SLO_SECONDS}s"
echo " investigate via the dashboard's watermark-lag widget"
echo " log the incident in the dashboard's anomaly journal"
echo " escalate via PagerDuty if the lag is sustained"
exit 2
fi
# 3. Read the previous day's longest lag and the breach count.
yesterday_stats=$(psql "${PG_DSN}" -tAc \
"SELECT max(lag_seconds), count(*) FILTER (WHERE lag_seconds > ${SLO_SECONDS})
FROM archive_lag_log
WHERE recorded_at >= date_trunc('day', now() - interval '1 day')
AND recorded_at < date_trunc('day', now())")
echo "yesterday: max_lag=${yesterday_stats}"
# 4. Acknowledge.
psql "${PG_DSN}" -c \
"INSERT INTO operator_routine_log (routine, actor, payload, ack_at)
VALUES ('daily-watermark', current_user,
jsonb_build_object('lag_seconds', ${lag},
'yesterday', '${yesterday_stats}'),
now())"
echo "ack: daily-watermark routine acknowledged at $(date -Iseconds)"
The GDPR delete fan-out drill harness (Go)
// gdpr_drill_test.go — quarterly synthetic-tenant deletion drill.
// Run via: go test -run TestDrill_GDPRDeleteFanOut -tags=drill -v
// Synthetic tenant ID is loaded from DRILL_DELETION_TENANT env var.
package drill
import (
"context"
"encoding/json"
"fmt"
"os"
"testing"
"time"
)
func TestDrill_GDPRDeleteFanOut(t *testing.T) {
tenantID := os.Getenv("DRILL_DELETION_TENANT")
if tenantID == "" {
t.Fatal("DRILL_DELETION_TENANT not set")
}
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Minute)
defer cancel()
h := newGDPRDrillHarness(t, tenantID)
h.recordReceiptStart("gdpr-delete-fan-out")
// 1. Reseed the synthetic tenant.
seed := h.reseedSyntheticTenant(ctx)
h.recordStep("reseed", "ok", time.Now())
if seed.ProbeMinuteRows == 0 || seed.ProbeDayRows == 0 ||
seed.ProbeMonthRows == 0 || seed.SuppressionRows == 0 ||
seed.RedisKeys == 0 {
t.Fatalf("seed produced empty surface: %+v", seed)
}
// 2. Confirm pre-delete counts.
pre := h.countAllSurfaces(ctx, tenantID)
h.recordStep("pre-count", "ok", time.Now())
if !pre.Equal(seed) {
t.Fatalf("pre-count != seed: pre=%+v seed=%+v", pre, seed)
}
// 3. Issue the delete.
if err := h.issueDelete(ctx, tenantID); err != nil {
t.Fatalf("issue delete: %v", err)
}
h.recordStep("delete", "ok", time.Now())
// 4. Within 60s, confirm read-side honour.
deadline := time.Now().Add(60 * time.Second)
var honoured bool
for time.Now().Before(deadline) {
honoured = h.checkReadSideHonoursTombstone(ctx, tenantID)
if honoured {
break
}
time.Sleep(2 * time.Second)
}
if !honoured {
t.Errorf("read-side did not honour tombstone within 60s")
}
h.recordStep("read-side-tombstone", "ok", time.Now())
// 5. Confirm post-delete counts are zero across all five surfaces.
post := h.countAllSurfaces(ctx, tenantID)
if post.ProbeMinuteRows != 0 || post.ProbeDayRows != 0 ||
post.ProbeMonthRows != 0 || post.SuppressionRows != 0 ||
post.RedisKeys != 0 {
t.Errorf("post-delete counts non-zero: %+v", post)
}
h.recordStep("post-count-zero", "ok", time.Now())
// 6. Confirm the tombstone row.
tomb, err := h.readTombstone(ctx, tenantID)
if err != nil {
t.Fatalf("read tombstone: %v", err)
}
if tomb == nil {
t.Fatalf("tombstone row missing")
}
h.recordStep("tombstone", "ok", time.Now())
// 7. Write the receipt.
receipt := h.finalReceipt()
receiptJSON, _ := json.MarshalIndent(receipt, "", " ")
fmt.Printf("GDPR DRILL RECEIPT\n%s\n", receiptJSON)
if !receipt.AllPassed() {
t.Fatalf("drill failed: %+v", receipt.Failures())
}
}
The founder-as-DPO Article 17 response template (markdown)
# OPERATIONS.md — Article 17 response template
# Drop into the operator-config repo. Render via the dashboard
# when an Article 17 request arrives.
Subject: Re: Your data deletion request — {tenant_name}
Hi {requestor_name},
Thank you for your data deletion request. I'm writing to confirm
receipt of your request filed on {request_filed_at} and to outline
the next steps.
Under GDPR Article 17 we will delete the personal data we hold for
your AliveMCP account ({tenant_id}) within 30 days of your request.
This 30-day window ends on {response_deadline}.
What we will delete:
- Per-minute uptime history for your monitored servers ({server_slugs})
- Daily and monthly rollups derived from that history
- Suppression-cluster log entries that reference your servers
- Any cached read-side data for your servers
- Backup copies will be aged out per our published retention policy
(90 days for offsite backups; the deleted data will not be
restored from any older backup)
What we will retain:
- The tombstone record proving your request was processed (legal basis:
evidence of GDPR compliance under Article 5(2) accountability)
- Aggregated, non-personal-data ecosystem statistics (legal basis:
legitimate interest in research and reporting under Article 6(1)(f))
What we may need from you:
- {if_complex} A clarification of which servers are in scope. Please
reply to this email with the list of server slugs you want deleted.
- {if_simple} Nothing further; we will proceed with the deletion of
all your account's data within 7 days.
Confirmation of completion will be sent to this email address by
{response_deadline}.
If you have any questions before then, you can reply to this email
or contact me directly at {founder_email}.
In the rare case I am unreachable, the secondary DPO contact for
AliveMCP is {fractional_dpo_name} at {fractional_dpo_email}, who
can answer questions and confirm the request's status under the
same 30-day window.
Best regards,
{founder_name}
Founder and Data Protection Officer, AliveMCP
{founder_email}
---
## Audit-log linkage
Every Article 17 response is logged in the dashboard's
operator_routine_log table with:
- routine = 'article-17-response'
- actor = founder OR fractional_dpo
- payload = jsonb { request_filed_at, response_sent_at, deadline,
requestor_name, tenant_id, complex (bool) }
- ack_at = timestamp of send
The S3-bucket-versioning offsite-backup runbook (markdown)
# OPERATIONS.md — S3-bucket-versioning offsite-backup runbook
# Drop into operator-config/runbooks/. Pin in the team channel.
## Calendar
- Nightly: pg_dump runs at 03:00 UTC.
- Weekly: SHA-256 hash of newest backup is committed to drill-log/.
- Quarterly: restore drill from this runbook (third Wednesday).
## Bucket configuration (set at week one, audited quarterly)
- Bucket name: alivemcp-archive-offsite-(region)
- Versioning: ENABLED
- Object Lock: ENABLED, compliance mode, default retention 90 days
- MFA-delete: ENABLED, hardware token held by the
offsite-backup owner (fourth hire on five-person deployments,
founder on smaller deployments — see §3)
- SSE: aws:kms with customer-managed key alivemcp-archive-cmk;
KMS grants logged in CloudTrail
- Cross-region replication: REPLICATING to (other-region) bucket
- Lifecycle: TRANSITION to S3 Glacier Deep Archive at age 30 days
## Quarterly restore drill (third Wednesday of quarter's first month)
### Step 1 — Provision the drill instance (~10 min)
- [ ] Spin up a Postgres instance in the team's drill VPC at the
same major version as the live database
- [ ] Verify the drill instance has zero existing data
- [ ] Verify the drill instance is in a private VPC with no inbound
public traffic
### Step 2 — Download the backup (~10 min)
- [ ] Use the offsite-backup owner's IAM credentials
- [ ] Download the most recent pg_dump file from the offsite-backup
bucket via aws s3 cp
- [ ] Verify the SHA-256 hash against the value committed to
drill-log/{quarter}-backup-hashes.md
### Step 3 — Restore (~30 min)
- [ ] Run pg_restore --jobs=4 --no-owner --no-acl into the drill
instance
- [ ] Verify the restore exits with status 0
- [ ] If pg_restore complains about format incompatibility, abort,
file P1, page founder — Postgres major-version mismatch
between the dump host and the restore host
### Step 4 — Row-count diff (~10 min)
- [ ] Run row-count diff against the live database for probe_day
(the canonical rollup)
- [ ] Expected: drill instance row-count is within 1 day of live
(because the backup is from last night)
- [ ] Run row-count diff for probe_month — should be exactly equal
### Step 5 — Destroy the drill instance (~5 min)
- [ ] Tear down the drill VPC's Postgres instance
- [ ] Verify the drill VPC is fully torn down (no residual storage,
no residual snapshots)
### Step 6 — Receipt
- [ ] Drill receipt committed to drill-log/{quarter}-restore.md
- [ ] Receipt signed by the offsite-backup owner and the founder
- [ ] One-line note posted to the team's channel:
"Q{n} restore drill ran. Backup restorable. Row-count diff ok."
- [ ] If a P1 was raised: receipt includes the corruption details
and the next-day re-attempt plan.
Where this fits — archiver companion
This post is the archiver-side companion to the architectural walkthrough. The architectural walkthrough described the five layers — schema with native columns and CHECK constraints, idempotent ingestion behind a watermark, retention by tier with two mechanisms, GDPR-shaped delete in one transaction, and a suppression-cluster log as a materialised view — that make a single archiver safe across many tenants. This post described how to actually operate that architecture with one to five humans on the team — the headcount-to-archiver-ownership mapping, the week-1 setup checklist, the daily and weekly and monthly and quarterly drill cadence, the contractor and external-handshake pattern, and seven small-team-specific failure modes with structural fixes. Together they are both halves of how a small multi-tenant MCP-monitoring team operates the persistence side of the stack. The architectural side and the operational side reinforce each other; neither stands alone.
The small-team-companion arc is now three posts deep. The first companion (post #14) paired with the operator-dashboard architectural walkthrough and described the four-layer permission model in operation. The second companion (post #15) paired with the per-tenant alert routing architectural walkthrough and described the five-layer alert router in operation. This post pairs with the shared-state archiver architectural walkthrough and describes the five-layer archiver in operation. One more companion is scheduled: the small-team companion to the multi-tenant probe collector — supervisor, workers, queues, secret store, coalescer — when operated with five-or-fewer staff. After that the small-team-companion arc closes and the next deliverable is the Q3 2026 audit.
The next deliverable after the small-team-companion arc is the Q3 2026 registry audit, landing mid-July 2026. The audit re-runs every probe from all five regions in parallel through the multi-tenant collector designed in post #10, with verdicts archived through the system designed in post #12 (the architectural reference this post operationalises), with cross-tenant suppression measured against the cluster log designed in post #11, and with operator actions during the audit window logged in the audit-log designed in post #13. The audit will report bucket-by-bucket movement vs the Q2 baseline — including how the archiver from this post's architectural counterpart held up under the Q3 audit's per-minute write rate; whether the credentialed-probe rollout from post #6 shrunk the auth-walled 16.8% bucket as expected; whether the schema-drift detector from post #4 caught the same 7.1%/48h drift rate or a different one; and the first end-to-end pass through the GDPR delete fan-out designed in post #12 at registry scale. Between now and then the small-team-companion arc continues — the practical guides that pair with each architectural walkthrough, calibrated for the team that has actually been doing the work the post describes.
Further reading on AliveMCP
- Shared-state archiver walkthrough — the architectural reference this post operationalises.
- Operating the four-layer permission model with five staff or fewer — the first small-team companion in the arc.
- Operating per-tenant alert routing with five staff or fewer — the second small-team companion in the arc; immediate predecessor of this post.
- Operator dashboard walkthrough — the operator-architecture side of the scale sub-series.
- Per-tenant alert routing at scale — the alert-router architecture; the suppression-cluster materialised view's downstream reader.
- Multi-tenant MCP probe collector — the write side of the scale sub-series; the verdict-minute Redis the archiver drains.
- State of the MCP Registry — Q2 2026 — the audit baseline the next quarterly audit will measure against.
- Why MCP servers die silently — 7 failure modes — the failure-class taxonomy the archived rows encode.
- JSON-RPC health checks vs HTTP probes — the protocol-aware probe whose verdicts the archiver persists.
- Schema drift in MCP tool definitions — the canonical-JSON SHA-256 hash that the archiver stores in
tool_list_hash. - MCP authentication primer — the four-posture decision tree that the credentialed-probe rows inherit from.
- Running a credentialed MCP health check, end to end — the per-region probe atom whose verdicts the archiver persists.
- Multi-region MCP probe deployment — the geographic-redundancy wrapper.
- Public status page for an MCP server — the human-facing reader of the rolled-up history.
- MCP uptime API and embeddable badge — the read-side surface that consumes the daily rollup.
- MCP server uptime monitoring — the whole stack
- MCP server health check — probe sequence explained
- MCP monitoring tool
- MCP endpoint not responding
- Check if your MCP server is alive
- UptimeRobot vs AliveMCP — a direct comparison
Want to be told before your MCP server dies silently?
AliveMCP probes every public MCP endpoint every 60 seconds, archives the canonical history through a Redis-to-Postgres pipeline that survives watermark drift and partition-drop races, fans GDPR Article 17 deletes across every derived view in one transaction, and gives your own staff a self-serve surface for retention preferences and deletion requests — all from the same multi-tenant stack described across the posts of the scale sub-series and operated by the small-team routines this post and its predecessors walk. Public servers are free; private servers start at $9/mo.