Deep dive · 2026-04-30 · Scale sub-series

Operator dashboard walkthrough — running one console safely for many MCP tenants

The scale sub-series so far has covered three sides of the multi-tenant MCP uptime stack. The collector walkthrough built the write side — supervisor, workers, per-region queues, per-tenant secret store, verdict-minute coalescer. The alert routing walkthrough built the paging side — sink-ownership verification, tenant-scoped configuration, cross-tenant suppression, per-tenant alert budgets, payload-shape boundaries. The shared-state archiver walkthrough built the persistence side — Redis-to-Postgres ingestion, retention by tier, daily and monthly rollups, the GDPR-shaped delete path, the suppression-cluster log as a derived view. Each of those three services emits metrics, surfaces tenant configuration, accepts admin operations, and produces audit logs. The single-tenant operator wires those four surfaces into a Grafana board and a handful of command-line scripts and ships the day. The multi-tenant operator needs a console — with per-tenant scoping, role-based access for staff and contractors and auditors, a customer-facing self-serve surface that lets tenants configure their own alert sinks and retention preferences and Article 17 requests without opening a support ticket, and an audit log that outlives every retention cap so that the question "who did what to which tenant on which minute, and from where" is answerable seven years later. This post is the architectural walkthrough.

TL;DR

The operator dashboard is the fourth load-bearing service in the multi-tenant MCP uptime stack — the one that operates the other three. Its job is to make safe, scoped, audited operations on the collector, the alert router, and the archiver possible without dropping every operator into a root-equivalent console with full read/write across every tenant. Five structural choices make it work. The four-layer admin permission model — root operator (full read/write across the platform, used by ~3 humans, MFA required, every action audited with a justification field), tenant-scoped operator (full read/write inside one tenant only, used by support and onboarding staff, Postgres row-security as the second-line defence, cannot self-elevate), read-only auditor (read-only across some or all tenants, used by SOC-2 auditors and security reviewers, cannot mutate, cannot impersonate, cannot read secrets — only metadata and aggregates), and customer self-serve (the tenant's own staff, scoped to one tenant, scoped further by role inside that tenant, scoped further again to a strict subset of the operator surface). The four layers are not a hierarchy; they are four different surfaces with four different threat models. The audit-log schema that outlives every other retention cap — one append-only Postgres table partitioned by month, retention 7 years for every tier, populated on every mutating operation by middleware not by the handler, with a strict schema (actor_role, actor_id, actor_ip, tenant_id, action, resource_kind, resource_id, justification, request_id, before_hash, after_hash, occurred_at) and a deliberate non-goal: the audit log does not store before/after content, it stores SHA-256 hashes of the canonical-JSON content; the content lives in the per-resource history table where the per-tier retention policy applies. The customer self-serve surface as a strict subset of the operator surface — every customer-self-serve route is implemented as a tenant-scoped-operator route with the tenant pinned by the session, plus a self-service-allowlist gate enforced in the same middleware that resolves the actor role; this means we never have a customer route that bypasses the operator audit log. The tenant-impersonation primitive — the operator can elevate to "view as " with a 30-minute time-boxed session that runs every read through the customer self-serve renderer (so the operator sees what the customer sees), records the entire impersonation in the audit log with a load-bearing justification field, blocks every mutation by default, and requires a second human approval for any mutation performed during impersonation. The operator-vs-customer field cut — a tabular reference for what every dashboard surface shows at each layer, defaulting to "not shown" for fields not on the customer's side of the cut and "shown" for fields the operator needs to do their job; it covers verdict fields, probe-step fields, alert-router fields, archiver fields, billing fields, and infrastructure fields. The seven failure-mode catalogue at the end is the catch — admin role drift after staff turnover, customer-facing route drift past the self-service allowlist, impersonation session not properly closing on the operator's logout, audit-log write failure not failing the request, justification field becoming a vestigial copy-paste, role-leakage via cached sessions on dashboard-rebuild, and the tenant-pinned customer route that loses its tenant pin on a redirect. The recipe section sketches the permission middleware, the audit-log table DDL, the impersonation token flow, and the Article 17 self-serve workflow in copy-pasteable form. This post closes the scale sub-series; the next deliverable is the Q3 2026 registry audit, which re-runs the probe stack designed in the four scale-sub-series posts against every endpoint in the public MCP registries and reports bucket-by-bucket movement vs the Q2 baseline.

Where the Grafana-plus-scripts setup stops being enough

The single-tenant operator runs the same probe stack the entire scale sub-series describes — credentialed probe, multi-region wrapper, status page, read-side API — but as a single tenant. The operator surface is whatever Grafana board they wired up, plus a handful of command-line scripts, plus direct Postgres and Redis access on a workstation that has the SSH key. There is no role model because there is one role; there is no audit log because the operator's shell history is the audit log; there is no customer self-serve because the operator is the customer. The whole machine has one user. None of that scales.

Three things break the moment a second tenant joins. The first is credential blast radius. The single-tenant operator has a workstation with a Postgres password that has full read/write on every table in the database; when there were one tenant's worth of secrets in the database, a leak of that password leaked one tenant's secrets. With many tenants, that workstation is now a single-credential single-host single-process compromise vector for every tenant the platform serves. The single-tenant operator's "log into the box and fix it" is a hard floor on the platform's credential blast radius. A multi-tenant operator's first job is to make the everyday case — a support engineer fixing an alert-sink misconfiguration for a Team-tier tenant — go through a path that is scoped to one tenant, audited per action, and unable to read or write any other tenant's data. Postgres row-security and a tenant-scoped admin role do that; a workstation with the master password does not.

The second thing that breaks is the customer-facing self-serve surface. The single-tenant operator can configure their own alert sink with a one-off Lua script that pushes a row into the alert_config table; if they get the SHA wrong they SSH back in and fix it. With many tenants, the support volume of "please change my Slack channel" alone is enough to swamp any small team. The first multi-tenant pivot is to give the customer a self-serve UI for the operations they perform most frequently — adding and verifying alert sinks (the verification handshake is per-tenant, not per-operator), changing retention preferences within their tier band, exporting their probe history, and submitting Article 17 deletion requests. That self-serve UI is not a separate product — building a separate product for it would mean a second copy of every authorisation rule, every audit-log writer, and every tenant-resolution helper. The second multi-tenant pivot is to make the customer self-serve surface a strict subset of the operator surface, sharing one middleware stack and one audit log.

The third thing that breaks is auditability. The single-tenant operator's audit trail is their shell history; SOC-2 reviewers don't accept that. The multi-tenant operator's audit trail has to answer "who did what to which tenant on which minute, from where, why, and what changed" for every mutating operation, has to outlive every retention cap on every other table (we delete the per-minute history at 7 days for Public-tier tenants but we keep the audit log of operations performed against that tenant for 7 years), and has to refuse to fail open — if the audit-log write fails, the operation that was being audited has to fail too. None of that is a Grafana feature. None of that is a shell-history feature. It is its own service and it is one of the load-bearing primitives the multi-tenant dashboard is built around.

Each of those three breaks has the same shape: it is fine for one tenant, intolerable for many, and the right answer is a small, boring, audited, scoped admin console — not a bigger Grafana board and not a more elaborate set of scripts. The operator dashboard is the service that makes safe operations on the multi-tenant collector, alert router, and archiver possible without dropping every operator into a root-equivalent console with full read/write across every tenant.

The four-layer admin permission model

The most consequential design choice in the multi-tenant operator dashboard is the role model — what set of distinct actor types we recognise, what each one can do, and how the model resists the inevitable scope creep of "this support engineer needs to be able to do X for one customer once". We landed on four layers. They are not a hierarchy; an operator at one layer cannot self-elevate to another, and the four layers each have a distinct threat model and a distinct surface. The four layers are the answer to four different questions.

Layer 1 — root operator

The root operator answers "who can perform a platform-wide change?" The platform-wide changes are migrations, partition rolls, retention policy changes, on-call paging schedule edits, KMS key rotations, billing-tier definition edits, and other actions that affect every tenant. Root operators are a small, named, fixed list — three humans on our deployment. The threat model is a compromised laptop or a coerced password. The defences are MFA on every login, hardware-token enforcement (no SMS, no TOTP-only), an audit log on every action with a load-bearing justification field that the UI refuses to accept as empty or copy-pasted from the previous action, a hard 60-minute idle timeout on the session, and a separate "break-glass" sub-role that requires a second root operator to approve any operation that touches more than 1% of tenants in a single transaction. Root operators are not the layer that does day-to-day support work; if a root operator finds themselves answering customer tickets, the role model has been deployed wrong.

Layer 2 — tenant-scoped operator

The tenant-scoped operator answers "who can perform a per-tenant operation on behalf of a customer?" These are support, onboarding, and account-management staff. Their session is bound to a specific tenant the moment they open a customer ticket; the binding is performed by an operator-onboarding flow that reads the ticket ID, resolves the tenant, and pins the session. Every read goes through the same Postgres row-security policy as the customer's own session would, with the actor role swapped to tenant_scoped_operator instead of customer. Every write goes through the same authorisation middleware the customer's own session would use, with the additional pre-condition that the operator's session is pinned to this tenant. The threat model is a support engineer who, in good faith, opens twenty tickets in one shift and fat-fingers a tenant ID; the defence is a tenant pin that the UI displays at the top of every page, a confirmation step on every mutating operation that re-displays the tenant name and asks the operator to type it, and a refusal in the database layer to perform any operation against a tenant that is not the pinned tenant. The bad threat model — a malicious support engineer with read access to many tenants — is mitigated structurally by the audit log; the operator can read another tenant's data by repinning the session, but every repin is in the audit log, and the audit log is reviewed weekly with a mechanical "did the actor have an open ticket for this tenant within five minutes of the repin?" check. The check catches the only failure mode the layer can't structurally prevent — and catches it in a deterministic, repeatable way that does not depend on a human noticing.

Layer 3 — read-only auditor

The read-only auditor answers "who can read but never write?" These are SOC-2 auditors, security reviewers, internal compliance staff, and external pentesters during a scoped engagement. They are scoped to a subset of tenants by an explicit list — a SOC-2 auditor sees only the tenants that are in scope for the audit; a pentester during a scoped engagement sees only the synthetic tenants the engagement set up; an internal compliance reviewer sees an explicit list of tenants that opted in to the review. The defences are: the auditor session cannot mutate (every mutating endpoint refuses with a 403 even before the authorisation check runs); the auditor session cannot impersonate (the impersonation primitive in §6 refuses to start a session for an auditor role); the auditor session cannot read secrets (every secret-bearing field — alert-sink webhook URL, OAuth client secret, KMS-encrypted credential blob — is replaced with the SHA-256 fingerprint of the secret rather than the secret itself, so the auditor can verify rotation without ever seeing the cleartext); the auditor session is rate-limited and logged. The threat model is "an auditor's laptop is compromised and the attacker uses the session to learn things they shouldn't know"; the defence is the strict surface — the auditor sees the metadata and the aggregates, not the secrets and not the customer's content.

Layer 4 — customer self-serve

The customer self-serve actor answers "who can perform a per-tenant operation as the tenant?" These are the customer's own staff. They are scoped to one tenant by their authentication; they are scoped further by role inside that tenant (a customer admin, a customer member, a customer billing-only); they are scoped further again to a strict subset of the operator surface — there are operations the tenant-scoped operator can perform that the customer cannot (running an arbitrary SQL select, restoring a partition, force-rotating a tenant's KMS key) and there are operations the customer can perform that the tenant-scoped operator can also perform (configuring an alert sink, changing a retention preference within tier band, submitting an Article 17 request). The threat model has two parts: a malicious tenant trying to escalate to read another tenant's data (mitigated by the same row-security policy that backs the tenant-scoped operator), and a customer admin who fat-fingers their alert sink and pages on the wrong Slack channel (mitigated by the verification handshake in the alert routing walkthrough — every alert sink is unusable until verified). The customer self-serve surface is the most-used surface on the dashboard by call volume; the operations on it are the operations the platform stops needing to handle as support tickets. Every dollar of leverage in the multi-tenant operator dashboard comes from getting this layer right.

Two non-goals are worth naming. Tenant-scoped operator is not a superset of customer self-serve. A tenant-scoped operator helping a customer through an alert-sink configuration uses the impersonation primitive to see what the customer sees, not the operator surface; this is the only way to keep the bug-report loop tight ("the customer says X is broken; the operator opens the same screen and reproduces"). Customer self-serve is not a thin client over the operator surface. The implementation reuses the middleware and the audit log, but the route definitions are explicit allowlists — the customer self-serve router has its own list of routes it forwards to the operator handler, and any handler not on the list is unreachable from a customer session. The implementation strategy is "one service, two routers"; the route definition strategy is "explicit allowlist on the customer side, explicit denylist on the operator side, with a CI check that the two never disagree".

The audit-log schema that outlives every other retention cap

The audit log is the second load-bearing primitive in the multi-tenant operator dashboard. Three properties make it different from every other table in the platform. First, it is append-only; nothing ever updates a row, nothing ever deletes a row, and the table has no UPDATE grant on any operator role. Second, it has a uniform retention policy: 7 years for every tier, regardless of the per-tenant retention policy on the per-minute history table. Third, the audit log does not store content; it stores hashes of the canonical-JSON content, with the content itself living in the per-resource history table where the per-tier retention policy applies. The third property is what makes the uniform 7-year retention compatible with GDPR — when the customer submits an Article 17 request, the per-resource history is deleted, the verdict-minute Redis is purged, the suppression-cluster contributing-tenant set is salt-and-replaced, and the audit log is left intact with the resource hashes pointing at content that no longer exists. The audit log row says "the customer's admin reset their alert sink at 2026-04-30T17:42Z; the before-hash was X and the after-hash was Y"; if the resource has been deleted under Article 17, the hashes are now unresolveable, but the row stands.

The schema is small and load-bearing. Every column matters and the order of columns is deliberate.

CREATE TABLE audit_log (
    occurred_at      timestamptz NOT NULL,
    actor_role       text NOT NULL CHECK (actor_role IN
                       ('root_operator','tenant_scoped_operator',
                        'read_only_auditor','customer_self_serve','system')),
    actor_id         uuid NOT NULL,
    actor_ip         inet NOT NULL,
    tenant_id        uuid,             -- NULL for platform-wide actions
    action           text NOT NULL,    -- e.g. 'alert_sink.create'
    resource_kind    text NOT NULL,    -- e.g. 'alert_sink'
    resource_id      text,             -- composite-friendly text
    justification    text NOT NULL,    -- empty string allowed only for 'system'
    request_id       uuid NOT NULL,
    before_hash      bytea,            -- sha256 of canonical-JSON before, NULL on create
    after_hash       bytea,            -- sha256 of canonical-JSON after, NULL on delete
    PRIMARY KEY (occurred_at, request_id)
) PARTITION BY RANGE (occurred_at);

REVOKE UPDATE, DELETE ON audit_log FROM PUBLIC;
GRANT INSERT, SELECT ON audit_log TO operator_app;
GRANT SELECT ON audit_log TO auditor_app;
-- no UPDATE or DELETE grant to anyone, including root_operator

Every field earns its place. occurred_at is timestamptz (UTC always) and is part of the primary key — partition pruning depends on it, and the only safe ordering is "what the database recorded as the commit time". actor_role is the four-layer enum from the previous section, plus system for actions taken by the platform (for example, the archiver running an Article 17 fan-out delete it was instructed to run by a customer self-serve action — the audit log has both rows: the customer's data_deletion_request.create and the system's data_deletion.fan_out). actor_id is a UUID; for human actors it points at the operator-account or customer-account table; for system actions it points at the workload identity that ran the action. actor_ip is captured at the edge and forwarded into the operation; load balancers and proxies are configured to use forwarded-for in a way that the middleware can trust. tenant_id is nullable — platform-wide root-operator actions like a partition roll have no tenant — but for any per-tenant action the column is non-null and is matched by the row-security policy on every read. action is a dotted string with a stable enumeration committed to the codebase; the CI check verifies that every action string in the audit-log writes is in the enumeration. resource_kind and resource_id together identify the resource the action ran against; the resource_id is composite-friendly text rather than a UUID because some resources have natural composite keys (an alert sink is identified by tenant_id|sink_kind|sink_id). justification is a plain-text field that the UI refuses to accept as empty (except for the system role) or as a copy-paste of the previous justification; the rationale is in the failure modes section below. request_id is the platform-wide request UUID that is also written into the rest of the platform's logs; it is the join key from the audit log to the application logs and the Postgres slow query log. before_hash and after_hash are SHA-256 hashes of the canonical-JSON serialisation of the resource before and after the action; on a create, before is NULL; on a delete, after is NULL; on an update, both are non-NULL and the hashes can be matched against the per-resource history table.

Two design decisions are deliberate and worth calling out. The audit log does not store content. Storing the before and after content would mean the audit log inherits the strictest retention requirement of every other table on the platform — if the per-minute history is 7 days for Public tier, but the audit log has the per-minute verdict in its before/after, then a Public tier tenant's Article 17 request requires walking the audit log and selectively erasing the per-tenant rows, which breaks the append-only property and changes the table's threat model. Storing only the hash means the audit log is uniform 7-year retention for every tenant, the per-resource content is on its own retention policy, and the Article 17 fan-out from the archiver walkthrough only has to walk the per-resource history tables. The audit log is written by middleware, not by the handler. Every mutating route is wrapped by an authorisation+audit middleware that reads the resource's before-state, runs the handler, reads the resource's after-state, computes the hashes, and inserts the audit row in the same transaction as the handler's mutation. If the handler's mutation rolls back, the audit row rolls back too. If the audit insert fails, the transaction fails and the request returns 500 — there is no path where the mutation succeeds and the audit fails. This is the strictest possible read-and-audit invariant; the middleware is ~80 lines of Go and is the single most important defensive component in the dashboard.

The customer self-serve surface as a strict subset of the operator surface

The customer self-serve surface is the largest by call volume. It is also the surface most prone to drift — a feature added "just for the operator" gets exposed to the customer because the route happened to be reachable from a session both layers share, or a route exposed to the customer gets a new field that leaks information from another tenant because the new field's serializer didn't go through the customer field-cut. The defence is structural: the customer self-serve router is an explicit allowlist over the operator handlers, the customer surface is field-cut at the response serializer, and the customer surface and the operator surface are tested with a single test harness that verifies the cut for every endpoint.

The five customer self-serve operations that earn their place on the surface are: add and verify an alert sink (Slack inbound-proof handshake, webhook TXT-record proof, email per-recipient verification, PagerDuty OAuth-PKCE, all with the same flow as the alert routing walkthrough describes); change retention preference within tier band (the customer's tier sets the cap; the customer can request a value below the cap and the change takes effect at the next archiver tick — see the archiver walkthrough for the per-tier retention table); export probe history (the customer requests a CSV or JSON export of their probe history; the export job runs in a worker, the result lands in a tenant-scoped object-store bucket, and the customer downloads it via a signed URL with a 24-hour expiry); submit an Article 17 deletion request (the workflow is described below); change billing tier and seat count (a self-serve flow for upgrading and downgrading; downgrades take effect at the next billing cycle to avoid surprise data loss; the workflow is interlocked with the retention preference because a downgrade can lower the tier cap below the customer's current preference, in which case the preference is auto-trimmed to the new cap with an explicit confirmation step the customer has to acknowledge before the downgrade commits). Every other operation — anything involving a partition roll, anything involving more than the customer's own tenant, anything involving a read of secrets the customer's own staff don't manage — is operator-only.

The Article 17 deletion request workflow

The Article 17 workflow is the single most consequential customer self-serve flow because it is the workflow that exercises every other layer of the platform on a single customer action. The user clicks "delete this server's data" in the dashboard. The customer self-serve handler resolves the tenant, the customer's role inside the tenant (only customer admins can submit Article 17), and the resource identifier. It writes an article_17_request row in the database with status pending, justification field auto-populated with "customer self-serve Article 17 request", and a 7-day cooling-off period (the cooling-off is a regulatory option; for some jurisdictions it is required, for some it is optional, and the platform errs on the side of including it because customers occasionally submit deletion requests in haste and ask to roll them back). At the cooling-off boundary, a system worker picks up the request, runs the GDPR delete fan-out from the archiver walkthrough (advisory-lock per-server, write tombstone first, delete from probe_minute / probe_day / probe_month / suppression_clusters / verdict_minute Redis / alert-router Redis / read-side cache / object-store export bucket / this customer's audit-log content references — the audit-log row stands; the resource the row points at is gone), updates the request row to status completed, and writes a confirmation audit-log row with actor_role = 'system'. The customer receives an email confirming the deletion and a downloadable PDF receipt with the request ID, the cooling-off start, the completion timestamp, and the canonical hash of the deleted resource at deletion-time. The PDF is what the customer submits to a regulator if the regulator asks for proof.

Two failure modes for Article 17 are structurally mitigated. The customer changes their mind during the cooling-off period. The dashboard shows pending Article 17 requests with a "cancel" button; cancellation marks the row cancelled and the worker skips the row. The cancellation is itself an audited action with its own justification field. The cooling-off period elapses while the system is in a degraded state and the worker can't run the fan-out. The worker is idempotent (the per-server advisory lock, the tombstone-first ordering, the at-least-once delivery model from the alert routing walkthrough's queue infrastructure all carry over), and the worker retries with exponential backoff bounded at 24 hours; if the request is still pending 24 hours past the cooling-off boundary, the platform pages on-call and the failure is on the operator's surface, not the customer's.

The tenant-impersonation primitive

Every multi-tenant operator dashboard eventually needs an impersonation primitive — a way for a tenant-scoped operator to "view as the customer" so they can reproduce the customer's bug from the customer's session, see the customer's screen the way the customer sees it, and walk through the customer's flow without translating between operator and customer fields. The impersonation primitive is the single most dangerous primitive in the dashboard, and the failure mode is famous: an operator starts an impersonation session for one customer, gets distracted, and weeks later their cached session is still active and is used by a colleague to look at a different customer's data. The defences are layered and structural.

The flow is: the tenant-scoped operator opens a customer ticket, navigates to the tenant view, and clicks "view as customer". The operator is required to enter a justification (free text, refused if empty, refused if matched against the operator's last 10 justifications, refused if matched against the global blacklist of low-information justifications like "support" or "test") and a ticket-system reference. The dashboard server creates an impersonation token with a 30-minute hard expiry, a strict operator_id binding (the token is unusable from any other operator's session), a strict tenant_id binding (the token is unusable for any other tenant), a read_only=true default flag (which can be flipped to read_only=false only by a second human approval, fetched via a side-channel page that opens for the second approver and requires their MFA code), and a session-cookie fingerprint binding that locks the token to the operator's current browser session. Every read performed during the impersonation session is rendered through the customer self-serve renderer (so the operator sees what the customer sees, never what the operator would see). Every read is recorded in the audit log with actor_role = 'tenant_scoped_operator', tenant_id = , and an impersonation_request_id field that links every row in the impersonation session to the original token-issue row. The token is invalidated on operator logout, on the operator closing the impersonation banner at the top of every page, on the operator switching tenants, on a ten-minute idle timeout, on the 30-minute hard expiry, and on a watchdog that cancels every active impersonation token whenever the operator's primary session ends for any reason — the token cannot outlive the parent session under any circumstance. The audit-log entry on token issue includes the justification, the ticket-system reference, the operator's IP, and the second approver's identity (if the token was upgraded to read-write).

Two design choices are non-obvious and earn their place. The impersonation banner is non-dismissable. The dashboard shows a non-dismissable banner at the top of every page during an impersonation session, with the impersonated tenant's name in red, the time remaining on the token, and a one-click "end impersonation" button. The banner is non-dismissable because the most common operator failure mode — a long-lived impersonation session that the operator forgot was open — is the failure mode the banner exists to make impossible. The impersonation primitive does not exist for the read-only auditor or root operator roles. The auditor role cannot impersonate (auditors see metadata and aggregates, not customer screens). The root operator role cannot impersonate (root operators don't do customer-facing support; if they did, their session would not be the right surface). Impersonation is a tenant-scoped-operator-only primitive, by design.

The operator-vs-customer field cut

The operator-vs-customer field cut is the tabular reference that defines, per surface, what every layer of the dashboard sees. It exists because every endpoint that returns data has to make this decision and the decision has to be made the same way every time; a per-endpoint judgement call is the surest way to leak per-region detail or probe-step internals or supervisor-level stack traces into a customer response. The cut is one big table; the tablified version below is enough to anchor the architecture.

The field-cut rule is: the customer sees what they need to operate their integration; the tenant-scoped operator sees what they need to operate their support ticket; the auditor sees what they need to verify the platform's compliance posture; the root operator sees everything. The cut is enforced at the response serializer, not at the database query — the database query returns the full row, the serializer applies the field cut for the actor role, and a CI check verifies for every endpoint that the customer-facing serializer never returns a field on the operator-only side of the cut. The cut is auditable because the serializer is one component, not many.

The field cut for the verdict surface — the read-side API contract the customer's badge and CI guardrail and runtime liveness check all read — is: customer sees state ∈ {up, down, degraded}, uptime_30d, p95_ms, last_probe_ago, as_of; tenant-scoped operator sees the customer fields plus per-region cells, probe-step breakdown for the most recent minute, the tool-list-hash sequence, and the canonical-JSON of the most recent initialize response; auditor sees the customer fields plus per-region cells and probe-step breakdown but never the canonical-JSON (the body might contain customer-specific tool descriptions); root operator sees everything plus the underlying credential fingerprint, the worker pod ID, and the supervisor's queue position for the most recent probe.

The field cut for the alert-router surface — the alert routing walkthrough describes the events — is: customer sees the alert events that fired for their tenant in the last 30 days, the alert-sink configuration, the verification status; tenant-scoped operator sees the customer fields plus the suppression-cluster ID for any event that was suppressed (the cluster's contributing-tenant set is hashed, so the operator sees that the event was suppressed at the registry-wide level without seeing which other tenants were in the cluster), and the per-tenant budget consumption; auditor sees the alert-sink verification handshake history and the budget consumption aggregates but never the alert content; root operator sees everything plus the global suppression-cluster log and the budget definitions.

The field cut for the archiver surface is: customer sees their probe-minute count, daily and monthly rollups for their servers, the retention preference and current cap; tenant-scoped operator sees the customer fields plus the watermark position, the partition list, and the contributing-tenant hash for the suppression-cluster log; auditor sees the retention configuration and the GDPR delete log but never the per-minute content; root operator sees everything plus the partition-roll history and the schema version.

The field cut for the billing surface is the strictest: customer sees their own invoices, their tier, their seat count, their card on file (last-4 only), their billing email; tenant-scoped operator sees the customer fields plus the billing history, the tier-change log, the dunning state; auditor sees the tier-change log and the dunning state but no card information ever; root operator sees the platform's billing aggregates but no individual customer's card information ever (root operators are not the layer that handles billing; the billing layer has its own sub-role on top of root operator).

Two non-obvious cuts are worth naming. Customer-supplied content is not in the operator field cut by default. The customer's tool descriptions, alert payload templates, and self-supplied free-text fields are operator-only with explicit opt-in — a customer who needs operator help reproducing a bug in their own template flips an "allow operator to view this template" toggle, and the operator can see it for the duration of the open ticket; the audit log captures the toggle's flip, the operator's reads, and the toggle's auto-flip-back at ticket close. The default-deny is the right shape because most operator support tickets do not require reading the customer's tool description, and a few that do can use the explicit opt-in. The auditor field cut never crosses tenant boundaries. An auditor scoped to one tenant sees aggregates that include only that tenant; an auditor scoped to many tenants sees per-tenant data only when explicitly listed in the auditor's scope, and never sees a join across tenants the auditor is not in scope for. This is the rule that keeps "an auditor whose laptop is compromised" from becoming a many-tenant data leak.

Seven failure modes specific to a multi-tenant operator console

Each failure mode below is the kind of bug that is invisible while you have one tenant and obvious in retrospect once you have many. Each has a structural fix that belongs in the dashboard architecture, not in a runbook.

1. Admin role drift after staff turnover

Symptom: an operator leaves the company and their account remains active for weeks because deactivation is a manual step on someone's checklist; the account's session token is still valid; the audit log shows a read from the departed operator's account two weeks after their last day. Structural fix: every operator account is tied to a single source of truth (the company's identity provider in our deployment), the dashboard's authorisation middleware re-resolves the actor's status from the IdP on every session refresh (every 15 minutes), an account that is no longer present in the IdP is auto-deactivated and every active session is invalidated, and a weekly automated audit compares the dashboard's operator-account list to the IdP's group membership and pages on any drift. The check is mechanical and runs on a cron; it does not depend on a human noticing.

2. Customer-facing route drift past the self-service allowlist

Symptom: a developer adds a new operator handler, registers it in the operator router, and forgets to register it on the customer router; a customer issues a request that hits the route by guessing the URL; the route works because the same middleware allows it. Structural fix: the customer router is an explicit allowlist over the operator handler list, and a CI check fails the build if any operator handler exists that is not on either the customer allowlist or the operator-only denylist (every handler must be explicitly classified). The check refuses the build with a list of unclassified handlers; new handlers cannot ship without an explicit decision. The customer router refuses every route that is not on the allowlist, regardless of what the underlying handler does.

3. Impersonation session not properly closing on operator logout

Symptom: an operator starts an impersonation session, then logs out without ending the impersonation; the impersonation token is still in the session store with 28 minutes remaining; a colleague who shares the laptop opens a new session, the cached session-cookie fingerprint binding fails, the colleague's request still authenticates against the operator's account because the password manager autofilled, and the impersonation token is now used in the colleague's session — but the audit log shows the impersonation as the original operator's. Structural fix: a session watchdog that cancels every active impersonation token whenever the operator's primary session ends for any reason; the impersonation token cannot outlive the parent session under any circumstance. The watchdog runs in-process for sessions on the same node and as a Postgres-trigger-fired cleanup for sessions that ended on a different node. Combined with the session-cookie-fingerprint binding from §6, the failure mode is structurally impossible — the colleague's session has a different cookie fingerprint than the operator's, and the impersonation token refuses to authenticate.

4. Audit-log write failure not failing the request

Symptom: a database is briefly partitioned; the dashboard's mutating endpoint succeeds (the row was written before the partition), the audit-log insert fails (the partition happened mid-transaction), the middleware logs the failure, and the request returns 200 to the operator anyway because the developer who wrote the middleware put the audit insert in a try/catch with a log statement and no re-raise. Six months later, an investigation needs to find out what changed on a tenant on a specific day and the audit log has gaps. Structural fix: the audit insert is in the same transaction as the handler's mutation; if the audit insert fails, the transaction fails and the handler's mutation rolls back. The middleware is ~80 lines of Go and the test for "audit failure rolls back the mutation" is the first test in the test file. There is no try/catch around the audit insert that absorbs the failure. There is no path where the mutation succeeds and the audit fails.

5. Justification field becoming a vestigial copy-paste

Symptom: the audit-log justification field exists, but every operator types "support ticket #1234" in the field every day, and the field is not load-bearing because the justifications are uniformly low-information. Six months later, the audit log's justification field is unhelpful for any forensic question. Structural fix: the UI refuses to accept an empty justification; the UI refuses to accept a justification matching the operator's last 10 justifications; the UI refuses to accept a justification matching a global blacklist of low-information justifications ("support", "test", "fix", "support ticket", "n/a"); the UI requires a ticket-system reference for any justification used in an impersonation session. The friction is intentional — the only way to make the justification field load-bearing is to make it expensive to type a low-information justification.

6. Role leakage via cached sessions on dashboard rebuild

Symptom: the dashboard's role definitions are bumped (a new role is added or a role's scope is reduced); the deployed binary is the new code but cached sessions still carry the old role's claims; for several minutes after deploy, a few sessions are authorised against rules that no longer apply. Structural fix: the session token includes a role-definitions hash in its signature; on every request, the middleware compares the session's role-definitions hash to the platform's current role-definitions hash; if they differ, the session is forcibly re-issued (the next request triggers a fresh role resolution from the IdP) before any authorisation check runs. The hash bump is automatic on any role-definitions change committed to the codebase. The same mechanism handles role-permission updates that take effect on a rolling deploy.

7. Tenant-pinned customer route loses its tenant pin on a redirect

Symptom: a customer self-serve route forwards to a shared component that needs a tenant ID; the shared component reads the tenant ID from a query parameter rather than the session; a redirect strips the query parameter; the next request re-resolves the tenant from a default that turns out to be the platform's first tenant in the database. The customer's request now reads (or worse, writes) against the wrong tenant. Structural fix: the tenant ID is read from the session, never from a query parameter or a path segment, except in the operator-side tenant-switcher endpoint where the tenant ID is the explicit subject of the request and is enforced by a row-security policy that requires the operator's session to have the appropriate role; a CI check fails the build if any handler reads tenant_id from a query parameter or path segment outside the explicit allowlist; row-security is the second-line defence on top of the application-layer rule.

Reference recipes

Each recipe is small and copy-paste-friendly; the goal is to make the architecture concrete enough to start with, not to ship a production-ready service. Adapt to your stack, your IdP, and your retention regime.

The permission middleware (Go, ~80 lines)

// auth.go — applied to every mutating route on the dashboard server
func WithAuthAndAudit(next http.Handler, action string, kind string) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        actor, err := ResolveActor(r) // reads session, refreshes from IdP if stale
        if err != nil { http.Error(w, "unauthorized", 401); return }
        if !RoleAllows(actor, action, kind) {
            http.Error(w, "forbidden", 403); return
        }
        // Read tenant ID from the session, never from query/path (rule #7).
        tenantID := actor.SessionTenantID(r)
        if RouteRequiresTenant(action) && tenantID == uuid.Nil {
            http.Error(w, "no tenant in session", 400); return
        }
        // Justification: required for every mutating action; refused if empty
        // or if matched against the actor's last 10 or the global blacklist.
        just := r.Header.Get("X-Justification")
        if !ValidJustification(actor.ID, just, action) {
            http.Error(w, "invalid justification", 400); return
        }
        // Read the resource's before-state.
        before, _ := ReadResourceForAudit(r.Context(), kind, ResourceID(r))
        // Begin a Postgres transaction; the handler runs in this transaction.
        tx, _ := db.BeginTx(r.Context(), nil)
        ctx := WithTx(r.Context(), tx)
        rec := NewResponseRecorder(w)
        next.ServeHTTP(rec, r.WithContext(ctx))
        // Read the resource's after-state in the same transaction.
        after, _ := ReadResourceForAudit(ctx, kind, ResourceID(r))
        // Insert the audit-log row in the same transaction.
        _, err = tx.ExecContext(ctx, `INSERT INTO audit_log
            (occurred_at, actor_role, actor_id, actor_ip,
             tenant_id, action, resource_kind, resource_id,
             justification, request_id, before_hash, after_hash)
             VALUES (now(), $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11)`,
             actor.Role, actor.ID, ActorIP(r), nullableTenant(tenantID),
             action, kind, ResourceID(r), just, RequestID(r),
             canonicalSHA(before), canonicalSHA(after))
        if err != nil { tx.Rollback(); http.Error(w, "audit failed", 500); return }
        if rec.StatusCode() >= 500 { tx.Rollback(); rec.Flush(w); return }
        if err := tx.Commit(); err != nil {
            http.Error(w, "commit failed", 500); return
        }
        rec.Flush(w)
    })
}

The audit-log table and grants (Postgres)

-- One partition per month; the cron rolls a new partition 7 days before the boundary.
CREATE TABLE audit_log (
    occurred_at      timestamptz NOT NULL,
    actor_role       text NOT NULL,
    actor_id         uuid NOT NULL,
    actor_ip         inet NOT NULL,
    tenant_id        uuid,
    action           text NOT NULL,
    resource_kind    text NOT NULL,
    resource_id      text,
    justification    text NOT NULL,
    request_id       uuid NOT NULL,
    before_hash      bytea,
    after_hash       bytea,
    PRIMARY KEY (occurred_at, request_id)
) PARTITION BY RANGE (occurred_at);

-- Indexes for forensics: by actor (who did what), by tenant (what happened
-- to this tenant), and by request_id (join to application logs).
CREATE INDEX audit_actor_idx ON audit_log (actor_id, occurred_at DESC);
CREATE INDEX audit_tenant_idx ON audit_log (tenant_id, occurred_at DESC) WHERE tenant_id IS NOT NULL;
CREATE INDEX audit_request_idx ON audit_log (request_id);

-- No UPDATE or DELETE grant to anyone; the table is append-only by ACL,
-- not by convention.
REVOKE ALL ON audit_log FROM PUBLIC;
GRANT INSERT, SELECT ON audit_log TO operator_app;
GRANT SELECT ON audit_log TO auditor_app;
-- Even root_operator cannot delete or update; partition-drop after 7 years
-- is the only allowed mutation, and is performed by a separate retention job
-- that runs as a distinct database role with one privilege.

-- Retention: 7 years uniformly. The retention job runs monthly.
-- A partition older than 7 years is dropped. There is no SELECT-then-DELETE
-- path because there is no DELETE grant.

The impersonation token flow (pseudocode)

POST /api/operator/impersonate
  body: { tenant_id, justification, ticket_ref }
  preconditions:
    actor.role == 'tenant_scoped_operator'
    actor.session.tenant_id == tenant_id    # operator already pinned to tenant
    valid_justification(justification)
    valid_ticket_ref(ticket_ref)
  side effects:
    issue impersonation_token with:
      operator_id   = actor.id
      tenant_id     = tenant_id
      read_only     = true
      cookie_fp     = sha256(actor.session.cookie)
      issued_at     = now()
      expires_at    = now() + 30m
      parent_sid    = actor.session.id
    audit_log.insert(action='impersonation.start', justification=justification,
                     resource_kind='tenant', resource_id=tenant_id)
  return: { token, expires_at }

GET /api/customer/* (during impersonation)
  preconditions:
    request.session has a parent operator session AND a valid impersonation_token
    impersonation_token.cookie_fp == sha256(request.session.cookie)
    impersonation_token.tenant_id == request.session.tenant_id
    impersonation_token.expires_at > now()
    impersonation_token.parent_sid is still alive
    request route is on the customer self-serve allowlist
  side effects:
    render via the customer self-serve renderer (not the operator renderer)
    audit_log.insert(actor_role='tenant_scoped_operator', tenant_id=...,
                     action='impersonation.read', impersonation_request_id=...)

POST /api/customer/* (during impersonation)
  preconditions:
    impersonation_token.read_only == false  # was upgraded by second approver
  side effects:
    audit_log.insert(... 'impersonation.write' ... second_approver_id=...)

POST /api/operator/impersonate/end
  side effects:
    revoke impersonation_token
    audit_log.insert(action='impersonation.end')

# Watchdog: any session-end event for the parent operator session triggers
# revocation of every active impersonation_token where parent_sid matches.
# In-process for same-node sessions; via Postgres trigger for cross-node.

The Article 17 self-serve workflow (pseudocode)

POST /api/customer/data/(server_slug)/delete
  preconditions:
    actor.role == 'customer_self_serve'
    actor.tenant_id is set
    actor.tenant_role == 'admin'
  side effects:
    insert article_17_request (
        tenant_id        = actor.tenant_id,
        server_slug      = server_slug,
        requested_by     = actor.id,
        requested_at     = now(),
        cooling_off_ends = now() + interval '7 days',
        status           = 'pending')
    audit_log.insert(action='article_17.create', justification='customer self-serve')
    send confirmation email to actor with cancel_url
  return: { request_id, cooling_off_ends }

POST /api/customer/data/(request_id)/cancel
  preconditions:
    article_17_request.requested_by == actor.id
    article_17_request.status == 'pending'
  side effects:
    update status = 'cancelled'
    audit_log.insert(action='article_17.cancel')

# Worker — runs every minute, picks up requests past their cooling-off boundary.
worker article_17_executor:
  for req in pending_requests where cooling_off_ends <= now():
      lock per-server advisory lock
      write tombstone in data_deletion_log (becomes the receipt PDF source)
      delete from probe_minute, probe_day, probe_month, suppression_clusters
              ... and verdict_minute Redis, alert-router Redis, read-side cache,
              and the tenant's export-bucket prefix, all in one transaction
      replace contributing-tenant set in suppression_clusters with salted-hash
              tombstone marker (cluster row survives, tenant identity replaced)
      update req.status = 'completed'
      audit_log.insert(actor_role='system', action='article_17.complete')
      send completion email with downloadable PDF receipt

Where this fits — the scale sub-series, complete

The scale sub-series so far has four posts. The collector walkthrough (post #10) built the write side — supervisor, workers, per-region queues, per-tenant secret store, verdict-minute coalescer. The alert routing walkthrough (post #11) built the paging side — sink-ownership verification, tenant-scoped configuration, cross-tenant suppression, per-tenant alert budgets, payload-shape boundaries. The shared-state archiver walkthrough (post #12) built the persistence side — Redis-to-Postgres ingestion, retention by tier, daily and monthly rollups, the GDPR-shaped delete path, the suppression-cluster log as a derived view. This post built the operator side — the four-layer admin permission model, the audit-log schema that outlives every retention cap, the customer self-serve surface as a strict subset of the operator surface, the tenant-impersonation primitive, the operator-vs-customer field cut, and the seven failure modes specific to operating one console on behalf of many tenants. Together they describe a multi-tenant MCP uptime stack that probes from many regions, alerts safely across many tenants, persists the canonical history in a shape that survives every retention cap and every Article 17 request, and is operated through one audited console with a self-serve surface for the customer's own staff.

The next deliverable is the Q3 2026 registry audit, landing mid-July 2026. The audit re-runs every probe from all five regions in parallel through the multi-tenant collector designed in post #10, with verdicts archived through the system designed in post #12, with cross-tenant suppression measured against the cluster log designed in post #11, and with operator actions during the audit window logged in the audit-log designed in this post. The audit will report bucket-by-bucket movement vs the Q2 baseline — including the new regionally degraded bucket the multi-region rollout from post #7 surfaces; whether the credentialed-probe rollout from post #6 shrinks the auth-walled 16.8% bucket as expected; whether the schema-drift detector from post #4 caught the same 7.1%/48h drift rate or a different one; how the cross-tenant alert-suppression rule from post #11 behaved on the registry-wide outages observed during the audit window; and the first end-to-end pass through the archiver designed in post #12 at registry scale. The next post after the audit, the first of a new sub-series, will be a hands-on guide to operating the four-layer permission model in an MCP-monitoring deployment with five-or-fewer staff — the smallest team size the model is calibrated for.

Want to be told before your MCP server dies silently?

AliveMCP probes every public MCP endpoint every 60 seconds, archives the verdict for as long as your tier specifies, surfaces the canonical history through an audited operator console, and gives your own staff a self-serve surface for alert sinks, retention preferences, and Article 17 requests — all from the same multi-tenant stack described across the four posts of this scale sub-series. Public servers are free; private servers start at $9/mo.

Join the waitlist