DevOps Tooling · 2026-07-04 · DevOps Tooling arc
Building MCP Tools for DevOps Platforms: The Three Patterns That Apply to CloudWatch, Jenkins, CircleCI, Vault, and ArgoCD
When you build your first MCP tool that touches a DevOps platform — a CloudWatch metrics query, a Jenkins build trigger, a Vault secret reader — three problems appear in the same sequence. You create an API client inside the tool handler and discover every invocation is re-authenticating. You get a 401 or a 403 mid-session and can't tell if it's a bad credential, an expired token, or a platform-specific session artifact. You wire a health check that confirms the MCP server process is alive and later find out your credentials expired six hours ago and every tool call since has been failing silently. By the second DevOps integration you recognize all three problems wearing a different platform's uniform. This synthesis covers five integrations — AWS CloudWatch, Jenkins, CircleCI, HashiCorp Vault, and ArgoCD — through the three patterns they all share, so you can identify and fix them before they show up in production.
TL;DR
Five different DevOps platforms, three shared patterns. (1) The singleton client pattern: every DevOps integration has a client-creation overhead that makes creating clients inside tool handlers expensive or incorrect. The fix is always a singleton: dual CloudWatchClient + CloudWatchLogsClient instances for CloudWatch (IAM credential chain resolves once — not per tool call), a pre-configured axios instance with Basic auth and CSRF crumb handling for Jenkins, an axios instance with the Circle-Token header for CircleCI, an axios instance with X-Vault-Token for Vault (no node-vault package dependency), and a factory function that returns a cached JWT with proactive refresh for ArgoCD. The common failure is creating or re-authenticating the client inside the tool handler — adding 100–500ms per call and causing token exhaustion at scale. (2) The credential lifecycle pattern: each DevOps platform has a different credential expiry model and a different failure signal when expiry happens. CloudWatch IAM session tokens expire and surface as ExpiredTokenException — detectable in /health/cloudwatch before the token reaches zero; Jenkins CSRF crumbs are invalidated on every restart and surface as a 403 on POST mutations — handled by a retry-on-403 that fetches a fresh crumb and replays the request; CircleCI API tokens don't expire but can be revoked — detected proactively by GET /me in /health/circleci; Vault tokens have a TTL and AppRole secret_id has a separate rotation schedule — handled by getValidToken() with a 30-second renewal threshold; ArgoCD JWTs expire at session end — handled by getArgoToken() with a 5-minute threshold and a two-consecutive-failure rule before alerting. The common failure is treating all 401s and 403s as equivalent authentication errors without understanding the platform-specific expiry model. (3) The health transparency pattern: each DevOps platform has a different most-dangerous invisible failure mode — the one that lets the MCP server process keep running and returning 200 while all tool calls fail. CloudWatch IAM expiry passes process checks but makes all API calls return a 403 or ExpiredTokenException. Jenkins CSRF crumb staleness leaves the process healthy but breaks every POST mutation with a 403. CircleCI rate limit exhaustion returns 429 — not a credential error, not a downtime event — and can be confused with real failures in aggregate monitoring. Vault sealed state returns 503 from /v1/sys/health, but that 503 is a Vault-specific status code, not a network error. ArgoCD JWT expiry is transient — it causes a 401 on the health check endpoint, but only for a short window during token refresh — so alerting on a single 401 produces false positives. Wire AliveMCP to platform-specific health endpoints that probe these exact failure modes, not generic process health checks.
The five integrations at a glance
Before diving into each pattern, here's where the five DevOps platforms stand on the dimensions that matter most for MCP tool design:
| Integration | Singleton client | Credential expiry model | Invisible failure mode | Mutation safety |
|---|---|---|---|---|
| AWS CloudWatch | Dual CloudWatchClient + CloudWatchLogsClient | IAM session token TTL → ExpiredTokenException | IAM expiry: process up, all API calls fail | Alarms require explicit state check before mute |
| Jenkins | Pre-configured axios + CSRF crumb helper | CSRF crumb invalidated on restart → 403 on POST | Stale crumb: process up, all POST mutations fail | confirm: true guard on cancel_build |
| CircleCI | Axios with Circle-Token header | Token revocation → 401; rate limit → 429 | Rate exhaustion: looks like errors, not downtime | cancel_workflow is irreversible — confirm guard |
| HashiCorp Vault | Axios with X-Vault-Token; getValidToken() | Token TTL expiry + AppRole secret_id rotation | Sealed state: 503 from /v1/sys/health is Vault-specific | Secret reads return key names, not values by default |
| ArgoCD | JWT factory with proactive refresh; cached axios | JWT expiry at session end → 401 | JWT expiry: transient 401 on health endpoint | dry_run on sync_app; confirm guard on rollback |
All five divergences stem from the same root cause: MCP tools are short-lived request-response units called concurrently and with arguments that may come from an LLM, while DevOps platforms were designed for long-lived sessions, interactive CLI use, and trusted administrative clients. The friction points are all expressions of that mismatch — authentication state that was designed to last a session, rate limits designed for human pacing, and health semantics designed for operators, not monitors.
Pattern 1 — The singleton client pattern
DevOps platform clients are not cheap to create. Every client creation either makes a network round-trip (fetching IAM credentials from the metadata service, fetching an ArgoCD JWT, fetching a Jenkins CSRF crumb), carries initialization overhead (axios instance setup, header configuration), or establishes a session (Vault token validation). Creating the client inside the tool handler means paying that cost on every invocation — which for concurrent MCP tools can mean hundreds of parallel initializations under load. The correct pattern is a module-level singleton initialized at startup and reused by every tool handler.
CloudWatch: dual singleton clients
AWS CloudWatch splits across two API surface areas: the metrics and alarms API (CloudWatch namespace) and the logs API (CloudWatch Logs namespace). They use separate SDK clients — CloudWatchClient and CloudWatchLogsClient — both from @aws-sdk/client-cloudwatch and @aws-sdk/client-cloudwatch-logs respectively. Each client resolves IAM credentials on first instantiation by calling the EC2 instance metadata service (or environment variables, or the shared credentials file, depending on deployment). That resolution is not free:
import { CloudWatchClient } from '@aws-sdk/client-cloudwatch';
import { CloudWatchLogsClient } from '@aws-sdk/client-cloudwatch-logs';
// WRONG — resolve credentials on every tool call
server.tool('get_metrics', schema, async ({ namespace, metricName }) => {
const cw = new CloudWatchClient({ region: process.env.AWS_REGION });
// ...
});
// RIGHT — resolve once at module load, reuse on every call
const cw = new CloudWatchClient({ region: process.env.AWS_REGION! });
const cwLogs = new CloudWatchLogsClient({ region: process.env.AWS_REGION! });
server.tool('get_metrics', schema, async ({ namespace, metricName }) => {
// reuse cw — credentials already resolved
});
The credential chain also handles automatic refresh for instance profile credentials — the SDK refreshes them in the background before expiry. Creating a new client every invocation bypasses that automatic refresh mechanism and does a fresh metadata fetch instead, adding latency and potentially getting a pre-expiry credential snapshot rather than the most current one.
Jenkins: pre-configured axios with Basic auth
Jenkins authentication uses HTTP Basic auth with an API token (not a password — use the token generated under User → Configure → API Token). The credentials are static for the lifetime of the token, so the axios instance can be configured once with the Authorization header pre-wired:
import axios from 'axios';
// WRONG — configure auth on every tool call
server.tool('trigger_build', schema, async ({ jobName }) => {
const http = axios.create({
baseURL: process.env.JENKINS_URL,
auth: { username: process.env.JENKINS_USER!, password: process.env.JENKINS_TOKEN! },
});
// ...
});
// RIGHT — singleton with auth pre-wired
const jenkins = axios.create({
baseURL: process.env.JENKINS_URL!,
auth: {
username: process.env.JENKINS_USER!,
password: process.env.JENKINS_TOKEN!,
},
timeout: 30_000,
});
The CSRF crumb is separate — it changes on every Jenkins restart — and must be fetched dynamically. But the axios instance that fetches the crumb and makes authenticated requests should itself be the singleton. The crumb is not part of the client construction; it's a request-level header obtained just before mutations. See the credential lifecycle section for the retry-on-403 pattern.
CircleCI: axios with Circle-Token header
CircleCI v2 API authentication uses a static Circle-Token header. Unlike OAuth or session tokens, CircleCI API tokens don't expire on a schedule — they're revocable by the user or organization admin, but not time-limited. This makes the singleton pattern straightforward:
const circleci = axios.create({
baseURL: 'https://circleci.com/api/v2',
headers: {
'Circle-Token': process.env.CIRCLECI_TOKEN!,
'Accept': 'application/json',
},
timeout: 30_000,
});
The project slug format (gh/org/repo for GitHub projects, bb/org/repo for Bitbucket) is worth pre-computing if your MCP server targets a fixed organization — it removes a string-join from every tool call. For multi-org servers, derive the slug from caller-provided arguments but validate the VCS prefix against an enum (gh or bb) before passing it to the API.
Vault: direct axios, not a wrapper library
The common first instinct for HashiCorp Vault is to reach for node-vault or similar wrapper packages. The problem is that wrappers add a dependency that lags behind the Vault API version, often doesn't support Vault Enterprise namespaces, and introduces its own authentication state machine that conflicts with the proactive token refresh you need in an MCP server. Direct axios is simpler:
import axios from 'axios';
const vault = axios.create({
baseURL: process.env.VAULT_ADDR!,
headers: {
'X-Vault-Token': process.env.VAULT_TOKEN!,
...(process.env.VAULT_NAMESPACE
? { 'X-Vault-Namespace': process.env.VAULT_NAMESPACE }
: {}),
},
timeout: 10_000,
});
The X-Vault-Token header value will change when you implement AppRole token refresh (see Pattern 2). In that case, replace the static header with a helper that calls getValidToken() before each request — either via an axios request interceptor or by explicitly setting the header in each tool handler. The interceptor approach keeps token lifecycle logic out of tool handlers entirely:
vault.interceptors.request.use(async (config) => {
config.headers['X-Vault-Token'] = await getValidToken();
return config;
});
ArgoCD: JWT factory function
ArgoCD authentication requires a JWT obtained by posting credentials to /api/v1/session. JWTs expire at session end — the default is 24 hours, but administrators can set shorter TTLs. The singleton here is not the JWT itself (that changes on refresh) but the JWT factory function and the cached token it maintains:
let argoToken: string | null = null;
let argoTokenExpiry: number = 0;
async function getArgoToken(): Promise {
const fiveMinutes = 5 * 60 * 1000;
if (argoToken && Date.now() < argoTokenExpiry - fiveMinutes) {
return argoToken;
}
const res = await axios.post(`${process.env.ARGOCD_SERVER}/api/v1/session`, {
username: process.env.ARGOCD_USER!,
password: process.env.ARGOCD_PASS!,
});
argoToken = res.data.token;
// Decode expiry from JWT payload — third segment, base64-decoded JSON
const payload = JSON.parse(Buffer.from(argoToken!.split('.')[1], 'base64').toString());
argoTokenExpiry = payload.exp * 1000;
return argoToken!;
}
// Every tool handler calls getArgoToken() — hit the cache on warm calls
server.tool('get_app_status', schema, async ({ appName }) => {
const token = await getArgoToken();
const res = await axios.get(
`${process.env.ARGOCD_SERVER}/api/v1/applications/${appName}`,
{ headers: { Authorization: `Bearer ${token}` } }
);
// ...
});
The 5-minute pre-expiry refresh window means tool calls never see a JWT expiry mid-session under normal conditions. See Pattern 3 for handling the case where a restart puts you exactly at the expiry boundary.
Pattern 2 — The credential lifecycle pattern
Every DevOps platform has a different model for how credentials expire and what signal they emit when they do. The failure of treating all 401s and 403s as equivalent is that it conflates five distinct situations: an expired IAM session token (CloudWatch), a stale CSRF crumb (Jenkins), a revoked API token (CircleCI), an expired Vault token TTL (Vault), and an expired ArgoCD JWT (ArgoCD). Each requires a different recovery action. The credential lifecycle pattern means modeling the expiry mechanism for each platform explicitly, building automatic recovery into the client layer, and surfacing the expiry state proactively in the health endpoint rather than letting tool calls fail and propagate errors to the calling agent.
CloudWatch: IAM session token expiry
IAM session tokens — used when your MCP server runs with assumed roles, federated credentials, or EC2 instance profiles — have a TTL typically between 1 and 12 hours. When they expire, every CloudWatch API call returns an ExpiredTokenException (for assumed role credentials) or a 403 with an ExpiredToken error code (for non-refreshable credentials). The AWS SDK handles automatic refresh for instance profile credentials by polling the metadata service in the background — this happens transparently as long as you keep the singleton client alive. The problem is non-refreshable credentials (static access key + session token from sts:AssumeRole without automatic re-assumption): these expire without the SDK refreshing them.
The most robust approach for MCP servers that must run unattended is to configure them with IAM instance profile credentials (on EC2) or IRSA (on EKS) — credentials the SDK refreshes automatically. If you must use static assumed-role credentials, detect expiry in the health endpoint and alert before the token reaches zero:
// /health/cloudwatch — detect IAM expiry before tool calls fail
async function checkCloudWatchHealth() {
try {
await cw.send(new ListMetricsCommand({ Namespace: 'AWS/EC2', MaxRecords: 1 }));
return { status: 'ok' };
} catch (err: any) {
if (err.name === 'ExpiredTokenException') {
return { status: 'error', reason: 'IAM session token expired — refresh credentials' };
}
if (err.name === 'AccessDeniedException') {
return { status: 'error', reason: 'IAM permissions insufficient for ListMetrics' };
}
return { status: 'error', reason: err.message };
}
}
Distinguish ExpiredTokenException from AccessDeniedException — they look similar in a generic error handler but require different responses: the first requires credential refresh, the second requires IAM policy changes.
Jenkins: CSRF crumb invalidation
Jenkins requires a CSRF crumb header (Jenkins-Crumb) on every POST, PUT, and DELETE request. The crumb is obtained from /crumbIssuer/api/json and is tied to the session — when Jenkins restarts, all existing crumbs are invalidated, and any in-flight mutation attempt returns a 403 No valid crumb was included in the request. This is the most common mid-session failure for Jenkins MCP servers that run continuous operations during a deployment window that includes a Jenkins restart.
The correct response is not to cache the crumb indefinitely (that risks using a stale crumb) nor to fetch a new crumb before every mutation (that doubles the request count). The correct pattern is to cache the crumb and retry on 403 with a fresh crumb exactly once:
let cachedCrumb: string | null = null;
async function getCrumb(): Promise {
if (cachedCrumb) return cachedCrumb;
const res = await jenkins.get('/crumbIssuer/api/json');
cachedCrumb = res.data.crumb;
return cachedCrumb!;
}
async function jenkinsPost(path: string, data?: unknown) {
const crumb = await getCrumb();
try {
return await jenkins.post(path, data, {
headers: { 'Jenkins-Crumb': crumb },
});
} catch (err: any) {
if (err.response?.status === 403) {
// Crumb is stale — invalidate cache, fetch fresh, retry once
cachedCrumb = null;
const freshCrumb = await getCrumb();
return jenkins.post(path, data, {
headers: { 'Jenkins-Crumb': freshCrumb },
});
}
throw err;
}
}
The retry is capped at one attempt. If the fresh crumb also fails, the error propagates — that means the problem is not crumb staleness but a permissions change or Jenkins configuration issue.
CircleCI: token revocation and rate limit exhaustion
CircleCI API tokens are statically issued and don't expire on a time schedule. The credential lifecycle event is revocation — an admin revokes the token, or the user rotates it as part of a security policy. Unlike time-expiry, revocation is unpredictable and the MCP server can't detect it proactively from the token itself. The health endpoint must probe the API with the current token and surface a revocation event as a clear error:
async function checkCircleCIHealth() {
try {
await circleci.get('/me');
return { status: 'ok' };
} catch (err: any) {
if (err.response?.status === 401) {
return { status: 'error', reason: 'CircleCI token is invalid or revoked — rotate CIRCLECI_TOKEN' };
}
if (err.response?.status === 429) {
return { status: 'degraded', reason: 'CircleCI rate limit exhausted — 1000 req/min cap reached' };
}
return { status: 'error', reason: err.message };
}
}
The 429 case is important to distinguish from a 401: rate limit exhaustion is not a credential problem and not a downtime event. It means the MCP server is making more than 1,000 requests per minute to the CircleCI API — an MCP server that triggers builds in a tight loop, or one that's called by an agentic loop without backoff, can hit this. Return a degraded status rather than an error status so upstream monitors don't page an on-call engineer for a rate limit event that will self-resolve in 60 seconds.
Vault: token TTL and AppRole rotation
HashiCorp Vault tokens have an explicit TTL — the default for tokens created by AppRole auth is 768 hours, but operators can configure shorter TTLs for compliance or security reasons. AppRole credentials add a second expiry dimension: the role_id is stable, but the secret_id can be configured with its own TTL and use-count limits. The getValidToken() function must handle both:
let vaultToken: string | null = null;
let vaultTokenExpiry: number = 0;
async function getValidToken(): Promise {
const thirtySeconds = 30_000;
if (vaultToken && Date.now() < vaultTokenExpiry - thirtySeconds) {
return vaultToken;
}
// Re-authenticate via AppRole
const res = await axios.post(`${process.env.VAULT_ADDR}/v1/auth/approle/login`, {
role_id: process.env.VAULT_ROLE_ID!,
secret_id: process.env.VAULT_SECRET_ID!,
});
vaultToken = res.data.auth.client_token;
const leaseDuration = res.data.auth.lease_duration; // seconds
vaultTokenExpiry = Date.now() + leaseDuration * 1000;
return vaultToken!;
}
The 30-second renewal threshold ensures the token is refreshed before the TTL expires — without it, a tool call that begins just before expiry will succeed with the cached token, but the token might expire before the Vault API call completes, especially under high latency conditions. The secret_id rotation problem is harder: if secret_id is configured with a TTL or use-count limit, getValidToken() will fail when it tries to re-authenticate. Build a dedicated /health/vault check that calls getValidToken() proactively — its failure will surface the AppRole misconfiguration before tool calls start failing:
async function checkVaultHealth() {
try {
const token = await getValidToken();
// Also check sealed state — 503 from /v1/sys/health means sealed
const healthRes = await axios.get(`${process.env.VAULT_ADDR}/v1/sys/health`, {
validateStatus: () => true, // don't throw on non-2xx — Vault uses 200/429/472/473/501/503
});
const statusMap: Record = {
200: 'active',
429: 'standby (reads only)',
472: 'DR secondary',
473: 'performance standby',
501: 'uninitialized',
503: 'sealed',
};
const vaultStatus = statusMap[healthRes.status] ?? `unknown (${healthRes.status})`;
if (healthRes.status === 503) {
return { status: 'error', reason: 'Vault is sealed — requires unseal operation' };
}
return { status: 'ok', vaultStatus };
} catch (err: any) {
return { status: 'error', reason: err.message };
}
}
The Vault /v1/sys/health endpoint uses non-standard HTTP status codes to represent Vault internal state — 200 means active leader, 429 means standby, 503 means sealed. If you use a standard HTTP monitoring probe that treats non-200 as failure, you'll receive a false alert on every standby node and miss the actual sealed-state alert because 503 from Vault means something different than 503 from a crashed application server. Set validateStatus: () => true and interpret the status code explicitly.
ArgoCD: proactive JWT refresh with consecutive-failure threshold
ArgoCD JWTs expire when the ArgoCD session expires — the default is 24 hours, but shorter TTLs are common in security-conscious environments. The getArgoToken() function handles proactive refresh at 5 minutes before expiry (see Pattern 1). The remaining lifecycle concern is what happens when ArgoCD itself is restarted: the JWT becomes invalid immediately, the next getArgoToken() call will detect the pre-expiry threshold has passed, and it will re-authenticate and get a new token. This works correctly as long as ArgoCD is available — a restart that takes more than a few seconds to come back will cause tool call failures during the downtime window.
The health transparency problem is that /api/v1/session/userinfo — the correct ArgoCD health probe endpoint — returns a 401 not just when the JWT is expired but also when ArgoCD is in the middle of returning from a restart and the session service hasn't warmed up yet. A single 401 from this endpoint is therefore ambiguous. The two-consecutive-failure rule resolves this:
let argoHealthFailures = 0;
async function checkArgoCDHealth() {
try {
const token = await getArgoToken();
await axios.get(`${process.env.ARGOCD_SERVER}/api/v1/session/userinfo`, {
headers: { Authorization: `Bearer ${token}` },
});
argoHealthFailures = 0;
return { status: 'ok' };
} catch (err: any) {
argoHealthFailures++;
if (argoHealthFailures >= 2) {
return { status: 'error', reason: `ArgoCD health check failed ${argoHealthFailures} consecutive times: ${err.message}` };
}
return { status: 'degraded', reason: 'ArgoCD health check failed once — may be transient' };
}
}
Wire this to AliveMCP monitoring: a degraded status on the first failure does not trigger an alert; two consecutive failures escalate to an error and trigger a notification. This eliminates alert fatigue from restart-window 401s while still catching genuine JWT expiry or ArgoCD outages.
Pattern 3 — The health transparency pattern
A health endpoint that returns 200 whenever the MCP server process is running passes every monitoring check while all tool calls fail. The root cause is that DevOps platform integrations have failure modes that are invisible to process-level health checks: credential expiry, API-specific session state, platform-internal status (Vault sealed state), and soft failures like rate limit exhaustion. The health transparency pattern means building a platform-specific health probe for each integration that surfaces exactly the failure mode that's otherwise invisible — and returning a semantically correct response that lets monitoring systems distinguish recoverable degradation from hard failures.
CloudWatch: IAM expiry is the invisible failure
AWS CloudWatch's most common invisible failure for MCP servers is IAM credential expiry. The MCP server process stays running, the HTTP server continues to respond to health checks on port 3000 or whatever port you've configured, and from a process-monitoring perspective everything looks fine. Meanwhile, every call to GetMetricStatistics, FilterLogEvents, or DescribeAlarms returns an ExpiredTokenException that the tool handler surfaces to the calling agent as a tool error.
The correct /health/cloudwatch endpoint makes a real CloudWatch API call — specifically ListMetricsCommand with a tight namespace filter and MaxRecords: 1, which is cheap and doesn't require write permissions:
app.get('/health/cloudwatch', async (req, res) => {
try {
await cw.send(new ListMetricsCommand({
Namespace: 'AWS/EC2',
MaxRecords: 1,
}));
res.json({ status: 'ok' });
} catch (err: any) {
const reason = err.name === 'ExpiredTokenException'
? 'IAM session token expired'
: err.name === 'AccessDeniedException'
? 'IAM policy lacks cloudwatch:ListMetrics'
: err.message;
res.status(503).json({ status: 'error', reason });
}
});
Separate the error cases explicitly: ExpiredTokenException needs credential refresh (operational action), AccessDeniedException needs IAM policy changes (infrastructure action), and network errors need connectivity investigation. A generic "CloudWatch API call failed" message collapses three distinct remediation paths into one opaque error.
For Logs Insights queries, add a second sub-check that calls StartQueryCommand with a 1-second window on a low-traffic log group and immediately cancels it — this verifies that the CloudWatchLogsClient credentials are also healthy, since a dual-client setup can have one client's credentials expire while the other's remain valid if they were initialized at different times or use different IAM roles.
Jenkins: stale CSRF crumb is the invisible failure
Jenkins's most dangerous invisible failure is a stale CSRF crumb. After a Jenkins restart, the CSRF crumb changes. If the MCP server cached the pre-restart crumb, every subsequent POST mutation — triggering builds, canceling jobs, any write operation — will fail with a 403. The Jenkins process is running, the Jenkins UI works fine, and a naive health check that does a GET request to /api/json will return 200.
The correct Jenkins health check must verify that mutation operations are possible, not just that reads work:
app.get('/health/jenkins', async (req, res) => {
try {
// Verify read access
await jenkins.get('/api/json');
// Verify crumb endpoint is reachable (required for any POST)
cachedCrumb = null; // force fresh fetch to detect crumb invalidation
await getCrumb();
res.json({ status: 'ok' });
} catch (err: any) {
if (err.response?.status === 403) {
res.status(503).json({ status: 'error', reason: 'CSRF crumb fetch returned 403 — Jenkins restart may have invalidated session' });
} else if (err.response?.status === 401) {
res.status(503).json({ status: 'error', reason: 'Jenkins authentication failed — check JENKINS_USER and JENKINS_TOKEN' });
} else {
res.status(503).json({ status: 'error', reason: err.message });
}
}
});
Setting cachedCrumb = null before the health check forces a fresh crumb fetch from the CSRF endpoint. This ensures the health check detects a stale crumb condition — if getCrumb() fails or returns a 403, the health check propagates that as a 503. If it succeeds, the crumb is refreshed in the cache as a side effect, which means the next mutation after a successful health check will use the fresh crumb without an additional network round-trip.
CircleCI: rate exhaustion looks like errors
CircleCI's invisible failure mode is subtler than credential expiry — it's the 429 rate limit, which at the API call level looks like a hard error but at the operational level is a soft, self-resolving condition. An MCP server that's being called in a tight loop — for example, an agentic orchestration that polls build status every few seconds across dozens of pipelines — can exhaust the 1,000 requests-per-minute cap and start receiving 429 responses on every call. From the calling agent's perspective, every tool call is failing. From the infrastructure perspective, nothing is wrong.
The distinction matters because the remediation is completely different: for a real error (401 invalid token, 404 project not found, 500 CircleCI API error) the response is human investigation; for a 429 the response is adding backoff and jitter to the polling logic. A health endpoint that conflates them creates unnecessary alert noise:
app.get('/health/circleci', async (req, res) => {
try {
await circleci.get('/me');
res.json({ status: 'ok' });
} catch (err: any) {
const status = err.response?.status;
if (status === 401) {
res.status(503).json({ status: 'error', reason: 'CircleCI token is invalid or revoked' });
} else if (status === 429) {
res.status(200).json({ status: 'degraded', reason: 'CircleCI rate limit active — 1000 req/min cap reached, tool calls will retry with backoff' });
} else if (status >= 500) {
res.status(503).json({ status: 'error', reason: `CircleCI API error: ${status}` });
} else {
res.status(503).json({ status: 'error', reason: err.message });
}
}
});
Return HTTP 200 with a degraded JSON status for 429s — this prevents your uptime monitor from recording a downtime event for a rate limit window. Return HTTP 503 only for genuine failure states: invalid tokens, API server errors. Wire your health check endpoint to AliveMCP with distinct alert thresholds for error vs degraded states.
Vault: sealed state uses non-standard HTTP semantics
HashiCorp Vault uses HTTP status codes in a non-standard way on its /v1/sys/health endpoint. In most HTTP applications, 503 means "service unavailable — the server cannot handle the request." In Vault's health endpoint, 503 means specifically "Vault is sealed" — a Vault-specific operational state where the decryption keys have been cleared from memory and all secret operations are blocked until an unseal quorum provides the key shares. A standard uptime monitor that treats any non-200 as a downtime event will produce false alerts on every Vault standby node (which returns 429) and will correctly alert on sealed state but with a confusing generic "503 Service Unavailable" message rather than "Vault is sealed."
The full Vault /v1/sys/health status code table:
| HTTP Status | Vault State | Operational meaning |
|---|---|---|
| 200 | Active | Active leader — all operations available |
| 429 | Standby | Standby node — reads available, writes forwarded to leader |
| 472 | DR Secondary | Disaster Recovery secondary — reads only in DR mode |
| 473 | Performance Standby | Performance replication standby — reads available |
| 501 | Uninitialized | Fresh Vault instance — needs vault operator init |
| 503 | Sealed | Sealed — needs unseal before any operations |
Probe this endpoint with validateStatus: () => true (axios) or the equivalent in your HTTP client, parse the status code explicitly against this table, and return a semantically correct health response. For an MCP server that only needs to read secrets (not write them), a 429 standby response is acceptable — the server can still serve reads. For a server that creates or rotates secrets, 429 means operations will be forwarded to the leader with additional latency, which may affect tool call timeouts.
ArgoCD: JWT expiry creates transient 401s on the health endpoint
ArgoCD's invisible failure is the most nuanced: the /api/v1/session/userinfo health probe endpoint returns 401 not only when the JWT has genuinely expired (hard failure) but also when ArgoCD has just restarted and its session service is initializing (transient failure) and when the MCP server's JWT is at the pre-expiry boundary and getArgoToken() is in the middle of fetching a new one (race condition). A single-check health endpoint that pages on any 401 produces false alerts that erode trust in the monitoring system.
The two-consecutive-failure pattern (see Pattern 2) resolves this, but the health check itself also needs to be written correctly — it should always try to get a fresh token via getArgoToken() before calling the userinfo endpoint, because a cached expired token will produce a guaranteed 401 that the retry-on-stale-token path in the factory function would have fixed:
app.get('/health/argocd', async (req, res) => {
try {
// Always call getArgoToken() — it handles caching and proactive refresh
const token = await getArgoToken();
await axios.get(`${process.env.ARGOCD_SERVER}/api/v1/session/userinfo`, {
headers: { Authorization: `Bearer ${token}` },
});
argoHealthFailures = 0;
res.json({ status: 'ok' });
} catch (err: any) {
argoHealthFailures++;
const httpStatus = argoHealthFailures >= 2 ? 503 : 200;
const status = argoHealthFailures >= 2 ? 'error' : 'degraded';
res.status(httpStatus).json({
status,
failureCount: argoHealthFailures,
reason: err.message,
});
}
});
The argoHealthFailures counter resets to 0 on success, so a transient 401 during a restart window increments to 1 (degraded, no alert), recovers on the next health check, and resets — no page sent. A genuine JWT expiry that getArgoToken() also fails to recover from (because ArgoCD itself is unreachable) increments to 2 (error, page sent) and stays at 2 or higher until the service recovers.
Combine ArgoCD's health endpoint with monitoring for the two independent state machines in ArgoCD application state — health (Healthy/Progressing/Degraded/Missing/Unknown) and sync (Synced/OutOfSync/Unknown). A /health/argocd endpoint that also queries the health and sync status of your critical applications surfaces application-level degradation that's invisible to the JWT-only health check. See the ArgoCD MCP integration guide for the combined state table.
Where the integrations diverge
The three patterns apply uniformly across all five platforms. Where the integrations diverge is on mutation safety conventions, rate limit characteristics, error shapes, and the async operation patterns that don't apply to synchronous CRUD APIs.
Mutation safety conventions
Each platform has different conventions for protecting destructive or irreversible operations from accidental tool calls by an LLM agent:
Jenkins — the highest-risk mutations are cancel_build (stops an in-progress build, which may roll back a deployment in progress) and delete_job (destroys the job configuration). Add a confirm: z.literal(true) parameter that the caller must explicitly pass. An LLM agent that calls cancel_build({ jobName: 'prod-deploy', buildNumber: 42 }) without confirm: true should receive a tool error, not a cancellation.
CircleCI — cancel_workflow is irreversible and cannot be undone. Canceled workflows don't restart from the canceled step — they must be re-triggered from the beginning. Apply the same confirm guard pattern. Also validate that the caller-provided workflow ID actually belongs to a project the MCP server is authorized for — CircleCI API tokens are scoped to the token owner, but it's still good practice to check pipeline.project.slug on the fetched workflow before canceling.
Vault — the mutation convention is different: the read operations (read_vault_secret) should return key names by default, not values. This is a defense-in-depth measure — an LLM agent scanning secrets should see "this secret has keys: DB_PASSWORD, API_KEY" without automatically retrieving the values. The caller must explicitly specify which key to retrieve:
server.tool('read_vault_secret', {
path: z.string(),
key: z.string().optional(), // if omitted, return key names only
}, async ({ path, key }) => {
const res = await vault.get(`/v1/${path}`);
const data = res.data.data?.data ?? {};
if (!key) {
return { content: [{ type: 'text', text: `Keys at ${path}: ${Object.keys(data).join(', ')}` }] };
}
if (!(key in data)) {
throw new McpError(ErrorCode.InvalidParams, `Key '${key}' not found at path '${path}'`);
}
return { content: [{ type: 'text', text: data[key] }] };
});
ArgoCD — sync_app should expose a dry_run: boolean parameter that triggers ArgoCD's built-in dry-run mode (which shows what would be applied without applying it). Always default to dry_run: false requiring explicit opt-in — don't default to false silently. The rollback_app tool is the highest-risk operation: rolling back a running application to a previous history revision is immediate and can disrupt active traffic. Add a confirm: z.literal(true) guard and include the history revision ID in the confirmation to prevent accidental rollbacks to the wrong version.
Async operation patterns
CloudWatch Logs Insights queries and Jenkins build triggers are both async operations — the API returns immediately with a reference ID, and you must poll for the result. This pattern doesn't apply to the other three integrations (CircleCI, Vault, ArgoCD surface their async operations differently), so it's worth addressing explicitly for the two that need it.
CloudWatch Logs Insights — StartQueryCommand returns a queryId immediately. The query runs asynchronously. Poll GetQueryResultsCommand until status is Complete, Failed, or Cancelled:
async function runLogsInsightsQuery(logGroupName: string, queryString: string, startTime: number, endTime: number) {
const { queryId } = await cwLogs.send(new StartQueryCommand({
logGroupName,
queryString,
startTime,
endTime,
}));
const deadline = Date.now() + 30_000; // 30s timeout
while (Date.now() < deadline) {
await new Promise(r => setTimeout(r, 1500));
const result = await cwLogs.send(new GetQueryResultsCommand({ queryId }));
if (result.status === 'Complete') return result.results;
if (result.status === 'Failed') throw new Error(`Logs Insights query failed: ${result.status}`);
if (result.status === 'Cancelled') throw new Error('Logs Insights query was cancelled');
}
// Cancel the query before throwing to avoid orphaned queries consuming quota
await cwLogs.send(new StopQueryCommand({ queryId }));
throw new Error('Logs Insights query timed out after 30s');
}
Jenkins build trigger — POST /job/:name/build returns a 201 with a Location header pointing to the queue item (e.g., /queue/item/123/). Poll that queue item until executable.number appears, which is the build number you can then pass to get_build_status:
async function triggerBuild(jobName: string, params?: Record) {
const endpoint = params ? `/job/${jobName}/buildWithParameters` : `/job/${jobName}/build`;
const res = await jenkinsPost(endpoint, params ? new URLSearchParams(params) : undefined);
const queueUrl = res.headers['location']; // e.g. http://jenkins.example.com/queue/item/456/
const queuePath = new URL(queueUrl).pathname + 'api/json';
const deadline = Date.now() + 60_000;
while (Date.now() < deadline) {
await new Promise(r => setTimeout(r, 2_000));
const queue = await jenkins.get(queuePath);
if (queue.data.executable?.number) {
return { buildNumber: queue.data.executable.number };
}
if (queue.data.cancelled) {
throw new Error('Build was cancelled before starting — check Jenkins queue');
}
}
throw new Error('Build did not start within 60s — Jenkins queue may be backed up');
}
Rate limits
Each platform has a different rate limit profile that affects polling strategies:
- CloudWatch metrics API — 400 requests per second globally.
GetMetricStatisticsalso has a per-metric throttle. Aggregate metric queries usingGetMetricData(batch up to 500 metrics per call) rather than multipleGetMetricStatisticscalls when querying many metrics. - CloudWatch Logs — 10 requests per second per log group for
FilterLogEvents. For Logs Insights, limit concurrent queries to 10 per account. - Jenkins — no API-level rate limit, but Jenkins is CPU-bound during heavy builds. Avoid polling build status faster than once per 2 seconds to prevent adding load during a build.
- CircleCI v2 — 1,000 requests per minute per token. For status polling, use 5-second minimum intervals and implement exponential backoff on 429 responses.
- HashiCorp Vault — no documented rate limit, but Vault's performance degrades under high concurrent requests on a single node. The AppRole re-authentication should be throttled — never call
getValidToken()from concurrent tool handlers without a mutex, or you'll generate multiple simultaneous AppRole logins. - ArgoCD — no documented rate limit. Be conservative with
list_appscalls that fetch full application state — the response payload can be large on clusters with many applications.
Error shapes
Error handling cannot be abstracted across these five integrations because their error representations are entirely different:
- CloudWatch — AWS SDK throws named exception classes:
ExpiredTokenException,AccessDeniedException,ThrottlingException,InvalidParameterValueException. Access viaerr.namefor the class name. - Jenkins — HTTP status codes with HTML error bodies. A 404 means the job doesn't exist; a 403 means CSRF crumb is stale or auth failed; a 500 means Jenkins threw an exception during API handling. Parse the body for details when status is 500.
- CircleCI — HTTP status codes with JSON error bodies. The body is
{ "message": "..." }. The 422 "Unprocessable Entity" status often means the project slug format is wrong. - Vault — HTTP status codes with JSON error arrays. The body is
{ "errors": ["..."] }. Multiple errors can appear in the array (e.g., "1 error occurred: * permission denied"). The 404 status means the path doesn't exist or the token lacks permission to list it — Vault intentionally conflates these two cases to prevent enumeration attacks. - ArgoCD — HTTP status codes with JSON error bodies following gRPC-gateway format:
{ "error": "...", "message": "...", "code": N }. Thecodefield is a gRPC status code, not an HTTP status code — don't confuse the two.
Implement per-integration error handlers that translate each platform's error representation into an McpError with a human-readable message. The calling agent receives the McpError message; make sure it contains enough information to distinguish "the job doesn't exist" from "you don't have permission to see it" — even if the underlying API intentionally conflates them, your tool can give better guidance based on context (e.g., whether the path was user-provided or hard-coded in config).
Putting it together: the reference checklist
The following checklist covers all three patterns across all five integrations. Use it when reviewing a DevOps platform MCP server before deployment:
| Check | CloudWatch | Jenkins | CircleCI | Vault | ArgoCD |
|---|---|---|---|---|---|
| Singleton client at module level? | Dual CWClient + CWLogsClient | Singleton axios instance | Singleton axios instance | Singleton axios + interceptor | JWT factory + cached axios |
| Credential expiry handled? | IAM auto-refresh (instance profile) or health-check expiry detection | Retry-on-403 for stale CSRF crumb | Proactive check via GET /me in health | getValidToken() with 30s threshold | getArgoToken() with 5min threshold |
| Platform-specific health probe? | ListMetricsCommand — catches IAM expiry | CSRF crumb fetch — catches stale crumb | GET /me — catches token revocation and rate limit | /v1/sys/health with full status-code parse | /session/userinfo with 2-failure threshold |
| Mutation safety? | Alarm mute: check current state first | confirm: true on cancel/delete | confirm: true on cancel_workflow | Read keys-only by default | dry_run on sync; confirm: true on rollback |
| Rate limit handling? | 400 req/s metrics; 10 req/s per log group | 2s minimum poll interval | 429 → degraded (not error); 5s minimum poll | Mutex on getValidToken() | Conservative list_apps polling |
Why these patterns don't appear in REST API wrapper guides
You won't find the singleton client pattern, the credential lifecycle pattern, or the health transparency pattern in most "how to call the Jenkins API" tutorials. That's because those tutorials are written for human-driven CLI scripts and one-shot integrations — contexts where the client is created once per process run, credentials are validated interactively before the script starts, and the health semantics of the target service are someone else's problem.
MCP tools invert all of these assumptions. The tool handler is called repeatedly and concurrently across the lifetime of a long-running server process. Credentials must be valid at every invocation, not just at startup. The health of the integration is the MCP server's responsibility to surface, because the calling agent has no other way to know whether a tool failure is transient (rate limit, brief restart) or permanent (credential expiry, sealed Vault). Every friction point in DevOps platform MCP integrations is a version of that inversion.
Recognizing the three patterns — singleton client, credential lifecycle, health transparency — as a unit means you can apply all three correctly to the next DevOps integration you build, even before you've seen it fail. The GitHub Actions, Kubernetes, and Datadog integrations all have the same three patterns in different uniforms. The health endpoint for GitHub Actions needs to probe the Actions API for rate limit headroom, not just check that the process is running. The Kubernetes integration needs a singleton client that uses the in-cluster service account token refresh mechanism, not a static token. The Datadog integration needs to distinguish API key errors (401) from application key errors (403) — two different credential types with different expiry and rotation policies.
The pattern doesn't change. The platform-specific details do. See the individual integration guides for CloudWatch, Jenkins, CircleCI, Vault, and ArgoCD for the full implementation details behind each entry in this synthesis.
For all five DevOps platform MCP servers: wire your platform-specific health endpoint to AliveMCP. A process health check alone misses IAM expiry, stale CSRF crumbs, rate limit exhaustion, sealed Vault state, and transient JWT 401s — every failure mode in the table above. Each endpoint should return a structured JSON response with a status field (ok, degraded, error) and a reason field that tells you exactly which failure mode was detected, so that when AliveMCP fires the alert you know the remediation before you open the terminal. See the MCP server health check guide and MCP server error handling guide for the full framework.