Guide · Authentication
MCP server JWKS key rotation
JWKS key rotation is how authorization servers replace their JWT signing keys without permanently breaking clients that hold tokens signed by the old key. The critical insight for MCP servers is that a key rotation event that removes the old key immediately will invalidate every in-flight MCP session — the next HTTP request to /mcp triggers a JWKS re-fetch, the old key is gone, and the session's existing token fails signature verification. The solution is a grace period: publish the new key alongside the old key for at least as long as your longest token TTL, only removing the old key after all tokens signed by it have expired. This guide covers the rotation mechanics, the grace period strategy, and how to monitor rotation events with AliveMCP.
TL;DR
When rotating: add the new key to JWKS first (do not remove the old key). Start signing new tokens with the new key. Keep the old key in JWKS for at least max(token_ttl, max_session_duration) — typically 24 hours for short-lived tokens, 7 days for long-lived sessions. Only remove the old key after that window. The kid field in the JWT header tells your MCP server which key to use for verification — jose's createRemoteJWKSet handles kid-based key selection automatically.
Why rotation breaks MCP sessions
MCP sessions are long-lived. A user might authenticate, receive a JWT, and then use that session for an hour or more. If the authorization server rotates signing keys mid-session — removing the old key from the JWKS endpoint — the next JWKS cache miss on the MCP server will fetch a JWKS that does not contain the key that signed the user's token. Validation fails with "signature verification failed" and the session receives a 401.
This is worse than a normal token expiry because:
- The token is still within its TTL — it has not expired
- The client cannot refresh its way out — it needs to re-authenticate from scratch
- The failure is silent until the next JWKS re-fetch, creating an unpredictable delay between the rotation event and the 401 spike
- All active sessions are affected simultaneously, not gradually as tokens expire
The MCP session model makes this worse than equivalent REST API failures because a REST client can immediately retry with a fresh token — a MCP client must tear down the session, start a new initialize handshake, and rebuild all session state.
The grace period strategy
The correct rotation procedure publishes both old and new keys simultaneously for a transition window:
// Phase 1: JWKS contains both old and new keys (grace period)
{
"keys": [
{ "kid": "key-2024-01", "alg": "RS256", "use": "sig", "kty": "RSA", ... }, // OLD
{ "kid": "key-2025-01", "alg": "RS256", "use": "sig", "kty": "RSA", ... } // NEW
]
}
// Phase 2: After grace period — JWKS contains only the new key
{
"keys": [
{ "kid": "key-2025-01", "alg": "RS256", "use": "sig", "kty": "RSA", ... } // NEW only
]
}
During Phase 1, the authorization server begins signing all new tokens with key-2025-01. Existing tokens signed with key-2024-01 are still verifiable because both keys are in JWKS. The JWKS response includes the new key, so any MCP server instance that re-fetches JWKS during the grace period will cache both keys and can verify both old-key and new-key tokens.
Grace period duration must cover the overlap of token TTL and session duration:
| Scenario | Minimum grace period |
|---|---|
| Short-lived tokens (15min), short sessions (<1h) | 1 hour |
| Short-lived tokens (15min), long sessions (up to 8h) | 8 hours (session duration is the constraint) |
| Long-lived tokens (24h), any session | 24 hours (token TTL is the constraint) |
| Refresh tokens (30d) | 30 days — or revoke refresh tokens separately before key removal |
How jose handles kid-based key selection
createRemoteJWKSet from jose reads the kid field from the JWT header and selects the matching key from the JWKS. If the kid is not in the cached JWKS, it re-fetches the JWKS endpoint once (subject to the cooldownDuration) and retries. This means your MCP server handles rotation transparently — no restart required, no code change needed:
// This code handles rotation automatically via kid-based selection
const JWKS = createRemoteJWKSet(
new URL(`${process.env.AUTH_ISSUER}/.well-known/jwks.json`),
{
cacheMaxAge: 10 * 60 * 1000, // 10 minutes — balance freshness vs. JWKS traffic
cooldownDuration: 30 * 1000, // 30 seconds — prevent flood on unknown kid
}
);
// jwtVerify selects the key matching the JWT's kid header automatically
const { payload } = await jwtVerify(token, JWKS, {
algorithms: ['RS256', 'ES256'],
issuer: process.env.AUTH_ISSUER,
audience: process.env.AUTH_AUDIENCE,
});
The cooldownDuration is your defence against key-confusion attacks: an attacker sending tokens with arbitrary kid values would otherwise trigger a JWKS re-fetch on every request, exhausting the authorization server's rate limits. The cooldown ensures a maximum of one re-fetch per cooldownDuration per unknown kid.
Zero-downtime rotation procedure
Follow this sequence to rotate keys without any session disruption:
## Step 1: Generate the new key pair (on the auth server)
openssl genrsa -out new-private.pem 2048
openssl rsa -in new-private.pem -pubout -out new-public.pem
## Step 2: Add the new public key to JWKS with a new kid
## DO NOT remove the old key yet
## Auth server JWKS endpoint now returns both keys
## Step 3: Verify JWKS contains both keys
curl https://auth.example.com/.well-known/jwks.json | jq '.keys | length'
# Should return 2
## Step 4: Switch the auth server to sign new tokens with the new key
## Old tokens (signed by old key) remain verifiable for the grace period
## Step 5: Wait for grace period
## Duration = max(token_ttl, max_session_lifetime)
## For 1h tokens and 8h sessions: wait 8 hours
## Step 6: Verify no active sessions hold old-key tokens
## Check auth server session store or wait for certainty
## Step 7: Remove the old key from JWKS
## JWKS endpoint now returns only the new key
## Step 8: Verify JWKS contains only the new key
curl https://auth.example.com/.well-known/jwks.json | jq '.keys | length'
# Should return 1
## Step 9: Archive the old private key securely (do not delete immediately)
## Required for forensic investigation if a token signed by the old key appears after rotation
Detecting bad rotations with AliveMCP
A misconfigured rotation — removing the old key before the grace period ends — produces a sudden 401 spike across all active MCP sessions. AliveMCP's continuous probes detect this as an authentication failure event: the probe token (signed by the old key, with a TTL longer than the rotation window) begins failing with a signature verification error the moment the old key disappears from JWKS.
Because AliveMCP probes run every 60 seconds, the maximum time between a bad rotation and the alert is 60 seconds. Without external probing, you would only discover the failure when users begin reporting errors — typically minutes to hours later depending on how many users are active.
The AliveMCP probe alert should name the expected behaviour: "HTTP 401 from a server that was healthy 60 seconds ago — likely key rotation without grace period." This context is included in the AliveMCP incident payload alongside the raw HTTP status code, so you immediately know what to check (your JWKS endpoint) rather than starting a blind investigation. See MCP server security monitoring for distinguishing rotation-induced 401 spikes from credential enumeration attacks.
Algorithm migration (RS256 to ES256)
Algorithm migration is a rotation where you change both the key and the algorithm. The procedure is the same as key rotation, with one addition: your MCP server must accept both algorithms during the grace period.
// During migration: accept both RS256 (old) and ES256 (new)
const { payload } = await jwtVerify(token, JWKS, {
algorithms: ['RS256', 'ES256'], // both accepted during grace period
issuer: process.env.AUTH_ISSUER,
audience: process.env.AUTH_AUDIENCE,
});
// After migration: restrict to ES256 only
// algorithms: ['ES256']
Do not remove RS256 from the algorithms list until after the grace period ends. Removing it early causes the same sudden 401 spike as removing the old key from JWKS prematurely.
Related questions
How often should I rotate JWKS signing keys?
At minimum, rotate annually for compliance (SOC 2, ISO 27001 typically require annual rotation). Rotate immediately if a private key is potentially compromised — do not wait for the scheduled rotation window. For high-security deployments, 90-day rotation is common. The overhead of following the grace period procedure is low — automate it. Never leave a key in service for more than 2 years regardless of compliance requirements.
What if my authorization server doesn't support publishing multiple keys?
This is a real constraint with some older auth systems. Two options: (1) Use a short token TTL (5–15 minutes) so the grace period is short and you can accept a brief 401 window during the rotation — clients retry quickly with fresh tokens. (2) Build a proxy JWKS endpoint that merges the keys from the auth server with a manually-maintained set of retired keys — the proxy serves both during the grace period, then stops serving the old key. Option 2 is more complex but eliminates session disruption entirely.
How do I test rotation in staging before doing it in production?
Set up a staging auth server with a 5-minute token TTL and 1-minute JWKS cache. Trigger a rotation (remove the old key immediately, without grace period) and verify that your MCP server produces the expected 401 spike. Then set up a proper grace period rotation and verify the spike does not occur. Run AliveMCP probes against your staging MCP server during both tests — this verifies that the alert fires when it should and does not fire when rotation is done correctly.
Further reading
- MCP server JWT validation — verifying tokens at the transport boundary
- MCP server authentication — API keys, OAuth 2.0, and session binding
- MCP server security monitoring — detecting auth anomalies
- MCP server secrets management — storing private keys securely
- MCP server circuit breaker — protecting against JWKS endpoint failures
- AliveMCP — uptime monitoring that detects rotation-induced 401 spikes within 60 seconds