Guide · Reliability

MCP server flapping

Flapping is when your MCP server monitor alternates between "down" and "up" in rapid succession, generating a stream of alerts that train your team to ignore them. The symptom looks like an unreliable server, but the root cause is usually a misconfigured monitor, a cold-start latency spike, or a server teetering at the edge of a resource limit. Here's how to tell them apart and fix each one.

TL;DR

Most MCP server flapping is caused by: (1) a probe timeout set too short for cold-start latency, (2) a fire-on-first-failure policy with instant recovery, or (3) a server sitting at its memory or connection limit and shedding probes intermittently. Fix with a consecutive-failure threshold (fire after 3 failures, recover after 3 successes) and a cold-start exemption window (suppress the first probe after a 10-minute silence gap).

What flapping looks like in your alert log

The signature: incident fired at 14:02, resolved at 14:03, incident re-fired at 14:04, resolved at 14:05. Repeat for hours. Each cycle is a real probe result — the server failed one probe and passed the next — but there's no real outage. The server is marginal, not dead.

Flapping is expensive. It trains your team to treat all alerts as noise (alert fatigue), it generates PagerDuty incidents that push against your monthly incident-count SLO, and it makes it impossible to see real outages in the alert stream. A server that goes genuinely down during a flapping period is indistinguishable from another flap cycle — which is how real outages go unacknowledged for hours.

The four causes of MCP server flapping

1. Cold-start latency exceeding the probe timeout

Serverless MCP servers on Vercel, Railway, Render, or AWS Lambda cold-start on the first probe after a period of inactivity. Cold-start latency ranges from 500ms for a lightweight Node server to 8–12 seconds for a JVM-based server. If your probe timeout is 5 seconds and the cold-start takes 7 seconds, every probe that hits a cold instance fails — but the next probe hits a warm instance and passes. The result is an alternating fail/pass/fail/pass pattern keyed to your instance TTL. See MCP server cold start for the full treatment.

2. Aggressive fire-on-first-failure policy

Monitors that fire on the first failed probe and recover on the first passing probe produce flapping whenever there's any transient network noise. A single dropped packet between your probe origin and the server creates one failed probe, fires an incident, then resolves immediately on the next probe. On a 60-second probe cadence, you can get 10–20 spurious incidents per hour from a server that is entirely healthy.

3. Server at resource limit (memory, connections, file descriptors)

A server running at 95% memory utilization will occasionally fail probes when the garbage collector pauses or a spike in concurrent connections triggers a rejection. The failures are intermittent — most probes succeed, but a probe that arrives during a GC pause gets a timeout. This produces a low-frequency flap (1 failure per 10–20 probes) that is invisible on a per-probe chart but shows up clearly in a 24-hour uptime percentage (e.g., 94.8% uptime on an "up" server).

4. Probe origin network jitter

If your probe origin shares infrastructure with other services (a shared VPS, a CI runner, a Lambda in a noisy account), you can see bursts of network jitter that cause probes to time out. The server isn't failing — the path between probe origin and server is. This is detectable: if failures are correlated across multiple unrelated servers (all fail within the same 30-second window), the probe origin is the likely cause.

Hysteresis: the fundamental fix

Hysteresis means requiring evidence of a state change before accepting it. Applied to uptime monitoring:

Fire condition: require N consecutive failed probes before transitioning from UP to DOWN and firing an alert.
Recovery condition: require M consecutive successful probes before transitioning from DOWN to UP and firing a recovery alert.

The typical values are N=3 and M=3 on a 60-second probe cadence. This means:

A genuine outage is detected in ≤3 minutes (3 × 60 seconds).
A transient failure — one or two bad probes — never fires an alert.
Recovery is confirmed after 3 minutes of consecutive success, not after the first passing probe.

The tradeoff: with N=3, your time-to-alert on a genuine outage is up to 3 minutes longer than with N=1. For a 60-second-probe monitor, that means the earliest possible alert is at minute 3, not minute 1. For most MCP server use cases, 3-minute detection is well within acceptable bounds — and the elimination of flapping alerts is worth the cost. If you need 1-minute detection for a business-critical server, use a 20-second probe cadence with N=3 (60-second detection, flap-free).

Cold-start exemption window

For serverless MCP servers, even N=3 hysteresis can produce flapping if the server's cold-start duration exceeds your probe timeout. The fix is a cold-start exemption window: if the last successful probe was more than T minutes ago (indicating the instance was shut down), suppress the first probe result after the gap and wait for the second probe to classify the server's state.

A gap of 10 minutes is a safe heuristic for most serverless platforms (Vercel Functions idle out after ~10 minutes without traffic; Railway and Render on free tiers shut down after 15 minutes of inactivity). If your server idles out faster, lower the window. The key invariant: the exemption window should be shorter than your actual cold-start duration, so you don't suppress probes on genuinely-down servers.

AliveMCP detects cold-start-associated failures separately from genuine failures, and does not count cold-start timeout spikes against your server's uptime SLO unless they exceed 3 consecutive probes.

Diagnosing which cause is yours

Given a flapping server, check in this order:

Plot failure timestamps against probe timestamps. If failures are regularly spaced (e.g., every 15 minutes) and your platform has a 15-minute idle timeout, it's cold-start flapping.
Check failure correlation across servers. If multiple unrelated servers fail in the same 30-second window, it's probe-origin jitter. Check the probe origin's CPU, network, and memory at those timestamps.
Check failure rate vs. absolute failure count. A server with 94% uptime on a 60-second probe has roughly 86 failures in 24 hours. If those failures are clustered in short bursts (10 failures in 10 minutes, then 0 for 3 hours, then 10 again), it's resource exhaustion — likely GC pauses or connection-limit events. If the failures are distributed roughly uniformly, it's probe timeout misconfiguration.
Widen the probe timeout and watch for improvement. If increasing your timeout from 5s to 15s eliminates the flapping, it was a timeout misconfiguration (almost certainly cold-start latency).