Guide · Reliability
MCP server flapping
Flapping is when your MCP server monitor alternates between "down" and "up" in rapid succession, generating a stream of alerts that train your team to ignore them. The symptom looks like an unreliable server, but the root cause is usually a misconfigured monitor, a cold-start latency spike, or a server teetering at the edge of a resource limit. Here's how to tell them apart and fix each one.
TL;DR
Most MCP server flapping is caused by: (1) a probe timeout set too short for cold-start latency, (2) a fire-on-first-failure policy with instant recovery, or (3) a server sitting at its memory or connection limit and shedding probes intermittently. Fix with a consecutive-failure threshold (fire after 3 failures, recover after 3 successes) and a cold-start exemption window (suppress the first probe after a 10-minute silence gap).
What flapping looks like in your alert log
The signature: incident fired at 14:02, resolved at 14:03, incident re-fired at 14:04, resolved at 14:05. Repeat for hours. Each cycle is a real probe result — the server failed one probe and passed the next — but there's no real outage. The server is marginal, not dead.
Flapping is expensive. It trains your team to treat all alerts as noise (alert fatigue), it generates PagerDuty incidents that push against your monthly incident-count SLO, and it makes it impossible to see real outages in the alert stream. A server that goes genuinely down during a flapping period is indistinguishable from another flap cycle — which is how real outages go unacknowledged for hours.
The four causes of MCP server flapping
1. Cold-start latency exceeding the probe timeout
Serverless MCP servers on Vercel, Railway, Render, or AWS Lambda cold-start on the first probe after a period of inactivity. Cold-start latency ranges from 500ms for a lightweight Node server to 8–12 seconds for a JVM-based server. If your probe timeout is 5 seconds and the cold-start takes 7 seconds, every probe that hits a cold instance fails — but the next probe hits a warm instance and passes. The result is an alternating fail/pass/fail/pass pattern keyed to your instance TTL. See MCP server cold start for the full treatment.
2. Aggressive fire-on-first-failure policy
Monitors that fire on the first failed probe and recover on the first passing probe produce flapping whenever there's any transient network noise. A single dropped packet between your probe origin and the server creates one failed probe, fires an incident, then resolves immediately on the next probe. On a 60-second probe cadence, you can get 10–20 spurious incidents per hour from a server that is entirely healthy.
3. Server at resource limit (memory, connections, file descriptors)
A server running at 95% memory utilization will occasionally fail probes when the garbage collector pauses or a spike in concurrent connections triggers a rejection. The failures are intermittent — most probes succeed, but a probe that arrives during a GC pause gets a timeout. This produces a low-frequency flap (1 failure per 10–20 probes) that is invisible on a per-probe chart but shows up clearly in a 24-hour uptime percentage (e.g., 94.8% uptime on an "up" server).
4. Probe origin network jitter
If your probe origin shares infrastructure with other services (a shared VPS, a CI runner, a Lambda in a noisy account), you can see bursts of network jitter that cause probes to time out. The server isn't failing — the path between probe origin and server is. This is detectable: if failures are correlated across multiple unrelated servers (all fail within the same 30-second window), the probe origin is the likely cause.
Hysteresis: the fundamental fix
Hysteresis means requiring evidence of a state change before accepting it. Applied to uptime monitoring:
- Fire condition: require N consecutive failed probes before transitioning from UP to DOWN and firing an alert.
- Recovery condition: require M consecutive successful probes before transitioning from DOWN to UP and firing a recovery alert.
The typical values are N=3 and M=3 on a 60-second probe cadence. This means:
- A genuine outage is detected in ≤3 minutes (3 × 60 seconds).
- A transient failure — one or two bad probes — never fires an alert.
- Recovery is confirmed after 3 minutes of consecutive success, not after the first passing probe.
The tradeoff: with N=3, your time-to-alert on a genuine outage is up to 3 minutes longer than with N=1. For a 60-second-probe monitor, that means the earliest possible alert is at minute 3, not minute 1. For most MCP server use cases, 3-minute detection is well within acceptable bounds — and the elimination of flapping alerts is worth the cost. If you need 1-minute detection for a business-critical server, use a 20-second probe cadence with N=3 (60-second detection, flap-free).
Cold-start exemption window
For serverless MCP servers, even N=3 hysteresis can produce flapping if the server's cold-start duration exceeds your probe timeout. The fix is a cold-start exemption window: if the last successful probe was more than T minutes ago (indicating the instance was shut down), suppress the first probe result after the gap and wait for the second probe to classify the server's state.
A gap of 10 minutes is a safe heuristic for most serverless platforms (Vercel Functions idle out after ~10 minutes without traffic; Railway and Render on free tiers shut down after 15 minutes of inactivity). If your server idles out faster, lower the window. The key invariant: the exemption window should be shorter than your actual cold-start duration, so you don't suppress probes on genuinely-down servers.
AliveMCP detects cold-start-associated failures separately from genuine failures, and does not count cold-start timeout spikes against your server's uptime SLO unless they exceed 3 consecutive probes.
Diagnosing which cause is yours
Given a flapping server, check in this order:
- Plot failure timestamps against probe timestamps. If failures are regularly spaced (e.g., every 15 minutes) and your platform has a 15-minute idle timeout, it's cold-start flapping.
- Check failure correlation across servers. If multiple unrelated servers fail in the same 30-second window, it's probe-origin jitter. Check the probe origin's CPU, network, and memory at those timestamps.
- Check failure rate vs. absolute failure count. A server with 94% uptime on a 60-second probe has roughly 86 failures in 24 hours. If those failures are clustered in short bursts (10 failures in 10 minutes, then 0 for 3 hours, then 10 again), it's resource exhaustion — likely GC pauses or connection-limit events. If the failures are distributed roughly uniformly, it's probe timeout misconfiguration.
- Widen the probe timeout and watch for improvement. If increasing your timeout from 5s to 15s eliminates the flapping, it was a timeout misconfiguration (almost certainly cold-start latency).
Related questions
My server is genuinely marginal — it fails 5% of probes. Is that flapping?
Not exactly — that's a reliability issue, not a monitoring misconfiguration. With N=3 consecutive-failure hysteresis and a 5% failure rate, you're unlikely to get 3 failures in a row (roughly 0.05³ = 0.0125% chance per probe window). Your uptime chart will show 95%, not 100%, but you won't generate spurious incidents. The root cause is still worth fixing — investigate memory, connection limits, or a dependency that's degraded.
How does AliveMCP prevent flapping alerts?
AliveMCP uses N=3 consecutive-failure hysteresis on all public endpoint probes, a 15-minute dedup window after incident fire, and cold-start detection for serverless endpoints (identified by platform domain: vercel.app, railway.app, render.com, onrender.com). Cold-start probe spikes are logged but don't count toward SLO uptime calculations unless they exceed 3 consecutive probes.
Can I tune the hysteresis threshold in AliveMCP?
Author tier ($9/mo) supports configurable N (1–5) on claimed endpoints. The default is N=3 for new endpoints. If you run a latency-sensitive server where 3-minute detection is too slow, you can set N=1 and accept that transient noise will generate incidents — or switch to a 30-second probe cadence with N=2 (60-second detection with two-probe filtering).
Further reading
- MCP server cold start — how serverless latency spikes look like failures
- MCP server alerting — routing, escalation, and suppression
- MCP server health check — the four-layer probe sequence
- MCP server downtime — causes, detection, and recovery time
- AliveMCP — uptime monitoring for every public MCP endpoint