Guide · Security

MCP server security monitoring

Security monitoring and uptime monitoring ask different questions. Uptime monitoring asks: is the server running? Security monitoring asks: is the server being attacked, compromised, or misused? The two are complementary, not substitutable. A server can be fully available but actively under credential stuffing attack. A server can be healthy from an external probe perspective while a dependency vulnerability allows privilege escalation from inside. This guide covers the security-specific signals MCP server operators should monitor, how to set baselines and alert thresholds for each, and where external probe monitoring like AliveMCP fits in the broader security picture.

TL;DR

Four security monitoring areas for MCP servers: (1) auth failure rate — a spike above your normal 2–5% baseline signals credential stuffing or misconfigured clients; (2) rate anomalies — abnormal call volume per session_id signals automated abuse; (3) tool schema integrity — a changed tools/list hash signals an unexpected update or dependency tampering; (4) TLS certificate expiry — AliveMCP's protocol-layer probing catches certificate issues at the handshake level, not just the port. External probing monitors availability, not security — use it as one layer in a layered security posture, not as a SIEM replacement.

Auth failure rate monitoring

Authentication failures are a normal part of any API's operation — misconfigured clients, expired tokens, and integration bugs all generate 401 and 403 responses. The question isn't whether auth failures occur; it's whether the rate is normal or elevated above baseline.

Establishing your baseline

Log every initialize request with its auth result: auth_result: "success" or auth_result: "failed" plus the failure reason (token_expired, token_invalid, scope_insufficient). Over one to two weeks of normal operation, measure your baseline auth failure rate as a percentage of total initialize attempts. Most well-configured MCP servers see a 2–5% auth failure rate (primarily from token expiry and first-time integration setup). Document this baseline.

Anomaly detection and alerting

Alert when auth failure rate exceeds your baseline by a significant factor:

10× baseline over a 5-minute window: likely a single misconfigured client or a credential rotation event where old tokens weren't fully replaced. Investigate the top failing client_id values — is it one client (configuration issue) or many distinct IPs (credential stuffing)?
50× baseline or absolute rate >100 failures/minute: likely automated attack. Consider temporary IP-based rate limiting while investigating.

Credential stuffing against MCP servers is less common than against web applications — MCP endpoints aren't browser-accessible and require protocol-level interaction. But for authenticated MCP servers that hold sensitive data or expensive-to-use capabilities, the attack surface is real. See MCP authentication primer for the full authentication pattern coverage, including the OAuth 2.0 Client Credentials flow that's most common for server-to-server MCP authentication.

Origin and client diversity monitoring

A normal MCP server traffic pattern shows requests from a consistent set of known agent client identifiers or IP ranges. Sudden appearance of large numbers of new, unknown client IDs — especially with rapid-fire initialize attempts in sequence — is a behavioral signal beyond just the failure rate. Log client_id and source IP on every initialize; alert when the 5-minute unique-source count exceeds 3× the 30-day hourly average.

Rate anomaly detection

After authentication, a legitimate agent session has a characteristic tool call pattern: a burst of tool calls (2–10 in a few seconds as the LLM reasons through a task), then a quiet period, then another burst. An automated abuse pattern looks different: sustained high-cadence tool calls from a single session_id with no inter-call pauses, or many concurrent sessions all calling the same expensive tool simultaneously.

Per-session call rate

Track cumulative tool call count per session_id over the session lifetime. Alert when:

A single session exceeds 100 tool calls (unless your server is intentionally designed for high-call-count workflows — set your threshold based on your expected maximum per your product design).
Tool call cadence within a session exceeds 10 calls in 10 seconds (faster than any human-initiated agent workflow; indicates automated looping).

When a session hits the threshold, options: return a -32001 error with a "rate limit exceeded" message, or log and alert without blocking (if you're in monitoring-only mode before enforcing limits).

Cross-session fleet anomalies

A coordinated attack may use many short sessions to stay below per-session thresholds. Monitor the aggregate fleet rate: total tool calls per minute across all sessions. Alert when this exceeds your expected peak × 3. If you have IP-level data, check whether the aggregate spike correlates with a single IP or ASN — a signal of botnet origin vs legitimate viral traffic.

Note: legitimate viral traffic (your server gets featured in a blog post) can produce the same aggregate rate spike as a coordinated abuse event. Distinguish them by checking auth failure rate simultaneously — legitimate new users have a normal auth failure rate; credential stuffing has an elevated one. Real viral traffic also tends to produce varied tool call patterns; automated abuse tends to call the same tool repeatedly.

Tool schema integrity monitoring

Your server's tools/list response defines the tool surface your clients see. A change to tools/list is expected when you deploy a new version — but an unexpected change (between deployments, during a period where no deploy occurred) is a signal worth investigating. It could indicate:

A dependency that dynamically mutates your tool list based on fetched configuration.
An unauthorized deployment or configuration change.
A supply chain compromise where an upstream package injects additional tools into your server's tool registry.

Schema hash monitoring

On every tools/list response, compute a hash of the canonical tool definitions (sorted tool names, sorted parameter schemas, stringified). Store the hash with a timestamp. Alert when the hash changes outside of a known deployment window:

const schemaHash = crypto
  .createHash('sha256')
  .update(JSON.stringify(
    toolList.tools.sort((a, b) => a.name.localeCompare(b.name))
  ))
  .digest('hex')
  .slice(0, 16);

AliveMCP's probe collects the tools/list response on every probe cycle and tracks schema drift. An unexpected tools/list change generates a schema_drift_detected event in the monitoring dashboard. This isn't a security alert in isolation — it's an investigation trigger. Check your deployment history first; if no deploy occurred in the window, escalate.

See schema drift in MCP tool definitions for the full schema drift detection and response pattern.

TLS certificate monitoring

An expired TLS certificate causes the same failure signature as a completely downed server: TLS handshake failure → transport-layer probe failure → alert. The difference is the error message and the remediation (renew certificate vs restart process). AliveMCP's protocol-layer probe reaches the TLS handshake before the MCP protocol exchange begins — it can detect a certificate expiry at the probe level, not just via port-scanning tools.

AliveMCP Author tier shows certificate expiry date in the server monitoring dashboard and generates a warning alert 14 days before expiry and a critical alert 3 days before expiry. This gives you time to renew before the certificate actually expires, avoiding the outage.

For Let's Encrypt certificates with auto-renewal (certbot, Caddy's built-in renewal, AWS Certificate Manager auto-renew), certificate expiry monitoring is a belt-and-suspenders check on whether the auto-renewal worked. Let's Encrypt certificates have a 90-day validity period; auto-renewal typically fires 30 days before expiry. If your monitoring shows a certificate expiry 30 days out that should have renewed, your renewal process has failed silently. See MCP server SSL certificate for the full TLS monitoring and renewal pattern.

Dependency vulnerability scanning

Your MCP server's npm or pip dependencies are part of its attack surface. A high-severity vulnerability in a transitive dependency can expose your server to remote code execution even if your own code is clean.

Minimum viable dependency security:

Run npm audit or pip audit in CI on every merge to main. Block merges on high-severity findings.
Configure Dependabot (GitHub) or Renovate to open automatic PRs when vulnerability patches are available. Review and merge these promptly — a 30-day-old Dependabot PR sitting unmerged is an unmitigated vulnerability in production.
Subscribe to security advisories for your key dependencies (MCP SDK, HTTP framework, auth library). The npm security advisory RSS feed and GitHub Advisory Database both support email or webhook subscriptions.

If you're running an MCP server that handles sensitive data or has privileged access to downstream systems (calendar, email, financial APIs), treat dependency vulnerabilities as production incidents, not development backlog items.

Supply chain health monitoring

If your agents pull third-party MCP servers from registries (MCP.so, Smithery, Glama, the Official Registry), those third-party servers are part of your supply chain. A third-party MCP server that's been dormant for 6 months, has a compromised maintainer account, or is silently returning malformed tool definitions is a risk to your agent workflows.

AliveMCP's public registry audit monitors every listed MCP endpoint and tracks health over time. The Q2 2026 audit found 91% of public MCP endpoints either dead or returning protocol errors — see State of the MCP Registry Q2 2026 for the full methodology. For teams that depend on specific third-party MCP servers, monitoring those servers' health status in AliveMCP's public dashboard gives you advance warning when a dependency is degrading — before your agents start failing silently.

Supply chain security for MCP goes beyond uptime. Verify third-party MCP servers you depend on:

Is the source code public and auditable?
Is the maintainer active (recent commits, responsive to issues)?
Is the tool schema stable (no unexpected additions that could inject new tool behaviors into your agents)?
Does the server require broad permissions your use case doesn't need (OAuth scopes, API key access)?

What external probing cannot tell you

AliveMCP monitors the availability and protocol health of your MCP endpoint from outside. It is not a Security Information and Event Management (SIEM) system. It cannot:

Detect auth failure patterns inside the server (it only probes with its own monitoring credential, not with attacker credentials).
Detect unauthorized access that used valid credentials.
Analyze log patterns for anomalous behavior.
Scan your server's runtime memory or filesystem for indicators of compromise.
Detect exfiltration of data via tool call responses (it doesn't analyze tool call content).

For these capabilities, you need a dedicated security tool: server-side log analysis (Splunk, Elastic SIEM), runtime security monitoring (Falco for containers), or a managed security service. External probe monitoring from AliveMCP sits at the availability layer — the bottom of the security stack, not the top. A complete security posture layers external availability monitoring, server-side auth and rate monitoring, vulnerability scanning, and (for sensitive workloads) runtime security monitoring.