Reference · Cloud

Cloud monitoring for MCP servers

Every major cloud platform includes a monitoring service: AWS CloudWatch, GCP Cloud Monitoring, Azure Monitor. These are mature, well-documented, and already included in your cloud bill. They are also structurally unable to answer the question that matters most for MCP server operators: can agents on the public internet actually connect and use the tools right now? Understanding the three gaps they all share explains why cloud monitoring and external protocol monitoring are complements, not substitutes.

TL;DR

Cloud-native monitoring tools share three gaps for MCP: (1) they see HTTP, not JSON-RPC protocol compliance; (2) they measure from inside the cloud, not from agents on the public internet; (3) they cover only your MCPs, not third-party MCPs your agents depend on. None of these gaps are design flaws — they're structural properties of cloud-internal observability. AliveMCP fills them with external protocol probes. Join the waitlist to add external MCP monitoring alongside your existing cloud monitoring.

The three structural gaps

Gap 1 — HTTP ≠ JSON-RPC protocol

Cloud monitoring platforms operate at the HTTP layer. CloudWatch measures request counts, latency distributions, and HTTP status codes from the Application Load Balancer or API Gateway. GCP Cloud Monitoring measures the same from Cloud Load Balancing or Cloud Run's internal metrics. Azure Monitor measures from the Azure Load Balancer and Container Apps ingress.

An MCP server's primary communication is JSON-RPC over HTTP. The cloud monitoring platform sees the outer HTTP envelope: a POST request to /mcp with a 200 response and some latency value. It does not inspect the JSON-RPC payload inside. The response can be:

A valid JSON-RPC success response — the server is working
A JSON-RPC error (-32600 Invalid Request, -32601 Method not found) — the MCP router is broken
A serialization error that produces malformed JSON — a code bug hit by a recent deploy
A valid JSON body but with an empty tools array — the tool registry lost its connection

In all four cases, the cloud monitoring layer records: one successful HTTP 200 request, normal latency, no error. The first case is healthy. The other three are P1 incidents from the agent's perspective. Cloud monitoring cannot distinguish between them without custom code that inspects the response body — and even then, you're measuring from the cloud's side, not the client's side.

This is the fundamental difference between an HTTP probe and an MCP health check: an HTTP probe asks "did the server respond?"; an MCP health check asks "did the server respond with a valid MCP handshake, tool registry, and correct schema?"

Gap 2 — Internal ≠ external perspective

Cloud monitoring measures from inside the cloud platform. CloudWatch Canaries (Synthetics) run from Lambda functions inside AWS's network. GCP Cloud Monitoring Uptime Checks run from Google's probe nodes. Azure Application Insights availability tests run from Azure's probe infrastructure.

None of these probes are on the same network path that external agents traverse. An agent on a developer's laptop is connecting through their ISP, through whatever transit providers carry traffic to your cloud, through any CDN layer in front of your server, through TLS termination, and into the load balancer. Each of these hops can fail independently of what cloud-internal probes see.

Specific failure modes that cloud-internal probes miss:

CDN misconfiguration — a Cloudflare rule accidentally blocks POST requests to your MCP endpoint. CloudWatch sees the origin as healthy (Canaries can bypass CDN). External agents get 403s.
Regional routing issue — a BGP route change makes your server unreachable from a specific ISP or geographic region. Cloud-internal probes, which are on the same backbone as your server, don't experience the routing issue.
TLS certificate chain problem — cloud-internal probes may use a different trust store or certificate validation path than external clients. A certificate that passes internal validation may fail from external agent runtimes.
DNS misconfiguration — if your domain's DNS is broken (bad TTL, wrong record), external agents can't resolve the hostname. Cloud-internal probes may use an internal DNS that bypasses public resolution.

External monitoring from public internet probes is the only way to measure what agents actually experience.

Gap 3 — Your MCPs only, not third-party MCPs

If your agents use third-party MCP servers — tools from the MCP registry, partner integrations, open-source MCPs you host externally — cloud monitoring covers none of them. CloudWatch monitors AWS resources you own. Cloud Monitoring monitors GCP resources you own. You're dependent on the third party's status page, or on your own agents' error rates as a lagging indicator of their availability.

For agent applications that mix first-party and third-party MCP tools, external monitoring of all the MCP endpoints (yours and theirs) gives you a complete availability picture. When an agent turn fails, you can quickly determine whether the failure is in a tool you own and can fix, or in a third-party tool you depend on.

Cloud platform MCP deployment patterns

AWS

The most common AWS production patterns for MCP servers:

ECS Fargate + Application Load Balancer: A containerized MCP server running as an ECS service behind an ALB with HTTPS termination. CloudWatch integration is automatic: ALB access logs, target group health (HTTP), ECS container metrics (CPU, memory, network). The ALB health check pings /health with HTTP GET — not a JSON-RPC probe.
AWS App Runner: Managed container hosting with automatic HTTPS. Simpler than ECS for teams without container ops experience. App Runner health checks are HTTP-based. CloudWatch integration provides request metrics, latency, and 4xx/5xx rates.
Lambda + Function URL: HTTP Function URLs make Lambda directly accessible via HTTPS without an API Gateway in front. Cold starts are the primary concern. CloudWatch provides invocation count, error rate, duration, and cold start count. Duration is measured as the Lambda execution time — not the end-to-end latency from the caller's perspective (which includes the TLS handshake and network transit).
EC2 with nginx/Caddy: Manual deployment gives full control but requires you to configure CloudWatch Agent for OS-level metrics and set up CloudWatch alarms manually. More operational overhead but appropriate for specific compliance or network control requirements.

AWS CloudWatch Synthetics Canaries can be scripted to POST a JSON-RPC request and check the response body — this is the closest AWS gets to MCP-protocol monitoring natively. The scripting overhead is significant, and canaries run from within AWS's network, not from external agent locations.

GCP

GCP Cloud Run is the dominant MCP server deployment platform on Google Cloud. Cloud Run provides automatic HTTPS, managed TLS certificates, scale-to-zero, and built-in Cloud Monitoring integration. Cloud Monitoring Uptime Checks can test HTTP endpoints with optional body matching — you can configure an uptime check to POST to your MCP endpoint and assert the response contains a specific string. This is a coarse MCP-protocol test but is better than an HTTP ping check.

Cloud Logging captures all requests to Cloud Run services with full request/response metadata (minus body content for large payloads). For deep protocol-level analysis, you need to log MCP-specific events from within your server code and query them in Cloud Logging.

GKE (Google Kubernetes Engine) is appropriate for larger MCP fleets that need multi-container architectures or specific Kubernetes capabilities. Monitoring setup is more involved: you need Workload Identity for metric export, Cloud Monitoring's GKE integration, and custom ServiceMonitor resources if you're using Prometheus for in-process metrics.

Multi-cloud MCP fleets

Some agent applications run MCP servers across multiple clouds — a first-party MCP on AWS, a partner integration on GCP, and a data-access tool on Azure. Cloud-native monitoring is siloed by cloud: you'd need CloudWatch + Cloud Monitoring + Azure Monitor, three separate consoles and three separate alert pipelines, to get a complete picture. External monitoring that probes all endpoints regardless of cloud is a simpler operational model for multi-cloud fleets.

Where cloud monitoring and AliveMCP fit together

The clearest division of responsibility:

Signal	Cloud monitoring	AliveMCP
Container / function health	Primary	—
CPU / memory / resource usage	Primary	—
HTTP error rate (your code)	Primary	—
Application-level traces	Primary	—
JSON-RPC protocol compliance	Requires custom scripting	Native
tools/list completeness	Not supported	Native
External availability (public internet)	Approximate (cloud-internal probes)	Primary
TLS certificate expiry alert	Not standard	Native
Third-party MCP availability	Not supported	Supported
30-day availability dashboard	Requires custom query	Built-in

A reasonable monitoring stack for a production MCP server on any cloud: cloud-native monitoring for infrastructure and application performance, AliveMCP for external protocol availability. The two alert pipelines handle different failure modes. Cloud monitoring fires when the infrastructure is stressed. AliveMCP fires when agents can't actually use the server — which is the failure your users experience.

The cost equation: cloud-native monitoring is included in your compute bill. AliveMCP Free tier monitors one public endpoint with no charge. Author tier ($9/mo) adds expiry alerts, custom thresholds, and webhook notifications. For most MCP authors, the monitoring cost is 1–5% of their compute cost — and it catches the failures that compute cost can't prevent.

Get early access