Guide · Cloud Monitoring

Monitoring AWS-hosted MCP servers

AWS is the most common infrastructure choice for MCP servers that have graduated from Vercel or Railway to something more production-grade. But AWS-hosted MCPs fail in platform-specific ways — Lambda cold starts, API Gateway timeout limits, IAM role expiry, VPC egress filtering — that generic uptime monitors miss entirely. Here's what to watch, how each failure mode looks in a probe log, and how to wire alerts that catch the right thing.

TL;DR

The four most common failure modes for AWS-hosted MCP servers: (1) Lambda cold starts exceeding API Gateway's 29-second hard timeout, producing 504s, (2) IAM role or STS credential expiry producing 403s on downstream AWS service calls, (3) VPC egress rules blocking outbound requests that the MCP handler depends on, (4) Lambda concurrency limits causing queuing that looks like high latency before it looks like failures. Each has a distinct probe signature. The fix is an MCP-aware monitor that probes the full JSON-RPC handshake — not just an HTTP ping.

AWS hosting patterns for MCP servers

MCP servers on AWS cluster around three hosting patterns, each with different monitoring characteristics:

Failure mode 1: Lambda cold start + API Gateway 29-second timeout

API Gateway has a hard maximum integration timeout of 29 seconds. Lambda cold starts — especially for JVM or Python runtimes with heavyweight MCP handlers — can exceed this limit, producing a 504 Gateway Timeout before the Lambda even finishes initializing. The probe sees a 504 with a body like {"message":"Endpoint request timed out"} rather than a JSON-RPC response.

Probe signature: single 504 after an idle period, followed by recovery on the next probe (the Lambda is now warm). Looks identical to a flapping MCP server. Distinguishable by: (1) the failure timestamp correlates with Lambda's idle timeout (15 minutes of no traffic by default), (2) the response body is API Gateway's timeout message, not a JSON-RPC error.

Fixes:

Failure mode 2: IAM role or STS credential expiry

MCP servers that call AWS services (DynamoDB, S3, Bedrock, etc.) use an execution role whose credentials are automatically rotated by STS. In correctly-configured Lambda environments, this is transparent — the SDK refreshes credentials automatically from the instance metadata service. It fails silently when:

Probe signature: the MCP initialize call succeeds (auth happens at the API Gateway layer, not the IAM layer), but tools/list or subsequent tool calls return a JSON-RPC error with a message containing "AccessDeniedException" or "ExpiredTokenException". The server is "up" at the protocol layer but cannot execute any tools — an HTTP uptime monitor shows green while every agent using the MCP is getting errors.

Fix: Use an MCP-aware monitor that probes tools/list and optionally runs a read-only test tool call. AliveMCP's four-layer probe catches this at layer 4 (tool surface) rather than layer 1 (TCP). On the AWS side: use execution roles with automatic credential rotation, never hardcode STS tokens in environment variables, and add STS VPC endpoints to any private Lambdas that call AWS services.

Failure mode 3: VPC egress filtering

MCP servers in a VPC (common for private enterprise MCPs) often have egress rules that restrict outbound traffic. When the MCP handler makes an outbound call (to a third-party API, to an S3 endpoint, to a DynamoDB endpoint in another region) and the security group or NACL blocks that egress, the call hangs until Lambda's function timeout (up to 15 minutes). From the probe's perspective: the MCP initialize returns quickly, but a tool call that triggers the blocked egress hangs until timeout.

Probe signature: initialize latency is normal (200–500ms), tools/list returns immediately (tools are statically registered), but any call to a tool that makes an outbound request times out. If your probe only checks initialize and tools/list, this failure mode is invisible — the server looks healthy while all meaningful tool calls fail for users.

Fix: Add VPC endpoints for every AWS service your Lambda calls (S3, DynamoDB, STS, Bedrock, etc.) so egress stays inside the AWS network and isn't subject to internet-bound security group rules. For third-party API calls, use a NAT Gateway. On the monitoring side: if your MCP has idempotent read-only tools, configure AliveMCP's credentialed probe (Author tier) to run one test tool call per probe and alert if it fails with a timeout while tools/list succeeds.

Failure mode 4: Lambda concurrency exhaustion

AWS Lambda accounts have a default concurrency limit of 1000 concurrent executions per region, shared across all functions. If your MCP server is deployed in an account that also runs other high-traffic Lambdas, a traffic spike on those functions can exhaust the concurrency pool and throttle your MCP Lambda — producing 429s at the API Gateway level. At low concurrency limits (e.g., you've set a reserved concurrency limit on the MCP function itself), even a single slow tool call that holds a Lambda for 30 seconds can block all concurrent probes.

Probe signature: sporadic 429 responses at the API Gateway layer (before the MCP protocol layer is ever reached). The probe log shows HTTP 429 rather than a JSON-RPC error. Recovery is automatic once the throttled Lambdas finish, usually within a few seconds to a minute.

Fix: Set a reserved concurrency limit on your MCP Lambda that guarantees it a portion of the account pool (e.g., reserve 10 of your 1000 concurrent executions for the MCP function, preventing throttle-by-neighbor). Monitor Lambda throttle metrics in CloudWatch alongside your external probe results — throttle spikes that don't appear in external probe failures indicate the throttle resolved before your next probe; throttle spikes that do appear indicate the concurrency limit is too low for the probe interval.

Monitoring ECS Fargate MCP servers

Fargate MCPs avoid cold starts but introduce a different failure mode: rolling deploys. During a Fargate service update, ECS drains the old task and registers the new task with the ALB. The draining period (default: 30 seconds) creates a window where the ALB is routing some traffic to the draining task while the new task completes its health check. If the new task's MCP server takes more than 30 seconds to become healthy (e.g., it needs to load a model or warm a cache), it may fail its health check, causing ECS to roll back — and leaving your MCP endpoint in an inconsistent state mid-probe.

Monitoring recommendation: increase the ALB health check grace period to 60 seconds for Fargate MCPs, and configure your external monitor to alert only after 3 consecutive failures (to absorb the rolling-deploy window). Tag deploy events in AliveMCP's maintenance window feature so that deploy-time failures don't count against your SLO uptime.

Related questions

Should I use CloudWatch for MCP monitoring or an external monitor?

Both, for different failure modes. CloudWatch monitors internal Lambda metrics (errors, throttles, duration) but cannot probe the MCP protocol — it doesn't know if initialize returned a valid result or if tools/list returned the expected tool set. An external monitor like AliveMCP probes from outside AWS and catches failures that are invisible to CloudWatch: misconfigured API Gateway routes, TLS certificate expiry, IP blocklist hits, and DNS misconfiguration. The combination catches both internal resource failures and external-facing protocol failures.

How does AliveMCP handle Lambda endpoints behind authentication?

AliveMCP's Author tier ($9/mo) supports credential-based probing: you supply an API key, Bearer token, or AWS Signature V4 credential, and AliveMCP sends it on every probe. Credentials are stored encrypted and used only for your endpoint's probes. You can rotate them through the dashboard without interrupting probe coverage.

What's the right probe timeout for a Lambda MCP behind API Gateway?

Set 28 seconds — just under API Gateway's 29-second hard limit. This ensures your probe times out before API Gateway does, so you get a "timeout from monitor" rather than a "504 from API Gateway" in your probe log. The distinction matters for diagnosis: a monitor timeout at 28 seconds is usually Lambda cold start; a 504 from API Gateway at 29+ seconds is usually a slow tool call that exceeded the integration timeout.

Further reading