Guide · Cloud Monitoring

Monitoring AWS-hosted MCP servers

AWS is the most common infrastructure choice for MCP servers that have graduated from Vercel or Railway to something more production-grade. But AWS-hosted MCPs fail in platform-specific ways — Lambda cold starts, API Gateway timeout limits, IAM role expiry, VPC egress filtering — that generic uptime monitors miss entirely. Here's what to watch, how each failure mode looks in a probe log, and how to wire alerts that catch the right thing.

TL;DR

The four most common failure modes for AWS-hosted MCP servers: (1) Lambda cold starts exceeding API Gateway's 29-second hard timeout, producing 504s, (2) IAM role or STS credential expiry producing 403s on downstream AWS service calls, (3) VPC egress rules blocking outbound requests that the MCP handler depends on, (4) Lambda concurrency limits causing queuing that looks like high latency before it looks like failures. Each has a distinct probe signature. The fix is an MCP-aware monitor that probes the full JSON-RPC handshake — not just an HTTP ping.

AWS hosting patterns for MCP servers

MCP servers on AWS cluster around three hosting patterns, each with different monitoring characteristics:

Lambda + API Gateway (Function URL or REST API): The most common pattern for hobby-to-small-production MCPs. Lambda handles the MCP request handler; API Gateway provides the HTTPS endpoint. Failure modes: cold start, 29-second hard timeout (API Gateway limitation), 6MB response payload limit, per-account concurrency limits.
ECS Fargate + Application Load Balancer: The step-up pattern for MCPs that need persistent connections, longer timeouts, or container-level dependencies (e.g., a local SQLite or embedded model). Failure modes: task health check failures, target group deregistration under rolling deploy, ALB idle timeout (60 seconds by default — shorter than long-running MCP tool calls).
EC2 + Caddy or nginx: The self-managed pattern, often used for MCPs that embed large language models or maintain warm in-memory state. Failure modes: instance stop/start losing in-memory state, security group rule changes blocking probe traffic, EIP detachment producing DNS propagation gaps.

Failure mode 1: Lambda cold start + API Gateway 29-second timeout

API Gateway has a hard maximum integration timeout of 29 seconds. Lambda cold starts — especially for JVM or Python runtimes with heavyweight MCP handlers — can exceed this limit, producing a 504 Gateway Timeout before the Lambda even finishes initializing. The probe sees a 504 with a body like {"message":"Endpoint request timed out"} rather than a JSON-RPC response.

Probe signature: single 504 after an idle period, followed by recovery on the next probe (the Lambda is now warm). Looks identical to a flapping MCP server. Distinguishable by: (1) the failure timestamp correlates with Lambda's idle timeout (15 minutes of no traffic by default), (2) the response body is API Gateway's timeout message, not a JSON-RPC error.

Fixes:

Enable Lambda SnapStart (available for Java 11+ runtimes) — reduces cold start from seconds to milliseconds.
Set a Lambda provisioned concurrency allocation ≥ 1 — keeps one warm instance at all times. Cost: roughly $0.015/hour for a 512MB function (≈ $11/mo). AliveMCP Author tier ($9/mo) is cheaper — but provisioned concurrency is the right answer if you also need to protect against user-visible cold starts, not just probe cold starts.
Use Lambda Function URLs instead of API Gateway — Function URLs have no hard timeout (they inherit the Lambda 15-minute maximum), eliminating the 29-second constraint for long tool calls.

Failure mode 2: IAM role or STS credential expiry

MCP servers that call AWS services (DynamoDB, S3, Bedrock, etc.) use an execution role whose credentials are automatically rotated by STS. In correctly-configured Lambda environments, this is transparent — the SDK refreshes credentials automatically from the instance metadata service. It fails silently when:

The Lambda is running in a VPC without internet access and cannot reach the STS endpoint (VPC-only Lambdas need a VPC endpoint for STS, or credentials will expire mid-execution).
The execution role's trust policy has been modified and the role can no longer be assumed (e.g., a security audit removed the Lambda service principal from the trust).
An externally-provisioned STS token (from AssumeRole in a CI pipeline) was hardcoded in an environment variable and has since expired.

Probe signature: the MCP initialize call succeeds (auth happens at the API Gateway layer, not the IAM layer), but tools/list or subsequent tool calls return a JSON-RPC error with a message containing "AccessDeniedException" or "ExpiredTokenException". The server is "up" at the protocol layer but cannot execute any tools — an HTTP uptime monitor shows green while every agent using the MCP is getting errors.

Fix: Use an MCP-aware monitor that probes tools/list and optionally runs a read-only test tool call. AliveMCP's four-layer probe catches this at layer 4 (tool surface) rather than layer 1 (TCP). On the AWS side: use execution roles with automatic credential rotation, never hardcode STS tokens in environment variables, and add STS VPC endpoints to any private Lambdas that call AWS services.

Failure mode 3: VPC egress filtering

MCP servers in a VPC (common for private enterprise MCPs) often have egress rules that restrict outbound traffic. When the MCP handler makes an outbound call (to a third-party API, to an S3 endpoint, to a DynamoDB endpoint in another region) and the security group or NACL blocks that egress, the call hangs until Lambda's function timeout (up to 15 minutes). From the probe's perspective: the MCP initialize returns quickly, but a tool call that triggers the blocked egress hangs until timeout.

Probe signature: initialize latency is normal (200–500ms), tools/list returns immediately (tools are statically registered), but any call to a tool that makes an outbound request times out. If your probe only checks initialize and tools/list, this failure mode is invisible — the server looks healthy while all meaningful tool calls fail for users.

Fix: Add VPC endpoints for every AWS service your Lambda calls (S3, DynamoDB, STS, Bedrock, etc.) so egress stays inside the AWS network and isn't subject to internet-bound security group rules. For third-party API calls, use a NAT Gateway. On the monitoring side: if your MCP has idempotent read-only tools, configure AliveMCP's credentialed probe (Author tier) to run one test tool call per probe and alert if it fails with a timeout while tools/list succeeds.

Failure mode 4: Lambda concurrency exhaustion

AWS Lambda accounts have a default concurrency limit of 1000 concurrent executions per region, shared across all functions. If your MCP server is deployed in an account that also runs other high-traffic Lambdas, a traffic spike on those functions can exhaust the concurrency pool and throttle your MCP Lambda — producing 429s at the API Gateway level. At low concurrency limits (e.g., you've set a reserved concurrency limit on the MCP function itself), even a single slow tool call that holds a Lambda for 30 seconds can block all concurrent probes.

Probe signature: sporadic 429 responses at the API Gateway layer (before the MCP protocol layer is ever reached). The probe log shows HTTP 429 rather than a JSON-RPC error. Recovery is automatic once the throttled Lambdas finish, usually within a few seconds to a minute.

Fix: Set a reserved concurrency limit on your MCP Lambda that guarantees it a portion of the account pool (e.g., reserve 10 of your 1000 concurrent executions for the MCP function, preventing throttle-by-neighbor). Monitor Lambda throttle metrics in CloudWatch alongside your external probe results — throttle spikes that don't appear in external probe failures indicate the throttle resolved before your next probe; throttle spikes that do appear indicate the concurrency limit is too low for the probe interval.

Monitoring ECS Fargate MCP servers

Fargate MCPs avoid cold starts but introduce a different failure mode: rolling deploys. During a Fargate service update, ECS drains the old task and registers the new task with the ALB. The draining period (default: 30 seconds) creates a window where the ALB is routing some traffic to the draining task while the new task completes its health check. If the new task's MCP server takes more than 30 seconds to become healthy (e.g., it needs to load a model or warm a cache), it may fail its health check, causing ECS to roll back — and leaving your MCP endpoint in an inconsistent state mid-probe.

Monitoring recommendation: increase the ALB health check grace period to 60 seconds for Fargate MCPs, and configure your external monitor to alert only after 3 consecutive failures (to absorb the rolling-deploy window). Tag deploy events in AliveMCP's maintenance window feature so that deploy-time failures don't count against your SLO uptime.