Guide · Infrastructure as Code

MCP Server Infrastructure with Terraform — provision, deploy, and verify at scale

Terraform manages the cloud infrastructure that runs your MCP server — virtual machines, ECS tasks, load balancers, security groups, and IAM roles. Define everything in HCL, apply it with a single command, and prove the MCP protocol endpoint is live before the pipeline turns green.

TL;DR

An MCP server needs more infrastructure than a typical REST API: sticky sessions or Streamable HTTP transport for long-lived connections, security group rules that let external monitoring probes reach the endpoint, and IAM roles to pull secrets at runtime. Encoding all of that in Terraform HCL puts every environment in version control, eliminates manual drift, and lets you reproduce staging identically in production. After terraform apply completes, a null_resource provisioner fires a real MCP initialize request against the newly provisioned endpoint — if the JSON-RPC handshake fails, Terraform marks the resource tainted and your next apply replaces it automatically. Pair that gate with AliveMCP for continuous per-minute monitoring once the server is live.

Why infrastructure as code matters for MCP servers

When you provision an MCP server by clicking through the AWS console you create invisible risk. Security group rules get added one at a time and never documented. IAM role policies accumulate permissions that nobody remembers granting. The ALB target group health check path drifts from what the server actually exposes. A new engineer joins the team and cannot recreate a working environment without talking to the person who originally set things up.

MCP servers compound this problem compared to ordinary web services. The Model Context Protocol uses either a Server-Sent Events transport (which requires sticky sessions at the load balancer so that a client's follow-up HTTP requests land on the same backend that opened the SSE stream) or the newer Streamable HTTP transport (which is stateless but still requires careful timeout configuration at every layer between the client and the server process). These details — ALB stickiness duration, idle timeout, connection draining time — are invisible once they exist only in the console. Terraform captures them as first-class HCL attributes that live next to the application code in the same git repository.

The following table contrasts the two approaches across four dimensions that matter for a team running MCP servers in production:

Dimension	Manual AWS console setup	Terraform IaC
Drift risk	High — any engineer can modify resources without a record	Low — `terraform plan` detects any out-of-band change before the next apply
Reproducibility	None — a new environment requires manual reconstruction from memory or runbooks	Full — `terraform apply -var-file=prod.tfvars` creates an identical environment from the same module
Audit trail	CloudTrail captures API calls but not the intent or ticket reference behind them	Every infrastructure change is a git commit with a PR, reviewer approval, and a CI plan artifact
Team access	Whoever has console IAM access can make changes; no review gate	Atlantis or GitHub Actions enforces plan-then-apply with required reviewers before any change reaches production

The reproducibility row is especially important for MCP servers. Because the protocol version, transport configuration, and IAM permissions all interact, having an environment that is byte-for-byte identical to production means bugs found in staging translate directly to production fixes rather than disappearing into "works on my machine" territory.

Terraform module structure for a Node.js MCP server on EC2

A minimal but production-ready Terraform module for an EC2-hosted MCP server spans four files: main.tf for resources, variables.tf for inputs, outputs.tf for values consumed by other modules or by the post-apply monitoring registration step, and user_data.sh for the instance bootstrap script. Here is the complete set.

variables.tf

variable "region" {
  description = "AWS region to deploy into"
  type        = string
  default     = "us-east-1"
}

variable "instance_type" {
  description = "EC2 instance type for the MCP server"
  type        = string
  default     = "t3.small"
}

variable "ami_id" {
  description = "AMI ID (Ubuntu 24.04 LTS recommended)"
  type        = string
  # Look up the latest Ubuntu 24.04 AMI with aws_ami data source in main.tf
}

variable "management_cidr" {
  description = "CIDR block allowed SSH access (your office or bastion IP)"
  type        = string
}

variable "key_pair_name" {
  description = "Name of an existing EC2 key pair for SSH access"
  type        = string
}

variable "mcp_repo_url" {
  description = "Git URL of the MCP server repository to clone and start"
  type        = string
}

variable "name_prefix" {
  description = "Short identifier prepended to all resource names"
  type        = string
  default     = "mcp"
}

main.tf

terraform {
  required_version = ">= 1.6.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.region
}

# Security group — allow HTTPS from the world, SSH from management CIDR only
resource "aws_security_group" "mcp" {
  name        = "${var.name_prefix}-mcp-sg"
  description = "MCP server: HTTPS inbound, SSH from management CIDR"

  ingress {
    description = "HTTPS"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    description = "HTTP (redirect to HTTPS)"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    description = "SSH from management CIDR"
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = [var.management_cidr]
  }

  egress {
    description = "Allow all outbound (npm install, Secrets Manager, monitoring probes)"
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "${var.name_prefix}-mcp-sg"
  }
}

# EC2 instance
resource "aws_instance" "mcp" {
  ami                    = var.ami_id
  instance_type          = var.instance_type
  key_name               = var.key_pair_name
  vpc_security_group_ids = [aws_security_group.mcp.id]
  user_data              = file("${path.module}/user_data.sh")

  root_block_device {
    volume_size = 20
    volume_type = "gp3"
    encrypted   = true
  }

  tags = {
    Name = "${var.name_prefix}-mcp-server"
  }
}

# Elastic IP — stable address for DNS and monitoring configuration
resource "aws_eip" "mcp" {
  domain = "vpc"

  tags = {
    Name = "${var.name_prefix}-mcp-eip"
  }
}

resource "aws_eip_association" "mcp" {
  instance_id   = aws_instance.mcp.id
  allocation_id = aws_eip.mcp.id
}

outputs.tf

output "instance_id" {
  description = "EC2 instance ID"
  value       = aws_instance.mcp.id
}

output "public_ip" {
  description = "Elastic IP address of the MCP server"
  value       = aws_eip.mcp.public_ip
}

output "mcp_endpoint" {
  description = "MCP JSON-RPC endpoint URL"
  value       = "https://${aws_eip.mcp.public_ip}/"
}

user_data.sh

The bootstrap script runs once when the instance starts. It installs Node.js 20 from NodeSource, clones the MCP server repository, installs dependencies, and starts the process under PM2 so it restarts automatically after a system reboot.

#!/bin/bash
set -euo pipefail

# Install Node.js 20 from NodeSource
curl -fsSL https://deb.nodesource.com/setup_20.x | bash -
apt-get install -y nodejs

# Install PM2 process manager globally
npm install -g pm2

# Clone and start the MCP server
cd /opt
git clone "${mcp_repo_url}" mcp-server
cd mcp-server
npm ci --omit=dev

# Start via PM2 — listens on 0.0.0.0:3000 with Streamable HTTP transport
pm2 start index.js --name mcp-server \
  --env production \
  -e /var/log/mcp-server-err.log \
  -o /var/log/mcp-server-out.log

# Persist PM2 across reboots
pm2 startup systemd -u root --hp /root
pm2 save

# Install Caddy for TLS termination (auto-HTTPS via Let's Encrypt)
apt-get install -y debian-keyring debian-archive-keyring apt-transport-https curl
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' \
  | gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \
  | tee /etc/apt/sources.list.d/caddy-stable.list
apt-get update && apt-get install -y caddy

cat > /etc/caddy/Caddyfile <<'CADDY'
:443 {
  reverse_proxy localhost:3000
}
CADDY

systemctl enable caddy
systemctl restart caddy

ECS/Fargate deployment for containerized MCP servers

EC2 is a solid choice for teams that want direct SSH access or need to control the exact runtime environment. For teams that prefer not to manage individual virtual machines — and especially for teams already running containers — ECS Fargate removes the instance-level operational burden entirely. Fargate tasks spin up in seconds, scale to zero when idle, and let you focus on the MCP server container rather than the host it runs on.

The following HCL provisions a full ECS/Fargate stack: cluster, task definition, service with load balancer integration, an Application Load Balancer with an ACM TLS certificate, and the IAM execution role that allows ECS to pull images and write CloudWatch logs.

ECS cluster and task definition

resource "aws_ecs_cluster" "mcp" {
  name = "${var.name_prefix}-mcp-cluster"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }
}

resource "aws_ecs_task_definition" "mcp" {
  family                   = "${var.name_prefix}-mcp"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = "512"
  memory                   = "1024"
  execution_role_arn       = aws_iam_role.ecs_execution.arn
  task_role_arn            = aws_iam_role.mcp_task.arn

  container_definitions = jsonencode([{
    name      = "mcp-server"
    image     = "${var.ecr_repository_url}:${var.image_tag}"
    essential = true

    portMappings = [{
      containerPort = 3000
      hostPort      = 3000
      protocol      = "tcp"
    }]

    environment = [
      { name = "NODE_ENV", value = "production" },
      { name = "PORT",     value = "3000" }
    ]

    secrets = [
      {
        name      = "API_KEY"
        valueFrom = aws_secretsmanager_secret.mcp_api_key.arn
      }
    ]

    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = "/ecs/${var.name_prefix}-mcp"
        "awslogs-region"        = var.region
        "awslogs-stream-prefix" = "ecs"
      }
    }

    healthCheck = {
      command     = ["CMD-SHELL", "curl -sf http://localhost:3000/health || exit 1"]
      interval    = 30
      timeout     = 5
      retries     = 3
      startPeriod = 15
    }
  }])
}

resource "aws_ecs_service" "mcp" {
  name            = "${var.name_prefix}-mcp-service"
  cluster         = aws_ecs_cluster.mcp.id
  task_definition = aws_ecs_task_definition.mcp.arn
  desired_count   = var.desired_count
  launch_type     = "FARGATE"

  # Force a new deployment whenever this resource is applied
  force_new_deployment = true

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.ecs_tasks.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.mcp.arn
    container_name   = "mcp-server"
    container_port   = 3000
  }

  depends_on = [aws_lb_listener.mcp_https]
}

Application Load Balancer, target group, and ACM certificate

resource "aws_lb" "mcp" {
  name               = "${var.name_prefix}-mcp-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = var.public_subnet_ids

  idle_timeout = 60  # Must exceed SSE keepalive interval if using SSE transport

  enable_deletion_protection = true
}

resource "aws_lb_target_group" "mcp" {
  name        = "${var.name_prefix}-mcp-tg"
  port        = 3000
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"

  # TCP health checks will not catch MCP protocol failures.
  # Use an HTTP path your MCP server exposes specifically for health probes.
  health_check {
    enabled             = true
    path                = "/health"
    protocol            = "HTTP"
    matcher             = "200"
    interval            = 30
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
  }

  stickiness {
    type            = "lb_cookie"
    cookie_duration = 86400  # Required for SSE transport; omit for Streamable HTTP
    enabled         = true
  }
}

resource "aws_acm_certificate" "mcp" {
  domain_name       = var.mcp_domain
  validation_method = "DNS"

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_lb_listener" "mcp_https" {
  load_balancer_arn = aws_lb.mcp.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = aws_acm_certificate.mcp.arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.mcp.arn
  }
}

resource "aws_lb_listener" "mcp_http_redirect" {
  load_balancer_arn = aws_lb.mcp.arn
  port              = 80
  protocol          = "HTTP"

  default_action {
    type = "redirect"
    redirect {
      port        = "443"
      protocol    = "HTTPS"
      status_code = "HTTP_301"
    }
  }
}

IAM execution role and task role

data "aws_iam_policy_document" "ecs_assume_role" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["ecs-tasks.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "ecs_execution" {
  name               = "${var.name_prefix}-ecs-execution-role"
  assume_role_policy = data.aws_iam_policy_document.ecs_assume_role.json
}

resource "aws_iam_role_policy_attachment" "ecs_execution_managed" {
  role       = aws_iam_role.ecs_execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

resource "aws_iam_role" "mcp_task" {
  name               = "${var.name_prefix}-mcp-task-role"
  assume_role_policy = data.aws_iam_policy_document.ecs_assume_role.json
}

A note on health checks: the ALB target group health check path /health must be an endpoint your MCP server actually implements. A TCP health check only confirms the process is accepting connections — it will not detect a hung Node.js event loop, a crashed tool runner, or an MCP protocol version mismatch. Keep the HTTP health check path lightweight (no tool calls, no external API calls) and treat it as a liveness signal, not a correctness signal. Correctness — that the server actually speaks MCP — is what the post-apply protocol probe and AliveMCP's continuous monitoring are for.

Secrets management with Terraform and AWS Secrets Manager

MCP servers frequently need API keys: upstream AI provider keys, database credentials, webhook signing secrets. Hardcoding these in terraform.tfvars is a common mistake that turns your infrastructure repository into a credential store visible to every engineer with git access. The correct pattern is to store secrets in AWS Secrets Manager and have the ECS task fetch them at runtime through the native ECS secrets injection mechanism.

resource "aws_secretsmanager_secret" "mcp_api_key" {
  name                    = "${var.name_prefix}/mcp/api-key"
  description             = "API key used by the MCP server to call upstream AI providers"
  recovery_window_in_days = 7
}

resource "aws_secretsmanager_secret_version" "mcp_api_key" {
  secret_id     = aws_secretsmanager_secret.mcp_api_key.id
  secret_string = var.mcp_api_key_value

  # In practice, rotate this manually or via a Lambda rotation function.
  # Never commit the actual key value; pass it via TF_VAR_mcp_api_key_value
  # environment variable in CI, not via a committed tfvars file.
  lifecycle {
    ignore_changes = [secret_string]
  }
}

The ECS task role needs explicit permission to retrieve the secret value. The execution role (which ECS uses to inject secrets before the container starts) also needs this permission. Grant it with an inline policy on the task role:

data "aws_iam_policy_document" "mcp_task_secrets" {
  statement {
    sid    = "AllowReadMCPSecrets"
    effect = "Allow"
    actions = [
      "secretsmanager:GetSecretValue",
      "secretsmanager:DescribeSecret"
    ]
    resources = [
      aws_secretsmanager_secret.mcp_api_key.arn
    ]
  }

  # If you store the DB password in SSM Parameter Store instead:
  statement {
    sid    = "AllowReadSSMParameters"
    effect = "Allow"
    actions = [
      "ssm:GetParameter",
      "ssm:GetParameters"
    ]
    resources = [
      "arn:aws:ssm:${var.region}:${data.aws_caller_identity.current.account_id}:parameter/${var.name_prefix}/*"
    ]
  }
}

resource "aws_iam_role_policy" "mcp_task_secrets" {
  name   = "${var.name_prefix}-mcp-task-secrets"
  role   = aws_iam_role.mcp_task.id
  policy = data.aws_iam_policy_document.mcp_task_secrets.json
}

Once the IAM policy is in place, the container definition references the secret by ARN rather than by value. ECS injects the secret as an environment variable named API_KEY at container start time. The secret value never appears in the task definition JSON stored in the Terraform state file — only the ARN does. Be aware that Terraform state files themselves can contain sensitive values from other resources, so your S3 state bucket must have encryption at rest enabled and access restricted to the IAM roles used by your CI pipeline and engineers.

Post-apply MCP protocol verification with null_resource

Infrastructure provisioning and application correctness are two different things. Terraform can confirm that an EC2 instance is running and an ECS service has reached its desired count, but it cannot natively verify that the process listening on port 443 is responding to valid MCP JSON-RPC messages. A null_resource with a local-exec provisioner bridges that gap by running a real MCP initialize probe from the machine running terraform apply.

resource "null_resource" "mcp_health_probe" {
  # Re-run the probe any time the instance is replaced
  triggers = {
    instance_id   = aws_instance.mcp.id
    elastic_ip    = aws_eip.mcp.public_ip
  }

  provisioner "local-exec" {
    command = <<-EOT
      sleep 30
      curl -sf -X POST https://${aws_eip.mcp.public_ip}/ \
        -H 'Content-Type: application/json' \
        -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","clientInfo":{"name":"terraform-probe","version":"1.0"}}}' \
        | jq -e '.result.protocolVersion == "2024-11-05"' \
        && echo "MCP server is live and speaking protocol version 2024-11-05" \
        || (echo "MCP protocol probe failed — rolling back" && exit 1)
    EOT
  }

  depends_on = [
    aws_instance.mcp,
    aws_eip_association.mcp
  ]
}

The sleep 30 gives the user_data.sh bootstrap script time to install Node.js, clone the repository, install dependencies, start PM2, and bring Caddy up with a TLS certificate. In practice you may need to increase this to 60–90 seconds for a cold start on a new instance, particularly if Let's Encrypt needs to issue a certificate. You can make the wait dynamic by replacing the fixed sleep with a polling loop that retries every five seconds for up to two minutes.

If the jq -e command exits non-zero — because the response body is not valid JSON, because the result.protocolVersion field does not match, or because curl could not connect — the provisioner exits with a non-zero status code. Terraform interprets that as a provisioner failure and marks the null_resource as tainted. The next terraform apply destroys and re-creates both the null_resource and any resources that triggered it, which in this case means replacing the EC2 instance. This is an extremely powerful pattern: a deploy that produces a syntactically incorrect MCP server automatically triggers infrastructure replacement without any manual intervention.

An equivalent probe for an ECS/Fargate deployment uses the ALB DNS name instead of a direct IP address, and may need a longer wait for the ECS task to reach a healthy state and for the ALB target group health check to mark it as such:

resource "null_resource" "mcp_ecs_probe" {
  triggers = {
    task_definition_arn = aws_ecs_task_definition.mcp.arn
  }

  provisioner "local-exec" {
    command = <<-EOT
      echo "Waiting for ECS service to stabilize..."
      aws ecs wait services-stable \
        --cluster ${aws_ecs_cluster.mcp.name} \
        --services ${aws_ecs_service.mcp.name} \
        --region ${var.region}

      echo "Probing MCP protocol endpoint..."
      curl -sf -X POST https://${var.mcp_domain}/ \
        -H 'Content-Type: application/json' \
        -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","clientInfo":{"name":"terraform-probe","version":"1.0"}}}' \
        | jq -e '.result.protocolVersion' \
        && echo "MCP probe passed" \
        || (echo "MCP probe failed" && exit 1)
    EOT
  }

  depends_on = [aws_ecs_service.mcp]
}

The aws ecs wait services-stable command polls the ECS service until all running task counts match the desired count and all tasks have passed the container health check. Only then does the curl probe run. This combination catches both infrastructure-level failures (ECS task fails to start) and protocol-level failures (container starts but MCP handshake is broken).

Registering the deployed endpoint with AliveMCP after apply

The post-apply probe confirms the server is alive at the moment of deployment, but continuous monitoring requires an ongoing external probe. AliveMCP provides a REST API for programmatically registering a new monitor. You can call it from Terraform using the http provider's data source or from a second null_resource provisioner that runs after the protocol probe passes:

resource "null_resource" "register_alivemcp_monitor" {
  triggers = {
    mcp_endpoint = "https://${aws_eip.mcp.public_ip}/"
  }

  provisioner "local-exec" {
    command = <<-EOT
      curl -sf -X POST https://api.alivemcp.com/v1/monitors \
        -H 'Authorization: Bearer ${var.alivemcp_api_key}' \
        -H 'Content-Type: application/json' \
        -d '{
          "name": "${var.name_prefix} MCP server",
          "url":  "https://${aws_eip.mcp.public_ip}/",
          "interval_seconds": 60,
          "protocol_version": "2024-11-05"
        }' \
        && echo "AliveMCP monitor registered"
    EOT
  }

  depends_on = [null_resource.mcp_health_probe]
}

Storing the AliveMCP API key in var.alivemcp_api_key and passing it via the TF_VAR_alivemcp_api_key environment variable in CI keeps the key out of version control while making the registration step fully automated. Once registered, AliveMCP probes the endpoint every minute, checks that the MCP initialize handshake succeeds, and pages your on-call rotation if it detects a protocol failure, a transport error, or an HTTP 5xx.

State management and team workflows

Local Terraform state is an antipattern for any resource that outlives a single engineer's laptop session. For MCP server infrastructure that runs continuously, the state file is the authoritative record of what Terraform believes is deployed. Losing it means Terraform will try to create duplicate resources or will be unable to modify or destroy existing ones. The solution is remote state stored in an S3 bucket with a DynamoDB lock table.

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "mcp-server/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "mycompany-terraform-locks"
  }
}

The DynamoDB lock table prevents two engineers (or two CI jobs) from running terraform apply simultaneously against the same state file, which would corrupt it. Create the table with a LockID string hash key — Terraform manages the lock records automatically.

For teams running multiple environments — development, staging, and production — Terraform workspaces or separate state file keys per environment keep the state isolated while reusing the same module code. The workspace approach is simpler to set up:

# Create and switch to the staging workspace
terraform workspace new staging
terraform workspace select staging
terraform apply -var-file=staging.tfvars

# Production in its own workspace
terraform workspace new production
terraform workspace select production
terraform apply -var-file=production.tfvars

A more explicit approach uses separate backend key paths and separate variable files without relying on workspace names in resource names. Either pattern works; what matters is that staging and production never share a state file.

For CI/CD, the recommended workflow is:

Pull request opened: CI runs terraform init and terraform plan, posts the plan output as a PR comment
PR approved and merged: CI runs terraform apply -auto-approve against the target environment
Post-apply: the null_resource MCP protocol probe runs; if it fails, the apply exits non-zero and the CI job is marked failed

Atlantis is the most popular open-source tool for this PR-based Terraform workflow. It runs as a server that listens for GitHub or GitLab webhook events, comments plan output on pull requests, and applies on a specific comment command (atlantis apply) after the required number of approvals. For MCP server infrastructure this is a significant safety improvement over giving every engineer direct terraform apply access to the production state.

Monitoring Terraform-deployed MCP servers with AliveMCP

The post-apply probe described above is a point-in-time check. It confirms the server was alive immediately after provisioning, but it cannot detect a server that goes down six hours later due to a memory leak in a long-running tool call, a certificate expiry, an accidental security group rule change pushed outside of Terraform, or an upstream API rate limit that causes the MCP server process to crash.

AliveMCP fills this gap with continuous external monitoring. Every minute, AliveMCP sends a real MCP initialize request to your endpoint, exactly like the Terraform probe does. But AliveMCP does this indefinitely, from multiple geographic regions, and with configurable alert routing. If the handshake times out, returns an unexpected protocol version, or fails with an HTTP error, AliveMCP notifies your team via Slack, PagerDuty, email, or webhook before users notice the outage.

There are several categories of failure that only continuous monitoring can catch, and that a one-time post-deploy probe will miss entirely:

Memory exhaustion. Node.js MCP servers that handle large tool responses can exhaust heap memory hours after a clean deploy. The process crashes, PM2 restarts it, but there is a window of 10–30 seconds where the endpoint is unreachable. AliveMCP detects this window.
TLS certificate expiry. Even with automated certificate renewal via Let's Encrypt or ACM, renewal can fail silently. AliveMCP checks the certificate expiry date on every probe and alerts you days before expiry, giving you time to diagnose the renewal failure before clients start seeing handshake errors.
Out-of-band infrastructure changes. An engineer who bypasses Terraform and modifies a security group rule in the AWS console may inadvertently block the port that AliveMCP (or your users) connect on. AliveMCP's probe will fail immediately, alerting you to the drift before users experience it. Running terraform plan after the alert will show the drift explicitly.
Upstream dependency failures. If your MCP server proxies requests to an external API and that API starts returning errors, the MCP server's initialize response may still succeed (the handshake does not call upstream APIs) while tool calls fail silently. Consider adding a shallow tool call to your health check endpoint that exercises one upstream dependency, so that AliveMCP's HTTP health check catches these failures too.

The combination of Terraform for infrastructure definition and AliveMCP for runtime verification gives you complete coverage: Terraform ensures the environment is built correctly and consistently; AliveMCP ensures it continues to operate correctly after every deploy and during steady-state operation. Together they eliminate the two most common sources of MCP server outages — environment drift and silent runtime failures — without requiring any manual intervention.

To add your Terraform-deployed endpoint to AliveMCP, navigate to alivemcp.com, enter your MCP server URL, and select a one-minute check interval. AliveMCP will run the full MCP initialize handshake on the first probe and configure ongoing monitoring automatically. You can also use the API-based registration shown in the null_resource.register_alivemcp_monitor block above to automate this step as part of every terraform apply.

Frequently asked questions

Can I use Terraform to manage multiple MCP server environments — staging and production — from the same codebase?

Yes. The standard patterns are Terraform workspaces and separate state file paths. With workspaces, you run terraform workspace select staging before applying staging changes and terraform workspace select production for production. Each workspace has its own state file in the S3 backend, so a staging apply cannot accidentally modify production resources. Pair each workspace with a matching variable file (staging.tfvars, production.tfvars) that sets environment-specific values like instance type, desired task count, and domain name. For stricter isolation, some teams prefer separate backend key paths per environment — this makes it impossible to accidentally apply staging changes to production even if you forget to switch workspaces.

How do I handle Terraform state for a team with multiple engineers all deploying MCP server changes?

Store state in S3 with a DynamoDB lock table, as shown in the backend configuration above. The lock table ensures that only one terraform apply can run at a time against a given state file. Configure IAM policies so that individual engineers can run terraform plan (which only requires read access to the state bucket) but only the CI/CD service role can run terraform apply. This pattern — plan in feature branches, apply in CI after merge — means no engineer ever applies directly to production from a local machine, which eliminates an entire class of accidental changes. Never email, Slack, or commit a local terraform.tfstate file; if the team needs to inspect state, use terraform state list and terraform state show against the remote backend.

What is the difference between terraform plan and terraform apply for MCP server infrastructure?

terraform plan compares your HCL configuration against the current state and reports what would change — resources that would be created, modified, or destroyed — without making any actual API calls to AWS. It is completely safe to run as many times as you like, including in CI on every pull request. terraform apply executes those changes against the real infrastructure. Always review the plan output before running apply, particularly for any lines that show resources being destroyed or replaced (indicated by the minus sign or the -/+ symbol), because replacing an EC2 instance or an ECS task definition may cause a brief interruption to your MCP server. In a good workflow, the plan artifact generated in CI is the same plan that gets applied — this means no surprises between the reviewed plan and the actual changes.

How do I update my MCP server's Docker image without downtime using Terraform?

For ECS/Fargate, set force_new_deployment = true on the aws_ecs_service resource, as shown in the ECS section above. When you update the image_tag variable and run terraform apply, ECS starts new tasks with the updated image before draining connections from the old tasks, provided you have set a non-zero deregistration_delay on the target group and your ALB listener rule routes to the target group rather than directly to tasks. For Streamable HTTP transport (stateless), the default ECS rolling update (minimum healthy percent 100, maximum percent 200) achieves zero downtime automatically. For SSE transport with sticky sessions, plan a brief maintenance window during which new SSE connections are established to the new tasks — existing long-lived SSE connections on old tasks will be closed when those tasks drain. After deploy, AliveMCP's continuous monitoring verifies the new tasks are responding to the MCP initialize handshake; use the AliveMCP status page as your deploy success signal rather than relying solely on ECS service health checks.

Should I use Terraform for every MCP server deployment, or only at scale?

Terraform pays for itself after the second server or the second environment. For a single prototype running on one EC2 instance that you will tear down in a week, Terraform is overhead — a shell script or a five-minute console session is faster. The break-even point arrives when you need any of the following: a staging environment that matches production exactly (so bugs do not hide until they reach production), audit trail requirements for compliance, team access controls that prevent individual engineers from making unreviewed production changes, or more than one or two engineers deploying infrastructure. Once you cross any of these thresholds, the time investment in writing Terraform modules pays back quickly because you stop debugging environment drift and spending engineering hours on "it works on staging but not on prod" incidents. Start with the EC2 module above and graduate to ECS/Fargate when your team outgrows managing individual instances.