Guide · MCP DevOps Integration

MCP Server AWS CloudWatch — metrics, logs, alarms, and Logs Insights

AWS CloudWatch is the primary observability service for AWS workloads. This guide covers building TypeScript MCP tools that query CloudWatch metrics with GetMetricStatistics, filter log events from CloudWatch Logs, run Logs Insights queries, list and manage alarms, and expose a /health/cloudwatch endpoint so AliveMCP can detect IAM expiry and regional outages before your agent workflows time out.

TL;DR

Create singleton CloudWatchClient and CloudWatchLogsClient instances — never per-call. Use GetMetricStatisticsCommand for time-series data, FilterLogEventsCommand for log search, and StartQueryCommand + GetQueryResultsCommand (with polling) for Logs Insights. Scope IAM to the minimum read-only actions (cloudwatch:GetMetricStatistics, logs:FilterLogEvents, logs:StartQuery). Wire AliveMCP to /health/cloudwatch — a 503 before tool invocations surface almost always means expired IAM credentials or a regional endpoint outage.

SDK setup: singleton clients

The AWS SDK v3 for JavaScript splits CloudWatch into two separate clients: @aws-sdk/client-cloudwatch for metrics and alarms, and @aws-sdk/client-cloudwatch-logs for log groups. Create both as module-level singletons — the SDK resolves credentials from the credential chain (environment variables → ~/.aws/credentials → EC2/ECS instance profile) once at construction time and re-uses the resolved credentials across all commands.

import { CloudWatchClient } from '@aws-sdk/client-cloudwatch';
import { CloudWatchLogsClient } from '@aws-sdk/client-cloudwatch-logs';

const REGION = process.env.AWS_REGION ?? 'us-east-1';

// Singleton clients — constructed once, shared across all tool calls
export const cw = new CloudWatchClient({ region: REGION });
export const cwl = new CloudWatchLogsClient({ region: REGION });

// Credentials are resolved from the credential chain:
// 1. AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY + AWS_SESSION_TOKEN env vars
// 2. ~/.aws/credentials file (profile selected by AWS_PROFILE)
// 3. ECS task role / EC2 instance profile (automatically rotated by AWS)

Avoid constructing new clients inside tool handlers. The credential chain resolution involves network calls for IAM roles and adds 100–500ms of latency per tool invocation when done inside the handler. Module-level singletons pay this cost once at startup.

Package	Client	Use for
`@aws-sdk/client-cloudwatch`	`CloudWatchClient`	Metrics, alarms, dashboards
`@aws-sdk/client-cloudwatch-logs`	`CloudWatchLogsClient`	Log groups, log streams, Logs Insights

query_metrics tool: GetMetricStatistics

GetMetricStatisticsCommand returns time-series data for a single metric. It requires a namespace, metric name, time window, and period (the granularity of each data point in seconds). Dimensions narrow the query to a specific resource — for example, InstanceId for EC2 CPUUtilization or FunctionName for Lambda errors.

import { z } from 'zod';
import {
  GetMetricStatisticsCommand,
  Statistic
} from '@aws-sdk/client-cloudwatch';
import { cw } from './aws-clients.js';

server.tool(
  'query_cloudwatch_metric',
  {
    namespace: z.string().min(1),        // e.g. 'AWS/EC2', 'AWS/Lambda', 'AWS/RDS'
    metric_name: z.string().min(1),      // e.g. 'CPUUtilization', 'Errors', 'Duration'
    dimension_name: z.string().optional(),   // e.g. 'InstanceId', 'FunctionName'
    dimension_value: z.string().optional(),  // e.g. 'i-0abc123', 'my-function'
    start_time: z.string().datetime(),   // ISO-8601, e.g. '2026-07-04T00:00:00Z'
    end_time: z.string().datetime(),
    period_seconds: z.number().int().min(60).default(300),  // min 60s; must be divisible by 60
    statistic: z.enum(['Average', 'Sum', 'Minimum', 'Maximum', 'SampleCount']).default('Average')
  },
  async ({ namespace, metric_name, dimension_name, dimension_value,
           start_time, end_time, period_seconds, statistic }) => {
    const dimensions = dimension_name && dimension_value
      ? [{ Name: dimension_name, Value: dimension_value }]
      : [];

    const result = await cw.send(new GetMetricStatisticsCommand({
      Namespace: namespace,
      MetricName: metric_name,
      Dimensions: dimensions,
      StartTime: new Date(start_time),
      EndTime: new Date(end_time),
      Period: period_seconds,
      Statistics: [statistic as Statistic]
    }));

    // Sort by timestamp ascending (CloudWatch returns in arbitrary order)
    const datapoints = (result.Datapoints ?? [])
      .sort((a, b) => (a.Timestamp?.getTime() ?? 0) - (b.Timestamp?.getTime() ?? 0))
      .map((dp) => ({
        timestamp: dp.Timestamp?.toISOString(),
        value: dp[statistic as keyof typeof dp],
        unit: dp.Unit
      }));

    return {
      content: [{
        type: 'text',
        text: JSON.stringify({
          namespace,
          metric_name,
          statistic,
          period_seconds,
          datapoints,
          datapoint_count: datapoints.length
        }, null, 2)
      }]
    };
  }
);

CloudWatch returns datapoints in arbitrary order — always sort by Timestamp in your tool before returning. The Period must be a multiple of 60 seconds; for high-resolution metrics published at sub-minute intervals, use GetMetricDataCommand with a MetricDataQuery instead — it supports periods as low as 1 second for metrics stored at high resolution.

list_alarms and get_alarm_history tools

CloudWatch Alarms are the primary alerting mechanism for AWS metrics. The DescribeAlarmsCommand lists alarms with optional state and name prefix filters. Alarm state transitions are logged in alarm history.

import {
  DescribeAlarmsCommand,
  DescribeAlarmHistoryCommand,
  StateValue
} from '@aws-sdk/client-cloudwatch';

server.tool(
  'list_cloudwatch_alarms',
  {
    state: z.enum(['OK', 'ALARM', 'INSUFFICIENT_DATA']).optional(),
    name_prefix: z.string().optional(),
    max_results: z.number().int().min(1).max(100).default(20)
  },
  async ({ state, name_prefix, max_results }) => {
    const result = await cw.send(new DescribeAlarmsCommand({
      StateValue: state as StateValue | undefined,
      AlarmNamePrefix: name_prefix,
      MaxRecords: max_results
    }));

    const alarms = (result.MetricAlarms ?? []).map((alarm) => ({
      name: alarm.AlarmName,
      state: alarm.StateValue,
      reason: alarm.StateReason,
      updated_at: alarm.StateUpdatedTimestamp?.toISOString(),
      namespace: alarm.Namespace,
      metric: alarm.MetricName,
      threshold: alarm.Threshold,
      comparison: alarm.ComparisonOperator
    }));

    return {
      content: [{
        type: 'text',
        text: JSON.stringify({ alarms, count: alarms.length }, null, 2)
      }]
    };
  }
);

server.tool(
  'get_alarm_history',
  {
    alarm_name: z.string().min(1),
    start_time: z.string().datetime().optional(),
    end_time: z.string().datetime().optional()
  },
  async ({ alarm_name, start_time, end_time }) => {
    const result = await cw.send(new DescribeAlarmHistoryCommand({
      AlarmName: alarm_name,
      StartDate: start_time ? new Date(start_time) : undefined,
      EndDate: end_time ? new Date(end_time) : undefined,
      MaxRecords: 50
    }));

    const events = (result.AlarmHistoryItems ?? []).map((item) => ({
      timestamp: item.Timestamp?.toISOString(),
      type: item.HistoryItemType,
      summary: item.HistorySummary
    }));

    return {
      content: [{
        type: 'text',
        text: JSON.stringify({ alarm_name, events }, null, 2)
      }]
    };
  }
);

Alarm state INSUFFICIENT_DATA means CloudWatch has not received enough data points to evaluate the alarm — this is common when a metric is published at a lower frequency than the alarm's evaluation period. It does not mean the resource is down; the alarm is simply waiting for data.

Logs Insights: start_query and poll for results

CloudWatch Logs Insights runs SQL-like queries asynchronously. The pattern is: StartQueryCommand to submit the query and get a query ID, then poll GetQueryResultsCommand until the status is Complete. Do not use FilterLogEventsCommand for complex queries — it's a simple text filter that scans the log stream sequentially and returns raw log events without aggregation or field parsing.

import {
  StartQueryCommand,
  GetQueryResultsCommand
} from '@aws-sdk/client-cloudwatch-logs';
import { cwl } from './aws-clients.js';

server.tool(
  'query_cloudwatch_logs',
  {
    log_group: z.string().min(1),       // e.g. '/aws/lambda/my-function'
    query_string: z.string().min(1),    // Logs Insights syntax
    start_time: z.string().datetime(),
    end_time: z.string().datetime(),
    limit: z.number().int().min(1).max(1000).default(100)
  },
  async ({ log_group, query_string, start_time, end_time, limit }) => {
    // Submit the query
    const startResult = await cwl.send(new StartQueryCommand({
      logGroupName: log_group,
      queryString: query_string,
      startTime: Math.floor(new Date(start_time).getTime() / 1000),
      endTime: Math.floor(new Date(end_time).getTime() / 1000),
      limit
    }));

    const queryId = startResult.queryId;
    if (!queryId) throw new Error('CloudWatch Logs Insights did not return a query ID');

    // Poll for completion (up to 30 seconds)
    let status = 'Running';
    let results: Record[] = [];
    const deadline = Date.now() + 30_000;

    while (status === 'Running' || status === 'Scheduled') {
      if (Date.now() > deadline) {
        return {
          content: [{
            type: 'text',
            text: JSON.stringify({ error: 'Query timed out after 30s', queryId })
          }]
        };
      }
      await new Promise((r) => setTimeout(r, 1500));

      const pollResult = await cwl.send(new GetQueryResultsCommand({ queryId }));
      status = pollResult.status ?? 'Unknown';
      results = (pollResult.results ?? []).map((row) =>
        Object.fromEntries((row ?? []).map((f) => [f.field ?? '', f.value ?? '']))
      );
    }

    return {
      content: [{
        type: 'text',
        text: JSON.stringify({ queryId, status, row_count: results.length, results }, null, 2)
      }]
    };
  }
);

Logs Insights query example	What it finds
`filter @message like /ERROR/ \| stats count() by bin(5m)`	Error count per 5-minute window
`filter @type = "REPORT" \| stats avg(@duration) by bin(1h)`	Lambda average duration per hour
`fields @timestamp, @message \| sort @timestamp desc \| limit 50`	50 most recent log events
`filter statusCode >= 500 \| stats count() by requestPath`	500-error count by API route (requires structured JSON logs)

Health endpoint: /health/cloudwatch

The most common failure mode for CloudWatch MCP tools is expired IAM credentials. A process-level HTTP check passes even when the IAM session token has expired — the process is running fine, but every CloudWatch API call will return a 403 ExpiredTokenException. A lightweight CloudWatch API probe catches this before tool callers see errors.

import express from 'express';
import { ListMetricsCommand } from '@aws-sdk/client-cloudwatch';

const app = express();

app.get('/health/cloudwatch', async (_req, res) => {
  const start = Date.now();
  try {
    // Minimal IAM action: cloudwatch:ListMetrics
    // Returns quickly when namespaced to a specific service
    await cw.send(new ListMetricsCommand({
      Namespace: 'AWS/EC2',
      RecentlyActive: 'PT3H'  // only metrics active in last 3 hours — limits response size
    }));

    res.status(200).json({
      status: 'ok',
      latency_ms: Date.now() - start,
      region: REGION
    });
  } catch (err) {
    const error = err as { name?: string; message?: string };
    res.status(503).json({
      status: 'error',
      error_type: error.name,   // 'ExpiredTokenException', 'AccessDeniedException', etc.
      error: error.message,
      latency_ms: Date.now() - start
    });
  }
});

app.listen(3001);

Failure mode	error_type in /health response	Action
IAM session token expired	`ExpiredTokenException`	Rotate credentials or restart task to refresh instance profile
Wrong IAM permissions	`AccessDeniedException`	Add `cloudwatch:ListMetrics` to IAM policy
Regional endpoint unreachable	`NetworkingError`	Check VPC endpoint or NAT gateway for outbound HTTPS
SDK throttle	`ThrottlingException`	Back off; CloudWatch health check fires too frequently

Minimum IAM policy

CloudWatch MCP tools should use the minimum IAM permissions needed. The policy below covers all tools in this guide — separate statements for metrics, logs, and alarms.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "CloudWatchMetricsReadOnly",
      "Effect": "Allow",
      "Action": [
        "cloudwatch:GetMetricStatistics",
        "cloudwatch:GetMetricData",
        "cloudwatch:ListMetrics",
        "cloudwatch:DescribeAlarms",
        "cloudwatch:DescribeAlarmHistory"
      ],
      "Resource": "*"
    },
    {
      "Sid": "CloudWatchLogsReadOnly",
      "Effect": "Allow",
      "Action": [
        "logs:FilterLogEvents",
        "logs:StartQuery",
        "logs:GetQueryResults",
        "logs:DescribeLogGroups",
        "logs:DescribeLogStreams"
      ],
      "Resource": "arn:aws:logs:*:*:log-group:*"
    }
  ]
}

Do not grant cloudwatch:PutMetricAlarm or logs:PutLogEvents unless your tools explicitly need to create alarms or write logs. Read-only access is sufficient for all query and alert-inspection use cases.

Frequently asked questions

What is the difference between GetMetricStatistics and GetMetricData?

GetMetricStatistics queries a single metric and returns an array of datapoints with statistics (Average, Sum, etc.) for each period. GetMetricData queries multiple metrics in one call using metric math expressions, supports higher-resolution data (sub-minute), and allows computed metrics like error rate = errors / requests. Use GetMetricStatistics for simple single-metric lookups; use GetMetricData when you need multiple metrics, math expressions, or sub-minute resolution.

How do I query multiple log groups in one Logs Insights query?

Use the logGroupNames array parameter on StartQueryCommand instead of logGroupName (singular). You can query up to 50 log groups in a single Insights query. For log groups that share a prefix (e.g., all Lambda function logs under /aws/lambda/), you can also use logGroupNamePattern which matches by substring — but only one of logGroupName, logGroupNames, or logGroupNamePattern can be set per request.

What are CloudWatch's API rate limits?

GetMetricStatistics and GetMetricData are throttled at 400 requests per second per account per region. FilterLogEvents is throttled at 10 requests per second per log group. Logs Insights allows 10 concurrent queries per account. For MCP tools that make frequent metric queries, consider caching results for 60 seconds in a Map keyed by query parameters to avoid hitting rate limits when multiple agent calls request the same metric in quick succession.

Can I use CloudWatch with cross-account assume-role?

Yes. Pass an STSClient-assumed role's credentials to the CloudWatchClient constructor: new CloudWatchClient({ region, credentials: fromTemporaryCredentials({ params: { RoleArn, RoleSessionName } }) }) using @aws-sdk/credential-providers. The temporary credentials are automatically refreshed by the SDK before they expire. Use this pattern when your MCP server needs to query CloudWatch in a different AWS account than where it runs.