Guide · Observability

MCP server observability

Observability is the ability to understand the internal state of a system from its external outputs. For MCP servers, this means: knowing what happened when an agent session failed (logs), knowing whether performance is trending toward failure (metrics), and knowing where in a multi-server workflow a slow request spent its time (traces). External monitoring with probe-based uptime checks is a fourth pillar that observability textbooks don't cover — it's the only signal visible from the user's perspective, not the server's perspective.

TL;DR

The MCP-adapted observability stack: structured JSON logs (every initialize and tools/list request, with session ID, tool name, latency, and error code); metrics for four key signals (request rate, error rate per layer, latency percentiles, active sessions); distributed traces spanning the full agent session → MCP server → downstream API chain; and external probe monitoring for the outside-in view that internal instrumentation can't see (network reachability, SSL expiry, cold-start pattern). Start with logs and external probing — they're zero-infrastructure-cost and cover 80% of incident investigations. Add metrics and traces as traffic grows.

Why standard observability frameworks need adaptation for MCP

Standard web service observability (OpenTelemetry, Prometheus, structured logging) was designed for request/response APIs: one request, one response, one latency measurement, one error or success. MCP has a different shape:

Session-level operations: each agent connection involves an initialize handshake, a tools/list fetch, and then N tool calls — potentially interleaved with other sessions. Latency and error signals need to be attributed to the correct phase (initialize vs. tools/list vs. tool call) to be useful.
Protocol-layer independence: the four MCP layers (transport, HTTP, initialize, tools/list) can fail independently. A standard HTTP error rate metric aggregates all errors into a single number; MCP observability requires per-layer error tracking.
Tool surface as a schema: the tools/list response defines the MCP server's "API surface." Observability for MCP includes tracking schema changes over time — a tools/list response that shrinks unexpectedly (fewer tools) is an important signal that standard metrics don't capture.
Stateless vs. stateful sessions: HTTP probes are stateless. MCP sessions are stateful (the agent maintains an initialize context across multiple tool calls). Distributed traces need to span the entire session, not just individual HTTP requests.

Pillar 1: Structured logs

Logs are the highest-value first investment for MCP server observability. They require no external infrastructure and cover the most common post-incident question: "what exactly happened during that session?"

What to log

Every log entry should be structured JSON (not free-text), emitted to stdout, and include at minimum:

timestamp: ISO 8601 with milliseconds
level: info / warn / error
event: the thing that happened (initialize_request, tools_list_response, tool_call_start, tool_call_complete, tool_call_error, session_close)
session_id: unique ID for the agent session (the initialize handshake should assign this)
duration_ms: for every operation with measurable latency
error_code: JSON-RPC error code or HTTP status, for error events
tool_name: for tool call events
client_id: the agent client identifier from the initialize request, if available

What NOT to log

Never log tool call arguments or results in plaintext — they may contain user data, credentials, or PII. Log the tool name and execution outcome (success/error, latency, error code), not the input/output content. If you need debugging visibility into arguments, use a log level (debug) that's disabled in production and requires explicit opt-in per-session.

Log retention and storage

For early-stage MCP servers: stdout logs piped to a file or a log aggregator (CloudWatch Logs, GCP Cloud Logging, Datadog Logs, Logtail). Retain 30 days minimum — most incident investigations happen within 24 hours, but SLO reviews need 30-day history. At low traffic (<10k sessions/day), log storage costs are negligible. At high traffic, consider logging only error events and sampled success events at a 1-in-100 rate.

Pillar 2: Metrics

Metrics are aggregated numerical signals over time — the foundation of dashboards and SLO tracking. The key MCP server metrics to instrument:

The four golden signals for MCP

Request rate (per layer): how many initialize requests per minute, how many tool calls per minute per tool. Rate changes indicate load changes or upstream agent behavior changes.
Error rate (per layer): fraction of requests failing at each MCP protocol layer. Track separately for transport, HTTP, initialize, and tools/list. See MCP server error rate for the measurement model.
Latency (p50, p95, p99 per operation): time to complete initialize, time to complete tools/list, time to complete each tool call by name. Separate percentile distributions for each tool — some tools are inherently slow (I/O-bound); others should always be fast. See MCP server latency.
Active sessions: how many active agent sessions at any moment. Sudden drops indicate sessions are timing out or being dropped. Sudden spikes may indicate runaway agent behavior or a load test.

MCP-specific metrics beyond the four signals

Tool surface size: count of tools returned by tools/list. Alert if this drops unexpectedly — a deployment that accidentally removes tools will show here before users complain.
Tool schema hash: a hash of the full tools/list JSON. Changes indicate schema drift. Tracking this as a metric (or logging it on every tools/list response) creates an audit trail of every tool schema change.
Downstream dependency error rate: if your tools call external APIs, instrument each call and track its error rate separately from your overall tool error rate. Separates "our server failed" from "the downstream API failed."

Instrumentation approaches

For Node.js MCP servers: Prometheus client library (prom-client) with custom counters and histograms, exposing a /metrics endpoint. For Python: prometheus_client. For serverless (Lambda, Cloud Run): custom metrics via CloudWatch custom metrics or GCP custom metrics, since Prometheus scraping doesn't work well with stateless serverless. OpenTelemetry SDK works across all runtimes and handles the metrics/traces/logs signal correlation.

See Prometheus MCP monitoring for the scraping and alerting setup.

Pillar 3: Distributed traces

Traces answer the question: "where did this specific agent session spend its time?" They're most valuable for MCP servers that call multiple downstream services per tool invocation — without traces, a 2-second tool call might be slow because of your server, a database query, an external API call, or an LLM inference call. Traces show you exactly which span consumed the time.

MCP trace structure

A well-instrumented MCP session produces a trace with the following span structure:

agent_session (root span)
  └── mcp_initialize (span, ~200-500ms)
  └── mcp_tools_list (span, ~50-300ms)
  └── mcp_tool_call: tool_name (span, per call)
        ├── db_query: table_name (child span)
        ├── external_api_call: api.example.com (child span)
        └── llm_inference: claude-sonnet-4-6 (child span, if applicable)

Each span carries: trace ID (unique per session), span ID, parent span ID, start time, duration, status (OK/ERROR), and relevant attributes (tool name, error code, HTTP status).

Propagating trace context through MCP

MCP does not currently have a standardized trace context propagation mechanism in the protocol spec. Practical approaches:

HTTP header propagation: agent clients can pass W3C traceparent headers in the HTTP requests to the MCP server. The server extracts the trace context and creates child spans. This requires the agent client to support trace context injection — current MCP SDKs vary in support.
Session ID correlation: a simpler approach: include the session ID in all log and metric entries, and use the session ID to correlate logs, metrics, and traces from the same session. Less powerful than full trace propagation but zero-coordination with the agent client.

Traces are the highest-infrastructure-cost observability pillar (requires a trace backend: Jaeger, Zipkin, Honeycomb, Datadog APM, GCP Cloud Trace). For early-stage MCP servers, logs + external monitoring covers the critical use cases. Add traces when you're debugging latency issues in complex multi-hop tool calls.

Pillar 4: External probe monitoring

Internal instrumentation — logs, metrics, traces — requires your server to be running and responding to emit signals. When the server is completely down, your internal observability goes dark. External probe monitoring is the only signal visible from outside the server: "is this server reachable and responding correctly from the user's network perspective?"

External probing covers failure modes that internal instrumentation cannot:

Network-level reachability (TCP connection to the server's IP and port)
TLS certificate validity and expiry
DNS resolution (is the domain resolving to the right IP?)
CDN/proxy layer failures (Cloudflare down, wrong SSL certificate at the edge)
Complete process crash before any logs are emitted

The four-layer MCP probe (transport → HTTP → initialize → tools/list) is also an outside-in functional test: it verifies the full user-facing path, not just internal component health. A server whose internal metrics show healthy but whose tools/list probe fails from the external probe origin has a split-brain issue that internal metrics would miss.

Combine internal instrumentation (logs and metrics for diagnosis and trend tracking) with external probe monitoring (AliveMCP for availability and functional health). They're complementary, not competing. See MCP server monitoring dashboard for how to visualize both signal types together.