Guide · Observability

MCP server observability

Observability is the ability to understand the internal state of a system from its external outputs. For MCP servers, this means: knowing what happened when an agent session failed (logs), knowing whether performance is trending toward failure (metrics), and knowing where in a multi-server workflow a slow request spent its time (traces). External monitoring with probe-based uptime checks is a fourth pillar that observability textbooks don't cover — it's the only signal visible from the user's perspective, not the server's perspective.

TL;DR

The MCP-adapted observability stack: structured JSON logs (every initialize and tools/list request, with session ID, tool name, latency, and error code); metrics for four key signals (request rate, error rate per layer, latency percentiles, active sessions); distributed traces spanning the full agent session → MCP server → downstream API chain; and external probe monitoring for the outside-in view that internal instrumentation can't see (network reachability, SSL expiry, cold-start pattern). Start with logs and external probing — they're zero-infrastructure-cost and cover 80% of incident investigations. Add metrics and traces as traffic grows.

Why standard observability frameworks need adaptation for MCP

Standard web service observability (OpenTelemetry, Prometheus, structured logging) was designed for request/response APIs: one request, one response, one latency measurement, one error or success. MCP has a different shape:

Pillar 1: Structured logs

Logs are the highest-value first investment for MCP server observability. They require no external infrastructure and cover the most common post-incident question: "what exactly happened during that session?"

What to log

Every log entry should be structured JSON (not free-text), emitted to stdout, and include at minimum:

What NOT to log

Never log tool call arguments or results in plaintext — they may contain user data, credentials, or PII. Log the tool name and execution outcome (success/error, latency, error code), not the input/output content. If you need debugging visibility into arguments, use a log level (debug) that's disabled in production and requires explicit opt-in per-session.

Log retention and storage

For early-stage MCP servers: stdout logs piped to a file or a log aggregator (CloudWatch Logs, GCP Cloud Logging, Datadog Logs, Logtail). Retain 30 days minimum — most incident investigations happen within 24 hours, but SLO reviews need 30-day history. At low traffic (<10k sessions/day), log storage costs are negligible. At high traffic, consider logging only error events and sampled success events at a 1-in-100 rate.

Pillar 2: Metrics

Metrics are aggregated numerical signals over time — the foundation of dashboards and SLO tracking. The key MCP server metrics to instrument:

The four golden signals for MCP

MCP-specific metrics beyond the four signals

Instrumentation approaches

For Node.js MCP servers: Prometheus client library (prom-client) with custom counters and histograms, exposing a /metrics endpoint. For Python: prometheus_client. For serverless (Lambda, Cloud Run): custom metrics via CloudWatch custom metrics or GCP custom metrics, since Prometheus scraping doesn't work well with stateless serverless. OpenTelemetry SDK works across all runtimes and handles the metrics/traces/logs signal correlation.

See Prometheus MCP monitoring for the scraping and alerting setup.

Pillar 3: Distributed traces

Traces answer the question: "where did this specific agent session spend its time?" They're most valuable for MCP servers that call multiple downstream services per tool invocation — without traces, a 2-second tool call might be slow because of your server, a database query, an external API call, or an LLM inference call. Traces show you exactly which span consumed the time.

MCP trace structure

A well-instrumented MCP session produces a trace with the following span structure:

agent_session (root span)
  └── mcp_initialize (span, ~200-500ms)
  └── mcp_tools_list (span, ~50-300ms)
  └── mcp_tool_call: tool_name (span, per call)
        ├── db_query: table_name (child span)
        ├── external_api_call: api.example.com (child span)
        └── llm_inference: claude-sonnet-4-6 (child span, if applicable)

Each span carries: trace ID (unique per session), span ID, parent span ID, start time, duration, status (OK/ERROR), and relevant attributes (tool name, error code, HTTP status).

Propagating trace context through MCP

MCP does not currently have a standardized trace context propagation mechanism in the protocol spec. Practical approaches:

Traces are the highest-infrastructure-cost observability pillar (requires a trace backend: Jaeger, Zipkin, Honeycomb, Datadog APM, GCP Cloud Trace). For early-stage MCP servers, logs + external monitoring covers the critical use cases. Add traces when you're debugging latency issues in complex multi-hop tool calls.

Pillar 4: External probe monitoring

Internal instrumentation — logs, metrics, traces — requires your server to be running and responding to emit signals. When the server is completely down, your internal observability goes dark. External probe monitoring is the only signal visible from outside the server: "is this server reachable and responding correctly from the user's network perspective?"

External probing covers failure modes that internal instrumentation cannot:

The four-layer MCP probe (transport → HTTP → initialize → tools/list) is also an outside-in functional test: it verifies the full user-facing path, not just internal component health. A server whose internal metrics show healthy but whose tools/list probe fails from the external probe origin has a split-brain issue that internal metrics would miss.

Combine internal instrumentation (logs and metrics for diagnosis and trend tracking) with external probe monitoring (AliveMCP for availability and functional health). They're complementary, not competing. See MCP server monitoring dashboard for how to visualize both signal types together.

Related questions

What's the minimum viable observability setup for a new MCP server?

Two things: structured JSON logs to stdout (zero infrastructure) and external probe monitoring with AliveMCP (zero server-side code). This covers 80% of incident investigations and gives you availability tracking immediately. Add Prometheus metrics once you have traffic worth analyzing. Add distributed tracing once you have multi-hop tool calls with latency you need to attribute. The mistake is investing in distributed tracing infrastructure before you have enough traffic to make traces statistically useful.

How do I correlate internal logs with AliveMCP probe events?

Match by timestamp: when AliveMCP shows a probe failure at 14:32:05 UTC, look for log entries from your server around the same time. If your logs show nothing around 14:32 (the server emitted no entries), the failure was at the network/transport layer — the probe never reached your process. If logs show entries up to 14:31:58 and then nothing until 14:35:12, the process crashed or OOMed at 14:32 and restarted at 14:35. If logs show 14:32 entries with error responses, the server was up but returning errors — your logs have the error detail that the probe summary doesn't.

Do I need OpenTelemetry for MCP observability?

OpenTelemetry is useful if you want vendor-neutral instrumentation that works across multiple backends (send traces to Jaeger today, Honeycomb next year, without re-instrumentation). For early-stage MCP servers, OTel is over-engineered. Start with: structured logging via your runtime's built-in logger, a simple Prometheus counter/histogram setup, and AliveMCP for external monitoring. Migrate to OTel when you have multiple services, multiple teams, or want to consolidate signals in a single observability platform.

How should I instrument tool calls for observability?

Wrap every tool handler in a try/catch that records: start time, end time (for latency), success or error, and error code if applicable. Emit a structured log entry and increment a Prometheus counter/histogram. Keep the instrumentation logic in a decorator or middleware, not duplicated across every tool handler. In Python: a @instrument_tool decorator. In TypeScript: a withInstrumentation(toolName, handler) wrapper function. This pattern means adding a new tool gets instrumentation automatically without per-tool boilerplate.

Further reading