Guide · Observability
MCP server observability
Observability is the ability to understand the internal state of a system from its external outputs. For MCP servers, this means: knowing what happened when an agent session failed (logs), knowing whether performance is trending toward failure (metrics), and knowing where in a multi-server workflow a slow request spent its time (traces). External monitoring with probe-based uptime checks is a fourth pillar that observability textbooks don't cover — it's the only signal visible from the user's perspective, not the server's perspective.
TL;DR
The MCP-adapted observability stack: structured JSON logs (every initialize and tools/list request, with session ID, tool name, latency, and error code); metrics for four key signals (request rate, error rate per layer, latency percentiles, active sessions); distributed traces spanning the full agent session → MCP server → downstream API chain; and external probe monitoring for the outside-in view that internal instrumentation can't see (network reachability, SSL expiry, cold-start pattern). Start with logs and external probing — they're zero-infrastructure-cost and cover 80% of incident investigations. Add metrics and traces as traffic grows.
Why standard observability frameworks need adaptation for MCP
Standard web service observability (OpenTelemetry, Prometheus, structured logging) was designed for request/response APIs: one request, one response, one latency measurement, one error or success. MCP has a different shape:
- Session-level operations: each agent connection involves an initialize handshake, a tools/list fetch, and then N tool calls — potentially interleaved with other sessions. Latency and error signals need to be attributed to the correct phase (initialize vs. tools/list vs. tool call) to be useful.
- Protocol-layer independence: the four MCP layers (transport, HTTP, initialize, tools/list) can fail independently. A standard HTTP error rate metric aggregates all errors into a single number; MCP observability requires per-layer error tracking.
- Tool surface as a schema: the tools/list response defines the MCP server's "API surface." Observability for MCP includes tracking schema changes over time — a tools/list response that shrinks unexpectedly (fewer tools) is an important signal that standard metrics don't capture.
- Stateless vs. stateful sessions: HTTP probes are stateless. MCP sessions are stateful (the agent maintains an initialize context across multiple tool calls). Distributed traces need to span the entire session, not just individual HTTP requests.
Pillar 1: Structured logs
Logs are the highest-value first investment for MCP server observability. They require no external infrastructure and cover the most common post-incident question: "what exactly happened during that session?"
What to log
Every log entry should be structured JSON (not free-text), emitted to stdout, and include at minimum:
timestamp: ISO 8601 with millisecondslevel: info / warn / errorevent: the thing that happened (initialize_request, tools_list_response, tool_call_start, tool_call_complete, tool_call_error, session_close)session_id: unique ID for the agent session (the initialize handshake should assign this)duration_ms: for every operation with measurable latencyerror_code: JSON-RPC error code or HTTP status, for error eventstool_name: for tool call eventsclient_id: the agent client identifier from the initialize request, if available
What NOT to log
Never log tool call arguments or results in plaintext — they may contain user data, credentials, or PII. Log the tool name and execution outcome (success/error, latency, error code), not the input/output content. If you need debugging visibility into arguments, use a log level (debug) that's disabled in production and requires explicit opt-in per-session.
Log retention and storage
For early-stage MCP servers: stdout logs piped to a file or a log aggregator (CloudWatch Logs, GCP Cloud Logging, Datadog Logs, Logtail). Retain 30 days minimum — most incident investigations happen within 24 hours, but SLO reviews need 30-day history. At low traffic (<10k sessions/day), log storage costs are negligible. At high traffic, consider logging only error events and sampled success events at a 1-in-100 rate.
Pillar 2: Metrics
Metrics are aggregated numerical signals over time — the foundation of dashboards and SLO tracking. The key MCP server metrics to instrument:
The four golden signals for MCP
- Request rate (per layer): how many initialize requests per minute, how many tool calls per minute per tool. Rate changes indicate load changes or upstream agent behavior changes.
- Error rate (per layer): fraction of requests failing at each MCP protocol layer. Track separately for transport, HTTP, initialize, and tools/list. See MCP server error rate for the measurement model.
- Latency (p50, p95, p99 per operation): time to complete initialize, time to complete tools/list, time to complete each tool call by name. Separate percentile distributions for each tool — some tools are inherently slow (I/O-bound); others should always be fast. See MCP server latency.
- Active sessions: how many active agent sessions at any moment. Sudden drops indicate sessions are timing out or being dropped. Sudden spikes may indicate runaway agent behavior or a load test.
MCP-specific metrics beyond the four signals
- Tool surface size: count of tools returned by tools/list. Alert if this drops unexpectedly — a deployment that accidentally removes tools will show here before users complain.
- Tool schema hash: a hash of the full tools/list JSON. Changes indicate schema drift. Tracking this as a metric (or logging it on every tools/list response) creates an audit trail of every tool schema change.
- Downstream dependency error rate: if your tools call external APIs, instrument each call and track its error rate separately from your overall tool error rate. Separates "our server failed" from "the downstream API failed."
Instrumentation approaches
For Node.js MCP servers: Prometheus client library (prom-client) with custom counters and histograms, exposing a /metrics endpoint. For Python: prometheus_client. For serverless (Lambda, Cloud Run): custom metrics via CloudWatch custom metrics or GCP custom metrics, since Prometheus scraping doesn't work well with stateless serverless. OpenTelemetry SDK works across all runtimes and handles the metrics/traces/logs signal correlation.
See Prometheus MCP monitoring for the scraping and alerting setup.
Pillar 3: Distributed traces
Traces answer the question: "where did this specific agent session spend its time?" They're most valuable for MCP servers that call multiple downstream services per tool invocation — without traces, a 2-second tool call might be slow because of your server, a database query, an external API call, or an LLM inference call. Traces show you exactly which span consumed the time.
MCP trace structure
A well-instrumented MCP session produces a trace with the following span structure:
agent_session (root span)
└── mcp_initialize (span, ~200-500ms)
└── mcp_tools_list (span, ~50-300ms)
└── mcp_tool_call: tool_name (span, per call)
├── db_query: table_name (child span)
├── external_api_call: api.example.com (child span)
└── llm_inference: claude-sonnet-4-6 (child span, if applicable)
Each span carries: trace ID (unique per session), span ID, parent span ID, start time, duration, status (OK/ERROR), and relevant attributes (tool name, error code, HTTP status).
Propagating trace context through MCP
MCP does not currently have a standardized trace context propagation mechanism in the protocol spec. Practical approaches:
- HTTP header propagation: agent clients can pass W3C
traceparentheaders in the HTTP requests to the MCP server. The server extracts the trace context and creates child spans. This requires the agent client to support trace context injection — current MCP SDKs vary in support. - Session ID correlation: a simpler approach: include the session ID in all log and metric entries, and use the session ID to correlate logs, metrics, and traces from the same session. Less powerful than full trace propagation but zero-coordination with the agent client.
Traces are the highest-infrastructure-cost observability pillar (requires a trace backend: Jaeger, Zipkin, Honeycomb, Datadog APM, GCP Cloud Trace). For early-stage MCP servers, logs + external monitoring covers the critical use cases. Add traces when you're debugging latency issues in complex multi-hop tool calls.
Pillar 4: External probe monitoring
Internal instrumentation — logs, metrics, traces — requires your server to be running and responding to emit signals. When the server is completely down, your internal observability goes dark. External probe monitoring is the only signal visible from outside the server: "is this server reachable and responding correctly from the user's network perspective?"
External probing covers failure modes that internal instrumentation cannot:
- Network-level reachability (TCP connection to the server's IP and port)
- TLS certificate validity and expiry
- DNS resolution (is the domain resolving to the right IP?)
- CDN/proxy layer failures (Cloudflare down, wrong SSL certificate at the edge)
- Complete process crash before any logs are emitted
The four-layer MCP probe (transport → HTTP → initialize → tools/list) is also an outside-in functional test: it verifies the full user-facing path, not just internal component health. A server whose internal metrics show healthy but whose tools/list probe fails from the external probe origin has a split-brain issue that internal metrics would miss.
Combine internal instrumentation (logs and metrics for diagnosis and trend tracking) with external probe monitoring (AliveMCP for availability and functional health). They're complementary, not competing. See MCP server monitoring dashboard for how to visualize both signal types together.
Related questions
What's the minimum viable observability setup for a new MCP server?
Two things: structured JSON logs to stdout (zero infrastructure) and external probe monitoring with AliveMCP (zero server-side code). This covers 80% of incident investigations and gives you availability tracking immediately. Add Prometheus metrics once you have traffic worth analyzing. Add distributed tracing once you have multi-hop tool calls with latency you need to attribute. The mistake is investing in distributed tracing infrastructure before you have enough traffic to make traces statistically useful.
How do I correlate internal logs with AliveMCP probe events?
Match by timestamp: when AliveMCP shows a probe failure at 14:32:05 UTC, look for log entries from your server around the same time. If your logs show nothing around 14:32 (the server emitted no entries), the failure was at the network/transport layer — the probe never reached your process. If logs show entries up to 14:31:58 and then nothing until 14:35:12, the process crashed or OOMed at 14:32 and restarted at 14:35. If logs show 14:32 entries with error responses, the server was up but returning errors — your logs have the error detail that the probe summary doesn't.
Do I need OpenTelemetry for MCP observability?
OpenTelemetry is useful if you want vendor-neutral instrumentation that works across multiple backends (send traces to Jaeger today, Honeycomb next year, without re-instrumentation). For early-stage MCP servers, OTel is over-engineered. Start with: structured logging via your runtime's built-in logger, a simple Prometheus counter/histogram setup, and AliveMCP for external monitoring. Migrate to OTel when you have multiple services, multiple teams, or want to consolidate signals in a single observability platform.
How should I instrument tool calls for observability?
Wrap every tool handler in a try/catch that records: start time, end time (for latency), success or error, and error code if applicable. Emit a structured log entry and increment a Prometheus counter/histogram. Keep the instrumentation logic in a decorator or middleware, not duplicated across every tool handler. In Python: a @instrument_tool decorator. In TypeScript: a withInstrumentation(toolName, handler) wrapper function. This pattern means adding a new tool gets instrumentation automatically without per-tool boilerplate.
Further reading
- MCP server health check — the four-layer probe at the foundation of external observability
- MCP server monitoring dashboard — aggregating all four observability pillars
- Prometheus MCP monitoring — metrics instrumentation and scraping setup
- MCP server error rate — the core metrics signal for SLO tracking
- MCP server latency — per-layer latency metrics and SLO alerting
- MCP server SLO — using observability data to track error budgets
- AliveMCP — external probe observability for every public MCP endpoint