Guide · Observability

MCP server tracing

Distributed tracing gives you a causally-linked timeline across every component involved in a single operation. For MCP servers, that timeline runs from the AI agent that initiated the session through the MCP protocol layers (transport, initialize, tools/list, tool call) into every downstream service your tools invoke. Without tracing, a slow agent session is a black box — you know something took too long, but you don't know whether the bottleneck was the network, the initialize handshake, a specific tool call, or a downstream API that tool depends on. With tracing, you have the full causal chain as a timeline with durations attached to each segment.

TL;DR

Use OpenTelemetry to instrument your MCP server. The trace structure has one root span per agent session (mcp.session), with child spans per protocol operation: mcp.initialize, mcp.tools_list, and one mcp.tool_call span per tool invocation. Each tool call span has child spans for downstream API calls. Propagate W3C traceparent via HTTP headers for HTTP/SSE MCP servers, or via JSON-RPC _meta fields for stdio-based servers. Never log tool call arguments as span attributes — they may contain user PII. External probe monitoring from AliveMCP complements tracing by covering the gap where the server is completely down and generating no traces at all.

Why standard distributed tracing needs MCP adaptation

Standard distributed tracing frameworks assume a request-response model: one request comes in, one response goes out, and the trace covers that lifecycle. MCP has a different shape:

Trace structure for MCP

The recommended span hierarchy for a complete MCP server trace:

agent_session (root span)
  ├── mcp.initialize
  │     └── (optional: auth validation span if SSO/OAuth involved)
  ├── mcp.tools_list
  │     └── (optional: tool registry fetch if dynamic tools)
  ├── mcp.tool_call [tool_name="weather.get"]
  │     ├── downstream.http [url="https://api.weather.example/v1/current"]
  │     └── downstream.cache [operation="redis.get", key="weather:lat:long"]
  └── mcp.tool_call [tool_name="calendar.list"]
        └── downstream.http [url="https://calendar.google.com/api/v3/events"]

Span attribute naming conventions:

PII rule: never include tool call arguments as span attributes. Tool inputs frequently contain user-provided data (names, addresses, queries, API keys passed as parameters). Log tool call argument shapes in structured logs instead, stripped to their schema (key names only, not values). Traces flow to observability backends where the retention and access control model may differ from your primary data store — arguments in spans will persist longer than intended and in more places than expected.

W3C traceparent propagation

Distributed tracing requires each component to pass context to the next so that spans from different services can be assembled into a single trace tree. W3C Trace Context defines the standard propagation format.

HTTP/SSE MCP servers

For MCP servers that communicate over HTTP (SSE transport or streamable HTTP), propagation uses standard HTTP headers:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: alivemcp=probe,vendor=otlp

The AI agent that initiates the session generates a root traceparent and includes it in the HTTP request to your MCP server. Your server reads it via the OpenTelemetry SDK's HTTP propagator, creates a child span, and passes the updated context to any downstream HTTP calls your tools make. This produces a trace tree that spans from the agent through your MCP server into your backend services — a single timeline for the full operation.

stdio-based MCP servers

Stdio MCP servers communicate over stdin/stdout with JSON-RPC messages, not HTTP. There are no headers. Use the JSON-RPC _meta field extension point to carry trace context:

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "weather.get",
    "arguments": { "location": "..." },
    "_meta": {
      "traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"
    }
  },
  "id": 1
}

The _meta field is defined in the MCP specification as an extension point for non-semantic metadata. Your server reads params._meta.traceparent at the JSON-RPC layer and extracts context before processing the request. This requires a thin wrapper around your MCP SDK's tool dispatch that runs the OTel propagator on the _meta field before entering tool-specific code.

See JSON-RPC health checks vs HTTP probes for deeper discussion of the protocol-layer differences between HTTP and JSON-RPC MCP transports.

OpenTelemetry SDK implementation

Minimal instrumentation for a Node.js MCP server using the OTel SDK:

import { trace, context, propagation } from '@opentelemetry/api';
import { W3CTraceContextPropagator } from '@opentelemetry/core';

const tracer = trace.getTracer('mcp-server', '1.0.0');
propagation.setGlobalPropagator(new W3CTraceContextPropagator());

// In your tool dispatch handler:
async function handleToolCall(req) {
  const carrier = req.params?._meta ?? {};
  const ctx = propagation.extract(context.active(), carrier);

  return context.with(ctx, async () => {
    const span = tracer.startSpan('mcp.tool_call', {
      attributes: {
        'mcp.tool_name': req.params.name,
        'mcp.session_id': req.sessionId,
        'mcp.operation': 'tool_call',
      }
    });
    try {
      const result = await executeTool(req.params.name, req.params.arguments);
      span.setStatus({ code: 0 }); // OK
      return result;
    } catch (err) {
      span.recordException(err);
      span.setStatus({ code: 2, message: err.message }); // ERROR
      throw err;
    } finally {
      span.end();
    }
  });
}

For Python MCP servers, the pattern is equivalent using opentelemetry-api and opentelemetry-sdk. Instrument each protocol operation (initialize, tools/list, tool call) with a span. Use context managers rather than try/finally where possible for cleaner error handling.

Sampling strategy

At high traffic volumes, tracing every operation is expensive in both CPU overhead and storage cost. A practical sampling strategy for MCP servers:

How external probes complement tracing

Distributed tracing has a fundamental blind spot: it only produces data when the server is running and receiving requests. When your MCP server is completely down — TCP refused, host unreachable, process crashed — no traces are generated. The absence of traces is not itself an alert signal in most tracing backends.

External probe monitoring from AliveMCP fills this gap. The probe initiates a real MCP protocol sequence from outside your infrastructure every 60 seconds, regardless of whether any agent is actually using the server. If the server is down, the probe generates an alert immediately — without waiting for a user to experience a failed agent session. The probe also verifies that the server is reachable from the public internet, which internal health checks and traces cannot confirm.

The practical workflow: when an alert fires, check AliveMCP first to understand which protocol layer failed (transport/HTTP/initialize/tools_list). Then open your tracing backend and look for error spans in the 5-minute window before the alert timestamp. The probe alert gives you the what-layer and when; the traces give you the why-within-that-layer. See MCP server error rate for per-layer error classification and how probe data and trace data differ per layer.

Backend options for trace storage

Where to send your OTel spans:

Related questions

Do I need distributed tracing if I'm running a single MCP server?

Not initially. For a single server with no downstream API dependencies, structured logging gives you 80% of the diagnostic value at a fraction of the implementation cost. Add tracing when: (1) your tools depend on downstream APIs and you're seeing slow sessions you can't attribute to a specific component; (2) you're running multiple MCP servers and need cross-service correlation; (3) you have enough traffic that individual log lines are too noisy to read during an incident and you need the timeline view a trace provides. Start with MCP server observability basics (structured JSON logs + external probing) before adding tracing infrastructure.

Can I use tracing to replace external monitoring?

No. Tracing only produces data when the server is running and serving requests. When the server is completely down, tracing is silent. External monitoring from AliveMCP probes the server independently of any real traffic — it's the only way to know the server is down before a user hits it. The two signals are complementary: use external monitoring for availability (is the server up?), use tracing for root cause analysis (why is this session slow?). A trace tells you which tool call took 800ms; the probe alert tells you the server has been returning 503s for the last 6 minutes even though no agent has tried to use it yet.

How do I correlate an AliveMCP downtime alert with my trace data?

The AliveMCP probe alert includes a timestamp for the first failed probe. In your trace backend, search for error spans within a 5-minute window before that timestamp on the same server. If you see error spans, the failure was at the application layer (traces existed but had errors). If you see no spans at all, the failure was at the transport or HTTP layer — the server wasn't reachable enough to even receive the request that would generate a trace. This no-trace pattern is diagnostic in itself: it narrows the investigation to infrastructure (host down, network, deployment) rather than application code.

Further reading