Observability guide · 2026-06-03 · Production MCP servers

MCP Server Observability Stack Guide: OpenTelemetry, Prometheus Metrics, Structured Logging, Distributed Tracing, and Log Aggregation

Most MCP server developers start with console.log and a ping check. That is enough for local development. It is not enough for production, where an LLM client is calling your tools live, sessions fail silently, downstream APIs slow down without returning errors, and crashes happen between log lines. A complete observability stack for a production MCP server has five concerns: OpenTelemetry as the unifying SDK backbone, Prometheus metrics for the alerting tier, Pino structured logging for session-level debugging, distributed tracing for cross-service latency attribution, and log aggregation for durable queryable history. This guide covers them as a system — how each layer contributes something the others cannot, how they interconnect, and what the combined stack still cannot observe (and where an external probe fills that gap).

TL;DR

OpenTelemetry is the SDK layer that unifies all three signals. The NodeSDK must be imported before any other module; it instruments the runtime, exports traces via OTLP, exports metrics at a 15-second interval, and injects traceId and spanId into every Pino log line via a mixin. Without OTel, the three signals are disconnected and cannot be correlated after the fact.
Prometheus metrics provide the alerting tier. The four golden signals for MCP — traffic (mcp_tool_calls_total), latency (mcp_tool_duration_seconds histogram), errors (error rate %), and saturation (mcp_active_sessions, mcp_bulkhead_running) — give Alertmanager the data it needs to page on-call before users notice. Expose /metrics on a separate port so scrape traffic does not appear in your MCP latency histogram.
Pino structured logging provides session-level detail. Use AsyncLocalStorage to bind a child logger per session — the logger carries session_id and user_id through every await without parameter threading. redact.paths prevents credentials from reaching the log pipeline. Log Error objects as err fields, not err.message, to preserve stack traces.
Distributed tracing attributes latency to specific spans. Extract the W3C traceparent header at initialize, store the OTel context in AsyncLocalStorage per session, start a child span per tool call, and inject traceparent into outgoing HTTP headers. When a tool call is slow, the trace tells you whether the latency is in your code, in a downstream API, or in the database — without log-diving.
Log aggregation makes logs queryable at scale. Pino writes NDJSON to stdout; Docker captures it; Promtail ships it to Grafana Loki. LogQL lets you filter by session_id, error rate over time, and slow calls (duration_ms > 1000). Grafana derived fields jump from a trace_id in a log line directly to the corresponding Tempo trace.
External synthetic probes cover what the internal stack cannot. Internal observability requires your process to be running and your log-shipping pipeline to be intact. AliveMCP probes from outside: process crashes before logger initialises, OOM kills, TLS expiry, and DNS failures are all invisible to internal signals but surface immediately on the external uptime dashboard.

Why Observability for MCP Servers Is Different

A conventional HTTP API has a simple observability model: one request maps to one log line and one span. MCP servers are different in two ways that make observability harder.

First, the session is long-lived. An MCP session opens with initialize and may last minutes or hours while an LLM iterates over a complex task. A single session generates many tool calls. The log lines, spans, and metrics for those tool calls must all be correlated to the same session context — otherwise debugging a session failure means sifting through thousands of unrelated lines from other sessions interleaved in the same log stream.

Second, MCP servers are almost always in the middle of a call graph. The LLM client calls your server; your server calls one or more downstream APIs; those may call databases or other services. When a tool call is slow or fails, the question is not just "what happened in my code" but "which hop in the call graph introduced the latency." Without distributed tracing, you cannot answer that question from logs alone.

These two characteristics — long-lived sessions with many tool calls, and a multi-hop call graph — mean that a complete observability stack for MCP servers must address:

Per-session context binding (so all events for a session are correlated)
Per-tool-call instrumentation (spans, metrics increments, structured log lines)
Cross-service trace propagation (so latency can be attributed to the correct hop)
An alerting tier that fires before users report failures (Prometheus rules)
Durable, queryable log storage (log aggregation)

The general observability overview covers the problem framing. This guide covers how the five specific concerns — OpenTelemetry, Prometheus metrics, structured logging, distributed tracing, and log aggregation — each address a subset of those requirements, and how they compose without redundancy.

The Five-Layer Stack and What Each Layer Contributes

Layer	Primary signal	Main use case	What it cannot do alone
OpenTelemetry SDK	Traces + metrics + log correlation	Unified instrumentation; correlates all signals via shared `traceId`	Does not store data — exports to backends only; does not aggregate logs
Prometheus metrics	Time-series metrics	Alerting (Alertmanager rules), dashboards (Grafana PromQL), saturation gauges	No per-request detail; counters and histograms, not individual traces
Pino structured logging	Structured logs (NDJSON)	Per-session debugging; per-request log lines with context fields	High cardinality; useful for debugging known sessions, not for aggregated alerting
Distributed tracing	Traces with spans	Cross-service latency attribution; waterfall view of per-tool-call call graph	Sampled (not every request); not suited for alerting (sparse data)
Log aggregation	Centralised log storage	Durable queryable history; multi-instance log merging; log-based alert rules	Not real-time; query latency on Loki/Elasticsearch is seconds, not milliseconds

The table makes the composition logic visible: Prometheus is fast for alerting but loses per-request detail; logs have full detail but cannot aggregate efficiently; traces have the full call graph but are sampled; OTel is the SDK that glues them together; log aggregation is the persistence tier that makes logs queryable after the fact.

OpenTelemetry: The Unifying Backbone

OpenTelemetry is the only layer that produces all three signals (traces, metrics, and correlated logs) from a single instrumentation pass. Its role in the stack is not to replace Prometheus or Pino — both continue to operate independently — but to provide a shared traceId that allows any log line and any metric exemplar to be linked back to the trace that generated them.

The NodeSDK setup has two hard requirements. First, the SDK initialisation file must be imported before any other module:

// instrumentation.ts — import this first in package.json "imports"
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { ParentBasedSampler, TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: 'my-mcp-server',
    [SEMRESATTRS_SERVICE_VERSION]: process.env.APP_VERSION ?? 'unknown',
    'deployment.environment': process.env.NODE_ENV ?? 'development',
  }),
  traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter(),
    exportIntervalMillis: 15_000,
  }),
  sampler: new ParentBasedSampler({ root: new TraceIdRatioBasedSampler(0.1) }),
});
sdk.start();

If the NodeSDK is imported after application code, some auto-instrumentation hooks will have already missed their injection points and will silently not fire. The resource attributes — service.name, service.version, deployment.environment — propagate to every span and metric exported, making filtering by service or environment work correctly in the backend.

Second, each tool call gets a span:

import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('mcp-tools');

server.tool('search', { query: z.string() }, async ({ query }, { sessionId }) => {
  return tracer.startActiveSpan('mcp.tool.search', async span => {
    span.setAttributes({ 'mcp.tool.name': 'search', 'mcp.session.id': sessionId });
    try {
      const results = await searchApi.query(query);
      span.setAttribute('mcp.result.count', results.length);
      return { content: [{ type: 'text', text: JSON.stringify(results) }] };
    } catch (err) {
      span.recordException(err as Error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: (err as Error).message });
      throw err;
    } finally {
      span.end();
    }
  });
});

The startActiveSpan call sets the span as active in the current async context. Any child spans created within the async callback — including spans auto-instrumented by the SDK inside downstream HTTP calls — are automatically parented to this span. This is the mechanism that produces the waterfall view in Jaeger or Grafana Tempo: the tool call span at the top, downstream API call spans below it, database query spans below those.

The OTel-to-Pino bridge — a Pino mixin that reads the active span's traceId and spanId — is how log lines become navigable in Grafana:

import { context, trace } from '@opentelemetry/api';

const otelMixin = () => {
  const span = trace.getActiveSpan();
  if (!span?.isRecording()) return {};
  const { traceId, spanId } = span.spanContext();
  return { trace_id: traceId, span_id: spanId };
};

With this mixin, every Pino log line emitted inside a traced tool call carries trace_id and span_id. In Grafana Loki, a derived field on those fields creates a clickable link to the corresponding Tempo trace — the log-to-trace jump that makes cross-signal investigation fast.

Prometheus Metrics: The Alerting Tier

Traces and logs answer "what happened for this specific session." Prometheus metrics answer "what is happening across all sessions right now." That aggregate view is what makes alerting possible — you cannot write a meaningful Alertmanager rule from individual log lines, but you can write one from a histogram P99.

The four golden signals map directly to MCP server behaviour. Define them as module-scope singletons so they are shared across all concurrent sessions:

import { Registry, Counter, Histogram, Gauge, collectDefaultMetrics } from 'prom-client';

export const registry = new Registry();
collectDefaultMetrics({ register: registry }); // Node.js process metrics

export const toolCallsTotal = new Counter({
  name: 'mcp_tool_calls_total',
  help: 'Total tool calls by tool name and outcome',
  labelNames: ['tool_name', 'status', 'transport'],
  registers: [registry],
});

export const toolDuration = new Histogram({
  name: 'mcp_tool_duration_seconds',
  help: 'Tool call duration in seconds',
  labelNames: ['tool_name', 'status'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [registry],
});

export const activeSessions = new Gauge({
  name: 'mcp_active_sessions',
  help: 'Currently open MCP sessions',
  registers: [registry],
});

export const circuitBreakerOpen = new Gauge({
  name: 'mcp_circuit_breaker_open',
  help: 'Whether the circuit breaker is open (1) or closed (0)',
  labelNames: ['dependency'],
  registers: [registry],
});

The mcp_tool_duration_seconds histogram is the most important instrument. The explicit bucket list — covering five milliseconds to ten seconds — lets Prometheus compute histogram_quantile(0.99, rate(mcp_tool_duration_seconds_bucket[5m])) accurately. A histogram without appropriate buckets produces P99 estimates that are correct in theory but garbage in practice because the value always falls in the largest bucket.

Three alert rules cover the most common production failure modes:

groups:
  - name: mcp_server_alerts
    rules:
      - alert: MCPToolHighErrorRate
        expr: |
          rate(mcp_tool_calls_total{status="error"}[5m])
          / rate(mcp_tool_calls_total[5m]) > 0.05
        for: 2m
        labels: { severity: warning }
        annotations:
          summary: "MCP tool error rate above 5%"

      - alert: MCPToolHighLatency
        expr: |
          histogram_quantile(0.99,
            rate(mcp_tool_duration_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "MCP tool P99 latency above 2 s"

      - alert: MCPCircuitBreakerOpen
        expr: mcp_circuit_breaker_open == 1
        for: 0m
        labels: { severity: critical }
        annotations:
          summary: "Circuit breaker open for {{ $labels.dependency }}"

One operational detail that is easy to get wrong: expose /metrics on a separate HTTP server from the MCP transport port. If both are on the same port, Prometheus scrape requests will appear in mcp_tool_duration_seconds and inflate latency percentiles. A one-line fix: metrics_server.listen(9090), keep mcp_server.listen(3000) separate.

The circuit breaker gauges are also a useful leading indicator: the breaker opens before the error rate alert fires (the error rate alert needs two minutes of sustained failures; the breaker opens immediately on threshold). Watching mcp_circuit_breaker_open in a Grafana panel gives on-call engineers early warning without noise from transient failures.

Pino Structured Logging: Session-Level Debugging

Structured logging with Pino gives you per-session, per-tool-call detail at a cost that is orders of magnitude lower than capturing full traces for every request. The key architectural decision is binding session context to a child logger at session open time, so every subsequent log line carries session_id and user_id without manual threading.

AsyncLocalStorage is the mechanism that makes this work across await chains:

import { AsyncLocalStorage } from 'node:async_hooks';
import pino from 'pino';

const als = new AsyncLocalStorage<pino.Logger>();

export const rootLogger = pino({
  redact: {
    paths: ['DATABASE_URL', 'REDIS_URL', 'password', 'token',
            'api_key', 'secret', 'authorization', '*.authorization'],
    censor: '[REDACTED]',
  },
  mixin: otelMixin, // inject trace_id and span_id from active OTel span
  formatters: {
    level: (label) => ({ level: label }),
    bindings: (bindings) => ({ pid: bindings.pid, host: bindings.hostname }),
  },
});

export const withSessionLogger = <T>(sessionId: string, userId: string, fn: () => T): T =>
  als.run(rootLogger.child({ session_id: sessionId, user_id: userId }), fn);

export const getLogger = () => als.getStore() ?? rootLogger;

Call withSessionLogger at initialize and wrap the entire session handler. Every await that runs inside the callback inherits the session logger — tool call handlers, downstream API calls, error handlers, all of them. getLogger() retrieves the correct session-bound logger anywhere in the call stack without needing to pass it as a parameter.

Two common logging mistakes to avoid in MCP servers. First, logging err.message instead of the full err object. The message property is rarely sufficient to reproduce a failure — you lose the stack trace, the error code, and any custom properties on the error class. Log { err } and let Pino serialise the full object. Second, logging database errors verbatim. Database error messages frequently contain fragments of the failing query, and queries sometimes embed user-supplied values. Sanitise database error messages before logging them:

const sanitiseDbError = (err: Error): string =>
  err.message.replace(/\b[\w.]+@[\w.]+\b/g, '[REDACTED]')
             .replace(/password=[^\s&]*/gi, 'password=[REDACTED]');

The log level strategy for MCP servers follows a consistent table:

Level	When to use	Examples
`fatal`	Process is about to exit	Unhandled exception in startup, invalid config
`error`	Tool call failed; action needed	Downstream API returned 5xx, schema validation failed
`warn`	Recoverable degradation	Circuit breaker open, retry attempt, rate limit approaching
`info`	Normal session events	`initialize`, `close`, tool call start/end with duration
`debug`	Per-request detail; off in production	Full tool arguments, response bodies, cache hits
`trace`	Internal library detail; never in production	Buffer contents, socket state changes

A common mistake is setting level: 'debug' in production to get more detail during an incident. Debug logging on a busy MCP server can produce enough volume to make the log pipeline itself a bottleneck, delaying the log lines you actually need. Keep production at info; use targeted feature flags to enable debug for a specific session ID when investigating a reported issue.

Distributed Tracing: Cross-Service Latency Attribution

Distributed tracing answers the question that neither logs nor metrics can: "Of the 800 ms this tool call took, how much was in my code, how much was the search API, and how much was the database?" The answer is in the trace waterfall — each span shows its start time, duration, and which service it ran in.

The propagation mechanism is W3C traceparent: a header containing the trace ID, parent span ID, and trace flags. When an upstream LLM orchestrator sets this header on the initialize request, your MCP server extracts it and creates child spans under the same trace. When your MCP server calls a downstream API, it injects a new traceparent into the outgoing headers so that service can create child spans under your tool call span.

import { propagation, context } from '@opentelemetry/api';

// At session initialize — extract upstream context
const parentCtx = propagation.extract(context.active(), headers);
const sessionContextStorage = new AsyncLocalStorage<Context>();

// Wrap the session handler to carry the parent context
sessionContextStorage.run(parentCtx, () => {
  // All tool call handlers now inherit this context
});

// When calling downstream APIs — inject outgoing context
const outgoingHeaders: Record<string, string> = {};
propagation.inject(context.active(), outgoingHeaders);
await fetch(upstreamUrl, { headers: outgoingHeaders });

The ParentBasedSampler in the NodeSDK setup respects the sampled bit in the incoming traceparent. If the upstream LLM orchestrator decides to sample this trace (bit set to 1), your MCP server samples it too, even if its own root sampling rate would otherwise have rejected it. This ensures that a trace that starts at the LLM client and passes through your MCP server to a database is either fully sampled or fully dropped — not partially sampled with gaps in the waterfall.

For the backend, Jaeger all-in-one (single Docker container, OTLP HTTP on port 4318) is sufficient for development. For production, Grafana Tempo with S3 or GCS storage is the operationally simpler choice — Tempo separates write (distributor/ingester) and read (querier/query-frontend) paths and integrates natively with Grafana for the trace viewer. The Grafana derived-field configuration that links from a trace_id in a Loki log line to the corresponding Tempo trace is the payoff: one click from a slow log line to its full waterfall.

What distributed tracing cannot cover: if the MCP server process crashes before emitting the spans, or if the OTLP export fails (network partition between the server and the collector), no trace data is recorded. Sampling also means that 90% of tool calls — at a 10% sampling rate — produce no trace at all. The alerting tier and metrics cover all requests; tracing covers the sampled subset with full detail.

Log Aggregation: Durable Queryable History

Log aggregation is the persistence tier. Pino writes NDJSON to stdout; a log shipper (Promtail for Loki, Filebeat for Elasticsearch, or the Docker log driver for CloudWatch) captures that output and forwards it to a centralised store. The centralised store is what makes logs useful at scale: when you have five MCP server replicas each writing 10,000 log lines per minute, you need to query them as a single stream, not SSH into each instance and grep individually.

The Grafana Loki + Promtail stack is the lowest-overhead option for teams already running Grafana. A minimal Promtail config for Docker Compose:

scrape_configs:
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - source_labels: [__meta_docker_container_label_logging]
        action: keep
        regex: "promtail"  # opt-in — only containers with this label are shipped
      - source_labels: [__meta_docker_container_name]
        target_label: container
    pipeline_stages:
      - docker: {}                       # unwrap Docker json-file envelope
      - json: { expressions: { level: level, session_id: session_id, duration_ms: duration_ms, trace_id: trace_id } }
      - labels: { level: null, session_id: null }   # index low-cardinality fields as labels
      - timestamp: { source: time, format: RFC3339Nano }

The label promotion rule is important: level and session_id become indexed labels, meaning LogQL can filter on them efficiently ({level="error"} uses the label index, not a full-text scan). High-cardinality fields like trace_id and duration_ms stay as log line fields — filtering on them uses | json | trace_id = "abc123" (slower, but rare queries).

The four LogQL queries you will use most often:

-- All errors in the last hour
{container="mcp-server"} | json | level = "error"

-- All log lines for one session
{container="mcp-server"} | json | session_id = "sess_xyz"

-- Slow tool calls (over 1 second)
{container="mcp-server"} | json | duration_ms > 1000

-- Error rate per minute (metric query)
rate({container="mcp-server"} | json | level="error" [1m])

Log aggregation can also source alert rules. A Loki alert rule that fires on a spike in error-level log lines provides a backstop that triggers even when the Prometheus metrics pipeline is degraded — a circuit breaker alert that relies only on mcp_circuit_breaker_open misses failures that kill the process before updating the gauge.

What log aggregation cannot catch: the cases where no log line is ever written. An ENOMEM kill, a process crash at startup before the logger initialises, a TLS certificate expiry that prevents the process from accepting connections, a DNS resolution failure — none of these produce log lines. The health check endpoint and AliveMCP external probes cover exactly this gap.

How the Five Layers Compose

The five layers are designed to address non-overlapping failure detection scenarios, not to duplicate each other:

OpenTelemetry SDK provides the shared instrumentation that makes the other four layers correlatable. Without OTel, you have four disconnected data streams; with it, every trace, metric exemplar, and log line can be linked by traceId.
Prometheus metrics fire alerts within minutes of a degradation across all sessions. They are the fastest path from "something is wrong" to "page on-call." They do not tell you which session is affected.
Pino structured logging tells you exactly what happened in the specific session that was reported as broken. It is the debugging layer, not the alerting layer. It is always-on (not sampled) and produces a complete record of every session event.
Distributed tracing tells you where a slow or failed tool call spent its time. It operates on a sampled subset and produces a waterfall view that logs cannot replicate. It answers "which downstream API is the bottleneck" in a way that no metric or log can.
Log aggregation makes the Pino logs queryable at scale, across replicas, after the fact. It is the persistence and query tier, not the emission tier.

The practical starting sequence for teams building out observability from scratch:

Add prom-client and expose the four golden signal metrics. This gives you an alerting tier in under an hour.
Replace console.log with Pino. Add AsyncLocalStorage session binding and the redact.paths config. Structured logs are queryable immediately with grep -E '"level":"error"' before you have a log aggregation backend.
Add the OTel NodeSDK with the Pino mixin for trace correlation. You now have trace IDs on every log line — even before you have a tracing backend, this makes log investigation dramatically faster.
Add Grafana Loki + Promtail for log aggregation. The NDJSON output from Pino is already in the right format; you are adding the shipping and storage layer, not changing the emission layer.
Add Jaeger or Grafana Tempo and confirm traces are appearing. Configure the Grafana derived field to link from trace_id in Loki to Tempo.

At each step you have a complete (if partial) observability capability that you can operate in production. You are not blocked on all five layers being in place before getting value from any of them.

The Gap the Internal Stack Cannot Fill

There is a class of failures that the entire five-layer internal observability stack cannot detect: failures that prevent the stack itself from running.

The process crashed before the logger or OTel SDK initialised — no log lines, no spans
An OOM kill terminated the process between log lines — the last line says "tool call started," no "tool call ended"
The TLS certificate expired before the process started — connections are refused at the transport layer, application code never runs
DNS resolution for the server's hostname is failing — clients cannot connect, the process is healthy but unreachable
The log shipping pipeline itself failed — Promtail crashed, the Loki ingester is OOM — log lines are emitted but never stored

These failures are invisible to internal signals by definition: they either prevent the process from producing signals at all, or they affect the pipeline that carries signals to storage.

External synthetic monitoring — AliveMCP probes that connect to your MCP server from outside, send real tool calls, and verify the responses — is the only mechanism that catches these cases. The probe does not rely on the server's internal logging, tracing, or metrics pipelines. It connects via the MCP protocol, makes a tool call, and records success or failure. An uptime monitor that pings /healthz catches process-down but not tool-call failures; a protocol-level probe catches both.

The combined picture — internal five-layer observability stack plus external synthetic probes — covers the full failure surface. Internal observability tells you about degradations that happen while the server is running and reachable. External probes tell you about the cases where it is not.