Guide · Error handling

MCP server error handling

MCP servers have two distinct error layers that are easy to confuse. Protocol errors are JSON-RPC level failures — malformed requests, unknown methods, invalid parameters — and are returned as JSON-RPC error objects with a numeric code. Application errors are business logic failures — a tool could not fetch the upstream API, a query returned zero results, a rate limit was hit — and are returned as successful JSON-RPC responses with isError: true in the result. Getting this distinction wrong causes one of two failure modes: surfacing application errors as protocol errors (breaks the session), or swallowing protocol errors as application errors (hides bugs). AliveMCP monitors the initialize exchange — a protocol error during initialization causes a probe failure; an isError result in a tool call does not.

TL;DR

Use return { isError: true, content: [...] } for application errors — the tool call failed but the MCP session continues. Use throw new McpError(ErrorCode.InternalError, 'message') for protocol-level problems that should close the current request. Let unhandled exceptions propagate — the SDK wraps them in -32603 Internal error responses. Never throw from a tool handler to signal an application-level failure (like "file not found") — that closes the session. Log all errors with structured fields so you can alert on error rate spikes.

JSON-RPC error codes

The MCP protocol is built on JSON-RPC 2.0. Standard error codes:

CodeNameWhen it occurs
-32700Parse errorRequest body is not valid JSON
-32600Invalid requestJSON is valid but not a valid JSON-RPC request (missing jsonrpc, method, or id)
-32601Method not foundClient called a method the server does not implement (e.g., a removed tool name)
-32602Invalid paramsMethod exists but the parameters fail validation (Zod schema mismatch)
-32603Internal errorUnhandled exception in a handler; the SDK catches thrown errors and wraps them
-32000 to -32099Server error (reserved)Implementation-defined server errors

MCP extends JSON-RPC with its own error codes in the positive range (1000–9999). These are defined in the @modelcontextprotocol/sdk package as the ErrorCode enum and map to the same JSON-RPC error structure. You rarely need these directly — the SDK uses them internally for protocol-level failures like ResourceNotFound (1004) or PromptNotFound (1003).

isError vs McpError — which to use

The single most important error handling decision in an MCP server:

SituationReturn typeSession afterExample
Tool's business logic failed{ isError: true, content: [...] }Still openAPI returned 429, file not found, query returned zero results, downstream timeout
Tool's input is semantically wrong (not caught by schema){ isError: true, content: [...] }Still openURL parameter points to a private resource, date range is inverted
Protocol-level invariant violatedthrow new McpError(...)Request fails; session may continueTool handler cannot locate a required resource that the protocol guarantees exists
Unrecoverable server errorLet the exception propagate (uncaught throw)Request fails with -32603; session may continueCritical invariant broken, memory corruption suspected
// Correct: application error → isError: true
server.tool(
  'fetch_weather',
  'Get current weather for a city',
  { city: z.string() },
  async (args) => {
    let data: WeatherData;
    try {
      data = await weatherApi.get(args.city);
    } catch (err: any) {
      // API failure = application error; session continues
      return {
        isError: true,
        content: [{ type: 'text', text: `Could not fetch weather: ${err.message}` }],
      };
    }

    if (!data) {
      return {
        isError: true,
        content: [{ type: 'text', text: `No weather data found for city: ${args.city}` }],
      };
    }

    return { content: [{ type: 'text', text: JSON.stringify(data) }] };
  }
);
// Rare: protocol-level error → McpError
import { McpError, ErrorCode } from '@modelcontextprotocol/sdk/types.js';

server.resource(
  'config://{key}',
  'Read a server configuration value',
  async (uri) => {
    const key = uri.pathname.slice(1);
    // If the config store is completely unavailable, throw McpError
    // so the SDK returns a well-formed JSON-RPC error response
    if (!configStore.isAvailable()) {
      throw new McpError(ErrorCode.InternalError, 'Configuration store unavailable');
    }
    const value = configStore.get(key);
    if (value === undefined) {
      throw new McpError(ErrorCode.ResourceNotFound, `Config key not found: ${key}`);
    }
    return { contents: [{ uri: uri.href, text: value }] };
  }
);

Retry-safe vs non-retry-safe errors

When returning isError: true, include enough information in the error message for the AI client to decide whether to retry. A well-designed error message distinguishes transient failures (retry after delay) from permanent failures (do not retry):

// Structured error results with retry guidance
function transientError(message: string): { isError: true; content: ContentItem[] } {
  return {
    isError: true,
    content: [{ type: 'text', text: `${message} (transient — please retry in a few seconds)` }],
  };
}

function permanentError(message: string): { isError: true; content: ContentItem[] } {
  return {
    isError: true,
    content: [{ type: 'text', text: `${message} (permanent — do not retry without changing the request)` }],
  };
}

// In a tool handler:
if (err.status === 429) return transientError('Rate limit exceeded');
if (err.status === 503) return transientError('Upstream service temporarily unavailable');
if (err.status === 404) return permanentError(`Resource not found: ${args.id}`);
if (err.status === 403) return permanentError('Insufficient permissions for this resource');

AI clients that support tool-call retry logic (Claude, Gemini) will use the error message content to decide whether to retry automatically or surface the error to the user. Clear retry guidance in the error text is one of the most impactful usability improvements you can make to a production MCP server.

Global error handler and unhandled rejections

Always register a global uncaught exception handler and unhandled rejection handler. The SDK wraps exceptions thrown from tool handlers in -32603 Internal error responses, but exceptions thrown from outside a handler (event emitters, setTimeout callbacks, async code not awaited in a handler) are not caught by the SDK:

// Global safety net — these errors are bugs, not expected failures
process.on('uncaughtException', (err) => {
  console.error({ event: 'uncaught_exception', error: err.message, stack: err.stack });
  // Graceful shutdown — the server is in an unknown state
  process.exit(1);
});

process.on('unhandledRejection', (reason) => {
  console.error({ event: 'unhandled_rejection', reason: String(reason) });
  // Depending on severity: may want to exit, or just log and continue
  // If this fires frequently, a bug is swallowing promise chains
});

Configure AliveMCP webhook alerts so that a process crash (connection refused on the next probe) immediately fires an incident notification. The probe will detect the outage within 60 seconds of the crash — the mean time to detection (MTTD) from an uncaught exception is at most 60 seconds plus alert delivery time.

Structured error logging

Log every error at the right level with the fields needed to triage quickly:

// Error logging in tool handlers
server.tool('my_tool', 'Description', schema, async (args) => {
  try {
    const result = await doWork(args);
    console.info({ event: 'tool_success', tool: 'my_tool', duration_ms: Date.now() - start });
    return { content: [{ type: 'text', text: result }] };
  } catch (err: any) {
    const isTransient = err.status >= 500 || err.code === 'ECONNRESET';
    console.error({
      event: 'tool_error',
      tool: 'my_tool',
      error_code: err.status ?? err.code,
      transient: isTransient,
      // Do NOT log args — they may contain user PII
    });
    return {
      isError: true,
      content: [{ type: 'text', text: isTransient
        ? 'Temporary failure — please retry.'
        : `Operation failed: ${err.message}` }],
    };
  }
});

Alert thresholds to configure: (1) tool_error rate > 5% of tool_success events over a 5-minute window — indicates a degraded upstream or a bug introduced in a recent deploy; (2) any uncaught_exception event — should fire a P0 incident; (3) unhandled_rejection rate > 0 — indicates a code quality issue to address immediately.

Related questions

Does throwing inside a tool handler close the MCP session?

No — the SDK catches exceptions thrown from tool handlers and returns them as -32603 Internal error JSON-RPC responses. The session remains open. The next tool call will succeed if the error was transient. However, throwing is not the right pattern for application errors — use isError: true instead. Reserve throwing for genuinely unexpected conditions that indicate a bug (null pointer, invariant violation) where continuing the session would produce incorrect results.

How does AliveMCP detect errors vs outages?

AliveMCP distinguishes three states: (1) up — the initialize handshake completes and the response matches the MCP protocol spec; (2) degraded — the server responds but with a non-200 HTTP status or a malformed MCP response; (3) down — connection refused, DNS failure, or TLS error. Protocol errors during initialize show as degraded. Application errors (returned via isError: true in tool calls) are not visible to the probe because AliveMCP does not call tools — only initialize and tools/list.

Should I validate parameters beyond what Zod handles?

Yes, for semantic validation that cannot be expressed in Zod schemas. For example, Zod can validate that a date string matches YYYY-MM-DD format, but cannot validate that a start date is before an end date. Add semantic validation at the top of the tool handler and return isError: true for semantic failures. This keeps the Zod schema focused on structural validation and the handler focused on business logic validation — they are separate concerns.

What should the error message say to an AI client?

Write error messages as if you are talking to an LLM that will decide what to do next. Include: what failed, whether it is permanent or transient, and what (if anything) the client can change to succeed. Bad: "Error 500". Good: "The GitHub API returned a 500 error — this is likely transient. Retrying in 5 seconds may succeed. If the error persists, the repository URL may be unreachable." The AI client reads the error content and uses it to decide whether to retry, change the request, or surface the failure to the user.

Further reading