Architecture guide · 2026-06-02 · Production operations

MCP Server Architecture Guide: Plugins, Middleware, Multi-Tenant Isolation, and Protocol Bridges

Most MCP server tutorials stop at the point where a single developer gets a tool list to respond. A production MCP server has four structural concerns that tutorials don't reach: how to layer the HTTP middleware stack so the ordering itself enforces your security model, how to compose tool handlers from a plugin system at startup so they can be deployed independently, how to serve multiple tenants from one process without module-scope contamination, and how to bridge existing WebSocket or gRPC backends without coupling their connection lifecycle to individual tool calls. This guide covers all four — not as abstract patterns, but as concrete decisions with specific consequences when you get them wrong.

TL;DR

Why architecture decisions that are optional for REST APIs are load-bearing for MCP servers

A REST API can survive a badly organized codebase for a long time. Routes are stateless, middleware composition is mostly additive, and errors fail fast with an HTTP status code. The production footprint of a REST endpoint is a single request/response cycle — nothing leaks between requests unless you explicitly share module-scope state.

MCP servers are structurally different in three ways that make architecture decisions consequential from the first deploy.

Sessions are stateful and long-lived. An initialize handshake begins a session that stays open until the client disconnects — in an agent context, that can be minutes or hours. Authentication evaluated at session creation is the authentication that covers every subsequent tool call in that session. If your auth middleware fires after the MCP transport handler, it fires after the session is already open, which means a session can be established without ever being authenticated. The middleware ordering is the authentication model; getting it wrong is a security boundary failure, not a performance problem.

Tool surface is the authorization layer. A REST API gates access to endpoints with route-level middleware. An MCP server's equivalent is the tool list returned by tools/list. If a tenant is not authorized to call a tool, the correct implementation is to not register that tool on their session's McpServer instance — not to register it and return an error when they call it. Registering a tool you don't intend callers to use creates audit surface and leaks implementation details to the tool list. Per-tenant plugin activation is where this is enforced.

Module scope persists for the process lifetime. REST APIs commonly use module-scope caches and configuration objects. In a multi-tenant MCP server, any module-scope value that is set during one tenant's session and read during another's is a data isolation failure. The failure mode is subtle: it works correctly under serial load (one tenant at a time), breaks silently under concurrent load (two tenants' sessions overlap), and the wrong tenant's data appears in the right tenant's tool results without any error or log entry.

The middleware stack: ordering is the security model

The five layers of a production MCP HTTP middleware stack, in the correct order:

  1. Correlation ID injection — generates a requestId and reads or creates a sessionId, stores both in AsyncLocalStorage so they are available throughout the request without parameter threading.
  2. Structured request logger — reads from AsyncLocalStorage and logs method, path, and eventually status code and duration_ms. Uses res.on('finish') to capture the final status after the response is sent. For SSE connections (MCP sessions), duration_ms equals the session lifetime — that's expected, not a bug.
  3. Auth guard — validates Bearer token or JWT, returns 401 before touching the MCP transport. The critical constraint: this middleware fires before transport.handleRequest(), which means an unauthenticated request never creates an MCP session. If auth runs inside a tool handler, an unauthenticated client has already passed initialize and can enumerate your tool list before being rejected.
  4. Rate limiter — enforces per-IP or per-key request rate before transport.handleRequest(). Returns 429 before the MCP transport allocates session resources. A rate limiter applied inside the transport (to individual tool calls) is useful but insufficient — it doesn't prevent a burst of initialize requests from exhausting session slots.
  5. MCP transport handlertransport.handleRequest(req, res, req.body). This is where the MCP session is created, where tools/list is served, and where tool calls are dispatched.

The ordering is not arbitrary. Placing the logger before auth means you log every request including those that fail authentication — useful for detecting credential stuffing. Placing auth before the rate limiter means valid clients don't count against rate limits before their request is validated — this is the correct behavior if rate limits are per authenticated identity, and the wrong behavior if rate limits are per IP to prevent pre-auth flooding (in which case rate limiting moves up above auth). Registering monitoring endpoints (/healthz, /metrics) per-route outside the auth middleware means they don't require credentials — correct for health checks that your uptime probe needs to reach without a Bearer token. The full pattern, with the AsyncLocalStorage context propagation that makes session IDs available in every log line, is in the MCP server middleware guide.

One pattern to avoid: app.use(authMiddleware) followed by individual app.get('/healthz', ...)  exemptions that use a skip condition inside the middleware. Those skip conditions are brittle — they require exact path matching, fail on query strings, and are easy to forget when adding new routes. Prefer explicit per-route registration: app.post('/mcp', authMiddleware, rateLimitMiddleware, transport.handleRequest) and app.get('/healthz', healthzHandler) with no auth middleware on the health route at all.

Plugin architecture: composition at startup

A monolithic MCP server where all tool handlers live in one file works until it doesn't — when the team grows past two people, when different tools need different dependencies, or when you need to enable different tool sets for different customers. The plugin pattern addresses all three.

The interface is deliberately minimal:

interface McpPlugin {
  name: string;
  version: string;
  register(server: McpServer, deps: PluginDeps): void;
}

interface PluginDeps {
  db: Pool;
  config: AppConfig;
  logger: Logger;
}

Each plugin receives shared infrastructure via PluginDeps rather than constructing its own. This is the critical design constraint: if each plugin opens its own database pool, a server with ten plugins opens ten pools, each sized for peak concurrent tool calls, for a total connection count that will exhaust max_connections on any reasonably sized Postgres instance. Infrastructure construction belongs in the application bootstrap, shared across all plugins via PluginDeps.

The registry enforces two invariants: no duplicate plugin names, and all registration completes before the server starts accepting connections.

class PluginRegistry {
  private plugins = new Map<string, McpPlugin>();

  register(plugin: McpPlugin, server: McpServer, deps: PluginDeps) {
    if (this.plugins.has(plugin.name)) {
      throw new Error(`Duplicate plugin: ${plugin.name}`);
    }
    plugin.register(server, deps);
    this.plugins.set(plugin.name, plugin);
  }

  registerAll(plugins: McpPlugin[], server: McpServer, deps: PluginDeps) {
    for (const p of plugins) this.register(p, server, deps);
  }
}

// Bootstrap sequence
const registry = new PluginRegistry();
registry.registerAll([weatherPlugin, calendarPlugin, filesPlugin], server, deps);
await app.listen(PORT); // never called before registerAll completes

For larger teams, plugins can be discovered by directory scan rather than explicit import: read a plugins/ directory, dynamically import each index.ts, and call register. This allows plugin deployment as independent packages without changing the server entry point. The plugin contract (name, version, register method) is the stable API boundary.

Hot reload does not work. The MCP specification has a notifications/tools/list_changed notification for signalling tool list changes, but most clients cache the tool list for the session lifetime and ignore the notification. The correct reload strategy is a rolling restart — replace the old process with a new one after graceful drain. In-process module swapping (deleting require.cache entries or using dynamic import() with a version query string) creates subtle state bugs when the old module has pending async operations and is not a supported pattern for production workloads.

Per-tenant plugin activation is where the plugin system pays its most important dividend: instead of registering all plugins on a shared McpServer instance, create a per-session McpServer and register only the plugins the tenant's feature set authorizes. A tenant without a calendar integration never sees calendar tools in their tools/list. The tool surface is the authorization boundary — it cannot be enforced after registration. The per-tenant activation pattern and its interaction with multi-tenant routing are covered in the MCP server plugins guide.

Multi-tenant isolation: the module-scope discipline

The fundamental rule for multi-tenant MCP servers: if a value differs between tenants, it must never live in module scope.

This rule is easy to state and easy to violate. Consider a typical pattern that works fine in a single-tenant server:

// WRONG for multi-tenant: module-scope tenant state
let currentTenantId: string;
let currentTenantConfig: TenantConfig;

server.setRequestHandler(InitializeRequestSchema, async (request) => {
  currentTenantId = extractTenantId(request);
  currentTenantConfig = await loadTenantConfig(currentTenantId);
  return { /* ... */ };
});

server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const config = currentTenantConfig; // RACE CONDITION under concurrent sessions
  // ...
});

Under serial load, this works. Under concurrent load — two tenants' sessions overlapping — Tenant A's initialize handler overwrites currentTenantConfig while Tenant B's tool handler is reading it. Tenant B sees Tenant A's config. No error is thrown. The wrong data appears in the right session's tool results.

The correct pattern uses a Map<sessionId, TenantContext> and always pairs set operations with cleanup:

const sessions = new Map<string, TenantContext>();

server.setRequestHandler(InitializeRequestSchema, async (request, extra) => {
  const ctx = await buildTenantContext(request);
  sessions.set(extra.sessionId, ctx);

  // Always pair set with cleanup to prevent unbounded map growth
  extra.signal.addEventListener('abort', () => {
    sessions.delete(extra.sessionId);
  });

  return { /* ... */ };
});

server.setRequestHandler(CallToolRequestSchema, async (request, extra) => {
  const ctx = sessions.get(extra.sessionId);
  if (!ctx) throw new McpError(ErrorCode.InternalError, 'Session not found');
  // ctx is isolated to this session; no sharing with other tenants
});

The cleanup handler on extra.signal (or equivalently on res.on('close') in the transport layer) is not optional. A Map that grows without cleanup will exhaust memory under load, since MCP sessions are long-lived and may not close promptly on client disconnection. The session Map is infrastructure, not a cache — it has no TTL-based eviction, only explicit delete on session end.

Data isolation at the storage layer follows from the same principle. Row-level security (Postgres SET LOCAL app.tenant_id + an RLS policy) is the correct approach for most shared-schema architectures: it enforces isolation at the database level regardless of application bugs, and it survives developer errors like forgetting to add a WHERE tenant_id = ? clause. Schema-per-tenant (each tenant in a separate Postgres schema) provides stronger isolation at the cost of more complex migration management. Separate-databases-per-tenant is strongest but prohibitively expensive for long-tail tenants. Column-based filtering with no database-level enforcement is prototype territory — it works until the query is wrong. The full isolation pattern table and AliveMCP monitoring configuration for multi-tenant subdomains are in the MCP server multi-tenant guide.

Protocol bridges: connecting existing backends without coupling their lifecycle

Many MCP servers are adapters: they expose MCP tool interfaces to AI agents while delegating actual work to existing WebSocket services or gRPC microservices. The architecture question is where those backend connections live and how their lifecycle relates to MCP session lifecycle.

The answer in both cases: module scope, created at startup.

For gRPC backends:

import * as grpc from '@grpc/grpc-js';
import * as protoLoader from '@grpc/proto-loader';

// Module scope — one channel per service, created at startup, reused across all tool calls
const packageDef = protoLoader.loadSync('service.proto', { keepCase: true });
const proto = grpc.loadPackageDefinition(packageDef) as any;

const serviceClient = new proto.mypackage.MyService(
  process.env.GRPC_SERVICE_ADDR,
  grpc.credentials.createInsecure()
);

// Per-tool-call: promisify and invoke, never create a new channel
function grpcCall<T>(method: string, request: unknown): Promise<T> {
  return new Promise((resolve, reject) => {
    (serviceClient as any)[method](request, (err: grpc.ServiceError | null, res: T) => {
      if (err) reject(err);
      else resolve(res);
    });
  });
}

Creating a new gRPC channel per tool call is the most common gRPC-to-MCP integration mistake. A gRPC channel establishment involves TCP connection, TLS handshake, and HTTP/2 SETTINGS exchange — in aggregate, 50–200ms of latency that is paid on every call. Under any real tool call rate, this exhausts ephemeral port allocation before it exhausts CPU. The module-scope channel multiplexes all tool calls over the same HTTP/2 connection, amortizes establishment cost to startup, and is the standard gRPC client usage pattern.

Error mapping from gRPC status codes to MCP error responses requires a deliberate decision for each code:

For WebSocket backends, the same module-scope principle applies: one WebSocket client per backend service, reconnecting on close events, never one connection per tool call or per MCP session. The MCP transport itself uses HTTP+SSE rather than WebSocket — an architectural choice the protocol made deliberately because standard HTTP infrastructure (load balancers, CDNs, proxies) handles POST requests without full sticky routing, and because a health probe is a plain HTTP POST that requires no WebSocket client library. The detailed configuration for proxy buffering, SSE infrastructure, and why the WebSocket-inside-tool-handler anti-pattern appears so often is in the MCP server WebSockets guide. The full gRPC bridge pattern — proto loading, metadata forwarding for end-to-end tracing, and the health_check tool that probes all gRPC dependencies — is in the MCP server gRPC guide.

How these four concerns interact: the AliveMCP probe view

An external uptime probe — AliveMCP's included — sends a real initialize handshake followed by a tools/list request and validates the response shape. This confirms your server is alive, your transport is responding, and your tool schema is well-formed. It does not confirm any of the following:

The full picture requires both external protocol-aware monitoring (for process-level health and schema validation) and internal structured logging (for session-level correctness and per-tenant isolation). For the complete production readiness checklist — authentication, rate limiting, error handling, connection pooling, schema governance, and CI gates — see the MCP server production checklist. For the distinction between HTTP-level health and protocol-level health — why a server can pass an HTTP 200 check and fail every real MCP tool call — see JSON-RPC health checks vs HTTP probes.

The order to address these concerns

If you are building a new MCP server, address these concerns in this order:

  1. Middleware stack first. Auth and rate limiting at the transport boundary are non-negotiable before any external traffic. The middleware ordering is one of two decisions in this list that is hard to change after the fact (the other is the module-scope discipline). Retrofitting auth into a server that was built without it requires changing every tool handler rather than inserting one middleware before the transport. Do it first.
  2. Module-scope discipline second. If the server will ever serve more than one tenant — even "us" and "them" — establish the Map<sessionId, TenantContext> pattern before writing any tool handlers. Module-scope state introduced during single-tenant development is deeply embedded by the time multi-tenant is added; tenant context must be threaded through every call site, and the refactor touches every tool handler. Design for it from session one.
  3. Plugin architecture when the team or tool set grows. A server with three tools and one developer does not need a plugin system. Add the plugin registry pattern when you have more than one developer contributing tool handlers, when different tools need to be deployed on different schedules, or when you need per-tenant tool activation. These are natural inflection points where the cost of the pattern is less than the cost of the coordination overhead it replaces.
  4. Protocol bridges when your tools need them. Module-scope gRPC channels and WebSocket clients are not inherently more complex than module-scope database pools. Add them when you have a backend to bridge; don't add them speculatively. The one constraint to introduce early: never allow per-call channel creation. If that pattern gets established in the codebase before load testing, it will persist until it causes a production incident.

Further reading