Guide · MCP Tool Implementation
MCP server code execution tools
Code execution is the most powerful — and most dangerous — MCP tool category. When an LLM can run arbitrary code, it can process data, run calculations, test hypotheses, and automate tasks in ways no static tool can match. It can also execute rm -rf /, open reverse shells, and exhaust your server's CPU and memory. This guide covers how to build a safe execute_code tool using Docker container isolation, resource limits, network blocking, timeout enforcement, and output capture.
TL;DR
Never execute LLM-generated code with eval(), child_process.exec(), or a bare subprocess.run() on your MCP server host. Always run untrusted code inside a Docker container with --network none, --memory 256m, --cpus 0.5, --read-only, and --no-new-privileges. Set a hard wall-clock timeout on the docker run call. Capture stdout and stderr separately, truncate large outputs, and never return raw process output without length limits.
Why eval() and child_process are not enough
The naive approach — passing code to eval() or child_process.exec() — runs with the same privileges as your MCP server process. In production, that means filesystem access to secrets, network access to internal services, and the ability to crash the server process. Even with a restricted Node.js VM (new vm.Script()), a skilled attacker can escape the sandbox via prototype pollution or by exploiting native module boundaries.
| Isolation level | Escapes sandbox? | Network access? | Filesystem access? |
|---|---|---|---|
Node.js eval() | Yes — full process | Yes | Yes |
Node.js vm.Script | Yes — prototype pollution | Yes (via require) | Yes (via require) |
Worker Thread with allowedModules: [] | Partial | No (limited) | Limited |
| Docker container (default) | No (container boundary) | Yes | Container only |
Docker + --network none + --read-only | No | No | No |
| gVisor / Firecracker microVM | No (kernel boundary) | Configurable | Configurable |
For production deployments, Docker with restrictive flags is the practical minimum. For multi-tenant or high-security scenarios, consider gVisor (runsc) or Firecracker microVMs that provide kernel-level isolation.
Building the execute_code tool
The tool writes code to a temp file, passes it to a sandboxed Docker container, captures output, and cleans up. The container is ephemeral — created and destroyed per execution:
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { z } from 'zod';
import { execFile } from 'child_process';
import { promisify } from 'util';
import fs from 'fs/promises';
import path from 'path';
import os from 'os';
const execFileAsync = promisify(execFile);
const server = new McpServer({ name: 'code-runner', version: '1.0.0' });
const RUNTIME_IMAGES: Record<string, string> = {
python: 'python:3.12-slim',
javascript: 'node:22-alpine',
typescript: 'tsx:latest', // or a custom image with ts-node/tsx
bash: 'bash:5-alpine',
};
const EXECUTION_TIMEOUT_MS = 15_000; // 15 seconds wall-clock
const MAX_OUTPUT_CHARS = 20_000; // truncate stdout+stderr beyond this
server.tool(
'execute_code',
'Execute code in an isolated sandbox and return stdout/stderr output',
{
language: z.enum(['python', 'javascript', 'typescript', 'bash']),
code: z.string().max(50_000).describe('Code to execute'),
stdin_input: z.string().max(10_000).default('').describe('Optional stdin to pass to the program'),
},
async ({ language, code, stdin_input }) => {
const image = RUNTIME_IMAGES[language];
const tmpDir = await fs.mkdtemp(path.join(os.tmpdir(), 'mcp-exec-'));
const ext = { python: 'py', javascript: 'js', typescript: 'ts', bash: 'sh' }[language];
const codeFile = path.join(tmpDir, `code.${ext}`);
const stdinFile = path.join(tmpDir, 'stdin.txt');
try {
await fs.writeFile(codeFile, code, 'utf8');
await fs.writeFile(stdinFile, stdin_input, 'utf8');
const entrypoint = {
python: ['python', `/sandbox/code.py`],
javascript: ['node', `/sandbox/code.js`],
typescript: ['tsx', `/sandbox/code.ts`],
bash: ['bash', `/sandbox/code.sh`],
}[language];
const dockerArgs = [
'run', '--rm',
'--network', 'none', // no network access
'--memory', '256m', // 256 MB RAM limit
'--memory-swap', '256m', // disable swap (same as RAM limit)
'--cpus', '0.5', // half a CPU core
'--read-only', // read-only root filesystem
'--no-new-privileges', // block privilege escalation
'--security-opt', 'no-new-privileges:true',
'--tmpfs', '/tmp:size=64m', // writable temp space (64 MB)
'-v', `${tmpDir}:/sandbox:ro`, // mount code as read-only
'-i', // enable stdin
image,
...entrypoint,
];
const { stdout, stderr } = await execFileAsync('docker', dockerArgs, {
timeout: EXECUTION_TIMEOUT_MS,
maxBuffer: 1024 * 1024, // 1 MB max combined output buffer
input: stdin_input,
});
const combined = [
stdout ? `STDOUT:\n${stdout}` : '',
stderr ? `STDERR:\n${stderr}` : '',
].filter(Boolean).join('\n\n') || '(no output)';
return {
content: [{
type: 'text',
text: combined.length > MAX_OUTPUT_CHARS
? combined.slice(0, MAX_OUTPUT_CHARS) + `\n\n[truncated — ${combined.length} chars total]`
: combined,
}]
};
} catch (e) {
const err = e as NodeJS.ErrnoException & { killed?: boolean; stdout?: string; stderr?: string };
if (err.killed || err.code === 'ETIMEDOUT') {
return { isError: true, content: [{ type: 'text', text: `Execution timed out after ${EXECUTION_TIMEOUT_MS / 1000}s` }] };
}
return {
isError: true,
content: [{ type: 'text', text: [
`Execution failed (exit code: ${err.code ?? 'unknown'})`,
err.stderr ? `STDERR:\n${String(err.stderr).slice(0, 5_000)}` : '',
].filter(Boolean).join('\n\n') }]
};
} finally {
await fs.rm(tmpDir, { recursive: true, force: true });
}
}
);
Container resource limits explained
| Flag | What it prevents | Recommended value |
|---|---|---|
--network none | Outbound HTTP, lateral movement to internal services | Always set for untrusted code |
--memory 256m --memory-swap 256m | Memory exhaustion, OOM killing the host | 64–512 MB depending on workload |
--cpus 0.5 | CPU saturation, forking bombs | 0.25–1.0 CPU |
--read-only | Persistent filesystem writes in the container layer | Always set; add --tmpfs for temp writes |
--no-new-privileges | Privilege escalation via setuid binaries | Always set |
--pids-limit 64 | Fork bombs that spawn unlimited child processes | 32–128 PIDs |
--ulimit nofile=64 | File descriptor exhaustion | 64–256 open files |
Add --pids-limit 64 and --ulimit nofile=256 to the dockerArgs array for defense in depth against fork bombs and file descriptor exhaustion attacks.
Pre-pulling images to avoid cold-start latency
The first execution of a language pulls the Docker image — potentially 50–200 MB of download that adds 30+ seconds to the first tool call. Pre-pull all runtime images at server startup:
async function prePullImages(): Promise<void> {
for (const [lang, image] of Object.entries(RUNTIME_IMAGES)) {
try {
await execFileAsync('docker', ['image', 'inspect', image], { timeout: 5_000 });
console.error(`[executor] ${lang} image present: ${image}`);
} catch {
console.error(`[executor] pulling ${lang} image: ${image}`);
await execFileAsync('docker', ['pull', image], { timeout: 120_000 });
}
}
}
// Call at startup, before registering the server transport
await prePullImages();
In Kubernetes, use an init container or DaemonSet to warm images on every node. In a PM2 setup, add a pre-start script to ecosystem.config.js.
Output from long-running computations
Some computations produce output incrementally — a data processing script that prints progress every few seconds. The execFile pattern above buffers all output and returns it at completion. For streaming output, use MCP streaming responses or structure the tool to accept a time-budget parameter and return partial results on timeout.
// Partial-result pattern: run up to budget_seconds, return whatever completed
server.tool(
'execute_code_partial',
'Run code with a time budget; returns partial output on timeout',
{
language: z.enum(['python', 'javascript']),
code: z.string().max(50_000),
budget_seconds: z.number().min(1).max(30).default(10),
},
async ({ language, code, budget_seconds }) => {
// ... same Docker setup as above ...
// On ETIMEDOUT, return whatever stdout/stderr was captured before timeout
// (requires using spawn() instead of execFile() to capture incremental output)
return { content: [{ type: 'text', text: '...' }] };
}
);
Monitoring code-execution MCP servers
Code execution servers fail in ways that differ from typical MCP tool failures. A Docker daemon crash takes down all execution capability silently — the MCP transport still responds normally to initialize and tools/list, but every execute_code call fails with "Cannot connect to Docker socket." A disk-full condition prevents temp file creation. An image pull failure makes a language unavailable while others work.
Add a canary execution check to your health check endpoint: run a trivial code snippet (print("ok")) and verify output contains the expected string. This end-to-end check catches Docker daemon failures, image issues, and resource limits that a transport-only check misses. AliveMCP probes your full MCP endpoint every 60 seconds, catching protocol-level and handler-level failures before users encounter broken code execution in their AI workflows.
Further reading
- MCP server Docker deployment — containerizing your MCP server
- MCP tool design — argument schemas and error return shapes
- MCP server timeout — wall-clock and per-tool execution limits
- MCP server streaming — progressive output for long-running tools
- MCP server health check — canary tool calls for end-to-end validation
- MCP server error handling — isError patterns for subprocess failures
- MCP server filesystem tools — safe file access for LLM agents
- AliveMCP — uptime monitoring for HTTP-deployed MCP servers