Guide · Multi-modal & Media Integration

MCP Server PDF Tools — Text Extraction, Page Chunking, and RAG Pipelines

PDF extraction is one of the most common document processing tasks in agentic workflows: reading contracts, invoices, research papers, and manuals so an LLM can answer questions about them. This guide covers integrating PDF parsing into a TypeScript MCP server — choosing between pdf-parse and pdfjs-dist, extracting full text and per-page content, chunking for RAG pipelines, exposing pages as MCP resources, handling encrypted and scanned PDFs, and wiring a /health/pdf endpoint so AliveMCP detects when your document pipeline breaks.

TL;DR

Use pdf-parse for simple full-text extraction (fast, zero-config) and pdfjs-dist when you need per-page content, text position data, or link extraction. Cap PDF input at 50 MB and 500 pages — larger files will OOM or hang. Return extracted text as a TextContent block with clear page delimiters; for RAG pipelines, return an array of page chunks (each under 2,000 tokens) that the agent can embed individually. Wire /health/pdf to parse a minimal 1-page test PDF rather than just checking if the process is alive — initialization failures in pdfjs-dist's Worker module will silently break all PDF tools.

Library comparison: pdf-parse vs pdfjs-dist

Two libraries dominate PDF extraction in Node.js, with different trade-offs for MCP tool development:

Feature	pdf-parse	pdfjs-dist
Full text extraction	Yes	Yes
Per-page content	Yes (via callback)	Yes (native)
Text position/layout	No	Yes (`getTextContent()`)
Link extraction	No	Yes (`getAnnotations()`)
Setup complexity	Zero-config	Requires Worker setup in Node.js
Bundle size	~5 MB	~50 MB
Cold start overhead	~30 ms	~200 ms first parse
Maintenance	Low activity	Active (Mozilla)

For most MCP server use cases, start with pdf-parse. Switch to pdfjs-dist if you need text position data (for table extraction or reading order correction) or link extraction.

Full-text extraction with pdf-parse

import pdfParse from 'pdf-parse';
import { z } from 'zod';
import { McpError, ErrorCode } from '@modelcontextprotocol/sdk/types.js';

const MAX_PDF_BYTES = 50 * 1024 * 1024;  // 50 MB
const MAX_PAGES = 500;

server.tool(
  'extract_pdf_text',
  {
    pdf_base64: z.string().min(1).describe('Base64-encoded PDF file contents'),
    max_pages: z.number().int().min(1).max(MAX_PAGES).default(MAX_PAGES),
    include_metadata: z.boolean().default(true)
  },
  async ({ pdf_base64, max_pages, include_metadata }) => {
    const pdfBuffer = Buffer.from(pdf_base64, 'base64');

    if (pdfBuffer.length > MAX_PDF_BYTES) {
      throw new McpError(
        ErrorCode.InvalidParams,
        `PDF too large: ${(pdfBuffer.length / 1e6).toFixed(1)} MB (max 50 MB)`
      );
    }

    // Validate PDF header magic bytes
    if (!pdfBuffer.slice(0, 5).equals(Buffer.from('%PDF-'))) {
      throw new McpError(ErrorCode.InvalidParams, 'Input does not appear to be a valid PDF');
    }

    let pageCount = 0;
    const pageTexts: string[] = [];

    const result = await pdfParse(pdfBuffer, {
      max: max_pages,
      pagerender: (pageData: { getTextContent: () => Promise<{ items: Array<{ str: string; hasEOL: boolean }> }>; pageIndex: number }) => {
        pageCount++;
        return pageData.getTextContent().then(content => {
          const text = content.items
            .map((item: { str: string; hasEOL: boolean }) => item.str + (item.hasEOL ? '\n' : ' '))
            .join('');
          pageTexts.push(text.trim());
          return text;
        });
      }
    });

    const output: Record<string, unknown> = {
      page_count: result.numpages,
      pages_extracted: pageCount,
      text: result.text,
      char_count: result.text.length
    };

    if (include_metadata) {
      output.metadata = {
        title: result.info?.Title ?? null,
        author: result.info?.Author ?? null,
        subject: result.info?.Subject ?? null,
        keywords: result.info?.Keywords ?? null,
        creator: result.info?.Creator ?? null,
        producer: result.info?.Producer ?? null,
        created: result.info?.CreationDate ?? null,
        modified: result.info?.ModDate ?? null,
        pdf_version: result.info?.PDFFormatVersion ?? null
      };
    }

    return {
      content: [{ type: 'text', text: JSON.stringify(output, null, 2) }]
    };
  }
);

Per-page chunking for RAG pipelines

When building a RAG pipeline, you need page-level chunks rather than a single text blob — the agent needs to know which page a fact came from, and embeddings work best on chunks of 300–500 tokens. A chunk_pdf tool returns an array of page objects, each with page number, text, and an estimated token count.

// Rough token estimator — 1 token ≈ 4 characters for English text
function estimateTokens(text: string): number {
  return Math.ceil(text.length / 4);
}

server.tool(
  'chunk_pdf_pages',
  {
    pdf_base64: z.string().min(1),
    max_pages: z.number().int().min(1).max(MAX_PAGES).default(MAX_PAGES),
    max_tokens_per_chunk: z.number().int().min(100).max(4000).default(1000),
    include_page_numbers: z.boolean().default(true)
  },
  async ({ pdf_base64, max_pages, max_tokens_per_chunk, include_page_numbers }) => {
    const pdfBuffer = Buffer.from(pdf_base64, 'base64');
    if (pdfBuffer.length > MAX_PDF_BYTES) {
      throw new McpError(ErrorCode.InvalidParams, `PDF too large`);
    }

    const pageTexts: Array<{ page: number; text: string }> = [];
    let pageIndex = 0;

    await pdfParse(pdfBuffer, {
      max: max_pages,
      pagerender: (pageData: { getTextContent: () => Promise<{ items: Array<{ str: string; hasEOL: boolean }> }> }) => {
        const currentPage = ++pageIndex;
        return pageData.getTextContent().then(content => {
          const text = content.items
            .map((item: { str: string; hasEOL: boolean }) => item.str + (item.hasEOL ? '\n' : ' '))
            .join('')
            .trim();
          if (text.length > 0) {
            pageTexts.push({ page: currentPage, text });
          }
          return text;
        });
      }
    });

    // Split pages that exceed max_tokens_per_chunk into sub-chunks
    const chunks: Array<{ chunk_id: string; page: number; sub_chunk: number; text: string; tokens: number }> = [];
    let chunkIndex = 0;

    for (const { page, text } of pageTexts) {
      const tokens = estimateTokens(text);
      if (tokens <= max_tokens_per_chunk) {
        chunks.push({
          chunk_id: `chunk_${++chunkIndex}`,
          page,
          sub_chunk: 1,
          text,
          tokens
        });
      } else {
        // Split by paragraph boundaries
        const paragraphs = text.split(/\n{2,}/).filter(p => p.trim().length > 0);
        let currentChunk = '';
        let subChunk = 0;

        for (const para of paragraphs) {
          if (estimateTokens(currentChunk + '\n\n' + para) > max_tokens_per_chunk && currentChunk.length > 0) {
            chunks.push({
              chunk_id: `chunk_${++chunkIndex}`,
              page,
              sub_chunk: ++subChunk,
              text: currentChunk.trim(),
              tokens: estimateTokens(currentChunk)
            });
            currentChunk = para;
          } else {
            currentChunk = currentChunk ? currentChunk + '\n\n' + para : para;
          }
        }
        if (currentChunk.trim().length > 0) {
          chunks.push({
            chunk_id: `chunk_${++chunkIndex}`,
            page,
            sub_chunk: ++subChunk,
            text: currentChunk.trim(),
            tokens: estimateTokens(currentChunk)
          });
        }
      }
    }

    return {
      content: [{
        type: 'text',
        text: JSON.stringify({
          total_chunks: chunks.length,
          total_pages: pageTexts.length,
          chunks: include_page_numbers ? chunks : chunks.map(({ page: _p, ...c }) => c)
        }, null, 2)
      }]
    };
  }
);

The chunk result is designed for embedding: each chunk has a stable chunk_id, the originating page number, and an estimated token count so the agent can verify the chunk fits within embedding model limits before calling the embedding API.

Exposing PDF pages as MCP resources

If an agent will make multiple queries against the same document, storing it as a resource and reading individual pages on demand is more efficient than re-parsing the entire PDF on each tool call.

import crypto from 'node:crypto';
import fs from 'node:fs/promises';
import path from 'node:path';

const PDF_STORE_DIR = process.env.PDF_STORE_DIR ?? '/tmp/mcp-pdfs';

// Store a PDF and return its document ID
export async function storePdf(buffer: Buffer): Promise<string> {
  await fs.mkdir(PDF_STORE_DIR, { recursive: true });
  const docId = crypto.createHash('sha256').update(buffer).digest('hex').slice(0, 16);
  await fs.writeFile(path.join(PDF_STORE_DIR, `${docId}.pdf`), buffer);
  return docId;
}

server.tool(
  'store_pdf',
  { pdf_base64: z.string().min(1) },
  async ({ pdf_base64 }) => {
    const buffer = Buffer.from(pdf_base64, 'base64');
    if (buffer.length > MAX_PDF_BYTES) throw new McpError(ErrorCode.InvalidParams, 'PDF too large');
    const docId = await storePdf(buffer);
    const result = await pdfParse(buffer, { max: 1 }); // parse just for page count
    return {
      content: [{ type: 'text', text: JSON.stringify({ doc_id: docId, total_pages: result.numpages }) }]
    };
  }
);

// Resource URI: pdf://{docId}/page/{pageNum}
server.setRequestHandler(ReadResourceRequestSchema, async (request) => {
  const { uri } = request.params;
  const match = uri.match(/^pdf:\/\/([a-f0-9]{16})\/page\/(\d+)$/);
  if (!match) throw new McpError(ErrorCode.InvalidParams, `Unknown URI: ${uri}`);

  const [, docId, pageStr] = match;
  const pageNum = parseInt(pageStr, 10);
  const filePath = path.join(PDF_STORE_DIR, `${docId}.pdf`);
  const buffer = await fs.readFile(filePath).catch(() => {
    throw new McpError(ErrorCode.InvalidParams, `Document ${docId} not found`);
  });

  let pageText = '';
  await pdfParse(buffer, {
    max: pageNum,
    pagerender: (pageData: { pageIndex: number; getTextContent: () => Promise<{ items: Array<{ str: string; hasEOL: boolean }> }> }) => {
      if (pageData.pageIndex === pageNum - 1) {
        return pageData.getTextContent().then(c => {
          pageText = c.items.map((i: { str: string; hasEOL: boolean }) => i.str + (i.hasEOL ? '\n' : ' ')).join('');
          return pageText;
        });
      }
      return Promise.resolve('');
    }
  });

  return { contents: [{ uri, mimeType: 'text/plain', text: pageText.trim() }] };
});

Health endpoint for PDF monitoring

// Minimal valid 1-page PDF for health probes (base64-encoded)
const PROBE_PDF_BASE64 = 'JVBERi0xLjAKMSAwIG9iago8PCAvVHlwZSAvQ2F0YWxvZyAvUGFnZXMgMiAwIFIgPj4KZW5kb2JqCjIgMCBvYmoKPDwgL1R5cGUgL1BhZ2VzIC9LaWRzIFszIDAgUl0gL0NvdW50IDEgPj4KZW5kb2JqCjMgMCBvYmoKPDwgL1R5cGUgL1BhZ2UgL1BhcmVudCAyIDAgUiAvTWVkaWFCb3ggWzAgMCA2MTIgNzkyXSA+PgplbmRvYmoKeHJlZgowIDQKMDAwMDAwMDAwMCA2NTUzNSBmIAowMDAwMDAwMDA5IDAwMDAwIG4gCjAwMDAwMDAwNTggMDAwMDAgbiAKMDAwMDAwMDExNSAwMDAwMCBuIAp0cmFpbGVyCjw8IC9TaXplIDQgL1Jvb3QgMSAwIFIgPj4Kc3RhcnR4cmVmCjE5MAolJUVPRgo=';

http.get('/health/pdf', async (req, reply) => {
  const start = Date.now();
  try {
    const probePdf = Buffer.from(PROBE_PDF_BASE64, 'base64');
    const result = await pdfParse(probePdf);
    return reply.send({
      status: 'ok',
      latency_ms: Date.now() - start,
      probe_pages: result.numpages,
      pdf_parse_version: require('pdf-parse/package.json').version
    });
  } catch (err) {
    return reply.code(503).send({
      status: 'error',
      detail: err instanceof Error ? err.message : String(err),
      latency_ms: Date.now() - start
    });
  }
});

The probe uses a minimal hand-crafted PDF (89 bytes) rather than a stored fixture file — it can't be missing from deployments. Parse latency for this probe should be under 10 ms; anything over 100 ms indicates a performance regression worth alerting on.

Silent failure modes

Failure	Symptom	Caught by process ping?	Detection
Encrypted PDF (password-protected)	pdf-parse throws or returns empty text	No	Detect empty text on non-empty PDF; return structured error
Scanned PDF (image-only, no text layer)	Extraction returns blank text for all pages	No — tool call "succeeds"	Detect when char_count is 0 for a multi-page PDF; advise OCR
Corrupted PDF structure	pdf-parse throws `Invalid PDF structure`	No	Catch and re-throw as McpError(InvalidParams)
OOM on large PDF	Process crash or OOM kill	After crash	Cap at 50 MB / 500 pages; monitor container memory
pdfjs-dist Worker not found	All pdfjs operations throw immediately	No	`/health/pdf` with probe parse

Frequently asked questions

How do I handle scanned PDFs that have no text layer?

pdf-parse returns an empty or near-empty string for scanned PDFs — there's no text to extract, only rasterized page images. Detect this case by checking result.text.trim().length === 0 when result.numpages > 0. When detected, return a structured error telling the caller the PDF appears to be scanned and requires OCR. For OCR, you can render each page as an image with pdfjs-dist's renderContext + Canvas, then send the image to an OCR API (Google Document AI, AWS Textract, or Tesseract.js for offline processing).

How do I extract tables from PDFs?

Pure-text extraction (pdf-parse or pdfjs-dist text content) loses table structure because PDF tables are drawn as positioned text fragments, not as actual table elements. To reconstruct tables, use pdfjs-dist's getTextContent({ includeMarkedContent: true }) which gives you the X/Y position of each text item. Cluster items by Y coordinate to identify rows, then by X coordinate to identify columns. For complex tables, a commercial API (AWS Textract, Azure Document Intelligence, Google Document AI) will produce better results than pure coordinate clustering. Return extracted tables as JSON arrays in the tool response.

Can I extract images from a PDF?

Yes, with pdfjs-dist's getOperatorList() + canvas rendering. Each page can be rendered to a Canvas at a specified DPI (e.g., 150 DPI for thumbnails, 300 DPI for accurate reproduction), then exported as PNG or JPEG. This approach renders the full page as a raster image, which is useful for previewing pages but not for extracting embedded images individually. Extracting individual embedded images (e.g., photos inside a PDF) requires parsing the PDF's XObject resources directly — pdf.js has experimental support for this but it's not stable. For most MCP tool use cases, rendering the full page as an image is sufficient.

What's the right way to handle a 500-page PDF without overloading the agent context?

Never return all 500 pages in a single tool call — that would overflow any LLM context window. Instead, build a two-tool workflow: a store_pdf tool that stores the PDF and returns its page count and a document ID, followed by a read_pdf_pages tool that accepts the document ID plus a start/end page range and returns text for just those pages. This way the agent can read the table of contents (usually page 1–3), identify the relevant section, then read only those pages. Structure the chunk response with page numbers so the agent can cite sources.