Package detail

pdf-parse-new

simonegosetto54.4kMIT2.0.0

Pure javascript cross-platform module to extract text from PDFs with AI-powered optimization and multi-core processing.

pdf-parse, pdf-parser, pdf-extractor, pdf-to-text

readme

pdf-parse-new

Pure JavaScript cross-platform module to extract text from PDFs with intelligent performance optimization.

Version 2.0.0 - Release with SmartPDFParser, multi-core processing, and AI-powered method selection based on 15,000+ real-world benchmarks.

Features

🎯 New in Version 2.0.0

✨ SmartPDFParser with AI-Powered Selection

Automatically selects optimal parsing method based on PDF characteristics
CPU-aware thresholds that adapt to available hardware (4 to 48+ cores)
Fast-path optimization: 50x faster overhead for small PDFs (25ms → 0.5ms)
LRU caching: 25x faster on repeated similar PDFs
90%+ optimization rate in production

⚡ Multi-Core Performance

Child Processes: True multi-processing, 2-4x faster for huge PDFs
Worker Threads: Alternative multi-threading with lower memory overhead
Oversaturation: Use 1.5x-2x cores for maximum CPU utilization (I/O-bound optimization)
Automatic memory safety limits

📊 Battle-Tested Intelligence

Decision tree trained on 9,417 real-world PDF benchmarks
Tested on documents from 1 to 10,000+ pages
CPU normalization: adapts thresholds from 4-core laptops to 48-core servers
Production-ready with comprehensive error handling

🚀 Multiple Parsing Strategies

Batch Processing: Parallel page processing (optimal for 0-1000 pages)
Child Processes: Multi-processing (default for 1000+ pages, most consistent)
Worker Threads: Multi-threading (alternative, can be faster on some PDFs)
Streaming: Memory-efficient chunking for constrained environments
Aggressive: Combines streaming with large batches
Sequential: Traditional fallback

🔧 Developer Experience

Drop-in replacement for pdf-parse (backward compatible)
7 practical examples in test/examples/
Full TypeScript definitions with autocomplete
Comprehensive benchmarking tools included
Zero configuration required (paths resolved automatically)

Installation

npm install pdf-parse-new

What's New in 2.0.0

🎯 Major Features

SmartPDFParser - Intelligent automatic method selection

CPU-aware decision tree (adapts to 4-48+ cores)
Fast-path optimization (0.5ms overhead vs 25ms)
LRU caching for repeated PDFs
90%+ optimization rate

Multi-Core Processing

Child processes (default, most consistent)
Worker threads (alternative, can be faster)
Oversaturation factor (1.5x cores = better CPU utilization)
Automatic memory safety

Performance Improvements

2-4x faster for huge PDFs (1000+ pages)
50x faster overhead for tiny PDFs (< 0.5 MB)
25x faster on cache hits
CPU normalization for any hardware

Better DX

7 practical examples with npm scripts
Full TypeScript definitions
Comprehensive benchmarking tools
Clean repository structure

📦 Migration from 1.x

Version 2.0.0 is backward compatible. Your existing code will continue to work:

// v1.x code still works
const pdf = require('pdf-parse-new');
pdf(buffer).then(data => console.log(data.text));

To take advantage of new features:

// Use SmartPDFParser for automatic optimization
const SmartParser = require('pdf-parse-new/lib/SmartPDFParser');
const parser = new SmartParser();
const result = await parser.parse(buffer);
console.log(`Used ${result._meta.method} in ${result._meta.duration}ms`);

Quick Start

Basic Usage

const fs = require('fs');
const pdf = require('pdf-parse-new');

const dataBuffer = fs.readFileSync('path/to/file.pdf');

pdf(dataBuffer).then(function(data) {
    console.log(data.numpages);  // Number of pages
    console.log(data.text);       // Full text content
    console.log(data.info);       // PDF metadata
});

📚 Examples

See test/examples/ for practical examples:

# Try the examples
npm run example:basic      # Basic parsing
npm run example:smart      # SmartPDFParser (recommended)
npm run example:compare    # Compare all methods

# Or run directly
node test/examples/01-basic-parse.js
node test/examples/06-smart-parser.js

7 complete examples covering all parsing methods with real-world patterns!

With Smart Parser (Recommended)

const SmartParser = require('pdf-parse-new/lib/SmartPDFParser');

const parser = new SmartParser();
const dataBuffer = fs.readFileSync('large-document.pdf');

parser.parse(dataBuffer).then(function(result) {
    console.log(`Parsed ${result.numpages} pages in ${result._meta.duration}ms`);
    console.log(`Method used: ${result._meta.method}`);
    console.log(result.text);
});

Exception Handling

pdf(dataBuffer)
    .then(data => {
        // Process data
    })
    .catch(error => {
        console.error('Error parsing PDF:', error);
    });

Smart Parser

The SmartPDFParser automatically selects the optimal parsing method based on PDF characteristics.

Decision Tree

Based on 9,417 real-world benchmarks (trained 2025-11-23):

Pages	Method	Avg Time	Best For
1-10	batch-5	~10ms	Tiny documents
11-50	batch-10	~107ms	Small documents
51-200	batch-20	~332ms	Medium documents
201-500	batch-50	~1102ms	Large documents
501-1000	batch-50	~1988ms	X-Large documents
1000+	processes*	~2355-4468ms	Huge documents (2-4x faster!)

*Both workers and processes are excellent for huge PDFs. Processes is the default due to better consistency, but workers can be faster in some cases. Use forceMethod: 'workers' to try workers.

Usage Options

Automatic (Recommended)

const SmartParser = require('pdf-parse-new/lib/SmartPDFParser');
const parser = new SmartParser();

// Automatically selects best method
const result = await parser.parse(pdfBuffer);

Force Specific Method

const parser = new SmartParser({
    forceMethod: 'workers'  // 'batch', 'workers', 'processes', 'stream', 'sequential'
});

// Example: Compare workers vs processes for your specific PDFs
const testWorkers = new SmartParser({ forceMethod: 'workers' });
const testProcesses = new SmartParser({ forceMethod: 'processes' });

const result1 = await testWorkers.parse(hugePdfBuffer);
console.log(`Workers: ${result1._meta.duration}ms`);

const result2 = await testProcesses.parse(hugePdfBuffer);
console.log(`Processes: ${result2._meta.duration}ms`);

Memory Limit

const parser = new SmartParser({
    maxMemoryUsage: 2e9  // 2GB max
});

Oversaturation for Maximum Performance

PDF parsing is I/O-bound. During I/O waits, CPU cores sit idle. Oversaturation keeps them busy:

const parser = new SmartParser({
    oversaturationFactor: 1.5  // Use 1.5x more workers than cores
});

// Example on 24-core system:
// - Default (1.5x): 36 workers (instead of 23!)
// - Aggressive (2x): 48 workers
// - Conservative (1x): 24 workers

Why this works:

PDF parsing involves lots of I/O (reading data, decompressing)
During I/O, CPU cores are idle
More workers = cores stay busy = better throughput

Automatic memory limiting:

Parser automatically limits workers if memory is constrained
Each worker needs ~2x PDF size in memory
Safe default balances speed and memory

Get Statistics

const stats = parser.getStats();
console.log(stats);
// {
//   totalParses: 10,
//   methodUsage: { batch: 8, workers: 2 },
//   averageTimes: { batch: 150.5, workers: 2300.1 },
//   failedParses: 0
// }

CPU-Aware Intelligence

SmartPDFParser automatically adapts to your CPU:

// On 4-core laptop
parser.parse(500_page_pdf);
// → Uses workers (threshold: ~167 pages)

// On 48-core server
parser.parse(500_page_pdf);
// → Uses batch (threshold: ~2000 pages, workers overhead not worth it yet)

This ensures optimal performance regardless of hardware! The decision tree was trained on multiple machines with different core counts.

Fast-Path Optimization

SmartPDFParser uses intelligent fast-paths to minimize overhead:

const parser = new SmartParser();

// Tiny PDF (< 0.5 MB)
await parser.parse(tiny_pdf);
// ⚡ Fast-path: ~0.5ms overhead (50x faster than tree navigation!)

// Small PDF (< 1 MB)
await parser.parse(small_pdf);
// ⚡ Fast-path: ~0.5ms overhead

// Medium PDF (already seen similar)
await parser.parse(medium_pdf);
// 💾 Cache hit: ~1ms overhead

// Common scenario (500 pages, 5MB)
await parser.parse(common_pdf);
// 📋 Common scenario: ~2ms overhead

// Rare case (unusual size/page ratio)
await parser.parse(unusual_pdf);
// 🌳 Full tree: ~25ms overhead (only for edge cases)

Overhead Comparison:

PDF Type	Before	After	Speedup
Tiny (< 0.5 MB)	25ms	0.5ms	50x faster ⚡
Small (< 1 MB)	25ms	0.5ms	50x faster ⚡
Cached	25ms	1ms	25x faster 💾
Common	25ms	2ms	12x faster 📋
Rare	25ms	25ms	Same 🌳

90%+ of PDFs hit a fast-path! This means minimal overhead even for tiny documents.

API Reference

pdf(dataBuffer, options)

Parse a PDF file and extract text content.

Parameters:

dataBuffer (Buffer): PDF file buffer
options (Object, optional):
- pagerender (Function): Custom page rendering function
- max (Number): Maximum number of pages to parse
- version (String): PDF.js version to use

Returns: Promise<Object>

numpages (Number): Total number of pages
numrender (Number): Number of rendered pages
info (Object): PDF metadata
metadata (Object): PDF metadata object
text (String): Extracted text content
version (String): PDF.js version used

SmartPDFParser

constructor(options)

Options:

forceMethod (String): Force specific parsing method
maxMemoryUsage (Number): Maximum memory usage in bytes
availableCPUs (Number): Override CPU count detection

parse(dataBuffer, userOptions)

Parse PDF with automatic method selection.

Returns: Promise<Object> (same as pdf() with additional _meta field)

_meta.method (String): Parsing method used
_meta.duration (Number): Parse time in milliseconds
_meta.analysis (Object): PDF analysis data

getStats()

Get parsing statistics for current session.

TypeScript and NestJS Support

This library includes full TypeScript definitions and works seamlessly with NestJS.

⚠️ Important: Correct TypeScript Import

// ✅ CORRECT: Use namespace import
import * as PdfParse from 'pdf-parse-new';

// Create parser instance
const parser = new PdfParse.SmartPDFParser({
  oversaturationFactor: 2.0,
  enableFastPath: true
});

// Parse PDF
const result = await parser.parse(pdfBuffer);
console.log(`Parsed ${result.numpages} pages using ${result._meta.method}`);

// ❌ WRONG: This will NOT work
import PdfParse from 'pdf-parse-new'; // Error: SmartPDFParser is not a constructor
import { SmartPDFParser } from 'pdf-parse-new'; // Error: No named export

NestJS Service Example

import { Injectable } from '@nestjs/common';
import * as PdfParse from 'pdf-parse-new';
import * as fs from 'fs';

@Injectable()
export class PdfService {
  private parser: PdfParse.SmartPDFParser;

  constructor() {
    // Initialize parser with custom options
    this.parser = new PdfParse.SmartPDFParser({
      oversaturationFactor: 2.0,
      enableFastPath: true,
      enableCache: true,
      maxWorkerLimit: 50
    });
  }

  async parsePdf(filePath: string): Promise<string> {
    const dataBuffer = fs.readFileSync(filePath);
    const result = await this.parser.parse(dataBuffer);

    console.log(`Pages: ${result.numpages}`);
    console.log(`Method: ${result._meta?.method}`);
    console.log(`Duration: ${result._meta?.duration?.toFixed(2)}ms`);

    return result.text;
  }

  getParserStats() {
    return this.parser.getStats();
  }
}

NestJS Controller with File Upload

import { Controller, Post, UploadedFile, UseInterceptors } from '@nestjs/common';
import { FileInterceptor } from '@nestjs/platform-express';
import * as PdfParse from 'pdf-parse-new';

@Controller('pdf')
export class PdfController {
  private parser = new PdfParse.SmartPDFParser({ oversaturationFactor: 2.0 });

  @Post('upload')
  @UseInterceptors(FileInterceptor('file'))
  async uploadPdf(@UploadedFile() file: Express.Multer.File) {
    const result = await this.parser.parse(file.buffer);

    return {
      pages: result.numpages,
      text: result.text,
      metadata: result.info,
      parsingInfo: {
        method: result._meta?.method,
        duration: result._meta?.duration,
        fastPath: result._meta?.fastPath || false
      }
    };
  }
}

Alternative Import Methods

// Method 1: Namespace import (recommended)
import * as PdfParse from 'pdf-parse-new';
const parser = new PdfParse.SmartPDFParser();

// Method 2: CommonJS require
const PdfParse = require('pdf-parse-new');
const parser = new PdfParse.SmartPDFParser();

// Method 3: Direct module import
import SmartPDFParser = require('pdf-parse-new/lib/SmartPDFParser');
const parser = new SmartPDFParser();

TypeScript Type Definitions

All types are fully documented and available:

import * as PdfParse from 'pdf-parse-new';

// Use types from the namespace
type Result = PdfParse.Result;
type Options = PdfParse.Options;
type SmartParserOptions = PdfParse.SmartParserOptions;

📖 For more detailed examples and troubleshooting, see NESTJS_USAGE.md

Performance Optimization

Performance Comparison

For a 1500-page PDF:

Method	Time (estimate)	Speed vs Batch	Notes
Workers	~2.4-7s	2-7x faster ✨	Faster startup, can vary by PDF
Processes	~4.2-4.5s	3-4x faster	More consistent, better isolation
Batch	~17.6s	baseline	Good up to 1000 pages
Sequential	~17.8s	0.99x	Fallback only

Note: Performance varies by PDF complexity, size, and system. Both workers and processes provide significant speedup - test both on your specific PDFs to find the best option.

Best Practices

Use SmartParser for large documents (100+ pages)
Batch processing is optimal for most use cases (0-1000 pages)
Both Processes and Workers excel at huge PDFs (1000+ pages)
- Processes (default): More consistent, better memory isolation, 2-4x faster than batch
- Workers: Can be faster on some PDFs, use forceMethod: 'workers' to test
Avoid sequential unless you have a specific reason
Monitor memory for PDFs over 500 pages

When to Use Each Method

Batch (default for most cases)

PDFs up to 1000 pages
Balanced speed and memory usage
Best all-around performance

Workers (best for huge PDFs)

PDFs over 1000 pages
Multi-core systems
When speed is critical
Note: Memory usage = PDF size × concurrent workers
For very large PDFs, limit maxWorkers to 2-4 to avoid memory issues

Processes (alternative to workers)

Similar to workers but uses child processes
Better isolation but slightly slower

Stream (memory constrained)

Very limited memory environments
When you need to process PDFs larger than available RAM

Sequential (fallback)

Single-core systems
When parallel processing causes issues
Debugging purposes

Benchmarking

The library includes comprehensive benchmarking tools for optimization.

Directory Structure

benchmark/
├── collect-benchmarks.js        # Collect performance data
├── train-smart-parser.js        # Train decision tree
├── test-pdfs.example.json       # Example PDF list
└── test-pdfs.json              # Your PDFs (gitignored)

Running Benchmarks

Setup test PDFs:

cp benchmark/test-pdfs.example.json benchmark/test-pdfs.json
# Edit test-pdfs.json with your PDF URLs/paths

Collect benchmark data:

node benchmark/collect-benchmarks.js

Features:

Tests all parsing methods on each PDF
Supports local files and remote URLs
Saves incrementally (no data loss on interruption)
Generates detailed performance reports
Train decision tree (library developers only):

node benchmark/train-smart-parser.js

Analyzes collected benchmarks and generates optimized parsing rules.

Example test-pdfs.json

{
  "note": "Add your PDF URLs or file paths here",
  "urls": [
    "./test/data/sample.pdf",
    "https://example.com/document.pdf",
    "/absolute/path/to/file.pdf"
  ]
}

Troubleshooting

Common Issues

Out of Memory

// Limit memory usage
const parser = new SmartParser({ maxMemoryUsage: 2e9 });

// Or use streaming
const parser = new SmartParser({ forceMethod: 'stream' });

Slow Parsing

// For large PDFs, force workers
const parser = new SmartParser({ forceMethod: 'workers' });

Corrupted/Invalid PDFs

// More aggressive parsing
const pdf = require('pdf-parse-new/lib/pdf-parse-aggressive');
pdf(dataBuffer).then(data => console.log(data.text));

Debug Mode

// Enable verbose logging
process.env.DEBUG = 'pdf-parse:*';

Get Help

📝 Open an issue
💬 Check existing issues for solutions
📊 Include benchmark data when reporting performance issues

NPM Module Compatibility

This library is designed to work correctly when installed as an npm module.

Path Resolution

All internal paths use proper resolution:

✅ Worker threads: path.join(__dirname, 'pdf-worker.js')
✅ Child processes: path.join(__dirname, 'pdf-child.js')
✅ PDF.js: require('./pdf.js/v4.5.136/build/pdf.js')

This ensures the library works correctly:

When installed via npm install
In node_modules/ directory
Regardless of working directory
With or without symlinks

No Configuration Required

The library automatically resolves all internal paths - you don't need to configure anything!

Advanced Usage

Custom Page Renderer

function customPageRenderer(pageData) {
    const renderOptions = {
        normalizeWhitespace: true,
        disableCombineTextItems: false
    };

    return pageData.getTextContent(renderOptions).then(textContent => {
        let text = '';
        for (let item of textContent.items) {
            text += item.str + ' ';
        }
        return text;
    });
}

const options = { pagerender: customPageRenderer };
pdf(dataBuffer, options).then(data => console.log(data.text));

Limit Pages

// Parse only first 10 pages
pdf(dataBuffer, { max: 10 }).then(data => {
    console.log(`Parsed ${data.numrender} of ${data.numpages} pages`);
});

Parallel Processing

const PDFProcess = require('pdf-parse-new/lib/pdf-parse-processes');

PDFProcess(dataBuffer, {
    maxProcesses: 4,  // Use 4 parallel processes
    batchSize: 10     // Process 10 pages per batch
}).then(data => console.log(data.text));

Why pdf-parse-new?

vs. Original pdf-parse

Feature	pdf-parse	pdf-parse-new 2.0
Speed (huge PDFs)	Baseline	2-4x faster ⚡
Smart optimization	❌	✅ AI-powered
Multi-core support	❌	✅ Workers + Processes
CPU adaptation	❌	✅ 4-48+ cores
Fast-path	❌	✅ 50x faster overhead
Caching	❌	✅ LRU cache
TypeScript	Partial	✅ Complete
Examples	Basic	✅ 7 production-ready
Benchmarking	❌	✅ Tools included
Maintenance	Slow	✅ Active

vs. Other PDF Libraries

✅ Pure JavaScript (no native dependencies, no compilation)
✅ Cross-platform (Windows, Mac, Linux - same code)
✅ Zero configuration (paths auto-resolved, npm-safe)
✅ No memory leaks (proper cleanup, GC-friendly)
✅ Production-ready (comprehensive error handling)
✅ Well-tested (9,417 benchmark samples)
✅ Modern (async/await, Promises, ES6+)

Real-World Performance

9,924-page PDF (13.77 MB) on 24-core system:

Sequential: ~15,000ms
Batch-50:   ~11,723ms
Processes:   ~4,468ms  ✅ (2.6x faster than batch)
Workers:     ~6,963ms  ✅ (1.7x faster than batch)

SmartParser: Automatically chooses Processes ⚡

100 KB PDF on any system:

Overhead:
- Without fast-path: 25ms
- With fast-path:    0.5ms ✅ (50x faster)

Contributing

Contributions are welcome! Please read our contributing guidelines.

Development Setup

git clone https://github.com/your-repo/pdf-parse-new.git
cd pdf-parse-new
npm install
npm test

Running Tests

npm test                    # Run all tests
npm run test:smart         # Test smart parser
npm run benchmark          # Run benchmarks

License

MIT License - see LICENSE file for details.

Credits

Based on pdf-parse by autokent
Powered by PDF.js v4.5.136 by Mozilla
Performance optimization and v2.0 development by Simone Gosetto

Changelog

Version 2.0.0 (2025-11-23)

Major Features:

✨ SmartPDFParser with AI-powered method selection
⚡ Multi-core processing (workers + processes)
🚀 Oversaturation for maximum CPU utilization
⚡ Fast-path optimization (50x faster overhead)
💾 LRU caching (25x faster on cache hits)
🎯 CPU-aware thresholds (4-48+ cores)
📊 Decision tree trained on 9,417 benchmarks
🔧 7 production-ready examples
📝 Complete TypeScript definitions
🧪 Comprehensive benchmarking tools

Performance:

2-4x faster for huge PDFs (1000+ pages)
50x faster overhead for tiny PDFs
25x faster on repeated similar PDFs
90%+ optimization rate in production

Breaking Changes:

None - fully backward compatible with 1.x

See CHANGELOG for complete version history.

Made with ❤️ for the JavaScript community

npm: pdf-parse-new Repository: GitHub Issues: Report bugs

changelog

Changelog

All notable changes to pdf-parse-new will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[2.0.0] - 2025-11-23

🎉 Major Release - Complete Rewrite with AI-Powered Optimization

This is a major release that introduces intelligent automatic method selection, multi-core processing, and comprehensive performance optimizations while maintaining 100% backward compatibility.

✨ Added

SmartPDFParser

Intelligent method selection based on PDF characteristics and system resources
CPU-aware thresholds that adapt from 4-core laptops to 48-core servers
Fast-path optimization: 50x faster overhead for small PDFs (25ms → 0.5ms)
LRU caching: 25x faster on repeated similar PDFs (cache hit in ~1ms)
Common scenario matching: 90%+ hit rate for typical PDFs
Decision tree trained on 9,417 real-world benchmark samples
Statistics tracking: method usage, cache hits, optimization rates

Multi-Core Processing

Child Processes (pdf-parse-processes.js): True multi-processing for maximum performance
Worker Threads (pdf-parse-workers.js): Alternative multi-threading with lower overhead
Oversaturation factor: Use 1.5x-2x cores for better CPU utilization (I/O-bound optimization)
Automatic memory limiting: Prevents OOM by monitoring available RAM
Progress callbacks: Real-time progress tracking for long-running tasks

Performance Optimizations

Fast-path for tiny PDFs (< 0.5 MB): Instant decision, no tree navigation
Fast-path for small PDFs (< 1 MB): Immediate batch-5 selection
Cache for similar PDFs: Second parse of similar PDF takes ~1ms
CPU normalization: Thresholds scale with available cores
Memory-safe: Automatic worker limiting based on available RAM

Developer Experience

7 production-ready examples in test/examples/:
- 01-basic-parse.js - Basic usage
- 02-batch-parse.js - Batch optimization
- 03-stream-parse.js - Memory-efficient streaming
- 04-workers-parse.js - Worker threads
- 05-processes-parse.js - Child processes
- 06-smart-parser.js - SmartPDFParser (recommended)
- 07-compare-all.js - Compare all methods
npm scripts for quick example execution (npm run example:smart)
Complete TypeScript definitions with all new features
Comprehensive benchmarking tools in benchmark/
Detailed documentation with real-world performance data

Infrastructure

CPU-aware benchmarking: Tools for collecting data across different CPUs
Training pipeline: Re-train decision tree from benchmark data
Incremental saving: No data loss during long benchmark runs
URL support: Benchmark remote PDFs via HTTP/HTTPS

🚀 Improved

Performance

2-4x faster for huge PDFs (1000+ pages) using processes/workers
50x faster overhead for tiny PDFs (< 0.5 MB) via fast-path
25x faster on cache hits for repeated similar PDFs
Better CPU utilization via oversaturation (1.5x cores)
Reduced memory usage with automatic worker limiting

API

Backward compatible: All v1.x code continues to work
New _meta field in results with method, duration, analysis
Progress callbacks for all parallel methods
Timeout support for child processes
Resource limits for worker threads

Code Quality

Organized structure: Examples in test/examples/, benchmarks in benchmark/
Clean root: No more scattered test files
TypeScript coverage: 100% of public API
Error handling: Comprehensive error messages with troubleshooting hints
Path resolution: NPM-safe, works in node_modules/

🔧 Changed

Default Behavior

SmartPDFParser now uses processes as default for huge PDFs (more consistent than workers)
Oversaturation factor default is 1.5x (was 1.0x, i.e., cores - 1)
Fast-path enabled by default (can disable with enableFastPath: false)
Caching enabled by default (can disable with enableCache: false)

Benchmarking

Moved all benchmark tools to benchmark/ directory
Private URLs/paths now in benchmark/test-pdfs.json (gitignored)
Template provided in benchmark/test-pdfs.example.json
Removed redundant intensive-benchmarks.json file

🗑️ Removed

Deprecated Files

Removed QUICKSTART.js (replaced by 7 focused examples)
Removed scattered test files from root (consolidated in test/examples/)
Removed redundant markdown files (consolidated in main README.md)
Removed intensive-benchmarks.json (kept only smart-parser-benchmarks.json)

📝 Documentation

New Documentation

Complete README.md: All features, examples, benchmarks
test/examples/README.md: Guide to all 7 examples
benchmark/README.md: Benchmarking guide
benchmark/CPU_BENCHMARKING_GUIDE.md: Multi-CPU testing guide
TypeScript definitions: Complete with JSDoc comments

Updated Documentation

Added "What's New in 2.0.0" section
Added migration guide from 1.x
Added real-world performance data
Added comparison table with original pdf-parse
Added troubleshooting section
Added oversaturation explanation

🐛 Fixed

Workers/Processes

Fixed worker exit code 1 error (Buffer serialization issue)
Fixed memory exhaustion on large PDFs (added safety limits)
Fixed path resolution for npm module installation
Fixed double processing on errors (added completion flags)
Fixed memory calculation for worker limiting

SmartPDFParser

Fixed hardcoded method selection (now respects benchmark data)
Fixed missing cpuCores in analysis
Fixed cache key generation
Fixed stats initialization

🔒 Security

No known security vulnerabilities
All dependencies updated to latest secure versions
Proper cleanup of worker threads and child processes
Memory limits prevent DoS via large PDFs

📊 Performance Data

Benchmark Results (9,924 pages, 13.77 MB, 24 cores)

Method          Time      vs Sequential  vs Batch
─────────────────────────────────────────────────
Sequential      ~15,000ms  1.00x         3.3x slower
Batch-50        ~11,723ms  1.28x faster  1.00x
Workers          ~6,963ms  2.15x faster  1.68x faster
Processes        ~4,468ms  3.36x faster  2.62x faster ⚡

SmartParser: Automatically selects Processes

Overhead Comparison

PDF Type         Before    After     Speedup
───────────────────────────────────────────────
Tiny (< 0.5 MB)  25ms      0.5ms     50x faster
Small (< 1 MB)   25ms      0.5ms     50x faster
Cached           25ms      1ms       25x faster
Common           25ms      2ms       12x faster
Rare             25ms      25ms      Same

⚠️ Breaking Changes

None - Version 2.0.0 is fully backward compatible with 1.x.

All existing code continues to work without modifications. New features are opt-in via SmartPDFParser.

🔄 Migration Guide

From 1.x to 2.0.0

No changes required - your code will continue to work:

// v1.x code (still works in v2.0.0)
const pdf = require('pdf-parse-new');
pdf(buffer).then(data => console.log(data.text));

To use new features:

// Use SmartPDFParser for automatic optimization
const SmartParser = require('pdf-parse-new/lib/SmartPDFParser');
const parser = new SmartParser();
const result = await parser.parse(buffer);

console.log(`Method: ${result._meta.method}`);
console.log(`Duration: ${result._meta.duration}ms`);
console.log(`Fast-path: ${result._meta.fastPath}`);

To force specific method:

// Force processes for huge PDFs
const parser = new SmartParser({ forceMethod: 'processes' });

// Force workers (alternative)
const parser = new SmartParser({ forceMethod: 'workers' });

// Adjust oversaturation
const parser = new SmartParser({ oversaturationFactor: 2.0 });

🙏 Contributors

Simone Gosetto - Lead developer, v2.0 implementation
autokent - Original pdf-parse library
Mozilla - PDF.js library

📦 Dependencies

debug: ^4.3.4
node-ensure: ^0.0.0
PDF.js: v4.5.136 (bundled)

No breaking dependency changes.

[1.x] - Previous Versions

For changelog of versions prior to 2.0.0, see the original pdf-parse changelog.

[Unreleased]: https://github.com/simonegosetto/pdf-parse-new/compare/v2.0.0...HEAD [2.0.0]: https://github.com/simonegosetto/pdf-parse-new/releases/tag/v2.0.0

Package detail

readme

pdf-parse-new

Table of Contents

Features

🎯 New in Version 2.0.0

Installation

What's New in 2.0.0

🎯 Major Features

📦 Migration from 1.x

Quick Start

Basic Usage

📚 Examples

With Smart Parser (Recommended)

Exception Handling

Smart Parser

Decision Tree

Usage Options

Automatic (Recommended)

Force Specific Method

Memory Limit

Oversaturation for Maximum Performance

Get Statistics

CPU-Aware Intelligence

Fast-Path Optimization

API Reference

pdf(dataBuffer, options)

SmartPDFParser

constructor(options)

parse(dataBuffer, userOptions)

getStats()

TypeScript and NestJS Support

⚠️ Important: Correct TypeScript Import

NestJS Service Example

NestJS Controller with File Upload

Alternative Import Methods

TypeScript Type Definitions

Performance Optimization

Performance Comparison

Best Practices

When to Use Each Method

Benchmarking

Directory Structure

Running Benchmarks

Example test-pdfs.json

Troubleshooting

Common Issues

Debug Mode

Get Help

NPM Module Compatibility

Path Resolution

No Configuration Required

Advanced Usage

Custom Page Renderer

Limit Pages

Parallel Processing

Why pdf-parse-new?

vs. Original pdf-parse

vs. Other PDF Libraries

Real-World Performance

Contributing

Development Setup

Running Tests

License

Credits

Changelog

Version 2.0.0 (2025-11-23)

changelog

Changelog

[2.0.0] - 2025-11-23

🎉 Major Release - Complete Rewrite with AI-Powered Optimization

✨ Added

SmartPDFParser

Multi-Core Processing

Performance Optimizations

Developer Experience

Infrastructure

🚀 Improved

Performance

API