A tool that converts regular PDFs into searchable PDF/A files.
GitHub Repo

A tool that converts regular PDFs into searchable PDF/A files.

@the_ospsPost Author

Project Description

View on GitHub

OCRmyPDF: Turn Scanned PDFs into Searchable Documents

The Problem with Scanned PDFs

Ever tried to Ctrl+F through a scanned PDF only to realize it’s just an image? Or struggled to extract text from a document that’s technically digital but functionally a pile of pixels?

Enter OCRmyPDF, an open-source tool that slaps an OCR (Optical Character Recognition) layer onto your PDFs, making them searchable, copy-pasteable, and generally less frustrating. With 29.7k GitHub stars and a robust feature set, it’s a Swiss Army knife for PDF post-processing.


What It Does

OCRmyPDF takes scanned PDFs (or image-based PDFs) and:

  • Adds a hidden text layer using Tesseract OCR, preserving the original layout.
  • Outputs standards-compliant PDF/A files by default (great for archiving).
  • Optionally deskews, cleans up images, or rotates pages—because crooked scans happen.
  • Multilingual support (100+ languages, mix-and-match if needed).

It’s a command-line tool at heart, but it’s also scriptable for bulk processing.


Why It’s Cool

  1. Lossless(ish) Workflow: Unlike some OCR tools that recompress images aggressively, OCRmyPDF tries to keep the original resolution intact while adding text invisibly underneath.
  2. Parallel Processing: Uses all your CPU cores because nobody likes waiting for OCR.
  3. PDF/A by Default: Ensures long-term readability—no proprietary format lock-in.
  4. Battle-Tested: Claims to handle "millions of PDFs," so your 500-page manual won’t break it.

How to Try It

  1. Install:
    pip install ocrmypdf  # Python 3.7+
    # Or:
    brew install ocrmypdf  # macOS
    
  2. Run:
    ocrmypdf -l eng --deskew input.pdf output.pdf
    
    (Pro tip: Add --rotate-pages if your scanner hates right angles.)

For a full feature tour, check the docs.


Final Thoughts

OCRmyPDF isn’t just for digitizing grandma’s recipes (though it’s great for that). It’s a legit tool for developers dealing with document pipelines, archiving systems, or preprocessing PDFs for ML datasets. The fact that it’s open-source and CLI-first makes it easy to slot into automated workflows.

Downsides? You’ll need Tesseract installed separately, and OCR quality depends on scan quality—but that’s true of any OCR tool.

TL;DR: If you’ve ever cursed a scanned PDF, this tool is your revenge.

Discuss on HN | Star on GitHub

Back to Projects
Project ID: 1943293705164554344Last updated: July 10, 2025 at 12:58 PM