Surya: 650M OCR model hitting 83.3% on olmOCR-bench at 5 pages/s
GitHub RepoImpressions1.1k
View on GitHub
@githubprojectsPost Author

Surya: A 650M OCR Model That Hits 83.3% at 5 Pages Per Second

If you've ever tried to extract text from scanned PDFs or images, you know the pain. Traditional OCR engines like Tesseract work okay for clean documents, but throw in poor lighting, curved text, or dense layouts, and accuracy drops fast. That's where Surya comes in.

This is a 650 million parameter OCR model that hits 83.3% on the olmOCR-bench benchmark — and does it at roughly 5 pages per second on a GPU. No fine-tuning needed. Just point it at your documents and let it rip.

What It Does

Surya is an open-source OCR pipeline built for real-world documents. It doesn't just recognize text in a straight line. It handles multi-column layouts, tables, headers, footnotes, and even images embedded inside documents. The model outputs structured text with bounding boxes for each line, making it easy to reconstruct the original layout.

Under the hood, it uses a transformer-based architecture trained on a diverse dataset of document images. The 650M parameter count means it's large enough to generalize well without being prohibitively slow.

Why It's Cool

A few things stand out:

Speed. 5 pages per second on a modern GPU is fast. That makes it practical for batch processing large collections of documents, not just one-off scans.

Benchmark results. The 83.3% on olmOCR-bench is solid. For context, that benchmark tests real-world OCR challenges: skewed text, noisy backgrounds, varying fonts. Surya beats most general-purpose OCR tools on that metric.

No training required. You download the model, install the dependencies, and run it. There's no need for domain-specific fine-tuning unless your use case is extremely niche.

Structured output. You get line-level bounding boxes with the recognized text. This is crucial for downstream tasks like RAG pipelines, document search, or data extraction from forms.

Handles complex layouts. Single-column text is easy. Surya handles multiple columns, nested tables, and floating sidebars without manual segmentation.

How to Try It

The repo has clear installation instructions. Here's the quick start:

pip install surya-ocr

Then you can run from the command line:

surya_ocr input.pdf --output_dir output/

Or use the Python API directly:

from surya import ocr
images = [Image.open("page1.png"), Image.open("page2.png")]
predictions = ocr(images)

It works with PDFs and image files. The output is JSON with the recognized text and bounding box coordinates.

If you want to see it in action before installing, there's an online demo link in the repo (check the README for the latest URL).

Final Thoughts

Surya is one of those tools that just works. No fussing with config files, no training pipeline setup. You install it, run it, and get reliable OCR output fast. For anyone building document processing pipelines, search indexes over scanned content, or just needing to extract text from messy PDFs, this is worth adding to your toolkit.

It's not perfect on every edge case, and 650M params means you'll want a GPU for the full speed. But for a free, open-source model that beats most commercial OCR alternatives on benchmark accuracy? This is a solid win.


Brought to you by @githubprojects

Back to Projects
Last updated: June 5, 2026 at 09:15 AM