Chunkr: open-source document intelligence that outputs RAG-ready chunks

Document Intelligence That Actually Speaks to Your RAG Pipeline: Meet Chunkr

If you've ever tried building a RAG (Retrieval-Augmented Generation) system, you know the pain of preprocessing documents. PDFs, Word files, and even plain text often arrive as messy blobs—tables break, headers get lost, and semantic meaning gets mangled. You spend more time chunking and cleaning than actually building the retrieval logic.

That's where Chunkr comes in. It's an open-source tool from Lumina AI that takes documents and outputs chunks that are ready to drop into your vector database or retrieval pipeline. No hand-cranking, no fragile regex hacks. Just clean, context-aware chunks.

---

**What It Does**

Chunkr is a document intelligence utility that processes various file formats (PDF, DOCX, TXT, etc.) and splits them into semantically meaningful pieces. It's not just splitting by character count or sentence boundaries—it understands document structure: headings, paragraphs, lists, and tables. The output is a set of chunks that preserve context, so your RAG system can actually retrieve the right information without losing the surrounding logic.

---

**Why It’s Cool**

The standout feature here is the "RAG-ready" promise. Most chunking tools give you either too many small pieces or giant blocks that kill retrieval quality. Chunkr tries to be smarter:

- **Structure-aware splitting** – It respects your document's natural hierarchy. A chunk might end at a section break rather than mid-sentence.
- **Output format** – Chunks come with metadata (source, position, heading hierarchy). You can plug them directly into an embedding pipeline.
- **Open-source** – You can inspect, modify, or extend the logic. No black box.
- **No lock-in** – The chunks are plain text/markdown, so they work with any vector DB (Pinecone, Chroma, Weaviate, etc.).

This is especially useful if you work with legal contracts, academic papers, or anything where context matters beyond a paragraph boundary.

---

**How to Try It**

Getting started is straightforward:

```bash
git clone https://github.com/lumina-ai-inc/chunkr
cd chunkr
pip install -r requirements.txt

Then run it on a document:

python chunkr.py path/to/your/file.pdf

Or use the Python API directly in your own script:

from chunkr import chunk_document

chunks = chunk_document("my_report.pdf", strategy="hierarchical")
for chunk in chunks:
    print(chunk.text, chunk.metadata)

Check the repo for more options (custom chunk size, overlap control, output formats).

Final Thoughts

Chunkr isn't trying to be a full RAG platform—it's a focused, well-executed helper that solves the dirty preprocessing work. If you've ever spent a weekend hand-tweaking chunk sizes to fix a retrieval quality issue, you'll appreciate this. It's the kind of tool that should be in every AI engineer's toolbox.

Give it a try on a messy document. The chunks will probably be better than what you'd write manually. And if you find a bug or have an idea for improvement, the repo is open—contribute back.

You can follow more open-source projects like this at @githubprojects.

Repository: https://github.com/lumina-ai-inc/chunkr

Chunkr: open-source document intelligence that outputs RAG-ready chunks

Project Description

Contributors

Chunkr: open-source document intelligence that outputs RAG-ready chunks

Project Description

Join our weekly newsletter

Love discovering amazing projects?

Contributors