Edit-Banana: Stop Copying Tables by Hand
We've all been there. You find a perfect table of data in a PDF, a research paper, or a webpage—maybe it's census data, financial results, or experimental findings. You need that data in a spreadsheet or a script, but it's trapped as a static image or in a messy, non-editable format. Your next hour is suddenly filled with the mind-numbing task of manual data entry. What if you could just… get the data?
That's the frustration Edit-Banana is built to solve. It's a command-line tool that takes those complex, formatted statistical tables (think PDFs, images, or messy text) and converts them into clean, editable data with a single command. It's like Ctrl+C, Ctrl+V for data that was never meant to be copied.
What It Does
In simple terms, Edit-Banana is an intelligent table extractor. You feed it a file containing a table—often from academic papers, reports, or official documents where data is presented for human reading, not machine processing. It then identifies the table structure, parses the rows and columns, and outputs the data into a usable format like CSV or Excel.
It goes beyond basic OCR by understanding the logic of statistical tables: merged headers, nested columns, footnotes, and units. It tries to reconstruct the intended dataset from the formatted presentation layer.
Why It's Cool
The magic of Edit-Banana isn't just that it extracts text; it's that it aims to extract meaningful structure. Here’s what makes it stand out:
- One-Command Simplicity: The core promise is real. A single command like
edit-banana input.pdf -o data.csvcan save an afternoon of tedious work. - Handles the Messy Stuff: It's designed for the real world of data presentation. It doesn't just bail when it sees a spanned header or a superscript footnote symbol; it tries to integrate that information intelligently.
- Developer-Centric: It's a CLI tool, which means it slots perfectly into data processing pipelines. You can automate the extraction of hundreds of tables, hook it into a data scraping script, or use it as the first step in your ETL process.
- Fights PDF Hell: For anyone in research, data analysis, or journalism, getting data out of PDFs is a notorious pain point. Edit-Banana is a direct assault on that problem.
How to Try It
Ready to free some trapped data? Getting started is straightforward.
-
Clone the repo:
git clone https://github.com/BIT-DataLab/Edit-Banana.git cd Edit-Banana -
Set up a Python environment and install dependencies (check the repo's
README.mdfor the most up-to-date list, as it may require Tesseract for OCR or other libs). -
Run it on a sample: The repository likely includes examples. Try it on a provided sample PDF or image to see it in action.
python edit_banana.py path/to/your/table.pdf --format csv
The project is on GitHub, so you can read the docs, look at the issues, and see the roadmap. It's an active tool, so contributions and feedback are part of the journey.
Final Thoughts
Edit-Banana feels like one of those utilities that, once you use it, becomes an essential part of your toolkit. It solves a specific, widespread pain point without overcomplicating things. It won't be 100% perfect for every bizarrely formatted table—no tool is—but for the majority of standard statistical tables, it promises to be a massive time-saver.
If your work involves ever reclaiming data from the prison of a formatted document, this is absolutely worth a look. It turns a frustrating chore into a simple command.
@githubprojects
Repository: https://github.com/BIT-DataLab/Edit-Banana