Koheesio: Finally, Data Pipelines That Don't Suck to Build
Let's be honest: building data pipelines is often a slog. You start with a simple ETL, then someone wants an extra transformation, then another source, then error handling, and before you know it you're staring at a 500-line monster that nobody wants to touch.
Nike open-sourced something that actually tries to fix this. Meet Koheesio.
What It Does
Koheesio is a Python framework for building data pipelines using simple, reusable building blocks. Think of it as Lego for data workflows. Each block is a small, focused component that does one thing well (read from S3, transform a column, write to Redshift). You chain these blocks together to build complex pipelines without writing spaghetti code.
The magic? It's built on top of Pandas and PySpark, so you get the flexibility of Python with the scalability of distributed processing when you need it.
Why It's Cool
Three things stand out:
1. Components are dead simple to write. Each component is just a Python class with a run method. No weird decorators, no complex inheritance. If you can write a function, you can write a Koheesio component.
from koheesio import Step
class FilterOutNulls(Step):
column: str
def run(self, df):
return df.filter(df[self.column].isNotNull())
2. Automatic error handling and retries. Failed pipeline at 2 AM? Koheesio logs exactly which component failed and why. It even supports automatic retries with exponential backoff built in.
3. First-class logging and observability. Every step automatically logs its input/output schemas, row counts, and execution time. You don't need to add your own logging everywhere. It just works.
How to Try It
Getting started takes about 2 minutes:
pip install koheesio
Then check out the official examples in the GitHub repo. The README has a solid "getting started" with a real pipeline example that reads from S3, transforms, and writes to a Parquet file.
If you want to build locally, clone the repo and run:
git clone https://github.com/Nike-Inc/koheesio.git
cd koheesio
poetry install
Final Thoughts
Koheesio feels like someone actually sat down and thought about what makes data pipeline development painful, then fixed it. It doesn't try to be a hyper-opinionated framework that forces you into a specific way of thinking. Instead, it gives you a clean structure to organize your code and gets out of your way.
If you're tired of maintaining pipeline code that's brittle and hard to debug, give it a spin. It's one of those tools that you'll appreciate more the longer you use it.
Found this project interesting? Follow @githubprojects for more open source discoveries.