opensourceprojects.dev

A broadsheet for software that doesn't ask for your email

Koheesio: Build complex data pipelines from simple, reusable components
GitHub RepoImpressions3

Project Description

View on GitHub

Koheesio: Finally, Data Pipelines That Don't Suck to Build

Let's be honest: building data pipelines is often a slog. You start with a simple ETL, then someone wants an extra transformation, then another source, then error handling, and before you know it you're staring at a 500-line monster that nobody wants to touch.

Nike open-sourced something that actually tries to fix this. Meet Koheesio.

What It Does

Koheesio is a Python framework for building data pipelines using simple, reusable building blocks. Think of it as Lego for data workflows. Each block is a small, focused component that does one thing well (read from S3, transform a column, write to Redshift). You chain these blocks together to build complex pipelines without writing spaghetti code.

The magic? It's built on top of Pandas and PySpark, so you get the flexibility of Python with the scalability of distributed processing when you need it.

Why It's Cool

Three things stand out:

1. Components are dead simple to write. Each component is just a Python class with a run method. No weird decorators, no complex inheritance. If you can write a function, you can write a Koheesio component.

from koheesio import Step

class FilterOutNulls(Step):
    column: str
    
    def run(self, df):
        return df.filter(df[self.column].isNotNull())

2. Automatic error handling and retries. Failed pipeline at 2 AM? Koheesio logs exactly which component failed and why. It even supports automatic retries with exponential backoff built in.

3. First-class logging and observability. Every step automatically logs its input/output schemas, row counts, and execution time. You don't need to add your own logging everywhere. It just works.

How to Try It

Getting started takes about 2 minutes:

pip install koheesio

Then check out the official examples in the GitHub repo. The README has a solid "getting started" with a real pipeline example that reads from S3, transforms, and writes to a Parquet file.

If you want to build locally, clone the repo and run:

git clone https://github.com/Nike-Inc/koheesio.git
cd koheesio
poetry install

Final Thoughts

Koheesio feels like someone actually sat down and thought about what makes data pipeline development painful, then fixed it. It doesn't try to be a hyper-opinionated framework that forces you into a specific way of thinking. Instead, it gives you a clean structure to organize your code and gets out of your way.

If you're tired of maintaining pipeline code that's brittle and hard to debug, give it a spin. It's one of those tools that you'll appreciate more the longer you use it.


Found this project interesting? Follow @githubprojects for more open source discoveries.

Back to Projects
Project ID: 9cdb80d7-122f-4d9a-abaf-29ad507d3647Last updated: July 5, 2026 at 02:43 AM