Data Science Bootstrap

Build your projects thinking in terms of pipelines

Why think in "pipelines"?

Our data science projects, at some point, end up looking a lot like data pipelines. Data flows through a sequence of data preparation functions, which yield so-called "clean" data. That cleaned data then flows through a trained model, which then returns a prediction. The prediction then flows through some automated reporting system, giving end-users a way to consume the result.

The science portion of data science includes the art of figuring out how that pipeline looks. Once the science portion of the work, which is essentially scoping out what we need to automate, is complete, we can now take things into an engineering paradigm where we build pipelines to automate the good things we've uncovered.

What to look out for in pipelining tools?

The biggest thing to look out for is the ability to avoid repeating unnecessary computations. Tools that do this well will provide you with a syntax for naming build steps and defining dependencies between them. They will also automatically cache intermediate results.

Other than that, some pipelining tools will come with niceties. One example is a "graph view" that lets you see the dependency graph between steps. Another example is a library of built-in steps that let you accomplish commonly-available tasks.

What pipelining tools exist?

Here's a quick overview of pipelining tools that are available. One thing to keep in mind is that the ecosystem is changing quickly. As such, I would advise two things: Firstly, treat this listing as an incomplete and evolving document. Secondly, be ready to learn multiple tools and scope out whether they work well for your use cases.

Make

make is the "big grand daddy" of pipelining tools. It is also the lightest weight tool that you can use. It's usually shipped with every UNIX-like system, making it ubiquitous and hence easy to get started with. make uses a Makefile that lives in the project repository root directory. There's a delightful tutorial on how to use Make that you can follow to learn how to use Make.

While scoping out the tooling set for Make, I learned that there are convenient tools available for Make. One such example is the Python package makefile2dot, which lets you visualize the Makefile dependency graph. (This composability of tooling that each do one and only one thing well is well in-line with the corresponding UNIX philosophy.)

Make is usually run on a local machine. In my experience, it's most convenient for providing a top-level command-line interface to interact with the project's files. For example, I would put code style checks under a style command, allowing me to execute the command make style to conveniently run all code style checks.

If you're starting with pipelining, I would recommend starting with Make ahead of the rest of the tools listed below, as its simplicity and ubiquity will help you master the concepts of pipelining.

Snakemake

Snakemake started as a bioinformatics pipelining tool but eventually grew to be a general-purpose pipelining too. If my recollection of history is correct, it initially started as a tool designed for use on "local" (though powerful) systems, such as the heavy-duty Linux workstations that are bioinformaticians' daily driver machines. Eventually, it grew to support distributed cluster/cloud workflows as well. You can check out Snakemake's website and docs.

Kedro

Kedro is built by Quantum Black, which is McKinsey's specialized data science consultancy. Kedro is somewhat opinionated about certain things, and some of their suggested practices might look slightly different from the exact ways I suggest to do something here, but I believe the underlying philosophy does make sense. You can check out Kedro here.

Prefect

Prefect is an open-source pipeline orchestration tool with a commercial offering by the company of the same name. One nice thing about Prefect is that its syntax is entirely in Python code, and its orchestration server comes with a dashboard for live monitoring of the jobs you've defined.

Kubeflow

Kubeflow is a pipelining tool designed to work on Kubernetes. Its primary use case is machine learning pipelines, which sometimes is one of the end products of data science projects. If your organization has already made significant investments in using Kubernetes, then Kubeflow might be a viable option for you to consider.

GitHub Actions

If you live on the GitHub ecosystem, then GitHub Actions is not a bad idea to consider. Its syntax for configuring builds is easy to learn, and it comes with a graph view, and the ability to trigger builds automatically is superb.

Pages that link here

Handling data
How to handle data Handling data in a data science project is very tricky

Use scripts to automate routine execution of tasks
This idea should be pretty obvious

Use scripts to automate routine execution of tasks

This idea should be pretty obvious. If you find yourself executing the exact same commands over and over and over, you should probably put them together into a bash, Python, or R script that you can call from the root of your directory.

Where should these scripts live?

In the spirit of putting things in categorically relevant places (see: Organize your projects by leveraging categories), you should place them in the scripts/ directory, and provide additional sub-categories inside there.

How do I decide what language to write those scripts in?

You should do what feels most comfortable for you, but there are still some idiomatic guidelines that can help you make a decision:

If you're doing text processing of files, or otherwise leveraging functions from your project's custom source, then you might want to write them in Python. (see: Place custom source code inside a lightweight package)
If you're doing filesystem manipulation, or repeated serial execution of command line tools, a bash script is a great idea.

What else should I pay attention to when building these scripts?

Design for project root execution

Most of the time, it's optimal to design these scripts assuming that the "current working directory" is project root directory. This will simplify how you execute the scripts. You'll save on injecting "cd" commands into the documentation that you build.

There are exceptions to the rule. For example, if you know that every subsequent operation in the script depends on being in a subdirectory, then setting the current working directory to that subdirectory is a great idea! That age-old adage of "knowing when to break the rules judiciously" applies here.

Leverage Makefiles

If you put your scripts in a scripts/ directory, then constantly executing a command that looks like:

bash scripts/ci/build.sh

can get boring over time. If you instead put that line in a Makefile as follows:


build:
	bash scripts/ci/build.sh

then you can execute the command make build from the project root, and save yourself keystrokes.

Help your colleagues with a "bootstrap" script

You can help your colleagues get setup by creating a script for them! For example, you can write one that has the following commands:

# ./scripts/setup.sh

export PROJECT_ENV_NAME = ______________  # replace with your env name
conda env create -f environment.yml || mamba env create -f $PROJECT_ENV_NAME
conda activate $PROJECT_ENV_NAME

# Install custom source
pip install -e .

# Install Jupyter extensions (if relevant)
jupyter labextension install @jupyter-widgets/jupyterlab-manager

# Install pre-commit hooks
pre-commit install
echo "Setup complete! In the future, run 'conda activate $PROJECT_ENV_NAME' before you run your notebooks."

This script will help you:

Create the conda environment. (see: Create one conda environment per project)
Install the custom source
Install the Jupyterlab IPywidgets extension (necessary for progress bars like tqdm!)
Install pre-commit hooks (see: Set up pre-commit hooks to automate checks before making git commits)

Saves a bunch of time downstream!

Separate computationally expensive steps from computationally cheap steps

If a script is part of a pipeline (see: Build your projects thinking in terms of pipelines), then ensure that you have it set up such that upstream computational steps, especially those that are computationally expensive, execute independent of computationally cheap ones that depend on them. One example, provided by one of my reviewers Simon, is "intermediate data generation" vs. "data visualization". To quote:

I run under the philosophy of not unnecessarily regenerating data. Having to regenerate data -- especially if takes a long time -- just to regenerate a visualization absolutely sucks and is a common cause of my annoyance when my underlings present data in meetings.