Build your projects thinking in terms of pipelines
Our data science projects, at some point, end up looking a lot like data pipelines. Data flows through a sequence of data preparation functions, which yield so-called "clean" data. That cleaned data then flows through a trained model, which then returns a prediction. The prediction then flows through some automated reporting system, giving end-users a way to consume the result.
The science portion of data science includes the art of figuring out how that pipeline looks. Once the science portion of the work, which is essentially scoping out what we need to automate, is complete, we can now take things into an engineering paradigm where we build pipelines to automate the good things we've uncovered.
The biggest thing to look out for is the ability to avoid repeating unnecessary computations. Tools that do this well will provide you with a syntax for naming build steps and defining dependencies between them. They will also automatically cache intermediate results.
Other than that, some pipelining tools will come with niceties. One example is a "graph view" that lets you see the dependency graph between steps. Another example is a library of built-in steps that let you accomplish commonly-available tasks.
Here's a quick overview of pipelining tools that are available. One thing to keep in mind is that the ecosystem is changing quickly. As such, I would advise two things: Firstly, treat this listing as an incomplete and evolving document. Secondly, be ready to learn multiple tools and scope out whether they work well for your use cases.
make
is the "big grand daddy" of pipelining tools.
It is also the lightest weight tool that you can use.
It's usually shipped with every UNIX-like system,
making it ubiquitous and hence easy to get started with.
make
uses a Makefile
that lives in the project repository root directory.
There's a delightful tutorial on how to use Make
that you can follow to learn how to use Make.
While scoping out the tooling set for Make,
I learned that there are convenient tools available for Make.
One such example is the Python package makefile2dot
,
which lets you visualize the Makefile dependency graph.
(This composability of tooling that each do one and only one thing well
is well in-line with the corresponding UNIX philosophy.)
Make is usually run on a local machine.
In my experience,
it's most convenient for providing a top-level command-line interface
to interact with the project's files.
For example, I would put code style checks under a style
command,
allowing me to execute the command make style
to conveniently run all code style checks.
If you're starting with pipelining, I would recommend starting with Make ahead of the rest of the tools listed below, as its simplicity and ubiquity will help you master the concepts of pipelining.
Snakemake started as a bioinformatics pipelining tool but eventually grew to be a general-purpose pipelining too. If my recollection of history is correct, it initially started as a tool designed for use on "local" (though powerful) systems, such as the heavy-duty Linux workstations that are bioinformaticians' daily driver machines. Eventually, it grew to support distributed cluster/cloud workflows as well. You can check out Snakemake's website and docs.
Kedro is built by Quantum Black, which is McKinsey's specialized data science consultancy. Kedro is somewhat opinionated about certain things, and some of their suggested practices might look slightly different from the exact ways I suggest to do something here, but I believe the underlying philosophy does make sense. You can check out Kedro here.
Prefect is an open-source pipeline orchestration tool with a commercial offering by the company of the same name. One nice thing about Prefect is that its syntax is entirely in Python code, and its orchestration server comes with a dashboard for live monitoring of the jobs you've defined.
Kubeflow is a pipelining tool designed to work on Kubernetes. Its primary use case is machine learning pipelines, which sometimes is one of the end products of data science projects. If your organization has already made significant investments in using Kubernetes, then Kubeflow might be a viable option for you to consider.
If you live on the GitHub ecosystem, then GitHub Actions is not a bad idea to consider. Its syntax for configuring builds is easy to learn, and it comes with a graph view, and the ability to trigger builds automatically is superb.
Handling data
Handling data in a data science project is very tricky. Primarily, we have to worry about the following:
The notes linked in this section should give you an overview on how to approach handling data in a sane fashion on your project.
Use scripts to automate routine execution of tasks
This idea should be pretty obvious. If you find yourself executing the exact same commands over and over and over, you should probably put them together into a bash, Python, or R script that you can call from the root of your directory.
In the spirit of putting things in categorically relevant places (see: Organize your projects by leveraging categories), you should place them in the scripts/
directory, and provide additional sub-categories inside there.
You should do what feels most comfortable for you, but there are still some idiomatic guidelines that can help you make a decision:
Most of the time, it's optimal to design these scripts assuming that the "current working directory" is project root directory. This will simplify how you execute the scripts. You'll save on injecting "cd" commands into the documentation that you build.
There are exceptions to the rule. For example, if you know that every subsequent operation in the script depends on being in a subdirectory, then setting the current working directory to that subdirectory is a great idea! That age-old adage of "knowing when to break the rules judiciously" applies here.
If you put your scripts in a scripts/
directory, then constantly executing a command that looks like:
bash scripts/ci/build.sh
can get boring over time. If you instead put that line in a Makefile as follows:
build:
bash scripts/ci/build.sh
then you can execute the command make build
from the project root, and save yourself keystrokes.
You can help your colleagues get setup by creating a script for them! For example, you can write one that has the following commands:
# ./scripts/setup.sh
export PROJECT_ENV_NAME = ______________ # replace with your env name
conda env create -f environment.yml || mamba env create -f $PROJECT_ENV_NAME
conda activate $PROJECT_ENV_NAME
# Install custom source
pip install -e .
# Install Jupyter extensions (if relevant)
jupyter labextension install @jupyter-widgets/jupyterlab-manager
# Install pre-commit hooks
pre-commit install
echo "Setup complete! In the future, run 'conda activate $PROJECT_ENV_NAME' before you run your notebooks."
This script will help you:
tqdm
!)Saves a bunch of time downstream!
If a script is part of a pipeline (see: Build your projects thinking in terms of pipelines), then ensure that you have it set up such that upstream computational steps, especially those that are computationally expensive, execute independent of computationally cheap ones that depend on them. One example, provided by one of my reviewers Simon, is "intermediate data generation" vs. "data visualization". To quote:
I run under the philosophy of not unnecessarily regenerating data. Having to regenerate data -- especially if takes a long time -- just to regenerate a visualization absolutely sucks and is a common cause of my annoyance when my underlings present data in meetings.