Data Science Bootstrap

Use scripts to automate routine execution of tasks

This idea should be pretty obvious. If you find yourself executing the exact same commands over and over and over, you should probably put them together into a bash, Python, or R script that you can call from the root of your directory.

Where should these scripts live?

In the spirit of putting things in categorically relevant places (see: Organize your projects by leveraging categories), you should place them in the scripts/ directory, and provide additional sub-categories inside there.

How do I decide what language to write those scripts in?

You should do what feels most comfortable for you, but there are still some idiomatic guidelines that can help you make a decision:

If you're doing text processing of files, or otherwise leveraging functions from your project's custom source, then you might want to write them in Python. (see: Place custom source code inside a lightweight package)
If you're doing filesystem manipulation, or repeated serial execution of command line tools, a bash script is a great idea.

What else should I pay attention to when building these scripts?

Design for project root execution

Most of the time, it's optimal to design these scripts assuming that the "current working directory" is project root directory. This will simplify how you execute the scripts. You'll save on injecting "cd" commands into the documentation that you build.

There are exceptions to the rule. For example, if you know that every subsequent operation in the script depends on being in a subdirectory, then setting the current working directory to that subdirectory is a great idea! That age-old adage of "knowing when to break the rules judiciously" applies here.

Leverage Makefiles

If you put your scripts in a scripts/ directory, then constantly executing a command that looks like:

bash scripts/ci/build.sh

can get boring over time. If you instead put that line in a Makefile as follows:


build:
	bash scripts/ci/build.sh

then you can execute the command make build from the project root, and save yourself keystrokes.

Help your colleagues with a "bootstrap" script

You can help your colleagues get setup by creating a script for them! For example, you can write one that has the following commands:

# ./scripts/setup.sh

export PROJECT_ENV_NAME = ______________  # replace with your env name
conda env create -f environment.yml || mamba env create -f $PROJECT_ENV_NAME
conda activate $PROJECT_ENV_NAME

# Install custom source
pip install -e .

# Install Jupyter extensions (if relevant)
jupyter labextension install @jupyter-widgets/jupyterlab-manager

# Install pre-commit hooks
pre-commit install
echo "Setup complete! In the future, run 'conda activate $PROJECT_ENV_NAME' before you run your notebooks."

This script will help you:

Create the conda environment. (see: Create one conda environment per project)
Install the custom source
Install the Jupyterlab IPywidgets extension (necessary for progress bars like tqdm!)
Install pre-commit hooks (see: Set up pre-commit hooks to automate checks before making git commits)

Saves a bunch of time downstream!

Separate computationally expensive steps from computationally cheap steps

If a script is part of a pipeline (see: Build your projects thinking in terms of pipelines), then ensure that you have it set up such that upstream computational steps, especially those that are computationally expensive, execute independent of computationally cheap ones that depend on them. One example, provided by one of my reviewers Simon, is "intermediate data generation" vs. "data visualization". To quote:

I run under the philosophy of not unnecessarily regenerating data. Having to regenerate data -- especially if takes a long time -- just to regenerate a visualization absolutely sucks and is a common cause of my annoyance when my underlings present data in meetings.

Pages that link here

Set up your project with a sane directory structure
Why setup your project with a sane directory structure Doing so will help you quickly and easily find things

Get prepped per project
Treat your projects as if they were software projects for maximum organizational effectiveness

Set up your project with a sane directory structure

Why setup your project with a sane directory structure

Doing so will help you quickly and easily find things. This is crucial when navigating your data project. If you don't do so, you will likely end up being utterly confused as to where things are located.

What does a sane directory look like

I am going to show you one particular example, but you can adapt it to however you like.

|- informative-project-name-here/
   |- data/          # never add anything here into source control
   |- notebooks/     # divide by usernames if needed
   |- scripts/       # basically for automation
   |- importable_name/
      |- __init__.py
      |-...
   |- tests/      # test suite
   |- README.md
   |- pyproject.toml # use this, not setup.py!
   |-...

The purpose of each directory is annotated in each line. That said, you can find relevant information in the following pages:

data/: Define single sources of truth for your data sources
notebooks/: Keep your notebooks organized with logical categories
importable_name/: Place custom source code inside a lightweight package
scripts/: Use scripts to automate routine execution of tasks

Set up pre-commit hooks to automate checks before making git commits

Why use pre-commit hooks?

One way to prevent yourself from committing code that is not properly checked is to use pre-commit hooks. This is a feature of Git that allows you to automatically run checks before they are committed to the repository history. Because they are automatically run, you set them up once, usually when you first download the repository, and no longer need to think about them again.

How do I set up pre-commit hooks?

You can install the pre-commit framework, which lets you easily configure pre-commit hooks to run.

The gist of the installation steps are in the bash commands below, but you should read the website for a fuller understanding.

conda install -c conda-forge pre-commit
pre-commit sample-config > .pre-commit-config.yaml

Now, go and edit .pre-commit-config.yaml -- add other pre-commit checks, for example. (See below for an example that you can use.) Then, run:

pre-commit install
pre-commit run --all-files   # run the checks against all of your files

What pre-commit hooks are good to install?

a.k.a. what would you put in your .pre-commit-config.yaml? Here's a sane collection of starter things that I usually include, taken from my Network Analysis Made Simple repository.

repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v2.3.0
    hooks:
      - id: check-yaml
      - id: end-of-file-fixer
      - id: trailing-whitespace
  - repo: https://github.com/psf/black
    rev: 19.3b0
    hooks:
      - id: black
  - repo: https://github.com/kynan/nbstripout
    rev: master
    hooks:
      - id: nbstripout
        files: ".ipynb"

nbstripout is a super important one -- it ensures that all of my notebook outputs are stripped before committing them to the repository! (Otherwise, you'll end up bloating your repository with large notebooks.)

How does this relate to continuous integration pipeline checks?

(For a refresher, or if you're not sure what CI pipeline checks are, see Build a continuous integration pipeline for your source.)

CI pipeline checks are also a form of automated checks that you can put into your workflow. Ideally, everything that is checked for in your pre-commit hooks should be checked for in your CI pipeline.

So what's the difference, then? Here's my thoughts on this:

In pre-commit hooks, you generally run the lightweight checks: the ones that are annoying to run manually all the time but also execute very quickly. Things like code style checks, for example, or those that ensure there are only single trailing lines in text files.

In the CI system, you run those checks in addition to the longer-running test suite. (see: Write tests that test your custom code). So the CI system behaves as a backup to the pre-commit hooks.

Create one conda environment per project

Why use one conda environment per project

If you have multiple projects that you work on, but you install all project dependencies into a shared environment, then I guarantee you that at some point, you will run into dependency conflicts as you try to upgrade/update packages to try out new things.

"So what?" you might ask. Well, you'll end up breaking your code! Take this word of advice from someone who has had to deal with the consequences of having his code not working in one project even as code in another does. And finding out one day before an important presentation, right when you need to put in new versions of figures that were made before. The horror!

You will want to ensure that you have an isolated conda environment for each project to keep your projects insulated from one another.

How do you set up your conda environment files

Here is a baseline that you can copy and modify at any time.

name: project-name-goes-here  ## CHANGE THIS TO YOUR ACTUAL PROJECT
channels:      ## Add any other channels below if necessary
- conda-forge
dependencies:  ## Prioritize conda packages
- python=3.10
- jupyter
- conda
- mamba
- ipython
- ipykernel
- numpy
- matplotlib
- scipy
- pandas
- pip
- pre-commit
- black
- nbstripout
- mypy
- flake8
- pycodestyle
- pydocstyle
- pytest
- pytest-cov
- pytest-xdist
- pip:  ## Add in pip packages if necessary
  - mkdocs
  - mkdocs-material
  - mkdocstrings
  - mknotebooks

If a package exists in both conda-forge and pip and you rely primarily on conda, then I recommend prioritizing the conda package over the pip package. The advantage here is that conda's dependency solver can grab the latest compatible version without worrying about pip clobbering over other dependencies. (h/t my reviewer Simon, who pointed out that newer versions of pip have a dependency solver, though as far as possible, staying consistent is preferable, though mixing-and-matching is alright if you know what you're doing.)

This baseline helps me bootstrap conda environments. The packages that are in there each serve a purpose. You can read more about them on the page: Install code checking tools to help write better code.

How do you decide which versions of packages to use?

Initially, I only specify the version of Python I want, and allow the conda package manager to solve the environment.

However, there may come a time when a new package version brings a new capability. That is when you may wish to pin the version of that particular package to be at the minimum that version. (See below for the syntax needed to pin a version.) At the same time, the new package version may break compatibility -- in this case, you will want to pin it to a maximum package version.

It's not always obvious, though, so be sure to use version control

If you wish, you can also pin versions to a minimum, maximum, or specific one, using version modifiers.

For conda, they are >, >=, =, <= and <. (You should be able to grok what is what!)
For pip, they are >, >=, ==, <= and <. (Note: for pip, it is double equals == and not single equals =.)

So when do you use each of the modifiers?

Use =/== sparingly while in development: you will be stuck with a particular version and will find it difficult to update other packages together.
Use <= and < to prevent conda/pip from upgrading a package beyond a certain version. This can be helpful if new versions of packages you rely on have breaking API changes.
Use >= and > to prevent conda/pip from installing a package below a certain version. This is helpful if you've come to depend on breaking API changes from older versions.

When do you upgrade/install new packages?

Upgrading and/or installing packages should be done on an as-needed basis. There are two paths to do upgrade packages that I have found:

The principled way

The principled way to do an upgrade is to first pin the version inside environment.yml, and then use the following command to update the environment:

conda env update -f environment.yml

The hacky way

The hacky way to do the upgrade is to directly conda or pip install the package, and then add it (or modify its version) in the environment.yml file. Do this only if you know what you're doing!

Ensure your environment kernels are available to Jupyter

By practicing "one project gets one environment", then ensuring that those environments' Python interpreters are available to Jupyter is going to be crucial. If you find that your project's environment Python is unavailable, then you'll need to ensure that it's available. To do so, ensure that the Python environment has the package ipykernel. (If not, install it by hand and add it to the environment.yml file.) Then, run the following command:

# assuming you have already activated your environment,
# replace $ENVIRONMENT_NAME with your environment's name.
python -m ipykernel install --user --name $ENVIRONMENT_NAME

Now, it will show up as a "kernel" for executing Python code in your Jupyter notebooks. (see Configure Jupyter and Jupyter Lab for more information on how to configure it.)

Further tips

Now, how should you name your conda environment? See the page: Sanely name things consistently!

Place custom source code inside a lightweight package

Why write a package for your custom source code

Have you encountered the situation where you create a new notebook, and then promptly copy code verbatim from another notebook with zero modifications?

As you as you did that, you created two sources of truth for that one function.

Now... if you intended to modify the function and test the effect of the modification on the rest of the code, then you still could have done better.

A custom source package that is installed into the conda environment that you have set up will help you refactor code out of the notebook, and hence help you define one source of truth for the entire function, which you can then import anywhere.

How to create a custom source package for a project

Firstly, I'm assuming you are following the ideas laid out in Set up your project with a sane directory structure. Specifically, you have a src/ directory under the project root. Here, I'm going to give you a summary of the official Python packaging tutorial.

In your project project_name/ directory, ensure you have a few files:

|- project_name/   # should be the same name as the conda environment
  |- data/         # for all data-related functions
	 |- loaders.py # convenience functions for loading data
	 |- schemas.py # this is for pandera schemas
  |- __init__.py   # this is necessary
  |- paths.py      # this is for path definitions
  |- utils.py      # utiity functions that you might need
  |- ...
|- tests/
  |- test_utils.py # tests for utility functions
  |- ...
|- pyproject.toml. # replacement for setup.py

If you're wondering about why we name the source package the same name as our conda environment, it's for consistency purposes. (see: Sanely name things consistently)

If you're wondering about the purpose of paths.py, read this page: Use pyprojroot to define relative paths to the project root

pyproject.toml should look like this:

[project]
name = "my-package-name"
version = "0.1.0"
authors = [{name = "EM", email = "me@em.com"}]
description = "Something cool here."

Now, you activate the environment dedicated to your project (see: Create one conda environment per project) and install the custom source package:

conda activate project_environment
pip install -e .

This will install the source package in development mode. As you continue to add more code into the custom source package, they will be instantly available to you project-wide.

Now, in your projects, you can import anything from the custom source package.

Note: If you've read the official Python documentation on packages, you might see that src/ has nothing special in its name. (Indeed, one of my reviewers, Arkadij Kummer, pointed this out to me.) Having tried to organize a few ways, I think having src/ is better for DS projects than having the setup.py file and source_package/ directory in the top-level project directory. Those two are better isolated from the rest of the project and we can keep the setup.py in src/ too, thus eliminating clutter from the top-level directory.

How often should the package be updated?

As often as you need it!

Also, I would encourage you to avoid releasing the package standalone until you know that it ought to be used as a standalone Python package. Otherwise, you might prematurely bring upon yourself a maintenance burden!

Is there an easier way to set this all up?

It feels like a lot to remember, right? Fret not! You can use pyds-cli to easily bootstrap a new project environment!

Build your projects thinking in terms of pipelines

Why think in "pipelines"?

Our data science projects, at some point, end up looking a lot like data pipelines. Data flows through a sequence of data preparation functions, which yield so-called "clean" data. That cleaned data then flows through a trained model, which then returns a prediction. The prediction then flows through some automated reporting system, giving end-users a way to consume the result.

The science portion of data science includes the art of figuring out how that pipeline looks. Once the science portion of the work, which is essentially scoping out what we need to automate, is complete, we can now take things into an engineering paradigm where we build pipelines to automate the good things we've uncovered.

What to look out for in pipelining tools?

The biggest thing to look out for is the ability to avoid repeating unnecessary computations. Tools that do this well will provide you with a syntax for naming build steps and defining dependencies between them. They will also automatically cache intermediate results.

Other than that, some pipelining tools will come with niceties. One example is a "graph view" that lets you see the dependency graph between steps. Another example is a library of built-in steps that let you accomplish commonly-available tasks.

What pipelining tools exist?

Here's a quick overview of pipelining tools that are available. One thing to keep in mind is that the ecosystem is changing quickly. As such, I would advise two things: Firstly, treat this listing as an incomplete and evolving document. Secondly, be ready to learn multiple tools and scope out whether they work well for your use cases.

Make

make is the "big grand daddy" of pipelining tools. It is also the lightest weight tool that you can use. It's usually shipped with every UNIX-like system, making it ubiquitous and hence easy to get started with. make uses a Makefile that lives in the project repository root directory. There's a delightful tutorial on how to use Make that you can follow to learn how to use Make.

While scoping out the tooling set for Make, I learned that there are convenient tools available for Make. One such example is the Python package makefile2dot, which lets you visualize the Makefile dependency graph. (This composability of tooling that each do one and only one thing well is well in-line with the corresponding UNIX philosophy.)

Make is usually run on a local machine. In my experience, it's most convenient for providing a top-level command-line interface to interact with the project's files. For example, I would put code style checks under a style command, allowing me to execute the command make style to conveniently run all code style checks.

If you're starting with pipelining, I would recommend starting with Make ahead of the rest of the tools listed below, as its simplicity and ubiquity will help you master the concepts of pipelining.

Snakemake

Snakemake started as a bioinformatics pipelining tool but eventually grew to be a general-purpose pipelining too. If my recollection of history is correct, it initially started as a tool designed for use on "local" (though powerful) systems, such as the heavy-duty Linux workstations that are bioinformaticians' daily driver machines. Eventually, it grew to support distributed cluster/cloud workflows as well. You can check out Snakemake's website and docs.

Kedro

Kedro is built by Quantum Black, which is McKinsey's specialized data science consultancy. Kedro is somewhat opinionated about certain things, and some of their suggested practices might look slightly different from the exact ways I suggest to do something here, but I believe the underlying philosophy does make sense. You can check out Kedro here.

Prefect

Prefect is an open-source pipeline orchestration tool with a commercial offering by the company of the same name. One nice thing about Prefect is that its syntax is entirely in Python code, and its orchestration server comes with a dashboard for live monitoring of the jobs you've defined.

Kubeflow

Kubeflow is a pipelining tool designed to work on Kubernetes. Its primary use case is machine learning pipelines, which sometimes is one of the end products of data science projects. If your organization has already made significant investments in using Kubernetes, then Kubeflow might be a viable option for you to consider.

GitHub Actions

If you live on the GitHub ecosystem, then GitHub Actions is not a bad idea to consider. Its syntax for configuring builds is easy to learn, and it comes with a graph view, and the ability to trigger builds automatically is superb.