Organize your projects by leveraging categories

This philosophy is best reflected in software development: The best software developers are masters of organization. If you go into a GitHub repository and browse a few well-structured projects, you'll easily glean this point. These projects keep things simple, are modular, have awesome documentation, and rely on single sources of truth for everything. Plain, unambiguous, and organized -- these are the best adjectives to describe them.

At the core of this philosophy is the fact that these developers have thought carefully about categories of things. You can think of a project as being composed of a series of categories of distinct entities: data, notebooks, scripts, source code, and more. They relate to each other in unique ways: data are consumed by notebooks, notebooks import source code, etc. If we're extremely clear about the categories of things that exist for our project, and strive to cleanly describe the relationships between these categories of things, then our projects will become very well-organized.

I believe data science projects ought to be organized the same way. Especially if they are collaborative projects involving more than one person. As such, it should be possible for us to adopt a sane way of working that is highly inspired from the software development world. We thus inject structure into our projects.

Now, structure for the sake of structure is pointless; structure should exist for our utilitarian benefit. We impose a particular file structure so that we can navigate through it and find what we want quickly. We structure our source code so that we can find what we need more easily. With clearly defined categories of things and their relationships, we can more cleanly collaborate with others.

The philosophies that ground the bootstrap

Use scripts to automate routine execution of tasks

This idea should be pretty obvious. If you find yourself executing the exact same commands over and over and over, you should probably put them together into a bash, Python, or R script that you can call from the root of your directory.

Where should these scripts live?

In the spirit of putting things in categorically relevant places (see: Organize your projects by leveraging categories), you should place them in the scripts/ directory, and provide additional sub-categories inside there.

How do I decide what language to write those scripts in?

You should do what feels most comfortable for you, but there are still some idiomatic guidelines that can help you make a decision:

  • If you're doing text processing of files, or otherwise leveraging functions from your project's custom source, then you might want to write them in Python. (see: Place custom source code inside a lightweight package)
  • If you're doing filesystem manipulation, or repeated serial execution of command line tools, a bash script is a great idea.

What else should I pay attention to when building these scripts?

Design for project root execution

Most of the time, it's optimal to design these scripts assuming that the "current working directory" is project root directory. This will simplify how you execute the scripts. You'll save on injecting "cd" commands into the documentation that you build.

There are exceptions to the rule. For example, if you know that every subsequent operation in the script depends on being in a subdirectory, then setting the current working directory to that subdirectory is a great idea! That age-old adage of "knowing when to break the rules judiciously" applies here.

Leverage Makefiles

If you put your scripts in a scripts/ directory, then constantly executing a command that looks like:

bash scripts/ci/build.sh

can get boring over time. If you instead put that line in a Makefile as follows:


build:
	bash scripts/ci/build.sh

then you can execute the command make build from the project root, and save yourself keystrokes.

Help your colleagues with a "bootstrap" script

You can help your colleagues get setup by creating a script for them! For example, you can write one that has the following commands:

# ./scripts/setup.sh

export PROJECT_ENV_NAME = ______________  # replace with your env name
conda env create -f environment.yml || mamba env create -f $PROJECT_ENV_NAME
conda activate $PROJECT_ENV_NAME

# Install custom source
pip install -e .

# Install Jupyter extensions (if relevant)
jupyter labextension install @jupyter-widgets/jupyterlab-manager

# Install pre-commit hooks
pre-commit install
echo "Setup complete! In the future, run 'conda activate $PROJECT_ENV_NAME' before you run your notebooks."

This script will help you:

  1. Create the conda environment. (see: Create one conda environment per project)
  2. Install the custom source
  3. Install the Jupyterlab IPywidgets extension (necessary for progress bars like tqdm!)
  4. Install pre-commit hooks (see: Set up pre-commit hooks to automate checks before making git commits)

Saves a bunch of time downstream!

Separate computationally expensive steps from computationally cheap steps

If a script is part of a pipeline (see: Build your projects thinking in terms of pipelines), then ensure that you have it set up such that upstream computational steps, especially those that are computationally expensive, execute independent of computationally cheap ones that depend on them. One example, provided by one of my reviewers Simon, is "intermediate data generation" vs. "data visualization". To quote:

I run under the philosophy of not unnecessarily regenerating data. Having to regenerate data -- especially if takes a long time -- just to regenerate a visualization absolutely sucks and is a common cause of my annoyance when my underlings present data in meetings.