Organize your projects by leveraging categories
This philosophy is best reflected in software development: The best software developers are masters of organization. If you go into a GitHub repository and browse a few well-structured projects, you'll easily glean this point. These projects keep things simple, are modular, have awesome documentation, and rely on single sources of truth for everything. Plain, unambiguous, and organized -- these are the best adjectives to describe them.
At the core of this philosophy is the fact that these developers have thought carefully about categories of things. You can think of a project as being composed of a series of categories of distinct entities: data, notebooks, scripts, source code, and more. They relate to each other in unique ways: data are consumed by notebooks, notebooks import source code, etc. If we're extremely clear about the categories of things that exist for our project, and strive to cleanly describe the relationships between these categories of things, then our projects will become very well-organized.
I believe data science projects ought to be organized the same way. Especially if they are collaborative projects involving more than one person. As such, it should be possible for us to adopt a sane way of working that is highly inspired from the software development world. We thus inject structure into our projects.
Now, structure for the sake of structure is pointless; structure should exist for our utilitarian benefit. We impose a particular file structure so that we can navigate through it and find what we want quickly. We structure our source code so that we can find what we need more easily. With clearly defined categories of things and their relationships, we can more cleanly collaborate with others.
Use scripts to automate routine execution of tasks
This idea should be pretty obvious. If you find yourself executing the exact same commands over and over and over, you should probably put them together into a bash, Python, or R script that you can call from the root of your directory.
In the spirit of putting things in categorically relevant places (see: Organize your projects by leveraging categories), you should place them in the
scripts/ directory, and provide additional sub-categories inside there.
You should do what feels most comfortable for you, but there are still some idiomatic guidelines that can help you make a decision:
Most of the time, it's optimal to design these scripts assuming that the "current working directory" is project root directory. This will simplify how you execute the scripts. You'll save on injecting "cd" commands into the documentation that you build.
There are exceptions to the rule. For example, if you know that every subsequent operation in the script depends on being in a subdirectory, then setting the current working directory to that subdirectory is a great idea! That age-old adage of "knowing when to break the rules judiciously" applies here.
If you put your scripts in a
scripts/ directory, then constantly executing a command that looks like:
can get boring over time. If you instead put that line in a Makefile as follows:
build: bash scripts/ci/build.sh
then you can execute the command
make build from the project root, and save yourself keystrokes.
You can help your colleagues get setup by creating a script for them! For example, you can write one that has the following commands:
# ./scripts/setup.sh export PROJECT_ENV_NAME = ______________ # replace with your env name conda env create -f environment.yml || mamba env create -f $PROJECT_ENV_NAME conda activate $PROJECT_ENV_NAME # Install custom source pip install -e . # Install Jupyter extensions (if relevant) jupyter labextension install @jupyter-widgets/jupyterlab-manager # Install pre-commit hooks pre-commit install echo "Setup complete! In the future, run 'conda activate $PROJECT_ENV_NAME' before you run your notebooks."
This script will help you:
Saves a bunch of time downstream!
If a script is part of a pipeline (see: Build your projects thinking in terms of pipelines), then ensure that you have it set up such that upstream computational steps, especially those that are computationally expensive, execute independent of computationally cheap ones that depend on them. One example, provided by one of my reviewers Simon, is "intermediate data generation" vs. "data visualization". To quote:
I run under the philosophy of not unnecessarily regenerating data. Having to regenerate data -- especially if takes a long time -- just to regenerate a visualization absolutely sucks and is a common cause of my annoyance when my underlings present data in meetings.
The philosophies that ground the bootstrap
Here are the philosophies that ground the bootstrap. Internalizing these philosophies will help you understand where we're coming from.