Data Science Bootstrap

One project should get one git repository

Why one project should get one Git repository

This helps a ton with organization. When you have one project targeted to one Git repository, you can easily house everything related to that project in that one Git repository. I mean everything. This includes:

source code (see: Place custom source code inside a lightweight package)
documentation (see: Write effective documentation for your projects)
data descriptors (see: Write data descriptor files for your data sources)
environment/configuration files (see: Create one conda environment per project and Create runtime environment variable configuration files for each of your projects)

In doing so, you have one mental location that you can point to for everything related to a project. This is a saner way of operating than over-engineering the separation of concerns at the beginning, with docs in one place and out-of-sync with the source code in another place... you get where we're going with this point.

How to get this implemented

Easy! Create your Git repo for the project, and then start putting stuff in there :).

Enough said here!

What should you name the Git repo? See the page: Sanely name things consistently

After you have set up your Git repo, make sure to Set up your project with a sane directory structure.

Also, Set up an awesome default gitignore for your projects!

Get prepped per project

Treat your projects as if they were software projects for maximum organizational effectiveness. Why? The biggest reason is that it will nudge us towards getting organized. The "magic" behind well-constructed software projects is that someone sat down and thought clearly about how to organize things. The same principle can be applied to data analysis projects.

Firstly, some overall ideas to ground the specifics:

Some ideas pertaining to Git:

Notes that pertain to organizing files:

Notes that pertain to your compute environment:

And notes that pertain to good coding practices:

Treating projects as if they were software projects, but without software engineering's stricter practices, keeps us primed to think about the generalizability of what we do, but without the over-engineering that might constrain future flexibility.

Define single sources of truth for your data sources

Why define single sources of truth for data

Let me describe a scenario: there's a project you're working on with others, and everybody depends on an Excel spreadsheet. This was before the days of collaboratively editing a single Excel spreadsheet was a possibility. To avoid conflicts, someone creates a spreadsheet_v2.xlsx, and then at the same time, another person creates spreadsheet_TE_edits.xlsx.

Which version do you trust?

The worst part? Neither of those spreadsheets contained purely raw data; they were a mix of both raw data and derived data (i.e. columns that are calculated off or from other columns). The derived data are not documented with why and how they were calculated; their provenance is unknown, in that we don't know who made those changes, and who to ask questions on those columns.

Rather than wrestling with multiple sources of truth, a data analysis workflow can be much more streamlined by defining a single source of truth for raw data that does not contain anything derived, followed by calculating the derived data in a custom source code (see: Place custom source code inside a lightweight package), written in such a way that they yield logical derived data structures for the problem (see: Iteratively scope out and define the most appropriate data structures for your problem). Those single sources of truth can also be described by a ground truth data descriptor file (see Write data descriptor files for your data sources), which give you the provenance of the file and a human-readable descriptor of each of the sources.

Examples of single sources of data truth in action

Data on an s3-like bucket

If your organization uses the cloud, then AWS S3 (or compatible bucket stores) might be available. A data source might be dumped on there and referenced by a single URL. That URL is your "single source of data"

Data on an internal data store

Your organization might have the resources to build out a data store with proper access controls and the likes. They might provide a unique key and a software API (RESTful, or Python or R package) to download data in an easy fashion. That "unique key" + the API defines your single source of truth.

Data on a shared network store

Longer-lived organizations might have started out with a shared networked filesystem, with access controls granted by UNIX-style user groups. In this case, the /path/to/the/data/file + access to the shared filesystem is your source of truth.

Data on the internet

This one should be easy to grok: a URL that points to the exact CSV, Parquet, or Excel table, or a zip dump of images, is your unique identifier.

Keep your notebooks organized with logical categories

In my experience, there are three types of notebooks that get written.

Prototyping notebooks go under `notebooks/`

These notebooks are drafting grounds for "production" code. We use Jupyter notebooks as an experimentation playground. (see: Use Jupyter as an experimentation playground). They do not need to be kept running reliably/reproducibly, and essentially are considered "disposable".

If you are collaborating with colleagues on a project, you can categorize notebooks by their primary author. For example, if I am working with Lily and Arkadij on a project, we can each get our own "user spaces" in there while agreeing not to touch each other's notebooks:

project/
- notebooks/
  - lily/     # lily's notebooks go here
  - arkadij/  # arkadij's notebooks go here
  - eric/     # eric's notebooks go here

Documentation notebooks go under `docs/`

These notebooks are written in the original spirit of Jupyter notebooks. They combine prose, code and code-generated figures. They contain a narrative, a data story. One may say they are "production", in that someone will read them and act on them. They need to be reliably executed from top-to-bottom, usually in a continuous integration system. (see: Build a continuous integration pipeline for your source) using MkDocs and mknotebooks.

For these notebooks, we might choose to keep them in the docs/ directory:

project/
- docs/
  - some_notebook.ipynb

Application notebooks go under `app/`

Sometimes you might opt to use voila to build front-end applications for those whom you serve. This is a convenient option because you don't have to jump out of a Jupyter context if you're already in there. These notebooks are considered "production" as well, however because they are code embedded in JSON, they are more difficult to diff with git.

For these notebooks, you probably want to keep them in a directory named app, where anything that becomes front-facing to the clients we serve are stored:

project/
- apps/
  - notebook_app.ipynb

Use scripts to automate routine execution of tasks

This idea should be pretty obvious. If you find yourself executing the exact same commands over and over and over, you should probably put them together into a bash, Python, or R script that you can call from the root of your directory.

Where should these scripts live?

In the spirit of putting things in categorically relevant places (see: Organize your projects by leveraging categories), you should place them in the scripts/ directory, and provide additional sub-categories inside there.

How do I decide what language to write those scripts in?

You should do what feels most comfortable for you, but there are still some idiomatic guidelines that can help you make a decision:

If you're doing text processing of files, or otherwise leveraging functions from your project's custom source, then you might want to write them in Python. (see: Place custom source code inside a lightweight package)
If you're doing filesystem manipulation, or repeated serial execution of command line tools, a bash script is a great idea.

What else should I pay attention to when building these scripts?

Design for project root execution

Most of the time, it's optimal to design these scripts assuming that the "current working directory" is project root directory. This will simplify how you execute the scripts. You'll save on injecting "cd" commands into the documentation that you build.

There are exceptions to the rule. For example, if you know that every subsequent operation in the script depends on being in a subdirectory, then setting the current working directory to that subdirectory is a great idea! That age-old adage of "knowing when to break the rules judiciously" applies here.

Leverage Makefiles

If you put your scripts in a scripts/ directory, then constantly executing a command that looks like:

bash scripts/ci/build.sh

can get boring over time. If you instead put that line in a Makefile as follows:


build:
	bash scripts/ci/build.sh

then you can execute the command make build from the project root, and save yourself keystrokes.

Help your colleagues with a "bootstrap" script

You can help your colleagues get setup by creating a script for them! For example, you can write one that has the following commands:

# ./scripts/setup.sh

export PROJECT_ENV_NAME = ______________  # replace with your env name
conda env create -f environment.yml || mamba env create -f $PROJECT_ENV_NAME
conda activate $PROJECT_ENV_NAME

# Install custom source
pip install -e .

# Install Jupyter extensions (if relevant)
jupyter labextension install @jupyter-widgets/jupyterlab-manager

# Install pre-commit hooks
pre-commit install
echo "Setup complete! In the future, run 'conda activate $PROJECT_ENV_NAME' before you run your notebooks."

This script will help you:

Create the conda environment. (see: Create one conda environment per project)
Install the custom source
Install the Jupyterlab IPywidgets extension (necessary for progress bars like tqdm!)
Install pre-commit hooks (see: Set up pre-commit hooks to automate checks before making git commits)

Saves a bunch of time downstream!

Separate computationally expensive steps from computationally cheap steps

If a script is part of a pipeline (see: Build your projects thinking in terms of pipelines), then ensure that you have it set up such that upstream computational steps, especially those that are computationally expensive, execute independent of computationally cheap ones that depend on them. One example, provided by one of my reviewers Simon, is "intermediate data generation" vs. "data visualization". To quote:

I run under the philosophy of not unnecessarily regenerating data. Having to regenerate data -- especially if takes a long time -- just to regenerate a visualization absolutely sucks and is a common cause of my annoyance when my underlings present data in meetings.

Place custom source code inside a lightweight package

Why write a package for your custom source code

Have you encountered the situation where you create a new notebook, and then promptly copy code verbatim from another notebook with zero modifications?

As you as you did that, you created two sources of truth for that one function.

Now... if you intended to modify the function and test the effect of the modification on the rest of the code, then you still could have done better.

A custom source package that is installed into the conda environment that you have set up will help you refactor code out of the notebook, and hence help you define one source of truth for the entire function, which you can then import anywhere.

How to create a custom source package for a project

Firstly, I'm assuming you are following the ideas laid out in Set up your project with a sane directory structure. Specifically, you have a src/ directory under the project root. Here, I'm going to give you a summary of the official Python packaging tutorial.

In your project project_name/ directory, ensure you have a few files:

|- project_name/   # should be the same name as the conda environment
  |- data/         # for all data-related functions
	 |- loaders.py # convenience functions for loading data
	 |- schemas.py # this is for pandera schemas
  |- __init__.py   # this is necessary
  |- paths.py      # this is for path definitions
  |- utils.py      # utiity functions that you might need
  |- ...
|- tests/
  |- test_utils.py # tests for utility functions
  |- ...
|- pyproject.toml. # replacement for setup.py

If you're wondering about why we name the source package the same name as our conda environment, it's for consistency purposes. (see: Sanely name things consistently)

If you're wondering about the purpose of paths.py, read this page: Use pyprojroot to define relative paths to the project root

pyproject.toml should look like this:

[project]
name = "my-package-name"
version = "0.1.0"
authors = [{name = "EM", email = "me@em.com"}]
description = "Something cool here."

Now, you activate the environment dedicated to your project (see: Create one conda environment per project) and install the custom source package:

conda activate project_environment
pip install -e .

This will install the source package in development mode. As you continue to add more code into the custom source package, they will be instantly available to you project-wide.

Now, in your projects, you can import anything from the custom source package.

Note: If you've read the official Python documentation on packages, you might see that src/ has nothing special in its name. (Indeed, one of my reviewers, Arkadij Kummer, pointed this out to me.) Having tried to organize a few ways, I think having src/ is better for DS projects than having the setup.py file and source_package/ directory in the top-level project directory. Those two are better isolated from the rest of the project and we can keep the setup.py in src/ too, thus eliminating clutter from the top-level directory.

How often should the package be updated?

As often as you need it!

Also, I would encourage you to avoid releasing the package standalone until you know that it ought to be used as a standalone Python package. Otherwise, you might prematurely bring upon yourself a maintenance burden!

Is there an easier way to set this all up?

It feels like a lot to remember, right? Fret not! You can use pyds-cli to easily bootstrap a new project environment!

Why setup your project with a sane directory structure

What does a sane directory look like

Pages that link here

Why one project should get one Git repository

How to get this implemented

Why define single sources of truth for data

Examples of single sources of data truth in action

Data on an s3-like bucket

Data on an internal data store

Data on a shared network store

Data on the internet

Prototyping notebooks go under `notebooks/`

Documentation notebooks go under `docs/`

Application notebooks go under `app/`

Where should these scripts live?

How do I decide what language to write those scripts in?

What else should I pay attention to when building these scripts?

Design for project root execution

Leverage Makefiles

Help your colleagues with a "bootstrap" script

Separate computationally expensive steps from computationally cheap steps

Why write a package for your custom source code

How to create a custom source package for a project

How often should the package be updated?

Is there an easier way to set this all up?

Why setup your project with a sane directory structure

What does a sane directory look like

Pages that link here

Why one project should get one Git repository

How to get this implemented

Why define single sources of truth for data

Examples of single sources of data truth in action

Data on an s3-like bucket

Data on an internal data store

Data on a shared network store

Data on the internet

Prototyping notebooks go under notebooks/

Documentation notebooks go under docs/

Application notebooks go under app/

Where should these scripts live?

How do I decide what language to write those scripts in?

What else should I pay attention to when building these scripts?

Design for project root execution

Leverage Makefiles

Help your colleagues with a "bootstrap" script

Separate computationally expensive steps from computationally cheap steps

Why write a package for your custom source code

How to create a custom source package for a project

How often should the package be updated?

Is there an easier way to set this all up?

Prototyping notebooks go under `notebooks/`

Documentation notebooks go under `docs/`

Application notebooks go under `app/`