Data Science Bootstrap

Place custom source code inside a lightweight package

Why write a package for your custom source code

Have you encountered the situation where you create a new notebook, and then promptly copy code verbatim from another notebook with zero modifications?

As you as you did that, you created two sources of truth for that one function.

Now... if you intended to modify the function and test the effect of the modification on the rest of the code, then you still could have done better.

A custom source package that is installed into the conda environment that you have set up will help you refactor code out of the notebook, and hence help you define one source of truth for the entire function, which you can then import anywhere.

How to create a custom source package for a project

Firstly, I'm assuming you are following the ideas laid out in Set up your project with a sane directory structure. Specifically, you have a src/ directory under the project root. Here, I'm going to give you a summary of the official Python packaging tutorial.

In your project project_name/ directory, ensure you have a few files:

|- project_name/   # should be the same name as the conda environment
  |- data/         # for all data-related functions
	 |- loaders.py # convenience functions for loading data
	 |- schemas.py # this is for pandera schemas
  |- __init__.py   # this is necessary
  |- paths.py      # this is for path definitions
  |- utils.py      # utiity functions that you might need
  |- ...
|- tests/
  |- test_utils.py # tests for utility functions
  |- ...
|- pyproject.toml. # replacement for setup.py

If you're wondering about why we name the source package the same name as our conda environment, it's for consistency purposes. (see: Sanely name things consistently)

If you're wondering about the purpose of paths.py, read this page: Use pyprojroot to define relative paths to the project root

pyproject.toml should look like this:

[project]
name = "my-package-name"
version = "0.1.0"
authors = [{name = "EM", email = "me@em.com"}]
description = "Something cool here."

Now, you activate the environment dedicated to your project (see: Create one conda environment per project) and install the custom source package:

conda activate project_environment
pip install -e .

This will install the source package in development mode. As you continue to add more code into the custom source package, they will be instantly available to you project-wide.

Now, in your projects, you can import anything from the custom source package.

Note: If you've read the official Python documentation on packages, you might see that src/ has nothing special in its name. (Indeed, one of my reviewers, Arkadij Kummer, pointed this out to me.) Having tried to organize a few ways, I think having src/ is better for DS projects than having the setup.py file and source_package/ directory in the top-level project directory. Those two are better isolated from the rest of the project and we can keep the setup.py in src/ too, thus eliminating clutter from the top-level directory.

How often should the package be updated?

As often as you need it!

Also, I would encourage you to avoid releasing the package standalone until you know that it ought to be used as a standalone Python package. Otherwise, you might prematurely bring upon yourself a maintenance burden!

Is there an easier way to set this all up?

It feels like a lot to remember, right? Fret not! You can use pyds-cli to easily bootstrap a new project environment!

Pages that link here

Get prepped per project
Treat your projects as if they were software projects for maximum organizational effectiveness

Set up your project with a sane directory structure
Why setup your project with a sane directory structure Doing so will help you quickly and easily find things

Use scripts to automate routine execution of tasks
This idea should be pretty obvious

Build a continuous integration pipeline for your source
What is a continuous integration pipeline If you end up writing software (see: Place custom source code inside a lightweight package), especially code that you might need to depend on in the future, having a test suite is essential (see: Write tests that test your custom code)

Use pyprojroot to define relative paths to the project root
Why you should use pyprojroot If you follow the practice of One project should get one git repository, then everything related to the project will be housed inside that repository

Define single sources of truth for your data sources
Why define single sources of truth for data Let me describe a scenario: there's a project you're working on with others, and everybody depends on an Excel spreadsheet

Follow the rule of one-to-one in managing your projects
What is this rule all about The one-to-one rule essentially means this

One project should get one git repository
Why one project should get one Git repository This helps a ton with organization

Define project-wide constants inside your custom package
Why you would want to define project-wide constants There are some "basic facts" about a project that you might want to be able to leverage project-wide

Adhere to best git practices
Why adhere to best Git practices? Git is a unique piece of software

Create one conda environment per project

Why use one conda environment per project

If you have multiple projects that you work on, but you install all project dependencies into a shared environment, then I guarantee you that at some point, you will run into dependency conflicts as you try to upgrade/update packages to try out new things.

"So what?" you might ask. Well, you'll end up breaking your code! Take this word of advice from someone who has had to deal with the consequences of having his code not working in one project even as code in another does. And finding out one day before an important presentation, right when you need to put in new versions of figures that were made before. The horror!

You will want to ensure that you have an isolated conda environment for each project to keep your projects insulated from one another.

How do you set up your conda environment files

Here is a baseline that you can copy and modify at any time.

name: project-name-goes-here  ## CHANGE THIS TO YOUR ACTUAL PROJECT
channels:      ## Add any other channels below if necessary
- conda-forge
dependencies:  ## Prioritize conda packages
- python=3.10
- jupyter
- conda
- mamba
- ipython
- ipykernel
- numpy
- matplotlib
- scipy
- pandas
- pip
- pre-commit
- black
- nbstripout
- mypy
- flake8
- pycodestyle
- pydocstyle
- pytest
- pytest-cov
- pytest-xdist
- pip:  ## Add in pip packages if necessary
  - mkdocs
  - mkdocs-material
  - mkdocstrings
  - mknotebooks

If a package exists in both conda-forge and pip and you rely primarily on conda, then I recommend prioritizing the conda package over the pip package. The advantage here is that conda's dependency solver can grab the latest compatible version without worrying about pip clobbering over other dependencies. (h/t my reviewer Simon, who pointed out that newer versions of pip have a dependency solver, though as far as possible, staying consistent is preferable, though mixing-and-matching is alright if you know what you're doing.)

This baseline helps me bootstrap conda environments. The packages that are in there each serve a purpose. You can read more about them on the page: Install code checking tools to help write better code.

How do you decide which versions of packages to use?

Initially, I only specify the version of Python I want, and allow the conda package manager to solve the environment.

However, there may come a time when a new package version brings a new capability. That is when you may wish to pin the version of that particular package to be at the minimum that version. (See below for the syntax needed to pin a version.) At the same time, the new package version may break compatibility -- in this case, you will want to pin it to a maximum package version.

It's not always obvious, though, so be sure to use version control

If you wish, you can also pin versions to a minimum, maximum, or specific one, using version modifiers.

For conda, they are >, >=, =, <= and <. (You should be able to grok what is what!)
For pip, they are >, >=, ==, <= and <. (Note: for pip, it is double equals == and not single equals =.)

So when do you use each of the modifiers?

Use =/== sparingly while in development: you will be stuck with a particular version and will find it difficult to update other packages together.
Use <= and < to prevent conda/pip from upgrading a package beyond a certain version. This can be helpful if new versions of packages you rely on have breaking API changes.
Use >= and > to prevent conda/pip from installing a package below a certain version. This is helpful if you've come to depend on breaking API changes from older versions.

When do you upgrade/install new packages?

Upgrading and/or installing packages should be done on an as-needed basis. There are two paths to do upgrade packages that I have found:

The principled way

The principled way to do an upgrade is to first pin the version inside environment.yml, and then use the following command to update the environment:

conda env update -f environment.yml

The hacky way

The hacky way to do the upgrade is to directly conda or pip install the package, and then add it (or modify its version) in the environment.yml file. Do this only if you know what you're doing!

Ensure your environment kernels are available to Jupyter

By practicing "one project gets one environment", then ensuring that those environments' Python interpreters are available to Jupyter is going to be crucial. If you find that your project's environment Python is unavailable, then you'll need to ensure that it's available. To do so, ensure that the Python environment has the package ipykernel. (If not, install it by hand and add it to the environment.yml file.) Then, run the following command:

# assuming you have already activated your environment,
# replace $ENVIRONMENT_NAME with your environment's name.
python -m ipykernel install --user --name $ENVIRONMENT_NAME

Now, it will show up as a "kernel" for executing Python code in your Jupyter notebooks. (see Configure Jupyter and Jupyter Lab for more information on how to configure it.)

Further tips

Now, how should you name your conda environment? See the page: Sanely name things consistently!

Set up your project with a sane directory structure

Why setup your project with a sane directory structure

Doing so will help you quickly and easily find things. This is crucial when navigating your data project. If you don't do so, you will likely end up being utterly confused as to where things are located.

What does a sane directory look like

I am going to show you one particular example, but you can adapt it to however you like.

|- informative-project-name-here/
   |- data/          # never add anything here into source control
   |- notebooks/     # divide by usernames if needed
   |- scripts/       # basically for automation
   |- importable_name/
      |- __init__.py
      |-...
   |- tests/      # test suite
   |- README.md
   |- pyproject.toml # use this, not setup.py!
   |-...

The purpose of each directory is annotated in each line. That said, you can find relevant information in the following pages:

data/: Define single sources of truth for your data sources
notebooks/: Keep your notebooks organized with logical categories
importable_name/: Place custom source code inside a lightweight package
scripts/: Use scripts to automate routine execution of tasks

Use scripts to automate routine execution of tasks

This idea should be pretty obvious. If you find yourself executing the exact same commands over and over and over, you should probably put them together into a bash, Python, or R script that you can call from the root of your directory.

Where should these scripts live?

In the spirit of putting things in categorically relevant places (see: Organize your projects by leveraging categories), you should place them in the scripts/ directory, and provide additional sub-categories inside there.

How do I decide what language to write those scripts in?

You should do what feels most comfortable for you, but there are still some idiomatic guidelines that can help you make a decision:

If you're doing text processing of files, or otherwise leveraging functions from your project's custom source, then you might want to write them in Python. (see: Place custom source code inside a lightweight package)
If you're doing filesystem manipulation, or repeated serial execution of command line tools, a bash script is a great idea.

What else should I pay attention to when building these scripts?

Design for project root execution

Most of the time, it's optimal to design these scripts assuming that the "current working directory" is project root directory. This will simplify how you execute the scripts. You'll save on injecting "cd" commands into the documentation that you build.

There are exceptions to the rule. For example, if you know that every subsequent operation in the script depends on being in a subdirectory, then setting the current working directory to that subdirectory is a great idea! That age-old adage of "knowing when to break the rules judiciously" applies here.

Leverage Makefiles

If you put your scripts in a scripts/ directory, then constantly executing a command that looks like:

bash scripts/ci/build.sh

can get boring over time. If you instead put that line in a Makefile as follows:


build:
	bash scripts/ci/build.sh

then you can execute the command make build from the project root, and save yourself keystrokes.

Help your colleagues with a "bootstrap" script

You can help your colleagues get setup by creating a script for them! For example, you can write one that has the following commands:

# ./scripts/setup.sh

export PROJECT_ENV_NAME = ______________  # replace with your env name
conda env create -f environment.yml || mamba env create -f $PROJECT_ENV_NAME
conda activate $PROJECT_ENV_NAME

# Install custom source
pip install -e .

# Install Jupyter extensions (if relevant)
jupyter labextension install @jupyter-widgets/jupyterlab-manager

# Install pre-commit hooks
pre-commit install
echo "Setup complete! In the future, run 'conda activate $PROJECT_ENV_NAME' before you run your notebooks."

This script will help you:

Create the conda environment. (see: Create one conda environment per project)
Install the custom source
Install the Jupyterlab IPywidgets extension (necessary for progress bars like tqdm!)
Install pre-commit hooks (see: Set up pre-commit hooks to automate checks before making git commits)

Saves a bunch of time downstream!

Separate computationally expensive steps from computationally cheap steps

If a script is part of a pipeline (see: Build your projects thinking in terms of pipelines), then ensure that you have it set up such that upstream computational steps, especially those that are computationally expensive, execute independent of computationally cheap ones that depend on them. One example, provided by one of my reviewers Simon, is "intermediate data generation" vs. "data visualization". To quote:

I run under the philosophy of not unnecessarily regenerating data. Having to regenerate data -- especially if takes a long time -- just to regenerate a visualization absolutely sucks and is a common cause of my annoyance when my underlings present data in meetings.

Sanely name things consistently

Why should you name things consistently

Think about the following scenario:

Your project is called sales-forecast
It lives in a Git repository hosted on GitHub called forecast-2020
Your conda environment is named something you copied and pasted from a tutorial, say my_env
Your custom source code is named my_source.

Are you going to be able to ever mentally map them to one another? Probably not, though maybe if you did put in the effort to do so, you might be able to. That said, if you work with someone else on the project, you're only going to increase the amount of mental work they need to do to keep things straight.

Now, consider a different scenario:

Your project is called Sales Forecast 2020
Your Git repository is called sales-forecast-2020
Your conda environment is called sales-forecast-2020-env
And your custom source code package is called sales_forecast_2020.

Does the latter seem saner? I think so too :).

What constitutes a "sane" name?

I think the following guidelines help:

2 words are preferred, 3 words are okay, 4 is bordering on verbose; 5 or more words is not really acceptable.
Explicit, precise, and well-defined for a "local" scope, where "local" depends on your definition.

I would add that learning how to name things precisely in English, and hence provide precise variable names in Python code, is a great way for English second language speakers to practice and expand their language vocabulary.

As one of my reviewers (Logan Thomas) pointed out, leveraging the name to help newcomers distinguish between entities is helpful too. For this reason, your environment can be suffixed with a consistent noun; for example, I have -dev as a suffix to make for software package-oriented projects; above, we used -env as a suffix (making sales-forecast-2020-env) to indicate to a newcomer that we're activating an environment when we conda activate sales-forecast-2020-env. As long as you're consistent, that's not a problem!

Use pyprojroot to define relative paths to the project root

Why you should use pyprojroot

If you follow the practice of One project should get one git repository, then everything related to the project will be housed inside that repository. Under this assumption, if you also develop a custom source code library for your project (see Place custom source code inside a lightweight package for why), then you'll likely encounter the need to find paths to things, such as data files, relative to the project root. Rather than hard-coding paths into your library and Jupyter notebooks, you can instead leverage pyprojroot to define a library of paths that are useful across the project.

How do you use pyprojroot effectively

Firstly, make sure you have an importable source_package.paths module. (I'm assuming you have written a custom source package!) In there, define project paths:

from pyprojroot import here

root = here(proj_files=[".git"])
notebooks_dir = root / "notebooks"
data_dir = root / "data"
timeseries_data_dir = data_dir / "timeseries"

here() returns a Python pathlib.Path object.

You can go as granular or as coarse-grained as you want.

Then, inside your Jupyter notebooks or Python scripts, you can import those paths as needed.

from source_package.paths import timeseries_data_dir
import pandas as pd

data = pd.read_csv(timeseries_data_dir / "2016-2019.csv")

Now, if for whatever reason you have to move the data files to a different subdirectory (say, to keep things even more organized than you already are, you awesome person!), then you just have to update one location in source_package.paths, and you're able to reference the data file in all of your scripts!

Build a continuous integration pipeline for your source

What is a continuous integration pipeline

If you end up writing software (see: Place custom source code inside a lightweight package), especially code that you might need to depend on in the future, having a test suite is essential (see: Write tests that test your custom code). However, the execution of the tests still needs to be triggered by you.

A continuous integration (CI) pipeline solves that problem for you. When configured correctly, on every commit you make to your codebase, it will automatically:

Build an environment that you configure
Execute all tests associated with your source code inside that environment

You can think of a continuous integration pipeline as a programmable bot that runs commands that you've configured it to run, except it does so automatically on every single commit.

Why write a continuous integration pipeline

You can configure a CI pipeline to automatically run code checks, thus preventing you from breaking something that you previously wrote on which you also depend.

You can also configure a CI pipeline to continuously run analyses that are crucial to the project. You essentially feed the CI pipeline the commands needed to re-run analyses that are important and deposit the results in a location that you get to configure.

If you don't build a CI pipeline, then you'll miss out on the benefits of automatically having a bot check your work for breakages.

How to build a CI pipeline

There's a myriad of CI providers. Here are a few examples:

Travis CI
Azure Pipelines
GitHub Actions
CircleCI

Because of the myriad of options available, it'd be futile to give you a tutorial. Instead, I'll show you what's common between them.

Firstly, you begin by writing a configuration file that lists out all of the build steps. Typically it's a YAML file (Travis CI, Azure Pipelines, and GitHub Actions all use this), but sometimes you'll have other formats, such as a Jenkinsfile for Jenkins. This file is, by convention, usually placed in the root of your project repository, but you can also opt to put it in another location if that helps with file organization.

Most commonly, the build steps will be nothing more than bash commands. For example, in Travis CI, each build step in the YAML file is a bash command used to execute the pipeline. Sometimes, to take advantage of the user-friendly UI elements provided by the CI provider, you'll be asked to supply a slightly more complex YAML file. There, you can group build steps into logical higher-order steps and provide human-readable descriptions for them; these get paired with a web UI that lets you easily debug a step when something goes wrong.

Secondly, there'll be a website (sometimes called a "control plane" in cloud jargon) where you go to configure the continuous integration bot. There, you'll typically configure:

The location of the Git repository
The exact configuration file(s) that contains the build steps.

If your company has set up internal systems slightly differently, you'll probably have to ask your IT department's DevOps team for help to accomplish your task. Ask nicely; they invest tons of time building out something usable, but sometimes the data scientist's level of expertise with these systems, which is usually beginner, is out of their radars.

Define single sources of truth for your data sources

Why define single sources of truth for data

Let me describe a scenario: there's a project you're working on with others, and everybody depends on an Excel spreadsheet. This was before the days of collaboratively editing a single Excel spreadsheet was a possibility. To avoid conflicts, someone creates a spreadsheet_v2.xlsx, and then at the same time, another person creates spreadsheet_TE_edits.xlsx.

Which version do you trust?

The worst part? Neither of those spreadsheets contained purely raw data; they were a mix of both raw data and derived data (i.e. columns that are calculated off or from other columns). The derived data are not documented with why and how they were calculated; their provenance is unknown, in that we don't know who made those changes, and who to ask questions on those columns.

Rather than wrestling with multiple sources of truth, a data analysis workflow can be much more streamlined by defining a single source of truth for raw data that does not contain anything derived, followed by calculating the derived data in a custom source code (see: Place custom source code inside a lightweight package), written in such a way that they yield logical derived data structures for the problem (see: Iteratively scope out and define the most appropriate data structures for your problem). Those single sources of truth can also be described by a ground truth data descriptor file (see Write data descriptor files for your data sources), which give you the provenance of the file and a human-readable descriptor of each of the sources.

Examples of single sources of data truth in action

Data on an s3-like bucket

If your organization uses the cloud, then AWS S3 (or compatible bucket stores) might be available. A data source might be dumped on there and referenced by a single URL. That URL is your "single source of data"

Data on an internal data store

Your organization might have the resources to build out a data store with proper access controls and the likes. They might provide a unique key and a software API (RESTful, or Python or R package) to download data in an easy fashion. That "unique key" + the API defines your single source of truth.

Data on a shared network store

Longer-lived organizations might have started out with a shared networked filesystem, with access controls granted by UNIX-style user groups. In this case, the /path/to/the/data/file + access to the shared filesystem is your source of truth.

Data on the internet

This one should be easy to grok: a URL that points to the exact CSV, Parquet, or Excel table, or a zip dump of images, is your unique identifier.

Follow the rule of one-to-one in managing your projects

What is this rule all about

The one-to-one rule essentially means this. Each project that we work on gets:

One Git repository (see: One project should get one git repository)
One conda environment (see: Create one conda environment per project)
One custom source package inside the repo (see: Place custom source code inside a lightweight package)
One documentation source inside the repo (see: Write effective documentation for your projects), including a well-maintained README file
One continuous integration pipeline (see: Build a continuous integration pipeline for your source)

In addition, when we name things, such as environment names, repository names, and more, we choose names that are consistent with one another (see: Sanely name things consistently for the reasons why).

Why is this important

Conventions help act as a lubricant - a shortcut for us to interact with others. Adopting the convention of one-to-one mappings helps us manage some of the complexity that may arise in a project.

Some teams have a habit of putting source code in one place (e.g. Bitbucket) and documentation in another (e.g. Confluence). I would discourage this; placing source code and documentation on how to use it next to each other is a much better way to work, because it gives you and your project stakeholders one single source of truth to find information related to a project.

When can we break this rule

A few guidelines can help you decide.

When a source repository matures enough such that you see a submodule that is generalizable beyond the project itself, then it's time to engage the help of a real software developer to refactor that chunk of code out of the source file into a separate package.

When the project matures enough such that there's a natural bifurcation in work that needs more independence from the original repository, then it's time to split the repository into two. At that point, apply the same principles to the new repository.