Data Science Bootstrap

Set up pre-commit hooks to automate checks before making git commits

Why use pre-commit hooks?

One way to prevent yourself from committing code that is not properly checked is to use pre-commit hooks. This is a feature of Git that allows you to automatically run checks before they are committed to the repository history. Because they are automatically run, you set them up once, usually when you first download the repository, and no longer need to think about them again.

How do I set up pre-commit hooks?

You can install the pre-commit framework, which lets you easily configure pre-commit hooks to run.

The gist of the installation steps are in the bash commands below, but you should read the website for a fuller understanding.

conda install -c conda-forge pre-commit
pre-commit sample-config > .pre-commit-config.yaml

Now, go and edit .pre-commit-config.yaml -- add other pre-commit checks, for example. (See below for an example that you can use.) Then, run:

pre-commit install
pre-commit run --all-files   # run the checks against all of your files

What pre-commit hooks are good to install?

a.k.a. what would you put in your .pre-commit-config.yaml? Here's a sane collection of starter things that I usually include, taken from my Network Analysis Made Simple repository.

repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v2.3.0
    hooks:
      - id: check-yaml
      - id: end-of-file-fixer
      - id: trailing-whitespace
  - repo: https://github.com/psf/black
    rev: 19.3b0
    hooks:
      - id: black
  - repo: https://github.com/kynan/nbstripout
    rev: master
    hooks:
      - id: nbstripout
        files: ".ipynb"

nbstripout is a super important one -- it ensures that all of my notebook outputs are stripped before committing them to the repository! (Otherwise, you'll end up bloating your repository with large notebooks.)

How does this relate to continuous integration pipeline checks?

(For a refresher, or if you're not sure what CI pipeline checks are, see Build a continuous integration pipeline for your source.)

CI pipeline checks are also a form of automated checks that you can put into your workflow. Ideally, everything that is checked for in your pre-commit hooks should be checked for in your CI pipeline.

So what's the difference, then? Here's my thoughts on this:

In pre-commit hooks, you generally run the lightweight checks: the ones that are annoying to run manually all the time but also execute very quickly. Things like code style checks, for example, or those that ensure there are only single trailing lines in text files.

In the CI system, you run those checks in addition to the longer-running test suite. (see: Write tests that test your custom code). So the CI system behaves as a backup to the pre-commit hooks.

Set up an awesome default gitignore for your projects

Why setup a "gitignore" file?

There will be some files you'll never want to commit to Git. Some include:

Files that contain passwords and other secrets.
Files that contain runtime environment variables (which themselves might be secrets).
Large files, such as images and binaries, unless they are essential assets. (A rule of thumb is anything >500 kb is "large" by Git standards.)
Jupyter notebooks that contain outputs.
Data file directories. (see: Never commit data into version control repositories)

If you commit them, then:

Secrets and other sensitive runtime information may linger in your repository and become exposed to the world.
Your repository will explode in history as changes happen to the large binary files.

How do I set up an awesome "gitignore" file?

Some believe that your .gitignore should be curated. I believe that you should use a good default one that is widely applicable. To do so, go to gitignore.io, fill in the languages and operating systems involved in your project, and copy/paste the one that fits you. If you want an awesome default one for Python:

cd /path/to/project/root
curl https://www.toptal.com/developers/gitignore/api/python

It will have .env available in there too! (see: Create runtime environment variable configuration files for each of your projects)

How is a `.gitignore` file parsed?

A .gitignore file is parsed according to the rules on its documentation page. It essentially follows the unix glob syntax while adding on logical modifiers. Here are a few examples to get you oriented:

Example 1: Ignore all `.DS_Store` files

These are files generated by macOS' Finder. You can ignore them by appending the following line to your .gitignore:

*.DS_Store

Example 2: Ignore all files under `site/`

If you use MkDocs to build documentation, it will place the output into the directory site/. You will want to ignore the entire directory appending the following line:

site/

Example 3: Ignore all `.ipynb_checkpoints` directories

If you have Jupyter notebooks inside your repository, you can ignore any path containing .ipynb_checkpoints.

.ipynb_checkpoints

Adding this line will prevent your Jupyter notebook checkpoints from being committed into your Git repository.

Get prepped per project

Treat your projects as if they were software projects for maximum organizational effectiveness. Why? The biggest reason is that it will nudge us towards getting organized. The "magic" behind well-constructed software projects is that someone sat down and thought clearly about how to organize things. The same principle can be applied to data analysis projects.

Firstly, some overall ideas to ground the specifics:

Some ideas pertaining to Git:

Notes that pertain to organizing files:

Notes that pertain to your compute environment:

And notes that pertain to good coding practices:

Treating projects as if they were software projects, but without software engineering's stricter practices, keeps us primed to think about the generalizability of what we do, but without the over-engineering that might constrain future flexibility.

Create runtime environment variable configuration files for each of your projects

Why configure environment variables per project

When you work on your projects, one assumption you will usually have is that your development environment will look like your project's runtime environment with all of its environment variables. The runtime environment is usually your "production" setting: a web app or API, a model in a pipeline, or a software package that gets distributed. (For more on environment variables, see: Take full control of your shell environment variables)

How to configure environment variables for your project

Here, I'm assuming that you follow the practice of

and that you Use pyprojroot to define relative paths to the project root.

To configure environment variables for your project, a recommended practice is to create a .env file in your project's root directory, which stores your environment variables as such:

export ENV_VAR_1 = "some_value"
export DATABASE_CONNECTION_STRING = "some_database_connection_string"
export ENV_VAR_3 = "some_other_value"

We use the export syntax here because we can, in our shells, run the command source .env and have the environment variables defined in there applied to our environment.

Now, if you're using a Python project, make sure you have the package python-dotenv (Github repo here) installed in the conda environment. Then, in your Python .py source files:

from dotenv import load_dotenv
from pyprojroot import here
import os

dotenv_path = here() / ".env"
load_dotenv(dotenv_path=dotenv_path)  # this will load the .env file in your project directory root.

# Now, get the environment variable.
DATABASE_CONNECTION_STRING = os.getenv("DATABASE_CONNECTION_STRING")

In this way, your runtime environment variables get loaded into the runtime environment, and become available to all child processes started from within the shell (e.g. Jupyter Lab, or Python, etc.).

Always gitignore your .env file

Your .env file might contain some sensitive secrets. You should always ensure that your .gitignore file contains .env in it.

Create configuration files for code checking tools

Why configure code checking tools using configuration files?

Configuration files give you the ability to declare your project's preferred configuration and distribute it to all participants in the project. It smooths out the practice of data science (and software development too), as these configuration represent the declared normative state of a project, answering questions such as:

What code style rules ought we adhere to?
How much docstring coverage ought to be present?
What software tests should be run?

Without these configuration files declaring how code checkers ought to behave, we leave it up to collaborators and contributors to manually configure their local systems, and without sufficient documentation, they may bug you over and over on how things ought to be configured. This increase in friction will inevitably lead to an increase in frustration with the project, and hence a decrease in engagement.

As such, you can think of these configuration files as part of your automation toolkit, thus satisfying the automation philosophy.

What configuration files belong with which code checking tools?

Because configuration files are so crucial to a project, I have collated them together on the Configuration file overview page.

When do I create these configuration files?

As always, just-in-time, at the moment that you need it.

Write effective documentation for your projects

Why write documentation

As your data science project progresses, you should be documenting your work somehow inside your project. Your future self and other colleagues will need mental context to help get up-to-speed with the project. That mental context can mean the difference between staying on course or veering off in unproductive directions.

Useful documentation helps you quickly onboard collaborators to the project. By reading your documentation, you will help them get oriented and know how to get things done with your project. You won't be available forever to everyone who might come by, so your documentation effectively scales the longevity and impact of your work.

How do you write useful documentation

To write effective documentation, we first need to recognize that there are actually four types of documentation. They are, respectively:

Tutorials
How-to Guides
Explanations
References

This is not a new concept, it is actually well-documented (ahem!) in the Diataxis Framework.

Concretely, here are some kinds of documentation that you will want to focus on.

The first is custom source code docstrings (a type of Reference). We write docstrings inside Python functions to document what we intend to accomplish with the code block and why that code needs to exist. Be diligent about writing down the why behind the what; it will help you recall the "what" later on.

The second is how-to guides for newcomers to the project (obviously under the How-to Guides category). These guides help your newcomers get up to speed on the commands needed to set up a local development environment for their project. Essentially the sequence of terminal incantations that they would need to type to start hacking on the project. As always, for non-obvious steps, always document the why!

The third involves crafting and telling a story of the project's progress. (We may consider this to be an Explanation-style documentation). For those of you who have done scientific research before, you'll know how this goes: it's essentially the lab meeting presentations that you deliver! Early on, your progress will be granular, but your progress will gain momentum as the project progresses. Doing this is important because the act of reflecting on prior work, summarizing, and linearizing it for yourself helps you catch logical gaps that need to be filled in, essentially identifying where you need to focus your project efforts.

The final one is the project README! The README usually exists as README.md or README.txt in the project root directory and serves a few purposes:

Giving an overview of why the project exists.
Providing an overview of the "rules of engagement" with the project.
Serving up a "Quickstart" or "Installation" section to guide users on how to get set up.
Showing an example of what they can do with the project.

The README file usually serves dual-purposes, both as a quick Tutorial and How-to Guide.

What tools should we use to write documentation?

On this matter, I would advocate that we simultaneously strive to be simple and automated. For Pythonistas, there are two predominant options that you can go with: Sphinx and MkDocs.

Sphinx

At first glance, most in the Python world would advocate for the use of Sphinx, which is the stalwart package used to generate documentation. Sphinx's power lies in its syntax and ecosystem of extensions: you can easily link out to other packages, build API documentation from docstrings, run examples in documentation as tests, and more.

MkDocs

However, if you're not already familiar with Sphinx, I would recommend getting started using MkDocs. Its core design is much simpler, relying only on Markdown files as the source for documentation. That is MkDoc's most significant advantage: from my vantage point, Markdown syntax knowledge is more widespread than Sphinx syntax knowledge; hence, it's much easier to invite collaborators to write documentation together. (Hint: the MkDocs Material theme by Squidfunk has a ton of super excellent features that easily enhance MkDocs!)

What principles should we keep in mind when writing docs?

Single source of truth

Firstly, you should define a single source of truth for statements that you make in your docs. If you can, avoid copy/pasting anything. Related ideas here are written in Define single sources of truth for your data sources.

Write to the audience

Secondly, you'll want to pick from several styles of writing. One effective way is to think of it in terms of answering critical questions for a project. An example list of questions that commonly show up in data projects mirror that of a scientific research paper and include (but are not limited to):

What question does this project answer? What problem are you solving through this project? What is the bigger context of this project?
What are the data backing the project, and from where do they come? Where is the data description? (see also: Write data descriptor files for your data sources)
What methods were used in the project?
What key insights should be gained from this project?

If your project also encompasses a tool that helps routinize the project in a production setting:

What is the deployment strategy for the project? What pre-requisites are needed before we can "deploy" the project?
What code/commands need to be executed at the command line/REPL/Jupyter notebook to use the tools built in this project?
What are the tools available for the visualization of model results, and how ought they be interpreted?

As one of my reviewers, Simon Eng, mentioned, the overarching point is that your documentation should explain to someone else what's going on in the project.

Use semantic line breaks

Finally, it would be best if you used semantic line breaks, also known as semantic line feeds. Go ahead. I know you're curious; click on the links to learn why :).

Resources

I strongly recommend reading the Write The Docs guide to writing technical documentation.

Additionally, Admond Lee has additional reasons for writing documentation.

Place custom source code inside a lightweight package

Why write a package for your custom source code

Have you encountered the situation where you create a new notebook, and then promptly copy code verbatim from another notebook with zero modifications?

As you as you did that, you created two sources of truth for that one function.

Now... if you intended to modify the function and test the effect of the modification on the rest of the code, then you still could have done better.

A custom source package that is installed into the conda environment that you have set up will help you refactor code out of the notebook, and hence help you define one source of truth for the entire function, which you can then import anywhere.

How to create a custom source package for a project

Firstly, I'm assuming you are following the ideas laid out in Set up your project with a sane directory structure. Specifically, you have a src/ directory under the project root. Here, I'm going to give you a summary of the official Python packaging tutorial.

In your project project_name/ directory, ensure you have a few files:

|- project_name/   # should be the same name as the conda environment
  |- data/         # for all data-related functions
	 |- loaders.py # convenience functions for loading data
	 |- schemas.py # this is for pandera schemas
  |- __init__.py   # this is necessary
  |- paths.py      # this is for path definitions
  |- utils.py      # utiity functions that you might need
  |- ...
|- tests/
  |- test_utils.py # tests for utility functions
  |- ...
|- pyproject.toml. # replacement for setup.py

If you're wondering about why we name the source package the same name as our conda environment, it's for consistency purposes. (see: Sanely name things consistently)

If you're wondering about the purpose of paths.py, read this page: Use pyprojroot to define relative paths to the project root

pyproject.toml should look like this:

[project]
name = "my-package-name"
version = "0.1.0"
authors = [{name = "EM", email = "me@em.com"}]
description = "Something cool here."

Now, you activate the environment dedicated to your project (see: Create one conda environment per project) and install the custom source package:

conda activate project_environment
pip install -e .

This will install the source package in development mode. As you continue to add more code into the custom source package, they will be instantly available to you project-wide.

Now, in your projects, you can import anything from the custom source package.

Note: If you've read the official Python documentation on packages, you might see that src/ has nothing special in its name. (Indeed, one of my reviewers, Arkadij Kummer, pointed this out to me.) Having tried to organize a few ways, I think having src/ is better for DS projects than having the setup.py file and source_package/ directory in the top-level project directory. Those two are better isolated from the rest of the project and we can keep the setup.py in src/ too, thus eliminating clutter from the top-level directory.

How often should the package be updated?

As often as you need it!

Also, I would encourage you to avoid releasing the package standalone until you know that it ought to be used as a standalone Python package. Otherwise, you might prematurely bring upon yourself a maintenance burden!

Is there an easier way to set this all up?

It feels like a lot to remember, right? Fret not! You can use pyds-cli to easily bootstrap a new project environment!

Why adhere to best Git practices?

What best practices should we adhere to?

Pages that link here

Why use pre-commit hooks?

How do I set up pre-commit hooks?

What pre-commit hooks are good to install?

How does this relate to continuous integration pipeline checks?

Why setup a "gitignore" file?

How do I set up an awesome "gitignore" file?

How is a `.gitignore` file parsed?

Example 1: Ignore all `.DS_Store` files

Example 2: Ignore all files under `site/`

Example 3: Ignore all `.ipynb_checkpoints` directories

Why configure environment variables per project

How to configure environment variables for your project

Always gitignore your .env file

Why configure code checking tools using configuration files?

What configuration files belong with which code checking tools?

When do I create these configuration files?

Why write documentation

How do you write useful documentation

What tools should we use to write documentation?

Sphinx

MkDocs

What principles should we keep in mind when writing docs?

Single source of truth

Write to the audience

Use semantic line breaks

Resources

Why write a package for your custom source code

How to create a custom source package for a project

How often should the package be updated?

Is there an easier way to set this all up?

Why adhere to best Git practices?

What best practices should we adhere to?

Pages that link here

Why use pre-commit hooks?

How do I set up pre-commit hooks?

What pre-commit hooks are good to install?

How does this relate to continuous integration pipeline checks?

Why setup a "gitignore" file?

How do I set up an awesome "gitignore" file?

How is a .gitignore file parsed?

Example 1: Ignore all .DS_Store files

Example 2: Ignore all files under site/

Example 3: Ignore all .ipynb_checkpoints directories

Why configure environment variables per project

How to configure environment variables for your project

Always gitignore your .env file

Why configure code checking tools using configuration files?

What configuration files belong with which code checking tools?

When do I create these configuration files?

Why write documentation

How do you write useful documentation

What tools should we use to write documentation?

Sphinx

MkDocs

What principles should we keep in mind when writing docs?

Single source of truth

Write to the audience

Use semantic line breaks

Resources

Why write a package for your custom source code

How to create a custom source package for a project

How often should the package be updated?

Is there an easier way to set this all up?

How is a `.gitignore` file parsed?

Example 1: Ignore all `.DS_Store` files

Example 2: Ignore all files under `site/`

Example 3: Ignore all `.ipynb_checkpoints` directories