Build a continuous integration pipeline for your source

What is a continuous integration pipeline

If you end up writing software (see: Place custom source code inside a lightweight package), especially code that you might need to depend on in the future, having a test suite is essential (see: Write tests that test your custom code). However, the execution of the tests still needs to be triggered by you.

A continuous integration (CI) pipeline solves that problem for you. When configured correctly, on every commit you make to your codebase, it will automatically:

  1. Build an environment that you configure
  2. Execute all tests associated with your source code inside that environment

You can think of a continuous integration pipeline as a programmable bot that runs commands that you've configured it to run, except it does so automatically on every single commit.

Why write a continuous integration pipeline

You can configure a CI pipeline to automatically run code checks, thus preventing you from breaking something that you previously wrote on which you also depend.

You can also configure a CI pipeline to continuously run analyses that are crucial to the project. You essentially feed the CI pipeline the commands needed to re-run analyses that are important and deposit the results in a location that you get to configure.

If you don't build a CI pipeline, then you'll miss out on the benefits of automatically having a bot check your work for breakages.

How to build a CI pipeline

There's a myriad of CI providers. Here are a few examples:

  • Travis CI
  • Azure Pipelines
  • GitHub Actions
  • CircleCI

Because of the myriad of options available, it'd be futile to give you a tutorial. Instead, I'll show you what's common between them.

Firstly, you begin by writing a configuration file that lists out all of the build steps. Typically it's a YAML file (Travis CI, Azure Pipelines, and GitHub Actions all use this), but sometimes you'll have other formats, such as a Jenkinsfile for Jenkins. This file is, by convention, usually placed in the root of your project repository, but you can also opt to put it in another location if that helps with file organization.

Most commonly, the build steps will be nothing more than bash commands. For example, in Travis CI, each build step in the YAML file is a bash command used to execute the pipeline. Sometimes, to take advantage of the user-friendly UI elements provided by the CI provider, you'll be asked to supply a slightly more complex YAML file. There, you can group build steps into logical higher-order steps and provide human-readable descriptions for them; these get paired with a web UI that lets you easily debug a step when something goes wrong.

Secondly, there'll be a website (also known as a "control plane" in cloud jargon) where you go to configure the continuous integration bot. There, you'll typically configure:

  1. The location of the Git repository
  2. The exact configuration file(s) that contains the build steps.

If your company has set up internal systems slightly differently, you'll probably have to ask your IT department's DevOps team for help to accomplish your task. Ask nicely; they invest tons of time building out something usable, but sometimes the data scientist's level of expertise with these systems, which is usually beginner, is out of their radars.

Keep your notebooks organized with logical categories

In my experience, there are three types of notebooks that get written.

Prototyping notebooks go under notebooks/

These notebooks are drafting grounds for "production" code. We use Jupyter notebooks as an experimentation playground. (see: Use Jupyter as an experimentation playground). They do not need to be kept running reliably/reproducibly, and essentially are considered "disposable".

If you are collaborating with colleagues on a project, you can categorize notebooks by their primary author. For example, if I am working with Lily and Arkadij on a project, we can each get our own "user spaces" in there while agreeing not to touch each other's notebooks:

project/
- notebooks/
  - lily/     # lily's notebooks go here
  - arkadij/  # arkadij's notebooks go here
  - eric/     # eric's notebooks go here

Documentation notebooks go under docs/

These notebooks are written in the original spirit of Jupyter notebooks. They combine prose, code and code-generated figures. They contain a narrative, a data story. One may say they are "production", in that someone will read them and act on them. They need to be reliably executed from top-to-bottom, usually in a continuous integration system. (see: Build a continuous integration pipeline for your source) using MkDocs and mknotebooks.

For these notebooks, we might choose to keep them in the docs/ directory:

project/
- docs/
  - some_notebook.ipynb

Application notebooks go under app/

Sometimes you might opt to use voila to build front-end applications for those whom you serve. This is a convenient option because you don't have to jump out of a Jupyter context if you're already in there. These notebooks are considered "production" as well, however because they are code embedded in JSON, they are more difficult to diff with git.

For these notebooks, you probably want to keep them in a directory named app, where anything that becomes front-facing to the clients we serve are stored:

project/
- apps/
  - notebook_app.ipynb

Place custom source code inside a lightweight package

Why write a package for your custom source code

Have you encountered the situation where you create a new notebook, and then promptly copy code verbatim from another notebook with zero modifications?

As you as you did that, you created two sources of truth for that one function.

Now... if you intended to modify the function and test the effect of the modification on the rest of the code, then you still could have done better.

A custom source package that is installed into the conda environment that you have set up will help you refactor code out of the notebook, and hence help you define one source of truth for the entire function, which you can then import anywhere.

How to create a custom source package for a project

Firstly, I'm assuming you are following the ideas laid out in Set up your project with a sane directory structure. Specifically, you have a src/ directory under the project root. Here, I'm going to give you a summary of the official Python packaging tutorial.

In your project src/ directory, ensure you have a few files:

|- src/
   |- setup.py
   |- source_package/  # rename this to the same name as the conda environment
      |- data/         # for all data-related functions
         |- loaders.py # convenience functions for loading data
         |- schemas.py # this is for pandera schemas
      |- __init__.py   # this is necessary
      |- paths.py      # this is for path definitionsme
      |- utils.py      # utiity functions that you might need
      |- ...
   |- tests/
      |- test_utils.py # tests for utility functions
      |- ...

If you're wondering about why we name the source package the same name as our conda environment, it's for consistency purposes. (see: Sanely name things consistently)

If you're wondering about the purpose of paths.py, read this page: Use pyprojroot to define relative paths to the project root

setup.py should look like this:

import setuptools

with open("README.md", "r", encoding="utf-8") as fh:
    long_description = fh.read()

setuptools.setup(
    name="source_package", # Replace with your environment name
    version="0.1",         # Replace with anything that you need
    packages=setuptools.find_packages(),
)

Now, you activate the environment dedicated to your project (see: Create one conda environment per project) and install the custom source package:

conda activate project_environment
cd src
pip install -e .

This will install the source package in development mode. As you continue to add more code into the custom source package, they will be instantly available to you project-wide.

Now, in your projects, you can import anything from the custom source package.

Note: If you've read the official Python documentation on packages, you might see that src/ has nothing special in its name. (Indeed, one of my reviewers, Arkadij Kummer, pointed this out to me.) Having tried to organize a few ways, I think having src/ is better for DS projects than having the setup.py file and source_package/ directory in the top-level project directory. Those two are better isolated from the rest of the project and we can keep the setup.py in src/ too, thus eliminating clutter from the top-level directory.

How often should the package be updated?

As often as you need it!

Also, I would encourage you to avoid releasing the package standalone until you know that it ought to be used as a standalone Python package. Otherwise, you might prematurely bring upon yourself a maintenance burden!

Write tests that test your custom code

Why write tests for your code

Writing tests for your code is a great practice. If you depend on a chunk of code, you should write tests for it.

As you develop a codebase, you might inadvertently modify an existing piece of code on which your project depends. This modification will break other analyses that rely on that piece of code. Writing tests that get automatically executed on every commit (see: Build a continuous integration pipeline for your source) will help you catch these changes before you merge them into your codebase.

How do you write tests

I could write a full-fledged testing tutorial, but because the intent here is to provide you with the "why"s followed by a quick guide, I would recommend reading an essay I wrote on this.

The general pattern to look out for is that:

  1. You first write a function that doesn't merely wrap another function but does a single unit of substantial work.
  2. Then, you test the function using examples that you provide to the test runner, i.e. the program that automatically finds all tests to run and then executes them.

In terms of test runners, I find pytest to be the fastest to get up and running with; through experience, I have also found it well-equipped to grow in complexity if my codebase necessitates it.

Follow the rule of one-to-one in managing your projects

What is this rule all about

The one-to-one rule essentially means this. Each project that we work on gets:

In addition, when we name things, such as environment names, repository names, and more, we choose names that are consistent with one another (see: Sanely name things consistently for the reasons why).

Why is this important

Conventions help act as a lubricant - a shortcut for us to interact with others. Adopting the convention of one-to-one mappings helps us manage some of the complexity that may arise in a project.

Some teams have a habit of putting source code in one place (e.g. Bitbucket) and documentation in another (e.g. Confluence). I would discourage this; placing source code and documentation on how to use it next to each other is a much better way to work, because it gives you and your project stakeholders one single source of truth to find information related to a project.

When can we break this rule

A few guidelines can help you decide.

When a source repository matures enough such that you see a submodule that is generalizable beyond the project itself, then it's time to engage the help of a real software developer to refactor that chunk of code out of the source file into a separate package.

When the project matures enough such that there's a natural bifurcation in work that needs more independence from the original repository, then it's time to split the repository into two. At that point, apply the same principles to the new repository.

Get prepped per project

Treat your projects as if they were software projects for maximum organizational effectiveness. Why? The biggest reason is that it will nudge us towards getting organized. The "magic" behind well-constructed software projects is that someone sat down and thought clearly about how to organize things. The same principle can be applied to data analysis projects.

Firstly, some overall ideas to ground the specifics:

Some ideas pertaining to Git:

Notes that pertain to organizing files:

Notes that pertain to your compute environment:

And notes that pertain to good coding practices:

Treating projects as if they were software projects, but without software engineering's stricter practices, keeps us primed to think about the generalizability of what we do, but without the over-engineering that might constrain future flexibility.

Set up pre-commit hooks to automate checks before making git commits

Why use pre-commit hooks?

One way to prevent yourself from committing code that is not properly checked is to use pre-commit hooks. This is a feature of Git that allows you to automatically run checks before they are committed to the repository history. Because they are automatically run, you set them up once, usually when you first download the repository, and no longer need to think about them again.

How do I set up pre-commit hooks?

You can install the pre-commit framework, which lets you easily configure pre-commit hooks to run.

The gist of the installation steps are in the bash commands below, but you should read the website for a fuller understanding.

conda install -c conda-forge pre-commit
pre-commit sample-config > .pre-commit-config.yaml

Now, go and edit .pre-commit-config.yaml -- add other pre-commit checks, for example. (See below for an example that you can use.) Then, run:

pre-commit install
pre-commit run --all-files   # run the checks against all of your files

What pre-commit hooks are good to install?

a.k.a. what would you put in your .pre-commit-config.yaml? Here's a sane collection of starter things that I usually include, taken from my Network Analysis Made Simple repository.

repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v2.3.0
    hooks:
      - id: check-yaml
      - id: end-of-file-fixer
      - id: trailing-whitespace
  - repo: https://github.com/psf/black
    rev: 19.3b0
    hooks:
      - id: black
  - repo: https://github.com/kynan/nbstripout
    rev: master
    hooks:
      - id: nbstripout
        files: ".ipynb"

nbstripout is a super important one -- it ensures that all of my notebook outputs are stripped before committing them to the repository! (Otherwise, you'll end up bloating your repository with large notebooks.)

How does this relate to continuous integration pipeline checks?

(For a refresher, or if you're not sure what CI pipeline checks are, see Build a continuous integration pipeline for your source.)

CI pipeline checks are also a form of automated checks that you can put into your workflow. Ideally, everything that is checked for in your pre-commit hooks should be checked for in your CI pipeline.

So what's the difference, then? Here's my thoughts on this:

In pre-commit hooks, you generally run the lightweight checks: the ones that are annoying to run manually all the time but also execute very quickly. Things like code style checks, for example, or those that ensure there are only single trailing lines in text files.

In the CI system, you run those checks in addition to the longer-running test suite. (see: Write tests that test your custom code). So the CI system behaves as a backup to the pre-commit hooks.