Set up pre-commit hooks to automate checks before making git commits
One way to prevent yourself from committing code that is not properly checked is to use pre-commit hooks. This is a feature of Git that allows you to automatically run checks before they are committed to the repository history. Because they are automatically run, you set them up once, usually when you first download the repository, and no longer need to think about them again.
You can install the pre-commit
framework, which lets you easily configure pre-commit hooks to run.
The gist of the installation steps are in the bash commands below, but you should read the website for a fuller understanding.
conda install -c conda-forge pre-commit
pre-commit sample-config > .pre-commit-config.yaml
Now, go and edit .pre-commit-config.yaml
-- add other pre-commit checks, for example. (See below for an example that you can use.) Then, run:
pre-commit install
pre-commit run --all-files # run the checks against all of your files
a.k.a. what would you put in your .pre-commit-config.yaml
? Here's a sane collection of starter things that I usually include, taken from my Network Analysis Made Simple repository.
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 19.3b0
hooks:
- id: black
- repo: https://github.com/kynan/nbstripout
rev: master
hooks:
- id: nbstripout
files: ".ipynb"
nbstripout
is a super important one -- it ensures that all of my notebook outputs are stripped before committing them to the repository! (Otherwise, you'll end up bloating your repository with large notebooks.)
(For a refresher, or if you're not sure what CI pipeline checks are, see Build a continuous integration pipeline for your source.)
CI pipeline checks are also a form of automated checks that you can put into your workflow. Ideally, everything that is checked for in your pre-commit hooks should be checked for in your CI pipeline.
So what's the difference, then? Here's my thoughts on this:
In pre-commit hooks, you generally run the lightweight checks: the ones that are annoying to run manually all the time but also execute very quickly. Things like code style checks, for example, or those that ensure there are only single trailing lines in text files.
In the CI system, you run those checks in addition to the longer-running test suite. (see: Write tests that test your custom code). So the CI system behaves as a backup to the pre-commit hooks.
Get prepped per project
Treat your projects as if they were software projects for maximum organizational effectiveness. Why? The biggest reason is that it will nudge us towards getting organized. The "magic" behind well-constructed software projects is that someone sat down and thought clearly about how to organize things. The same principle can be applied to data analysis projects.
Firstly, some overall ideas to ground the specifics:
Some ideas pertaining to Git:
Notes that pertain to organizing files:
Notes that pertain to your compute environment:
And notes that pertain to good coding practices:
Treating projects as if they were software projects, but without software engineering's stricter practices, keeps us primed to think about the generalizability of what we do, but without the over-engineering that might constrain future flexibility.
Use scripts to automate routine execution of tasks
This idea should be pretty obvious. If you find yourself executing the exact same commands over and over and over, you should probably put them together into a bash, Python, or R script that you can call from the root of your directory.
In the spirit of putting things in categorically relevant places (see: Organize your projects by leveraging categories), you should place them in the scripts/
directory, and provide additional sub-categories inside there.
You should do what feels most comfortable for you, but there are still some idiomatic guidelines that can help you make a decision:
Most of the time, it's optimal to design these scripts assuming that the "current working directory" is project root directory. This will simplify how you execute the scripts. You'll save on injecting "cd" commands into the documentation that you build.
There are exceptions to the rule. For example, if you know that every subsequent operation in the script depends on being in a subdirectory, then setting the current working directory to that subdirectory is a great idea! That age-old adage of "knowing when to break the rules judiciously" applies here.
If you put your scripts in a scripts/
directory, then constantly executing a command that looks like:
bash scripts/ci/build.sh
can get boring over time. If you instead put that line in a Makefile as follows:
build:
bash scripts/ci/build.sh
then you can execute the command make build
from the project root, and save yourself keystrokes.
You can help your colleagues get setup by creating a script for them! For example, you can write one that has the following commands:
# ./scripts/setup.sh
export PROJECT_ENV_NAME = ______________ # replace with your env name
conda env create -f environment.yml || mamba env create -f $PROJECT_ENV_NAME
conda activate $PROJECT_ENV_NAME
# Install custom source
pip install -e .
# Install Jupyter extensions (if relevant)
jupyter labextension install @jupyter-widgets/jupyterlab-manager
# Install pre-commit hooks
pre-commit install
echo "Setup complete! In the future, run 'conda activate $PROJECT_ENV_NAME' before you run your notebooks."
This script will help you:
tqdm
!)Saves a bunch of time downstream!
If a script is part of a pipeline (see: Build your projects thinking in terms of pipelines), then ensure that you have it set up such that upstream computational steps, especially those that are computationally expensive, execute independent of computationally cheap ones that depend on them. One example, provided by one of my reviewers Simon, is "intermediate data generation" vs. "data visualization". To quote:
I run under the philosophy of not unnecessarily regenerating data. Having to regenerate data -- especially if takes a long time -- just to regenerate a visualization absolutely sucks and is a common cause of my annoyance when my underlings present data in meetings.
Adhere to best git practices
Git is a unique piece of software. It does one and only one thing well: store versions of hand-curated files. Adhering to Git best practices will ensure that you use Git in its intended fashion.
The most significant point to keep in mind: only commit to Git files that you have had to create manually. That usually means version controlling:
There are also things you should actively avoid committing.
For specific files, you can set up a .gitignore
file.
See the page Set up an awesome default gitignore for your projects
for more information on preventing yourself from committing them automatically.
For Jupyter notebooks,
it is considered good practice to avoid committing notebooks that still have outputs.
It is best to clear them out using nbstripout
.
That can be automated before committing them through the use of pre-commit hooks.
(See: Set up pre-commit hooks to automate checks before making Git commits)
Write tests that test your custom code
Writing tests for your code is a great practice. If you depend on a chunk of code, you should write tests for it.
As you develop a codebase, you might inadvertently modify an existing piece of code on which your project depends. This modification will break other analyses that rely on that piece of code. Writing tests that get automatically executed on every commit (see: Build a continuous integration pipeline for your source) will help you catch these changes before you merge them into your codebase.
I could write a full-fledged testing tutorial, but because the intent here is to provide you with the "why"s followed by a quick guide, I would recommend reading an essay I wrote on this.
The general pattern to look out for is that:
In terms of test runners, I find pytest
to be the fastest to get up and running with; through experience, I have also found it well-equipped to grow in complexity if my codebase necessitates it.
Build a continuous integration pipeline for your source
If you end up writing software (see: Place custom source code inside a lightweight package), especially code that you might need to depend on in the future, having a test suite is essential (see: Write tests that test your custom code). However, the execution of the tests still needs to be triggered by you.
A continuous integration (CI) pipeline solves that problem for you. When configured correctly, on every commit you make to your codebase, it will automatically:
You can think of a continuous integration pipeline as a programmable bot that runs commands that you've configured it to run, except it does so automatically on every single commit.
You can configure a CI pipeline to automatically run code checks, thus preventing you from breaking something that you previously wrote on which you also depend.
You can also configure a CI pipeline to continuously run analyses that are crucial to the project. You essentially feed the CI pipeline the commands needed to re-run analyses that are important and deposit the results in a location that you get to configure.
If you don't build a CI pipeline, then you'll miss out on the benefits of automatically having a bot check your work for breakages.
There's a myriad of CI providers. Here are a few examples:
Because of the myriad of options available, it'd be futile to give you a tutorial. Instead, I'll show you what's common between them.
Firstly, you begin by writing a configuration file that lists out all of the build steps. Typically it's a YAML file (Travis CI, Azure Pipelines, and GitHub Actions all use this), but sometimes you'll have other formats, such as a Jenkinsfile for Jenkins. This file is, by convention, usually placed in the root of your project repository, but you can also opt to put it in another location if that helps with file organization.
Most commonly, the build steps will be nothing more than bash commands. For example, in Travis CI, each build step in the YAML file is a bash command used to execute the pipeline. Sometimes, to take advantage of the user-friendly UI elements provided by the CI provider, you'll be asked to supply a slightly more complex YAML file. There, you can group build steps into logical higher-order steps and provide human-readable descriptions for them; these get paired with a web UI that lets you easily debug a step when something goes wrong.
Secondly, there'll be a website (sometimes called a "control plane" in cloud jargon) where you go to configure the continuous integration bot. There, you'll typically configure:
If your company has set up internal systems slightly differently, you'll probably have to ask your IT department's DevOps team for help to accomplish your task. Ask nicely; they invest tons of time building out something usable, but sometimes the data scientist's level of expertise with these systems, which is usually beginner, is out of their radars.