Adhere to best git practices
Git is a unique piece of software. It does one and only one thing well: store versions of hand-curated files. Adhering to Git best practices will ensure that you use Git in its intended fashion.
The most significant point to keep in mind: only commit to Git files that you have had to create manually. That usually means version controlling:
There are also things you should actively avoid committing.
For specific files, you can set up a .gitignore
file.
See the page Set up an awesome default gitignore for your projects
for more information on preventing yourself from committing them automatically.
For Jupyter notebooks,
it is considered good practice to avoid committing notebooks that still have outputs.
It is best to clear them out using nbstripout
.
That can be automated before committing them through the use of pre-commit hooks.
(See: Set up pre-commit hooks to automate checks before making Git commits)
Create runtime environment variable configuration files for each of your projects
When you work on your projects, one assumption you will usually have is that your development environment will look like your project's runtime environment with all of its environment variables. The runtime environment is usually your "production" setting: a web app or API, a model in a pipeline, or a software package that gets distributed. (For more on environment variables, see: Take full control of your shell environment variables)
Here, I'm assuming that you follow the practice of and that you Use pyprojroot to define relative paths to the project root.
To configure environment variables for your project,
a recommended practice is to create a .env
file in your project's root directory,
which stores your environment variables as such:
export ENV_VAR_1 = "some_value"
export DATABASE_CONNECTION_STRING = "some_database_connection_string"
export ENV_VAR_3 = "some_other_value"
We use the export
syntax here because we can, in our shells,
run the command source .env
and have the environment variables defined in there applied to our environment.
Now, if you're using a Python project,
make sure you have the package python-dotenv
(Github repo here)
installed in the conda environment.
Then, in your Python .py
source files:
from dotenv import load_dotenv
from pyprojroot import here
import os
dotenv_path = here() / ".env"
load_dotenv(dotenv_path=dotenv_path) # this will load the .env file in your project directory root.
# Now, get the environment variable.
DATABASE_CONNECTION_STRING = os.getenv("DATABASE_CONNECTION_STRING")
In this way, your runtime environment variables get loaded into the runtime environment, and become available to all child processes started from within the shell (e.g. Jupyter Lab, or Python, etc.).
Your .env file might contain some sensitive secrets.
You should always ensure that your .gitignore
file contains .env
in it.
See also: Set up an awesome default gitignore for your projects
Set up pre-commit hooks to automate checks before making git commits
One way to prevent yourself from committing code that is not properly checked is to use pre-commit hooks. This is a feature of Git that allows you to automatically run checks before they are committed to the repository history. Because they are automatically run, you set them up once, usually when you first download the repository, and no longer need to think about them again.
You can install the pre-commit
framework, which lets you easily configure pre-commit hooks to run.
The gist of the installation steps are in the bash commands below, but you should read the website for a fuller understanding.
conda install -c conda-forge pre-commit
pre-commit sample-config > .pre-commit-config.yaml
Now, go and edit .pre-commit-config.yaml
-- add other pre-commit checks, for example. (See below for an example that you can use.) Then, run:
pre-commit install
pre-commit run --all-files # run the checks against all of your files
a.k.a. what would you put in your .pre-commit-config.yaml
? Here's a sane collection of starter things that I usually include, taken from my Network Analysis Made Simple repository.
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 19.3b0
hooks:
- id: black
- repo: https://github.com/kynan/nbstripout
rev: master
hooks:
- id: nbstripout
files: ".ipynb"
nbstripout
is a super important one -- it ensures that all of my notebook outputs are stripped before committing them to the repository! (Otherwise, you'll end up bloating your repository with large notebooks.)
(For a refresher, or if you're not sure what CI pipeline checks are, see Build a continuous integration pipeline for your source.)
CI pipeline checks are also a form of automated checks that you can put into your workflow. Ideally, everything that is checked for in your pre-commit hooks should be checked for in your CI pipeline.
So what's the difference, then? Here's my thoughts on this:
In pre-commit hooks, you generally run the lightweight checks: the ones that are annoying to run manually all the time but also execute very quickly. Things like code style checks, for example, or those that ensure there are only single trailing lines in text files.
In the CI system, you run those checks in addition to the longer-running test suite. (see: Write tests that test your custom code). So the CI system behaves as a backup to the pre-commit hooks.
Get prepped per project
Treat your projects as if they were software projects for maximum organizational effectiveness. Why? The biggest reason is that it will nudge us towards getting organized. The "magic" behind well-constructed software projects is that someone sat down and thought clearly about how to organize things. The same principle can be applied to data analysis projects.
Firstly, some overall ideas to ground the specifics:
Some ideas pertaining to Git:
Notes that pertain to organizing files:
Notes that pertain to your compute environment:
And notes that pertain to good coding practices:
Treating projects as if they were software projects, but without software engineering's stricter practices, keeps us primed to think about the generalizability of what we do, but without the over-engineering that might constrain future flexibility.
Create configuration files for code checking tools
Configuration files give you the ability to declare your project's preferred configuration and distribute it to all participants in the project. It smooths out the practice of data science (and software development too), as these configuration represent the declared normative state of a project, answering questions such as:
Without these configuration files declaring how code checkers ought to behave, we leave it up to collaborators and contributors to manually configure their local systems, and without sufficient documentation, they may bug you over and over on how things ought to be configured. This increase in friction will inevitably lead to an increase in frustration with the project, and hence a decrease in engagement.
As such, you can think of these configuration files as part of your automation toolkit, thus satisfying the automation philosophy.
Because configuration files are so crucial to a project, I have collated them together on the Configuration file overview page.
As always, just-in-time, at the moment that you need it.
Write effective documentation for your projects
As your data science project progresses, you should be documenting your work somehow inside your project. Your future self and other colleagues will need mental context to help get up-to-speed with the project. That mental context can mean the difference between staying on course or veering off in unproductive directions.
Useful documentation helps you quickly onboard collaborators to the project. By reading your documentation, you will help them get oriented and know how to get things done with your project. You won't be available forever to everyone who might come by, so your documentation effectively scales the longevity and impact of your work.
To write effective documentation, we first need to recognize that there are actually four types of documentation. They are, respectively:
This is not a new concept, it is actually well-documented (ahem!) in the Diataxis Framework.
Concretely, here are some kinds of documentation that you will want to focus on.
The first is custom source code docstrings (a type of Reference). We write docstrings inside Python functions to document what we intend to accomplish with the code block and why that code needs to exist. Be diligent about writing down the why behind the what; it will help you recall the "what" later on.
The second is how-to guides for newcomers to the project (obviously under the How-to Guides category). These guides help your newcomers get up to speed on the commands needed to set up a local development environment for their project. Essentially the sequence of terminal incantations that they would need to type to start hacking on the project. As always, for non-obvious steps, always document the why!
The third involves crafting and telling a story of the project's progress. (We may consider this to be an Explanation-style documentation). For those of you who have done scientific research before, you'll know how this goes: it's essentially the lab meeting presentations that you deliver! Early on, your progress will be granular, but your progress will gain momentum as the project progresses. Doing this is important because the act of reflecting on prior work, summarizing, and linearizing it for yourself helps you catch logical gaps that need to be filled in, essentially identifying where you need to focus your project efforts.
The final one is the project README! The README usually exists as README.md
or README.txt
in the project root directory and serves a few purposes:
The README file usually serves dual-purposes, both as a quick Tutorial and How-to Guide.
On this matter, I would advocate that we simultaneously strive to be simple and automated. For Pythonistas, there are two predominant options that you can go with: Sphinx and MkDocs.
At first glance, most in the Python world would advocate for the use of Sphinx, which is the stalwart package used to generate documentation. Sphinx's power lies in its syntax and ecosystem of extensions: you can easily link out to other packages, build API documentation from docstrings, run examples in documentation as tests, and more.
However, if you're not already familiar with Sphinx, I would recommend getting started using MkDocs. Its core design is much simpler, relying only on Markdown files as the source for documentation. That is MkDoc's most significant advantage: from my vantage point, Markdown syntax knowledge is more widespread than Sphinx syntax knowledge; hence, it's much easier to invite collaborators to write documentation together. (Hint: the MkDocs Material theme by Squidfunk has a ton of super excellent features that easily enhance MkDocs!)
Firstly, you should define a single source of truth for statements that you make in your docs. If you can, avoid copy/pasting anything. Related ideas here are written in Define single sources of truth for your data sources.
Secondly, you'll want to pick from several styles of writing. One effective way is to think of it in terms of answering critical questions for a project. An example list of questions that commonly show up in data projects mirror that of a scientific research paper and include (but are not limited to):
If your project also encompasses a tool that helps routinize the project in a production setting:
As one of my reviewers, Simon Eng, mentioned, the overarching point is that your documentation should explain to someone else what's going on in the project.
Finally, it would be best if you used semantic line breaks, also known as semantic line feeds. Go ahead. I know you're curious; click on the links to learn why :).
I strongly recommend reading the Write The Docs guide to writing technical documentation.
Additionally, Admond Lee has additional reasons for writing documentation.
Place custom source code inside a lightweight package
Have you encountered the situation where you create a new notebook, and then promptly copy code verbatim from another notebook with zero modifications?
As you as you did that, you created two sources of truth for that one function.
Now... if you intended to modify the function and test the effect of the modification on the rest of the code, then you still could have done better.
A custom source package that is installed into the conda environment that you have set up will help you refactor code out of the notebook, and hence help you define one source of truth for the entire function, which you can then import anywhere.
Firstly, I'm assuming you are following the ideas laid out in Set up your project with a sane directory structure. Specifically, you have a src/
directory under the project root. Here, I'm going to give you a summary of the official Python packaging tutorial.
In your project project_name/
directory, ensure you have a few files:
|- project_name/ # should be the same name as the conda environment
|- data/ # for all data-related functions
|- loaders.py # convenience functions for loading data
|- schemas.py # this is for pandera schemas
|- __init__.py # this is necessary
|- paths.py # this is for path definitions
|- utils.py # utiity functions that you might need
|- ...
|- tests/
|- test_utils.py # tests for utility functions
|- ...
|- pyproject.toml. # replacement for setup.py
If you're wondering about why we name the source package the same name as our conda environment, it's for consistency purposes. (see: Sanely name things consistently)
If you're wondering about the purpose of paths.py
, read this page: Use pyprojroot to define relative paths to the project root
pyproject.toml
should look like this:
[project]
name = "my-package-name"
version = "0.1.0"
authors = [{name = "EM", email = "me@em.com"}]
description = "Something cool here."
Now, you activate the environment dedicated to your project (see: Create one conda environment per project) and install the custom source package:
conda activate project_environment
pip install -e .
This will install the source package in development mode. As you continue to add more code into the custom source package, they will be instantly available to you project-wide.
Now, in your projects, you can import anything from the custom source package.
Note: If you've read the official Python documentation on packages, you might see that src/
has nothing special in its name. (Indeed, one of my reviewers, Arkadij Kummer, pointed this out to me.) Having tried to organize a few ways, I think having src/
is better for DS projects than having the setup.py
file and source_package/
directory in the top-level project directory. Those two are better isolated from the rest of the project and we can keep the setup.py
in src/
too, thus eliminating clutter from the top-level directory.
As often as you need it!
Also, I would encourage you to avoid releasing the package standalone until you know that it ought to be used as a standalone Python package. Otherwise, you might prematurely bring upon yourself a maintenance burden!
It feels like a lot to remember, right? Fret not! You can use pyds-cli to easily bootstrap a new project environment!
Set up an awesome default gitignore for your projects
There will be some files you'll never want to commit to Git. Some include:
If you commit them, then:
Some believe that your .gitignore
should be curated. I believe that you should use a good default one that is widely applicable. To do so, go to gitignore.io, fill in the languages and operating systems involved in your project, and copy/paste the one that fits you. If you want an awesome default one for Python:
cd /path/to/project/root
curl https://www.toptal.com/developers/gitignore/api/python
It will have .env
available in there too! (see: Create runtime environment variable configuration files for each of your projects)
.gitignore
file parsed?A .gitignore
file is parsed according to the rules on its documentation page. It essentially follows the unix glob syntax while adding on logical modifiers. Here are a few examples to get you oriented:
.DS_Store
filesThese are files generated by macOS' Finder. You can ignore them by appending the following line to your .gitignore
:
*.DS_Store
site/
If you use MkDocs to build documentation, it will place the output into the directory site/
. You will want to ignore the entire directory appending the following line:
site/
.ipynb_checkpoints
directoriesIf you have Jupyter notebooks inside your repository, you can ignore any path containing .ipynb_checkpoints
.
.ipynb_checkpoints
Adding this line will prevent your Jupyter notebook checkpoints from being committed into your Git repository.