Set up an awesome default gitignore for your projects

Why setup a "gitignore" file?

There will be some files you'll never want to commit to Git. Some include:

  • Files that contain passwords and other secrets.
  • Files that contain runtime environment variables (which themselves might be secrets).
  • Large files, such as images and binaries, unless they are essential assets. (A rule of thumb is anything >500 kb is "large" by Git standards.)
  • Jupyter notebooks that contain outputs.
  • Data file directories. (see: Never commit data into version control repositories)

If you commit them, then:

  1. Secrets and other sensitive runtime information may linger in your repository and become exposed to the world.
  2. Your repository will explode in history as changes happen to the large binary files.

How do I set up an awesome "gitignore" file?

Some believe that your .gitignore should be curated. I believe that you should use a good default one that is widely applicable. To do so, go to gitignore.io, fill in the languages and operating systems involved in your project, and copy/paste the one that fits you. If you want an awesome default one for Python:

cd /path/to/project/root
curl https://www.toptal.com/developers/gitignore/api/python

It will have .env available in there too! (see: Create runtime environment variable configuration files for each of your projects)

How is a .gitignore file parsed?

A .gitignore file is parsed according to the rules on its documentation page. It essentially follows the unix glob syntax while adding on logical modifiers. Here are a few examples to get you oriented:

Example 1: Ignore all .DS_Store files

These are files generated by macOS' Finder. You can ignore them by appending the following line to your .gitignore:

*.DS_Store

Example 2: Ignore all files under site/

If you use MkDocs to build documentation, it will place the output into the directory site/. You will want to ignore the entire directory appending the following line:

site/

Example 3: Ignore all .ipynb_checkpoints directories

If you have Jupyter notebooks inside your repository, you can ignore any path containing .ipynb_checkpoints.

.ipynb_checkpoints

Adding this line will prevent your Jupyter notebook checkpoints from being committed into your Git repository.

Create runtime environment variable configuration files for each of your projects

Why configure environment variables per project

When you work on your projects, one assumption you will usually have is that your development environment will look like your project's runtime environment with all of its environment variables. The runtime environment is usually your "production" setting: a web app or API, a model in a pipeline, or a software package that gets distributed. (For more on environment variables, see: Take full control of your shell environment variables)

How to configure environment variables for your project

Here, I'm assuming that you follow the practice of One project should get one git repository and that you Use pyprojroot to define relative paths to the project root.

To configure environment variables for your project, a recommended practice is to create a .env file in your project's root directory, which stores your environment variables as such:

export ENV_VAR_1 = "some_value"
export DATABASE_CONNECTION_STRING = "some_database_connection_string"
export ENV_VAR_3 = "some_other_value"

We use the export syntax here because we can, in our shells, run the command source .env and have the environment variables defined in there applied to our environment.

Now, if you're using a Python project, make sure you have the package python-dotenv (Github repo here) installed in the conda environment. Then, in your Python .py source files:

from dotenv import load_dotenv
from pyprojroot import here
import os

dotenv_path = here() / ".env"
load_dotenv(dotenv_path=dotenv_path)  # this will load the .env file in your project directory root.

# Now, get the environment variable.
DATABASE_CONNECTION_STRING = os.getenv("DATABASE_CONNECTION_STRING")

In this way, your runtime environment variables get loaded into the runtime environment, and become available to all child processes started from within the shell (e.g. Jupyter Lab, or Python, etc.).

Always gitignore your .env file

Your .env file might contain some sensitive secrets. You should always ensure that your .gitignore file contains .env in it.

See also: Set up an awesome default gitignore for your projects

Adhere to best git practices

Why adhere to best Git practices?

Git is a unique piece of software. It does one and only one thing well: store versions of hand-curated files. Adhering to Git best practices will ensure that you use Git in its intended fashion.

What best practices should we adhere to?

The most significant point to keep in mind: only commit to Git files that you have had to create manually. That usually means version controlling:

  1. Source code. See: Place custom source code inside a lightweight package)
  2. Configuration files. See:
    1. Create runtime environment variable configuration files for each of your projects
    2. Create configuration files for code checking tools
  3. Documentation. See: Write effective documentation for your projects

There are also things you should actively avoid committing.

For specific files, you can set up a .gitignore file. See the page Set up an awesome default gitignore for your projects for more information on preventing yourself from committing them automatically.

For Jupyter notebooks, it is considered good practice to avoid committing notebooks that still have outputs. It would be best if you cleared them out using nbstripout. If you need a way to automate the cleaning of cell outputs before committing Jupyter notebooks, consider setting up pre-commit hooks using pre-commit. (See: Set up pre-commit hooks to automate checks before making Git commits)

Never commit data into version control repositories

Why you should never commit data to Git

Data should never be committed into your Git repositories. This is because git was designed to version small files of source code; committing data, a different category of things from source code, into your repositories will first and foremost lead to repository size bloat. Also, committing data into repositories means the data get shipped alongside the source code to anybody who has access to the source code. This might not necessarily be in-line with organizational practices.

Add data to .gitignore

That said, in a pinch sometimes you need to work with data locally, so you might have a data/ directory underneath the project root in which you temporarily store data. You might have chosen data/ rather than /tmp/ because it is easier to reference. To avoid accidentally committing any data to the repository, you might want to add the data directory to your .gitignore file:

# Above is the rest of your .gitignore
data/

The alternative is to ignore any file extensions that you know exclusively belong to the category of things called "data":

# Above is the rest of your .gitignore
*.csv
*.xlsx
*.Rdata

See also

Get prepped per project

Treat your projects as if they were software projects for maximum organizational effectiveness. Why? The biggest reason is that it will nudge us towards getting organized. The "magic" behind well-constructed software projects is that someone sat down and thought clearly about how to organize things. The same principle can be applied to data analysis projects.

Firstly, some overall ideas to ground the specifics:

Some ideas pertaining to Git:

Notes that pertain to organizing files:

Notes that pertain to your compute environment:

And notes that pertain to good coding practices:

Treating projects as if they were software projects, but without software engineering's stricter practices, keeps us primed to think about the generalizability of what we do, but without the over-engineering that might constrain future flexibility.

One project should get one git repository

Why one project should get one Git repository

This helps a ton with organization. When you have one project targeted to one Git repository, you can easily house everything related to that project in that one Git repository. I mean everything. This includes:

In doing so, you have one mental location that you can point to for everything related to a project. This is a saner way of operating than over-engineering the separation of concerns at the beginning, with docs in one place and out-of-sync with the source code in another place... you get where we're going with this point.

How to get this implemented

Easy! Create your Git repo for the project, and then start putting stuff in there :).

Enough said here!

What should you name the Git repo? See the page: Sanely name things consistently

After you have set up your Git repo, make sure to Set up your project with a sane directory structure.

Also, Set up an awesome default gitignore for your projects!