Never commit data into version control repositories

Why you should never commit data to Git

Data should never be committed into your Git repositories. This is because git was designed to version small files of source code; committing data, a different category of things from source code, into your repositories will first and foremost lead to repository size bloat. Also, committing data into repositories means the data get shipped alongside the source code to anybody who has access to the source code. This might not necessarily be in-line with organizational practices.

Add data to .gitignore

That said, in a pinch sometimes you need to work with data locally, so you might have a data/ directory underneath the project root in which you temporarily store data. You might have chosen data/ rather than /tmp/ because it is easier to reference. To avoid accidentally committing any data to the repository, you might want to add the data directory to your .gitignore file:

# Above is the rest of your .gitignore
data/

The alternative is to ignore any file extensions that you know exclusively belong to the category of things called "data":

# Above is the rest of your .gitignore
*.csv
*.xlsx
*.Rdata

See also

Define single sources of truth for your data sources

Why define single sources of truth for data

Let me describe a scenario: there's a project you're working on with others, and everybody depends on an Excel spreadsheet. This was before the days of collaboratively editing a single Excel spreadsheet was a possibility. To avoid conflicts, someone creates a spreadsheet_v2.xlsx, and then at the same time, another person creates spreadsheet_TE_edits.xlsx.

Which version do you trust?

The worst part? Neither of those spreadsheets contained purely raw data; they were a mix of both raw data and derived data (i.e. columns that are calculated off or from other columns). The derived data are not documented with why and how they were calculated; their provenance is unknown, in that we don't know who made those changes, and who to ask questions on those columns.

Rather than wrestling with multiple sources of truth, a data analysis workflow can be much more streamlined by defining a single source of truth for raw data that does not contain anything derived, followed by calculating the derived data in a custom source code (see: Place custom source code inside a lightweight package), written in such a way that they yield logical derived data structures for the problem (see: Iteratively scope out and define the most appropriate data structures for your problem). Those single sources of truth can also be described by a ground truth data descriptor file (see Write data descriptor files for your data sources), which give you the provenance of the file and a human-readable descriptor of each of the sources.

Examples of single sources of data truth in action

Data on an s3-like bucket

If your organization uses the cloud, then AWS S3 (or compatible bucket stores) might be available. A data source might be dumped on there and referenced by a single URL. That URL is your "single source of data"

Data on an internal data store

Your organization might have the resources to build out a data store with proper access controls and the likes. They might provide a unique key and a software API (RESTful, or Python or R package) to download data in an easy fashion. That "unique key" + the API defines your single source of truth.

Data on a shared network store

Longer-lived organizations might have started out with a shared networked filesystem, with access controls granted by UNIX-style user groups. In this case, the /path/to/the/data/file + access to the shared filesystem is your source of truth.

Data on the internet

This one should be easy to grok: a URL that points to the exact CSV, Parquet, or Excel table, or a zip dump of images, is your unique identifier.

Set up an awesome default gitignore for your projects

Why setup a "gitignore" file?

There will be some files you'll never want to commit to Git. Some include:

  • Files that contain passwords and other secrets.
  • Files that contain runtime environment variables (which themselves might be secrets).
  • Large files, such as images and binaries, unless they are essential assets. (A rule of thumb is anything >500 kb is "large" by Git standards.)
  • Jupyter notebooks that contain outputs.
  • Data file directories. (see: Never commit data into version control repositories)

If you commit them, then:

  1. Secrets and other sensitive runtime information may linger in your repository and become exposed to the world.
  2. Your repository will explode in history as changes happen to the large binary files.

How do I set up an awesome "gitignore" file?

Some believe that your .gitignore should be curated. I believe that you should use a good default one that is widely applicable. To do so, go to gitignore.io, fill in the languages and operating systems involved in your project, and copy/paste the one that fits you. If you want an awesome default one for Python:

cd /path/to/project/root
curl https://www.toptal.com/developers/gitignore/api/python

It will have .env available in there too! (see: Create runtime environment variable configuration files for each of your projects)

How is a .gitignore file parsed?

A .gitignore file is parsed according to the rules on its documentation page. It essentially follows the unix glob syntax while adding on logical modifiers. Here are a few examples to get you oriented:

Example 1: Ignore all .DS_Store files

These are files generated by macOS' Finder. You can ignore them by appending the following line to your .gitignore:

*.DS_Store

Example 2: Ignore all files under site/

If you use MkDocs to build documentation, it will place the output into the directory site/. You will want to ignore the entire directory appending the following line:

site/

Example 3: Ignore all .ipynb_checkpoints directories

If you have Jupyter notebooks inside your repository, you can ignore any path containing .ipynb_checkpoints.

.ipynb_checkpoints

Adding this line will prevent your Jupyter notebook checkpoints from being committed into your Git repository.

Handling data

How to handle data

Handling data in a data science project is very tricky. Primarily, we have to worry about the following:

  1. Availability: How do I make the data that my project depends on available to others who want to work on the project?
  2. Validation: How do I know whether my data are exactly what I think it should be?
  3. Flow complexity: How do I combat the entropy (complexity) that grows as the project develops?
  4. Provenance: If I have a problem with the data, whom should I ask questions about it?

The notes linked in this section should give you an overview on how to approach handling data in a sane fashion on your project.