Never commit data into version control repositories
Data should never be committed into your Git repositories. This is because
git was designed to version small files of source code; committing data, a different category of things from source code, into your repositories will first and foremost lead to repository size bloat. Also, committing data into repositories means the data get shipped alongside the source code to anybody who has access to the source code. This might not necessarily be in-line with organizational practices.
That said, in a pinch sometimes you need to work with data locally, so you might have a
data/ directory underneath the project root in which you temporarily store data. You might have chosen
data/ rather than
/tmp/ because it is easier to reference. To avoid accidentally committing any data to the repository, you might want to add the data directory to your
# Above is the rest of your .gitignore data/
The alternative is to ignore any file extensions that you know exclusively belong to the category of things called "data":
# Above is the rest of your .gitignore *.csv *.xlsx *.Rdata
Set up an awesome default gitignore for your projects
There will be some files you'll never want to commit to Git. Some include:
If you commit them, then:
Some believe that your
.gitignore should be curated. I believe that you should use a good default one that is widely applicable. To do so, go to gitignore.io, fill in the languages and operating systems involved in your project, and copy/paste the one that fits you. If you want an awesome default one for Python:
cd /path/to/project/root curl https://www.toptal.com/developers/gitignore/api/python
It will have
.env available in there too! (see: Create runtime environment variable configuration files for each of your projects)
.gitignore file is parsed according to the rules on its documentation page. It essentially follows the unix glob syntax while adding on logical modifiers. Here are a few examples to get you oriented:
These are files generated by macOS' Finder. You can ignore them by appending the following line to your
If you use MkDocs to build documentation, it will place the output into the directory
site/. You will want to ignore the entire directory appending the following line:
If you have Jupyter notebooks inside your repository, you can ignore any path containing
Adding this line will prevent your Jupyter notebook checkpoints from being committed into your Git repository.
Handling data in a data science project is very tricky. Primarily, we have to worry about the following:
The notes linked in this section should give you an overview on how to approach handling data in a sane fashion on your project.
Define single sources of truth for your data sources
Let me describe a scenario: there's a project you're working on with others, and everybody depends on an Excel spreadsheet. This was before the days of collaboratively editing a single Excel spreadsheet was a possibility. To avoid conflicts, someone creates a
spreadsheet_v2.xlsx, and then at the same time, another person creates
Which version do you trust?
The worst part? Neither of those spreadsheets contained purely raw data; they were a mix of both raw data and derived data (i.e. columns that are calculated off or from other columns). The derived data are not documented with why and how they were calculated; their provenance is unknown, in that we don't know who made those changes, and who to ask questions on those columns.
Rather than wrestling with multiple sources of truth, a data analysis workflow can be much more streamlined by defining a single source of truth for raw data that does not contain anything derived, followed by calculating the derived data in a custom source code (see: Place custom source code inside a lightweight package), written in such a way that they yield logical derived data structures for the problem (see: Iteratively scope out and define the most appropriate data structures for your problem). Those single sources of truth can also be described by a ground truth data descriptor file (see Write data descriptor files for your data sources), which give you the provenance of the file and a human-readable descriptor of each of the sources.
If your organization uses the cloud, then AWS S3 (or compatible bucket stores) might be available. A data source might be dumped on there and referenced by a single URL. That URL is your "single source of data"
Your organization might have the resources to build out a data store with proper access controls and the likes. They might provide a unique key and a software API (RESTful, or Python or R package) to download data in an easy fashion. That "unique key" + the API defines your single source of truth.
Longer-lived organizations might have started out with a shared networked filesystem, with access controls granted by UNIX-style user groups. In this case, the
/path/to/the/data/file + access to the shared filesystem is your source of truth.
This one should be easy to grok: a URL that points to the exact CSV, Parquet, or Excel table, or a zip dump of images, is your unique identifier.