Data Science Bootstrap

Define single sources of truth for your data sources

Why define single sources of truth for data

Let me describe a scenario: there's a project you're working on with others, and everybody depends on an Excel spreadsheet. This was before the days of collaboratively editing a single Excel spreadsheet was a possibility. To avoid conflicts, someone creates a spreadsheet_v2.xlsx, and then at the same time, another person creates spreadsheet_TE_edits.xlsx.

Which version do you trust?

The worst part? Neither of those spreadsheets contained purely raw data; they were a mix of both raw data and derived data (i.e. columns that are calculated off or from other columns). The derived data are not documented with why and how they were calculated; their provenance is unknown, in that we don't know who made those changes, and who to ask questions on those columns.

Rather than wrestling with multiple sources of truth, a data analysis workflow can be much more streamlined by defining a single source of truth for raw data that does not contain anything derived, followed by calculating the derived data in a custom source code (see: Place custom source code inside a lightweight package), written in such a way that they yield logical derived data structures for the problem (see: Iteratively scope out and define the most appropriate data structures for your problem). Those single sources of truth can also be described by a ground truth data descriptor file (see Write data descriptor files for your data sources), which give you the provenance of the file and a human-readable descriptor of each of the sources.

Examples of single sources of data truth in action

Data on an s3-like bucket

If your organization uses the cloud, then AWS S3 (or compatible bucket stores) might be available. A data source might be dumped on there and referenced by a single URL. That URL is your "single source of data"

Data on an internal data store

Your organization might have the resources to build out a data store with proper access controls and the likes. They might provide a unique key and a software API (RESTful, or Python or R package) to download data in an easy fashion. That "unique key" + the API defines your single source of truth.

Data on a shared network store

Longer-lived organizations might have started out with a shared networked filesystem, with access controls granted by UNIX-style user groups. In this case, the /path/to/the/data/file + access to the shared filesystem is your source of truth.

Data on the internet

This one should be easy to grok: a URL that points to the exact CSV, Parquet, or Excel table, or a zip dump of images, is your unique identifier.

Pages that link here

Set up your project with a sane directory structure
Why setup your project with a sane directory structure Doing so will help you quickly and easily find things

Iteratively scope out and define the most appropriate data structures for your problem
Why you need to define good data structures Data structures are incredibly important to any modelling problem

Never commit data into version control repositories
Why you should never commit data to Git Data should never be committed into your Git repositories

Handling data
How to handle data Handling data in a data science project is very tricky

Write effective documentation for your projects
Why write documentation As your data science project progresses, you should be documenting your work somehow inside your project

Use pyprojroot to define relative paths to the project root
Why you should use pyprojroot If you follow the practice of One project should get one git repository, then everything related to the project will be housed inside that repository

Set up your project with a sane directory structure

Why setup your project with a sane directory structure

Doing so will help you quickly and easily find things. This is crucial when navigating your data project. If you don't do so, you will likely end up being utterly confused as to where things are located.

What does a sane directory look like

I am going to show you one particular example, but you can adapt it to however you like.

|- informative-project-name-here/
   |- data/          # never add anything here into source control
   |- notebooks/     # divide by usernames if needed
   |- scripts/       # basically for automation
   |- importable_name/
      |- __init__.py
      |-...
   |- tests/      # test suite
   |- README.md
   |- pyproject.toml # use this, not setup.py!
   |-...

The purpose of each directory is annotated in each line. That said, you can find relevant information in the following pages:

data/: Define single sources of truth for your data sources
notebooks/: Keep your notebooks organized with logical categories
importable_name/: Place custom source code inside a lightweight package
scripts/: Use scripts to automate routine execution of tasks

Write effective documentation for your projects

Why write documentation

As your data science project progresses, you should be documenting your work somehow inside your project. Your future self and other colleagues will need mental context to help get up-to-speed with the project. That mental context can mean the difference between staying on course or veering off in unproductive directions.

Useful documentation helps you quickly onboard collaborators to the project. By reading your documentation, you will help them get oriented and know how to get things done with your project. You won't be available forever to everyone who might come by, so your documentation effectively scales the longevity and impact of your work.

How do you write useful documentation

To write effective documentation, we first need to recognize that there are actually four types of documentation. They are, respectively:

Tutorials
How-to Guides
Explanations
References

This is not a new concept, it is actually well-documented (ahem!) in the Diataxis Framework.

Concretely, here are some kinds of documentation that you will want to focus on.

The first is custom source code docstrings (a type of Reference). We write docstrings inside Python functions to document what we intend to accomplish with the code block and why that code needs to exist. Be diligent about writing down the why behind the what; it will help you recall the "what" later on.

The second is how-to guides for newcomers to the project (obviously under the How-to Guides category). These guides help your newcomers get up to speed on the commands needed to set up a local development environment for their project. Essentially the sequence of terminal incantations that they would need to type to start hacking on the project. As always, for non-obvious steps, always document the why!

The third involves crafting and telling a story of the project's progress. (We may consider this to be an Explanation-style documentation). For those of you who have done scientific research before, you'll know how this goes: it's essentially the lab meeting presentations that you deliver! Early on, your progress will be granular, but your progress will gain momentum as the project progresses. Doing this is important because the act of reflecting on prior work, summarizing, and linearizing it for yourself helps you catch logical gaps that need to be filled in, essentially identifying where you need to focus your project efforts.

The final one is the project README! The README usually exists as README.md or README.txt in the project root directory and serves a few purposes:

Giving an overview of why the project exists.
Providing an overview of the "rules of engagement" with the project.
Serving up a "Quickstart" or "Installation" section to guide users on how to get set up.
Showing an example of what they can do with the project.

The README file usually serves dual-purposes, both as a quick Tutorial and How-to Guide.

What tools should we use to write documentation?

On this matter, I would advocate that we simultaneously strive to be simple and automated. For Pythonistas, there are two predominant options that you can go with: Sphinx and MkDocs.

Sphinx

At first glance, most in the Python world would advocate for the use of Sphinx, which is the stalwart package used to generate documentation. Sphinx's power lies in its syntax and ecosystem of extensions: you can easily link out to other packages, build API documentation from docstrings, run examples in documentation as tests, and more.

MkDocs

However, if you're not already familiar with Sphinx, I would recommend getting started using MkDocs. Its core design is much simpler, relying only on Markdown files as the source for documentation. That is MkDoc's most significant advantage: from my vantage point, Markdown syntax knowledge is more widespread than Sphinx syntax knowledge; hence, it's much easier to invite collaborators to write documentation together. (Hint: the MkDocs Material theme by Squidfunk has a ton of super excellent features that easily enhance MkDocs!)

What principles should we keep in mind when writing docs?

Single source of truth

Firstly, you should define a single source of truth for statements that you make in your docs. If you can, avoid copy/pasting anything. Related ideas here are written in Define single sources of truth for your data sources.

Write to the audience

Secondly, you'll want to pick from several styles of writing. One effective way is to think of it in terms of answering critical questions for a project. An example list of questions that commonly show up in data projects mirror that of a scientific research paper and include (but are not limited to):

What question does this project answer? What problem are you solving through this project? What is the bigger context of this project?
What are the data backing the project, and from where do they come? Where is the data description? (see also: Write data descriptor files for your data sources)
What methods were used in the project?
What key insights should be gained from this project?

If your project also encompasses a tool that helps routinize the project in a production setting:

What is the deployment strategy for the project? What pre-requisites are needed before we can "deploy" the project?
What code/commands need to be executed at the command line/REPL/Jupyter notebook to use the tools built in this project?
What are the tools available for the visualization of model results, and how ought they be interpreted?

As one of my reviewers, Simon Eng, mentioned, the overarching point is that your documentation should explain to someone else what's going on in the project.

Use semantic line breaks

Finally, it would be best if you used semantic line breaks, also known as semantic line feeds. Go ahead. I know you're curious; click on the links to learn why :).

Resources

I strongly recommend reading the Write The Docs guide to writing technical documentation.

Additionally, Admond Lee has additional reasons for writing documentation.

Use pyprojroot to define relative paths to the project root

Why you should use pyprojroot

If you follow the practice of One project should get one git repository, then everything related to the project will be housed inside that repository. Under this assumption, if you also develop a custom source code library for your project (see Place custom source code inside a lightweight package for why), then you'll likely encounter the need to find paths to things, such as data files, relative to the project root. Rather than hard-coding paths into your library and Jupyter notebooks, you can instead leverage pyprojroot to define a library of paths that are useful across the project.

How do you use pyprojroot effectively

Firstly, make sure you have an importable source_package.paths module. (I'm assuming you have written a custom source package!) In there, define project paths:

from pyprojroot import here

root = here(proj_files=[".git"])
notebooks_dir = root / "notebooks"
data_dir = root / "data"
timeseries_data_dir = data_dir / "timeseries"

here() returns a Python pathlib.Path object.

You can go as granular or as coarse-grained as you want.

Then, inside your Jupyter notebooks or Python scripts, you can import those paths as needed.

from source_package.paths import timeseries_data_dir
import pandas as pd

data = pd.read_csv(timeseries_data_dir / "2016-2019.csv")

Now, if for whatever reason you have to move the data files to a different subdirectory (say, to keep things even more organized than you already are, you awesome person!), then you just have to update one location in source_package.paths, and you're able to reference the data file in all of your scripts!

Iteratively scope out and define the most appropriate data structures for your problem

Why you need to define good data structures

Data structures are incredibly important to any modelling problem.

Data structures, when designed well, give us an efficient handle over the problem at hand. Especially when a data structure is paired with a programmatic API.

How to design good data structures for a problem

Consider the example where you have a time series measurement. Here's a simple data structure you can use: two lists. It'd look like:

time_index = [0, 5, 10, ...]
values = [193, 283, 111, ...]

Now, while simplistic, it's not ideal. The time index doesn't start at 1, and it's difficult to index into the values corresponding to a particular time step. Manipulating and analyzing this data is difficult, because of a poor choice of data structures.

By contrast, if we instead stuck the data inside a dataframe, things would start to look a bit more sane.

df = pd.DataFrame({"time_index": time_index, "measurement": values})

Now, our time index and measurements are no longer divorced from one another. We can write queries against them easily. Plotting is a cinch too, because the dataframe API supports it. Hence, by choosing to structure our data in a dataframe rather than in two lists, we gain a world of capabilities afforded to us from the dataframe API.

Dataframe considerations

Designing a good "dataframe" takes effort too. Once you have your raw data loaded in memory from your single source of truth (see: Define single sources of truth for your data sources), you probably will end up defining new derived columns. These are columns that are calculated on the basis of, or otherwise "derived" from, the "raw data" columns. Examples include:

Binarization/quantization of a continuous column.
Joining two dataframes together on a key column.
Gaussian-standardization of a column.

The "raw data" form the baseline logical unit that can be validated (see: Validate your data wherever practically possible). On top of this baseline logical unit, you can make an arbitrary number of changes to the dataframe. How many changes form a new "logical unit" of changes for which you'll want to define new schema validation checks? This is an important question to think about, because after all, your dataframes form the "data API", and it'll be implicated in the pandera schemas and data descriptors you end up writing! (see: Write data descriptor files for your data sources).

Never commit data into version control repositories

Why you should never commit data to Git

Data should never be committed into your Git repositories. This is because git was designed to version small files of source code; committing data, a different category of things from source code, into your repositories will first and foremost lead to repository size bloat. Also, committing data into repositories means the data get shipped alongside the source code to anybody who has access to the source code. This might not necessarily be in-line with organizational practices.

Add data to .gitignore

That said, in a pinch sometimes you need to work with data locally, so you might have a data/ directory underneath the project root in which you temporarily store data. You might have chosen data/ rather than /tmp/ because it is easier to reference. To avoid accidentally committing any data to the repository, you might want to add the data directory to your .gitignore file:

# Above is the rest of your .gitignore
data/

The alternative is to ignore any file extensions that you know exclusively belong to the category of things called "data":

# Above is the rest of your .gitignore
*.csv
*.xlsx
*.Rdata

Why write data descriptor files

When you get a new CSV file, how do you know what the semantic meaning of each column is, what null values are, and other background information of that file?

Usually, we'd go in and ask another person. However, that's not scalable. Instead, if we provided a human-readable text file that provided all of the aforementioned information, that would be awesome! In comes the data descriptor file. (In the clinical research world, they are also known as "data dictionaries".)

But beyond that, the data descriptor file has another benefit! It takes manual work to sit down and comb through each file and provide a description of each of its columns, where the data came from, and more. This is all part of the process of understanding the data generating process, which is incredibly helpful for downstream modelling efforts. In essence, writing a data descriptor file per data file is an incredibly great first step in the exploratory data analysis (EDA) stage of doing data analysis, because you are literally exploring the structure of the data.

These are two great reasons to write descriptor files, which beat out the single downside: "it takes time".

How do you write data descriptor files

At its most basic form, you can simply write a README file for each data source. Plain text, fully customizable.

That said, some lightweight structure can help. I have previously opted for a YAML file format, which is both human readable and computer-parseable. In that YAML file, we can describe the table schema using the frictionless data TableSchema spec. One can also go for the full JSON that they specify (but it's not as easy to write by hand). In choosing to go with a specification, we effectively gain a checklist, helping us remember to describe everything that could be necessary!

Alternatives to data descriptor files

If you primarily handle tabular data (which, if my understanding is correct, forms the vast majority of data science use cases), then I would strongly suggest using pandera to not only validate your data (see: Validate your data wherever practically possible) but also to generate dataframe schemas that you can store as code. Pandera comes with the ability to generate a starter dataframe schema that one can continually update as data arrive. Storing your data descriptor as code not only allows you to annotate it with comments but also use it for validation itself: a double win.

Place custom source code inside a lightweight package

Why write a package for your custom source code

Have you encountered the situation where you create a new notebook, and then promptly copy code verbatim from another notebook with zero modifications?

As you as you did that, you created two sources of truth for that one function.

Now... if you intended to modify the function and test the effect of the modification on the rest of the code, then you still could have done better.

A custom source package that is installed into the conda environment that you have set up will help you refactor code out of the notebook, and hence help you define one source of truth for the entire function, which you can then import anywhere.

How to create a custom source package for a project

Firstly, I'm assuming you are following the ideas laid out in Set up your project with a sane directory structure. Specifically, you have a src/ directory under the project root. Here, I'm going to give you a summary of the official Python packaging tutorial.

In your project project_name/ directory, ensure you have a few files:

|- project_name/   # should be the same name as the conda environment
  |- data/         # for all data-related functions
	 |- loaders.py # convenience functions for loading data
	 |- schemas.py # this is for pandera schemas
  |- __init__.py   # this is necessary
  |- paths.py      # this is for path definitions
  |- utils.py      # utiity functions that you might need
  |- ...
|- tests/
  |- test_utils.py # tests for utility functions
  |- ...
|- pyproject.toml. # replacement for setup.py

If you're wondering about why we name the source package the same name as our conda environment, it's for consistency purposes. (see: Sanely name things consistently)

If you're wondering about the purpose of paths.py, read this page: Use pyprojroot to define relative paths to the project root

pyproject.toml should look like this:

[project]
name = "my-package-name"
version = "0.1.0"
authors = [{name = "EM", email = "me@em.com"}]
description = "Something cool here."

Now, you activate the environment dedicated to your project (see: Create one conda environment per project) and install the custom source package:

conda activate project_environment
pip install -e .

This will install the source package in development mode. As you continue to add more code into the custom source package, they will be instantly available to you project-wide.

Now, in your projects, you can import anything from the custom source package.

Note: If you've read the official Python documentation on packages, you might see that src/ has nothing special in its name. (Indeed, one of my reviewers, Arkadij Kummer, pointed this out to me.) Having tried to organize a few ways, I think having src/ is better for DS projects than having the setup.py file and source_package/ directory in the top-level project directory. Those two are better isolated from the rest of the project and we can keep the setup.py in src/ too, thus eliminating clutter from the top-level directory.

How often should the package be updated?

As often as you need it!

Also, I would encourage you to avoid releasing the package standalone until you know that it ought to be used as a standalone Python package. Otherwise, you might prematurely bring upon yourself a maintenance burden!

Is there an easier way to set this all up?

It feels like a lot to remember, right? Fret not! You can use pyds-cli to easily bootstrap a new project environment!