Iteratively scope out and define the most appropriate data structures for your problem

Why you need to define good data structures

Data structures are incredibly important to any modelling problem.

Data structures, when designed well, give us an efficient handle over the problem at hand. Especially when a data structure is paired with a programmatic API.

How to design good data structures for a problem

Consider the example where you have a time series measurement. Here's a simple data structure you can use: two lists. It'd look like:

time_index = [0, 5, 10, ...]
values = [193, 283, 111, ...]

Now, while simplistic, it's not ideal. The time index doesn't start at 1, and it's difficult to index into the values corresponding to a particular time step. Manipulating and analyzing this data is difficult, because of a poor choice of data structures.

By contrast, if we instead stuck the data inside a dataframe, things would start to look a bit more sane.

df = pd.DataFrame({"time_index": time_index, "measurement": values})

Now, our time index and measurements are no longer divorced from one another. We can write queries against them easily. Plotting is a cinch too, because the dataframe API supports it. Hence, by choosing to structure our data in a dataframe rather than in two lists, we gain a world of capabilities afforded to us from the dataframe API.

Dataframe considerations

Designing a good "dataframe" takes effort too. Once you have your raw data loaded in memory from your single source of truth (see: Define single sources of truth for your data sources), you probably will end up defining new derived columns. These are columns that are calculated on the basis of, or otherwise "derived" from, the "raw data" columns. Examples include:

  • Binarization/quantization of a continuous column.
  • Joining two dataframes together on a key column.
  • Gaussian-standardization of a column.

The "raw data" form the baseline logical unit that can be validated (see: Validate your data wherever practically possible). On top of this baseline logical unit, you can make an arbitrary number of changes to the dataframe. How many changes form a new "logical unit" of changes for which you'll want to define new schema validation checks? This is an important question to think about, because after all, your dataframes form the "data API", and it'll be implicated in the pandera schemas and data descriptors you end up writing! (see: Write data descriptor files for your data sources).

Write data descriptor files for your data sources

Why write data descriptor files

When you get a new CSV file, how do you know what the semantic meaning of each column is, what null values are, and other background information of that file?

Usually, we'd go in and ask another person. However, that's not scalable. Instead, if we provided a human-readable text file that provided all of the aforementioned information, that would be awesome! In comes the data descriptor file. (In the clinical research world, they are also known as "data dictionaries".)

But beyond that, the data descriptor file has another benefit! It takes manual work to sit down and comb through each file and provide a description of each of its columns, where the data came from, and more. This is all part of the process of understanding the data generating process, which is incredibly helpful for downstream modelling efforts. In essence, writing a data descriptor file per data file is an incredibly great first step in the exploratory data analysis (EDA) stage of doing data analysis, because you are literally exploring the structure of the data.

These are two great reasons to write descriptor files, which beat out the single downside: "it takes time".

How do you write data descriptor files

At its most basic form, you can simply write a README file for each data source. Plain text, fully customizable.

That said, some lightweight structure can help. I have previously opted for a YAML file format, which is both human readable and computer-parseable. In that YAML file, we can describe the table schema using the frictionless data TableSchema spec. One can also go for the full JSON that they specify (but it's not as easy to write by hand). In choosing to go with a specification, we effectively gain a checklist, helping us remember to describe everything that could be necessary!

Alternatives to data descriptor files

If you primarily handle tabular data (which, if my understanding is correct, forms the vast majority of data science use cases), then I would strongly suggest using pandera to not only validate your data (see: Validate your data wherever practically possible) but also to generate dataframe schemas that you can store as code. Pandera comes with the ability to generate a starter dataframe schema that one can continually update as data arrive. Storing your data descriptor as code not only allows you to annotate it with comments but also use it for validation itself: a double win.

Handling data

How to handle data

Handling data in a data science project is very tricky. Primarily, we have to worry about the following:

  1. Availability: How do I make the data that my project depends on available to others who want to work on the project?
  2. Validation: How do I know whether my data are exactly what I think it should be?
  3. Flow complexity: How do I combat the entropy (complexity) that grows as the project develops?
  4. Provenance: If I have a problem with the data, whom should I ask questions about it?

The notes linked in this section should give you an overview on how to approach handling data in a sane fashion on your project.

Validate your data wherever practically possible

What is "data validation"?

To understand data validation, we have to back up a little bit and consider the simplest case of tabular data.

We canonically understand tabular data as having columns and rows. Rows are, in a statistical sense, "samples". Columns, then, are measured attributes of the samples. Each of the measured attributes has a range of values for which it is semantically valid. (In statistics, this is analogous to the statistical support, which is the range of values that define the probability distribution.) Validation of tabular data, then, refers to the act of ensuring that the measured attributes are, for lack of a better word, valid.

To make this clearer, let me illustrate the ways that "validated" data might look.

From a statistical standpoint:

  1. For continuous measurement data, the measurement values fall within semantically valid ranges.
    1. Unbounded in statistical language means support from negative to positive infinity.
    2. Bounded data usually would have at least one of "minimum" or "maximum" values stated.
  2. For discrete measurement data, the measurement values fall within a set of semantically valid options.
  3. There are no null values present in columns that should not have them.

From a computational standpoint:

  1. Each column's data are of the correct data type (integer, float, categorical, object) for interoperability with other code that you might write.
  2. Column names are named precisely in line with their references in the codebase.

When to validate data

For interactive computational use cases, just-in-time checks are handy in helping you identify errors in data before using them. That means the verification ideally happens right before you consume the data and right after your data processing/handling function returns the data. You probably could call this runtime data validation.

On the other hand, if your project ends up being part of a more complex pipeline, especially one with continually updating data, you might want to validate the data at the point of ingestion. You could catch any data points that fail the validation checks that you have defined at the time of upload. You might even go further and periodically run the validation checks on a regular interval. If the data source is large, you might opt to sample a small subset of data rather than perform full data scans. For this strategy, you might want to call it storage time validation.

Parallels to software testing

Software tests check that the functions that you write behave correctly. By contrast, data validation ensures that the input data to your functions satisfy the assumptions that you make in the data processing functions you write.

Just as you should be able to run tests to check your data automatically and continuously, you should be able to constantly check that the data you put into your functions should satisfy the assumptions you possess about them.

Tools for validating data

At the moment, I see two open-source projects that are well-developed and maintained for data validation.

Pandera

Pandera targets validation of pandas dataframes in your Python code and comes with a very lightweight API for tacking on automatic runtime validation to your functions.

Great Expectations

Great Expectations is a bit more heavyweight than Pandera, and in my opinion, is more suitable for heavy-duty pipelines that continuously process data that gets continually fed (whether streamed or in batch) into the data storage system.

Your database system's schemas

If you are ingesting data into a database, which is inherently already structured, rather than being dumped into a data lake, which is intrinsically unstructured, then your database schemas can serve as an automated check for some parts of data validity, such as data being in the right range, or having the right data types.

Define single sources of truth for your data sources

Why define single sources of truth for data

Let me describe a scenario: there's a project you're working on with others, and everybody depends on an Excel spreadsheet. This was before the days of collaboratively editing a single Excel spreadsheet was a possibility. To avoid conflicts, someone creates a spreadsheet_v2.xlsx, and then at the same time, another person creates spreadsheet_TE_edits.xlsx.

Which version do you trust?

The worst part? Neither of those spreadsheets contained purely raw data; they were a mix of both raw data and derived data (i.e. columns that are calculated off or from other columns). The derived data are not documented with why and how they were calculated; their provenance is unknown, in that we don't know who made those changes, and who to ask questions on those columns.

Rather than wrestling with multiple sources of truth, a data analysis workflow can be much more streamlined by defining a single source of truth for raw data that does not contain anything derived, followed by calculating the derived data in a custom source code (see: Place custom source code inside a lightweight package), written in such a way that they yield logical derived data structures for the problem (see: Iteratively scope out and define the most appropriate data structures for your problem). Those single sources of truth can also be described by a ground truth data descriptor file (see Write data descriptor files for your data sources), which give you the provenance of the file and a human-readable descriptor of each of the sources.

Examples of single sources of data truth in action

Data on an s3-like bucket

If your organization uses the cloud, then AWS S3 (or compatible bucket stores) might be available. A data source might be dumped on there and referenced by a single URL. That URL is your "single source of data"

Data on an internal data store

Your organization might have the resources to build out a data store with proper access controls and the likes. They might provide a unique key and a software API (RESTful, or Python or R package) to download data in an easy fashion. That "unique key" + the API defines your single source of truth.

Data on a shared network store

Longer-lived organizations might have started out with a shared networked filesystem, with access controls granted by UNIX-style user groups. In this case, the /path/to/the/data/file + access to the shared filesystem is your source of truth.

Data on the internet

This one should be easy to grok: a URL that points to the exact CSV, Parquet, or Excel table, or a zip dump of images, is your unique identifier.