Programmatically clean your data rather than manually

Why clean your data programmatically

To illustrate, let's look at a simple but common scenario.

Data came to you from an upstream source with some errors. You went into Excel or a text editor and manually corrected those errors. You then went about your analysis. At a later date, your upstream source provided you with an updated data file... and it still contained the exact same errors you fixed manually earlier.

This scenario makes the case for baking in all data cleaning (i.e. processing) steps into code. In doing so, we declare that "the source of truth for the state of my data is whatever the data source gives me", and our code will do the cleaning.

What tools can we use to programmatically clean our data?

In the Python world, the pandas library is one tool that is available. Its API has become the de facto standard that other libraries target, especially if they are trying to make accelerated data processing libraries. Some examples of dataframe libraries that adopt the pandas API include:

Building on top of the pandas API, I built a port of the R package janitor, called pyjanitor. Inside pyjanitor, you'll find a library of data cleaning routines, each routine having an API that is implemented as a DataFrame class method.

How do we design good data cleaning pipelines?

Here, it pays to think a bit like a software engineer. Your dataframe is basically your "data API" -- your downstream model, which is effectively a numerical program of some kind, depends on your input data having certain guarantees. What are these guarantees? I have described some of these considerations in Validate your data wherever practically possible, but here are some of the more common ones to consider:

  • Presence of certain columns.
  • Columns having certain dtypes.
  • Data being present/absent.

As mentioned in the data validation note, validate your data wherever you can!

One technique I find particularly handy is to map out all of the necessary data transformations as a graph. Nodes are the resulting dataframes and their columns; edges are the functions that take in dataframes and return dataframes. This map will be immensely helpful: it will help you think through where you could decompose larger data transformation functions to be more modular, and where you might be able to reuse transformed data.

Handling data

How to handle data

Handling data in a data science project is very tricky. Primarily, we have to worry about the following:

  1. Availability: How do I make the data that my project depends on available to others who want to work on the project?
  2. Validation: How do I know whether my data are exactly what I think it should be?
  3. Flow complexity: How do I combat the entropy (complexity) that grows as the project develops?
  4. Provenance: If I have a problem with the data, whom should I ask questions about it?

The notes linked in this section should give you an overview on how to approach handling data in a sane fashion on your project.

Validate your data wherever practically possible

What is "data validation"?

To understand data validation, we have to back up a little bit and consider the simplest case of tabular data.

We canonically understand tabular data as having columns and rows. Rows are, in a statistical sense, "samples". Columns, then, are measured attributes of the samples. Each of the measured attributes has a range of values for which it is semantically valid. (In statistics, this is analogous to the statistical support, which is the range of values that define the probability distribution.) Validation of tabular data, then, refers to the act of ensuring that the measured attributes are, for lack of a better word, valid.

To make this clearer, let me illustrate the ways that "validated" data might look.

From a statistical standpoint:

  1. For continuous measurement data, the measurement values fall within semantically valid ranges.
    1. Unbounded in statistical language means support from negative to positive infinity.
    2. Bounded data usually would have at least one of "minimum" or "maximum" values stated.
  2. For discrete measurement data, the measurement values fall within a set of semantically valid options.
  3. There are no null values present in columns that should not have them.

From a computational standpoint:

  1. Each column's data are of the correct data type (integer, float, categorical, object) for interoperability with other code that you might write.
  2. Column names are named precisely in line with their references in the codebase.

When to validate data

For interactive computational use cases, just-in-time checks are handy in helping you identify errors in data before using them. That means the verification ideally happens right before you consume the data and right after your data processing/handling function returns the data. You probably could call this runtime data validation.

On the other hand, if your project ends up being part of a more complex pipeline, especially one with continually updating data, you might want to validate the data at the point of ingestion. You could catch any data points that fail the validation checks that you have defined at the time of upload. You might even go further and periodically run the validation checks on a regular interval. If the data source is large, you might opt to sample a small subset of data rather than perform full data scans. For this strategy, you might want to call it storage time validation.

Parallels to software testing

Software tests check that the functions that you write behave correctly. By contrast, data validation ensures that the input data to your functions satisfy the assumptions that you make in the data processing functions you write.

Just as you should be able to run tests to check your data automatically and continuously, you should be able to constantly check that the data you put into your functions should satisfy the assumptions you possess about them.

Tools for validating data

At the moment, I see two open-source projects that are well-developed and maintained for data validation.

Pandera

Pandera targets validation of pandas dataframes in your Python code and comes with a very lightweight API for tacking on automatic runtime validation to your functions.

Great Expectations

Great Expectations is a bit more heavyweight than Pandera, and in my opinion, is more suitable for heavy-duty pipelines that continuously process data that gets continually fed (whether streamed or in batch) into the data storage system.

Your database system's schemas

If you are ingesting data into a database, which is inherently already structured, rather than being dumped into a data lake, which is intrinsically unstructured, then your database schemas can serve as an automated check for some parts of data validity, such as data being in the right range, or having the right data types.