Skip to content

Data Testing Tutorial

You're a data scientist, and you're having fun with your data. You might write a lot of code in support of your project, and enjoying it a ton too!

But at some point, things get out of control.

The functions you've copied from one notebook to another proliferate across the project, and suddenly, you're not so sure which function is "the one" you trust.

Also, the data seem to be changing underneath you, with unexpected errors cropping up in the form of null values, invalid values, and more.

Sounds familiar?

If you're on the lookout for a short guidebook to solving these problems, this one will give you the fundamentals. Come join me as I show you how the incorporation of two, namely software testing and schema validation, can help bring a measure of sanity to your data science workflow as a data scientist.

What to expect

Throughout these notebooks, we will be covering two key themes:

  1. Software testing, via pytest and hypothesis, and
  2. Data validation, using a mixture of custom Python functions and pandera.

Software testing involves running commands at the terminal, as such the notebooks on software testing are designed to be read. By contrast, data validation involves running code in the notebook. As such, those notebooks on data validation are intended to be executed.

Let's get going!

Head over to the first chapter to learn how to get setup!