Recently, I’ve just wrapped up the data analysis tasks for one of my projects, and have finally finished the writing up to the stage of being ready for submission. I’m still going to choose to keep it under wraps until it finally gets published somewhere - don’t want to count chickens before they’re hatched! But the paper isn’t the main point of this post; the main point is on the importance of doing tests as part of the data analysis workflow, especially for large data analysis.
Data comes to us analytics-minded people - but rarely is it structured, cleaned, and ready for analysis & modelling. Even if it were generated by machines, there may be bugs in the data - and by bugs, I mean things we didn’t expect to see about the data. Therefore, I think it’s important for us to check our assumptions about the data, prior to and during analysis, to catch anything that is out of whack. I would argue that the best way to do this is by employing an "automated testing" mindset and framework. Doing this builds a list of data integrity checks that can speed up the catching of bugs during the analysis process, can provide feedback to the data providers/generators. In the event that the data gets updated, we can automatically check the integrity of the datasets that we work with prior to proceeding.
As I eventually found out through experience, doing data tests isn’t hard at all. The core requirements are:
test_data.py, with functions that express the assumptions one is making about the data.
To get started, ensure that you have Python installed. As usual, I recommend the Anaconda distribution of Python. Once that is installed, install the
pytest package by typing in your Terminal:
conda install pytest.
Once that is done, in your folder, created a blank script called
test_data.py. In this script, you will write functions that express the data integrity tests for your data.
To illustrate some of the basic logic behind testing, I have provided a Github repository of some test data and an example script. The material is available on https://github.com/ericmjl/data-testing-tutorial. To use this ‘tutorial’ material, you can clone the repo to disk, and install the dependencies mentioned on the Github site.
The example data set provided is a Divvy data set, which is a simple data set on which we can do some tests. There is a clean data set, and a corrupt data set in which one cell of the CSV file has a "hello", and another cell has a "world" present in the
latitude columns, defying what one would expect about two of the data columns. If you inspect the data, you will see that there are different columns present.
To begin testing the data, we can write in the following lines of code in the
import pandas as pd data = pd.read_csv('data/Divvy_Stations_2013.csv', index_col=0) def test_column_latitude_dtype(): """ Checks that the dtype of the 'Latitude' column is a float. """ assert data[’atitude’].dtype == float
If you fire up a Terminal window,
cd into the
data-testing-tutorial directory, and execute
py.test. You should see the following terminal output:
1 passed in 0.53 seconds.
If you now change the function to check the
data_corrupt DataFrame, such that the assert statement is:
assert data_corrupt['altitude'].dtype == float
py.test output should include the error message:
> assert df['latitude'].dtype == float E assert dtype('O') == float
At this point, because the assertion statement failed, you would thus know that the data suffered a corruption.
In this case, there is only one data set that is of interest. However, if you find that it’s important to test more than one file of a similar data set, you can encapsulate the test code in a function call embedded in the test function as such:
def test_column_latitude_dtype(): """ Checks that the dtype of the 'Latitude' column is a float. """ def column_latitude_dtype(df): assert df['latitude'].dtype == float column_latitude_dtype(data) column_latitude_dtype(data_corrupt)
In this way, you would be testing all of the data files together. You can also opt to do similar encapsulation, abstraction etc. if it helps with automating the test cases. As one speaker once said (as far as my memory can recall), if you use the same block of code twice, encapsulate it in a function.
You can extend the script by adding more test functions.
py.test is able to do automated test discovery, by looking for the
test_ prefix to a function. Therefore, simply make sure all of your test functions have the
test_ prefix before them.
Need some ideas on when to add more tests?
Happy Testing! :D