Iteratively scope out and define the most appropriate data structures for your problem

## Why you need to define good data structures

Data structures are incredibly important to any modelling problem.

Data structures, when designed well, give us an efficient handle over the problem at hand. Especially when a data structure is paired with a programmatic API.

## How to design good data structures for a problem

Consider the example where you have a time series measurement. Here's a simple data structure you can use: two lists. It'd look like:

time_index = [0, 5, 10, ...]
values = [193, 283, 111, ...]


Now, while simplistic, it's not ideal. The time index doesn't start at 1, and it's difficult to index into the values corresponding to a particular time step. Manipulating and analyzing this data is difficult, because of a poor choice of data structures.

By contrast, if we instead stuck the data inside a dataframe, things would start to look a bit more sane.

df = pd.DataFrame({"time_index": time_index, "measurement": values})


Now, our time index and measurements are no longer divorced from one another. We can write queries against them easily. Plotting is a cinch too, because the dataframe API supports it. Hence, by choosing to structure our data in a dataframe rather than in two lists, we gain a world of capabilities afforded to us from the dataframe API.

### Dataframe considerations

Designing a good "dataframe" takes effort too. Once you have your raw data loaded in memory from your single source of truth (see: Define single sources of truth for your data sources), you probably will end up defining new derived columns. These are columns that are calculated on the basis of, or otherwise "derived" from, the "raw data" columns. Examples include:

• Binarization/quantization of a continuous column.
• Joining two dataframes together on a key column.
• Gaussian-standardization of a column.

The "raw data" form the baseline logical unit that can be validated (see: Validate your data wherever practically possible). On top of this baseline logical unit, you can make an arbitrary number of changes to the dataframe. How many changes form a new "logical unit" of changes for which you'll want to define new schema validation checks? This is an important question to think about, because after all, your dataframes form the "data API", and it'll be implicated in the pandera schemas and data descriptors you end up writing! (see: Write data descriptor files for your data sources).