Eric's Notes

Data scientists should know the data generating process behind their data

I think a data scientist has a responsibility to be fully informed about how the numbers they're using are generated. This relates very much to knowing the data generating process for whatever input numbers they are using. We shouldn't take for granted the inputs that arrive in our hands.

Some key questions to always ask:

How was a number generated? What are the possible causal mechanisms?
What data are missing? Why are they missing? Are there possibly multiple causes for missing data?

Pages that link here

State of Data Science
This was inspired by my participation in the TAO Data Science Panel