Data scientists should know the data generating process behind their data

I think a data scientist has a responsibility to be fully informed about how the numbers they're using are generated. This relates very much to knowing the data generating process for whatever input numbers they are using. We shouldn't take for granted the inputs that arrive in our hands.

Some key questions to always ask:

  • How was a number generated? What are the possible causal mechanisms?
  • What data are missing? Why are they missing? Are there possibly multiple causes for missing data?

See also: Embeddings should be treated with care

State of Data Science

Embeddings should be treated with care

Let's think about embeddings. (see The craze with embeddings) Embeddings, at their core, are nothing more than numbers generated off another model. What that model was built/intended to do, and what input data were used to train that model, and hence what caveats should be associated with embeddings generated from that model, should not be black boxes to the data scientist using it.

Not saying that embeddings shouldn't be used, just that they should be treated with care. Any inductive biases that may be introduced should be fully known by tracing the path of the original data through every equation in the model that generated the embedding.