Data scientists should know the data generating process behind their data
I think a data scientist has a responsibility to be fully informed about how the numbers they're using are generated. This relates very much to knowing the data generating process for whatever input numbers they are using. We shouldn't take for granted the inputs that arrive in our hands.
Some key questions to always ask:
See also: Embeddings should be treated with care
State of Data Science
This was inspired by my participation in the TAO Data Science Panel.
I'm starting to see a bifurcation in research vs business data science.
How this translates to training needs and hiring
And notes for managing data scientists:
Embeddings should be treated with care
Let's think about embeddings. (see The craze with embeddings) Embeddings, at their core, are nothing more than numbers generated off another model. What that model was built/intended to do, and what input data were used to train that model, and hence what caveats should be associated with embeddings generated from that model, should not be black boxes to the data scientist using it.
Not saying that embeddings shouldn't be used, just that they should be treated with care. Any inductive biases that may be introduced should be fully known by tracing the path of the original data through every equation in the model that generated the embedding.