Embeddings should be treated with care

Let's think about embeddings. (see The craze with embeddings) Embeddings, at their core, are nothing more than numbers generated off another model. What that model was built/intended to do, and what input data were used to train that model, and hence what caveats should be associated with embeddings generated from that model, should not be black boxes to the data scientist using it.

Not saying that embeddings shouldn't be used, just that they should be treated with care. Any inductive biases that may be introduced should be fully known by tracing the path of the original data through every equation in the model that generated the embedding.

The craze with embeddings

Embeddings are cool and such, because they are like a general purpose "data API". Their form is general enough (an array of numbers), of an arbitrarily large or small size, which means they can be easily connected into other models.

However... the way they are constructed brings its own set of inductive biases to a problem, and if those inductive biases don't make sense... then caution should be had.

I would always be in favour of the user of some embeddings be fully aware of how they were constructed. It's just like how the consumer of some dataset, e.g. a statistician or data scientist, should be fully aware of how the data were collected. In this case, embeddings are some intermediate representation away from the raw-est form of the original data. (See Embeddings should be treated with care)

Data scientists should know the data generating process behind their data

I think a data scientist has a responsibility to be fully informed about how the numbers they're using are generated. This relates very much to knowing the data generating process for whatever input numbers they are using. We shouldn't take for granted the inputs that arrive in our hands.

Some key questions to always ask:

  • How was a number generated? What are the possible causal mechanisms?
  • What data are missing? Why are they missing? Are there possibly multiple causes for missing data?

See also: Embeddings should be treated with care