The craze with embeddings
Embeddings are cool and such, because they are like a general purpose "data API". Their form is general enough (an array of numbers), of an arbitrarily large or small size, which means they can be easily connected into other models.
However... the way they are constructed brings its own set of inductive biases to a problem, and if those inductive biases don't make sense... then caution should be had.
I would always be in favour of the user of some embeddings be fully aware of how they were constructed. It's just like how the consumer of some dataset, e.g. a statistician or data scientist, should be fully aware of how the data were collected. In this case, embeddings are some intermediate representation away from the raw-est form of the original data. (See Embeddings should be treated with care)
The impossibility of low-rank representations for triangle-rich complex networks
News article on ScienceDaily. The original paper backing the news article is published in PNAS.
Quotables from the news article:
He also noted that new embedding methods are mostly being compared to other embedding methods. Recent empirical work by other researchers, however, shows that different techniques can give better results for specific tasks.
Benchmarks are quite important! See also: The craze with embeddings.
Given the growing influence of machine learning in our society, Seshadhri said it is important to investigate whether the underlying assumptions behind the models are valid.
Relates to the idea that Finding the appropriate model to apply is key. This is because Every well-constructed model is leverage against a problem; when the underlying assumptions behind our models are valid for our specific problem at hand, we gain leverage to solve our problems. (We should also keep keenly aware of When is your statistical model right vs. good enough.)
Embeddings should be treated with care
Let's think about embeddings. (see The craze with embeddings) Embeddings, at their core, are nothing more than numbers generated off another model. What that model was built/intended to do, and what input data were used to train that model, and hence what caveats should be associated with embeddings generated from that model, should not be black boxes to the data scientist using it.
Not saying that embeddings shouldn't be used, just that they should be treated with care. Any inductive biases that may be introduced should be fully known by tracing the path of the original data through every equation in the model that generated the embedding.