The impossibility of low-rank representations for triangle-rich complex networks

News article on ScienceDaily. The original paper backing the news article is published in PNAS.

Quotables from the news article:

He also noted that new embedding methods are mostly being compared to other embedding methods. Recent empirical work by other researchers, however, shows that different techniques can give better results for specific tasks.

Benchmarks are quite important! See also: The craze with embeddings.

Given the growing influence of machine learning in our society, Seshadhri said it is important to investigate whether the underlying assumptions behind the models are valid.

Relates to the idea that Finding the appropriate model to apply is key. This is because Every well-constructed model is leverage against a problem; when the underlying assumptions behind our models are valid for our specific problem at hand, we gain leverage to solve our problems. (We should also keep keenly aware of When is your statistical model right vs. good enough.)

Every well-constructed model is leverage against a problem

Why so?

A well-constructed model, for which the residuals cannot be further accounted for (see When is your statistical model right vs. good enough), is one which we can use to gain high explanatory power. (see: The goal of scientific model building is high explanatory power) In using these models, we can:

  1. map its key parameters to values of interest, which can then be used to in comparisons. This is the act of characterization.
  2. simulate what-if scenarios (including counterfactual scenarios). This is us thinking causally.

The reason this is leverage, is because we can engage in these actions without needing to spend real-world resources. (Apart from, of course, real-world validation.)

Finding the appropriate model to apply is key

I think we need to develop a sense for "when to apply which model".

The key skill here is to be able to look at a problem and very quickly identify what class of problem it falls under, and what model classes are best suited for it.

By problem class, I mean things like:

  • Input/inverse design problems
  • Supervised learning problems
  • Unsupervised learning problems
  • Statistical inference problems
  • Pure prediction problems

I think that those who claim that "the end of programming is near" likely have a deeply flawed view of how models of all kinds are built.

The craze with embeddings

Embeddings are cool and such, because they are like a general purpose "data API". Their form is general enough (an array of numbers), of an arbitrarily large or small size, which means they can be easily connected into other models.

However... the way they are constructed brings its own set of inductive biases to a problem, and if those inductive biases don't make sense... then caution should be had.

I would always be in favour of the user of some embeddings be fully aware of how they were constructed. It's just like how the consumer of some dataset, e.g. a statistician or data scientist, should be fully aware of how the data were collected. In this case, embeddings are some intermediate representation away from the raw-est form of the original data. (See Embeddings should be treated with care)

When is your statistical model right vs. good enough

Came from Michael Betancourt on Twitter.

How do you know that your model is right?
When the residuals contain no information.

How do you know that your model is good enough?
When the residuals contain no information that you can resolve.

Relates to this notion of The goal of scientific model building is high explanatory power, I think.

I'm going to let that one simmer for a bit before I comment further.