Experimentation is getting a bit out of hand

Experiment with overparameteized models in domains that may not necessarily make sense, particularly from a first-principles perspective. One example: I have peer-reviewed a paper where 1-dimensional convolutional neural networks were unleashed on on tabular data: this makes no sense from a first-principles perspective, because in tabular data, there's usually no semantically meaningful spatial correlations between columns. These experiments, to the trained practitioner, leave the impression of the experimenter trying too hard to shoehorn a problem into a newly learned tool, with the model's assumptions not being sufficiently understood before being applied.

In the spirit of Finding the appropriate model to apply is key, I'd rather see models custom-built for each problem. After all, Every well-constructed model is leverage against a problem

Every well-constructed model is leverage against a problem

Why so?

A well-constructed model, for which the residuals cannot be further accounted for (see When is your statistical model right vs. good enough), is one which we can use to gain high explanatory power. (see: The goal of scientific model building is high explanatory power) In using these models, we can:

  1. map its key parameters to values of interest, which can then be used to in comparisons. This is the act of characterization.
  2. simulate what-if scenarios (including counterfactual scenarios). This is us thinking causally.

The reason this is leverage, is because we can engage in these actions without needing to spend real-world resources. (Apart from, of course, real-world validation.)

Research vs Business Data Science

One of my colleagues (well, strictly speaking my boss' boss) recently crystallized a very important and key idea for my colleagues: the difference between biomedical research data science and tech business data science. I gave his ideas some thought, and decided to pen down what I saw as the biggest similarities and differences.

The goals between the two "forms" of data science are different:

There are issues that I'm seeing in the data science field. Some of the problems I have seen thus far.

And what I think is needed:

The key difference, I think is that The end goals of business data science is about capturing value from existing processes, while The end goals of research data science is about expanding new avenues of value from unknown, un-developed, and un-captured business processes. The latter is and has always been an investment to make; in a well-oiled system, the former likely generates profit that can and should be invested in the latter.

Finding the appropriate model to apply is key

I think we need to develop a sense for "when to apply which model".

The key skill here is to be able to look at a problem and very quickly identify what class of problem it falls under, and what model classes are best suited for it.

By problem class, I mean things like:

  • Input/inverse design problems
  • Supervised learning problems
  • Unsupervised learning problems
  • Statistical inference problems
  • Pure prediction problems

I think that those who claim that "the end of programming is near" likely have a deeply flawed view of how models of all kinds are built.