The goal of scientific model building is high explanatory power

Why does mechanistic thinking matter? In The end goals of research data science, we are in pursuit of the invariants, i.e. knowledge that stands the test of time. (How our business contexts exploit that knowledge for win-win benefit of society and the business is a matter to discuss another day).

When we build models, particularly of natural systems, predictive power matters only in the context of explanatory power, where we can map phenomena of interest to key parameters in a model. For example, in an Autoregressive Hidden Markov Model, the autoregressive coefficient may correspond to a meaningful properly in our research context.

Being able to look at a natural system and find the most appropriate model for the system is a key skill for winning the trust of the non-quantitative researchers that we serve. (ref: Finding the appropriate model to apply is key)

Finding the appropriate model to apply is key

I think we need to develop a sense for "when to apply which model".

The key skill here is to be able to look at a problem and very quickly identify what class of problem it falls under, and what model classes are best suited for it.

By problem class, I mean things like:

  • Input/inverse design problems
  • Supervised learning problems
  • Unsupervised learning problems
  • Statistical inference problems
  • Pure prediction problems

I think that those who claim that "the end of programming is near" likely have a deeply flawed view of how models of all kinds are built.

The place for overparametrized models is tooling

I think that the place where overparametrized models make the most sense for application are not in models with "high explanatory power" (see The goal of scientific model building is high explanatory power). Rather, their best realms of application are a bit more pedantic. We don't always make these large, overparametrized models to explain the world, per se, but to automate some task that could have been done by a human being. In other words, we come right back to the notion of using computing to automate routine tasks. Basically one of the two classes of tasks:

  1. Generating representations of input data, which falls under the umbrella term of Representation Learning.
  2. Automation of routine, manual tasks, via methods such as semantic segmentation.

Both of these are things that would have originally required human intervention.

Seems to me, then, that many so-called "AI" applications are merely (yes, merely) about automating tasks that would have otherwise taken much longer with humans.

Other references:

Research vs Business Data Science

One of my colleagues (well, strictly speaking my boss' boss) recently crystallized a very important and key idea for my colleagues: the difference between biomedical research data science and tech business data science. I gave his ideas some thought, and decided to pen down what I saw as the biggest similarities and differences.

The goals between the two "forms" of data science are different:

There are issues that I'm seeing in the data science field. Some of the problems I have seen thus far.

And what I think is needed:

The key difference, I think is that The end goals of business data science is about capturing value from existing processes, while The end goals of research data science is about expanding new avenues of value from unknown, un-developed, and un-captured business processes. The latter is and has always been an investment to make; in a well-oiled system, the former likely generates profit that can and should be invested in the latter.

When is your statistical model right vs. good enough

Came from Michael Betancourt on Twitter.

How do you know that your model is right?
When the residuals contain no information.

How do you know that your model is good enough?
When the residuals contain no information that you can resolve.

Relates to this notion of The goal of scientific model building is high explanatory power, I think.

I'm going to let that one simmer for a bit before I comment further.

Every well-constructed model is leverage against a problem

Why so?

A well-constructed model, for which the residuals cannot be further accounted for (see When is your statistical model right vs. good enough), is one which we can use to gain high explanatory power. (see: The goal of scientific model building is high explanatory power) In using these models, we can:

  1. map its key parameters to values of interest, which can then be used to in comparisons. This is the act of characterization.
  2. simulate what-if scenarios (including counterfactual scenarios). This is us thinking causally.

The reason this is leverage, is because we can engage in these actions without needing to spend real-world resources. (Apart from, of course, real-world validation.)

The end goals of research data science

What are the end goals of research data science?

One of them is being able to explain the world, ideally in a causal fashion. (See: The goal of scientific model building is high explanatory power)

The kind of data scientist we need for this kind of work is different from that of business data science (see: The end goals of business data science).

The context in which research data science operates is one where business processes (and their corresponding business outcomes) are oftentimes not well defined. Thus, defining the ROI is a less straightforward task than it might otherwise be. As such, we need to view research data science as an investment for the future, just like any research organization is.

Also a thing to read: Researchers think mechanistically about the world

Autoregressive Hidden Markov Model

Gaussian AR-HMMs and their structure in equations.

Firstly the HMM piece. States $s$ at time $t$ are $s_{t}$. Transition matrix is $p_{tr}$.

$$s_{t} | s_{t-1} \sim \text{Categorical}(p_{tr}[s_{t-1}])$$

(Just expressing that we slice out the row belonging to state $s_{t-1}$ from the transition matrix.)

Now, we have a conditional distribution: emission $y$ given state $s$.

$$y_t | s_t \sim \text{Normal}(\mu[s_t] + ky_{t-1}, \sigma[s_t])$$

This is the most common. The mean depends on the previous time state. We can also make the variance depend on the previous time state too:

$$y_t | s_t \sim \text{Normal}(\mu[s_t] + ky_{t-1}, (\sigma[s_t]) k \sigma_{t-1})$$

There's many other ways to establish the autoregressive dependency, as long as the autoregressive dependency is expressed in the general form of the current emission distribution parameters depending on the previous emission's output values.