Eric's Notes

When is your statistical model right vs. good enough

Came from Michael Betancourt on Twitter.

How do you know that your model is right?
When the residuals contain no information.

How do you know that your model is good enough?
When the residuals contain no information that you can resolve.

Relates to this notion of The goal of scientific model building is high explanatory power, I think.

I'm going to let that one simmer for a bit before I comment further.

Pages that link here

Every well-constructed model is leverage against a problem
Why so? A well-constructed model, for which the residuals cannot be further accounted for (see When is your statistical model right vs

Notes on Statistics
Some learnings while training myself in statistics

The impossibility of low-rank representations for triangle-rich complex networks
News article on ScienceDaily

Every well-constructed model is leverage against a problem

Why so?

A well-constructed model, for which the residuals cannot be further accounted for (see When is your statistical model right vs. good enough), is one which we can use to gain high explanatory power. (see: The goal of scientific model building is high explanatory power) In using these models, we can:

map its key parameters to values of interest, which can then be used to in comparisons. This is the act of characterization.
simulate what-if scenarios (including counterfactual scenarios). This is us thinking causally.

The reason this is leverage, is because we can engage in these actions without needing to spend real-world resources. (Apart from, of course, real-world validation.)

Notes on Statistics

Some learnings while training myself in statistics.

Distributions:

Probability distribution
Dirichlet Process
Dirichlet Distribution
Hidden Markov Model
Estimating a multivariate Gaussian's parameters by gradient descent
Gaussians come from processes that are additive

Statistical Estimation:

Fermi estimation and Bayesian priors
When is your statistical model right vs. good enough
Maximum score estimator is used to maximize classification true positives and negatives

Papers that I'm writing:

Hierarchical Bayesian models for high throughput biological measurement

Dealing with data:

Multiple imputation is better than imputing single values

Probabilistic Programming:

Anatomy of a probabilistic programming framework
PROBPROG2020 conference

The impossibility of low-rank representations for triangle-rich complex networks

News article on ScienceDaily. The original paper backing the news article is published in PNAS.

Quotables from the news article:

He also noted that new embedding methods are mostly being compared to other embedding methods. Recent empirical work by other researchers, however, shows that different techniques can give better results for specific tasks.

Benchmarks are quite important! See also: The craze with embeddings.

Given the growing influence of machine learning in our society, Seshadhri said it is important to investigate whether the underlying assumptions behind the models are valid.

Relates to the idea that Finding the appropriate model to apply is key. This is because Every well-constructed model is leverage against a problem; when the underlying assumptions behind our models are valid for our specific problem at hand, we gain leverage to solve our problems. (We should also keep keenly aware of When is your statistical model right vs. good enough.)

The goal of scientific model building is high explanatory power

Why does mechanistic thinking matter? In The end goals of research data science, we are in pursuit of the invariants, i.e. knowledge that stands the test of time. (How our business contexts exploit that knowledge for win-win benefit of society and the business is a matter to discuss another day).

When we build models, particularly of natural systems, predictive power matters only in the context of explanatory power, where we can map phenomena of interest to key parameters in a model. For example, in an Autoregressive Hidden Markov Model, the autoregressive coefficient may correspond to a meaningful properly in our research context.

Being able to look at a natural system and find the most appropriate model for the system is a key skill for winning the trust of the non-quantitative researchers that we serve. (ref: Finding the appropriate model to apply is key)