Notes on Statistics

Hidden Markov Model

Notes on Hidden Markov Models

Dirichlet Process

What exactly is a Dirichlet process?

We start from the Dirichlet Distribution, which provides a vector of probabilities over states. The vector of probabilities must sum to 1, and they give Categorical Distribution probabilities, i.e. the probability of getting a draw from that category.

In contrast to the Dirichlet distribution, which has a fixed number of categories, the Dirichlet process describes a generative process for an infinite number of categories. There are a few algorithmic variants of the Dirichlet process, including the Chinese Restaurant Process, the Indian Buffet Process, and the Stick Breaking Process.

In practice, when computing with Dirichlet processes, we tend to use the Stick Breaking Process with a large but finite number of states.

Multiple imputation is better than imputing single values

From the multiple imputation book.

The core idea here:

His solution was simple and brilliant: create multiple imputations that reflect the uncertainty of the missing data.


Drawing imputations from a distribution, instead of estimating the “best” value, was a drastic departure from everything that had been done before. Rubin’s original proposal did not include formulae for calculating combined estimates, but instead stressed the study of variation because of uncertainty in the imputed values. The idea was rooted in the Bayesian framework for inference, quite different from the dominant randomization-based framework in survey statistics.

So if we posit a distribution of values for missing data, we can run any analyses across each imputed version, thereby giving us a distribution over all possible outcomes.

Very Bayesian indeed!

Also, taken from Schafer (1999):

...Rubin recommends that imputations be created through Bayesian arguments: specify a parametric model for the complete data (and, if necessary, a model for the mechanism by which data become missing), apply a prior distribution to the unknown model parameters, and simulate $m$ independent draws from the conditional distribution of $Y_m$ is given $Y_{obs}$ by Bayes' theorem.

Anatomy of a probabilistic programming framework

From George Foo:

Key ingredients of a probabilistic programming framework:

  1. Language for specifying a model.
  2. Library of probability distributions + facilities to specify arbitrary distributions.
  3. Inference algorithm belonging to at least one of MCMC or VI.
  4. An optimizer, to compute mode of posterior density.
  5. Autodiff library, to compute gradients for items 3 and 4 (inference algo + optimizer)
  6. Diagnostics suite to analyze quality of inference.

PyMC3 provides a whole lot of these, alongside ArviZ!

When is your statistical model right vs. good enough

Came from Michael Betancourt on Twitter.

How do you know that your model is right?
When the residuals contain no information.

How do you know that your model is good enough?
When the residuals contain no information that you can resolve.

Relates to this notion of The goal of scientific model building is high explanatory power, I think.

I'm going to let that one simmer for a bit before I comment further.

Gaussians come from processes that are additive

I forgot where I read this, need to find the reference.

Gaussians are the result of processes that are additive. Therefore, if we have data that we think come from the addition of "stuff", a Gaussian might be a reasonable assumption. Standardizing the Gaussian might also be valid here.

Fermi estimation and Bayesian priors

Fermi estimation can probably be used to construct Bayesian priors in a rpincipled fashion.

A common critique of Bayesian estimation models is that "one can cook up priors to make the model say anything you want". While true, it ignores the reality that most of us will be able to to use Fermi estimation to come up with justified and non-informative priors.

Let me illustrate.

In biology, say we had a quantity to estimate, such as the average size of a species of bacteria, which we could reasonably assume to be Gaussian-distributed. Since we know that most other bacteria are 0.1 to 10 microns in size, we might set a the log(mean) with a $N(0, 0.5)$ prior, in the units of microns, which would put us in a good prior distribution range. The mean is at $10^0 = 1$ microns, which is smack in the middle. Three scale units out to the right, and we are at 30-ish microns; three scale units out to the left, and we are at about 0.03-ish microns. As a prior, it's uninformative enough for the problem that data can refine the distribution, and few would argue against this choice when framed the way above.

Let's consider counterfactually ridiculous priors. It would be unreasonable if I placed scale = 10 as the prior. Why? In that case, we are spanning orders of magnitude that are practically unreasonable for the category of things we call bacteria. We would go down to as small as $10^{-10}$ microns, which is less than the size of a bond between atoms, or as large as $10^{10}$ microns, which is on the scale of distances between towns on a map.

It would also be uncreasonable if I placed scale = 0.1 as the prior. Why? Because it ignores the one order of magnitude above and below our mean, which may be relevant for the category of things called bacteria. As a rule of thumb, that one order of magnitude above or below where our guess lies is a good thing to span for weakly informative priors.

So as you can see, Fermi estimation is a pretty principled way to guess at what priors should be used for a problem. It also helps us constrain those priors to be not-so-unreasonable. Yet another critique against Bayesian bites the dust, in my opinion.

Update, I saw this tweet:

the outcome variable is human body temperature measurements. you gotta assume any value from -infinity to +infinity is plausible. otherwise you're being subjective.

(Posted by a Bayesian, with an eyeroll emoji at the end.) I eyerolled myself. If we are being reasonable people, we put a small Gaussian centered on 37.0ºC as an extremely good enough prior model for human body temperatures.


This is the landing page for my notes.

This is 100% inspired by Andy Matuschak's famous notes page. I'm not technically skilled enough to replicate the full "Andy Mode", though, so I just did some simple hacks. If you're curious how these notes compiled, check out the summary in How these notes are made into HTML pages.

This is my "notes garden". I tend to it on a daily basis, and it contains some of my less fully-formed thoughts. Nothing here is intended to be cited, as the link structure evolves over time. The notes are best viewed on a desktop/laptop computer, because of the use of hovers for previews.

There's no formal "navigation", or "search" for these pages. To go somewhere, click on any of the "high-level" notes below, and enjoy.

  1. Notes on statistics
  2. Notes on differential computing
  3. The State of Data Science
  4. Network science
  5. Scholarly readings
  6. Software skills for data scientists
  7. The Data Science Programming Newsletter MOC
  8. Life and computer hacks
  9. Reading Bazaar
  10. Blog drafts
  11. Conference Proposals

Hierarchical Bayesian models for high throughput biological measurement

Key Information

Current Status

  • Paper draft needs to be written out on Overleaf.
  • Submit through company's approval system.
  • Then submit onto MDPI's website.

Probability distribution

A probability distribution is an object that assigns credibility values to discrete or continuous values. For parametrized distributions, there is usually a math function that takes in one or more parameters and returns a value across the number line.

Dirichlet Distribution

The Dirichlet distribution is often known as the multi-class (or multivariate) generalization of the Beta Distribution. In contrast to the Beta distribution, which provides a probability draw for one class, the Dirichlet distribution generalizes this to multi-class.

Maximum score estimator is used to maximize classification true positives and negatives

Taken from Stats StackExchange

Used when we want to maximize the difference between the true positives and false positives.

There is a derivation in there of how the thing works.

According to the derivation, this is also equivalent to maximizing the number of correct predictions.


PROBPROG2020 Key Information


  • Arkadij's ticket number: MA-6963-4407-835
  • My ticket number: MA-2612-6273-EFF

Conference Notes

Dates: 22-24 October
Journal publication delayed

Poster notes
- 14×10 (W×H) ratio
- 4 quadrants

Estimating a multivariate Gaussian's parameters by gradient descent

I've seen this idea in action before, but never tried my hand at it until yesterday at work.

Turns out, if you have data that can be modelled by a multivariate Gaussian distribution, you can optimize the parameters of that Gaussian to maximize the likelihood of data under the multivariate Gaussian model.

The key tricky piece of this problem is that a multivariate Gaussian distribution's covariance parameter has to be structured. (see Covariance matrix) When pairing with gradient descent though, a key problem we face is that we can never guarantee that the partial derivative of our likelihood function w.r.t. the square covariance-like matrix that we input will preserve its correct structure. What can we do to solve this problem?

We use the Cholesky decomposition.

For example, using JAX's scipy wrapper (for automatic differentiation-compatible Cholesky decomposition):

import jax.numpy as np
from jax.scipy.linalg import cholesky

a = np.array([
    [1, 0.8],
    [0.8, 1],

U = cholesky(a)  # returns the upper triangle

From U, we can reconstitute a by taking the dot-product of the the transpose of U against U:

a_hat =, U)

assert np.all(a_hat == a)

That's all cool and such, but how does this apply to fitting a multivariate Gaussian against data? As mentioned above, the key problem is that when we initialize our parameters to optimize, we oftentimes sample a number from a Gaussian (or other unbounded) distributions. To initialize a covariance matrix, we might be tempted to draw a square matrix, but there's no easy way to guarantee that it follows the desired structure of a covariance matrix.

That's where the reconstitution of a covariance matrix from upper triangle matrices can help. To sweeten the deal, because of JAX's fully NumPy-compatible API, we can instead initialize howeve we want, and then perform a transformation step to help us get back to a covariance matrix:

from jax import random as rnd

key = rnd.PRNGKey(42)
init_cov = rnd.normal(key, shape=(5, 5))  # let's make a 5x5 covariance matrix

def transform_to_covariance_matrix(sq_mat):
    U = np.triu(sq_mat)
    U_T = np.transpose(U)
    return, U)

And now with that, we can perform a full optimization of the init_cov parameter to its maximum likelihood. Let me show you how this works.

Firstly, we define the likelihood of observing our data under a multivariate Gaussian model:

from jax.scipy.stats import multivariate_normal
from jax import vmap

def loglike(params, data):
    mu, untransformed_cov = params
    cov = transform_to_covariance_matrix(untransformed_cov)
    def logpdf_func(datum):
        """logpdf of multivariate normal for one datum."""
        return multivariate_normal.logpdf(mu, cov, datum)
    logp = vmap(logpdf_func)(data)
    return np.sum(logp)

Notice two pieces:
1. We used some pretty cool/rad JAX tooling, like vmap, to eliminate sample dimensions and for-loops!
2. We also assumed that our covariance matrix is passed into the loglike function in an untransformed form, and we transform it to the correct form directly.

Now, we perform gradient-based optimization. We define the derivative function, which we'll use later for calculating gradients:

from jax import grad
dloglike = grad(loglike)

We then initialize our parameters:

mu = random.normal(key, shape=(5,))
untransformed_cov = random.normal(key, shape=(5, 5))

params = mu, untransformed_cov  # package them up into a convenient variable.

Finally, we use JAX's built-in optimizers, calling on jit to make computations fast:

from jax import jit
from jax.experimental.optimizers import adam

init, update, get_params = adam(step_size=0.005)
get_params = jit(get_params); update = jit(update)
dloglike = jit(dloglike); loglike = jit(loglike)

state = init(params)
for i in range(300):
    params = get_params(state)
    g = dloglike(params, data)
    state = update(i, g, state)
mu_opt, untransformed_cov_opt = get_params(state)
cov_opt = transform_to_covariance_matrix(untransformed_cov_opt)

The optimized covariance matrix will be pretty darn close to the true one!