Eric's Notes

Dirichlet Process

What exactly is a Dirichlet process?

We start from the Dirichlet Distribution, which provides a vector of probabilities over states. The vector of probabilities must sum to 1, and they give Categorical Distribution probabilities, i.e. the probability of getting a draw from that category.

In contrast to the Dirichlet distribution, which has a fixed number of categories, the Dirichlet process describes a generative process for an infinite number of categories. There are a few algorithmic variants of the Dirichlet process, including the Chinese Restaurant Process, the Indian Buffet Process, and the Stick Breaking Process.

In practice, when computing with Dirichlet processes, we tend to use the Stick Breaking Process with a large but finite number of states.

Pages that link here

Hierarchical Dirichlet Process Hidden Markov Model Transition Matrices
From Matt Johnson's thesis

Notes on Statistics
Some learnings while training myself in statistics

Hidden Markov Model axes semantics
What are the semantics behind a Hidden Markov Model's matrices? Previously, I wrote about the "event, batch, and sample" shape axes

Stick Breaking Process
One algorithmic protocol for generating Dirichlet Process draws

Notes on Statistics

Some learnings while training myself in statistics.

Distributions:

Probability distribution
Dirichlet Process
Dirichlet Distribution
Hidden Markov Model
Estimating a multivariate Gaussian's parameters by gradient descent
Gaussians come from processes that are additive

Statistical Estimation:

Fermi estimation and Bayesian priors
When is your statistical model right vs. good enough
Maximum score estimator is used to maximize classification true positives and negatives

Papers that I'm writing:

Hierarchical Bayesian models for high throughput biological measurement

Dealing with data:

Multiple imputation is better than imputing single values

Probabilistic Programming:

Anatomy of a probabilistic programming framework
PROBPROG2020 conference

Hidden Markov Model axes semantics

What are the semantics behind a Hidden Markov Model's matrices?

Previously, I wrote about the "event, batch, and sample" shape axes. (see my blog post on probability distributions.)

Building off that, here's a few more.

Input shape

First off, there's the transition matrix. It is a square matrix, and its axis semantics are (num_states, num_states). From here, we already know that there's at least two tensor axes that have a states semantic. The transition matrix is necessary and (I think) sufficient for specifying a Markov Model. (The Hidden piece needs extra stuff.)

The transition matrix is one of the inputs, and we can control it using a Dirichlet Process prior (see also: Hierarchical Dirichlet Process Hidden Markov Model Transition Matrices).

There's also the emission parameters. Assume we have Gaussian-distributed observations from each of the hidden states, and that there are no autoregressive terms. Per state, we need a Gaussian central tendency vector $\mu$ of length n_dimensions, and a covariance matrix $\sigma$ of dimension (n_dimensions, n_dimensions). Therefore, for an entire HMM, we need to also specify $\mu$ with shape (n_states, n_dimensions) and covariance matrix $\sigma$ with shape (n_states, n_dimensions, n_dimensions).

Event shape

We can view the Hidden Markov Model as a probability distribution, by also specifying its output shapes. If we semantically define one draw of a hidden Markov model as a timeseries sequence of states drawn from the model, then the draw should have an event shape of (n_timesteps,). Multiple draws would give an event shape of (n_samples, n_timesteps).

Assuming we have a Gaussian emission distribution, we have to now

Indian Buffet Process

Chinese Restaurant Process

From Wikipedia:

In probability theory, the Chinese restaurant process is a discrete-time stochastic process, analogous to seating customers at tables in a Chinese restaurant. Imagine a Chinese restaurant with an infinite number of circular tables, each with infinite capacity. Customer 1 sits at the first table. The next customer either sits at the same table as customer 1, or the next table. This continues, with each customer choosing to either sit at an occupied table with a probability proportional to the number of customers already there (i.e., they are more likely to sit at a table with many customers than few), or an unoccupied table. At time n, the n customers have been partitioned among m ≤ n tables (or blocks of the partition).

Let's try putting this into an algorithmic framing.

Initialize with infinite tables (read: large number of vector slots), each with an infinite number of possible seats (i.e. integer counts per vector slot).
Add first customer to first table.
For each new customer $n$,
1. Calculate probability of sitting at a given table by $\frac{n}{(\sum{n})+1}$ or sitting at a new table by $\frac{1}{(\sum{n})+1}$
2. Draw a table based on the probability vector above.
3. Assign customer to that table.

The probability vector that is calculated on-the-fly is effectively a Categorical Distribution.

I think this description should be enough to implement it in Python, which would be an ultimate test of my understanding.

Stick Breaking Process

One algorithmic protocol for generating Dirichlet Process draws.

Steps:

Start with a stick of length 1
Draw realization $i$ from a pre-configured Beta Distribution. (What is a pre-configured probability distribution) Call this realization $p_i$.
Split stick of length 1 into two, with fraction $p_i$ of the stick on the left, and fraction $1 - p_i$ of the stick on the right.
Store the left stick as $l_i$.
Repeat this ad infinitum (if we're talking about it in abstract), or up till a fixed number of draws.

We'll now have a series of draws for $p_i$ and $l_i$:

$l = (l_1, l_2, l_3, ... l_n)$
$p = (p_1, p_2, p_3, ... p_n)$

Each $p$ came from an independent Beta Distribution draw, while each $l$ was the result of breaking whatever was leftover from the previous round of stick breaking.

If we finished at a finite stopping point, then $l$ is guaranteed to not sum to 1, as we never know what length of stick was leftover on that last stick breaking step. To use $l$ as a valid probability vector, it must be re-normalized to sum to 1, i.e.:

$$l_{norm} = \frac{l}{\sum{l}}$$

Categorical Distribution

The categorical distribution is a probability distribution that yields a category from a discrete set of categories.

It's parametrized by a probability vector, which has a length equal to the number of categories and sums to 1. The probability vector provides the probability of drawing each category.

Dirichlet Distribution

The Dirichlet distribution is often known as the multi-class (or multivariate) generalization of the Beta Distribution. In contrast to the Beta distribution, which provides a probability draw for one class, the Dirichlet distribution generalizes this to multi-class.

Hierarchical Dirichlet Process Hidden Markov Model Transition Matrices

From Matt Johnson's thesis.

Section 3.2.3 says:

The generative process HDP-HMM($\gamma$, $\alpha$, $H$) given concentration parameters $\gamma, \alpha > 0$ and base measure (observation prior) H can be summarized as:

$$\beta \sim GEM(\gamma)$$
$$\pi^{(i)} \sim DP(\alpha, \beta)$$
$$x_t \sim \pi^{(x_{t-1})}$$
$$y_t \sim f(\theta^{x_t})$$

...
The HDP plays the role of a prior over infinite transition matrices: each $\pi^{(i)}$ is a DP draw, and is interpreted as the ransition distribution from state $j$. The $\pi^{(i)}$ are linked by being DP draws parametrized by the same discrete measure $\beta$, thus $E[\pi^{(i)}] = \beta$ and the transition distributions tend to have their mass concentrated arounda typical set of states, providing the desired bias towards re-entering and re-using a consistent set of states.

The key here: make sure each row vector is generated from a Dirichlet Process. By generating each row from the same Dirichlet process distribution, we get a hierarchical Dirichlet process, in which a parental concentration parameter governs the generation of state transition probabilities. The concentration parameter can be "concentrated" (i.e. low) thus expressing the belief that the Markov chain should re-enter and re-use a consistent set of states.