Hierarchical Dirichlet Process Hidden Markov Model Transition Matrices

From Matt Johnson's thesis.

Section 3.2.3 says:

The generative process HDP-HMM($\gamma$, $\alpha$, $H$) given concentration parameters $\gamma, \alpha > 0$ and base measure (observation prior) H can be summarized as:

$$\beta \sim GEM(\gamma)$$
$$\pi^{(i)} \sim DP(\alpha, \beta)$$
$$x_t \sim \pi^{(x_{t-1})}$$
$$y_t \sim f(\theta^{x_t})$$

...
The HDP plays the role of a prior over infinite transition matrices: each $\pi^{(i)}$ is a DP draw, and is interpreted as the ransition distribution from state $j$. The $\pi^{(i)}$ are linked by being DP draws parametrized by the same discrete measure $\beta$, thus $E[\pi^{(i)}] = \beta$ and the transition distributions tend to have their mass concentrated arounda typical set of states, providing the desired bias towards re-entering and re-using a consistent set of states.

The key here: make sure each row vector is generated from a Dirichlet Process. By generating each row from the same Dirichlet process distribution, we get a hierarchical Dirichlet process, in which a parental concentration parameter governs the generation of state transition probabilities. The concentration parameter can be "concentrated" (i.e. low) thus expressing the belief that the Markov chain should re-enter and re-use a consistent set of states.

Dirichlet Process

What exactly is a Dirichlet process?

We start from the Dirichlet Distribution, which provides a vector of probabilities over states. The vector of probabilities must sum to 1, and they give Categorical Distribution probabilities, i.e. the probability of getting a draw from that category.

In contrast to the Dirichlet distribution, which has a fixed number of categories, the Dirichlet process describes a generative process for an infinite number of categories. There are a few algorithmic variants of the Dirichlet process, including the Chinese Restaurant Process, the Indian Buffet Process, and the Stick Breaking Process.

In practice, when computing with Dirichlet processes, we tend to use the Stick Breaking Process with a large but finite number of states.

Hidden Markov Model axes semantics

What are the semantics behind a Hidden Markov Model's matrices?

Previously, I wrote about the "event, batch, and sample" shape axes. (see my blog post on probability distributions.)

Building off that, here's a few more.

Input shape

First off, there's the transition matrix. It is a square matrix, and its axis semantics are (num_states, num_states). From here, we already know that there's at least two tensor axes that have a states semantic. The transition matrix is necessary and (I think) sufficient for specifying a Markov Model. (The Hidden piece needs extra stuff.)

The transition matrix is one of the inputs, and we can control it using a Dirichlet Process prior (see also: Hierarchical Dirichlet Process Hidden Markov Model Transition Matrices).

There's also the emission parameters. Assume we have Gaussian-distributed observations from each of the hidden states, and that there are no autoregressive terms. Per state, we need a Gaussian central tendency vector $\mu$ of length n_dimensions, and a covariance matrix $\sigma$ of dimension (n_dimensions, n_dimensions). Therefore, for an entire HMM, we need to also specify $\mu$ with shape (n_states, n_dimensions) and covariance matrix $\sigma$ with shape (n_states, n_dimensions, n_dimensions).

Event shape

We can view the Hidden Markov Model as a probability distribution, by also specifying its output shapes. If we semantically define one draw of a hidden Markov model as a timeseries sequence of states drawn from the model, then the draw should have an event shape of (n_timesteps,). Multiple draws would give an event shape of (n_samples, n_timesteps).

Assuming we have a Gaussian emission distribution, we have to now

Hidden Markov Model

Notes on Hidden Markov Models