Hidden Markov Model

Notes on Hidden Markov Models

Hidden Markov Model axes semantics

What are the semantics behind a Hidden Markov Model's matrices?

Previously, I wrote about the "event, batch, and sample" shape axes. (see my blog post on probability distributions.)

Building off that, here's a few more.

Input shape

First off, there's the transition matrix. It is a square matrix, and its axis semantics are (num_states, num_states). From here, we already know that there's at least two tensor axes that have a states semantic. The transition matrix is necessary and (I think) sufficient for specifying a Markov Model. (The Hidden piece needs extra stuff.)

The transition matrix is one of the inputs, and we can control it using a Dirichlet Process prior (see also: Hierarchical Dirichlet Process Hidden Markov Model Transition Matrices).

There's also the emission parameters. Assume we have Gaussian-distributed observations from each of the hidden states, and that there are no autoregressive terms. Per state, we need a Gaussian central tendency vector $\mu$ of length n_dimensions, and a covariance matrix $\sigma$ of dimension (n_dimensions, n_dimensions). Therefore, for an entire HMM, we need to also specify $\mu$ with shape (n_states, n_dimensions) and covariance matrix $\sigma$ with shape (n_states, n_dimensions, n_dimensions).

Event shape

We can view the Hidden Markov Model as a probability distribution, by also specifying its output shapes. If we semantically define one draw of a hidden Markov model as a timeseries sequence of states drawn from the model, then the draw should have an event shape of (n_timesteps,). Multiple draws would give an event shape of (n_samples, n_timesteps).

Assuming we have a Gaussian emission distribution, we have to now

Hierarchical Dirichlet Process Hidden Markov Model Transition Matrices

From Matt Johnson's thesis.

Section 3.2.3 says:

The generative process HDP-HMM($\gamma$, $\alpha$, $H$) given concentration parameters $\gamma, \alpha > 0$ and base measure (observation prior) H can be summarized as:

$$\beta \sim GEM(\gamma)$$
$$\pi^{(i)} \sim DP(\alpha, \beta)$$
$$x_t \sim \pi^{(x_{t-1})}$$
$$y_t \sim f(\theta^{x_t})$$

...
The HDP plays the role of a prior over infinite transition matrices: each $\pi^{(i)}$ is a DP draw, and is interpreted as the ransition distribution from state $j$. The $\pi^{(i)}$ are linked by being DP draws parametrized by the same discrete measure $\beta$, thus $E[\pi^{(i)}] = \beta$ and the transition distributions tend to have their mass concentrated arounda typical set of states, providing the desired bias towards re-entering and re-using a consistent set of states.

The key here: make sure each row vector is generated from a Dirichlet Process. By generating each row from the same Dirichlet process distribution, we get a hierarchical Dirichlet process, in which a parental concentration parameter governs the generation of state transition probabilities. The concentration parameter can be "concentrated" (i.e. low) thus expressing the belief that the Markov chain should re-enter and re-use a consistent set of states.

Autoregressive Hidden Markov Model

Gaussian AR-HMMs and their structure in equations.

Firstly the HMM piece. States $s$ at time $t$ are $s_{t}$. Transition matrix is $p_{tr}$.

$$s_{t} | s_{t-1} \sim \text{Categorical}(p_{tr}[s_{t-1}])$$

(Just expressing that we slice out the row belonging to state $s_{t-1}$ from the transition matrix.)

Now, we have a conditional distribution: emission $y$ given state $s$.

$$y_t | s_t \sim \text{Normal}(\mu[s_t] + ky_{t-1}, \sigma[s_t])$$

This is the most common. The mean depends on the previous time state. We can also make the variance depend on the previous time state too:

$$y_t | s_t \sim \text{Normal}(\mu[s_t] + ky_{t-1}, (\sigma[s_t]) k \sigma_{t-1})$$

There's many other ways to establish the autoregressive dependency, as long as the autoregressive dependency is expressed in the general form of the current emission distribution parameters depending on the previous emission's output values.