Bayesian Estimation on Multiple Groups
Problem Type
The Bayesian estimation model is widely applicable across a number of scenarios. The classical scenario is when we have an experimental design where there is a control vs. a treatment, and we want to know what the difference is between the two. Here, "estimation" is used to estimate the "true" value for the control and the "true" value for the treatment, and the "Bayesian" part refers to the computation of the uncertainty surrounding the parameter.
Bayesian estimation's advantages over the classical t-test was first described by John Kruschke (2013).
In this notebook, I provide a concise implementation suitable for two-sample and multi-sample inference, with data that don't necessarily fit Gaussian assumptions.
Data structure
To use it with this model, the data should be structured as such:
- Each row is one measurement.
- The columns should indicate, at the minimum:
- What treatment group the sample belonged to.
- The measured value.
Extensions to the model
As of now, the model only samples posterior distributions of measured values. The model, then, may be extended to compute differences in means (sample vs. control) or effect sizes, complete with uncertainty around it. Use pm.Deterministic(...)
to ensure that those statistics' posterior distributions, i.e. uncertainty, are also computed.
Reporting summarized findings
Here are examples of how to summarize the findings.
Treatment group A was greater than control by x units (95% HPD: [
lower
,upper
]).Treatment group A was higher than control (effect size 95% HPD: [
lower
,upper
]).
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import pymc4 as pm4
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import janitor as jn
from utils import ecdf
from pyprojroot import here
# Read in the data
df = (
pd.read_csv(here() / "datasets/biofilm.csv")
.label_encode(columns=["isolate"]) # encode isolate as labels.
.transform_column("normalized_measurement", np.log, "log_normalized_measurement")
)
# Display a subset of the data.
df.head()
Model Specification
We know that the OD600
and measurements
columns are all positive-valued, and so the normalized_measurement
column will also be positive-valued.
There are two ways to handle this situation:
- We can either choose to directly model the likelihood using a bounded, positive-support-only distribution, or
- We can model the log-transformation of the
normalized_measurement
column, using an unbounded, infinite-support distribution (e.g. the T-distribution family of distributions, which includes the Gaussian and the Cauchy in there).
The former is ever slightly more convenient to reason about, but the latter lets us use Gaussians, which have some nice properties when sampling.
import tensorflow as tf
import numpy as np
num_isolates = len(set(df["isolate_enc"]))
@pm4.model
def bacteria_model():
mu_mean = yield pm4.Normal("mu_mean", loc=0, scale=1)
mu = yield pm4.Normal("mu", loc=mu_mean, scale=1, batch_stack=num_isolates)
mu_bounded = yield pm4.Deterministic("mu_bounded", tf.exp(mu))
# Because we use TFP, tf.gather now replaces the old numpy syntax.
# the following line is equivalent to:
# mu_all = mu[df["isolate_enc"]]
mu_all = tf.gather(mu, df["isolate_enc"])
sigma = yield pm4.HalfCauchy("sd", scale=1, batch_stack=num_isolates)
sigma_all = tf.gather(sigma, df["isolate_enc"])
nu = yield pm4.Exponential("nu", rate=1/30.)
like = yield pm4.StudentT("like", loc=mu_all, scale=sigma_all, df=nu, observed=df["log_normalized_measurement"])
# Take the difference against the ATCC strain, which is the control.
difference = yield pm4.Deterministic("difference", mu_bounded[:-1] - mu_bounded[-1])
Inference Button!
Now, we hit the Inference Button(tm) and sample from the posterior distribution.
trace = pm4.sample(bacteria_model())
Diagnostics
Our first diagnostic will be the trace plot. We expect the trace of the variables that we are most interested in to be a fuzzy caterpillar.
import arviz as az
axes = az.plot_trace(trace, var_names=["bacteria_model/mu"], compact=True)
Looking at the traces, yes, everything looks more or less like a hairy caterpillar. This means that sampling went well, and has converged, thus we have a good MCMC estimator of the posterior distribution.
I need a mapping of isolate to its encoding - will come in handy below.
mapping = dict(zip(df["isolate_enc"], df["isolate"]))
yticklabels = list(reversed([mapping[i] for i in range(len(mapping))]))
Let's now plot the posterior distributions. We'll use a ridge plot, as it's both aesthetically pleasing and informative.
fig, ax = plt.subplots(figsize=(8, 6))
axes = az.plot_forest(trace, var_names=["bacteria_model/mu_bounded"], ax=ax, kind="ridgeplot")
axes[0].set_yticklabels(yticklabels);
On the basis of this, we would say that strain 5 was the most different from the other strains.
Let's now look at the differences directly.
fig, ax = plt.subplots(figsize=(8, 6))
axes = az.plot_forest(trace, var_names=["bacteria_model/difference"], ax=ax, kind="ridgeplot")
axes[0].axvline(0, color="black")
axes[0].set_yticklabels(yticklabels[1:]);
If we were in a binary decision-making mode, we would say that isolates 5 was the most "significantly" different from the ATCC strain.
trace_with_posterior = pm4.sample_posterior_predictive(bacteria_model(), trace=trace)
trace_with_posterior.posterior_predictive
# We want indices for each of the samples.
indices = dict()
for enc, iso in mapping.items():
idxs = list(df[df["isolate_enc"] == enc].index)
indices[iso] = idxs
indices
# Make PPC plot for one of the groups.
fig = plt.figure(figsize=(16, 16))
gs = GridSpec(nrows=4, ncols=4)
axes = dict()
for i, (strain, idxs) in enumerate(indices.items()):
if i > 0:
ax = fig.add_subplot(gs[i], sharex=axes[0])
else:
ax = fig.add_subplot(gs[i])
x, y = ecdf(df.iloc[idxs]["log_normalized_measurement"])
ax.plot(x, y, label="data")
x, y = ecdf(
trace_with_posterior
.posterior_predictive["bacteria_model/like"]
.loc[:, :, idxs]
.mean(axis=(2))
.data
.flatten()
)
ax.plot(x, y, label="ppc")
ax.set_title(f"Strain {strain}")
axes[i] = ax
The PPC draws clearly have longer tails than do the originals. I chalk this down to having small number of samples. The central tendency is definitely modelled well, and I don't see wild deviations between the sampled posterior and the measured data.