Introduction
Let's say there are three bacteria species that characterize the gut, and we hypothesize that they are ever so shifted off from one another, but we don't know how (i.e. ignore the data-generating distribution below). Can we figure out the proportion parameters and their uncertainty?
Generate Synthetic Data
In the synthetic dataset generated below, we pretend that every patient is one sample, and we are recording the number of sequencing reads corresponding to some OTUs (bacteria). Each row is one sample (patient), and each column is one OTU (sample).
Proportions
Firstly, let's generate the ground truth proportions that we will infer later on.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pymc3 as pm
import numpy.random as npr
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
def proportion(arr):
arr = np.asarray(arr)
return arr / arr.sum()
healthy_proportions = proportion([10, 16, 2])
healthy_proportions
sick_proportions = proportion([10, 27, 15])
sick_proportions
Data
Now, given the proportions, let's generate data. Here, we are assuming that there are 10 patients per cohort (10 sick patients and 10 healthy patients), and that the number of counts in total is 50.
n_data_points = 10
def make_healthy_multinomial(arr):
n_sequencing_reads = 50 # npr.poisson(lam=50)
return npr.multinomial(n_sequencing_reads, healthy_proportions)
def make_sick_multinomial(arr):
n_sequencing_reads = 50 # npr.poisson(lam=50)
return npr.multinomial(n_sequencing_reads, sick_proportions)
# Generate healthy data
healthy_reads = np.zeros((n_data_points, 3))
healthy_reads = np.apply_along_axis(make_healthy_multinomial, axis=1, arr=healthy_reads)
# Generate sick reads
sick_reads = np.zeros((n_data_points, 3))
sick_reads = np.apply_along_axis(make_sick_multinomial, axis=1, arr=sick_reads)
# Make pandas dataframe
healthy_df = pd.DataFrame(healthy_reads)
healthy_df.columns = ["bacteria1", "bacteria2", "bacteria3"]
healthy_df = pm.floatX(healthy_df)
sick_df = pd.DataFrame(sick_reads)
sick_df.columns = ["bacteria1", "bacteria2", "bacteria3"]
sick_df = pm.floatX(sick_df)
healthy_df
sick_df
Model Construction
Here's an implementation of the model - Dirichlet prior with Multinomial likelihood.
There are 3 classes of bacteria, so the Dirichlet distribution serves as the prior probability mass over each of the classes in the multinomial distribution.
The multinomial distribution serves as the likelihood function.
with pm.Model() as dirichlet_model:
proportions_healthy = pm.Dirichlet(
"proportions_healthy",
a=np.array([1.0] * 3).astype("float32"),
shape=(3,),
testval=[0.1, 0.1, 0.1],
)
proportions_sick = pm.Dirichlet(
"proportions_sick",
a=np.array([1.0] * 3).astype("float32"),
shape=(3,),
testval=[0.1, 0.1, 0.1],
)
healthy_like = pm.Multinomial(
"like_healthy", n=50, p=proportions_healthy, observed=healthy_df.values
)
sick_like = pm.Multinomial(
"like_sick", n=50, p=proportions_sick, observed=sick_df.values
)
Sampling
import arviz as az
with dirichlet_model:
dirichlet_trace = pm.sample(2000)
az.plot_trace(dirichlet_trace)
Results
ylabels=[
"healthy_bacteria1", "healthy_bacteria2", "healthy_bacteria3",
"sick_bacteria1", "sick_bacteria2", "sick_bacteria3",
]
axes = az.plot_forest(dirichlet_trace)
healthy_proportions, sick_proportions
They match up with the original synthetic percentages!