This notebook shows how to do hierarchical modelling with Binomially-distributed random variables.
Baseball players have many metrics measured for them. Let's say we are on a baseball team, and would like to quantify player performance, one metric being their batting average (defined by how many times a batter hit a pitched ball, divided by the number of times they were up for batting ("at bat")). How would you go about this task?
We first need some measurements of batting data. To answer this question, we need to have data on the number of time a player has batted and the number of times the player has hit the ball while batting. Let's see an example dataset below.
%load_ext autoreload %autoreload 2 %matplotlib inline %config InlineBackend.figure_format = 'retina'
import pandas as pd import pymc3 as pm import matplotlib.pyplot as plt import numpy as np import theano.tensor as tt from pyprojroot import here
WARNING (theano.configdefaults): install mkl with `conda install mkl-service`: No module named 'mkl'
df = pd.read_csv(here() / "datasets/baseballdb/core/Batting.csv") df["AB"] = df["AB"].replace(0, np.nan) df = df.dropna() df["batting_avg"] = df["H"] / df["AB"] df = df[df["yearID"] >= 2016] df = df.iloc[ 0:15 ] # select out only the first 15 players, just for illustration purposes. df.head(5)
5 rows × 23 columns
In this dataset, the columns
H are the most relevent.
ABis the number of times a player was At Bat.
His the number of times a player hit the ball while batting.
The performance of a player can be defined by their batting percentage - essentially the number of hits divided by the number of times at bat. (Technically, a percentage should run from 0-100, but American sportspeople are apparently not very strict with how they approach these definitions.)
Model 1: Naive Model
One model that we can write is a model that assumes that each player has a batting percentage that is independent of the other players in the dataset.
A pictorial view of the model is as such:
Let's implement this model in PyMC3.
with pm.Model() as baseline_model: thetas = pm.Beta("thetas", alpha=0.5, beta=0.5, shape=(len(df))) like = pm.Binomial("likelihood", n=df["AB"], p=thetas, observed=df["H"])
with baseline_model: baseline_trace = pm.sample(2000)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [thetas] Sampling 4 chains, 16 divergences: 100%|██████████| 10000/10000 [00:03<00:00, 2887.97draws/s] There were 5 divergences after tuning. Increase `target_accept` or reparameterize. There were 2 divergences after tuning. Increase `target_accept` or reparameterize. There were 8 divergences after tuning. Increase `target_accept` or reparameterize. There was 1 divergence after tuning. Increase `target_accept` or reparameterize.
Let's view the posterior distribution traces.
import arviz as az traceplot = az.plot_trace(baseline_trace)