Introduction
This notebook shows how to do hierarchical modelling with Binomially-distributed random variables.
Problem Setup
Baseball players have many metrics measured for them. Let's say we are on a baseball team, and would like to quantify player performance, one metric being their batting average (defined by how many times a batter hit a pitched ball, divided by the number of times they were up for batting ("at bat")). How would you go about this task?
We first need some measurements of batting data. To answer this question, we need to have data on the number of time a player has batted and the number of times the player has hit the ball while batting. Let's see an example dataset below.
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import pandas as pd
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
import theano.tensor as tt
from pyprojroot import here
df = pd.read_csv(here() / "datasets/baseballdb/core/Batting.csv")
df["AB"] = df["AB"].replace(0, np.nan)
df = df.dropna()
df["batting_avg"] = df["H"] / df["AB"]
df = df[df["yearID"] >= 2016]
df = df.iloc[
0:15
] # select out only the first 15 players, just for illustration purposes.
df.head(5)
In this dataset, the columns AB
and H
are the most relevent.
AB
is the number of times a player was At Bat.H
is the number of times a player hit the ball while batting.
The performance of a player can be defined by their batting percentage - essentially the number of hits divided by the number of times at bat. (Technically, a percentage should run from 0-100, but American sportspeople are apparently not very strict with how they approach these definitions.)
Model 1: Naive Model
One model that we can write is a model that assumes that each player has a batting percentage that is independent of the other players in the dataset.
A pictorial view of the model is as such:
Let's implement this model in PyMC3.
with pm.Model() as baseline_model:
thetas = pm.Beta("thetas", alpha=0.5, beta=0.5, shape=(len(df)))
like = pm.Binomial("likelihood", n=df["AB"], p=thetas, observed=df["H"])
with baseline_model:
baseline_trace = pm.sample(2000)
Let's view the posterior distribution traces.
import arviz as az
traceplot = az.plot_trace(baseline_trace)