Bayesian Data Science by Simulation (PyCon 2021)

pycon2021

Title: Bayesian Data Science by Probabilistic Programming

Description

This introduces tutorial participants on how to use a probabilistic programming language, PyMC3, to perform a variety of statistical inference tasks. We will use hands-on instruction working on real-world examples (which have been simplified for pedagogical purposes) to show you how to do parameter estimation and inference, with a specific focus on building towards generalized Bayesian A/B/C/D/E… testing, a.k.a. multi-group experimental comparison, hierarchical modelling, and arbitrary curve regression.

Audience
This tutorial is intended for Pythonistas who are interested in using a probabilistic programming language to learn how to do flexible Bayesian data analysis without needing to know the fancy math behind it. 

Tutorial participants should come equipped with working knowledge of numpy, probability distributions and where they are commonly used, and frequentist statistics. By the end of the tutorial, participants will have code that they can use and modify for their own problems. 

Participants will also have at least one round of practice with the Bayesian modelling loop, starting from model (re-)formulation and ending in model checking. More generally, by the end of both sessions, participants should be equipped with the ability to think through and describe a problem using arbitrary (but suitable) statistical distributions and link functions.

Format
This tutorial is a hands-on tutorial, with a series of hands-on exercises for each topic being covered. Roughly 60% of the time will be spent on hands-on exercises, and 40% of the time on lecture-style material + discussions.

Internet access is not strictly necessary, but can be useful (to access Binder and its text editor and terminal emulator) if you encounter difficulties in setting up locally.

The general educational strategy here is to study on one model class for an extended period (the Beta-Binomial model), but use it to gradually introduce more advanced concepts, such as vectorization in PyMC3/Theano and hierarchical modelling. The use of repetition in workflow will also reinforce good practices. At the end, a lecture-style delivery will introduce Bayesian regression modelling.

Outline

The timings below indicate at what time we begin that section. Everything is relative to 0 minutes.

0th min: Introduction

In this section, we will cover some basic topics that are useful for the tutorial:

  • Probability as “assigning credibility over values”.
  • Probability distributions:
    • key parameters and shapes (i.e. likelihood functions)
    • what process they are modelling.

A combination of simulation and lectures will be used in this section.

15th min: Warm-Up With The Coin Flip

This section gets participants warmed up and familiar with PyMC3 and Theano syntax. We will use the classic coin flip to build intuition behind the Beta-Binomial model, a classic model that can be used in many places. Along the way, we will learn the basics of PyMC3, including:

  • How to structure a Bayesian model in PyMC3: priors, likelihood.
  • How to sample from posterior: “Inference Button”
  • How to examine model for correctness: posterior predictive checks.

40th min: Break for 5 min.

45th min: Extending the coin flip to two groups

This section extends the Beta-Binomial model to two groups. Here, we will compare the results of A/B testing of e-commerce site design, and use it to introduce the use of Bayesian estimation to provide richer information than a simple t-test.

We will begin by implementing the model the “manual” way, in which there is explicit duplication of code. Then, we will see how to vectorize this model. Many visuals will be provided to help participants understand what is going on.

Through this example, we will extend our knowledge with the ability to use vectorization to express what we would otherwise write in a for-loop.

More importantly, we will engage in a comparison with how we might do this in a frequentist setting, discover that the modelling assumptions of the t-test do not fit the problem setting, and conclude that a flexible modelling language is necessary.

1 hr 30th min: Break, 10 min.

1 hr 40th min: Hierarchical Beta-Bernoulli

In this section, we will extend our use of the Beta-Binomial model in a multi-group setting. Here, we will use hockey goalies’ save percentage as an example. A particular quirk of this dataset is that there are players that have very few data points.

We will first implement the model using lessons learned from the two-group case (i.e. how to vectorize our model), but soon after fitting the model and critiquing it, participants should discover that there are qualitative issues: namely, wide posterior distributions on measured ability where we plausibly wouldn’t believe so. (Should be 15 minutes or so to reach here.)

We will then introduce the idea of a hierarchical model, and use a code-along format to introduce how to build up the model, mainly by working backwards from the likelihood up to the parental priors involved. (Should be 15 minutes to reach here.)

Following that, we will look at a comparison between the posterior distributions for the non-hierarchical and hierarchical model. (10 minutes to finish this).

By the end of this section, participants should have a fairly complete view of the Bayesian modelling workflow, and should have a well-grounded anchoring example of the general way to model the data.

2 hr 30th minute: Break 10 minutes

2 hr 40th minute: Bayesian Regression Modelling

This section is more of a lecture than hands-on section, though there will be pre-written code for participants to read and execute at their own pace if they prefer it.

In this section, we will take the ideas learned before - that there are parameters in a model that are directly used in the likelihood distribution - and extend that idea to regression modelling, where model parameters are linked to the likelihood function parameters by an equation. In doing so, we will introduce the idea of a “link function”, and show how this is the general idea behind arbitrary curve regression.

There will be two examples, one for logistic regression and one for exponential curve decay. We are intentionally avoiding linear regression because the point is to show that any kind of “link function” is possible (including neural networks!).

3 hr 10th minute: Conclusion

We will summarize with the following framework given to participants:

  • Model = Parameters + Priors + Data + Structure (Equations) + Likelihood
  • Model + Sampler -> Posterior
  • Bayesian Estimation -> Hierarchical Bayesian Estimation
  • Single Group -> Two Group Comparison -> Multi-Group Comparison
  • Direct Estimation vs. Arbitrary Link Functions

3 hr 20th minute: End