Bayesian Inference & Testing Sets

written by Eric J. Ma on 2018-02-07 | tags: bayesian statistics data science

Further thoughts on whether Bayesian models overfit.

This topic recently came up again on the PyMC3 discourse. I had an opportunity to further clarify what I was thinking about when I first uttered the train/test split comment at PyData NYC.

After a little while, my thoughts for a layperson are a bit clearer, and I thought I'd re-iterate them here.

Model specification uncertainty: Did we get the conditional relationships correct? Did we specify enough of the explanatory variables?
Model parameter uncertainty: Given a model, can we quantify the uncertainty in the parameter values?

These are different uncertainties to deal with. We must be clear: where we are pretty sure about the model spec, Bayesian inference is about quantifying the uncertainty in the parameter values. Under this paradigm, if we use more data, we get narrower posterior distributions, and if we use less data, we get wider posterior distributions. If we split the data, we're just feeding in fewer data points to the model; if we don't, then we're just feeding in more data points.

Cite this blog post:

@article{
    ericmjl-2018-bayesian-inference-and-testing-sets,
    author = {Eric J. Ma},
    title = {Bayesian Inference & Testing Sets},
    year = {2018},
    month = {02},
    day = {07},
    howpublished = {\url{https://ericmjl.github.io}},
    journal = {Eric J. Ma's Blog},
    url = {https://ericmjl.github.io/blog/2018/2/7/bayesian-inference-and-testing-sets},
}

I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.

If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!

Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!

Eric J Ma's Website