Eric J Ma's Website

Model Baselines Are Important

written by Eric J. Ma on 2018-05-06 | tags: machine learning data science deep learning automl


For any problem that we think is machine learnable, having a sane baseline is really important. It is even more important to establish them early.

Today at ODSC, I had a chance to meet both Andreas Mueller and Randy Olson. Andreas leads scikit-learn development, while Randy was the lead developer of TPOT, an AutoML tool. To both of them, I told a variation of the following story:

I had spent about 1.5 months building and testing a graph convolutions neural network model to predict RNA cleavage by an enzyme. I was suffering from a generalization problem - this model class would never generalize beyond the training samples for my problem on hand, even though I saw the same model class perform admirably well for small molecules and proteins.

Together with an engineer at NIBR, we brainstormed a baseline with some simple features, and threw a random forest model at it. Three minutes later, after implementing everything, we had a model that generalized and outperformed my implementation of graph CNNs. Three days later, we had an AutoML (TPOT) model that beat the random forest. After further discussion, we realize then that the work that we did is sufficiently publishable even without the fancy graph CNNs.

I think there’s a lesson in establishing baselines and MVPs early on!


I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.

If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!

Finally, I do free 30-minute GenAI strategy calls for organizations who are seeking guidance on how to best leverage this technology. Consider booking a call on Calendly if you're interested!