Data Science Programming October 2020 Newsletter

Hello fellow datanistas!

Welcome to the October edition of the programming-oriented data science newsletter. As the weather chills down, hope you all are staying warm indoors, and safe both indoors and outdoors!

This edition of the Data Science Programming Newsletter has a particular focus on machine learning engineering, which is a discipline that is evolving out of the old "data science" umbrella into its own.

Effective testing for machine learning systems

Recently at work, I've been building some bespoke machine learning models (autoregressive hidden Markov models and graph neural networks) for scientific problems that we encounter. In building those bespoke models, because we aren't using standard reference libraries, we have to build the model code from scratch. Since it's software, it needs tests, and Jeremy Jordan has a great blog post on how to effectively test ML systems. Definitely worth a read!

Software engineering fundamentals for Data Scientists

In his Medium article, Gonzalo Ferreiro Volpi shares some fundamentals software skills for data scientists. For those of you who want to invest in levelling up your code-writing skills to reap multiplicative dividends in time saved, frustrations avoided, and happiness multiplied, come check it out!

Reflecting on a year of making machine learning actually useful

In her blog post, Shreya Shankar has some extremely valuable insights into the practice of making ML useful in the real world, which I absolutely agree with. One, in particular, being the quote:

Outside of ML classes and research, I learned that often the most reliable way to get performance improvements is to find another piece of data which gives insight into a completely new aspect of the problem, rather than to add a tweak to the loss. Whenever model performance is bad, we (scientists and practitioners) shouldn’t only resort to investigating model architecture and parameters. We should also be thinking about “culprits” of bad performance in the data.

Reminds me of the power of finding "the invariants" of a problem. With that little teaser, I hope this gives you enough impetus to read it!

The Multiplicative Power of Masks

This article is one that is topical and relevant. I also appreciated the illustrations put in there! Also, it's a blog post that highlights a really powerful model -- where powerful doesn't mean millions of parameters, but rather conceptually simple, easy to communicate, broadly applicable, and intensely relevant for the times. Aatish Bhatia has done a tremendously wonderful job here with this explanation. It's a technical masterpiece.

From my collection

Some colleagues had questions about environment variables, so I decided to surface up an old post on the topic and spruce it up with more information on my essays collection.
I moved data across work sites securely and as fast a commercial tools using nothing but free and open source tooling. Come read how!
I also recently figured out how to directly open a Jupyter notebook in a Binder session. The hack is really cool!
Finally, some more Twitter humour from the ever on fire Kareem Carr :).