2020 10-October

Content to feature:

Recently at work, I've been building some bespoke machine learning models (autoregressive hidden Markov models and graph neural networks) for scientific problems that we encounter. In building those bespoke models, because we aren't using standard reference libraries, we have to build the model code from scratch. Since it's software, it needs tests, and Jeremy Jordan has a great blog post on how to effectively test ML systems. Definitely worth a read in my opinion.

In his Medium article, Gonzalo Ferreiro Volpi shares some fundamentals software skills for data scientists. For those of you who want to invest in levelling up your code-writing skills to reap multiplicative dividends in time saved, frustrations avoided and happiness, come check it out.

In her blog post, Shreya Shankar has some extremely valuable insights into the practice of making ML useful in the real world, which I absolutely agree with. One, in particular, being the quote:

Outside of ML classes and research, I learned that often the most reliable way to get performance improvements is to find another piece of data which gives insight into a completely new aspect of the problem, rather than to add a tweak to the loss. Whenever model performance is bad, we (scientists and practitioners) shouldn’t only resort to investigating model architecture and parameters. We should also be thinking about “culprits” of bad performance in the data.

With that little teaser, I hope this gives you enough impetus to read it. :)

This article is one that is topical and relevant. I also appreciated the illustrations put in there. Also, it's a blog post that highlights a really powerful model -- where powerful doesn't mean millions of parameters, but rather conceptually simple, easy to communicate, broadly applicable, and intensely relevant for the times. Aatish Bhatia has done a tremendously wonderful job here with this explanation. It's a technical masterpiece.

  • From my collection:
    • Some colleagues had questions about environment variables, so I decided to surface up an old post on the topic and spruce it up with more information on my essays collection.
    • I moved data across work sites securely and as fast a commercial tools using nothing but free and open source tooling. Come read how.
    • I also recently figured out how to directly open a Jupyter notebook in a Binder session. The hack is super cool.

Finally, some more humour from the ever on fire Kareem Carr :).

Data Science Programming Newsletter MOC

With the Data Science Programming newsletter, I'm trying to share ideas on how to make

Key information

Protocol

  1. On last week of the month, draft newsletter.
  2. On every first Monday of the month, send out the newsletter.
  3. Cross-post to essays collection.

Newsletters

2020

2021