Data Science Programming Newsletter MOC

With the Data Science Programming newsletter, I'm trying to share ideas on how to make

Key information

Protocol

  1. On last week of the month, draft newsletter.
  2. On every first Monday of the month, send out the newsletter.
  3. Cross-post to essays collection.

Newsletters

2020

2021

2021 05-May

On software quality. Same goes for a model. It doesn't exist in a vacuum, context matters -- therefore, each layer of the model matters, from equations up to the usage API. We try to do this properly with the unirep model.

https://martinfowler.com/articles/is-quality-worth-cost.html


Building tools to interact with your data

Very cool blog post documenting what

https://www.scottcondron.com/jupyter/visualisation/audio/2020/10/21/interactive-audio-plots-in-jupyter-notebook.html


Synthetic Data Vault

Generate synthetic data to help preserve privacy!

https://sdv.dev/


Deduplipy

https://www.deduplipy.com/

which uses ModAL underneath the hood

https://modal-python.readthedocs.io/en/latest/index.html


2021 02-February

Notes about packaging: https://labs.quansight.org/blog/2021/01/python-packaging-brainstorm/

MLOps: https://ml-ops.org
- awesome site, lots of informative articles for the data practitioner looking to take a model that passed MVP onto a production route.
- related: https://github.com/visenger/awesome-mlops

Transformers:
- https://theaisummer.com/transformer/
- http://jalammar.github.io/illustrated-transformer/

Model Search:
- https://ai.googleblog.com/2021/02/introducing-model-search-open-source.html?m=1.
- Repo: https://github.com/google/model_search

2020 12-December

Hello, datanistas!

This month is a special edition dedicated to JAX! It's a Python package built by some friends I made while they were at Harvard's Intelligent and Probabilistic Systems lab, and I was still in grad school.

I've been a fan of JAX ever since I started seriously developing array programs that required the use of automatic differentiation. What's up with JAX, you might ask? It's a library that brings automatic differentiation and many other composable program transformations to the NumPy API.

Why is automatic differentiation significant? The reason is that the ability to calculate the derivative of a function, w.r.t. one or more of its arguments, is essential to many computation realms. For example, we can use gradient-based optimization to train small and large models to do maximum likelihood or maximum a posteriori estimation of model parameters. Gradients are necessary for modern MCMC samplers, which leverage gradients to guide where to draw a new posterior sample next. Input design problems can also use gradient-based optimization, in which we either optimize or sample new inputs to achieve some output.

What JAX does is it takes a function that returns a scalar value and returns the derivative of that function's output w.r.t. the inputs. JAX accomplishes this by using the grad function, which takes the function passed into it, and transforms it into another function that evaluates the gradient. Gradient transformations are one example of a broader class of program transformations, which take a program (e.g. a function implemented in NumPy code) and transforms it into another program (its derivative function). JAX houses other program transformations, including just-in-time compilation for speed-ups, loop-replacement functions, and more.

Here, I'm going to highlight a sampling of the JAX projects that have come up on my radar to showcase the diversity of numerical computation projects that you can build with it.

Neural network projects

Because differential programming is a broader thing than just neural networks, you can write neural networks and more using JAX. If you're not used to writing neural network models from scratch, not an issue: there are a few neural network API frontends that build on top of JAX's NumPy API, which implements PyTorch-like APIs.

  • flax: A neural network library focused on flexibility.
  • haiku: One developed by the fine folks at DeepMind, alongside their other JAX projects.
  • stax: JAX's internal experimental module for writing neural network models, which pairs well with its optimizers module!
  • neural-tangents: Research that I have been following, one that provides "infinitely wide" versions of classical neural networks. It extends the stax API.

The best part of these projects? You never have to leave the idiomatic NumPy API :).

Probabilistic programming projects

As someone who has dabbled in Bayesian statistical modelling, probabilistic programming is high on my watch list.

The first one I want to highlight is PyMC3. More specifically, Theano. One of our PyMC devs, Brandon Willard, had the foresight to see that we could rewrite Theano to compile to JAX, providing a modernized array computation backend to Theano's symbolic graph manipulation capabilities. It's in the works right now! Read more about it on a blog post written by the PyMC devs.

The second one I want to highlight is NumPyro, a JAX-backed version of the Pyro probabilistic programming language. A collection of Pyro enthusiasts built NumPyro; one of its most significant selling points is implementing the No-U-Turn Sampler (NUTS) in a performant fashion.

The third one I want to highlight is mcx, a learning project built by Remi Louf, a software engineer in Paris. He has single-handedly implemented a probabilistic programming language leveraging JAX's idioms. I had the privilege of chatting with him about it and test-driving early versions of it.

Tutorials on JAX

Here are two tutorials on JAX that I have encountered, which helped me along the way.

Colin Raffel has a blog post on JAX that very much helped me understand how to use it. I highly recommend it!

Eric Jang has a blog post on meta-learning, with accompanying notebooks linked in the post, that show how to do meta-learning using JAX.

Beyond that, the JAX docs have a great tutorial to help get you up to speed.

From my collection

As I've experimented with JAX and used it in projects at work, here are things I've had a ton of fun building on top of JAX.

The first is jax-unirep, done together with one of my interns Arkadij Kummer, in which we took a recurrent neural network developed by the Church Lab at Harvard Medical School and accelerated it over 100X using JAX, while also extending its API for ease of use. You can check out the pre-print we wrote as well.

The second is a tutorial on differential programming. This one is one I'm continually building out as I learn more about differential programming. There are a few rough edges in there post-rewrite, but I'm sharing this early in the spirit of working with an open garage door. In particular, I had a ton of fun walking through the math behind Dirichlet process Gaussian mixture model clustering.

2021 04-April_social-media

Hello fellow datanistas!

First off, if you are wondering what happened to the March edition, I was a bit overloaded at work in the lead-up to parental leave, and so I intentionally took some time off from all things data related. To make things up, though, this month, there'll be two editions of the newsletter forthcoming, this being a special edition on Awesome Social Media Posts! (It was what I had planned for March, and I'm still excited to share it with you all.)

When to use what model?

Isabelle Ghement has a great tweet that lists out the factors that influence the choice of a statistical model. Things that I learned there are that practical matters, such as skill level of an individual, are real constraints on whether a model can be used or not. Models are tools, and require skill to wield!

Sponsor the people who make your tools

Samuel Colvin, who makes the awesome tool Pydantic, sponsored Ned Batchelder, who makes coverage.py, a tool for testing code. The financial support, even if just the price of a cup of coffee a month (or latte, if you're feeling fancy), can make maintaining these tools financial viable for some of your favourite tools' maintainers!

Data curation is a worthwhile infrastructural investment

The Protein Data Bank was instrumental in efforts to build vaccines and treatments against COVID-19, and the fact that now over 1000 such structures have been deposited highlights for me how focused data curation over a long period of time targeting one data modality can be such a worthwhile investment that pays off dividends multiple fold.

How good are machine learning paper publication practices?

There's no doubt right now that machine learning, as a discipline, has intersected with many other disciplines. How do non-machine learners perceive the view of ML? David Ha (@hardmaru) tweeted a Reddit Thread that spells out some views.

Are two brains better than one in pair programming?

Those who have worked with me know that I like to work in pairs, solving problems together. My sense is that it makes for more robust projects; creativity is also sharpened by having pairs work together. Does this hold all the time? Jacqueline Smith shares her take on her blog.

Why I'm lukewarm on Graph Neural Networks

In this post shared by Andrew Fairless on LinkedIn, Matt Ranger talks about why the research on graph neural networks appeaer to be "more of the same" from the academy. It's a simultaneously entertaining and sobering read :).

No COVID-19 models are clinic ready!

On Twitter, Eric Topol shared a link to a publication in Nature Machine Intelligence in which the authors found that none of the published models for using chest radiographs and CT scans to predict COVID-19 progression were ready for the clinic. Why? I won't spill the beans here, check out the paper linked in the tweet!

Berkson's Paradox

Also known as "how observational biases give rise to spurious correlations". Tweeted out by Lionel Page, there's a whole thread! Mathematician Hannah Fry explains further with more examples of Berkson's paradox in her Numberphile video (linked in the tweet).

That ends this special social media edition of the Data Science Programming Newsletter. At the end of the month, we'll resume regular, ahem, programming.

Other cool stuff

  1. https://www.python-graph-gallery.com
  2. https://www.technologyreview.com/2021/03/11/1020600/facebook-responsible-ai-misinformation/amp/
  3. https://www.python.org/dev/peps/pep-0646/

index

This is the landing page for my notes.

This is 100% inspired by Andy Matuschak's famous notes page. I'm not technically skilled enough to replicate the full "Andy Mode", though, so I just did some simple hacks. If you're curious how these notes compiled, check out the summary in How these notes are made into HTML pages.

This is my "notes garden". I tend to it on a daily basis, and it contains some of my less fully-formed thoughts. Nothing here is intended to be cited, as the link structure evolves over time. The notes are best viewed on a desktop/laptop computer, because of the use of hovers for previews.

There's no formal "navigation", or "search" for these pages. To go somewhere, click on any of the "high-level" notes below, and enjoy.

  1. Notes on statistics
  2. Notes on differential computing
  3. The State of Data Science
  4. Network science
  5. Scholarly readings
  6. Software skills for data scientists
  7. The Data Science Programming Newsletter MOC
  8. Life and computer hacks
  9. Reading Bazaar
  10. Blog drafts
  11. Conference Proposals

2020 10-October

Content to feature:

Recently at work, I've been building some bespoke machine learning models (autoregressive hidden Markov models and graph neural networks) for scientific problems that we encounter. In building those bespoke models, because we aren't using standard reference libraries, we have to build the model code from scratch. Since it's software, it needs tests, and Jeremy Jordan has a great blog post on how to effectively test ML systems. Definitely worth a read in my opinion.

In his Medium article, Gonzalo Ferreiro Volpi shares some fundamentals software skills for data scientists. For those of you who want to invest in levelling up your code-writing skills to reap multiplicative dividends in time saved, frustrations avoided and happiness, come check it out.

In her blog post, Shreya Shankar has some extremely valuable insights into the practice of making ML useful in the real world, which I absolutely agree with. One, in particular, being the quote:

Outside of ML classes and research, I learned that often the most reliable way to get performance improvements is to find another piece of data which gives insight into a completely new aspect of the problem, rather than to add a tweak to the loss. Whenever model performance is bad, we (scientists and practitioners) shouldn’t only resort to investigating model architecture and parameters. We should also be thinking about “culprits” of bad performance in the data.

With that little teaser, I hope this gives you enough impetus to read it. :)

This article is one that is topical and relevant. I also appreciated the illustrations put in there. Also, it's a blog post that highlights a really powerful model -- where powerful doesn't mean millions of parameters, but rather conceptually simple, easy to communicate, broadly applicable, and intensely relevant for the times. Aatish Bhatia has done a tremendously wonderful job here with this explanation. It's a technical masterpiece.

  • From my collection:
    • Some colleagues had questions about environment variables, so I decided to surface up an old post on the topic and spruce it up with more information on my essays collection.
    • I moved data across work sites securely and as fast a commercial tools using nothing but free and open source tooling. Come read how.
    • I also recently figured out how to directly open a Jupyter notebook in a Binder session. The hack is super cool.

Finally, some more humour from the ever on fire Kareem Carr :).

2021 04-April_official

Hello fellow datanistas!

As promised, here is the official April edition of the Data Science Programming Newsletter.

Having been on paternal leave for over a month now, I have found some space to think strategically about data projects (and products) beyond just the fun part of coding. These were inspired by a range of articles that I have read, and I'd like to share them with everybody.

(1) Orphaned Analytics

The first article that I'd like to share is one about orphaned analytics. Orphaned analytics are defined as "one-off Machine Learning (ML) models written to address a specific business or operational problem, but never engineered for sharing, re-use and continuous-learning and adapting." The liability incurred by orphaned models (my term) is described in the article, and it essentially boils down to data systems being filled with implicit context that aren't explicitly recorded.

(2) Data Science Paint By Numbers

Reading the article on orphaned analytics led me to the next article about good data project workflow. In there, the author writes about the least developed part of data projects: "what it is we are trying to prove out with our data science engagement and how do we measure progress and success." At its core, it sounds a ton like how hypothesis-driven scholarly research ought to be done.

(3) The Machine Learning Canvas

Having read all of that led me to this awesome resource: the Machine Learning Canvas. In there lies a framework -- a collection of questions that need to be answered, which will help thoroughly flesh out a plan for how a machine learning project could develop. I imagine this will work for data projects in general too. The old adage holds true: failing to plan means planning to fail, and I think the ML Canvas is a great resource to help us data scientists work with our colleagues to build better data systems.

(4) Data projects and data products

This exploration around how to structure a data project reminded me another article I had read before. This one is from the Harvard Business Review, which encourages us to approach our data projects with a product mindset. I particularly like the authors' definition of how "productization" happens:

Productization involves abstracting the underlying principles of successful point solutions until they can be used to solve an array of similar, but distinct, business problems.

and

...true productization involves taking the target end users into account.

(5) Run your data team like a product team

I also learned something from this article on locallyoptimistic about how to run a data team effectively. The key is to run it as if the team were to build a product surrounding the data. The product has features, with the heuristic, "if people are using it to make decisions, then it’s a feature of the Data Product".

(6) Some of my own thoughts

Reading through these thoughts have reinforced this next idea for me: It takes time to build a data project from conception to completion; leading that project well probably means leading one project with focus and managing all aspects of it properly.

My hope is that as you read through the articles, some thoughts bubble up in your mind as well. Having had some time to ponder these ideas, I'm thinking of hosting a discussion hour on this idea of "data science projects and products" to exchange ideas around this and learn from one another. If this is something you're interested in taking part, please send me a message and let's flesh it out!

(7) The geeky stuff

Having unloaded my thoughts learning how to run a data project and team, let's turn our attention to the geeky stuff.

Firstly, you have to check out Rich by Will McGugan. It's an absolute ballers of a package, especially for making rich command line interfaces. Also, Khuyen Tran has a wonderful article about how to use Rich.

Secondly, the socially prolific Kareem Carr has the best relationship mic drop ever.

Finally, for those of you doing image machine learning and need a fast way to do cropping, you should check out inbac, a Python application for doing interactive batch cropping (where its name comes from, obviously).

(8) From my collection

I've been at work on nxviz, a Python package I developed in graduate school and subsequently neglected for over four years while at work. Now that I've been on a break for a while, I decided to do an upgrade to the API, working out the grammar needed to compose together beautiful network visualizations. Now it's basically ready to share, with a release coming early May! Meanwhile, please check out the docs for a preview of graph visualizations you'll be able to make!

Also, I will be teaching a tutorial called Magical NumPy with JAX at PyCon and SciPy this year. It's an extension of this tutorial repository I made a while ago, dl-workshop. Looking forward to teaching this new workshop!

Stay safe, keep having fun, and keep making cool and useful things!

Eric

2021 01-January

Hello datanistas!

A new year has started, and it's a new dawn! And with that new dawn is a move to Substack. The primary thing that you'll benefit from is that the newsletter archive will be more readily visible. (The newsletter archive was something I had questions from some of y'all about.) I also have a hunch that Substack supports newsletters better than Mailchimp does with Tinyletter.

In this edition of the newsletter, I wanted to share items on two themes. The first is on data, the second is on learning.

Pandera new releases

To kickstart, I wanted to share about Pandera, a runtime data validation library, which, if my memory serves me right, I've highlighted before on the newsletter. That said, there's been new releases that I'm a fan of, and it contains exciting stuff! One of the things I especially like are the new pydantic-style class declarations. At work, I've used Pandera to help me gain clarity over my data processing functions: By pre-defining what dataframes I need in my project, I can explicitly state my assumptions about how my data ought to look, and the pydantic-style class declarations help with that. Additionally, you can now add them as part of your function annotations, which means we can write a program that performs static analysis of a Python codebase, leveraging dataframe type annotations, to automatically spit out the data pipeline expressed in the codebase. (I hacked that out one Monday afternoon on a private repo, it was a welcome distraction!) Definitely check out Pandera!

Great Expectations

The other thing I wanted to share is about two blog posts by friends at Superconductive (developers of Great Expectations, another data validation library oriented towards large pipelines). They have posted two blog posts that I found to be insightful, which I wanted to share with you.

Think Bayes 2

Moving onto the theme of learning, Prof. Allen Downey of the Olin College of Engineering has been updating Think Bayes to include new material, including the use of PyMC3 inside there. I am excited to see it be released! The first version was foundational in my journey into Bayesian statistical modelling, and having seen the 2nd version's material online, I am confident that a newcomer to Bayesian inference will enjoy learning from it!

Causal Inference for the Brave and True

For nearly half a decade of observing the role of "data science", I've noticed a distinct lack of the incorporation of causal thinking into our data science projects. Part of that maybe the hype surrounding "big models", but part of that may also be a lack of awesome introductory causal inference material. Fret not: Matheus Facure has your back covered with a lighthearted introduction to causal inference methods.

From my collection

Over the winter, I reflected on a year of getting newcomer and seasoned colleagues up-to-speed on modern data science tooling and project organization. The result of that is a new eBook I wrote, in which I pour out everything I know about getting your computer bootstrapped and organized to do awesome data science work. (I put it to use just recently in replacing a 4-year old 12" MacBook with a new 13" M1 MacBook Air, so I'm dog-fooding the material myself!) In the spirit of democratizing knowledge, the website is freely available to all; if you would like to support the project as it gets updated with new things I learn, it's also available up on LeanPub (which will also be continuously updated). My hope is that it becomes a great resource for you!

2020 11-November