2021 04-April_official

Hello fellow datanistas!

As promised, here is the official April edition of the Data Science Programming Newsletter.

Having been on paternal leave for over a month now, I have found some space to think strategically about data projects (and products) beyond just the fun part of coding. These were inspired by a range of articles that I have read, and I'd like to share them with everybody.

(1) Orphaned Analytics

The first article that I'd like to share is one about orphaned analytics. Orphaned analytics are defined as "one-off Machine Learning (ML) models written to address a specific business or operational problem, but never engineered for sharing, re-use and continuous-learning and adapting." The liability incurred by orphaned models (my term) is described in the article, and it essentially boils down to data systems being filled with implicit context that aren't explicitly recorded.

(2) Data Science Paint By Numbers

Reading the article on orphaned analytics led me to the next article about good data project workflow. In there, the author writes about the least developed part of data projects: "what it is we are trying to prove out with our data science engagement and how do we measure progress and success." At its core, it sounds a ton like how hypothesis-driven scholarly research ought to be done.

(3) The Machine Learning Canvas

Having read all of that led me to this awesome resource: the Machine Learning Canvas. In there lies a framework -- a collection of questions that need to be answered, which will help thoroughly flesh out a plan for how a machine learning project could develop. I imagine this will work for data projects in general too. The old adage holds true: failing to plan means planning to fail, and I think the ML Canvas is a great resource to help us data scientists work with our colleagues to build better data systems.

(4) Data projects and data products

This exploration around how to structure a data project reminded me another article I had read before. This one is from the Harvard Business Review, which encourages us to approach our data projects with a product mindset. I particularly like the authors' definition of how "productization" happens:

Productization involves abstracting the underlying principles of successful point solutions until they can be used to solve an array of similar, but distinct, business problems.

and

...true productization involves taking the target end users into account.

(5) Run your data team like a product team

I also learned something from this article on locallyoptimistic about how to run a data team effectively. The key is to run it as if the team were to build a product surrounding the data. The product has features, with the heuristic, "if people are using it to make decisions, then it’s a feature of the Data Product".

(6) Some of my own thoughts

Reading through these thoughts have reinforced this next idea for me: It takes time to build a data project from conception to completion; leading that project well probably means leading one project with focus and managing all aspects of it properly.

My hope is that as you read through the articles, some thoughts bubble up in your mind as well. Having had some time to ponder these ideas, I'm thinking of hosting a discussion hour on this idea of "data science projects and products" to exchange ideas around this and learn from one another. If this is something you're interested in taking part, please send me a message and let's flesh it out!

(7) The geeky stuff

Having unloaded my thoughts learning how to run a data project and team, let's turn our attention to the geeky stuff.

Firstly, you have to check out Rich by Will McGugan. It's an absolute ballers of a package, especially for making rich command line interfaces. Also, Khuyen Tran has a wonderful article about how to use Rich.

Secondly, the socially prolific Kareem Carr has the best relationship mic drop ever.

Finally, for those of you doing image machine learning and need a fast way to do cropping, you should check out inbac, a Python application for doing interactive batch cropping (where its name comes from, obviously).

(8) From my collection

I've been at work on nxviz, a Python package I developed in graduate school and subsequently neglected for over four years while at work. Now that I've been on a break for a while, I decided to do an upgrade to the API, working out the grammar needed to compose together beautiful network visualizations. Now it's basically ready to share, with a release coming early May! Meanwhile, please check out the docs for a preview of graph visualizations you'll be able to make!

Also, I will be teaching a tutorial called Magical NumPy with JAX at PyCon and SciPy this year. It's an extension of this tutorial repository I made a while ago, dl-workshop. Looking forward to teaching this new workshop!

Stay safe, keep having fun, and keep making cool and useful things!

Eric

Data Science Programming Newsletter MOC

With the Data Science Programming newsletter, I'm trying to share ideas on how to make

Key information

Protocol

  1. On last week of the month, draft newsletter.
  2. On every first Monday of the month, send out the newsletter.
  3. Cross-post to essays collection.

Newsletters

2020

2021