2021 01-January
Hello datanistas!
A new year has started, and it's a new dawn! And with that new dawn is a move to Substack. The primary thing that you'll benefit from is that the newsletter archive will be more readily visible. (The newsletter archive was something I had questions from some of y'all about.) I also have a hunch that Substack supports newsletters better than Mailchimp does with Tinyletter.
In this edition of the newsletter, I wanted to share items on two themes. The first is on data, the second is on learning.
To kickstart, I wanted to share about Pandera, a runtime data validation library, which, if my memory serves me right, I've highlighted before on the newsletter. That said, there's been new releases that I'm a fan of, and it contains exciting stuff! One of the things I especially like are the new pydantic-style class declarations. At work, I've used Pandera to help me gain clarity over my data processing functions: By pre-defining what dataframes I need in my project, I can explicitly state my assumptions about how my data ought to look, and the pydantic-style class declarations help with that. Additionally, you can now add them as part of your function annotations, which means we can write a program that performs static analysis of a Python codebase, leveraging dataframe type annotations, to automatically spit out the data pipeline expressed in the codebase. (I hacked that out one Monday afternoon on a private repo, it was a welcome distraction!) Definitely check out Pandera!
The other thing I wanted to share is about two blog posts by friends at Superconductive (developers of Great Expectations, another data validation library oriented towards large pipelines). They have posted two blog posts that I found to be insightful, which I wanted to share with you.
Moving onto the theme of learning, Prof. Allen Downey of the Olin College of Engineering has been updating Think Bayes to include new material, including the use of PyMC3 inside there. I am excited to see it be released! The first version was foundational in my journey into Bayesian statistical modelling, and having seen the 2nd version's material online, I am confident that a newcomer to Bayesian inference will enjoy learning from it!
For nearly half a decade of observing the role of "data science", I've noticed a distinct lack of the incorporation of causal thinking into our data science projects. Part of that maybe the hype surrounding "big models", but part of that may also be a lack of awesome introductory causal inference material. Fret not: Matheus Facure has your back covered with a lighthearted introduction to causal inference methods.
Over the winter, I reflected on a year of getting newcomer and seasoned colleagues up-to-speed on modern data science tooling and project organization. The result of that is a new eBook I wrote, in which I pour out everything I know about getting your computer bootstrapped and organized to do awesome data science work. (I put it to use just recently in replacing a 4-year old 12" MacBook with a new 13" M1 MacBook Air, so I'm dog-fooding the material myself!) In the spirit of democratizing knowledge, the website is freely available to all; if you would like to support the project as it gets updated with new things I learn, it's also available up on LeanPub (which will also be continuously updated). My hope is that it becomes a great resource for you!
Data Science Programming Newsletter MOC
With the Data Science Programming newsletter, I'm trying to share ideas on how to make