PyCon 2019 Tutorial and Conference Days

written by Eric J. Ma on 2019-05-10

pycon 2019 conferences data science

It's just been days since I got back from PyCon, and I'm already looking forward to 2020! But I thought it'd be nice to continue the recap.

This year at PyCon, I co-led two tutorials, one with Hugo Bowne-Anderson on Bayesian Data Science by Simulation, and the other with Mridul Seth on Network Analysis Made Simple.

I always enjoy teaching with Hugo. He brought his giant sense of humour, both figuratively and literally, to this tutorial, and melded it with his deep grasp of the math behind Bayesian statistics, delivering a workshop that, by many points of feedback, is excellent. Having reviewed the tutorial feedback, we've got many ideas for our showcase of Part II at SciPy 2019!

This year was the first year that Mridul and I swapped roles. In previous years, he was my TA, helping tutorial participants while I did the main lecturing. This year, the apprentice became the master, and a really good one indeed! Looking forward to seeing him shine more in subsequent tutorial iterations.

During the conference days, I spent most of my time either helping out with Financial Aid, or at the Microsoft booth. As I have alluded to in multiple tweets, Microsoft's swag this year was the best of them all. Microelectronics kits in a blue lunch box from Adafruit, and getting set up with Azure. In fact, this website is now re-built with each push on Azure pipelines! Indeed, Steve Dowell from Microsoft told me that this year's best swag was probably getting setup with Azure, and I'm 100% onboard with that! (Fun fact, Steve told me that he's never been called by his Twitter handle (zooba) in real-life... until we met.)

I also delivered a talk this year, which essentially amounted to a formal rant against canned statistical procedures. I had a ton of fun delivering this talk. The usual nerves still get to me, and I had to do a lot of talking-to-myself-style rehearsals to work off those nerves. For the first time, I also did office hours post-talk at an Open Space, where for one hour, we talked about all things Bayes. Happy to have met everybody who came by; I genuinely enjoyed the chat!

Did you enjoy this blog post? Let's discuss more!


PyCon 2019 Sprints

written by Eric J. Ma on 2019-05-08

pycon software development sprint open source

This year was the first year that I decided to lead a sprint! The sprint I led was for pyjanitor, a package that I developed with my colleague, Zach Barry, and a remote collaborator in NYC, Sam Zuckerman (whom I've never met in person!). This being the first sprint I've ever led, I think I was lucky to stumble upon a few ideas that made for a productive, collaborative, and most importantly, fun sprint.

pyjanitor?

I'm going to deliver a talk on pyjanitor later in the year, so I'll save the details for that talk. The short description of pyjanitor is that if you have pandas one-liners that are difficult to remember, they should become a function in pyjanitor; if you have a 10-liner that you always copy/paste from another source, they should become a function in pyjanitor.

Sprints?

Code sprints are a part of PyCon, and it's basically one to four days of intense and focused software development on a single project.

Project sprint leads first pitch their projects at the "Sprintros", where they indicate what days they will be sprinting, and at what times. The next day, we indicate on which rooms our projects will be sprinting. Volunteers from the conference, who would like to make an open source project contribution, then identify projects they would like to come sprint with.

In some senses, there's no way for a sprint leader to know how popular their sprint will be a priori. We have to be prepared to handle a range of scenarios from sprinting with just one other person to sprinting with a crowd.

Structure

Preparation

In preparation for the sprint, I absorbed many lessons learned over the years of sprinting on others' projects.

The most obvious one was to ensure that every sprinter had something to do right from the get-go. Having a task from the get-go keeps sprinters, especially newcomers, engaged right from the beginning. This motivated the requirement to make a doc fix before making a code fix. (You can read more below on how we made this happen.) I wrote out this requirement in a number of places, and by the time the sprint day rolled by, this rolled off pretty naturally.

The second thing that I did to prep was to triage existing issues and label them as being beginner- or intermediate-friendly, and whether they were doc, infrastructure, or code contributions.

Those two things were the my highest priority preparation for the sprint, and I think that helped a ton.

Doc Fixes

This sprint, I gave the structure some thought, and settled on the following: Before making a code contribution, I required a docs contribution.

Docs contributions could be of any scale:

I think this worked well for the following reasons:

  1. New contributors must read the docs before developing on the project, and hence become familiar with the project.
  2. There's always something that can be done better in the docs, and hence, there is something that can be immediately acted on.
  3. The task is a pain point personally uncovered by the contributor, and hence the contributor has the full context of the problem.
  4. The docs don't break the code/tests, and hence doc contributions are a great way make a contribution without wrestling with more complex testing.
  5. Starting everybody who has never worked on pyjanitor on docs is an egalitarian way of on-boarding every newcomer, beginner and experienced individuals alike. Nobody gets special treatment.

For each individual's contribution, I asked them to first raise an issue on the GitHub issue tracker describing the contribution that they would like to make, and then clearly indicate in the comments that they would like to work on it. Then, they would go through the process of doing the documentation fix, from forking the repository, cloning it locally, creating a new branch, making edits, committing, pushing, and PR-ing.

If two people accidentally ended up working on the same docs issue, I would assess the effort of the later one, and if it was substantial enough, I would allow them to consider it done, and move onto a different issue.

Going forth, as the group of contributors expands, I will enforce this "docs-first" requirement only for newcomer sprinters, and request experienced ones to help manage the process.

Code Contributions

Once the docs contributions were done, sprinters were free to either continue with more docs contributions, or provide a code contribution.

Code contributions could be of one of the following:

  1. New function contributions.
  2. Cleaner implementations of existing functions.
  3. Restructuring of existing functions.
  4. Identification of functions to deprecate (very important!)
  5. Infrastructural changes to docs, build system, and more.

The process for doing this was identical to docs: raise an issue, claim it, and then make a new branch with edits, and finally PR it.

My Role

For both days, we had more than 10 people sprint on pyjanitor. Many who were present on the 1st day (and didn't have a flight to catch) came back on the 2nd day. As such, I actually didn't get much coding done. Instead, I took on the following roles:

  1. Q&A person: Answering questions about what would be acceptable contributions, technical (read: git) issues, and more.
  2. Issue labeller and triage-r: I spent the bulk of my downtime (and pre-/post-sprint time) tagging issues on the GitHub issue tracker and marking them as being available or unavailable for hacking on, and tagging them with whether they were docs-related, infrastructure-related, or code enhancements.
  3. Code reviewer: As PRs came in, I would conduct code reviews on each of them, and would discuss with them where to adjust the code to adhere to code style standards.
  4. Continuous integration pipeline babysitter: Because I had just switched us off from Travis CI to Azure Pipelines, I was babysitting the pipelines to make sure nothing went wrong. (Spoiler: something did!)
  5. Green button pusher: Once all tests passed, I would hit the big green button to merge PRs!

If I get to sprint with other experienced contributors at the sprints, I would definitely like to have some help with the above.

Thoughts

Making sprints human-friendly

I tried out a few ideas, which I hope made the sprints just that little bit more human-friendly.

  1. Used large Post-It easel pads to write out commonly-used commands at the terminal.
  2. Displayed claimed seating arrangements at the morning, and more importantly, get to know every sprinter's name.
  3. Announcing every PR to the group that was merged and what the content was, followed by a round of applause.
  4. Setting a timer for 5 minutes before lunch so that they could all get ahead in the line.
  5. I used staff privileges to move the box of leftover bananas into our sprint room. :)

I think the applause is the most encouraging part of the process. Having struggled through a PR, however big or small, and having group recognition for that effort, is super encouraging, especially for first-time contributors. I think we need to encourage this more at sprints.

Relinquishing control

The only things about the pyjanitor project that I'm unwilling to give up on are: good documentation of what a function does, that a function should do one thing well, and that it be method-chainable. Everything else, including functionality, is an open discussion that we can have!

One PR I particularly enjoyed was that from Lucas, who PR'd in a logo for the project on the docs page. He had the idea to take the hacky broomstick I drew on the big sticky note (as a makeshift logo), redraw it as a vector graphic in Inkscape, and PR it in as the (current) official logo on the docs.

More broadly, I deferred to the sprinters' opinions on docs, because I recognized that I'd have old eyes on the docs, and wouldn't be able to easily identify places where the docs could be written more clearly. Eventually, a small, self-organizing squad of 3-5 sprinters ended up becoming the unofficial docs squad, rearranging the structure of the docs, building automation around it, and better organizing and summarizing the information on the docs.

In more than a few places, if there were a well-justified choice for the API (which really meant naming the functions and keyword arguments), I'd be more than happy to see the PR happen. Even if it is evolved away later, the present codebase and PRs that led to it provided the substrate for better evolution of the API!

A new Microsoft

This year, I switched from Travis CI to Azure Pipelines. In particular, I was attracted to the ability to build on all three major operating systems, Windows, macOS, and Linux, on the free tier.

Microsoft had a booth at PyCon, in which Steve Dowell led an initiative to get us set up with Azure-related tools. Indeed, as a major sponsor of the conference, this was one of the best swag given to us. Super practical, relationship- and goodwill-building. Definitely not lesser than the Adafruit lunchboxes with electronics as swag!

Hiccups

Naturally, not everything was smooth sailing throughout. I did find myself a tad expressing myself in an irate fashion at times with the amount of context switching that I was doing, especially switching between talking to different sprinters one after another. (I am very used to long stretches hacking on one problem.) One thing future sprinters could help with, which I will document, is to give me enough ramp-up context around their problem, so that I can quickly pinpoint what other information I might need.

The other not-so-smooth-sailing thing was finding out that Azure sometimes did not catch errors in a script block! My unproven hypothesis at this point is that if I have four commands executed in a script block, and if any of the first three fail but the last one passes, the entire script block will behave as if it passes. This probably stems from the build system looking at only the last exit code to determine exit status. Eventually, after splitting each check into individual steps, linting and testing errors started getting caught automatically! (Automating this is much preferred to me running the black code formatter in my head.)

Though the above issue is fixed, I think I am still having issues getting pycodestyle and black to work on the Windows builds. Definitely looking forward to hearing from Azure devs what could be done here!

Suggestions

I'm sure there's ways I could have made the sprint a bit better. I'd love to hear them if there's something I've missed! Please feel free to comment below.

Sprinter Acknowledgement

I would like to thank all the sprinters who joined in this sprint. Their GitHub handles are below:

And as always, big thanks to my collaborators on the repository:

Did you enjoy this blog post? Let's discuss more!


PyCon 2019 Pre-Journey

written by Eric J. Ma on 2019-04-29

pycon python data science conferences

I'm headed out to PyCon 2019! This year, I will be co-instructing two tutorials, one on network analysis and one on Bayesian statistics, and delivering one talk on Bayesian statistics.

The first tutorial on network analysis is based on material that I first developed 5 years ago, and have continually updated. I've enjoyed teaching this tutorial because it represents a different way of thinking about data - in other words, relationally. This year, I will be a co-instructor for Mridul, who has kindly agreed to step up and teach it this year at PyCon. The apprentice has exceeded the master!

The second tutorial on Bayesian statistics is based on material co-developed with Hugo Bowne-Anderson. Hugo is a mathematician by training, a pedagogy master, and data science aficionado. Like myself, he is a fan of Bayesian statistical modelling methods, and we first debuted the tutorial past year at SciPy. We're super excited for this one!

The talk that I will deliver is on Bayesian statistical analysis of case/control tests. In particular, I noticed a content gap in the data science talks, where case/control comparisons were limited to one case and one control. One epiphany I came to was that if we use Bayesian methods to analyze our data, there's no particular reason to limit ourselves to one case and one control; we can flexibly model multiple cases vs. one control, or even multiple cases vs multiple different controls in the same analysis, in a fashion that is flexible and principled.

My final involvement with PyCon this year is as Financial Aid Chair. This is the first year that I'm leading the FinAid effort; during previous years, I had learned a ton from the previous chair Karan Goel. My co-chairs this year are Denise Williams and Jigyasa Grover; I'm looking forward to meeting them in 3D!

All-in-all, I'm looking forward to another fun year at PyCon!

Did you enjoy this blog post? Let's discuss more!


Variance Explained

written by Eric J. Ma on 2019-03-24

data science machine learning

Variance explained, as a regression quality metric, is one that I have begun to like a lot, especially when used in place of a metric like the correlation coefficient (r2).

Here's variance explained defined:

$$1 - \frac{var(y_{true} - y_{pred})}{var(y_{true})}$$

Why do I like it? It’s because this metric gives us a measure of the scale of the error in predictions relative to the scale of the data.

The numerator in the fraction calculates the variance in the errors, in other words, the scale of the errors. The denominator in the fraction calculates the variance in the data, in other words, the scale of the data. By subtracting the fraction from 1, we get a number upper-bounded at 1 (best case), and unbounded towards negative infinity.

Here's a few interesting scenarios.

A thing that is really nice about variance explained is that it can be used to compare related machine learning tasks that have different unit scales, for which we want to compare how good one model performs across all of the tasks. Mean squared error makes this an apples-to-oranges comparison, because the unit scales of each machine learning task is different. On the other hand, variance explained is unit-less.

Now, we know that single metrics can have failure points, as does the coefficient of correlation r^2^, as shown in Ansecombe's quartet and the Datasaurus Dozen:

Ansecombe's quartet, taken from Autodesk Research

Fig. 1: Ansecombe's quartet, taken from Autodesk Research

Datasaurus Dozen, taken from Revolution Analytics

Fig. 2: Datasaurus Dozen, taken from Revolution Analytics

One place where the variance explained can fail is if the predictions are systematically shifted off from the true values. Let's say prediction was shifted off by 2 units.

$$var(y_{true} - y_{pred}) = var([2, 2, ..., 2]) = 0$$

There's no variance in errors, even though they are systematically shifted off from the true prediction. Like r2, variance explained will fail here.

As usual, Ansecombe's quartet, as does The Datasaurus Dozen, gives us a pertinent reminder that visually inspecting your model predictions is always a good thing!

Did you enjoy this blog post? Let's discuss more!


Functools Partial

written by Eric J. Ma on 2019-03-22

python hacks tips and tricks data science productivity coding

If you’ve done Python programming for a while, I think it pays off to know some little tricks that can improve the readability of your code and decrease the amount of repetition that goes on.

One such tool is functools.partial. It took me a few years after my first introduction to partial before I finally understood why it was such a powerful tool.

Essentially, what partial does is it wraps a function and sets a keyword argument to a constant. That’s it. What do we mean?

Here’s a minimal example. Let’s say we have a function f, not written by me, but provided by someone else.

def f(a, b):
    result = # do something with a and b.
    return result

In my code, let’s say that I know that the value that b takes on in my app is always the tuple (1, 'A'). I now have a few options. The most obvious is assign the tuple (1, 'A') to a variable, and pass that in on every function call:

b = (1, 'A')
result1 = f(a=1, b=b)
# do some stuff.
result2 = f(a=15, b=b)
# do more stuff.
# ad nauseum
N = # set value of N
resultN = f(a=N, b=b)

The other way I could do it is use functools.partial and just set the keyword argument b to equal to the tuple directly.

from functools import partial
f_ = partial(f, b=(1, 'A'))

Now, I can repeat the code above, but now only worrying about the keyword argument a:

result1 = f_(a=1)
# do some stuff.
result2 = f_(a=15)
# do more stuff.
# ad nauseum
N = # set value of N
resultN = f_(a=N)

And there you go, that’s basically how functools.partial works in a nutshell.

Now, where have I used this in real life?

The most common place I have used it is in Flask. I have built Flask apps where I need to dynamically keep my Bokeh version synced up between the Python and JS libraries that get called. To ensure that my HTML templates have a consistent Bokeh version, I use the following pattern:

from bokeh import __version__ as bkversion
from flask import render_template, Flask
from functools import partial 

render_template = partial(render_template, bkversion=bkversion)

# Flask app boilerplate
app = Flask(__name__)


@app.route('/')
def home():
    return render_template('index.html.j2')

Now, because I always have bkversion pre-specified in render_template, I never have to repeat it over every render_template function call.

Did you enjoy this blog post? Let's discuss more!