PyCon 2019 Sprints

written by Eric J. Ma on 2019-05-08

pycon software development sprint open source

This year was the first year that I decided to lead a sprint! The sprint I led was for pyjanitor, a package that I developed with my colleague, Zach Barry, and a remote collaborator in NYC, Sam Zuckerman (whom I've never met in person!). This being the first sprint I've ever led, I think I was lucky to stumble upon a few ideas that made for a productive, collaborative, and most importantly, fun sprint.


I'm going to deliver a talk on pyjanitor later in the year, so I'll save the details for that talk. The short description of pyjanitor is that if you have pandas one-liners that are difficult to remember, they should become a function in pyjanitor; if you have a 10-liner that you always copy/paste from another source, they should become a function in pyjanitor.


Code sprints are a part of PyCon, and it's basically one to four days of intense and focused software development on a single project.

Project sprint leads first pitch their projects at the "Sprintros", where they indicate what days they will be sprinting, and at what times. The next day, we indicate on which rooms our projects will be sprinting. Volunteers from the conference, who would like to make an open source project contribution, then identify projects they would like to come sprint with.

In some senses, there's no way for a sprint leader to know how popular their sprint will be a priori. We have to be prepared to handle a range of scenarios from sprinting with just one other person to sprinting with a crowd.



In preparation for the sprint, I absorbed many lessons learned over the years of sprinting on others' projects.

The most obvious one was to ensure that every sprinter had something to do right from the get-go. Having a task from the get-go keeps sprinters, especially newcomers, engaged right from the beginning. This motivated the requirement to make a doc fix before making a code fix. (You can read more below on how we made this happen.) I wrote out this requirement in a number of places, and by the time the sprint day rolled by, this rolled off pretty naturally.

The second thing that I did to prep was to triage existing issues and label them as being beginner- or intermediate-friendly, and whether they were doc, infrastructure, or code contributions.

Those two things were the my highest priority preparation for the sprint, and I think that helped a ton.

Doc Fixes

This sprint, I gave the structure some thought, and settled on the following: Before making a code contribution, I required a docs contribution.

Docs contributions could be of any scale:

I think this worked well for the following reasons:

  1. New contributors must read the docs before developing on the project, and hence become familiar with the project.
  2. There's always something that can be done better in the docs, and hence, there is something that can be immediately acted on.
  3. The task is a pain point personally uncovered by the contributor, and hence the contributor has the full context of the problem.
  4. The docs don't break the code/tests, and hence doc contributions are a great way make a contribution without wrestling with more complex testing.
  5. Starting everybody who has never worked on pyjanitor on docs is an egalitarian way of on-boarding every newcomer, beginner and experienced individuals alike. Nobody gets special treatment.

For each individual's contribution, I asked them to first raise an issue on the GitHub issue tracker describing the contribution that they would like to make, and then clearly indicate in the comments that they would like to work on it. Then, they would go through the process of doing the documentation fix, from forking the repository, cloning it locally, creating a new branch, making edits, committing, pushing, and PR-ing.

If two people accidentally ended up working on the same docs issue, I would assess the effort of the later one, and if it was substantial enough, I would allow them to consider it done, and move onto a different issue.

Going forth, as the group of contributors expands, I will enforce this "docs-first" requirement only for newcomer sprinters, and request experienced ones to help manage the process.

Code Contributions

Once the docs contributions were done, sprinters were free to either continue with more docs contributions, or provide a code contribution.

Code contributions could be of one of the following:

  1. New function contributions.
  2. Cleaner implementations of existing functions.
  3. Restructuring of existing functions.
  4. Identification of functions to deprecate (very important!)
  5. Infrastructural changes to docs, build system, and more.

The process for doing this was identical to docs: raise an issue, claim it, and then make a new branch with edits, and finally PR it.

My Role

For both days, we had more than 10 people sprint on pyjanitor. Many who were present on the 1st day (and didn't have a flight to catch) came back on the 2nd day. As such, I actually didn't get much coding done. Instead, I took on the following roles:

  1. Q&A person: Answering questions about what would be acceptable contributions, technical (read: git) issues, and more.
  2. Issue labeller and triage-r: I spent the bulk of my downtime (and pre-/post-sprint time) tagging issues on the GitHub issue tracker and marking them as being available or unavailable for hacking on, and tagging them with whether they were docs-related, infrastructure-related, or code enhancements.
  3. Code reviewer: As PRs came in, I would conduct code reviews on each of them, and would discuss with them where to adjust the code to adhere to code style standards.
  4. Continuous integration pipeline babysitter: Because I had just switched us off from Travis CI to Azure Pipelines, I was babysitting the pipelines to make sure nothing went wrong. (Spoiler: something did!)
  5. Green button pusher: Once all tests passed, I would hit the big green button to merge PRs!

If I get to sprint with other experienced contributors at the sprints, I would definitely like to have some help with the above.


Making sprints human-friendly

I tried out a few ideas, which I hope made the sprints just that little bit more human-friendly.

  1. Used large Post-It easel pads to write out commonly-used commands at the terminal.
  2. Displayed claimed seating arrangements at the morning, and more importantly, get to know every sprinter's name.
  3. Announcing every PR to the group that was merged and what the content was, followed by a round of applause.
  4. Setting a timer for 5 minutes before lunch so that they could all get ahead in the line.
  5. I used staff privileges to move the box of leftover bananas into our sprint room. :)

I think the applause is the most encouraging part of the process. Having struggled through a PR, however big or small, and having group recognition for that effort, is super encouraging, especially for first-time contributors. I think we need to encourage this more at sprints.

Relinquishing control

The only things about the pyjanitor project that I'm unwilling to give up on are: good documentation of what a function does, that a function should do one thing well, and that it be method-chainable. Everything else, including functionality, is an open discussion that we can have!

One PR I particularly enjoyed was that from Lucas, who PR'd in a logo for the project on the docs page. He had the idea to take the hacky broomstick I drew on the big sticky note (as a makeshift logo), redraw it as a vector graphic in Inkscape, and PR it in as the (current) official logo on the docs.

More broadly, I deferred to the sprinters' opinions on docs, because I recognized that I'd have old eyes on the docs, and wouldn't be able to easily identify places where the docs could be written more clearly. Eventually, a small, self-organizing squad of 3-5 sprinters ended up becoming the unofficial docs squad, rearranging the structure of the docs, building automation around it, and better organizing and summarizing the information on the docs.

In more than a few places, if there were a well-justified choice for the API (which really meant naming the functions and keyword arguments), I'd be more than happy to see the PR happen. Even if it is evolved away later, the present codebase and PRs that led to it provided the substrate for better evolution of the API!

A new Microsoft

This year, I switched from Travis CI to Azure Pipelines. In particular, I was attracted to the ability to build on all three major operating systems, Windows, macOS, and Linux, on the free tier.

Microsoft had a booth at PyCon, in which Steve Dowell led an initiative to get us set up with Azure-related tools. Indeed, as a major sponsor of the conference, this was one of the best swag given to us. Super practical, relationship- and goodwill-building. Definitely not lesser than the Adafruit lunchboxes with electronics as swag!


Naturally, not everything was smooth sailing throughout. I did find myself a tad expressing myself in an irate fashion at times with the amount of context switching that I was doing, especially switching between talking to different sprinters one after another. (I am very used to long stretches hacking on one problem.) One thing future sprinters could help with, which I will document, is to give me enough ramp-up context around their problem, so that I can quickly pinpoint what other information I might need.

The other not-so-smooth-sailing thing was finding out that Azure sometimes did not catch errors in a script block! My unproven hypothesis at this point is that if I have four commands executed in a script block, and if any of the first three fail but the last one passes, the entire script block will behave as if it passes. This probably stems from the build system looking at only the last exit code to determine exit status. Eventually, after splitting each check into individual steps, linting and testing errors started getting caught automatically! (Automating this is much preferred to me running the black code formatter in my head.)

Though the above issue is fixed, I think I am still having issues getting pycodestyle and black to work on the Windows builds. Definitely looking forward to hearing from Azure devs what could be done here!


I'm sure there's ways I could have made the sprint a bit better. I'd love to hear them if there's something I've missed! Please feel free to comment below.

Sprinter Acknowledgement

I would like to thank all the sprinters who joined in this sprint. Their GitHub handles are below:

And as always, big thanks to my collaborators on the repository:

Did you enjoy this blog post? Let's discuss more!

PyCon 2019 Pre-Journey

written by Eric J. Ma on 2019-04-29

pycon python data science conferences

I'm headed out to PyCon 2019! This year, I will be co-instructing two tutorials, one on network analysis and one on Bayesian statistics, and delivering one talk on Bayesian statistics.

The first tutorial on network analysis is based on material that I first developed 5 years ago, and have continually updated. I've enjoyed teaching this tutorial because it represents a different way of thinking about data - in other words, relationally. This year, I will be a co-instructor for Mridul, who has kindly agreed to step up and teach it this year at PyCon. The apprentice has exceeded the master!

The second tutorial on Bayesian statistics is based on material co-developed with Hugo Bowne-Anderson. Hugo is a mathematician by training, a pedagogy master, and data science aficionado. Like myself, he is a fan of Bayesian statistical modelling methods, and we first debuted the tutorial past year at SciPy. We're super excited for this one!

The talk that I will deliver is on Bayesian statistical analysis of case/control tests. In particular, I noticed a content gap in the data science talks, where case/control comparisons were limited to one case and one control. One epiphany I came to was that if we use Bayesian methods to analyze our data, there's no particular reason to limit ourselves to one case and one control; we can flexibly model multiple cases vs. one control, or even multiple cases vs multiple different controls in the same analysis, in a fashion that is flexible and principled.

My final involvement with PyCon this year is as Financial Aid Chair. This is the first year that I'm leading the FinAid effort; during previous years, I had learned a ton from the previous chair Karan Goel. My co-chairs this year are Denise Williams and Jigyasa Grover; I'm looking forward to meeting them in 3D!

All-in-all, I'm looking forward to another fun year at PyCon!

Did you enjoy this blog post? Let's discuss more!

Variance Explained

written by Eric J. Ma on 2019-03-24

data science machine learning

Variance explained, as a regression quality metric, is one that I have begun to like a lot, especially when used in place of a metric like the correlation coefficient (r2).

Here's variance explained defined:

$$1 - \frac{var(y_{true} - y_{pred})}{var(y_{true})}$$

Why do I like it? It’s because this metric gives us a measure of the scale of the error in predictions relative to the scale of the data.

The numerator in the fraction calculates the variance in the errors, in other words, the scale of the errors. The denominator in the fraction calculates the variance in the data, in other words, the scale of the data. By subtracting the fraction from 1, we get a number upper-bounded at 1 (best case), and unbounded towards negative infinity.

Here's a few interesting scenarios.

A thing that is really nice about variance explained is that it can be used to compare related machine learning tasks that have different unit scales, for which we want to compare how good one model performs across all of the tasks. Mean squared error makes this an apples-to-oranges comparison, because the unit scales of each machine learning task is different. On the other hand, variance explained is unit-less.

Now, we know that single metrics can have failure points, as does the coefficient of correlation r^2^, as shown in Ansecombe's quartet and the Datasaurus Dozen:

Ansecombe's quartet, taken from Autodesk Research

Fig. 1: Ansecombe's quartet, taken from Autodesk Research

Datasaurus Dozen, taken from Revolution Analytics

Fig. 2: Datasaurus Dozen, taken from Revolution Analytics

One place where the variance explained can fail is if the predictions are systematically shifted off from the true values. Let's say prediction was shifted off by 2 units.

$$var(y_{true} - y_{pred}) = var([2, 2, ..., 2]) = 0$$

There's no variance in errors, even though they are systematically shifted off from the true prediction. Like r2, variance explained will fail here.

As usual, Ansecombe's quartet, as does The Datasaurus Dozen, gives us a pertinent reminder that visually inspecting your model predictions is always a good thing!

Did you enjoy this blog post? Let's discuss more!

Functools Partial

written by Eric J. Ma on 2019-03-22

python hacks tips and tricks data science productivity coding

If you’ve done Python programming for a while, I think it pays off to know some little tricks that can improve the readability of your code and decrease the amount of repetition that goes on.

One such tool is functools.partial. It took me a few years after my first introduction to partial before I finally understood why it was such a powerful tool.

Essentially, what partial does is it wraps a function and sets a keyword argument to a constant. That’s it. What do we mean?

Here’s a minimal example. Let’s say we have a function f, not written by me, but provided by someone else.

def f(a, b):
    result = # do something with a and b.
    return result

In my code, let’s say that I know that the value that b takes on in my app is always the tuple (1, 'A'). I now have a few options. The most obvious is assign the tuple (1, 'A') to a variable, and pass that in on every function call:

b = (1, 'A')
result1 = f(a=1, b=b)
# do some stuff.
result2 = f(a=15, b=b)
# do more stuff.
# ad nauseum
N = # set value of N
resultN = f(a=N, b=b)

The other way I could do it is use functools.partial and just set the keyword argument b to equal to the tuple directly.

from functools import partial
f_ = partial(f, b=(1, 'A'))

Now, I can repeat the code above, but now only worrying about the keyword argument a:

result1 = f_(a=1)
# do some stuff.
result2 = f_(a=15)
# do more stuff.
# ad nauseum
N = # set value of N
resultN = f_(a=N)

And there you go, that’s basically how functools.partial works in a nutshell.

Now, where have I used this in real life?

The most common place I have used it is in Flask. I have built Flask apps where I need to dynamically keep my Bokeh version synced up between the Python and JS libraries that get called. To ensure that my HTML templates have a consistent Bokeh version, I use the following pattern:

from bokeh import __version__ as bkversion
from flask import render_template, Flask
from functools import partial 

render_template = partial(render_template, bkversion=bkversion)

# Flask app boilerplate
app = Flask(__name__)

def home():
    return render_template('index.html.j2')

Now, because I always have bkversion pre-specified in render_template, I never have to repeat it over every render_template function call.

Did you enjoy this blog post? Let's discuss more!

How I Work

written by Eric J. Ma on 2019-03-20

data science productivity

I was inspired to write this because of Will Wolf’s interview with DeepLearning.AI, in which I found a ton of similarities between how both of us work. As such, I thought I’d write down what I use at work to get things done.


For a data scientist, I think tooling is of very high importance: mastery over our tools keeps us productive. Here’s a sampling of what I use at work:

As you probably can see, I’m a very Python-centric person!

Daily/Weekly Routines

Most of my work necessitates long stretches of thinking and hacking time. Without that, I’m unable to get into “the zone” to do anything productive. Hence, I have a habit of packing meetings onto Mondays (a.k.a. “Meeting Mondays”). Backup times for meetings, which I prefer to not do, are 11 am and 1 pm, bookending lunch time so that I don’t end up with a fragmented morning/afternoon. The only exceptions I make are for my two high-priority team meetings, for which I defer to the rest of the team. I’m glad that my managers understand the need for long stretches of hacking time, and have stuck to Monday one-on-one meetings.

Hence, almost every day from Tuesday through to Friday, I have long stretches of pre-allocated time for hacking. It’s data science scheduling bliss! It also means I turn down a lot of “can I meet you to chat” invites - unless we can pack them on Monday!

On Friday, I make a point to try to work remotely. It helps with sanity, particularly in the winter, when the commute gets harsh and I can’t bike. Fridays also are the days on which I try to do my open source work.

Pair Coding

Pair coding with others on mutual projects has been a very productive endeavor, which I have written about before. Unlike weekly update meetings, I plan for pair coding on an as-needed basis. We have a pre-defined goal for what we want to accomplish, including a conceivably achievable goal and a stretch goal; achieving the easier one keeps us motivated. It follows the “no agenda, no meeting” rule of thumb by which I protect my time.

I found that a good setup is really necessary for pair coding to be successful. A minimum is a dual-monitor setup, with one extra keyboard + mouse for my coding partner.

One thing I didn’t mention in my previous blog post was how knowledge transfer happens. Here’s how I think it works. We have one in the “driver’s seat”, and the other in the observer role. Knowledge transfer generally happens from the more experienced person to the less experienced one, and the driver doesn’t necessarily have to be the more experienced one. For example, when pair coding with my intern, I play the role of observer and may dictate code or outline what needs to be done, but I don’t actively take over on my keyboard unless there’s a situation that shows up that is irrelevant to the coding session goals. On the other hand, if there’s a codebase I’ve developed for which I need to play the tour guide role, I will be in the driver’s seat, while the observer will help me catch peripheral errors that I’m making.

Learning New Things

Pair coding has been one way I learn new things. For example, with my colleague Zach as the observer, we hacked together a simple dashboard project using Flask, Holoviews and Panel.

I’m not very mathematically-savvy, in that algebra is difficult for me to follow. (I’m mildly algebra-blind, but getting better now.) Ironically, code, which is algebraic in nature too, but works with plain English names, works much better for me. Implementing algorithms and statistical methods using jax (for things that involve differential computing) and PyMC3 (for all things Bayesian) has served to be very educational. While implementing, I also impose some software abstractions on the math, and this also forces me to organize my knowledge, which also helps learning. Implementing things on the computer is also the perfect way to learn by teaching: The computer is the ultimately dumb student, as it will execute exactly as you tell it, mistakes included!

Did you enjoy this blog post? Let's discuss more!