T-distributed likelihoods are kind of neat

written by Eric J. Ma on 2019-07-23

data science statistics distributions

The Student’s T distribution is the generalization of the Gaussian and Cauchy distributions. How so? Basically by use of its “degrees of freedom” ($df$) parameter.

If we plot the probability density functions of the T distribution with varying degrees of freedom, and compare them to the Cauchy and Gaussian distributions, we get the following:

Student T distributions with varying degrees of freedom.

Notice that when $df=1$, the T distribution is identical to the Cauchy distribution, and that as $df$ increases, it gradually becomes more and more like the Normal distribution. At $df=30$, we can consider it to be approximately enough Gaussian.

On its own, this is already quite useful; when placed in the context of a hierarchical Bayesian model, that’s when it gets even more interesting! In a hierarchical Bayesian model, we are using samples to estimate group-level parameters, but constraining group parameters to vary mostly like each other, unless evidence in the data suggests otherwise. If we allow the $df$ parameter to vary, then if some groups look more Cauchy while other groups look more Gaussian, this can be flexibly captured in the model.

Did you enjoy this blog post? Let's discuss more!


How to lead a great code sprint

written by Eric J. Ma on 2019-07-21

software development sprint code sprint open source community community development

This blog post is the first in a series of two blog posts on participating in code sprints, and is the culmination of two other blog posts I’ve written on leading a sprint. In this post, I’ll be writing it from the perspective of what a sprint participant might appreciate from a sprint leader.

Write good docs

Documentation scales you, the package maintainer. Good docs let others get going without needing your intervention, while bad docs create more confusion. Write good docs ahead-of-time on:

Require documentation as a first contribution

This may not necessarily apply to all projects, but for small-ish enough projects, this might be highly relevant. Requiring documentation contributions as the first contribution has a few nice side effects for newcomer contributors:

  1. This enforces familiarity with the project before making contributions.
  2. It’s a very egalitarian way to kick-off the sprints, reducing the probability of sprinter anxiety from falling behind.
  3. This reduces the burden of new contributions for first-time sprinters: docs do not break code!
  4. Apart from being non-intimidating, it can sometimes give rise to repetitive tasks that newcomer sprinters with which newcomers can practice git workflow.

Make clear who should sprint with your project

Mismatched expectations breed frustration; hence, making clear what pre-requisite knowledge participants should have can go a long way to reducing frustrations later on.

Drawing on my pyjanitor experience, I would want participants to at the minimum be me of the following:

  1. pandas users who have frustrations with the library, and would like to make contributions, or
  2. Individuals who wish to make a documentation contribution and don’t mind doing a fine-toothed pass over the docs to figure out what is unclear in the docs.

Defining your so-called “sprint audience” ahead-of-time can go a long way to making the sprint productive and friendly.

Communicate priorities of the project

Sprinters want to know that their contributions are going to be valued. Though it is easy to say, “come talk with me before you embark on a task”, the reality is that you, the sprint lead, are likely going to be extremely overbooked. One way to get around this, I think, is to have a high-level list of priorities for the sprint, which can help sprinters better strategize which tasks to tackle. Communicate this on a whiteboard, large sticky notes, or online Wiki page that you can direct people to.

Have a publicly-viewable “file lock”

Merge conflicts will inadvertently show up if multiple people are contributing to the same file simultaneously. It helps to have a publicly-viewable “file lock” on, say, a whiteboard or large sticky notes, so that we know who is working on what file. This helps prevent you, the sprint lead, from accidentally getting two people to work on the same file, and then having to resolve merge conflicts later.

In the pyjanitor sprints, I frequently approved two people working on the same notebook; resolving merge conflicts in the notebook JSON later proved to be a big pain! This lesson was one learned hard.

Encourage contribution of life-like examples

If there are new package users in the crowd who want to get familiar with the package, then encouraging them to contribute life-like examples is a great way to have them make a contribution! This has some nice side effects:

  1. In creating the example, they may find limitations in the package that could form the substrate of future contributions.
  2. By using the library in the creation of an example, they become users of the project themselves.

Celebrate every contribution

This one is particularly important for first-time contributors. Oftentimes, they have never done standard Gitflow, and that is intimidating enough. So it doesn’t matter if the contribution is nothing more than deleting an unnecessary plural s or correcting a broken URL. We should celebrate that contribution, because they have now learned how to make a contribution (regardless of type), and can repeat the unfamiliar Gitflow pattern until they have it muscle memorized.

At the SciPy sprints, for the pyjanitor project, once a contributor’s PR finished building and passed all checks, I brought my iPad over to their table to let them hit the Big Green Button on GitHub. This is one touch I am quite confident our sprinters loved!

Recognize and assign non-code tasks

While code contributions are useful, I think a great way to encourage them to help out would be to have them help with non-code contributions. A few examples include:

This is because you, the sprinter, might have your hands full helping beginners, and so having as much help as possible is extremely helpful to you. Be sure, of course, the acknowledge them in some public way that expresses your appreciation of their effort, because they might not necessarily get into the commit record (which, by the way, is not the only way to keep track of contributions).

Stay humble and calm

In open source software development, it is hard to find contributors who are willing to sustain an effort, and so any contributions are generally welcome (barring those that are clearly out of scope). Hence, as far as it is humanly possible, I would be inclined to express appreciation for contributors’ contributions.

One of the PyMC maintainers, Colin Carroll, said something of a contribution that I wanted to make that stuck with me. The gist of it was as follows:

It’s a contribution from someone who is willing, and I’d take that any day.

So yes, even though we may see our project as providing an opportunity for newcomers to contribute, the fact that they are willing to contribute is even more so an important thing to recognize! Gratitude makes more sense than entitlement here.

Staying calm is also important. It’s easy to get irritated because of all of the context switching that happens. Leverage the help you can get from your sprint co-leads to help shoulder the load. If you take good care of your mental state, you can help make the sprints fun and productive for others.

Did you enjoy this blog post? Let's discuss more!


SciPy 2019 Post-Conference

written by Eric J. Ma on 2019-07-15

conference scipy2019 data science

It’s my last day in Austin, TX, having finished a long week of conferencing at SciPy 2019. This trip was very fruitful and productive! At the same time, I’m ready for a quieter change - meeting and talking with people does take a drain on my brain, and I have a mildly strong preference for quiet time over interaction time.

Tutorials

I participated in the tutorials as an instructor for three tutorials, which I think have become my “data science toolkit”: Bayesian statistical modeling, network analysis, and deep learning.

Of the three, the one I had the most fun teaching was the deep learning one. The goal of that tutorial was to peel back a layer behind the frameworks and see what’s going on. To reinforce this and make it all concrete, we live coded a deep learning framework prototype, and it worked! (I didn’t plan for it, and so I was quite nervous while doing it, but we pulled it off as a class, and I think it reinforced the point about revealing what goes on underneath a framework.

I also had a lot of fun teaching the Bayesian statistical modeling tutorial, which I had co-created with Hugo Bowne-Anderson, and as always, my personal “evergreen” tutorial on Network Analysis always brings me joy, especially when we reach the end and talk about graphs and matrices. I think the material connecting linear algebra to graph concepts is one that the crowd enjoys, and I might emphasize it more going forth at the SciPy tutorials.

Talks

This year, I delivered a talk on pyjanitor. Excluding lightning talks, this is probably the first time I’ve started my slides one day before having to deliver it (yikes!). Granted, I’ve had the outline in my head for a long time now, I guess having to do the talk was good impetus to actually get it done.

Apart from that, there’s a rich selection of talks at SciPy from which I think we can screen at work over lunches (Data Science YouTube). I particularly like the talk on Optuna, a framework for hyperparameter optimization, and I think I’ll be using this tool going forwards.

Sprints

I did a sprint on pyjanitor with my colleague Zach Barry. This sprint, we had about 20+ sprinters join us, the vast majority of them being first-time sprinters.

One thing that stuck for me, this time round, is how even first-timers have different degrees of experience. Some know git while most others don’t; most don’t have any prior experience with Gitflow. I had an interaction that led me to realize it’s very important to state meaningfully what “beginner” means in concrete terms. For example, a “beginner” pyjanitor contributor is probably a pandas user, may or may not have used git before, probably doesn’t know GitFlow. A common prerequisite quality amongst contributors would probably be that they would have the patience to

  1. Read the documentation,
  2. Attempt at least one pass digesting the documentation, and
  3. Ask questions regarding the intent behind something before asking for a change.

In terms of the things accomplished at this sprint, contributions mainly revolved around:

In addition to pyjanitor sprinting, special thanks goes to Felipe Fernandes, who helped me get jax up onto conda-forge! SciPy is really the place where we can get to meet people and get things done.

Career advice learned

While at SciPy, I had a chance to talk with Eric Jones, CEO of Enthought. Having described my current role at work, he mentioned how having a team like the one I’m on parked inside IT gives us a very unique position to connect data science work across the organization to the consumers of our data products. When I raised to him my frustrations regarding our infatuation with vendors when FOSS alternatives clearly exist, his advice in return was essentially this:

Focus on leveling-up your colleagues skills and knowledge, keep pushing the education piece at work, and don’t worry about the money that gets spent on tooling.

Having thought about this, I agree. Over time, we should let the results speak. At the same time, I want to help create the environment that I would like to work in: where my colleagues use the same tooling stack, are hacker-types, aren’t afraid to dig deep into the “computer stuff” and into the biology/chemistry, and have the necessary skill + desire to design machine learning systems to systematically accelerate discovery science.

Did you enjoy this blog post? Let's discuss more!


Order of magnitude is more than accurate enough

written by Eric J. Ma on 2019-07-07

data science estimation statistics

When I was in Science One at UBC in 2006, our Physics professor, Mark Halpern, said a quotable statement that has stuck for many years.

Order of magnitude is more than accurate enough.

At the time, that statement rocked the class, myself included. We were classically taught that significant digits are significant, and that we have to keep track of them. But Mark’s quote seemed to throw all of that caution and precision in Physics into the wind. Did what we learn in Physics lab class not matter?

Turns out, there was one highly instructive activity that still hasn’t left my mind. We were asked, during a recitation, to estimate how many days the city of Vancouver could be powered for if we took a piece of chalk and converted its entire mass into energy. This clearly required estimation of chalk mass and Vancouver daily energy consumption, both of which we had no way of accurately knowing.

Regardless, I took it upon myself to carry significant digits in our calculation, while my recitation partner, Charles Au, was fully convinced that this wasn’t necessary, and so did all calculations order-of-magnitude. We debated and agreed upon what assumptions we needed to arrive at a solution, and then proceeded to do the same calculations, one with significant digits, the other without.

We reached the same conclusion.

More precisely, I remember obtaining a result along the lines of $6.2 \cdot 10^3$ days, while Charles obtained $10^4$ days. On an order of magnitude, more or less equivalent.

In retrospect, I shouldn’t have been so surprised. Mark is an astrophysicist, and at that scale, 1 or 2 significant digits might not carry the most importance; rather, getting into the right ballpark might be more important. At the same time, the recitation activity was a powerful first-hand experience of that last point: getting into the right ballpark first.

At the same time, I was also missing a second perspective, which then explains my surprise at Mark’s quote. Now that I’ve gone the route of more statistics-oriented work, I see a similar theme showing up. John Tukey said something along these lines:

Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.

The connection to order of magnitude estimates should be quite clear here. If we’re on an order of magnitude correct on the right questions, we can always refine the answer further. If we’re precisely answering the wrong question, God help us.

What does this mean for a data scientist? For one, it means that means approximate methods are usually good enough practically to get ourselves into the right ballpark; we can use pragmatic considerations to decide whether we need a more complicated model or not. It also means that when we’re building data pipelines, minimum viable products, which help us test whether we’re answering the right question, matter more than the fanciest deep learning model.

So yes, to mash those two quotes together:

Order of magnitude estimates on the right question are more useful than precise quantifications on the wrong question.

Did you enjoy this blog post? Let's discuss more!


SciPy 2019 Pre-Conference

written by Eric J. Ma on 2019-07-07

conferences scipy2019

For the 6th year running, I’m out at UT Austin for SciPy 2019! It’s one of my favorite conferences to attend, because the latest in data science tooling is well featured in the conference program, and I get to meet in-person a lot of the GitHub usernames that I interact with online.

I will be involved in three tutorials this year, which I think have become my data science toolkit: Bayesian stats, network science, and deep learning. Really excited to share my knowledge; my hope is that at least a few more people find the practical experience I’ve gained over the years useful, and that they can put it to good use in their own work too. This year is also the first year I’ve submitted a talk on pyjanitor, which is a package that I have developed with others for cleaning data, also excited to share this with the broader SciPy community!

I’m also looking forward to meeting the conference scholarship recipients. Together with Scott Collis and Celia Cintas, we’ve been managing the FinAid process for the past three years, and each year it heartens me to see the scholarship recipients in person.

Finally, this year’s SciPy is quite unique for me, as it is the first year that I’ll be here with colleagues at work! (In prior years, I came alone, and did networking on my own.) I hope they all have as much of a fun time as I have at SciPy!

Did you enjoy this blog post? Let's discuss more!