How to be a great code sprinter

written by Eric J. Ma on 2019-07-29

software development sprint code sprint open source community community development

This blog post is the second in a series of two on participating in code sprints. The first one is here. In this post, I will write about how a sprinter themselves can also help contribute to a positive sprint experience all-ways.

Read the docs to understand the scope of the project

As a sprinter, we may often have preconceived notions about what a project is about. It helps to have an accurate view on what a project is and isn’t about. This is oftentimes best accomplished by reading the documentation of that project, assuming the docs are well-written. Doing so can help you better align what you think should be done on the project with what the package maintainer sees as priorities for the project.

Be ready to make documentation contributions of any scale

Documentation is oftentimes the hardest thing for a package maintainer to write, because it often entails slowing down to a beginner’s speed (an unnatural speed at this point), while knowing one’s own blind spots on where a beginner would stumble (also challenging to do).

If you are newcomer sprinter, by focusing on the sections of the docs that pertain to “processes” (e.g. getting development environment setup) and slowly working through them and documenting what’s missing, that can go a long way to helping other newcomers get set up as well. Anything that the maintainer leaves out may need to be made explicitly clear - and you can help make it clear!

Package maintainers, and prior contributors, are human. That means that there inadvertently may be errors in language that may have been inserted into the package. Any small patch that fixes the docs, including even small typographical errors, can be very helpful to improving documentation quality.

Don’t be afraid to ask questions...

You will find that asking questions can really accelerate your own progress on the project. This is important for getting unstuck, wherever you might be stuck.

...but also try keep your questions to the non-obvious things.

That said, asking the too-simple questions that can be answered by a Google query is likely going to steal time and attention away from other sprinters who might have more substantial questions on hand.

A pet peeve of mine is asking questions that can be answered in the docs. Asking these questions of the maintainer doesn’t reflect positively on you, the sprinter. Whether or not you intended, what often gets received/communicated to the maintainer is carelessness and a lack of attention to detail, the opposite of both being generally good qualities to possess and project.

There’s a pretty broad balance point between the two, so don’t feel inhibited by fear of not hitting a precise balance between looking for docs and asking questions.

For any feature requests, try to be ready with a proposed implementation

This one I find very important. Having a proposed implementation on hand for a thing that you think should be in the library goes a long way to helping the package maintainer (or other contributors) see what exactly you’re trying to accomplish with that feature. Having a sketch on-hand makes it much easier for the package maintainer to say “yes” to the new feature, and having written the documentation and a proposed suite of tests for that new feature makes it even easier.

If you aren’t able to propose an implementation, then raising an inquiry rather than a request makes a world of difference in how a package maintainer perceives the communication of the issue at hand.

As an example:

The latter are more thoughtful, and communicates much less a sense of entitlement on the part of the sprinter’s request.

We’re all building mental maps of each others’ knowledge

When two colleagues meet for the first time, we have to build a mental model of each others’ strengths. At a sprint, the package maintainer has to multiply this by however many people are sprinting.

If they are making an effort to map your skills against theirs, they may be very verbose, asking lots of questions to clarify what you do and don’t know. It pays to be patient here.

If they don’t have the bandwidth to do so (and this is a charitable description for some maintainers), then they may be glossing over detail. Rather than being stuck, it pays to interrupt them gently and clarify. (Taking notes is a very good way of communicating that you’re treating this process seriously too!)

Give your sprint leader sufficient context

As mentioned above, the sprint leader will oftentimes be context switching from person to person. It’s mentally exhausting, so spoon-feeding a bit more context (such as the thing you’re working on), and condensing your question to the essentials and asking it very precisely can go a long way to helping your sprint leader help you better.

Did you enjoy this blog post? Let's discuss more!


PyViz Panel Apps

written by Eric J. Ma on 2019-07-26

data science data products app deployment

I finally learned how to build and serve apps with Panel!

Here are the key ideas:

  1. Prototype the app inside a Jupyter notebook. That gives the real-time feedback on whether your apps/widgets are working or not.
  2. The most important thing is that the final thing you package together is now a .servable() object.
  3. Use Panel’s serve command to test the app locally. It’s actually quite magical - the serve command can actually parse a Jupyter notebook and serve it up on a local web server.
  4. When you’ve confirmed that everything is working properly locally, Heroku is a great deployment option. Using the default Python buildpack and a requirements.txt file, one can easily specify the exact Python environment for deployment.

As a pedagogical implementation, I put up a minimal panel app on GitHub, and also served it up on Heroku. Come check it out! I hope it’s useful for you.

Did you enjoy this blog post? Let's discuss more!


T-distributed likelihoods are kind of neat

written by Eric J. Ma on 2019-07-23

data science statistics distributions

The Student’s T distribution is the generalization of the Gaussian and Cauchy distributions. How so? Basically by use of its “degrees of freedom” ($df$) parameter.

If we plot the probability density functions of the T distribution with varying degrees of freedom, and compare them to the Cauchy and Gaussian distributions, we get the following:

Student T distributions with varying degrees of freedom.

Notice that when $df=1$, the T distribution is identical to the Cauchy distribution, and that as $df$ increases, it gradually becomes more and more like the Normal distribution. At $df=30$, we can consider it to be approximately enough Gaussian.

On its own, this is already quite useful; when placed in the context of a hierarchical Bayesian model, that’s when it gets even more interesting! In a hierarchical Bayesian model, we are using samples to estimate group-level parameters, but constraining group parameters to vary mostly like each other, unless evidence in the data suggests otherwise. If we allow the $df$ parameter to vary, then if some groups look more Cauchy while other groups look more Gaussian, this can be flexibly captured in the model.

Did you enjoy this blog post? Let's discuss more!


How to lead a great code sprint

written by Eric J. Ma on 2019-07-21

software development sprint code sprint open source community community development

This blog post is the first in a series of two blog posts on participating in code sprints, and is the culmination of two other blog posts I’ve written on leading a sprint. In this post, I’ll be writing it from the perspective of what a sprint participant might appreciate from a sprint leader.

Write good docs

Documentation scales you, the package maintainer. Good docs let others get going without needing your intervention, while bad docs create more confusion. Write good docs ahead-of-time on:

Require documentation as a first contribution

This may not necessarily apply to all projects, but for small-ish enough projects, this might be highly relevant. Requiring documentation contributions as the first contribution has a few nice side effects for newcomer contributors:

  1. This enforces familiarity with the project before making contributions.
  2. It’s a very egalitarian way to kick-off the sprints, reducing the probability of sprinter anxiety from falling behind.
  3. This reduces the burden of new contributions for first-time sprinters: docs do not break code!
  4. Apart from being non-intimidating, it can sometimes give rise to repetitive tasks that newcomer sprinters with which newcomers can practice git workflow.

Make clear who should sprint with your project

Mismatched expectations breed frustration; hence, making clear what pre-requisite knowledge participants should have can go a long way to reducing frustrations later on.

Drawing on my pyjanitor experience, I would want participants to at the minimum be me of the following:

  1. pandas users who have frustrations with the library, and would like to make contributions, or
  2. Individuals who wish to make a documentation contribution and don’t mind doing a fine-toothed pass over the docs to figure out what is unclear in the docs.

Defining your so-called “sprint audience” ahead-of-time can go a long way to making the sprint productive and friendly.

Communicate priorities of the project

Sprinters want to know that their contributions are going to be valued. Though it is easy to say, “come talk with me before you embark on a task”, the reality is that you, the sprint lead, are likely going to be extremely overbooked. One way to get around this, I think, is to have a high-level list of priorities for the sprint, which can help sprinters better strategize which tasks to tackle. Communicate this on a whiteboard, large sticky notes, or online Wiki page that you can direct people to.

Have a publicly-viewable “file lock”

Merge conflicts will inadvertently show up if multiple people are contributing to the same file simultaneously. It helps to have a publicly-viewable “file lock” on, say, a whiteboard or large sticky notes, so that we know who is working on what file. This helps prevent you, the sprint lead, from accidentally getting two people to work on the same file, and then having to resolve merge conflicts later.

In the pyjanitor sprints, I frequently approved two people working on the same notebook; resolving merge conflicts in the notebook JSON later proved to be a big pain! This lesson was one learned hard.

Encourage contribution of life-like examples

If there are new package users in the crowd who want to get familiar with the package, then encouraging them to contribute life-like examples is a great way to have them make a contribution! This has some nice side effects:

  1. In creating the example, they may find limitations in the package that could form the substrate of future contributions.
  2. By using the library in the creation of an example, they become users of the project themselves.

Celebrate every contribution

This one is particularly important for first-time contributors. Oftentimes, they have never done standard Gitflow, and that is intimidating enough. So it doesn’t matter if the contribution is nothing more than deleting an unnecessary plural s or correcting a broken URL. We should celebrate that contribution, because they have now learned how to make a contribution (regardless of type), and can repeat the unfamiliar Gitflow pattern until they have it muscle memorized.

At the SciPy sprints, for the pyjanitor project, once a contributor’s PR finished building and passed all checks, I brought my iPad over to their table to let them hit the Big Green Button on GitHub. This is one touch I am quite confident our sprinters loved!

Recognize and assign non-code tasks

While code contributions are useful, I think a great way to encourage them to help out would be to have them help with non-code contributions. A few examples include:

This is because you, the sprinter, might have your hands full helping beginners, and so having as much help as possible is extremely helpful to you. Be sure, of course, the acknowledge them in some public way that expresses your appreciation of their effort, because they might not necessarily get into the commit record (which, by the way, is not the only way to keep track of contributions).

Stay humble and calm

In open source software development, it is hard to find contributors who are willing to sustain an effort, and so any contributions are generally welcome (barring those that are clearly out of scope). Hence, as far as it is humanly possible, I would be inclined to express appreciation for contributors’ contributions.

One of the PyMC maintainers, Colin Carroll, said something of a contribution that I wanted to make that stuck with me. The gist of it was as follows:

It’s a contribution from someone who is willing, and I’d take that any day.

So yes, even though we may see our project as providing an opportunity for newcomers to contribute, the fact that they are willing to contribute is even more so an important thing to recognize! Gratitude makes more sense than entitlement here.

Staying calm is also important. It’s easy to get irritated because of all of the context switching that happens. Leverage the help you can get from your sprint co-leads to help shoulder the load. If you take good care of your mental state, you can help make the sprints fun and productive for others.

Did you enjoy this blog post? Let's discuss more!


SciPy 2019 Post-Conference

written by Eric J. Ma on 2019-07-15

conference scipy2019 data science

It’s my last day in Austin, TX, having finished a long week of conferencing at SciPy 2019. This trip was very fruitful and productive! At the same time, I’m ready for a quieter change - meeting and talking with people does take a drain on my brain, and I have a mildly strong preference for quiet time over interaction time.

Tutorials

I participated in the tutorials as an instructor for three tutorials, which I think have become my “data science toolkit”: Bayesian statistical modeling, network analysis, and deep learning.

Of the three, the one I had the most fun teaching was the deep learning one. The goal of that tutorial was to peel back a layer behind the frameworks and see what’s going on. To reinforce this and make it all concrete, we live coded a deep learning framework prototype, and it worked! (I didn’t plan for it, and so I was quite nervous while doing it, but we pulled it off as a class, and I think it reinforced the point about revealing what goes on underneath a framework.

I also had a lot of fun teaching the Bayesian statistical modeling tutorial, which I had co-created with Hugo Bowne-Anderson, and as always, my personal “evergreen” tutorial on Network Analysis always brings me joy, especially when we reach the end and talk about graphs and matrices. I think the material connecting linear algebra to graph concepts is one that the crowd enjoys, and I might emphasize it more going forth at the SciPy tutorials.

Talks

This year, I delivered a talk on pyjanitor. Excluding lightning talks, this is probably the first time I’ve started my slides one day before having to deliver it (yikes!). Granted, I’ve had the outline in my head for a long time now, I guess having to do the talk was good impetus to actually get it done.

Apart from that, there’s a rich selection of talks at SciPy from which I think we can screen at work over lunches (Data Science YouTube). I particularly like the talk on Optuna, a framework for hyperparameter optimization, and I think I’ll be using this tool going forwards.

Sprints

I did a sprint on pyjanitor with my colleague Zach Barry. This sprint, we had about 20+ sprinters join us, the vast majority of them being first-time sprinters.

One thing that stuck for me, this time round, is how even first-timers have different degrees of experience. Some know git while most others don’t; most don’t have any prior experience with Gitflow. I had an interaction that led me to realize it’s very important to state meaningfully what “beginner” means in concrete terms. For example, a “beginner” pyjanitor contributor is probably a pandas user, may or may not have used git before, probably doesn’t know GitFlow. A common prerequisite quality amongst contributors would probably be that they would have the patience to

  1. Read the documentation,
  2. Attempt at least one pass digesting the documentation, and
  3. Ask questions regarding the intent behind something before asking for a change.

In terms of the things accomplished at this sprint, contributions mainly revolved around:

In addition to pyjanitor sprinting, special thanks goes to Felipe Fernandes, who helped me get jax up onto conda-forge! SciPy is really the place where we can get to meet people and get things done.

Career advice learned

While at SciPy, I had a chance to talk with Eric Jones, CEO of Enthought. Having described my current role at work, he mentioned how having a team like the one I’m on parked inside IT gives us a very unique position to connect data science work across the organization to the consumers of our data products. When I raised to him my frustrations regarding our infatuation with vendors when FOSS alternatives clearly exist, his advice in return was essentially this:

Focus on leveling-up your colleagues skills and knowledge, keep pushing the education piece at work, and don’t worry about the money that gets spent on tooling.

Having thought about this, I agree. Over time, we should let the results speak. At the same time, I want to help create the environment that I would like to work in: where my colleagues use the same tooling stack, are hacker-types, aren’t afraid to dig deep into the “computer stuff” and into the biology/chemistry, and have the necessary skill + desire to design machine learning systems to systematically accelerate discovery science.

Did you enjoy this blog post? Let's discuss more!