Pair Coding: Why and How for Data Scientists

written by Eric J. Ma on 2019-03-01

data science programming best practices

Introduction

While at work, I've been experimenting with pair coding with other data science-oriented colleagues. My experiences tell me that this is something extremely valuable to do. I'd like to share here the "why" and the "how" on pair coding, but focused towards data scientists.

What is pair coding?

Pair coding is a form of programming where two people work together on a single code base together. It usually involves one person on the keyboard and another talking through the problem and observing for issues, such as syntax, logic, or code style. Occasionally, they may swap who is on the keyboard. In other words, one is the "creator", and the other is the "critic" (but in a positive, constructive fashion).

What's your history with pair coding?

I was inspired by a few places. Firstly, there are a wealth of blog posts detailing the potential benefits and pitfalls of pair coding, in a software developer's context. (A quick Google search will lead you to them.) Secondly, I had, at work, experimented with "pair hacking" sessions, which involved more than coding, including white-boarding a problem to get a feel for its scope, and it turned out to be pretty productive. Thirdly, I was inspired by a New Yorker article on Jeff and Sanjay, in which part of it chronicled how they worked as a pair to solve the toughest problems at Google.

Now, because I'm not a software engineer by training, and because don't have extensive experience beforehand, and because there are no data-science-oriented resources for pair coding that I have read before (I'd love to read them if you know of any!), I've had to be adapt what I read for software development to a data science context.

What are the potential benefits of pair coding?

I can see at least the following benefits, if not more that I have yet to discover:

  1. Instant peer review over data science logic and code. Because we are talking through a problem while coding it up, we can instantly check whether our logic is correct against each other.
  2. Knowledge transfer. In my experience, I've had productive pair-coding sessions with another colleague who has a better grasp of the project than I do. Hence, I contribute & teach the technical component, while I also learn the broader project context better.
  3. Building trust. We all know that the more closely you work with someone, the more rough corners get rubbed off.

What pre-requisites do you see for a productive pair programming session?

  1. A long, continuous, and uninterrupted time slot (at least 2-3 hours in length) to maintain continuity.
  2. A defined goal or question that we are seeking to answer - keeps us focused on what needs to be done.
  3. That goal should also be plausibly achievable within the 2-3 hour timeframe.
  4. Large monitors for both parties to look at, or a code-sharing platform where both can see the code without needing to physically huddle.
  5. A place where we can talk without feeling hindered.
  6. No impromptu interruptions from other individuals.
  7. Complementary and intersecting skillsets.
  8. Open-minded individuals who are willing to learn. (Ego-free.)

Where does pair coding differ for data scientists vs. software engineers?

I think the differences at best are subtle, not necessarily overt.

The biggest difference that I can think of might be in clarity. To the best of my knowledge, software engineers work with pretty well-defined requirements. The only hiccups that I can imagine that may occur are in unforeseen logic/code blockers. Data scientists, on the other hand, often are exploring and defining the requirements as things go along. In other words, we are working with more unknowns than a software engineer might.

An example is a model I built with a colleague at work that involved groups of groups of samples. We weren't able to envision the final model right at the beginning, and code towards it. Rather, we built the model iteratively, starting with highly simplifying assumptions, discussing which ones to refine, and iteratively building the model as we went forward.

Perhaps a related difference is that as data scientists, because of potentially greater uncertainty surrounding the final product, we may end up talking more about project direction than one would as a software engineer. But that's probably just a minor detail.

Do you have any memorable quotes from the New Yorker article?

Yes, a number of them.

One on scaling things up.

Alan Eustace became the head of the engineering team after Rosing left, in 2005. “To solve problems at scale, paradoxically, you have to know the smallest details,” Eustace said.

Another on pair programming as an uncommon practice:

“I don’t know why more people don’t do it,” Sanjay said, of programming with a partner.

“You need to find someone that you’re gonna pair-program with who’s compatible with your way of thinking, so that the two of you together are a complementary force,” Jeff said.

Did you enjoy this blog post? Let's discuss more!


Minimum Viable Products (MVPs) Matter

written by Eric J. Ma on 2019-01-28

data science data products minimum viable products

MVPs matter because they afford us at least two things:

  1. Psychological safety
  2. Credibility

Psychological safety comes from knowing that we have at least a working prototype that we can deliver to whomever is going to consume our results. We aren't stuck in the land of imaginary ideas without something tangible for others to interact with.

Credibility comes about because with the MVP on hand, others now can trust on our ability to execute on an idea. Prior to that, all that others have to go off are promises of "a thing".

Build your MVPs. They're a good thing!

Did you enjoy this blog post? Let's discuss more!


ADVI: Scalable Bayesian Inference

written by Eric J. Ma on 2019-01-21

bayesian variational inference data science

Introduction

You never know when scalability becomes an issue. Indeed, scalability necessitates a whole different world of tooling.

While at work, I've been playing with a model - a Bayesian hierarchical 4-parameter dose response model, to be specific. With this model, the overall goal (without going into proprietary specifics) was parameter learning - what's the 50th percentile concentration, what's the max, what's the minimum, etc.; what was also important was quantifying the uncertainty surrounding these parameters.

Prototype Phase

Originally, when I prototyped the model, I used just a few thousand samples, which was trivial to fit with NUTS. I also got the model specification (both the group-level and population-level priors) done using those same few thousand.

At some point, I was qualitatively and quantitatively comfortable with the model specification. Qualitatively - the model structure reflected prior biochemical knowledge. Quantitatively, I saw good convergence when examining the sampling traces, as well as the shrinkage phenomena.

Scaling Up

Once I reached that point, I decided to scale up to the entire dataset: 400K+ samples, 3000+ groups.

Fitting this model with NUTS with the full dataset would have taken a week, with no stopping time guarantees at the end of an overnight run - when I left the day before, I was still hoping for 5 days. However, switching over to ADVI (automatic differentiation variational inference) was a key enabler for this model: I was able to finish fitting the model with ADVI in just 2.5 hours, with similar uncertainty estimates (it'll never end up being identical, given random sampling).

Thoughts

I used to not appreciate that ADVI could be useful for simpler models; in the past, I used to think that ADVI was mainly useful in Bayesian neural network applications - in other words, with large parameter and large data models.

With this example, I'm definitely more informed about what "scale" can mean: both in terms of number of parameters in a model, and in terms of number of samples that the model is fitted on. In this particular example, the model is simple, but the number of samples is so large that ADVI becomes a feasible alternative to NUTS MCMC sampling.

Did you enjoy this blog post? Let's discuss more!


Conda hacks for data science efficiency

written by Eric J. Ma on 2018-12-25

data science conda hacks

The conda package manager has, over the years, become an integral part of my workflow. I use it to manage project environments, and have built a bunch of very simple hacks around it that you can adopt too. I'd like to share them with you, alongside the rationale for using them.

Hack #1: Set up your .condarc

Why? It will save you a few keystrokes each time you want to do something with conda. For example, in my .condarc, I have the following:

# Set the channels that the `conda install` command will 
# automatically search through.
channels:
  - defaults
  - conda-forge
  - ericmjl

# Always accept installation. Is convenient, but always 
# double-check!
always_yes: true

For more information on how to configure your .condarc, check the online documentation!

Hack #2: Use one environment spec file per project

This assumes that you have the habit of putting all files related to one project inside one folder, using subdirectories for finer-grained organization.

Why? It will ensure that you have one version-controlled, authoritative specification for the packages that are associated with the project. This is good for (1) reproducibility, as you can send it to a colleague and have them reproduce the environment, and (2) will enable Hack #3, which I will showcase afterwards.

# file name: environment.yml

# Give your project an informative name
name: project-name

# Specify the conda channels that you wish to grab packages from, in order of priority.
channels:
- defaults
- conda-forge
- ericmjl

# Specify the packages that you would like to install inside your environment. Version numbers are allowed, and conda will automatically use its dependency solver to ensure that all packages work with one another.
dependencies:
- python=3.7
- conda
- jupyterlab
- scipy
- numpy
- pandas
- pyjanitor
- pandas
# There are some packages which are not conda-installable. You can put the pip dependencies here instead.
- pip:
    - tqdm  # for example only, tqdm is actually available by conda.

A hack that I have related to this is that I use TextExpander shortcut to populate a starting environment spec file.

Additionally, if I want to install a new package, rather than simply typing conda install <packagename>, I will add the package to the environment spec file, and then type conda env update -f environment.yml, as more often than not, my default is to continue using the package I added.

For more details on what the environment spec file is all about, read the online docs!

Hack 3: use conda-auto-env

Written by Christine Doig, conda-auto-env is a bash hack that enables you to automatically activate an environment once you enter into a project directory, as long as an environment.yml file already exists in the directory. If the environment does not already exist, then conda-auto-env will automatically create one based on the environment.yml file in your project directory.

Why? If you have many projects that you are working on, then it will greatly reduce the amount of effort used to remember which project to activate.

conda-auto-env looks like this:

# File: .conda-auto-env
#!/bin/bash

function conda_auto_env() {
  if [ -e "environment.yml" ]; then
    # echo "environment.yml file found"
    ENV=$(head -n 1 environment.yml | cut -f2 -d ' ')
    # Check if you are already in the environment
    if [[ $PATH != *$ENV* ]]; then
      # Check if the environment exists
      conda activate $ENV
      if [ $? -eq 0 ]; then
        :
      else
        # Create the environment and activate
        echo "Conda env '$ENV' doesn't exist."
        conda env create -q
        conda activate $ENV
      fi
    fi
  fi
}

export PROMPT_COMMAND=conda_auto_env

To use it, you have two options. You can either copy/paste the whole original script into your .bashrc, or you can put it in a file called .conda-auto-env, and source it from your .bashrc. I recommend the latter, as it makes managing your .bashrc easier:

# File: .bashrc
source /path/to/.conda-auto-env

Hack 4: hijack bash aliases for conda commands

I use aliases to save myself a few keystrokes whenever I'm at the terminal. This is a generalizable bash hack, but here it is as applied to conda commands.

Anyways, these are the commands that I use most often, which I have found it useful to alias:

# File: .aliases
alias ceu="conda env update"
alias cl="conda list"
alias ci="conda install"
alias cr="conda remove"

Make sure your aliases don't clash with existing commands that you use!

Then, source .aliases in your . bashrc:

# File: .bashrc
source /path/to/.aliases

Now, all of your defined aliases will be available in your bash shell.

The idea/pattern, as I mentioned earlier, is generalizable beyond just bash commands. (I have ls aliased for exa, and l aliased for ls - the epitome of laziness!)

Conclusion

I hope you found these bash and conda hacks to be useful. Hopefully they will help you become more productive and efficient!

Did you enjoy this blog post? Let's discuss more!


Gaussian Process Notes

written by Eric J. Ma on 2018-12-16

data science bayesian

I first learned GPs about two years back, and have been fascinated by the idea. I learned it through a video by David MacKay, and managed to grok it enough that I could put it to use in simple settings. That was reflected in my Flu Forecaster project, in which my GPs were trained only on individual latent spaces.

Recently, though, I decided to seriously sit down and try to grok the math behind GPs (and other machine learning models). To do so, I worked through Nando de Freitas' YouTube videos on GPs. (Super thankful that he has opted to put these videos up online!)

The product of this learning is two-fold. Firstly, I have added a GP notebook to my Bayesian analysis recipes repository.

Secondly, I have also put together some hand-written notes on GPs. (For those who are curious, I first hand-wrote them on paper, then copied them into my iPad mini using a Wacom stylus. We don't have the budget at the moment for an iPad Pro!) They can be downloaded here.

Some lessons learned:

Did you enjoy this blog post? Let's discuss more!