written by Eric J. Ma on 2019-03-01 | tags: data science programming best practices
In this Q&A-style blog post, I detail how data scientists can begin to engage in pair coding as a more common practice in our day-to-day work, and why we should spend the time to do it as much as we can afford.
While at work, I've been experimenting with pair coding with other data science-oriented colleagues. My experiences tell me that this is something extremely valuable to do. I'd like to share here the "why" and the "how" on pair coding, but focused towards data scientists.
Pair coding is a form of programming where two people work together on a single code base together. It usually involves one person on the keyboard and another talking through the problem and observing for issues, such as syntax, logic, or code style. Occasionally, they may swap who is on the keyboard. In other words, one is the "creator", and the other is the "critic" (but in a positive, constructive fashion).
I was inspired by a few places. Firstly, there are a wealth of blog posts detailing the potential benefits and pitfalls of pair coding, in a software developer's context. (A quick Google search will lead you to them.) Secondly, I had, at work, experimented with "pair hacking" sessions, which involved more than coding, including white-boarding a problem to get a feel for its scope, and it turned out to be pretty productive. Thirdly, I was inspired by a New Yorker article on Jeff and Sanjay, in which part of it chronicled how they worked as a pair to solve the toughest problems at Google.
Now, because I'm not a software engineer by training, and because don't have extensive experience beforehand, and because there are no data-science-oriented resources for pair coding that I have read before (I'd love to read them if you know of any!), I've had to be adapt what I read for software development to a data science context.
I can see at least the following benefits, if not more that I have yet to discover:
I think the differences at best are subtle, not necessarily overt.
The biggest difference that I can think of might be in clarity. To the best of my knowledge, software engineers work with pretty well-defined requirements. The only hiccups that I can imagine that may occur are in unforeseen logic/code blockers. Data scientists, on the other hand, often are exploring and defining the requirements as things go along. In other words, we are working with more unknowns than a software engineer might.
An example is a model I built with a colleague at work that involved groups of groups of samples. We weren't able to envision the final model right at the beginning, and code towards it. Rather, we built the model iteratively, starting with highly simplifying assumptions, discussing which ones to refine, and iteratively building the model as we went forward.
Perhaps a related difference is that as data scientists, because of potentially greater uncertainty surrounding the final product, we may end up talking more about project direction than one would as a software engineer. But that's probably just a minor detail.
Yes, a number of them.
One on scaling things up.
Alan Eustace became the head of the engineering team after Rosing left, in 2005. "To solve problems at scale, paradoxically, you have to know the smallest details," Eustace said.
Another on pair programming as an uncommon practice:
"I don’t know why more people don’t do it," Sanjay said, of programming with a partner.
"You need to find someone that you’re gonna pair-program with who’s compatible with your way of thinking, so that the two of you together are a complementary force," Jeff said.
@article{
ericmjl-2019-pair-scientists,
author = {Eric J. Ma},
title = {Pair Coding: Why and How for Data Scientists},
year = {2019},
month = {03},
day = {01},
howpublished = {\url{https://ericmjl.github.io}},
journal = {Eric J. Ma's Blog},
url = {https://ericmjl.github.io/blog/2019/3/1/pair-coding-why-and-how-for-data-scientists},
}
I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.
If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!
Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!