Code review in data science

written by Eric J. Ma on 2019-10-30 | tags: essays data science workflow good practices

My thoughts on how to do code review in data science projects effectively.

This blog post is cross-posted in my essays collection.

The practice of code review is extremely beneficial to the practice of software engineering. I believe it has its place in data science as well.

What code review is

Code review is the process by which a contributor's newly committed code is reviewed by one or more teammate(s). During the review process, the teammate(s) are tasked with ensuring that they

understand the code and are able to follow the logic,
find potential flaws in the newly contributed code,
identify poorly documented code and confusing use of variable names,
raise constructive questions and provide constructive feedback

on the codebase.

If you've done the practice of scientific research before, it is essentially identical to peer review, except with code being the thing being reviewed instead.

What code review isn't

Code review is not the time for a senior person to slam the contributions of a junior person, nor vice versa.

Why data scientists should do code review

The first reason is to ensure that project knowledge is shared amongst teammates. By doing this, we ensure that in case the original code creator needs to be offline for whatever reason, others on the team cover for that person and pick up the analysis. When N people review the code, N+1 people know what went on. (It does not necessarily have to be N == number of people on the team.)

In the context of notebooks, this is even more important. An analysis is complex, and involves multiple modelling decisions and assumptions. Raising these questions, and pointing out where those assumptions should be documented (particularly in the notebook) is a good way of ensuring that N+1 people know those implicit assumptions that go into the model.

The second reason is that even so-called "senior" data scientists are humans, and will make mistakes. With my interns and less-experienced colleagues, I will invite them to constructively raise queries about my code where it looks confusing to them. Sometimes, their lack of experience gives me an opportunity to explain and share design considerations during the code review process, but at other times, they are correct, and I have made a mistake in my code that should be rectified.

What code review can be

Code review can become a very productive time of learning for all parties. What it takes is the willingness to listen to the critique provided, and the willingness to raise issues on the codebase in a constructive fashion.

How code review happens

Code review happens usually in the context of a pull request to merge contributed code into the master branch. The major version control system hosting platforms (GitHub, BitBucket, GitLab) all provide an interface to show the "diff" (i.e. newly contributed or deleted code) and comment directly on the code, in context.

As such, code review can happen entirely asynchronously, across time zones, and without needing much in-person interaction.

Of course, being able to sync up either via a video call, or by meeting up in person, has numerous advantages by allowing non-verbal communication to take place. This helps with building trust between teammates, and hence doing even "virtual" in-person reviews can be a way of being inclusive towards remote colleagues.

Parting words

If your firm is set up to use a version control system, then you probably have the facilities to do code review available. I hope this essay encourages you to give it a try.

Cite this blog post:

@article{
    ericmjl-2019-code-review-in-data-science,
    author = {Eric J. Ma},
    title = {Code review in data science},
    year = {2019},
    month = {10},
    day = {30},
    howpublished = {\url{https://ericmjl.github.io}},
    journal = {Eric J. Ma's Blog},
    url = {https://ericmjl.github.io/blog/2019/10/30/code-review-in-data-science},
}

I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.

If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!

Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!