Variance Explained

written by Eric J. Ma on 2019-03-24

data science machine learning

Variance explained, as a regression quality metric, is one that I have begun to like a lot, especially when used in place of a metric like the correlation coefficient (r2).

Here's variance explained defined:

$$1 - \frac{var(y_{true} - y_{pred})}{var(y_{true})}$$

Why do I like it? It’s because this metric gives us a measure of the scale of the error in predictions relative to the scale of the data.

The numerator in the fraction calculates the variance in the errors, in other words, the scale of the errors. The denominator in the fraction calculates the variance in the data, in other words, the scale of the data. By subtracting the fraction from 1, we get a number upper-bounded at 1 (best case), and unbounded towards negative infinity.

Here's a few interesting scenarios.

A thing that is really nice about variance explained is that it can be used to compare related machine learning tasks that have different unit scales, for which we want to compare how good one model performs across all of the tasks. Mean squared error makes this an apples-to-oranges comparison, because the unit scales of each machine learning task is different. On the other hand, variance explained is unit-less.

Now, we know that single metrics can have failure points, as does the coefficient of correlation r^2^, as shown in Ansecombe's quartet and the Datasaurus Dozen:

Ansecombe's quartet, taken from Autodesk Research

Fig. 1: Ansecombe's quartet, taken from Autodesk Research

Datasaurus Dozen, taken from Revolution Analytics

Fig. 2: Datasaurus Dozen, taken from Revolution Analytics

One place where the variance explained can fail is if the predictions are systematically shifted off from the true values. Let's say prediction was shifted off by 2 units.

$$var(y_{true} - y_{pred}) = var([2, 2, ..., 2]) = 0$$

There's no variance in errors, even though they are systematically shifted off from the true prediction. Like r2, variance explained will fail here.

As usual, Ansecombe's quartet, as does The Datasaurus Dozen, gives us a pertinent reminder that visually inspecting your model predictions is always a good thing!

Did you enjoy this blog post? Let's discuss more!


Functools Partial

written by Eric J. Ma on 2019-03-22

python hacks tips and tricks data science productivity coding

If you’ve done Python programming for a while, I think it pays off to know some little tricks that can improve the readability of your code and decrease the amount of repetition that goes on.

One such tool is functools.partial. It took me a few years after my first introduction to partial before I finally understood why it was such a powerful tool.

Essentially, what partial does is it wraps a function and sets a keyword argument to a constant. That’s it. What do we mean?

Here’s a minimal example. Let’s say we have a function f, not written by me, but provided by someone else.

def f(a, b):
    result = # do something with a and b.
    return result

In my code, let’s say that I know that the value that b takes on in my app is always the tuple (1, 'A'). I now have a few options. The most obvious is assign the tuple (1, 'A') to a variable, and pass that in on every function call:

b = (1, 'A')
result1 = f(a=1, b=b)
# do some stuff.
result2 = f(a=15, b=b)
# do more stuff.
# ad nauseum
N = # set value of N
resultN = f(a=N, b=b)

The other way I could do it is use functools.partial and just set the keyword argument b to equal to the tuple directly.

from functools import partial
f_ = partial(f, b=(1, 'A'))

Now, I can repeat the code above, but now only worrying about the keyword argument a:

result1 = f_(a=1)
# do some stuff.
result2 = f_(a=15)
# do more stuff.
# ad nauseum
N = # set value of N
resultN = f_(a=N)

And there you go, that’s basically how functools.partial works in a nutshell.

Now, where have I used this in real life?

The most common place I have used it is in Flask. I have built Flask apps where I need to dynamically keep my Bokeh version synced up between the Python and JS libraries that get called. To ensure that my HTML templates have a consistent Bokeh version, I use the following pattern:

from bokeh import __version__ as bkversion
from flask import render_template, Flask
from functools import partial 

render_template = partial(render_template, bkversion=bkversion)

# Flask app boilerplate
app = Flask(__name__)


@app.route('/')
def home():
    return render_template('index.html.j2')

Now, because I always have bkversion pre-specified in render_template, I never have to repeat it over every render_template function call.

Did you enjoy this blog post? Let's discuss more!


How I Work

written by Eric J. Ma on 2019-03-20

data science productivity

I was inspired to write this because of Will Wolf’s interview with DeepLearning.AI, in which I found a ton of similarities between how both of us work. As such, I thought I’d write down what I use at work to get things done.

Tooling

For a data scientist, I think tooling is of very high importance: mastery over our tools keeps us productive. Here’s a sampling of what I use at work:

As you probably can see, I’m a very Python-centric person!

Daily/Weekly Routines

Most of my work necessitates long stretches of thinking and hacking time. Without that, I’m unable to get into “the zone” to do anything productive. Hence, I have a habit of packing meetings onto Mondays (a.k.a. “Meeting Mondays”). Backup times for meetings, which I prefer to not do, are 11 am and 1 pm, bookending lunch time so that I don’t end up with a fragmented morning/afternoon. The only exceptions I make are for my two high-priority team meetings, for which I defer to the rest of the team. I’m glad that my managers understand the need for long stretches of hacking time, and have stuck to Monday one-on-one meetings.

Hence, almost every day from Tuesday through to Friday, I have long stretches of pre-allocated time for hacking. It’s data science scheduling bliss! It also means I turn down a lot of “can I meet you to chat” invites - unless we can pack them on Monday!

On Friday, I make a point to try to work remotely. It helps with sanity, particularly in the winter, when the commute gets harsh and I can’t bike. Fridays also are the days on which I try to do my open source work.

Pair Coding

Pair coding with others on mutual projects has been a very productive endeavor, which I have written about before. Unlike weekly update meetings, I plan for pair coding on an as-needed basis. We have a pre-defined goal for what we want to accomplish, including a conceivably achievable goal and a stretch goal; achieving the easier one keeps us motivated. It follows the “no agenda, no meeting” rule of thumb by which I protect my time.

I found that a good setup is really necessary for pair coding to be successful. A minimum is a dual-monitor setup, with one extra keyboard + mouse for my coding partner.

One thing I didn’t mention in my previous blog post was how knowledge transfer happens. Here’s how I think it works. We have one in the “driver’s seat”, and the other in the observer role. Knowledge transfer generally happens from the more experienced person to the less experienced one, and the driver doesn’t necessarily have to be the more experienced one. For example, when pair coding with my intern, I play the role of observer and may dictate code or outline what needs to be done, but I don’t actively take over on my keyboard unless there’s a situation that shows up that is irrelevant to the coding session goals. On the other hand, if there’s a codebase I’ve developed for which I need to play the tour guide role, I will be in the driver’s seat, while the observer will help me catch peripheral errors that I’m making.

Learning New Things

Pair coding has been one way I learn new things. For example, with my colleague Zach as the observer, we hacked together a simple dashboard project using Flask, Holoviews and Panel.

I’m not very mathematically-savvy, in that algebra is difficult for me to follow. (I’m mildly algebra-blind, but getting better now.) Ironically, code, which is algebraic in nature too, but works with plain English names, works much better for me. Implementing algorithms and statistical methods using jax (for things that involve differential computing) and PyMC3 (for all things Bayesian) has served to be very educational. While implementing, I also impose some software abstractions on the math, and this also forces me to organize my knowledge, which also helps learning. Implementing things on the computer is also the perfect way to learn by teaching: The computer is the ultimately dumb student, as it will execute exactly as you tell it, mistakes included!

Did you enjoy this blog post? Let's discuss more!


Pair Coding: Why and How for Data Scientists

written by Eric J. Ma on 2019-03-01

data science programming best practices

Introduction

While at work, I've been experimenting with pair coding with other data science-oriented colleagues. My experiences tell me that this is something extremely valuable to do. I'd like to share here the "why" and the "how" on pair coding, but focused towards data scientists.

What is pair coding?

Pair coding is a form of programming where two people work together on a single code base together. It usually involves one person on the keyboard and another talking through the problem and observing for issues, such as syntax, logic, or code style. Occasionally, they may swap who is on the keyboard. In other words, one is the "creator", and the other is the "critic" (but in a positive, constructive fashion).

What's your history with pair coding?

I was inspired by a few places. Firstly, there are a wealth of blog posts detailing the potential benefits and pitfalls of pair coding, in a software developer's context. (A quick Google search will lead you to them.) Secondly, I had, at work, experimented with "pair hacking" sessions, which involved more than coding, including white-boarding a problem to get a feel for its scope, and it turned out to be pretty productive. Thirdly, I was inspired by a New Yorker article on Jeff and Sanjay, in which part of it chronicled how they worked as a pair to solve the toughest problems at Google.

Now, because I'm not a software engineer by training, and because don't have extensive experience beforehand, and because there are no data-science-oriented resources for pair coding that I have read before (I'd love to read them if you know of any!), I've had to be adapt what I read for software development to a data science context.

What are the potential benefits of pair coding?

I can see at least the following benefits, if not more that I have yet to discover:

  1. Instant peer review over data science logic and code. Because we are talking through a problem while coding it up, we can instantly check whether our logic is correct against each other.
  2. Knowledge transfer. In my experience, I've had productive pair-coding sessions with another colleague who has a better grasp of the project than I do. Hence, I contribute & teach the technical component, while I also learn the broader project context better.
  3. Building trust. We all know that the more closely you work with someone, the more rough corners get rubbed off.

What pre-requisites do you see for a productive pair programming session?

  1. A long, continuous, and uninterrupted time slot (at least 2-3 hours in length) to maintain continuity.
  2. A defined goal or question that we are seeking to answer - keeps us focused on what needs to be done.
  3. That goal should also be plausibly achievable within the 2-3 hour timeframe.
  4. Large monitors for both parties to look at, or a code-sharing platform where both can see the code without needing to physically huddle.
  5. A place where we can talk without feeling hindered.
  6. No impromptu interruptions from other individuals.
  7. Complementary and intersecting skillsets.
  8. Open-minded individuals who are willing to learn. (Ego-free.)

Where does pair coding differ for data scientists vs. software engineers?

I think the differences at best are subtle, not necessarily overt.

The biggest difference that I can think of might be in clarity. To the best of my knowledge, software engineers work with pretty well-defined requirements. The only hiccups that I can imagine that may occur are in unforeseen logic/code blockers. Data scientists, on the other hand, often are exploring and defining the requirements as things go along. In other words, we are working with more unknowns than a software engineer might.

An example is a model I built with a colleague at work that involved groups of groups of samples. We weren't able to envision the final model right at the beginning, and code towards it. Rather, we built the model iteratively, starting with highly simplifying assumptions, discussing which ones to refine, and iteratively building the model as we went forward.

Perhaps a related difference is that as data scientists, because of potentially greater uncertainty surrounding the final product, we may end up talking more about project direction than one would as a software engineer. But that's probably just a minor detail.

Do you have any memorable quotes from the New Yorker article?

Yes, a number of them.

One on scaling things up.

Alan Eustace became the head of the engineering team after Rosing left, in 2005. “To solve problems at scale, paradoxically, you have to know the smallest details,” Eustace said.

Another on pair programming as an uncommon practice:

“I don’t know why more people don’t do it,” Sanjay said, of programming with a partner.

“You need to find someone that you’re gonna pair-program with who’s compatible with your way of thinking, so that the two of you together are a complementary force,” Jeff said.

Did you enjoy this blog post? Let's discuss more!


Minimum Viable Products (MVPs) Matter

written by Eric J. Ma on 2019-01-28

data science data products minimum viable products

MVPs matter because they afford us at least two things:

  1. Psychological safety
  2. Credibility

Psychological safety comes from knowing that we have at least a working prototype that we can deliver to whomever is going to consume our results. We aren't stuck in the land of imaginary ideas without something tangible for others to interact with.

Credibility comes about because with the MVP on hand, others now can trust on our ability to execute on an idea. Prior to that, all that others have to go off are promises of "a thing".

Build your MVPs. They're a good thing!

Did you enjoy this blog post? Let's discuss more!