Variance Explained

written by Eric J. Ma on 2019-03-24

data science machine learning

Variance explained, as a regression quality metric, is one that I have begun to like a lot, especially when used in place of a metric like the correlation coefficient (r2).

Here's variance explained defined:

$$1 - \frac{var(y_{true} - y_{pred})}{var(y_{true})}$$

Why do I like it? It’s because this metric gives us a measure of the scale of the error in predictions relative to the scale of the data.

The numerator in the fraction calculates the variance in the errors, in other words, the scale of the errors. The denominator in the fraction calculates the variance in the data, in other words, the scale of the data. By subtracting the fraction from 1, we get a number upper-bounded at 1 (best case), and unbounded towards negative infinity.

Here's a few interesting scenarios.

A thing that is really nice about variance explained is that it can be used to compare related machine learning tasks that have different unit scales, for which we want to compare how good one model performs across all of the tasks. Mean squared error makes this an apples-to-oranges comparison, because the unit scales of each machine learning task is different. On the other hand, variance explained is unit-less.

Now, we know that single metrics can have failure points, as does the coefficient of correlation r^2^, as shown in Ansecombe's quartet and the Datasaurus Dozen:

Ansecombe's quartet, taken from Autodesk Research

Fig. 1: Ansecombe's quartet, taken from Autodesk Research

Datasaurus Dozen, taken from Revolution Analytics

Fig. 2: Datasaurus Dozen, taken from Revolution Analytics

One place where the variance explained can fail is if the predictions are systematically shifted off from the true values. Let's say prediction was shifted off by 2 units.

$$var(y_{true} - y_{pred}) = var([2, 2, ..., 2]) = 0$$

There's no variance in errors, even though they are systematically shifted off from the true prediction. Like r2, variance explained will fail here.

As usual, Ansecombe's quartet, as does The Datasaurus Dozen, gives us a pertinent reminder that visually inspecting your model predictions is always a good thing!