Introduction
In this notebook, we will take a quick look at the "collider" effect.
Let's say we have the following causal graph:
Apparently, if we "condition" on b, then a and c will be correlated, even though they are independent.
import numpy as np
from causality_notes import noise
import pandas as pd
import seaborn as sns
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
Generate Data
Let's assume we have a causal model that follows the equations below:
This is expressed in the code below.
size = 1000
a = noise(size)
c = noise(size)
b = 20*a - 20*c + noise(size)
We now make it into a pandas DataFrame.
df = pd.DataFrame({'a': a, 'b': b, 'c': c})
Let's view a pair plot to see the pairwise correlation (dependency) between the variables.
sns.pairplot(df)
Ok, as shown in the causal graph, a and c are independent of one another, and so distributionally, there's no trend between them.
Conditioning
When we "condition" on a variable, remember that we are essentially taking a "slice" of a variable, and seeing what the distributions for the other variables are. I illustrated this on my blog.
In our problem, this means that we have to slice out a range of the values of b:
df_new = df[(df['b'] < df['b'].mean()) & (df['b'] > np.percentile(df['b'], 25))]
Now, let's visualize the relationship between a and c, now conditioned on b.
ax = df_new.plot(kind='scatter', x='a', y='c')
ax.set_aspect('equal')
ax.set_title('conditioned on b')
We can also look at the full joint distribution of a and c, colouring b to illustrate what would happen if we conditioned on particular values of b.
ax = sns.scatterplot(data=df, x='a', y='c', hue='b')
ax.set_aspect('equal')
Conclusion
Here, we see that in a collider situation, if we condition on the child variable, the parents will be unduly correlated.