Skip to content

Binder

%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as npr
import pandas as pd
from causality_notes import draw_graph, noise

Introduction

In this notebook, I would like to simulate a complex causal process (with only linear relationships between the variables), but in a complex DAG that isn't just a triangle. Then, I would like to use this simulated data to convince myself that with the right conditioning on variables, we can recover the correct parameters back.

Graphical Structure

First off, let's assume that there is the following graphical structure.

G = nx.DiGraph()
G.add_edge('c', 'x', coeff=5)
G.add_edge('a', 'x', coeff=2)
G.add_edge('a', 'k', coeff=-3)
G.add_edge('x', 'f', coeff=6)
G.add_edge('x', 'd', coeff=-2)
G.add_edge('d', 'g', coeff=-8)
G.add_edge('k', 'y', coeff=3)
G.add_edge('d', 'y', coeff=5)
G.add_edge('y', 'h', coeff=-4)

draw_graph(G)

Written as a set of equations, it might look like the following:

n = 1000  # number of samples taken

c = noise(n)
a = noise(n)
x = 5 * c + 2 * a + noise(n)
k = -3 * a + noise(n)
f = 6 * x + noise(n)
d = -2 * x + noise(n)
g = -8 * d + noise(n)
y = 3 * k + 5 * d + noise(n)
h = -4 * y + noise(n)

data = dict(c=c, a=a, x=x, k=k, f=f, d=d, g=g, y=y, h=h)
df = pd.DataFrame(data)
df.sample(5)
c a x k f d g y h
798 -1.078732 0.794306 -3.212282 -0.610186 -19.582504 5.964205 -47.652161 29.263166 -116.295944
136 0.636438 -0.655031 3.899436 2.264376 24.951148 -8.262962 65.838390 -35.451501 142.794698
633 -0.363132 -0.304403 0.237570 1.299515 1.166533 -2.155883 16.125365 -6.459946 25.891101
311 -0.773808 0.740542 -2.616127 -3.296688 -12.936983 5.433091 -43.553788 16.290036 -63.290850
56 1.329955 -0.872511 5.286332 4.875391 31.694392 -9.472437 76.314045 -32.155163 129.815254

Note how the coefficients on the edges are basically the linear multipliers.

Before we go on, let's get a feel for how the data are distributed.

import seaborn as sns
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0x7f32be41dca0>

Looking at the pair plot above, we should compute the pairwise correlation between each pair of variables.

plt.imshow(df.corr(method='pearson').values, cmap='RdBu')
plt.xticks(range(len(df.columns)), df.columns)
plt.yticks(range(len(df.columns)), df.columns)
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7f32abe632b0>

Compare the graphical structure below against the correlation plots above. Some things are quite neat.

  • c and k are uncorrelated, because there is no causal path from c to k. Same goes for c and a.
  • On the other hand, c is causally related to all of the other variables. It has a negative correlation with d and y, and this comes directly from Sewall Wright's path rules: positive coeff \times negative coeff gives a negative coeff.
draw_graph(G)

From the graph, we know that the direct effect of x on y is going to be -2 \times 5 = -10 (Sewall Wright's path analysis). However, if we only regress y on x, the coefficients are going to be wrong, because we have a confounder between x and y, primarily originating from a.

Now, let's try naïvely regressing y on x, given the causal structure above.

from statsmodels.regression.linear_model import OLS

model = OLS.from_formula('y ~ x', data=df)
results = model.fit()
results.params
Intercept     0.450144
x           -10.458158
dtype: float64

We almost recover the correct coefficients, but because we didn't condition on the confounding path from x to y, that is, the path x <- a -> k -> y. Thus, we're still off.

What if we conditioned on a? To condition on a means adding it as a term in the linear regression.

model = OLS.from_formula('y ~ x + a', data=df)
results = model.fit()
results.params
Intercept    0.115918
x           -9.994394
a           -8.780201
dtype: float64

Much better! What if we conditioned on k only?

model = OLS.from_formula('y ~ x + k', data=df)
results = model.fit()
results.params
Intercept    0.119636
x           -9.978843
k            2.906956
dtype: float64

Wonderful! We get coefficients that are much closer to -10, which is exactly what we had expected. Notice how we also recovered the effect of a and k respectively on y.

One thing that is quite nice about this scheme is that if we know the causality structure ahead of time, then we need not condition on every last variable. We needn't even condition on every single variable on the confounding path; conditioning on a single variable in each confounding path is sufficient.

This property comes in handy in scenarios where we don't have perfect information: if we weren't able to measure a, or just forgot to measure it, k is a sufficiently good variable to condition on.

What would happen if we conditioned on a variable that wasn't involved in the causal path from x to y? Let's try conditioning on g.

model = OLS.from_formula('y ~ x + g', data=df)
results = model.fit()
results.params
Intercept    0.369591
x           -1.572082
g           -0.556936
dtype: float64

We are way off! This is because g is not a confounder of x and y, therefore, conditioning on it is the wrong thing to do.