In [1]:
# Load the data
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as npr
from scipy.stats import norm, ks_2samp  # no scipy - comment out
from custom import load_data as cf
from custom import ecdf

%matplotlib inline
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

Introduction

In this notebook, we will walk through a hacker's approach to statistical thinking, as applied to network analysis.

Statistics in a Nutshell

All of statistics can be broken down into two activities:

  • Descriptively summarizing data. (a.k.a. "descriptive statistics")
  • Figuring out whether something happened by random chance. (a.k.a. "inferential statistics")

Descriptive Statistics

  • Centrality measures: mean, median, mode
  • Variance measures: inter-quartile range (IQR), variance and standard deviation

Inferential Statistics

  • Models of Randomness (see below)
  • Hypothesis Testing
  • Fitting Statistical Models

Load Data

Let's load a protein-protein interaction network dataset.

This undirected network contains protein interactions contained in yeast. Research showed that proteins with a high degree were more important for the surivial of the yeast than others. A node represents a protein and an edge represents a metabolic interaction between two proteins. The network contains loops.

In [2]:
# Read in the data.
# Note from above that we have to skip the first two rows, and that there's no header column,and that the edges are
# delimited by spaces in between the nodes. Hence the syntax below:
G = cf.load_propro_network()

Exercise

Compute some basic descriptive statistics about the graph, namely:

  • the number of nodes,
  • the number of edges,
  • the graph density,
  • the distribution of degree centralities in the graph,
In [3]:
# Number of nodes:
len(G.nodes())
Out[3]:
1870
In [4]:
# Number of edges:
len(G.edges())
Out[4]:
2277
In [5]:
# Graph density:
nx.density(G)
Out[5]:
0.0013029931073553016
In [6]:
# Degree centrality distribution:
list(nx.degree_centrality(G).values())[0:5]
Out[6]:
[0.002675227394328518,
 0.003210272873194221,
 0.010700909577314071,
 0.001070090957731407,
 0.002675227394328518]

How are protein-protein networks formed? Are they formed by an Erdos-Renyi process, or something else?

In the G(n, p) model, a graph is constructed by connecting nodes randomly. Each edge is included in the graph with probability p independent from every other edge.

If protein-protein networks are formed by an E-R process, then we would expect that properties of the protein-protein graph would look statistically similar to those of an actual E-R graph.

Exercise

Make an ECDF of the degree centralities for the protein-protein interaction graph, and the E-R graph.

  • The construction of an E-R graph requires a value for n and p.
  • A reasonable number for n is the number of nodes in our protein-protein graph.
  • A reasonable value for p might be the density of the protein-protein graph.
In [7]:
ppG_deg_centralities = list(nx.degree_centrality(G).values())
plt.plot(*ecdf(ppG_deg_centralities))

erG = nx.erdos_renyi_graph(n=len(G.nodes()), p=nx.density(G))
erG_deg_centralities = list(nx.degree_centrality(erG).values())
plt.plot(*ecdf(erG_deg_centralities))

plt.show()

From visualizing these two distributions, it is clear that they look very different. How do we quantify this difference, and statistically test whether the protein-protein graph could have arisen under an Erdos-Renyi model?

One thing we might observe is that the variance, that is the "spread" around the mean, differs between the E-R model compared to our data. Therefore, we can compare variance of the data to the distribtion of variances under an E-R model.

This is essentially following the logic of statistical inference by 'hacking' (not to be confused with the statistical bad practice of p-hacking).

Exercise

Fill in the skeleton code below to simulate 100 E-R graphs.

In [8]:
# 1. Generate 100 E-R graph degree centrality variance measurements and store them.
# Takes ~50 seconds or so.
n_sims = 100
er_vars = np.zeros(n_sims)  # variances for n simulaed E-R graphs.
for i in range(n_sims):
    erG = nx.erdos_renyi_graph(n=len(G.nodes()), p=nx.density(G))
    erG_deg_centralities = list(nx.degree_centrality(erG).values())
    er_vars[i] = np.var(erG_deg_centralities)
In [9]:
# 2. Compute the test statistic that is going to be used for the hypothesis test.
# Hint: numpy has a "var" function implemented that computes the variance of a distribution of data.
ppG_var = np.var(ppG_deg_centralities)
In [10]:
# Do a quick visual check
n, bins, patches = plt.hist(er_vars)
plt.vlines(ppG_var, ymin=0, ymax=max(n), color='red', lw=2)
Out[10]:
<matplotlib.collections.LineCollection at 0x10d54df98>

Visually, it should be quite evident that the protein-protein graph did not come from an E-R distribution. Statistically, we can also use the hypothesis test procedure to quantitatively test this, using our simulated E-R data.

In [11]:
# Conduct the hypothesis test.
ppG_var > np.percentile(er_vars, 99)  # we can only use the 99th percentile, because there are only 100 data points.
Out[11]:
True

Another way to do this is to use the 2-sample Kolmogorov-Smirnov test implemented in the scipy.stats module. From the docs:

This tests whether 2 samples are drawn from the same distribution. Note that, like in the case of the one-sample K-S test, the distribution is assumed to be continuous.

This is the two-sided test, one-sided tests are not implemented. The test uses the two-sided asymptotic Kolmogorov-Smirnov distribution.

If the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same.

As an example to convince yourself that this test works, run the synthetic examples below.

In [12]:
# Scenario 1: Data come from the same distributions.
# Notice the size of the p-value.
dist1 = npr.random(size=(100))
dist2 = npr.random(size=(100))

ks_2samp(dist1, dist2)
# Note how the p-value, which ranges between 0 and 1, is likely to be greater than a commonly-accepted
# threshold of 0.05
Out[12]:
Ks_2sampResult(statistic=0.18, pvalue=0.06909243488939824)
In [13]:
# Scenario 2: Data come from different distributions. 
# Note the size of the KS statistic, and the p-value.

dist1 = norm(3, 1).rvs(100)
dist2 = norm(5, 1).rvs(100)

ks_2samp(dist1, dist2)
# Note how the p-value is likely to be less than 0.05, and even more stringent cut-offs of 0.01 or 0.001.
Out[13]:
Ks_2sampResult(statistic=0.65, pvalue=1.7451213199870204e-19)

Exercise

Now, conduct the K-S test for one synthetic graph and the data.

In [14]:
# Now try it on the data distribution
ks_2samp(erG_deg_centralities, ppG_deg_centralities)
Out[14]:
Ks_2sampResult(statistic=0.2438502673796792, pvalue=4.15103214600513e-49)

Networks may be high-dimensional objects, but the logic for inference on network data essentially follows the same logic as for 'regular' data:

  • Identify a model of 'randomness' that may model how your data may have been generated.
  • Compute a "test statistic" for your data and the model.
  • Compute the probability of observing the data's test statistic under the model.

Further Reading

Jake Vanderplas' "Statistics for Hackers" slides: https://speakerdeck.com/jakevdp/statistics-for-hackers

Allen Downey's "There is Only One Test": http://allendowney.blogspot.com/2011/05/there-is-only-one-test.html

In [15]: