# Load the data import pandas as pd import networkx as nx import matplotlib.pyplot as plt import numpy as np import numpy.random as npr from scipy.stats import norm, ks_2samp # no scipy - comment out from custom import load_data as cf from custom import ecdf %matplotlib inline %load_ext autoreload %autoreload 2 %config InlineBackend.figure_format = 'retina'
In this notebook, we will walk through a hacker's approach to statistical thinking, as applied to network analysis.
All of statistics can be broken down into two activities:
Let's load a protein-protein interaction network dataset.
This undirected network contains protein interactions contained in yeast. Research showed that proteins with a high degree were more important for the surivial of the yeast than others. A node represents a protein and an edge represents a metabolic interaction between two proteins. The network contains loops.
# Read in the data. # Note from above that we have to skip the first two rows, and that there's no header column,and that the edges are # delimited by spaces in between the nodes. Hence the syntax below: G = cf.load_propro_network()
Compute some basic descriptive statistics about the graph, namely:
# Number of nodes: len(G.nodes())
# Number of edges: len(G.edges())
# Graph density: nx.density(G)
# Degree centrality distribution: list(nx.degree_centrality(G).values())[0:5]
[0.002675227394328518, 0.003210272873194221, 0.010700909577314071, 0.001070090957731407, 0.002675227394328518]
How are protein-protein networks formed? Are they formed by an Erdos-Renyi process, or something else?
In the G(n, p) model, a graph is constructed by connecting nodes randomly. Each edge is included in the graph with probability p independent from every other edge.
If protein-protein networks are formed by an E-R process, then we would expect that properties of the protein-protein graph would look statistically similar to those of an actual E-R graph.
Make an ECDF of the degree centralities for the protein-protein interaction graph, and the E-R graph.
nis the number of nodes in our protein-protein graph.
pmight be the density of the protein-protein graph.
ppG_deg_centralities = list(nx.degree_centrality(G).values()) plt.plot(*ecdf(ppG_deg_centralities)) erG = nx.erdos_renyi_graph(n=len(G.nodes()), p=nx.density(G)) erG_deg_centralities = list(nx.degree_centrality(erG).values()) plt.plot(*ecdf(erG_deg_centralities)) plt.show()
From visualizing these two distributions, it is clear that they look very different. How do we quantify this difference, and statistically test whether the protein-protein graph could have arisen under an Erdos-Renyi model?
One thing we might observe is that the variance, that is the "spread" around the mean, differs between the E-R model compared to our data. Therefore, we can compare variance of the data to the distribtion of variances under an E-R model.
This is essentially following the logic of statistical inference by 'hacking' (not to be confused with the statistical bad practice of p-hacking).
# 1. Generate 100 E-R graph degree centrality variance measurements and store them. # Takes ~50 seconds or so. n_sims = 100 er_vars = np.zeros(n_sims) # variances for n simulaed E-R graphs. for i in range(n_sims): erG = nx.erdos_renyi_graph(n=len(G.nodes()), p=nx.density(G)) erG_deg_centralities = list(nx.degree_centrality(erG).values()) er_vars[i] = np.var(erG_deg_centralities)
# 2. Compute the test statistic that is going to be used for the hypothesis test. # Hint: numpy has a "var" function implemented that computes the variance of a distribution of data. ppG_var = np.var(ppG_deg_centralities)
# Do a quick visual check n, bins, patches = plt.hist(er_vars) plt.vlines(ppG_var, ymin=0, ymax=max(n), color='red', lw=2)
<matplotlib.collections.LineCollection at 0x10d54df98>
Visually, it should be quite evident that the protein-protein graph did not come from an E-R distribution. Statistically, we can also use the hypothesis test procedure to quantitatively test this, using our simulated E-R data.
# Conduct the hypothesis test. ppG_var > np.percentile(er_vars, 99) # we can only use the 99th percentile, because there are only 100 data points.
Another way to do this is to use the 2-sample Kolmogorov-Smirnov test implemented in the
scipy.stats module. From the docs:
This tests whether 2 samples are drawn from the same distribution. Note that, like in the case of the one-sample K-S test, the distribution is assumed to be continuous.
This is the two-sided test, one-sided tests are not implemented. The test uses the two-sided asymptotic Kolmogorov-Smirnov distribution.
If the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same.
As an example to convince yourself that this test works, run the synthetic examples below.
# Scenario 1: Data come from the same distributions. # Notice the size of the p-value. dist1 = npr.random(size=(100)) dist2 = npr.random(size=(100)) ks_2samp(dist1, dist2) # Note how the p-value, which ranges between 0 and 1, is likely to be greater than a commonly-accepted # threshold of 0.05
# Scenario 2: Data come from different distributions. # Note the size of the KS statistic, and the p-value. dist1 = norm(3, 1).rvs(100) dist2 = norm(5, 1).rvs(100) ks_2samp(dist1, dist2) # Note how the p-value is likely to be less than 0.05, and even more stringent cut-offs of 0.01 or 0.001.
# Now try it on the data distribution ks_2samp(erG_deg_centralities, ppG_deg_centralities)
Networks may be high-dimensional objects, but the logic for inference on network data essentially follows the same logic as for 'regular' data:
Jake Vanderplas' "Statistics for Hackers" slides: https://speakerdeck.com/jakevdp/statistics-for-hackers
Allen Downey's "There is Only One Test": http://allendowney.blogspot.com/2011/05/there-is-only-one-test.html