Chapter 4: Hubs

%load_ext autoreload
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import warnings
warnings.filterwarnings('ignore')


Introduction

from IPython.display import YouTubeVideo



Because of the relational structure in a graph, we can begin to think about "importance" of a node that is induced because of its relationships to the rest of the nodes in the graph.

Before we go on, let's think about a pertinent and contemporary example.

An example: contact tracing

At the time of writing (April 2020), finding important nodes in a graph has actually taken on a measure of importance that we might not have appreciated before. With the COVID-19 virus spreading, contact tracing has become quite important. In an infectious disease contact network, where individuals are nodes and contact between individuals of some kind are the edges, an "important" node in this contact network would be an individual who was infected who also was in contact with many people during the time that they were infected.

Our dataset: "Sociopatterns"

The dataset that we will use in this chapter is the "sociopatterns network" dataset. Incidentally, it's also about infectious diseases.

Note to readers: We originally obtained the dataset in 2014 from the Konect website. It is unfortunately no longer available. The sociopatterns.org website hosts an edge list of a slightly different format, so it will look different from what we have here.

From the original description on Konect, here is the description of the dataset:

This network describes the face-to-face behavior of people during the exhibition INFECTIOUS: STAY AWAY in 2009 at the Science Gallery in Dublin. Nodes represent exhibition visitors; edges represent face-to-face contacts that were active for at least 20 seconds. Multiple edges between two nodes are possible and denote multiple contacts. The network contains the data from the day with the most interactions.

To simplify the network, we have represented only the last contact between individuals.

from nams import load_data as cf


It is loaded as an undirected graph object:

type(G)

networkx.classes.graph.Graph


As usual, before proceeding with any analysis, we should know basic graph statistics.

len(G.nodes()), len(G.edges())

(410, 2765)


A Measure of Importance: "Number of Neighbors"

One measure of importance of a node is the number of neighbors that the node has. What is a neighbor? We will work with the following definition:

The neighbor of a node is connected to that node by an edge.

Let's explore this concept, using the NetworkX API.

Every NetworkX graph provides a G.neighbors(node) class method, which lets us query a graph for the number of neighbors of a given node:

G.neighbors(7)

<dict_keyiterator at 0x7fd967486f90>


It returns a generator that doesn't immediately return the exact neighbors list. This means we cannot know its exact length, as it is a generator. If you tried to do:

len(G.neighbors(7))


you would get the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-72c56971d077> in <module>
----> 1 len(G.neighbors(7))

TypeError: object of type 'dict_keyiterator' has no len()


Hence, we will need to cast it as a list in order to know both its length and its members:

list(G.neighbors(7))

[5, 6, 21, 22, 37, 48, 51]


In the event that some nodes have an extensive list of neighbors, then using the dict_keyiterator is potentially a good memory-saving technique, as it lazily yields the neighbors.

Exercise: Rank-ordering the number of neighbors a node has

Since we know how to get the list of nodes that are neighbors of a given node, try this following exercise:

Can you create a ranked list of the importance of each individual, based on the number of neighbors they have?

Here are a few hints to help:

• You could consider using a pandas Series. This would be a modern and idiomatic way of approaching the problem.
• You could also consider using Python's sorted function.
from nams.solutions.hubs import rank_ordered_neighbors



The original implementation looked like the following

from nams.solutions.hubs import rank_ordered_neighbors_original
# rank_ordered_neighbors_original??


And another implementation that uses generators:

from nams.solutions.hubs import rank_ordered_neighbors_generator
# rank_ordered_neighbors_generator??


Generalizing "neighbors" to arbitrarily-sized graphs

The concept of neighbors is simple and appealing, but it leaves us with a slight point of dissatisfaction: it is difficult to compare graphs of different sizes. Is a node more important solely because it has more neighbors? What if it were situated in an extremely large graph? Would we not expect it to have more neighbors?

As such, we need a normalization factor. One reasonable one, in fact, is the number of nodes that a given node could possibly be connected to. By taking the ratio of the number of neighbors a node has to the number of neighbors it could possibly have, we get the degree centrality metric.

Formally defined, the degree centrality of a node (let's call it $d$) is the number of neighbors that a node has (let's call it $n$) divided by the number of neighbors it could possibly have (let's call it $N$):

d = \frac{n}{N}

NetworkX provides a function for us to calculate degree centrality conveniently:

import networkx as nx
import pandas as pd
dcs = pd.Series(nx.degree_centrality(G))
dcs

100    0.070905
101    0.031785
102    0.039120
103    0.063570
104    0.041565
...
89     0.009780
91     0.051345
96     0.036675
99     0.034230
98     0.002445
Length: 410, dtype: float64


nx.degree_centrality(G) returns to us a dictionary of key-value pairs, where the keys are node IDs and values are the degree centrality score. To save on output length, I took the liberty of casting it as a pandas Series to make it easier to display.

Incidentally, we can also sort the series to find the nodes with the highest degree centralities:

dcs.sort_values(ascending=False)

51     0.122249
272    0.114914
235    0.105134
195    0.105134
265    0.083130
...
390    0.002445
135    0.002445
398    0.002445
186    0.002445
98     0.002445
Length: 410, dtype: float64


Does the list order look familiar? It should, since the numerator of the degree centrality metric is identical to the number of neighbors, and the denominator is a constant.

Distribution of graph metrics

One important concept that you should come to know is that the distribution of node-centric values can characterize classes of graphs.

What do we mean by "distribution of node-centric values"? One would be the degree distribution, that is, the collection of node degree values in a graph.

Generally, you might be familiar with plotting a histogram to visualize distributions of values, but in this book, we are going to avoid histograms like the plague. I detail a lot of reasons in a blog post I wrote in 2018, but the main points are that:

1. It's easier to lie with histograms.
2. You get informative statistical information (median, IQR, extremes/outliers) more easily.

Exercise: Degree distribution

In this next exercise, we are going to get practice visualizing these values using empirical cumulative distribution function plots.

I have written for you an ECDF function that you can use already. Its API looks like the following:

x, y = ecdf(list_of_values)


giving you x and y values that you can directly plot.

The exercise prompt is this:

Plot the ECDF of the degree centrality and degree distributions.

First do it for degree centrality:

from nams.functions import ecdf
from nams.solutions.hubs import ecdf_degree_centrality

ecdf_degree_centrality(G)


Now do it for degree:

from nams.solutions.hubs import ecdf_degree

ecdf_degree(G)


The fact that they are identically-shaped should not surprise you!

The denominator $N$ in the degree centrality definition is "the number of nodes that a node could possibly be connected to". Can you think of two ways $N$ be defined?

from nams.solutions.hubs import num_possible_neighbors

#### UNCOMMENT TO SEE MY ANSWER
# print(num_possible_neighbors())


Exercise: Circos Plotting

Let's get some practice with the nxviz API.

Visualize the graph G, while ordering and colouring them by the 'order' node attribute.

from nams.solutions.hubs import circos_plot

circos_plot(G)


And here's an alternative view using an arc plot.

import nxviz as nv
nv.arc(G, sort_by="order", node_color_by="order")

<AxesSubplot:>


Exercise: Visual insights

Since we know that node colour and order are by the "order" in which the person entered into the exhibit, what does this visualization tell you?

from nams.solutions.hubs import visual_insights

#### UNCOMMENT THE NEXT LINE TO SEE MY ANSWER
# print(visual_insights())


Exercise: Investigating degree centrality and node order

One of the insights that we might have gleaned from visualizing the graph is that the nodes that have a high degree centrality might also be responsible for the edges that criss-cross the Circos plot. To test this, plot the following:

• x-axis: node degree centrality
• y-axis: maximum difference between the neighbors' orders (a node attribute) and the node's order.
from nams.solutions.hubs import dc_node_order

dc_node_order(G)


The somewhat positive correlation between the degree centrality might tell us that this trend holds true. A further applied question would be to ask what behaviour of these nodes would give rise to this pattern. Are these nodes actually exhibit staff? Or is there some other reason why they are staying so long? This, of course, would require joining in further information that we would overlay on top of the graph (by adding them as node or edge attributes) before we might make further statements.

Reflections

In this chapter, we defined a metric of node importance: the degree centrality metric. In the example we looked at, it could help us identify potential infectious agent superspreaders in a disease contact network. In other settings, it might help us spot:

• message amplifiers/influencers in a social network, and
• potentially crowded airports that have lots of connections into and out of it (still relevant to infectious disease spread!)
• and many more!

What other settings can you think of in which the number of neighbors that a node has can become a metric of importance for the node?

Solutions

Here are the solutions to the exercises above.

from nams.solutions import hubs
import inspect

print(inspect.getsource(hubs))

"""Solutions to Hubs chapter."""

import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
import nxviz as nv
from nxviz import annotate

from nams import ecdf

def rank_ordered_neighbors(G):
"""
Uses a pandas Series to help with sorting.
"""
s = pd.Series({n: len(list(G.neighbors(n))) for n in G.nodes()})
return s.sort_values(ascending=False)

def rank_ordered_neighbors_original(G):
"""Original implementation of rank-ordered number of neighbors."""
return sorted(G.nodes(), key=lambda x: len(list(G.neighbors(x))), reverse=True)

def rank_ordered_neighbors_generator(G):
"""
Rank-ordered generator of neighbors.

Contributed by @dgerlanc.

"""
gen = ((len(list(G.neighbors(x))), x) for x in G.nodes())
return sorted(gen, reverse=True)

def ecdf_degree_centrality(G):
"""ECDF of degree centrality."""
x, y = ecdf(list(nx.degree_centrality(G).values()))
plt.scatter(x, y)
plt.xlabel("degree centrality")
plt.ylabel("cumulative fraction")

def ecdf_degree(G):
"""ECDF of degree."""
num_neighbors = [len(list(G.neighbors(n))) for n in G.nodes()]
x, y = ecdf(num_neighbors)
plt.scatter(x, y)
plt.xlabel("degree")
plt.ylabel("cumulative fraction")

def num_possible_neighbors():
"""Answer to the number of possible neighbors for a node."""
return r"""
The number of possible neighbors can either be defined as:

1. All other nodes but myself
2. All other nodes and myself

If $K$ is the number of nodes in the graph,
then if defined as (1), $N$ (the denominator) is $K - 1$.
If defined as (2), $N$ is equal to $K$.
"""

def circos_plot(G):
"""Draw a Circos Plot of the graph."""
# c = CircosPlot(G, node_order="order", node_color="order")
# c.draw()
nv.circos(G, sort_by="order", node_color_by="order")
annotate.node_colormapping(G, color_by="order")

def visual_insights():
"""Visual insights from the Circos Plot."""
return """
We see that most edges are "local" with nodes
that are proximal in order.
The nodes that are weird are the ones that have connections
with individuals much later than itself,
crossing larger jumps in order/time.

Additionally, if you recall the ranked list of degree centralities,
it appears that these nodes that have the highest degree centrality scores
are also the ones that have edges that cross the circos plot.
"""

def dc_node_order(G):
"""Comparison of degree centrality by maximum difference in node order."""
import matplotlib.pyplot as plt
import pandas as pd
import networkx as nx

# Degree centralities
dcs = pd.Series(nx.degree_centrality(G))

# Maximum node order difference
maxdiffs = dict()
for n, d in G.nodes(data=True):
diffs = []
for nbr in G.neighbors(n):
diffs.append(abs(G.nodes[nbr]["order"] - d["order"]))
maxdiffs[n] = max(diffs)
maxdiffs = pd.Series(maxdiffs)

ax = pd.DataFrame(dict(degree_centrality=dcs, max_diff=maxdiffs)).plot(
x="degree_centrality", y="max_diff", kind="scatter"
)