Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Chapter: 5-Spectral graph theory
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: July 15, 2024
Copyright: © 2024 Sebastien Roch
$\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bfbeta}{\boldsymbol{\beta}}$ $\newcommand{\bflambda}{\boldsymbol{\lambda}}$ $\newcommand{\bgamma}{\boldsymbol{\gamma}}$ $\newcommand{\bsigma}{{\boldsymbol{\sigma}}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\btheta}{{\boldsymbol{\theta}}}$ $\newcommand{\bphi}{\boldsymbol{\phi}}$ $\newcommand{\balpha}{\boldsymbol{\alpha}}$ $\newcommand{\blambda}{\boldsymbol{\lambda}}$ $\renewcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bp}{\mathbf{p}}$ $\renewcommand{\bx}{\mathbf{x}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bY}{\mathbf{Y}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bZ}{\mathbf{Z}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bV}{\mathbf{V}}$ $\newcommand{\bfg}{\mathbf{g}}$ $\newcommand{\bfh}{\mathbf{h}}$ $\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}$ $\renewcommand{\S}{\mathcal{S}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\pa}{\mathrm{pa}}$ $\newcommand{\Z}{\mathcal{Z}}$ $\newcommand{\bh}{\mathbf{h}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bc}{\mathbf{c}}$ $\newcommand{\cE}{\mathcal{E}}$ $\newcommand{\cP}{\mathcal{P}}$ $\newcommand{\bbeta}{\boldsymbol{\beta}}$ $\newcommand{\bLambda}{\boldsymbol{\Lambda}}$ $\newcommand{\cov}{\mathrm{Cov}}$ $\newcommand{\bfk}{\mathbf{k}}$ $\newcommand{\idx}[1]{}$ $\newcommand{\xdi}{}$
Motivating example: uncovering social groups¶
In this chapter, we analyze datasets in the form of networks. As motivation, we first look at the Karate Club dataset.
From Wikipedia:
A social network of a karate club was studied by Wayne W. Zachary for a period of three years from 1970 to 1972. The network captures 34 members of a karate club, documenting links between pairs of members who interacted outside the club. During the study a conflict arose between the administrator "John A" and instructor "Mr. Hi" (pseudonyms), which led to the split of the club into two. Half of the members formed a new club around Mr. Hi; members from the other part found a new instructor or gave up karate. Based on collected data Zachary correctly assigned all but one member of the club to the groups they actually joined after the split.
import networkx as nx
G = nx.karate_club_graph()
plt.figure(figsize=(6,6))
pos = nx.spring_layout(G, k=0.7, iterations=50, seed=42)
nx.draw_networkx(G, pos=pos, node_size=300, node_color='black', font_color='white')
plt.axis('off')
plt.show()
Our goal:
identify natural sub-groups in the network
That is, we want to find groups of nodes that have many links between them, but relatively few with the other nodes.
It will turn out that the eigenvectors of the Laplacian matrix, a matrix naturally associated to the graph, contain useful information about such communities.
Background: basic concepts in graph theory¶
NUMERICAL CORNER: In Python, the NetworkX
package provides many functionalities for defining, modifying and plotting graphs. For instance, many standard graphs can be defined conveniently. The petersen_graph()
function defines the Petersen graph.
G = nx.petersen_graph()
The graph can be plotted using the function networkx.draw_networkx()
. Recall that in NumPy array indices start at $0$. Consistently, NetworkX also names vertices starting at $0$. Note, however, that this conflicts with our mathematical conventions.
nx.draw_networkx(G, node_color='black', font_color='white', node_size=200)
plt.axis('off')
plt.show()
Other standard graphs can be generated with special functions, e.g. complete graphs using complete_graph()
. See here for a complete list.
G = nx.complete_graph(3)
nx.draw_networkx(G, node_color='black', font_color='white')
plt.axis('off')
plt.show()
G = nx.path_graph(10)
nx.draw_networkx(G, node_color='black', font_color='white')
plt.axis('off')
plt.show()
G.number_of_nodes() # number of nodes
10
G.number_of_edges() # number of edges
9
G.has_node(7) # checks whether the graph has a particular vertex
True
G.has_node(10)
False
G.has_edge(0, 1) # checks whether the graph has a particular edge
True
G.has_edge(0, 2)
False
[n for n in G.neighbors(2)] # returns a list of neighbors of the specified vertex
[1, 3]
nx.is_connected(G) # checks whether the graph is connected
True
[cc for cc in nx.connected_components(G)] # returns the connected components
[{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}]
for e in G.edges():
print(e)
(0, 1) (1, 2) (2, 3) (3, 4) (4, 5) (5, 6) (6, 7) (7, 8) (8, 9)
Another way of specifying a graph is to start with an empty graph with a given number of vertices and then add edges one by one. The following command creates a graph with $4$ vertices and no edge (see empty_graph()
).
G = nx.empty_graph(4)
G.add_edge(0, 1)
G.add_edge(2, 3)
G.add_edge(0, 3)
G.add_edge(3, 0)
nx.draw_networkx(G, node_color='black', font_color='white')
plt.axis('off')
plt.show()
NUMERICAL CORNER: The package NetworkX
also supports digraphs.
G = nx.DiGraph()
nx.add_star(G, [0, 1, 2, 3, 4])
nx.draw_networkx(G, node_color='black', font_color='white')
plt.axis('off')
plt.show()
Another way of specifying a digraph is to start with an empty graph with a given number of vertices and then add edges one by one (compare to the undirected case above).
G = nx.DiGraph()
G.add_edge(0, 1)
G.add_edge(2, 3)
G.add_edge(0, 3)
G.add_edge(3, 0)
G.add_edge(1, 1)
nx.draw_networkx(G, node_color='black', font_color='white')
plt.axis('off')
plt.show()
Note that edges in both directions are depicted here with a double-arrow. Also the self-loop is hard to see. We can use
networkx.draw_networkx_edges()
(together with
networkx.draw_networkx_nodes()
and
networkx.draw_networkx_labels()
)
to have more control over the drawing of the edges.
pos = nx.spring_layout(G, seed=42)
nx.draw_networkx_nodes(G, pos, node_color='black')
nx.draw_networkx_labels(G, pos, font_color='white')
nx.draw_networkx_edges(G, pos, connectionstyle="arc3,rad=0.3")
plt.axis('off')
plt.show()
NUMERICAL CORNER: Using NetworkX
, the adjacency matrix of a graph can be obtained with adjacency_matrix()
. By default, it returns a SciPy
sparse matrix. Alternatively, one can get a regular array with toarray()
.
G = nx.complete_graph(4)
A = nx.adjacency_matrix(G)
print(A)
(0, 1) 1 (0, 2) 1 (0, 3) 1 (1, 0) 1 (1, 2) 1 (1, 3) 1 (2, 0) 1 (2, 1) 1 (2, 3) 1 (3, 0) 1 (3, 1) 1 (3, 2) 1
A = nx.adjacency_matrix(G).toarray()
print(A)
[[0 1 1 1] [1 0 1 1] [1 1 0 1] [1 1 1 0]]
The incidence matrix is obtained with incidence_matrix()
-- again as a sparse array.
B = nx.incidence_matrix(G)
print(B)
(0, 0) 1.0 (1, 0) 1.0 (0, 1) 1.0 (2, 1) 1.0 (0, 2) 1.0 (3, 2) 1.0 (1, 3) 1.0 (2, 3) 1.0 (1, 4) 1.0 (3, 4) 1.0 (2, 5) 1.0 (3, 5) 1.0
B = nx.incidence_matrix(G).toarray()
print(B)
[[1. 1. 1. 0. 0. 0.] [1. 0. 0. 1. 1. 0.] [0. 1. 0. 1. 0. 1.] [0. 0. 1. 0. 1. 1.]]
NUMERICAL CORNER: We revisit an earlier directed graph.
G = nx.DiGraph()
G.add_edge(0, 1)
G.add_edge(2, 3)
G.add_edge(0, 3)
G.add_edge(3, 0)
G.add_edge(1,1)
We compute the adjacency and incidence matrices. For the incidence matrix, one must specify oriented=True
for the oriented version.
A = nx.adjacency_matrix(G).toarray()
print(A)
[[0 1 0 1] [0 1 0 0] [0 0 0 1] [1 0 0 0]]
B = nx.incidence_matrix(G, oriented=True).toarray()
print(B)
[[-1. -1. 0. 0. 1.] [ 1. 0. 0. 0. 0.] [ 0. 0. 0. -1. 0.] [ 0. 1. 0. 1. -1.]]
Revisiting an ealier undirected graph, we note that incidence_matrix()
can also produce an arbitrary oriented incidence matrix by using the oriented=True
option.
G = nx.empty_graph(4)
G.add_edge(0, 1)
G.add_edge(2, 3)
G.add_edge(0, 3)
G.add_edge(3, 0)
B = nx.incidence_matrix(G, oriented=True).toarray()
print(B)
[[-1. -1. 0.] [ 1. 0. 0.] [ 0. 0. -1.] [ 0. 1. 1.]]
Spectral properties of the Laplacian matrix¶
NUMERICAL CORNER: One use of the spectral decomposition of the Laplacian matrix is in graph drawing$\idx{graph drawing}\xdi$. We illustrate this next. Given a graph $G = (V, E)$, it is not clear a priori how to draw it in the plane since the only information available are adjacencies of vertices. One approach is just to position the vertices at random. The function networkx.draw()
or networkx.draw_networkx()
can take as input different graph layout functions that return an $x$ and $y$-coordinate for each vertex.
We will test this on a grid graph. We use grid_2d_graph()
to construct such a graph.
G = nx.grid_2d_graph(4,7)
One layout approach is to choose random locations for the nodes. Specifically, for every node, a position is generated by choosing each coordinate uniformly at random on the interval $[0,1]$.
nx.draw_networkx(G, pos=nx.random_layout(G, seed=535), with_labels=False,
node_size=50, node_color='black', width=0.5)
plt.axis('off')
plt.show()
Clearly, this is hard to read.
Another approach is to map the vertices to two eigenvectors, similarly to what we did for dimensionality reduction. The eigenvector associated to $\mu_1$ is constant and therefore not useful for drawing. We try the next two. We use the Laplacian matrix. This is done using spectral_layout()
.
nx.draw_networkx(G, pos=nx.spectral_layout(G), with_labels=False,
node_size=50, node_color='black', width=0.5)
plt.axis('off')
plt.show()
Interestingly, the outcome is provides a much more natural drawing of the graph, revealing its underlying structure as a grid. We will come back later to try to explain this, after we have developed further understanding of the spectral properties of the Laplacian matrix.
NUMERICAL CORNER: We construct a graph with two connected components and check the results above. We work directly with the adjacency matrix.
A = np.array([[0, 1, 1, 0, 0],
[1, 0, 1, 0, 0],
[1, 1, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 0, 1, 0]])
print(A)
[[0 1 1 0 0] [1 0 1 0 0] [1 1 0 0 0] [0 0 0 0 1] [0 0 0 1 0]]
Note the block structure.
The degrees can be obtained by summing the rows of the adjacency matrix.
degrees = A.sum(axis=1)
print(degrees)
[2 2 2 1 1]
D = np.diag(degrees)
print(D)
[[2 0 0 0 0] [0 2 0 0 0] [0 0 2 0 0] [0 0 0 1 0] [0 0 0 0 1]]
L = D - A
print(L)
[[ 2 -1 -1 0 0] [-1 2 -1 0 0] [-1 -1 2 0 0] [ 0 0 0 1 -1] [ 0 0 0 -1 1]]
print(LA.eigvals(L))
[ 3.00000000e+00 -3.77809194e-16 3.00000000e+00 2.00000000e+00 0.00000000e+00]
Observe that (up to numerical error) there are two $0$ eigenvalues and that the largest eigenvalue is greater or equal than the maximum degree plus one.
To compute the Laplacian matrix, one can also use the function laplacian_matrix()
. For example, the Laplacian of the Petersen graph is the following:
G = nx.petersen_graph()
L = nx.laplacian_matrix(G).toarray()
print(L)
[[ 3 -1 0 0 -1 -1 0 0 0 0] [-1 3 -1 0 0 0 -1 0 0 0] [ 0 -1 3 -1 0 0 0 -1 0 0] [ 0 0 -1 3 -1 0 0 0 -1 0] [-1 0 0 -1 3 0 0 0 0 -1] [-1 0 0 0 0 3 0 -1 -1 0] [ 0 -1 0 0 0 0 3 0 -1 -1] [ 0 0 -1 0 0 -1 0 3 0 -1] [ 0 0 0 -1 0 -1 -1 0 3 0] [ 0 0 0 0 -1 0 -1 -1 0 3]]
print(LA.eigvals(L))
[ 5.00000000e+00 2.00000000e+00 -2.80861083e-17 5.00000000e+00 5.00000000e+00 2.00000000e+00 2.00000000e+00 5.00000000e+00 2.00000000e+00 2.00000000e+00]
NUMERICAL CORNER: This is perhaps easiest to see on a path graph. Recall that NetworkX
numbers vertices $0,\ldots,n-1$.
G = nx.path_graph(10)
We plot the second Laplacian eigenvector (i.e., the eigenvector of the Laplacian matrix corresponding to the second smallest eigenvalue). We use numpy.argsort()
to find the index of the second smallest eigenvalue. Because indices start at 0
, we want entry 1
of the output of np.argsort()
.
L = nx.laplacian_matrix(G).toarray()
w, v = LA.eigh(L)
y2 = v[:,np.argsort(w)[1]]
plt.plot(y2, c='k')
plt.show()
Application: graph partitioning via spectral clustering¶
NUMERICAL CORNER: (A Random Tree) We illustrate the definitions above on a tree, that is, a connected graph with no cycle. The function networkx.random_tree()
can produce a random one. As before we use a seed
for reproducibility. Again, we use $0,\ldots,n-1$ for the vertex set.
G_tree = nx.random_tree(n=6, seed=111)
nx.draw_networkx(G_tree, pos=nx.circular_layout(G_tree),
node_color='black', font_color='white')
plt.axis('off')
plt.show()
Suppose we take $S = \{0,1,2,3\}$. Then $S^c = \{4,5\}$ and
$$ E(S,S^c) = \{\{1,5\}, \{2,4\}\}. $$The cut ratio is then
$$ \phi(S) = \frac{|E(S,S^c)|}{\min\{|S|,|S^c|\}} = \frac{2}{2} = 1. $$A better cut is given by $S = \{0,1,5\}$. In that case $S^c = \{2,3,4\}$,
$$ E(S,S^c)= \{\{1,3\}\}, $$and
$$ \phi(S) = \frac{|E(S,S^c)|}{\min\{|S|,|S^c|\}} = \frac{1}{3}. $$This is also equal to $\phi_G$. Indeed, in a connected graph with $n$ vertices, the numerator is at least $1$ and the denominator is at most $n/2$, which is achieved here.
NUMERICAL CORNER: We return to the random tree example above. We claimed that $\phi_G = 1/3$. The maximum degree is $\bar{\delta} = 3$. We now compute $\mu_2$. We first compute the Laplacian matrix.
phi_G = 1/3
max_deg = 3
We now compute $\mu_2$. We first compute the Laplacian matrix.
L_tree = nx.laplacian_matrix(G_tree).toarray()
print(L_tree)
[[ 1 -1 0 0 0 0] [-1 3 0 -1 0 -1] [ 0 0 2 -1 -1 0] [ 0 -1 -1 2 0 0] [ 0 0 -1 0 1 0] [ 0 -1 0 0 0 1]]
w, v = LA.eigh(L_tree)
mu_2 = np.sort(w)[1]
print(mu_2)
0.32486912943335317
We check Cheeger's inequalities. The left-hand side is:
(phi_G ** 2) / (2 * max_deg)
0.018518518518518517
The right-hand side is:
2 * phi_G
0.6666666666666666
We implement the graph cutting algorithm above.
We now implement this heuristic in Python. We first write an auxiliary function that takes as input an adjacency matrix, an ordering of the vertices and a value $k$. It returns the cut ratio for the first $k$ vertices in the order.
def cut_ratio(A, order, k):
n = A.shape[0]
edge_boundary = 0
for i in range(k+1):
for j in range(k+1,n):
edge_boundary += A[order[i],order[j]]
denominator = np.minimum(k+1, n-k-1)
return edge_boundary/denominator
Using the cut_ratio
function, we first compute the Laplacian, find the second eigenvector and corresponding order of vertices. Then we compute the cut ratio for every $k$. Finally we output the cut (both $S_k$ and $S_k^c$) corresponding to the minimum, as a tuple of arrays.
def spectral_cut2(A):
n = A.shape[0]
degrees = A.sum(axis=1)
D = np.diag(degrees)
L = D - A
w, v = LA.eigh(L)
order = np.argsort(v[:,np.argsort(w)[1]])
phi = np.zeros(n-1)
for k in range(n-1):
phi[k] = cut_ratio(A, order, k)
imin = np.argmin(phi)
return order[0:imin+1], order[imin+1:n]
Finally, to help visualize the output, we write a function coloring the vertices according to which side of the cut they are on.
def viz_cut(G, s, pos, node_size=100, with_labels=False):
n = G.number_of_nodes()
assign = np.zeros(n)
assign[s] = 1
nx.draw(G, node_color=assign, pos=pos, with_labels=with_labels,
cmap='spring', node_size=node_size, font_color='k')
plt.show()
NUMERICAL CORNER: We will illustrate this on the path graph.
n = 10
G = nx.path_graph(n)
nx.draw_networkx(G, pos=nx.spectral_layout(G),
node_color='black', font_color='white')
plt.axis('off')
plt.show()
We apply our spectral-based cutting algorihtm.
A = nx.adjacency_matrix(G).toarray()
s, sc = spectral_cut2(A)
print(s)
print(sc)
[0 1 2 3 4] [5 6 7 8 9]
pos = nx.spectral_layout(G)
viz_cut(G, s, pos)
Let's try it on the grid graph. Can you guess what the cut will be?
G = nx.grid_2d_graph(4,7)
A = nx.adjacency_matrix(G).toarray()
s, sc = spectral_cut2(A)
pos = nx.spectral_layout(G)
viz_cut(G, s, pos)
NUMERICAL CORNER: We return to the Karate Club dataset.
G = nx.karate_club_graph()
n = G.number_of_nodes()
A = nx.adjacency_matrix(G).toarray()
We seek to find natural sub-communities. We use the spectral properties of the Laplacian. Specifically, we use our spectral_cut2
and viz_cut
functions to compute a good cut and vizualize it.
s, sc = spectral_cut2(A)
print(s)
print(sc)
[18 26 20 14 29 22 24 15 23 25 32 27 9 33 31 28 30 8] [ 2 13 1 19 7 3 12 0 21 17 11 4 10 6 5 16]
plt.figure(figsize=(6,6))
pos = nx.spring_layout(G, k=0.7, iterations=50, seed=42)
viz_cut(G, s, pos, node_size=300, with_labels=True)
It is not trivial to assess the quality of the resulting cut. But this particular example has a known ground-truth community structure (which partly explains its widespread use). Quoting from Wikipedia:
A social network of a karate club was studied by Wayne W. Zachary for a period of three years from 1970 to 1972. The network captures 34 members of a karate club, documenting links between pairs of members who interacted outside the club. During the study a conflict arose between the administrator "John A" and instructor "Mr. Hi" (pseudonyms), which led to the split of the club into two. Half of the members formed a new club around Mr. Hi; members from the other part found a new instructor or gave up karate. Based on collected data Zachary correctly assigned all but one member of the club to the groups they actually joined after the split.
This ground truth is the following. We use numpy.nonzero
to convert it into a cut.
truth = np.array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0,
0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
s_truth = np.nonzero(truth)
plt.figure(figsize=(6,6))
viz_cut(G, s_truth, pos, node_size=300, with_labels=True)
You can check that our cut perfectly matches the ground truth.
Erdős-Rényi random graph and stochastic blockmodel¶
We implement the generation of an inhomogeneous ER graph using NetworkX. We first initialize a pseudorandom number generator rng
. To determine whether an edge is present between i
and j
, we generate a uniform random variable rng.random()
(see numpy.random.Generator.random
) and add the edge with G.add_edge(i, j)
if the random variable is < M[i, j]
-- an event which indeed occurs with the desired probability (check it!).
def inhomogeneous_er_random_graph(rng, n, M):
G = nx.Graph()
G.add_nodes_from(range(n))
for i in range(n):
for j in range(i + 1, n):
if rng.random() < M[i, j]:
G.add_edge(i, j)
return G
NUMERICAL CORNER: Here is an example usage. We generate probabilities $m_{i,j}$ uniformly at random between $0$ and $1$.
seed = 535
rng = np.random.default_rng(seed)
n = 20
M = rng.random([n, n])
M = (M + M.T) / 2 # ensures symmetry of M (why?)
G = inhomogeneous_er_random_graph(rng, n, M)
We draw the resulting graph.
nx.draw_networkx(G, node_color='black', font_color='white')
plt.axis('off')
plt.show()
The following subroutine generates an ER graph.
def er_random_graph(rng, n, p):
M = p * (np.ones((n, n)) - np.eye(n))
return inhomogeneous_er_random_graph(rng, n, M)
To confirm our previous calculations, below is the implementation of a routine to estimate the edge density for an ER graph with a fixed parameter $p$. Recall that the edge density is defined as the number of edges present divided by the number of possible edges (i.e., the number of pairs of distinct vertices). The routine takes advantage of the Law of Large Numbers by generating a large number of sample graphs, computing their edge density, and then taking the mean.
def estimate_edge_density(rng, n, p, num_samples=100):
total_edges = 0
total_possible_edges = n * (n - 1) / 2
for _ in range(num_samples):
G = er_random_graph(rng, n, p)
total_edges += G.number_of_edges()
average_edges = total_edges / num_samples
edge_density = average_edges / total_possible_edges
return edge_density
NUMERICAL CORNER: On a small example, we indeed get that the edge density is roughly $p$.
n = 10
p = 0.3
num_samples = 1000
estimated_density = estimate_edge_density(rng, n, p, num_samples)
print(f"Estimated edge density for an ER graph with n={n} and p={p}: {estimated_density}")
Estimated edge density for an ER graph with n=10 and p=0.3: 0.3004888888888889
When $n$, the number of vertices, is large, random graphs tend to exhibit large-scale emergent behavior. One classical example involves the probability of being connected in an ER graph. To illustrate, below is code to estimate that probability over a range of edge densities $p$ (with help from Claude and ChatGPT).
def estimate_connected_probability(rng, n, p, num_samples=100):
connected_count = 0
for _ in range(num_samples):
G = er_random_graph(rng, n, p)
if nx.is_connected(G):
connected_count += 1
connected_probability = connected_count / num_samples
return connected_probability
def plot_connected_probability(rng, n, p_values, num_samples=100):
probabilities = []
for p in p_values:
prob = estimate_connected_probability(rng, n, p, num_samples)
probabilities.append(prob)
plt.figure(figsize=(6, 4))
plt.plot(p_values, probabilities, marker='o', color='black')
plt.xlabel('$p$'), plt.ylabel('Estimated probability of being connected')
plt.show()
NUMERICAL CORNER: We run the code for n
equal to 100
. What do you observe?
n = 100
p_values = np.linspace(0, 0.1, 50)
num_samples = 250
plot_connected_probability(rng, n, p_values, num_samples)
The probability of being connected starts out at $0$ when $p$ is small, which is not surprising since it implies that the graph has a relatively small number of edges. But then that probability increases -- rapidly -- to $1$ as $p$ crosses a threshold. This is referred to as the phase transition of the ER graph.
It can be shown rigorously that the transition occurs at roughly $p = \log n/n$. That is:
np.log(n)/n
0.04605170185988092
which is consistent with the plot.
We implement the SBM model. We use blocks numbered $0$ and $1$.
def sbm_random_graph(rng, n, block_assignments, B):
num_blocks = B.shape[0]
Z = np.zeros((n, num_blocks))
for i in range(n):
Z[i, block_assignments[i]] = 1
M = Z @ B @ Z.T
return inhomogeneous_er_random_graph(rng, n, M)
NUMERICAL CORNER: Here is an example usage. We first pick a block assignment at random. Specifically, blocks are assigned randomly with numpy.random.Generator.choice
. It produces two blocks by assigning each vertex with equal probability to either block, independently of all other choices.
n = 50
block_assignments = rng.choice(2, n) # randomly assign vertices to two blocks
B = np.array([[0.8, 0.1], [0.1, 0.8]])
G = sbm_random_graph(rng, n, block_assignments, B)
We draw the graph with colored nodes based on block assignments. The "good" cut is clearly visible in this layout.
plt.figure(figsize=(6,6))
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_color=block_assignments, cmap='rainbow',
node_size=200, font_size=10, font_color='white')
plt.show()
We introduce a subroutine which assigns blocks at random as follows. Let $\beta_1, \beta_2 \in [0,1]$ with $\beta_1 + \beta_2 = 1$ be the probability that a vertex belongs respectively to block $1$ and $2$. We collect these probabilities in the following vector
$$ \bbeta = (\beta_1, \beta_2). $$We pick block $z(i) \in \{1,2\}$ for each vertex $1 \leq i \leq n$ according to the distribution $\bbeta$, independently of all other vertices $\neq i$.
def generate_block_assignments(rng, n, beta):
return rng.choice(len(beta), size=n, p=beta)
NUMERICAL CORNER: Here is an example usage.
n = 50
beta = [0.33, 0.67]
B = np.array([[0.5, 0.03], [0.03, 0.4]])
block_assignments = generate_block_assignments(rng, n, beta)
G = sbm_random_graph(rng, n, block_assignments, B)
Observe that the blocks are more unbalanced this time.
plt.figure(figsize=(6,6))
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_color=block_assignments, cmap=plt.cm.rainbow,
node_size=200, font_size=10, font_color='white')
plt.show()
To test our spectral partitioning algorithm, we run spectral_cut2
, which indeed recovers the ground truth.
A = nx.adjacency_matrix(G).toarray()
s, sc = mmids.spectral_cut2(A)
plt.figure(figsize=(6,6))
mmids.viz_cut(G, s, pos, node_size=200, with_labels=True)
The following code computes the fraction of incorrectly assigned vertices. Note that it considers two assignments corresponding to swapping the labels 0
and 1
which cannot be inferred.
def calculate_incorrect_fraction(block_assignments, inferred_s, inferred_sc):
n = len(block_assignments)
inferred_assignments = np.zeros(n)
for i in inferred_s:
inferred_assignments[i] = 0
for i in inferred_sc:
inferred_assignments[i] = 1
incorrect_assignments_1 = np.sum(block_assignments != inferred_assignments)/n
incorrect_assignments_2 = np.sum(block_assignments == inferred_assignments)/n
return np.minimum(incorrect_assignments_1, incorrect_assignments_2)
NUMERICAL CORNER: We confirm on our previous example that the ground truth was perfectly recovered.
fraction_incorrect = calculate_incorrect_fraction(block_assignments, s, sc)
print(f"Fraction of incorrectly assigned vertices: {fraction_incorrect}")
Fraction of incorrectly assigned vertices: 0.0
One expects that the ground truth is harder to recover if the probability of an edge between blocks is close to that within blocks, which makes the community structure more murky. To test this hypothesis, we modify our previous example by significantly increasing the inter-block probability.
n = 100
beta = [0.55, 0.45]
B = np.array([[0.55, 0.25], [0.25, 0.45]])
block_assignments = generate_block_assignments(rng, n, beta)
G = sbm_random_graph(rng, n, block_assignments, B)
We run spectral_cut2
. It recovers the ground truth only partially this time.
A = nx.adjacency_matrix(G).toarray()
s, sc = mmids.spectral_cut2(A)
fraction_incorrect = calculate_incorrect_fraction(block_assignments, s, sc)
print(f"Fraction of incorrectly assigned vertices: {fraction_incorrect}")
Fraction of incorrectly assigned vertices: 0.22