Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Chapter: 7-Random walks on graphs and Markov chains
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: July 16, 2024
Copyright: © 2024 Sebastien Roch
$\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bfbeta}{\boldsymbol{\beta}}$ $\newcommand{\bflambda}{\boldsymbol{\lambda}}$ $\newcommand{\bgamma}{\boldsymbol{\gamma}}$ $\newcommand{\bsigma}{{\boldsymbol{\sigma}}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\btheta}{{\boldsymbol{\theta}}}$ $\newcommand{\bphi}{\boldsymbol{\phi}}$ $\newcommand{\balpha}{\boldsymbol{\alpha}}$ $\newcommand{\blambda}{\boldsymbol{\lambda}}$ $\renewcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bp}{\mathbf{p}}$ $\renewcommand{\bx}{\mathbf{x}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bY}{\mathbf{Y}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bZ}{\mathbf{Z}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bV}{\mathbf{V}}$ $\newcommand{\bfg}{\mathbf{g}}$ $\newcommand{\bfh}{\mathbf{h}}$ $\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}$ $\renewcommand{\S}{\mathcal{S}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\pa}{\mathrm{pa}}$ $\newcommand{\Z}{\mathcal{Z}}$ $\newcommand{\bh}{\mathbf{h}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bc}{\mathbf{c}}$ $\newcommand{\cE}{\mathcal{E}}$ $\newcommand{\cP}{\mathcal{P}}$ $\newcommand{\bbeta}{\boldsymbol{\beta}}$ $\newcommand{\bLambda}{\boldsymbol{\Lambda}}$ $\newcommand{\cov}{\mathrm{Cov}}$ $\newcommand{\bfk}{\mathbf{k}}$ $\newcommand{\idx}[1]{}$ $\newcommand{\xdi}{}$
Motivating example: discovering mathematical topics¶
A common task in network analysis is to identify "central" vertices in a graph. Centrality is a vague concept. It can be defined in many different ways depending on the context and the type of network. Quoting from Wikipedia:
In graph theory and network analysis, indicators of centrality assign numbers or rankings to nodes within a graph corresponding to their network position. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, super-spreaders of disease, and brain networks. [...] Centrality indices are answers to the question "What characterizes an important vertex?" The answer is given in terms of a real-valued function on the vertices of a graph, where the values produced are expected to provide a ranking which identifies the most important nodes. The word "importance" has a wide number of meanings, leading to many different definitions of centrality.
In an undirected graph, a natural approach is to look at the degree of a vertex as a measure of its importance (also referred to as degree centrality). But it is hardly the only one. One could for instance look at the average distance to all other nodes (its reciprocal is the closeness centrality) or at the number of shortest paths between pairs of vertices going through the vertex (known as betweenness centrality).
What if the graph is directed? Things are somewhat more complicated there. For instance, there is now the in-degree as well as the out-degree.
Let us look at a particular example of practical importance, the World Wide Web (from now on, the Web). In this case, the vertices are webpages and a directed edge from $u$ to $v$ indicates a hyperlink from page $u$ to page $v$. The Web is much too large to analyze here. Instead, we will consider a tiny (but still interesting!) subset of it, the pages of Wolfram's MathWorld, a wonderful mathematics resource.
Each page of MathWorld concerns a particular mathematical concept, e.g., scale-free network. A definition and notable properties are described. Importantly for us, in a section entitled "SEE ALSO", other related mathematical concepts are listed with a link to their MathWorld page. In the case of scale-free networks, the small world network topic is referenced, among others.
The resulting directed graph is available through the NetSet datasets and can be downloaded here. We load it now. For convenience, we have reformatted it into the files mathworld-adjacency.csv
and mathworld-titles.csv
, which are available on the GitHub of the book.
data_edges = pd.read_csv('mathworld-adjacency.csv')
data_edges.head()
from | to | |
---|---|---|
0 | 0 | 2 |
1 | 1 | 47 |
2 | 1 | 404 |
3 | 1 | 2721 |
4 | 2 | 0 |
It consists in a list of directed edges. For example, the first one is an edge from vertex 0
to vertex 2
. The second one is from 1
to 47
and so on.
There is a total of $49069$ edges.
data_edges.shape[0]
49069
The second file contains the titles of the pages.
data_titles = pd.read_csv('mathworld-titles.csv')
data_titles.head()
title | |
---|---|
0 | Alexander's Horned Sphere |
1 | Exotic Sphere |
2 | Antoine's Horned Sphere |
3 | Flat |
4 | Poincaré Manifold |
n = data_titles.shape[0]
print(n)
12362
We construct the graph by adding the edges one by one. We first convert df_edges
into a Numpy array.
edgelist = data_edges[['from','to']].to_numpy()
print(edgelist)
[[ 0 2] [ 1 47] [ 1 404] ... [12361 12306] [12361 12310] [12361 12360]]
G = nx.empty_graph(n, create_using=nx.DiGraph)
for i in range(edgelist.shape[0]):
G.add_edge(edgelist[i,0], edgelist[i,1])
Returning to question of centrality, we can now try to measure the importance of different nodes. For instance, the in-degree of Alexander's Horned Sphere
is:
G.in_degree(0)
5
while that of Antoine's Horned Sphere
is:
G.in_degree(2)
1
suggesting that the former is more central than the latter, at least in the sense that it is referenced more often.
But is that the right measure? Consider the following: Antoine's Horned Sphere
receives only one reference, but it is from a seemingly relatively important vertex, Alexander's Horned Sphere
. How can one take this into account in quantifying its importance in the network?
We will come back to this question later in this chapter. To hint at things to come, it will turn out that "exploring the graph at random" provides a powerful perspective on centrality.
Background: elements of finite Markov chains¶
EXAMPLE: (Random Walk on the Petersen Graph) Let $G = (V,E)$ be the Petersen graph. Each vertex $i$ has degree $3$, that is, it has three neighbors which we denote $v_{i,1}, v_{i,2}, v_{i,3}$ in some arbitrary order. For instance, denoting the vertices by $1,\ldots, 10$ as above, vertex $9$ has neighbors $v_{9,1} = 4, v_{9,2} = 6, v_{9,3} = 7$.
We consider the following random walk on $G$. We start at $X_0 = 1$. Then, for each $t\geq 0$, we let $X_{t+1}$ be a uniformly chosen neighbor of $X_t$, independently of the previous history. That is, we jump at random from neighbor to neighbor. Formally, fix $X_0 = 1$ and let $(Z_t)_{t \geq 0}$ be an i.i.d. sequence of random variables taking values in $\{1,2,3\}$ satisfying
$$ \mathbb{P}[Z_t = 1] = \mathbb{P}[Z_t = 2] = \mathbb{P}[Z_t = 3] = 1/3. $$Then define, for all $t \geq 0$, $ X_{t+1} = f(X_t, Z_t) = v_{i,Z_t} $ if $X_t = v_i$.
By an argument similar to the previous example, $(X_t)_{t \geq 0}$ is a Markov chain. Also as in the previous example, one can pick $X_0$ according to an initial distribution, independently from the sequence $(Z_t)_{t \geq 0}$. $\lhd$
EXAMPLE: (Random Walk on the Petersen Graph, continued) Consider again the random walk on the Petersen graph $G = (V,E)$. We number the vertices $1, 2,\ldots, 10$. To compute the transition matrix, we list for each vertex its neighbors and put the value $1/3$ in the corresponding columns. For instance, vertex $1$ has neighbors $2$, $5$ and $6$, so row $1$ has $1/3$ in columns $2$, $5$, and $6$. And so on.
We get:
$$ P = \begin{pmatrix} 0 & 1/3 & 0 & 0 & 1/3 & 1/3 & 0 & 0 & 0 & 0\\ 1/3 & 0 & 1/3 & 0 & 0 & 0 & 1/3 & 0 & 0 & 0\\ 0 & 1/3 & 0 & 1/3 & 0 & 0 & 0 & 1/3 & 0 & 0\\ 0 & 0 & 1/3 & 0 & 1/3 & 0 & 0 & 0 & 1/3 & 0\\ 1/3 & 0 & 0 & 1/3 & 0 & 0 & 0 & 0 & 0 & 1/3\\ 1/3 & 0 & 0 & 0 & 0 & 0 & 0 & 1/3 & 1/3 & 0\\ 0 & 1/3 & 0 & 0 & 0 & 0 & 0 & 0 & 1/3 & 1/3\\ 0 & 0 & 1/3 & 0 & 0 & 1/3 & 0 & 0 & 0 & 1/3\\ 0 & 0 & 0 & 1/3 & 0 & 1/3 & 1/3 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 1/3 & 0 & 1/3 & 1/3 & 0 & 0 \end{pmatrix} $$We have already encountered a matrix that encodes the neighbors of each vertex, the adjacency matrix. Here we can recover the transition matrix by multiplying the adjacency matrix by $1/3$. $\lhd$
NUMERICAL CORNER: Returning to our Robot Vacuum Example, the transition graph of the chain can be obtained by thinking of $P$ as the weighted adjacency matrix of the transition graph.
P_robot = np.array([[0, 0.8, 0, 0.2, 0, 0, 0, 0, 0],
[0.3, 0, 0.2, 0, 0, 0.5, 0, 0, 0],
[0, 0.6, 0, 0, 0, 0.4, 0, 0, 0],
[0.1, 0.1, 0, 0, 0.8, 0, 0, 0, 0],
[0, 0, 0, 0.25, 0, 0, 0.75, 0, 0],
[0, 0.15, 0.15, 0, 0, 0, 0, 0.35, 0.35],
[0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0.3, 0.4, 0.2, 0, 0.1],
[0, 0, 0, 0, 0, 1, 0, 0, 0]])
We define a graph from its adjancency matrix. See networkx.from_numpy_array()
.
G_robot = nx.from_numpy_array(P_robot, create_using=nx.DiGraph)
Drawing edge weights on a directed graph in a readable fashion is not straighforward. We will not do this here.
n_robot = P_robot.shape[0]
nx.draw_networkx(G_robot, pos=nx.circular_layout(G_robot),
labels={i: i+1 for i in range(n_robot)},
node_color='black', font_color='white',
connectionstyle='arc3, rad = 0.2')
plt.axis('off')
plt.show()
Once we have specified a transition matrix (and an initial distribution), we can simulate the corresponding Markov chain. This is useful to compute (approximately) probabilities of complex events through the law of large numbers. Here is some code to generate one sample path up to some given time $T$. We assume that the state space is $[n]$. We use rng.choice
to generate each transition.
def SamplePath(rng, mu, P, T):
n = mu.shape[0]
X = np.zeros(T+1)
for i in range(T+1):
if i == 0:
X[i] = rng.choice(a=np.arange(start=1,stop=n+1),p=mu)
else:
X[i] = rng.choice(a=np.arange(start=1,stop=n+1),p=P[int(X[i-1]-1),:])
return X
NUMERICAL CORNER: Let's try with our Robot Vacuum. We take the initial distribution to be the uniform distribution.
seed = 535
rng = np.random.default_rng(seed)
mu = np.ones(n_robot) / n_robot
print(SamplePath(rng, mu, P_robot, 10))
[9. 6. 3. 6. 8. 6. 2. 1. 2. 6. 8.]
For example, we can use a simulation to approximate the expected number of times that room $9$ is visited up to time $10$. To do this, we run the simulation a large number of times (say $1000$) and count the average number of visits to $9$.
z = 9
N_samples = 1000
visits_to_z = np.zeros(N_samples)
for i in range(N_samples):
visits_to_z[i] = np.count_nonzero(SamplePath(rng, mu, P_robot, 10) == z)
print(np.mean(visits_to_z))
1.193
Limit behavior 1: stationary distributions¶
EXAMPLE: (Robot Vacuum, continued) Going back to the Robot Vacuum Example, recall the transition graph. While there is no direct edge from $4$ to $3$, we do have $4 \to 3$ through the path $(4,2), (2,3)$. Do we have $3 \to 4$? $\lhd$
NUMERICAL CORNER: Consider random walk on the following digraph, which we refer to as the Two Sinks Example (why do you think?).
G_sinks = nx.DiGraph()
n_sinks = 5
for i in range(n_sinks):
G_sinks.add_node(i)
G_sinks.add_edge(0, 0, weight=1/3)
G_sinks.add_edge(0, 1, weight=1/3)
G_sinks.add_edge(1, 1, weight=1/3)
G_sinks.add_edge(1, 2, weight=1/3)
G_sinks.add_edge(2, 2, weight=1)
G_sinks.add_edge(3, 3, weight=1)
G_sinks.add_edge(0, 4, weight=1/3)
G_sinks.add_edge(1, 4, weight=1/3)
G_sinks.add_edge(4, 3, weight=1)
nx.draw_networkx(G_sinks, pos=nx.circular_layout(G_sinks),
labels={i: i+1 for i in range(n_sinks)},
node_color='black', font_color='white',
connectionstyle='arc3, rad = -0.2')
plt.axis('off')
plt.show()
Here we have $1 \to 4$ (Why?). The Communication Lemma implies that, when started at $1$, $(X_t)_{t \geq 0}$ visits $4$ with positive probability. But that probability is not one. Indeed we also have $1 \to 3$ (Why?), so there is a positive probability of visiting $3$ as well. But if we do so before visiting $4$, we stay at $3$ forever hence cannot subsequently reach $4$.
In fact, intuitively, if we run this chain long enough we will either get stuck at $3$ or get stuck at $4$. These give rise to different stationary distributions. The transition probability is the following.
P_sinks = nx.adjacency_matrix(G_sinks).toarray()
print(P_sinks)
[[0.33333333 0.33333333 0. 0. 0.33333333] [0. 0.33333333 0.33333333 0. 0.33333333] [0. 0. 1. 0. 0. ] [0. 0. 0. 1. 0. ] [0. 0. 0. 1. 0. ]]
It is easy to check that $\bpi = (0,0,1,0,0)$ and $\bpi' = (0,0,0,1,0)$ are both stationary distributions.
pi = np.array([0.,0.,1.,0.,0.])
pi_prime = np.array([0.,0.,0.,1.,0.])
P_sinks.T @ pi.T
array([0., 0., 1., 0., 0.])
P_sinks.T @ pi_prime.T
array([0., 0., 0., 1., 0.])
In fact, there are infinitely many stationary distributions in this case.
NUMERICAL CORNER: Because irreducibility is ultimately a graph-theoretic property, it is easy to check using NetworkX
. For this, we use the function is_strongly_connected()
. Revisiting the Robot Vacuum Example:
P_robot = np.array([[0, 0.8, 0, 0.2, 0, 0, 0, 0, 0],
[0.3, 0, 0.2, 0, 0, 0.5, 0, 0, 0],
[0, 0.6, 0, 0, 0, 0.4, 0, 0, 0],
[0.1, 0.1, 0, 0, 0.8, 0, 0, 0, 0],
[0, 0, 0, 0.25, 0, 0, 0.75, 0, 0],
[0, 0.15, 0.15, 0, 0, 0, 0, 0.35, 0.35],
[0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0.3, 0.4, 0.2, 0, 0.1],
[0, 0, 0, 0, 0, 1, 0, 0, 0]])
G_robot = nx.from_numpy_array(P_robot, create_using=nx.DiGraph)
print(nx.is_strongly_connected(G_robot))
True
Consider again the Two Sinks Example. It turns out not to be irreducible:
print(nx.is_strongly_connected(G_sinks))
False
NUMERICAL CORNER: In general, computing stationary distributions is not as straigthforward as in the simple example we considered above. We conclude this subsection with some numerical recipes.
Going back to the Robot Vacuum, finding a solution to $\bpi P =\bpi$ in this case is not obvious. One way to do this is to note that, taking transposes, this condition is equivalent to $P^T \bpi^T = \bpi^T$. That is, $\bpi^T$ is an eigenvector of $P^T$ with eigenvalue $1$. (Or, as we noted previously, the row vector $\bpi$ is a left eigenvector of $P$ with eigenvalue $1$.) It must also satisfy $\bpi \geq 0$ with at least one entry non-zero. Here, we use NumPy.
w, v = LA.eig(P_robot.T)
The first eigenvalue is approximately $1$, as seen below.
print(w)
[ 1. +0.j 0.67955052+0.j 0.50519638+0.j -0.70014828+0.j -0.59989603+0.j -0.47710224+0.32524037j -0.47710224-0.32524037j 0.03475095+0.04000569j 0.03475095-0.04000569j]
The corresponding eigenvector is approximately non-negative.
print(v[:,0])
[0.08933591+0.j 0.27513917+0.j 0.15744007+0.j 0.06794162+0.j 0.20029774+0.j 0.68274825+0.j 0.24751961+0.j 0.48648149+0.j 0.28761004+0.j]
To obtain a stationary distribution, we remove the imaginary part and normalize it to sum to $1$.
pi_robot = np.real(v[:,0]) / np.sum(np.real(v[:,0]))
print(pi_robot)
[0.03581295 0.11029771 0.06311453 0.02723642 0.0802953 0.27369992 0.09922559 0.19502056 0.11529703]
Alternatively, we can solve the linear system
$$ \sum_{i=1}^n \pi_i p_{i,j} = \pi_j, \qquad \forall j \in [n]. $$It turns out that the last equation is a linear combination over the other equations (see Exercise 3.48), so we remove it and replace it instead with the condition $\sum_{i=1}^n \pi_i = 1$.
The left-hand side of the resulting linear system is (after taking the transpose to work with column vectors):
n_robot = P_robot.shape[0]
A = np.copy(P_robot.T) - np.diag(np.ones(n_robot))
A[n_robot-1,:] = np.ones(n_robot)
print(A)
[[-1. 0.3 0. 0.1 0. 0. 0. 0. 0. ] [ 0.8 -1. 0.6 0.1 0. 0.15 0. 0. 0. ] [ 0. 0.2 -1. 0. 0. 0.15 0. 0. 0. ] [ 0.2 0. 0. -1. 0.25 0. 0. 0. 0. ] [ 0. 0. 0. 0.8 -1. 0. 0. 0.3 0. ] [ 0. 0.5 0.4 0. 0. -1. 0. 0.4 1. ] [ 0. 0. 0. 0. 0.75 0. -1. 0.2 0. ] [ 0. 0. 0. 0. 0. 0.35 1. -1. 0. ] [ 1. 1. 1. 1. 1. 1. 1. 1. 1. ]]
The right-hand side of the resulting linear system is:
b = np.concatenate((np.zeros(n_robot-1),[1.]))
print(b)
[0. 0. 0. 0. 0. 0. 0. 0. 1.]
We solve the linear system using numpy.linalg.solve()
.
pi_robot_solve = LA.solve(A,b)
print(pi_robot_solve)
[0.03581295 0.11029771 0.06311453 0.02723642 0.0802953 0.27369992 0.09922559 0.19502056 0.11529703]
This last approach is known as "Replace an Equation".
Limit behavior 2: convergence to equilibrium¶
NUMERICAL CORNER: The Convergence to Equilibrium Theorem implies that we can use power iteration to compute the unique stationary diistribution in the irreducible case. We revisit the Robot Vaccum Example. We initialize with the uniform distribution, then repeatedly multiply by $P$.
P_robot = np.array([[0, 0.8, 0, 0.2, 0, 0, 0, 0, 0],
[0.3, 0, 0.2, 0, 0, 0.5, 0, 0, 0],
[0, 0.6, 0, 0, 0, 0.4, 0, 0, 0],
[0.1, 0.1, 0, 0, 0.8, 0, 0, 0, 0],
[0, 0, 0, 0.25, 0, 0, 0.75, 0, 0],
[0, 0.15, 0.15, 0, 0, 0, 0, 0.35, 0.35],
[0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0.3, 0.4, 0.2, 0, 0.1],
[0, 0, 0, 0, 0, 1, 0, 0, 0]])
n_robot = P_robot.shape[0]
mu = np.ones(n_robot)/n_robot
print(mu)
[0.11111111 0.11111111 0.11111111 0.11111111 0.11111111 0.11111111 0.11111111 0.11111111 0.11111111]
mu = mu @ P_robot
print(mu)
[0.04444444 0.18333333 0.03888889 0.05 0.12222222 0.25555556 0.10555556 0.15 0.05 ]
mu = mu @ P_robot
print(mu)
[0.06 0.10222222 0.075 0.03944444 0.085 0.21722222 0.12166667 0.195 0.10444444]
We repeat, say, $10$ more times and compare to the truth pi_robot
.
for _ in range(10):
mu = mu @ P_robot
print(mu)
[0.0358112 0.10982018 0.06297235 0.02721311 0.08055026 0.27393441 0.09944157 0.19521946 0.11503747]
w, v = LA.eig(P_robot.T)
pi_robot = np.real(v[:,0]) / np.sum(np.real(v[:,0]))
print(pi_robot)
[0.03581295 0.11029771 0.06311453 0.02723642 0.0802953 0.27369992 0.09922559 0.19502056 0.11529703]
We see that a small number of iterations sufficed to get an accurate answer. In general, the speed of convergence depends on the eigenvalues of $P$ that are strictly smaller than $1$ in absolute value.
We can also check the Ergodic Theorem through simulation. We generate a long sample path and compare the state visit frequencies to pi_robot
.
seed = 535
rng = np.random.default_rng(seed)
mu = np.ones(n_robot) / n_robot
path_length = 50000
visit_freq = np.zeros(n_robot)
path = mmids.SamplePath(rng, mu, P_robot, path_length)
for i in range(n_robot):
visit_freq[i] = np.count_nonzero(path == i+1)/(path_length+1)
print(visit_freq)
[0.03627927 0.10927781 0.0601788 0.02645947 0.08045839 0.27359453 0.10119798 0.19693606 0.11561769]
print(pi_robot)
[0.03581295 0.11029771 0.06311453 0.02723642 0.0802953 0.27369992 0.09922559 0.19502056 0.11529703]
Application: random walks on graphs and PageRank¶
Here is an implementation of the PageRank algorithm. We will need a function that takes as input an adjacency matrix $A$ and returns the corresponding transition matrix $P$. Some vertices have no outgoing links. To avoid dividing by $0$, we add a self-loop to all vertices with out-degree $0$. We numpy.fill_diagonal
for this purpose.
Also, because the adjacency matrix and the vector of out-degrees have different shapes, we turn out_deg
into a column vector using numpy.newaxis
to ensure that the division is done one column at a time. (There are many ways of doing this, but some are slower than others.)
def transition_from_adjacency(A):
n = A.shape[0]
sinks = (A @ np.ones(n)) == 0.
P = A.copy()
np.fill_diagonal(P, sinks)
out_deg = P @ np.ones(n)
P = P / out_deg[:, np.newaxis]
return P
The following function adds the damping factor. Here mu
will be the uniform distribution. It gets added (after scaling by 1-alpha
) one row at a time to P
(again after scaling by alpha
). This time we do not need to reshape mu
.
def add_damping(P, alpha, mu):
Q = alpha * P + (1-alpha) * mu
return Q
When computing PageRank, we take the transpose of $Q$ to turn multiplication from the left into multiplication from the right.
def pagerank(A, alpha=0.85, max_iter=100):
n = A.shape[0]
mu = np.ones(n)/n
P = transition_from_adjacency(A)
Q = add_damping(P, alpha, mu)
v = mu
for _ in range(max_iter):
v = Q.T @ v
return v
NUMERICAL CORNER: Let's try a star with edges pointing out. Along the way, we check that our functions work how we expect them to.
n = 8
G_outstar = nx.DiGraph()
for i in range(1,n):
G_outstar.add_edge(0,i)
nx.draw_networkx(G_outstar, labels={i: i+1 for i in range(n)},
node_color='black', font_color='white')
plt.axis('off')
plt.show()
A_outstar = nx.adjacency_matrix(G_outstar).toarray()
print(A_outstar)
[[0 1 1 1 1 1 1 1] [0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0]]
We compute the matrices $P$ and $Q$. We use numpy.set_printoptions
to condense the output.
P_outstar = transition_from_adjacency(A_outstar)
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
print(P_outstar)
[[ 0.000 0.143 0.143 0.143 0.143 0.143 0.143 0.143] [ 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000] [ 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000] [ 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000] [ 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000] [ 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000] [ 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000] [ 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000]]
alpha = 0.85
mu = np.ones(n)/n
Q_outstar = add_damping(P_outstar, alpha, mu)
print(Q_outstar)
[[ 0.019 0.140 0.140 0.140 0.140 0.140 0.140 0.140] [ 0.019 0.869 0.019 0.019 0.019 0.019 0.019 0.019] [ 0.019 0.019 0.869 0.019 0.019 0.019 0.019 0.019] [ 0.019 0.019 0.019 0.869 0.019 0.019 0.019 0.019] [ 0.019 0.019 0.019 0.019 0.869 0.019 0.019 0.019] [ 0.019 0.019 0.019 0.019 0.019 0.869 0.019 0.019] [ 0.019 0.019 0.019 0.019 0.019 0.019 0.869 0.019] [ 0.019 0.019 0.019 0.019 0.019 0.019 0.019 0.869]]
While it is tempting to guess that $1$ is the most central node of the network, no edge actually points to it. In this case, the center of the star has a low PageRank value.
print(pagerank(A_outstar))
[ 0.019 0.140 0.140 0.140 0.140 0.140 0.140 0.140]
We then try a star with edges pointing in.
n = 8
G_instar = nx.DiGraph()
G_instar.add_node(0)
for i in range(1,n):
G_instar.add_edge(i,0)
nx.draw_networkx(G_instar, labels={i: i+1 for i in range(n)},
node_color='black', font_color='white')
plt.axis('off')
plt.show()
A_instar = nx.adjacency_matrix(G_instar).toarray()
print(A_instar)
[[0 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0]]
P_instar = transition_from_adjacency(A_instar)
print(P_instar)
[[ 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000] [ 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000] [ 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000] [ 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000] [ 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000] [ 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000] [ 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000] [ 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000]]
Q_instar = add_damping(P_instar, alpha, mu)
print(Q_instar)
[[ 0.869 0.019 0.019 0.019 0.019 0.019 0.019 0.019] [ 0.869 0.019 0.019 0.019 0.019 0.019 0.019 0.019] [ 0.869 0.019 0.019 0.019 0.019 0.019 0.019 0.019] [ 0.869 0.019 0.019 0.019 0.019 0.019 0.019 0.019] [ 0.869 0.019 0.019 0.019 0.019 0.019 0.019 0.019] [ 0.869 0.019 0.019 0.019 0.019 0.019 0.019 0.019] [ 0.869 0.019 0.019 0.019 0.019 0.019 0.019 0.019] [ 0.869 0.019 0.019 0.019 0.019 0.019 0.019 0.019]]
In this case, the center of the star does indeed have a high PageRank value.
print(pagerank(A_instar))
[ 0.869 0.019 0.019 0.019 0.019 0.019 0.019 0.019]
NUMERICAL CORNER: We revisit the star example in the undirected case.
n = 8
G_star = nx.Graph()
for i in range(1,n):
G_star.add_edge(0,i)
nx.draw_networkx(G_star, labels={i: i+1 for i in range(n)},
node_color='black', font_color='white')
plt.axis('off')
plt.show()
We first compute the PageRank vector without damping. Here the random walk is periodic (Why?) so power iteration may fail (Try it!). Instead, we use a small amount of damping and increase the number of iterations.
A_star = nx.adjacency_matrix(G_star).toarray()
print(A_star)
[[0 1 1 1 1 1 1 1] [1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0]]
print(pagerank(A_star, max_iter=10000, alpha=0.999))
[ 0.500 0.071 0.071 0.071 0.071 0.071 0.071 0.071]
The PageRank value for the center node is indeed roughly $7$ times larger than the other ones, as can be expected from the ratio of their degrees.
We try again with more damping. This time the ratio of PageRank values is not quite the same as the ratio of degrees, but the center node continues to have a higher value than the other nodes.
print(pagerank(A_star))
[ 0.470 0.076 0.076 0.076 0.076 0.076 0.076 0.076]
NUMERICAL CORNER: We load the dataset again.
data_edges = pd.read_csv('mathworld-adjacency.csv')
data_edges.head()
from | to | |
---|---|---|
0 | 0 | 2 |
1 | 1 | 47 |
2 | 1 | 404 |
3 | 1 | 2721 |
4 | 2 | 0 |
The second file contains the titles of the pages.
data_titles = pd.read_csv('mathworld-titles.csv')
data_titles.head()
title | |
---|---|
0 | Alexander's Horned Sphere |
1 | Exotic Sphere |
2 | Antoine's Horned Sphere |
3 | Flat |
4 | Poincaré Manifold |
We construct the graph by adding the edges one by one. We first convert df_edges
into a Numpy array.
edgelist = data_edges[['from','to']].to_numpy()
print(edgelist)
[[ 0 2] [ 1 47] [ 1 404] ... [12361 12306] [12361 12310] [12361 12360]]
n = 12362
G_mw = nx.empty_graph(n, create_using=nx.DiGraph)
for i in range(edgelist.shape[0]):
G_mw.add_edge(edgelist[i,0], edgelist[i,1])
To apply PageRank, we construct the adjacency matric of the graph. We also define a vector of title pages.
A_mw = nx.adjacency_matrix(G_mw).toarray()
titles_mw = data_titles['title'].to_numpy()
pr_mw = pagerank(A_mw)
We use numpy.argsort
to identify the pages with highest scores. We apply it to -pr_mw
to sort from the highest to lowest value.
top_pages = np.argsort(-pr_mw)
The top 25 topics are:
print(titles_mw[top_pages[:25]])
['Sphere' 'Circle' 'Prime Number' 'Aleksandrov-Čech Cohomology' 'Centroid Hexagon' 'Group' 'Fourier Transform' 'Tree' 'Splitting Field' 'Archimedean Solid' 'Normal Distribution' 'Integer Sequence Primes' 'Perimeter Polynomial' 'Polygon' 'Finite Group' 'Large Number' 'Riemann Zeta Function' 'Chebyshev Approximation Formula' 'Vector' 'Ring' 'Fibonacci Number' 'Conic Section' 'Fourier Series' 'Derivative' 'Gamma Function']
We indeed get a list of central concepts in mathematics -- including several we have encountered previously such as Normal Distribution
, Tree
, Vector
or Derivative
.
There is a variant of PageRank, referred to as Personalized PageRank (PPR)$\idx{Personalized PageRank}\xdi$, which aims to tailor the outcome to specific interests. This is accomplished from a simple change to the algorithm. When teleporting, rather than jumping to a uniformly random page, we instead jump to an arbitrary distribution which is meant to capture some specific interests. In the context of the web for instance, this distribution might be uniform over someone's bookmarks.
We adapt pagerank
as follows:
def ppr(A, mu, alpha=0.85, max_iter=100):
n = A.shape[0]
P = transition_from_adjacency(A)
Q = add_damping(P, alpha, mu)
v = mu
for _ in range(max_iter):
v = Q.T @ v
return v
NUMERICAL CORNER: To test PPR, consider the distribution concentrated on a single topic Normal Distribution
. This is topic number 1270
.
print(np.argwhere(titles_mw == 'Normal Distribution')[0][0])
1270
mu = np.zeros(n)
mu[1270] = 1
We now run PPR and list the top 25 pages.
ppr_mw = ppr(A_mw, mu)
top_pers_pages = np.argsort(-ppr_mw)
The top 25 topics are:
print(titles_mw[top_pers_pages[:25]])
['Normal Distribution' 'Pearson System' 'Logit Transformation' 'z-Score' 'Erf' 'Central Limit Theorem' 'Bivariate Normal Distribution' 'Normal Ratio Distribution' 'Normal Sum Distribution' 'Normal Distribution Function' 'Gaussian Function' 'Standard Normal Distribution' 'Normal Product Distribution' 'Binomial Distribution' 'Tetrachoric Function' 'Ratio Distribution' 'Kolmogorov-Smirnov Test' 'Box-Muller Transformation' 'Galton Board' 'Fisher-Behrens Problem' 'Erfc' 'Normal Difference Distribution' 'Half-Normal Distribution' 'Inverse Gaussian Distribution' 'Error Function Distribution']
This indeed returns various statistical concepts, particularly related to the normal dsitribution.
Further applications: Gibbs sampling and generating images¶
Sampling from simple distributions When $\bpi$ is a standard distribution or $\S$ is relatively small, this can be done efficiently by using a random number generator, as we have done previously.
NUMERICAL CORNER: Recall how this works. We first initialize the random number generator and use a seed
for reproducibility.
seed = 535
rng = np.random.default_rng(seed)
To generate, say $1000$, samples from a multivariate normal, say with mean $(0, 0)$ and covariance $\begin{pmatrix}5 & 0\\0 & 1\end{pmatrix}$, we use numpy.random.Generator.multivariate_normal
as follows.
mean = np.array([0., 0.])
cov = np.array([[5., 0.], [0., 1.]])
x, y = rng.multivariate_normal(mean, cov, 1000).T
Computing the mean of each component we get:
print(np.mean(x))
-0.035322561120667575
print(np.mean(y))
-0.009499619370100139
This is somewhat close to the expected answer: $(0,0)$.
Using a larger number of samples, say $10,000$, gives a better result.
x, y = rng.multivariate_normal(mean, cov, 10000).T
print(np.mean(x))
print(np.mean(y))
-0.0076273930440971215 -0.008874190869155479
Sampling from an arbitrary distribution on a finite set is also straightforward -- as long as the set is not too big. This can be done using numpy.random.Generator.choice
. Borrowing the example from the documentation, the following:
aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'christopher']
print(rng.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3]))
['pooh' 'pooh' 'piglet' 'christopher' 'piglet']
generates $5$ samples from the set $\S = \{\tt{pooh}, \tt{rabbit}, \tt{piglet}, \tt{christopher}\}$ with respective probabilities $0.5, 0.1, 0.1, 0.3$.
But this may not be practical when the state space $\S$ is very large. As an example, later in this section, we will learn a "realistic" distribution of handwritten digits. We will do so using the MNIST dataset.
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
mnist = datasets.MNIST(root='./data', train=True,
download=True, transform=transforms.ToTensor())
train_loader = DataLoader(mnist, batch_size=len(mnist), shuffle=False)
imgs, labels = next(iter(train_loader))
imgs = imgs.squeeze().numpy()
labels = labels.numpy()
imgs = np.round(imgs)
Each image is $28 \times 28$, so the total number of (black and white) pixels is $784$.
nx_pixels, ny_pixels = imgs[0].shape
nx_pixels, ny_pixels
(28, 28)
n_pixels = nx_pixels * ny_pixels
n_pixels
784
To specify the a distribution over all possible black and white images of this size, we need in principle to assign a probability to a very large number of states. Our space here is $\S = \{0,1\}^{784}$, imagining that $0$ encodes white and $1$ encodes black and that we have ordered the pixels in some arbitrary way. How big is this space?
Answer: $2^{784}$.
Or in base $10$, we compute $\log_{10}(2^{784})$, which is:
784 * np.log(2) / np.log(10)
236.00751660056122
So a little more than $10^{236}$.
This is much too large to naively plug into rng.choice
!
NUMERICAL CORNER: Suppose $\S = \{1,\cdots, n\} = [n]$ for some positive integer $n$ and $\bpi$ is proportional to a Poisson distribution with mean $\lambda > 0$. That is,
$$ \pi_i = C e^{-\lambda} \frac{\lambda^i}{i!}, \quad \forall i \in \S $$for some constant $C$ chosen so that $\sum_{i=1}^{n} \pi_i = 1$. Recall that we do not need to determine $C$ as it is enough to know the target distribution up to a scaling factor by the previous remark.
To apply Metropolis-Hastings, we need a proposal chain. Consider the following choice. For each $1 < i < n$, move to $i+1$ or $i-1$ with probability $1/2$ each. For $i=1$ (respectively $i = n$), move to $2$ (respectively $n-1$) with probability $1/2$, otherwise stay where you are. For instance, if $n = 4$, then
$$ Q = \begin{pmatrix} 1/2 & 1/2 & 0 & 0\\ 1/2 & 0 & 1/2 & 0\\ 0 & 1/2 & 0 & 1/2\\ 0 & 0 & 1/2 & 1/2 \end{pmatrix}, $$which is indeed a stochastic matrix. It is also symmetric, so it does not enter into the acceptance probability by the previous remark.
To compute the acceptance probability, we only need to consider pairs of adjacent integers as they are the only one that have non-zero probability under $Q$. Consider state $1 < i < n$. Observe that
$$ \frac{\pi_{i+1}}{\pi_{i}} = \frac{C e^{-\lambda} \lambda^{i+1}/(i+1)!}{C e^{-\lambda} \lambda^{i}/i!} = \frac{\lambda}{i+1} $$so a move to $i+1$ happens with probability
$$ \frac{1}{2} \min\left\{1, \frac{\lambda}{i+1}\right\}, $$where the $1/2$ factor from the proposal distribution. Similarly, it can be checked (try it!) that a move to $i-1$ occurs with probability
$$ \frac{1}{2} \min\left\{1, \frac{i}{\lambda}\right\}. $$And we stay at $i$ with probability $1 - \frac{1}{2} \min\left\{1, \frac{\lambda}{i+1}\right\} - \frac{1}{2} \min\left\{1, \frac{i}{\lambda}\right\}$. (Why is this guaranteed to be a probability?)
A similar formula applies to $i = 1, n$. (Try it!)
We are ready to apply Metropolis-Hastings.
def mh_transition_poisson(lmbd, n):
P = np.zeros((n,n))
for idx in range(n):
i = idx + 1 # index starts at 0 rather than 1
if (i > 1 and i < n):
P[idx, idx+1] = (1/2) * np.min(np.array([1, lmbd/(i+1)]))
P[idx, idx-1] = (1/2) * np.min(np.array([1, i/lmbd]))
P[idx, idx] = 1 - P[idx, idx+1] - P[idx, idx-1]
elif i == 1:
P[idx, idx+1] = (1/2) * np.min(np.array([1, lmbd/(i+1)]))
P[idx, idx] = 1 - P[idx, idx+1]
elif i == n:
P[idx, idx-1] = (1/2) * np.min(np.array([1, i/lmbd]))
P[idx, idx] = 1 - P[idx, idx-1]
return P
Take $\lambda = 1$ and $n = 6$. We get the following transition matrix.
lmbd = 1
n = 6
P = mh_transition_poisson(lmbd, n)
print(P)
[[ 0.750 0.250 0.000 0.000 0.000 0.000] [ 0.500 0.333 0.167 0.000 0.000 0.000] [ 0.000 0.500 0.375 0.125 0.000 0.000] [ 0.000 0.000 0.500 0.400 0.100 0.000] [ 0.000 0.000 0.000 0.500 0.417 0.083] [ 0.000 0.000 0.000 0.000 0.500 0.500]]
We use our simulator from a previous chapter. We start from the uniform distribution and take $100$ steps.
seed = 535
rng = np.random.default_rng(seed)
mu = np.ones(n) / n
T = 100
X = mmids.SamplePath(rng, mu, P, T)
Our sample is the final state of the trajectory.
X[T]
2.0
We repeat $1000$ times.
N_samples = 1000 # number of repetitions
freq_z = np.zeros(n) # init of frequencies sampled
for i in range(N_samples):
X = mmids.SamplePath(rng, mu, P, T)
freq_z[int(X[T])-1] += 1 # adjust for index starting at 0
freq_z = freq_z/N_samples
We plot the frequencies.
plt.bar(range(1,n+1),freq_z, color='lightblue', edgecolor='black')
plt.show()
If we increase the parameter $\lambda$ (which is not quite the mean; why?), what would you expect will happen to the sampled distribution?
Gibbs sampling: We sample from the joint distribution $\pi$ and observe only $\bv$.
We need to compute the conditional probabilities given every other variable. The sigmoid function, which we have encountered previously, $\sigma(x)$, will once again make an appearance.
def sigmoid(x):
return 1/(1 + np.exp(-x))
We implement the Gibbs sampler for an RBM. Rather than updating the units at random, we use a block approach. Specifically, we update all hidden units independently, given the visible units; then we update all visible units independently, given the hidden units. In each case, this is warranted by the conditional independence structure revealed above.
We first implement the conditional means using the formulas previously derived.
def rbm_mean_hidden(v, W, c):
return sigmoid(W @ v + c)
def rbm_mean_visible(h, W, b):
return sigmoid(W.T @ h + b)
We next implement one step of the sampler, which consists in updating all hidden units, followed by updating all visible units.
def rbm_gibbs_update(rng, v, W, b, c):
p_hidden = rbm_mean_hidden(v, W, c)
h = rng.binomial(1, p_hidden, p_hidden.shape)
p_visible = rbm_mean_visible(h, W, b)
v = rng.binomial(1, p_visible, p_visible.shape)
return v
Finally, we repeat these steps k
times. We only return the visible units v
.
def rbm_gibbs_sampling(rng, k, v_0, W, b, c):
counter = 0
v = v_0
while counter < k:
v = rbm_gibbs_update(rng, v, W, b, c)
counter += 1
return v
Here v_0
is the initial visible unit states. We do not need to initialize the hidden ones as this is done automatically in the first update step. In the next subsection, we will take the initial distribution of $\bv$ to be independent Bernoullis with success probability $1/2$.
NUMERICAL CORNER: We apply our Gibbs sampler to generating images. As mentioned previously, we use the MNIST dataset to learn a "realistic" distribution of handwritten digit images. Here the images are encoded by the visible units of an RBM. Then we sample from this model.
We first need to train the model on the data. We will not show how this is done here, but instead use sklearn.neural_network.BernoulliRBM
. (Some details of how this training is done is provided here.)
from sklearn.neural_network import BernoulliRBM
rbm = BernoulliRBM(random_state=seed, verbose=0)
To simplify the analysis and speed up the training, we only keep digits $0$, $1$ and $5$.
mask = (labels == 0) | (labels == 1) | (labels == 5)
imgs = imgs[mask]
labels = labels[mask]
We flatten the images (which have already been "rounded" to black-and-white; see the first subsection).
X = imgs.reshape(len(imgs), -1)
We now fit the model. Choosing the hyperparameters of the training algorithm is tricky. The following seem to work reasonably well. (For a more systematic approach to tuning hyperparameters, see here.)
rbm.n_components = 100
rbm.learning_rate = 0.02
rbm.batch_size = 50
rbm.n_iter = 20
rbm.fit(X)
BernoulliRBM(batch_size=50, learning_rate=0.02, n_components=100, n_iter=20, random_state=535)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BernoulliRBM(batch_size=50, learning_rate=0.02, n_components=100, n_iter=20, random_state=535)
We are ready to sample from the trained RBM. We extract the learned parameters from rbm
.
W = rbm.components_
W.shape
(100, 784)
b = rbm.intercept_visible_
b.shape
(784,)
c = rbm.intercept_hidden_
c.shape
(100,)
To generate $25$ samples, we first generate $25$ independent initial states. We stack them into a matrix, where each row is a different flattened random noise image.
n_samples = 25
z = rng.binomial(1, 0.5, (n_samples, n_pixels))
To process all samples simultaneously, we make a small change to the code. We use numpy.reshape
to make the offsets into column vectors, which are then automatically added to all columns of the resulting weighted sum.
(This is known as broadcasting.)
def rbm_mean_hidden(v, W, c):
return sigmoid(W @ v + c.reshape(len(c),1))
def rbm_mean_visible(h, W, b):
return sigmoid(W.T @ h + b.reshape(len(b),1))
For plotting, we use a script adapted from here (with help from CHatGPT).
def plot_imgs(z, n_imgs, nx_pixels, ny_pixels):
nx_imgs = np.floor(np.sqrt(n_imgs))
ny_imgs = np.ceil(np.sqrt(n_imgs))
plt.figure(figsize=(8, 8))
for i, comp in enumerate(z):
plt.subplot(int(nx_imgs), int(ny_imgs), i + 1)
plt.imshow(comp.reshape((nx_pixels, ny_pixels)), cmap='gray_r')
plt.xticks([]), plt.yticks([])
plt.show()
We are now ready to run our Gibbs sampler. The outcome depends on the number of steps we take. After $100$ steps, the outcome is somewhat realistic.
v_0 = z.T
gen_v = rbm_gibbs_sampling(rng, 100, v_0, W, b, c)
plot_imgs(gen_v.T, n_samples, nx_pixels, ny_pixels)