Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Chapter: 4-Singular value decomposition
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: July 15, 2024
Copyright: © 2024 Sebastien Roch
$\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bfbeta}{\boldsymbol{\beta}}$ $\newcommand{\bflambda}{\boldsymbol{\lambda}}$ $\newcommand{\bgamma}{\boldsymbol{\gamma}}$ $\newcommand{\bsigma}{{\boldsymbol{\sigma}}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\btheta}{{\boldsymbol{\theta}}}$ $\newcommand{\bphi}{\boldsymbol{\phi}}$ $\newcommand{\balpha}{\boldsymbol{\alpha}}$ $\newcommand{\blambda}{\boldsymbol{\lambda}}$ $\renewcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bp}{\mathbf{p}}$ $\renewcommand{\bx}{\mathbf{x}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bY}{\mathbf{Y}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bZ}{\mathbf{Z}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bV}{\mathbf{V}}$ $\newcommand{\bfg}{\mathbf{g}}$ $\newcommand{\bfh}{\mathbf{h}}$ $\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}$ $\renewcommand{\S}{\mathcal{S}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\pa}{\mathrm{pa}}$ $\newcommand{\Z}{\mathcal{Z}}$ $\newcommand{\bh}{\mathbf{h}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bc}{\mathbf{c}}$ $\newcommand{\cE}{\mathcal{E}}$ $\newcommand{\cP}{\mathcal{P}}$ $\newcommand{\bbeta}{\boldsymbol{\beta}}$ $\newcommand{\bLambda}{\boldsymbol{\Lambda}}$ $\newcommand{\cov}{\mathrm{Cov}}$ $\newcommand{\bfk}{\mathbf{k}}$ $\newcommand{\idx}[1]{}$ $\newcommand{\xdi}{}$
Motivating example: visualizing viral evolution¶
We consider an application of dimensionality reduction in biology. We will look at single-nucleotide polymorphism (SNP)$\idx{single-nucleotide polymorphism}\xdi$ data from viruses. A little background first. From Wikipedia:
A single-nucleotide polymorphism (SNP) is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present at a level of more than 1% in the population. For example, at a specific base position in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position, and the two possible nucleotide variations -- C or A -- are said to be the alleles for this specific position.
Quoting Jombart et al., BMC Genetics (2010), we analyze:
the population structure of seasonal influenza A/H3N2 viruses using hemagglutinin (HA) sequences. Changes in the HA gene are largely responsible for immune escape of the virus (antigenic shift), and allow seasonal influenza to persist by mounting yearly epidemics peaking in winter. These genetic changes also force influenza vaccines to be updated on a yearly basis. [...] Assessing the genetic evolution of a pathogen through successive epidemics is of considerable epidemiological interest. In the case of seasonal influenza, we would like to ascertain how genetic changes accumulate among strains from one winter epidemic to the next.
Some details about the Jombart et al. dataset:
For this purpose, we retrieved all sequences of H3N2 hemagglutinin (HA) collected between 2001 and 2007 available from Genbank. Only sequences for which a location (country) and a date (year and month) were available were retained, which allowed us to classify strains into yearly winter epidemics.
We load a dataset, which contains a subset of strains from the dataset mentioned above.
data = pd.read_csv('h3n2-snp.csv')
This is a large dataset. Here are the first five rows and first 10 colums.
print(data.iloc[:5, :10])
strain s6a s6c s6g s17a s17g s17t s39a s39c s39g 0 AB434107 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1 AB434108 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 2 CY000113 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 3 CY000209 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 4 CY000217 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
For positions 6
, 17
, 39
, etc., the corresponding columns indicate which nucleotide (a
, c
, g
, t
) is present in the strain with a 1.0
. For example, strain AB434107
has an a
at position 6
and 17
, and a g
at position 39
.
Overall it contains $1642$ strains (whose names are listed in the first colum). The data lives in a $317$-dimensional space (not counting the name of strain, i.e., the first column).
data.shape
(1642, 318)
Obviously, vizualizing this data is not straighforward. How can we make sense of it? More specifically, how can we explore any underlying structure it might have. Quoting Wikipedia:
In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. [...] Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.
In this chapter we will encounter an important mathematical technique for dimension reduction, which allow us to explore this data -- and find interesting structure -- in $2$ (rather than $317$!) dimensions.
Background: review of matrix rank and spectral decomposition¶
NUMERICAL CORNER: In Numpy, one can compute the rank of a matrix using the function numpy.linalg.matrix_rank
. We will see later in the chapter how to compute it using the singular value decomposition (which is how LA.matrix_rank
does it). Let's try the example above.
w1 = np.array([1., 0., 1.])
w2 = np.array([0., 1., 1.])
w3 = np.array([1., -1., 0.])
A = np.stack((w1, w2, w3), axis=-1)
print(A)
[[ 1. 0. 1.] [ 0. 1. -1.] [ 1. 1. 0.]]
We compute the rank of A
.
LA.matrix_rank(A)
2
We take only the first two columns of A
this time to form B
.
B = np.stack((w1, w2),axis=-1)
print(B)
[[1. 0.] [0. 1.] [1. 1.]]
LA.matrix_rank(B)
2
Recall that, in Numpy, @
is used for matrix product.
C = np.array([[1., 0., 1.],[0., 1., -1.]])
print(C)
[[ 1. 0. 1.] [ 0. 1. -1.]]
LA.matrix_rank(C)
2
print(B @ C)
[[ 1. 0. 1.] [ 0. 1. -1.] [ 1. 1. 0.]]
NUMERICAL CORNER: In Numpy, the eigenvalues and eigenvectors of a matrix can be computed using numpy.linalg.eig
.
A = np.array([[2.5, -0.5], [-0.5, 2.5]])
w, v = LA.eig(A)
print(w)
print(v)
[3. 2.] [[ 0.70710678 0.70710678] [-0.70710678 0.70710678]]
Above, w
are the eigenvalues in an array, whereas the columns of v
are the corresponding eigenvectors.
NUMERICAL CORNER: Hence, we can check whether a matrix is positive semidefinite by computing its eigenvalues using numpy.linalg.eig
.
A = np.array([[1, -1], [-1, 1]])
w, v = LA.eig(A)
print(w)
[2. 0.]
B = np.array([[1, -2], [-2, 1]])
z, u = LA.eig(B)
print(z)
[ 3. -1.]
KNOWLEDGE CHECK: Which one(s) of these matrices is positive semidefinite?
$$ A = \begin{pmatrix} 1 & -1\\ -1 & 1 \end{pmatrix} \qquad B = \begin{pmatrix} 1 & -2\\ -2 & 1 \end{pmatrix} $$a) Both
b) $A$
c) $B$
d) Neither
$\checkmark$
Power iteration¶
We implement the algorithm suggested by the Power Iteration Lemma. That is, we compute $B^{k} \mathbf{x}$, then normalize it. To obtain the corresponding singular value and left singular vector, we use that $\sigma_1 = \|A \mathbf{v}_1\|$ and $\mathbf{u}_1 = A \mathbf{v}_1/\sigma_1$.
def topsing(rng, A, maxiter=10):
x = rng.normal(0,1,np.shape(A)[1])
B = A.T @ A
for _ in range(maxiter):
x = B @ x
v = x / LA.norm(x)
s = LA.norm(A @ v)
u = A @ v / s
return u, s, v
NUMERICAL CORNER: We will apply it to our previous two-cluster example. The necessary functions are in mmids.py, which is available on the GitHub of the book.
seed = 42
rng = np.random.default_rng(seed)
d, n, w = 10, 100, 3.
X = mmids.two_mixed_clusters(rng, d, n, w)
plt.figure(figsize=(6,3))
plt.scatter(X[:,0], X[:,1], s=10, c='k')
plt.axis([-6,6,-3,3])
plt.show()
Let's compute the top singular vector.
u, s, v = topsing(rng, X)
print(v)
[ 0.99257882 0.10164805 0.01581003 0.03202184 0.02075852 0.02798115 -0.02920916 -0.028189 -0.0166094 -0.00648726]
This is approximately $-\mathbf{e}_1$. We get roughly the same answer (possibly up to sign) from Python's numpy.linalg.svd
function.
u, s, vh = LA.svd(X)
print(vh.T[:,0])
[ 0.99257882 0.10164803 0.01581003 0.03202184 0.02075851 0.02798112 -0.02920917 -0.028189 -0.01660938 -0.00648724]
Recall that, when we applied $k$-means clustering to this example with $d=1000$ dimension, we obtained a very poor clustering.
d, n, w = 1000, 100, 3.
X = mmids.two_mixed_clusters(rng, d, n, w)
assign = mmids.kmeans(rng, X, 2)
99423.42794703908 99423.42794703908 99423.42794703908 99423.42794703908 99423.42794703908
plt.figure(figsize=(6,3))
plt.scatter(X[:,0], X[:,1], s=10, c=assign, cmap='brg')
plt.axis([-6,6,-3,3])
plt.show()
Let's try again, but after projecting on the top singular vector. Recall that this corresponds to finding the best one-dimensional approximating subspace. The projection can be computed using the truncated SVD $Z= U_{(1)} \Sigma_{(1)} V_{(1)}^T$. We can interpret the rows of $U_{(1)} \Sigma_{(1)}$ as the coefficients of each data point in the basis $\mathbf{v}_1$. We will work in that basis. We need one small hack: because our implementation of $k$-means clustering expects data points in at least $2$ dimension, we add a column of $0$'s.
u, s, v = topsing(rng, X)
Xproj = np.stack((u*s, np.zeros(np.shape(X)[0])), axis=-1)
fig = plt.figure()
ax = fig.add_subplot(111, aspect='equal')
ax.scatter(Xproj[:,0], Xproj[:,1], s=10, c='b', alpha=0.25)
plt.ylim([-3,3])
plt.show()
There is a small -- yet noticeable -- gap around 0. We run $k$-means clustering on the projected data.
assign = mmids.kmeans(rng, Xproj, 2)
1779.020119584778 514.1899426112672 514.1899426112672 514.1899426112672 514.1899426112672
plt.figure(figsize=(6,3))
plt.scatter(X[:,0], X[:,1], s=10, c=assign, cmap='brg')
plt.axis([-6,6,-3,3])
plt.show()
Much better. We give a more formal explanation of this outcome in a subsequent section. In essence, quoting [BHK, Section 7.5.1]:
[...] let's understand the central advantage of doing the projection to [the top $k$ right singular vectors]. It is simply that for any reasonable (unknown) clustering of data points, the projection brings data points closer to their cluster centers.
Finally, looking at the top right singular vector (or its first ten entries for lack of space), we see that it does align quite well (but not perfectly) with the first dimension.
print(v[:10])
[-0.55564563 -0.02433674 0.02193487 -0.0333936 -0.00445505 -0.00243003 0.02576056 0.02523275 -0.00682153 0.02524646]
Application: principal components analysis¶
Having established a formal connection between PCA and SVD, we implement PCA using the SVD algorithm numpy.linalg.svd
. We perform mean centering (now is the time to read that quote about the importance of mean centering again), but not the optional standardization. We use the fact that, in Numpy, subtracting a matrix by a vector whose dimension matches the number of columns performs row-wise subtraction.
def pca(X, l):
mean = np.mean(X, axis=0)
Y = X - mean
U, S, Vt = LA.svd(Y, full_matrices=False)
return U[:, :l] @ np.diag(S[:l])
NUMERICAL CORNER: We apply it to the Gaussian Mixture Model.
seed = 535
rng = np.random.default_rng(seed)
d, n, w = 1000, 100, 3.
X = mmids.two_mixed_clusters(rng, d, n, w)
T = pca(X, 2)
Plotting the result, we see that PCA does succeed in finding the main direction of variation. Note tha gap in the middle.
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(T[:,0], T[:,1], s=5, c='k')
plt.show()
Note however that the first two principal components in fact "capture more noise" than what can be seen in the orginal first two coordinates, a form of overfitting.
NUMERICAL CORNER: We load the dataset again and examine its first rows. Recall that it contains $1642$ strains and lives in a $317$-dimensional space.
data = pd.read_csv('h3n2-snp.csv')
Our goal is to find a "good" low-dimensional representation of the data. We work with ten dimensions using PCA.
A = data[[data.columns[i] for i in range(1,len(data.columns))]].to_numpy()
n_dims = 10
T = pca(A, n_dims)
We plot the first two principal components, and see what appears to be some potential structure.
plt.figure(figsize=(5,3))
plt.scatter(T[:,0], T[:,1], s=10, c='k')
plt.axis([-3,6,-3,3])
plt.show()
There seems to be some reasonably well-defined clusters in this projection. We use $k$-means to identiy clusters. We take advantage of the implementation in scikit-learn, sklearn.cluster.KMeans. By default, it finds $8$ clusters. The clusters can be extracted from the attribute labels_
.
from sklearn.cluster import KMeans
n_clusters = 8
kmeans = KMeans(n_clusters=n_clusters, init='k-means++',
random_state=seed, n_init=10).fit(T)
assign = kmeans.labels_
To further reveal the structure, we look at our the clusters spread out over the years. That information is in a separate file.
data_oth = pd.read_csv('h3n2-other.csv')
data_oth.head()
strain | length | country | year | lon | lat | date | |
---|---|---|---|---|---|---|---|
0 | AB434107 | 1701 | Japan | 2002 | 137.215474 | 35.584176 | 2002/02/25 |
1 | AB434108 | 1701 | Japan | 2002 | 137.215474 | 35.584176 | 2002/03/01 |
2 | CY000113 | 1762 | USA | 2002 | -73.940000 | 40.670000 | 2002/01/29 |
3 | CY000209 | 1760 | USA | 2002 | -73.940000 | 40.670000 | 2002/01/17 |
4 | CY000217 | 1760 | USA | 2002 | -73.940000 | 40.670000 | 2002/02/26 |
year = data_oth['year'].to_numpy()
For each cluster, we plot how many of its data points come from a specific year. Each cluster has a different color.
fig, ax = plt.subplots(figsize=(6,4))
for i in range(n_clusters):
unique, counts = np.unique(year[assign == i], return_counts=True)
ax.bar(unique, counts, label=i)
ax.set(xlim=(2001, 2007), xticks=np.arange(2002, 2007))
ax.legend()
plt.show()
Remarkably, we see that each cluster comes mostly from one year or two consecutive ones. In other words, the clustering in this low-dimensional projection captures some true underlying structure that is not explicitly in the genetic data on which it is computed.
Going back to the first two principal components, we color the points on the scatterplot by year. (We use legend_elements()
for automatic legend creation.)
fig = plt.figure(figsize=(5,3))
ax = fig.add_subplot(111, aspect='equal')
scatter = ax.scatter(T[:,0], T[:,1], s=10, c=year, label=year)
plt.legend(*scatter.legend_elements())
plt.show()
To some extent, one can "see" the virus evolving from year to year. The $x$-axis in particular seems to correlate strongly with the year, in the sense that samples from later years tend to be towards one side of the plot.
To further quantify this observation, we use numpy.corrcoef to compute the correlation coefficients between the year and the first $10$ principal components.
corr = np.zeros(n_dims)
for i in range(n_dims):
corr[i] = np.corrcoef(np.stack((T[:,i], year)))[0,1]
print(corr)
[-0.7905001 -0.42806325 0.0870437 -0.16839491 0.05757342 -0.06046913 -0.07920042 0.01436618 -0.02544749 0.04314641]
Indeed, we see that the first three or four principal components correlate well with the year.
Further applications of the SVD: low-rank approximations and ridge regression¶
NUMERICAL CORNER: In Numpy, the Frobenius norm of a matrix can be computed using the default of the function numpy.linalg.norm
while the induced norm can be computed using the same function with ord
parameter set to 2
.
A = np.array([[1., 0.],[0., 1.],[0., 0.]])
print(A)
[[1. 0.] [0. 1.] [0. 0.]]
LA.norm(A)
1.4142135623730951
LA.norm(A, 2)
1.0
NUMERICAL CORNER: We return to our example with the two Gaussian clusters. We use function producing two separate clusters.
def two_separate_clusters(rng, d, n, w):
mu0 = np.concatenate(([w], np.zeros(d-1)))
mu1 = np.concatenate(([-w], np.zeros(d-1)))
X0 = mmids.spherical_gaussian(rng, d, n, mu0, 1)
X1 = mmids.spherical_gaussian(rng, d, n, mu1, 1)
return X0, X1
We first generate the data.
seed = 535
rng = np.random.default_rng(seed)
d, n, w = 1000, 100, 3.
X1, X2 = two_separate_clusters(rng, d, n, w)
X = np.vstack((X1, X2))
In reality, we cannot compute the matrix norms of $X-C$ and $X_k-C$ as the true centers are not known. But, because this is simulated data, we happen to know the truth and we can check the validity of our results in this case. The centers are:
C1 = np.stack([np.concatenate(([-w], np.zeros(d-1))) for _ in range(n)])
C2 = np.stack([np.concatenate(([w], np.zeros(d-1))) for _ in range(n)])
C = np.vstack((C1, C2))
We use numpy.linalg.svd
function to compute the norms from the formulas in the Matrix Norms and Singular Values Lemma. First, we observe that the singular values of $X-C$ are decaying slowly.
uc, sc, vhc = LA.svd(X-C)
plt.plot(sc, c='k')
plt.show()
The $k$-means objective with respect to the true centers under the full-dimensional data is:
print(np.sum(sc**2))
207905.47916782406
while the square of the top singular value of $X-C$ is only:
print(sc[0]**2)
8258.19604762502
Finally, we compute the $k$-means objective with respect to the true centers under the projected one-dimensional data:
u, s, vh = LA.svd(X)
print(np.sum((s[0] * np.outer(u[:,0],vh[:,0]) - C)**2))
8099.057045408984