Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Chapter: 3-Singular value decomposition
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: March 4, 2024
Copyright: © 2024 Sebastien Roch
We consider an application of dimensionality reduction in biology. We will look at SNP data from viruses. A little background first. From Wikipedia:
A single-nucleotide polymorphism (SNP) is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present at a level of more than 1% in the population. For example, at a specific base position in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position, and the two possible nucleotide variations -- C or A -- are said to be the alleles for this specific position.
Quoting Jombart et al., BMC Genetics (2010), we analyze:
the population structure of seasonal influenza A/H3N2 viruses using hemagglutinin (HA) sequences. Changes in the HA gene are largely responsible for immune escape of the virus (antigenic shift), and allow seasonal influenza to persist by mounting yearly epidemics peaking in winter. These genetic changes also force influenza vaccines to be updated on a yearly basis. [...] Assessing the genetic evolution of a pathogen through successive epidemics is of considerable epidemiological interest. In the case of seasonal influenza, we would like to ascertain how genetic changes accumulate among strains from one winter epidemic to the next.
Some details about the Jombart et al. dataset:
For this purpose, we retrieved all sequences of H3N2 hemagglutinin (HA) collected between 2001 and 2007 available from Genbank. Only sequences for which a location (country) and a date (year and month) were available were retained, which allowed us to classify strains into yearly winter epidemics. Because of the temporal lag between influenza epidemics in the two hemispheres, and given the fact that most available sequences were sampled in the northern hemisphere, we restricted our analysis to strains from the northern hemisphere (latitudes above 23.4°north). The final dataset included 1903 strains characterized by 125 SNPs which resulted in a total of 334 alleles. All strains from 2001 to 2007 were classified into six winter epidemics (2001-2006). This was done by assigning all strains from the second half of the year with those from the first half of the following year. For example, the 2005 winter epidemic comprises all strains collected between the 1st of July 2005 and the 30th of June 2006.
We load a dataset, which contains a subset of strains from the dataset mentioned above.
df = pd.read_csv('h3n2-snp.csv')
The first five rows are the following.
df.head()
strain | s6a | s6c | s6g | s17a | s17g | s17t | s39a | s39c | s39g | ... | s977g | s978a | s978c | s978g | s978t | s979a | s979c | s979t | s980a | s980c | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AB434107 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
1 | AB434108 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
2 | CY000113 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
3 | CY000209 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
4 | CY000217 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
5 rows × 318 columns
Overall it contains $1642$ strains.
df.shape[0]
1642
The data lives in a $318$-dimensional space.
df.shape[1]
318
Obviously, vizualizing this data is not straighforward. How can we make sense of it? More specifically, how can we explore any underlying structure it might have. Quoting Wikipedia:
In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. [...] Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.
In this chapter we will encounter an importatn mathematical technique for dimension reduction, which allow us to explore this data -- and find interesting structure -- in $2$ (rather than $318$!) dimensions.
NUMERICAL CORNER: In Numpy, one can compute the rank of a matrix using the function numpy.linalg.matrix_rank
. We will see later in the course how to compute it using the singular value decomposition (which is how LA.matrix_rank
does it).
Let's try the example above.
w1 = np.array([1., 0., 1.])
w2 = np.array([0., 1., 1.])
w3 = np.array([1., -1., 0.])
A = np.stack((w1, w2, w3), axis=-1)
print(A)
[[ 1. 0. 1.] [ 0. 1. -1.] [ 1. 1. 0.]]
We compute the rank of A
.
LA.matrix_rank(A)
2
We take only the first two columns of A
this time to form B
.
B = np.stack((w1, w2),axis=-1)
print(B)
[[1. 0.] [0. 1.] [1. 1.]]
LA.matrix_rank(B)
2
Recall that, in Numpy, @
is used for matrix product.
C = np.array([[1., 0., 1.],[0., 1., -1.]])
print(C)
[[ 1. 0. 1.] [ 0. 1. -1.]]
LA.matrix_rank(C)
2
print(B @ C)
[[ 1. 0. 1.] [ 0. 1. -1.] [ 1. 1. 0.]]
$\unlhd$
NUMERICAL CORNER: In Numpy, the outer product is computed using numpy.outer
.
u = np.array([0., 2., -1.])
v = np.array([3., -2.])
Z = np.outer(u, v)
print(Z)
[[ 0. -0.] [ 6. -4.] [-3. 2.]]
print(LA.matrix_rank(Z))
1
$\unlhd$
NUMERICAL CORNER: In Numpy, the eigenvalues and eigenvectors of a matrix can be computed using numpy.linalg.eig
.
A = np.array([[2.5, -0.5], [-0.5, 2.5]])
w, v = LA.eig(A)
print(w)
print(v)
[3. 2.] [[ 0.70710678 0.70710678] [-0.70710678 0.70710678]]
$\unlhd$
NUMERICAL CORNER: Hence, we can check whether a matrix is positive semidefinite by computing its eigenvalues using numpy.linalg.eig
.
A = np.array([[1, -1], [-1, 1]])
w, v = LA.eig(A)
print(w)
[2. 0.]
B = np.array([[1, -2], [-2, 1]])
z, u = LA.eig(B)
print(z)
[ 3. -1.]
$\unlhd$
NUMERICAL CORNER: We implement the algorithm suggested by the Power Iteration Lemma. That is, we compute $B^{k} \mathbf{x}$, then normalize it. To obtain the corresponding singular value and left singular vector, we use that $\sigma_1 = \|A \mathbf{v}_1\|$ and $\mathbf{u}_1 = A \mathbf{v}_1/\sigma_1$.
def topsing(A, maxiter=10):
x = rng.normal(0,1,np.shape(A)[1])
B = A.T @ A
for _ in range(maxiter):
x = B @ x
v = x / LA.norm(x)
s = LA.norm(A @ v)
u = A @ v / s
return u, s, v
We will apply it to our previous two-cluster example. The necessary functions are in mmids.py, which is available on the GitHub of the notes.
d, n, w = 10, 100, 3.
X1, X2 = mmids.two_clusters(d, n, w)
X = np.concatenate((X1, X2), axis=0)
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1])
plt.show()
Let's compute the top singular vector.
u, s, v = topsing(X)
print(v)
[-0.99758487 -0.00113548 -0.02959873 -0.01200961 -0.04158032 0.02730253 -0.02688588 0.00375839 -0.00481017 0.02384009]
This is approximately $-\mathbf{e}_1$. We get roughly the same answer (possibly up to sign) from Python's numpy.linalg.svd
function.
u, s, vh = LA.svd(X)
print(vh.T[:,0])
[ 0.99758487 0.00113548 0.02959873 0.01200961 0.04158032 -0.02730253 0.02688588 -0.00375839 0.00481017 -0.02384009]
Recall that, when we applied $k$-means clustering to this example with $d=1000$ dimension, we obtained a very poor clustering. Let's try again after projecting onto the top singular vector.
d, n, w = 1000, 100, 3.
X1, X2 = mmids.two_clusters(d, n, w)
X = np.concatenate((X1, X2), axis=0)
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1])
plt.show()
assign = mmids.kmeans(X, 2)
200779.67991200055 200716.59990277822 200716.59990277822 200716.59990277822 200716.59990277822 200716.59990277822 200716.59990277822 200716.59990277822 200716.59990277822 200716.59990277822
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], c=assign)
plt.show()
Let's try again, but after projecting on the top singular vector. Recall that this corresponds to finding the best one-dimensional approximating subspace. The projection can be computed using the truncated SVD $Z= U_{(1)} \Sigma_{(1)} V_{(1)}^T$. We can interpret the rows of $U_{(1)} \Sigma_{(1)}$ as the coefficients of each data point in the basis $\mathbf{v}_1$. We will work in that basis. We need one small hack: because our implementation of $k$-means clustering expects data points in at least $2$ dimension, we add a column of $0$'s.
u, s, v = topsing(X)
Xproj = np.stack((u*s, np.zeros(np.shape(X)[0])), axis=-1)
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(Xproj[:,0], Xproj[:,1])
plt.ylim([-3,3])
plt.show()
There is a small -- yet noticeable -- gap around 0.
A histogram of the first component of Xproj
gives a better sense of the density of points.
plt.hist(Xproj[:,0])
plt.show()
We run $k$-means clustering on the projected data.
assign = mmids.kmeans(Xproj, 2)
3069.4402502007983 468.06675222564377 468.06675222564377 468.06675222564377 468.06675222564377 468.06675222564377 468.06675222564377 468.06675222564377 468.06675222564377 468.06675222564377
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], c=assign)
plt.show()
Much better. We will give an explanation of this outcome in an upcoming (optional) subsection. In essence, quoting [BHK, Section 7.5.1]:
[...] let's understand the central advantage of doing the projection to [the top $k$ right singular vectors]. It is simply that for any reasonable (unknown) clustering of data points, the projection brings data points closer to their cluster centers.
Finally, looking at the top right singular vector (or its first ten entries for lack of space), we see that it does align quite well (but not perfectly) with the first dimension. In the next (optional) section, we try again with the top two singular vectors.
print(v[:10])
[ 0.77963237 -0.03550223 0.03731707 0.00200108 -0.00941479 -0.02415851 -0.00782393 0.03104106 0.00447986 -0.02359974]
NUMERICAL CORNER: We implement this last algorithm. We will need our previous implementation of Gram-Schimdt.
def svd(A, l, maxiter=100):
V = rng.normal(0,1,(np.size(A,1),l))
for _ in range(maxiter):
W = A @ V
Z = A.T @ W
V, R = mmids.gramschmidt(Z)
W = A @ V
S = [LA.norm(W[:, i]) for i in range(np.size(W,1))]
U = np.stack([W[:,i]/S[i] for i in range(np.size(W,1))],axis=-1)
return U, S, V
Note that above we avoided forming the matrix $A^T A$. With a small number of iterations, that approach potentially requires fewer arithmetic operations overall and it allows to take advantage of the possible sparsity of $A$ (i.e. the fact that it may have many zeros).
We apply it again to our two-cluster example.
d, n, w = 1000, 100, 3.
X1, X2 = mmids.two_clusters(d, n, w)
X = np.concatenate((X1, X2), axis=0)
Let's try again, but after projecting on the top two singular vectors. Recall that this corresponds to finding the best two-dimensional approximating subspace. The projection can be computed using the truncated SVD $Z= U_{(2)} \Sigma_{(2)} V_{(2)}^T$. We can interpret the rows of $U_{(2)} \Sigma_{(2)}$ as the coefficients of each data point in the basis $\mathbf{v}_1,\mathbf{v}_2$. We will work in that basis.
U, S, V = svd(X, 2)
Xproj = np.stack((U[:,0]*S[0], U[:,1]*S[1]), axis=-1)
assign = mmids.kmeans(Xproj, 2)
4839.151922662225 3893.043231849657 3887.181432640043 3871.465672809303 3866.3657016218503 3832.797758573426 3614.612105279853 3137.4511765696116 2527.577531668082 2434.4101807079765
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], c=assign)
plt.show()
Finally, looking at the first two right singular vectors, we see that the first one does align quite well with the first dimension.
print(np.stack((V[:,0], V[:,1]), axis=-1))
[[ 0.7951525 -0.04955857] [-0.0071491 -0.0146887 ] [-0.00925286 -0.02036707] ... [-0.00749113 0.02588917] [ 0.01061442 -0.01162646] [ 0.00415543 0.01529427]]
NUMERICAL CORNER: Having established a formal connection between PCA and SVD, we implement PCA using our SVD algorithm (in mmids.py). We perform mean centering (now is the time to read that quote about the importance of mean centering again), but not the optional standardization. We use the fact that, in Numpy, subtracting a matrix by a vector whose dimension matches the number of columns performs row-wise subtraction.
def pca(X, l, maxiter=100):
mean = np.mean(X, axis=0) # Compute mean of each column
Y = X - mean # Mean center each column
U, S, V = mmids.svd(Y, l, maxiter=maxiter)
return U[:,:l] @ np.diag(S[:l])
We apply it to the Gaussian Mixture Model.
d, n, w = 1000, 100, 3.
X = mmids.two_mixed_clusters(d, n, w)
T = pca(X, 2)
Plotting the result, we see that PCA does succeed in finding the main direction of variation, with the first principal component aligning well with the first coordinate.
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(T[:,0], T[:,1], s=10)
plt.show()
Note however that the first two principal components (especially the second one) in fact "capture more noise" than what can be seen in the first two coordinates, a form of overfitting.
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(X[:,0], X[:,1], s=10)
plt.show()
We return to our motivating example. We apply PCA to our genetic dataset.
We load the dataset again and examine its first rows.
df = pd.read_csv('h3n2-snp.csv')
df.head()
strain | s6a | s6c | s6g | s17a | s17g | s17t | s39a | s39c | s39g | ... | s977g | s978a | s978c | s978g | s978t | s979a | s979c | s979t | s980a | s980c | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AB434107 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
1 | AB434108 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
2 | CY000113 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
3 | CY000209 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
4 | CY000217 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
5 rows × 318 columns
Recall that it contains $1642$ strains and lives in a $318$-dimensional space.
df.shape
(1642, 318)
Our goal is to find a "good" low-dimensional representation of the data. Ten dimensions will do here. We use PCA.
A = df[[df.columns[i] for i in range(1,len(df.columns))]].to_numpy()
n_dims = 10
T = pca(A, n_dims)
We plot the first two principal components, and see what appears to be some potential structure.
fig = plt.figure()
ax = fig.add_subplot(111,aspect='equal')
ax.scatter(T[:,0], T[:,1], s=10)
plt.show()
There seems to be some reasonably well-defined clusters in this projection. We use $k$-means to identiy clusters. We take advantage of the implementation in scikit-learn, sklearn.cluster.KMeans. By default, it finds $8$ clusters. The clusters can be extracted from the attribute labels_
.
from sklearn.cluster import KMeans
n_clusters = 8
kmeans = KMeans(n_clusters=n_clusters, init='k-means++', n_init=10).fit(T)
assign = kmeans.labels_
To further reveal the structure, we look at our the clusters spread out over the years. That information is in a separate file.
dfoth = pd.read_csv('h3n2-other.csv')
dfoth.head()
strain | length | country | year | lon | lat | date | |
---|---|---|---|---|---|---|---|
0 | AB434107 | 1701 | Japan | 2002 | 137.215474 | 35.584176 | 2002/02/25 |
1 | AB434108 | 1701 | Japan | 2002 | 137.215474 | 35.584176 | 2002/03/01 |
2 | CY000113 | 1762 | USA | 2002 | -73.940000 | 40.670000 | 2002/01/29 |
3 | CY000209 | 1760 | USA | 2002 | -73.940000 | 40.670000 | 2002/01/17 |
4 | CY000217 | 1760 | USA | 2002 | -73.940000 | 40.670000 | 2002/02/26 |
year = dfoth['year'].to_numpy()
For each cluster, we plot how many of its data points come from a specific year. Each cluster has a different color.
fig, ax = plt.subplots()
for i in range(n_clusters):
unique, counts = np.unique(year[assign == i], return_counts=True)
ax.bar(unique, counts, label=i)
ax.set(xlim=(2001, 2007), xticks=np.arange(2002, 2007))
ax.legend()
plt.show()
Remarkably, we see that each cluster comes mostly from one year or two consecutive ones. In other words, the clustering in this low-dimensional projection captures some true underlying structure that is not explicitly in the genetic data on which it is computed.
Going back to the first two principal components, we color the points on the scatterplot by year. (We use legend_elements()
for automatic legend creation.)
fig = plt.figure()
ax = fig.add_subplot(111, aspect='equal')
scatter = ax.scatter(T[:,0], T[:,1], s=10, c=year, label=year)
plt.legend(*scatter.legend_elements())
plt.show()
To some extent, one can "see" the virus evolving from year to year. The $x$-axis in particular seems to correlate strongly with the year, in the sense that samples from later years tend to be towards one side of the plot.
To further quantify this observation, we use numpy.corrcoef to compute the correlation coefficients between the year and the first $10$ principal components.
corr = np.zeros(n_dims)
for i in range(n_dims):
corr[i] = np.corrcoef(np.stack((T[:,i], year)))[0,1]
print(corr)
[-0.7905001 0.42806325 0.0870437 -0.16840281 0.05754834 -0.06046913 -0.0791655 0.01456608 0.02544735 -0.04316141]
Indeed, we see that the first three or four principal components correlate well with the year.
A = np.array([[1., 0.],[0., 1.],[0., 0.]])
print(A)
[[1. 0.] [0. 1.] [0. 0.]]
LA.norm(A)
1.4142135623730951
LA.norm(A, 2)
1.0
NUMERICAL CORNER: In Numpy, the pseudoinverse of a matrix can be computed using the function numpy.linalg.pinv
.
Let's try our previous example.
A = np.array([[1., 0.], [-1., 0.]])
print(A)
[[ 1. 0.] [-1. 0.]]
Ap = LA.pinv(A)
print(Ap)
[[ 0.5 -0.5] [ 0. 0. ]]
For example, orthogonal matrices have condition number $1$, the lowest possible value for it (Why?). That indicates that orthogonal matrices have good numerical properties.
q = 1/np.sqrt(2)
Q = np.array([[q, q], [q, -q]])
print(Q)
[[ 0.70710678 0.70710678] [ 0.70710678 -0.70710678]]
LA.cond(Q)
1.0000000000000002
In contrast, matrices with nearly linearly dependent columns have large condition numbers.
eps = 1e-6
A = np.array([[q, q], [q, q+eps]])
print(A)
[[0.70710678 0.70710678] [0.70710678 0.70710778]]
LA.cond(A)
2828429.1245844117
u, s, vh = LA.svd(A)
print(s)
[1.41421406e+00 4.99999823e-07]
We compute the solution to $A \mathbf{x} = \mathbf{b}$ when $\mathbf{b}$ is the left singular vector of $A$ corresponding to the largest singular value. Recall that in the proof of the Conditioning of Matrix-Vector Multiplication Theorem, we showed that the worst case bound is achieved when $\mathbf{z} = \mathbf{b}$ is right singular vector of $M= A^{-1}$ corresponding to the lowest singular value. In a previous example, given a matrix $A = \sum_{j=1}^n \sigma_j \mathbf{u}_j \mathbf{v}_j^T$ in compact SVD form, we derived a compact SVD for the inverse as
$$ A^{-1} = \sigma_n^{-1} \mathbf{v}_n \mathbf{u}_n^T + \sigma_{n-1}^{-1} \mathbf{v}_{n-1} \mathbf{u}_{n-1}^T + \cdots + \sigma_1^{-1} \mathbf{v}_1 \mathbf{u}_1^T. $$Here, compared to the SVD of $A$, the order of the singular values is reversed and the roles of the left and right singular vectors are exchanged. So we take $\mathbf{b}$ to be the top left singular vector of $A$.
b = u[:,0]
print(b)
[-0.70710653 -0.70710703]
x = LA.solve(A,b)
print(x)
[-0.49999965 -0.5 ]
We make a small perturbation in the direction of the second right singular vector. Recall that in the proof of the Conditioning of Matrix-Vector Multiplication Theorem, we showed that the worst case is achieved when $\mathbf{d} = \delta\mathbf{b}$ is top right singular vector of $M = A^{-1}$. By the argument above, that is the left singular vector of $A$ corresponding to the lowest singular value.
delta = 1e-6
bp = b + delta*u[:,1]
print(bp)
[-0.70710724 -0.70710632]
The relative change in solution is:
xp = LA.solve(A,bp)
print(xp)
[-1.91421421 0.91421356]
(LA.norm(x-xp)/LA.norm(x))/(LA.norm(b-bp)/LA.norm(b))
2828429.1246659174
$\unlhd$
$\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}$