Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Chapter: 1-Introduction: a first data science problem
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: July 15, 2024
Copyright: © 2024 Sebastien Roch
$\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bfbeta}{\boldsymbol{\beta}}$ $\newcommand{\bflambda}{\boldsymbol{\lambda}}$ $\newcommand{\bgamma}{\boldsymbol{\gamma}}$ $\newcommand{\bsigma}{{\boldsymbol{\sigma}}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\btheta}{{\boldsymbol{\theta}}}$ $\newcommand{\bphi}{\boldsymbol{\phi}}$ $\newcommand{\balpha}{\boldsymbol{\alpha}}$ $\newcommand{\blambda}{\boldsymbol{\lambda}}$ $\renewcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bp}{\mathbf{p}}$ $\renewcommand{\bx}{\mathbf{x}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bY}{\mathbf{Y}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bZ}{\mathbf{Z}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bV}{\mathbf{V}}$ $\newcommand{\bfg}{\mathbf{g}}$ $\newcommand{\bfh}{\mathbf{h}}$ $\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}$ $\renewcommand{\S}{\mathcal{S}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\pa}{\mathrm{pa}}$ $\newcommand{\Z}{\mathcal{Z}}$ $\newcommand{\bh}{\mathbf{h}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bc}{\mathbf{c}}$ $\newcommand{\cE}{\mathcal{E}}$ $\newcommand{\cP}{\mathcal{P}}$ $\newcommand{\bbeta}{\boldsymbol{\beta}}$ $\newcommand{\bLambda}{\boldsymbol{\Lambda}}$ $\newcommand{\cov}{\mathrm{Cov}}$ $\newcommand{\bfk}{\mathbf{k}}$ $\newcommand{\idx}[1]{}$ $\newcommand{\xdi}{}$
Motivating example: identifying penguin species¶
Imagine that you are an evolutionary biologist studying penguins. You have collected measurements on a large number of individual specimens. Your goal is to identify different species within this collection based on those measurements.
Here is a penguin dataset collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. We upload the data in the form of a data table (similar to a spreadsheet) called DataFrame
in pandas
, where the columns are different measurements (or features) and the rows are different samples. Below, we load the data using pandas.read_csv
and show the first $5$ lines of the dataset (see DataFrame.head
). This dataset is a simplified version (i.e., with some columns removed) of the full dataset, maintained by Allison Horst at this GitHub page.
import pandas as pd
data = pd.read_csv('penguins-measurements.csv')
data.head()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
0 | 39.1 | 18.7 | 181.0 | 3750.0 |
1 | 39.5 | 17.4 | 186.0 | 3800.0 |
2 | 40.3 | 18.0 | 195.0 | 3250.0 |
3 | NaN | NaN | NaN | NaN |
4 | 36.7 | 19.3 | 193.0 | 3450.0 |
Observe that this dataset has missing values (i.e., the entries NaN
above). A common way to deal with this issue is to remove all rows with missing values. This can be done using pandas.DataFrame.dropna
. This kind of pre-processing is fundamental in data science, but we will not discuss it much in this book. It is however important to be aware of it.
data = data.dropna()
data.head()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
0 | 39.1 | 18.7 | 181.0 | 3750.0 |
1 | 39.5 | 17.4 | 186.0 | 3800.0 |
2 | 40.3 | 18.0 | 195.0 | 3250.0 |
4 | 36.7 | 19.3 | 193.0 | 3450.0 |
5 | 39.3 | 20.6 | 190.0 | 3650.0 |
There are $342$ samples, as can be seen by using pandas.DataFrame.shape
which gives the dimensions of the DataFrame as a tuple.
data.shape
(342, 4)
Let us first extract the columns into a Numpy array using pandas.DataFrame.to_numpy()
. We will have more to say later on about Numpy, a numerical library for Python that in essence allows to manipulate vectors and matrices.
X = data.to_numpy()
print(X)
[[ 39.1 18.7 181. 3750. ] [ 39.5 17.4 186. 3800. ] [ 40.3 18. 195. 3250. ] ... [ 50.4 15.7 222. 5750. ] [ 45.2 14.8 212. 5200. ] [ 49.9 16.1 213. 5400. ]]
We visualize two measurements in the data, the bill depth and flipper length. (The original dataset used the more precise term culmen depth.) Below, each point is a sample. This is called a scatter plot$\idx{scatter plot}\xdi$. Quoting Wikipedia:
The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.
We use matplotlib.pyplot
for most of our plotting needs in this book, with a few exceptions. Specifically, here we use the function matplotlib.pyplot.scatter
.
import matplotlib.pyplot as plt
plt.scatter(X[:,1], X[:,2], s=5, c='k')
plt.xlabel('bill_depth_mm'), plt.ylabel('flipper_length_mm')
plt.show()
We observe what appears to be two fairly well-defined clusters of samples on the top left and bottom right respectively. What is a cluster? Intuitively, it is a group of samples that are close to each other, but far from every other sample. In this case, it may be an indication that these samples come from a separate species.
Now let's look at the full dataset. Visualizing the full $4$-dimensional data is not straightforward. One way to do this is to consider all pairwise scatter plots. We use the function seaborn.pairplot
from the library Seaborn.
import seaborn as sns
sns.pairplot(data, height=2)
plt.show()
How many species of penguins do you think there are in this dataset?
What would be useful is a method that automatically identifies clusters whatever the dimension of the data. In this chapter, we will discuss a standard way to do this: $k$-means clustering. We will come back to the penguins dataset later in the chapter.
But first we need to review some basic concepts about vectors and distances in order to formulate clustering as an appropriate optimization problem, a perspective that will be recurring throughout.
Background: quick refresher of matrix algebra, differential calculus, and elementary probability¶
NUMERICAL CORNER: In Numpy, a vector is defined as a 1d array. We first must import the Numpy package, which is often abbreviated by np
.
import numpy as np
u = np.array([1., 3., 5. ,7.])
print(u)
[1. 3. 5. 7.]
We access the entries of u
as follows, where note that indexing in Numpy starts at $0$.
print(u[0])
print(u[1])
1.0 3.0
To obtain the norm of a vector, we can use the function linalg.norm
, which requires the numpy.linalg
package (often abbreviated as LA
):
from numpy import linalg as LA
LA.norm(u)
9.16515138991168
which we check next "by hand"
np.sqrt(np.sum(u ** 2))
9.16515138991168
In Numpy, **
indicates element-wise exponentiation.
NUMERICAL CORNER: We will often work with collections of $n$ vectors $\mathbf{x}_1, \ldots, \mathbf{x}_n$ in $\mathbb{R}^d$ and it will be convenient to stack them up into a matrix
$$ X = \begin{bmatrix} \mathbf{x}_1^T \\ \mathbf{x}_2^T \\ \vdots \\ \mathbf{x}_n^T \\ \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \\ \end{bmatrix}. $$To create a matrix out of two vectors, we use the function numpy.stack
.
u = np.array([1., 3., 5., 7.])
v = np.array([2., 4., 6., 8.])
X = np.stack((u,v),axis=0)
print(X)
[[1. 3. 5. 7.] [2. 4. 6. 8.]]
Quoting the documentation:
The axis parameter specifies the index of the new axis in the dimensions of the result. For example, if axis=0 it will be the first dimension and if axis=-1 it will be the last dimension.
Alternatively, we can define the same matrix as follows.
Y = np.array([[1., 3., 5., 7.],[2., 4., 6., 8.]])
print(Y)
[[1. 3. 5. 7.] [2. 4. 6. 8.]]
We access the entries as follows.
print(Y[0,0])
print(Y[0,1])
1.0 3.0
NUMERICAL CORNER: In Numpy, the Frobenius norm of a matrix can be computed using the function numpy.linalg.norm
.
A = np.array([[1., 0.],[0., 1.],[0., 0.]])
print(A)
[[1. 0.] [0. 1.] [0. 0.]]
LA.norm(A)
1.4142135623730951
NUMERICAL CORNER: The function $f(x) = x^2$ over $\mathbb{R}$ has a global minimizer at $x^* = 0$. Indeed, we clearly have $f(x) \geq 0$ for all $x$ while $f(0) = 0$. To plot the function, we use the Matplotlib package again, and specifically its function matplotlib.pyplot.plot
. We also use the function numpy.linspace
to create an array of evenly spaced numbers where we evaluate $f$.
import matplotlib.pyplot as plt
x = np.linspace(-2,2,100)
y = x ** 2
plt.plot(x, y, c='k')
plt.ylim(-0.25,4.25)
plt.show()
The function $f(x) = e^x$ over $\mathbb{R}$ does not have a global minimizer. Indeed, $f(x) > 0$ but no $x$ achieves $0$. And, for any $m > 0$, there is $x$ small enough such that $f(x) < m$. Note that $\mathbb{R}$ is not bounded, therefore the Extreme Value Theorem does not apply here.
x = np.linspace(-2,2,100)
y = np.exp(x)
plt.plot(x, y, c='k')
plt.ylim(-0.25,4.25)
plt.show()
The function $f(x) = (x+1)^2 (x-1)^2$ over $\mathbb{R}$ has two global minimizers at $x^* = -1$ and $x^{**} = 1$. Indeed, $f(x) \geq 0$ and $f(x) = 0$ if and only $x = x^*$ or $x = x^{**}$.
x = np.linspace(-2,2,100)
y = ((x+1)**2) * ((x-1)**2)
plt.plot(x,y,c='k')
plt.ylim(-0.25,4.25)
plt.show()
NUMERICAL CORNER: We can use simulations to confirm the Weak Law of Large Numbers. Recall that a uniform random variable over the interval $[a,b]$ has density
$$ f_{X}(x) = \begin{cases} \frac{1}{b-a} & x \in [a,b] \\ 0 & \text{o.w.} \end{cases} $$We write $X \sim \mathrm{U}[a,b]$. We can obtain a sample from $\mathrm{U}[0,1]$ by using the function
numpy.random.Generator.uniform
. We must first instantiate a random number generator (RNG) with numpy.random.default_rng
in Numpy. We provide a seed as an initial state for the RNG. Using the same seed again ensures reproducibility.
seed = 535
rng = np.random.default_rng(seed)
rng.uniform()
0.9836159914889122
Now we take $n$ samples from $\mathrm{U}[0,1]$ and compute their sample mean. We repeat $k$ times and display the empirical distribution of the sample means using an histogram. We start with $n=10$.
n, k = 10, 1000
sample_mean = [np.mean(rng.random(n)) for i in range(k)]
plt.hist(sample_mean, bins=10, color='lightblue', edgecolor='black')
plt.xlim(0,1)
plt.show()
Taking $n$ much larger leads to more concentration around the mean.
n, k = 100, 1000
sample_mean = [np.mean(rng.random(n)) for i in range(k)]
plt.hist(sample_mean, bins=10, color='lightblue', edgecolor='black')
plt.xlim(0,1)
plt.show()
Normal distribution $\idx{normal or Gaussian distribution}\xdi$ Recall that a standard Normal variable $X$ has PDF
$$ f_X(x) = \frac{1}{\sqrt{2 \pi}} \exp\left( - x^2/2 \right). $$Its mean is $0$ and its variance is $1$.
NUMERICAL CORNER: The following function generates $n$ data points from a spherical $d$-dimensional Gaussians with variance $\sigma^2$ and mean $\bmu$.
Below, rng.normal(0,1,(n,d))
generates a n
independent d
-dimensional spherical Gaussian with mean $\mathbf{0}$ (as row vectors).
Throughout, when defining a function that uses a random number generator (RNG), we initialize the RNG outside the function and pass the RNG to it. It allows us to maintain control over the random number generation process at a higher level and ensures consistent results across multiple runs.
def spherical_gaussian(rng, d, n, mu, sig):
return mu + sig * rng.normal(0,1,(n,d))
We generate $100$ data points in dimension $d=2$. We take $\sigma^2 = 1$ and $\bmu = w \mathbf{e}_1$. Below we use the function numpy.hstack
to create a vector by concatenating two given vectors. We use [w]
to create a vector with a single entry w
. We also use the function numpy.zeros
to create an all-zero vector.
d, n, w, sig = 2, 100, 3., 1.
mu = np.hstack(([w], np.zeros(d-1)))
X = spherical_gaussian(rng, d, n, mu, sig)
plt.scatter(X[:,0], X[:,1], s=5, c='k')
plt.axis([-1, 7, -4, 4])
plt.show()
This is straightforward to implement by using numpy.random.Generator.choice
to choose the component of each sample.
The code is the following. It returns an d
by n
array X
, where each row is a sample from a 2-component spherical Gaussian mixture.
def gmm2spherical(rng, d, n, phi0, phi1, mu0, sig0, mu1, sig1):
phi, mu, sig = np.stack((phi0, phi1)), np.stack((mu0, mu1)), np.stack((sig0,sig1))
X = np.zeros((n,d))
component = rng.choice(2, size=n, p=phi)
for i in range(n):
X[i,:] = spherical_gaussian(rng, d, 1, mu[component[i],:], sig[component[i]])
return X
NUMERICAL CORNER: Let us try it with following parameters.
d, n, w, sig0, sig1, phi0, phi1 = 2, 1000, 3., 1.5, 0.5, 0.2, 0.8
mu0, mu1 = np.hstack(([w], np.zeros(d-1))), np.hstack(([-w], np.zeros(d-1)))
X = gmm2spherical(rng, d, n, phi0, phi1, mu0, sig0, mu1, sig1)
plt.figure(figsize=(6,3))
plt.scatter(X[:,0], X[:,1], s=5, color='k')
plt.axis([-8, 8, -4, 4])
plt.show()
As expected, we observe two clusters. The one on the right (component $0$) is sparser (i.e., it contains fewer data points) since phi0
is much smaller than phi1
. It is also larger as its variance is larger.
Clustering: an objective, an algorithm and a guarantee¶
Lloyd's algorithm and its analysis¶
We are now ready to describe Lloyd's algorithm$\idx{Lloyd's algorithm}\xdi$. We start from a random assignment of clusters. (An alternative initialization strategy is to choose $k$ representatives at random among the data points.) We then alternate between the optimal choices in the lemmas. In lieu of pseudo-code, we write out the algorithm in Python. We will use this approach throughout the book.
The input X
is assumed to be a collection of $n$ vectors $\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d$ stacked into a matrix, with one row for each data point. The other input, k
, is the desired number of clusters. There is an optional input maxiter
for the maximum number of iterations, which is set to $5$ by default.
We first define separate functions for the two main steps. To find the minimum of an array, we use the function numpy.argmin
. We also use numpy.linalg.norm
to compute the Euclidean distance.
def opt_reps(X, k, assign):
(n, d) = X.shape
reps = np.zeros((k, d))
for i in range(k):
in_i = [j for j in range(n) if assign[j] == i]
reps[i,:] = np.sum(X[in_i,:],axis=0) / len(in_i)
return reps
def opt_clust(X, k, reps):
(n, d) = X.shape
dist = np.zeros(n)
assign = np.zeros(n, dtype=int)
for j in range(n):
dist_to_i = np.array([LA.norm(X[j,:] - reps[i,:]) for i in range(k)])
assign[j] = np.argmin(dist_to_i)
dist[j] = dist_to_i[assign[j]]
G = np.sum(dist ** 2)
print(G)
return assign
The main function follows. Below, rng.integers(0,k,n)
is an array of n
uniformly chosen integers between 0
and k-1
(inclusive). See random.Generator.integers for details. Recall that throughout, when defining a function that uses a random number generator (RNG), we initialize the RNG outside the function and pass the RNG to it. It allows us to maintain control over the random number generation process at a higher level and ensures consistent results across multiple runs.
def kmeans(rng, X, k, maxiter=5):
(n, d) = X.shape
assign = rng.integers(0,k,n)
reps = np.zeros((k, d), dtype=int)
for iter in range(maxiter):
reps = opt_reps(X, k, assign)
assign = opt_clust(X, k, reps)
return assign
NUMERICAL CORNER: We apply our implementation of $k$-means to the example above. We fix k
to $3$. Here the data matrix X
is the following:
seed = 535
rng = np.random.default_rng(seed)
X = np.array([[1., 0.],[-2., 0.],[-2.,1.],[1.,-3.],
[-10.,10.],[2.,-2.],[-3.,1.],[3.,-1.]])
assign = kmeans(rng, X, 3)
162.7 74.8611111111111 9.083333333333334 9.083333333333334 9.083333333333334
We vizualize the output by coloring the points according to their cluster assignment.
plt.scatter(X[:,0], X[:,1], s=10, c=assign, cmap='brg')
plt.axis([-11,4,-4,11])
plt.show()
We can compute the final representatives (optimal for the final assignment) by using the subroutine opt_reps
.
print(opt_reps(X, 3, assign))
[[ -2.33333333 0.66666667] [ 1.75 -1.5 ] [-10. 10. ]]
Each row is the center of the corresponding cluster. Note these match with the ones we previously computed. Indeed, the clustering is the same (although not necessarily in the same order).
NUMERICAL CORNER: We will test our implementation of $k$-means on the penguins dataset introduced earlier in the chapter. We first extract the columns and combine them into a data matrix X
. As we did previously, we also remove the rows with missing values.
data = pd.read_csv('penguins-measurements.csv')
data = data.dropna()
X = data[['bill_length_mm', 'bill_depth_mm',
'flipper_length_mm', 'body_mass_g']].to_numpy()
We visualize a two-dimensional slice of the data.
plt.scatter(X[:,1], X[:,3], s=5, c='k')
plt.xlabel('bill_depth_mm'), plt.ylabel('body_mass_g')
plt.show()
Observe that the features have quite different scales (tens versus thousands in the plot above). In such a case, it is common to standardize the data so that each feature has roughly the same scale. For each column of X
, we subtract its empirical mean and divide by its empirical standard deviation.
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
X = (X - mean) / std
Now we run Lloyd's algorithm with $k=2$ clusters.
assign = kmeans(rng, X, 2)
1338.2046936914157 820.9361062178352 603.8787658966849 575.2587351391593 567.7837494880662
We vizualize the output as we did before, but this time coloring the data points by their cluster assignment.
plt.scatter(X[:,1], X[:,3], s=5, c=assign, cmap='brg')
plt.xlabel('bill_depth (standardized)'), plt.ylabel('body_mass (standardized)')
plt.show()
This clustering looks quite good. Nevertheless recall that:
in this plot we are looking at only two of the four variables while $k$-means uses all of them,
we are not guaranteed to find the best solution,
our objective function is somewhat arbitrary, and
it is not clear what the right choice of $k$ is.
In fact, the original dataset contained the correct answer, as provided by biologists. Despite what the figure above may lead us to believe, there are in reality three separate species. So let us try with $k=3$ clusters.
assign = kmeans(rng, X, 3)
1312.344945158482 577.1700837839458 428.50397345437966 392.2616692426171 383.3452894259011
The output does not seem quite right.
plt.scatter(X[:,1], X[:,3], s=5, c=assign, cmap='brg')
plt.xlabel('bill_depth (standardized)'), plt.ylabel('body_mass (standardized)')
plt.show()
But, remembering the warnings mentioned previously, let us look at a different two-dimensional slice.
plt.scatter(X[:,0], X[:,3], s=5, c=assign, cmap='brg')
plt.xlabel('bill_length (standardized)'), plt.ylabel('body_mass (standardized)')
plt.show()
Let us load up the truth and compare. We only keep those samples that were not removed because of missing values (see pandas.DataFrame.iloc
).
data_truth = pd.read_csv('penguins-species.csv')
data_truth = data_truth.iloc[data.index]
data_truth.head()
species | |
---|---|
0 | Adelie |
1 | Adelie |
2 | Adelie |
4 | Adelie |
5 | Adelie |
The species are:
species = data_truth['species']
print(species.unique())
['Adelie' 'Chinstrap' 'Gentoo']
To plot the outcome, we color the species blue-green-red using a dictionary.
species2color_dict = {'Adelie': 'blue', 'Chinstrap': 'lime', 'Gentoo': 'red'}
truth = [species2color_dict[a] for a in species]
Finally, we can compare the output to the truth. The match is quite good -- but certainly not perfect.
f, (ax1, ax2) = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(6.5, 3))
ax1.scatter(X[:,0], X[:,3], s=5, c=truth)
ax1.set_title('truth')
ax2.scatter(X[:,0], X[:,3], s=5, c=assign, cmap='brg')
ax2.set_title('kmeans')
plt.show()
Determining the appropriate number of clusters is not a straighforward problem. To quote Wikipedia:
The correct choice of $k$ is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing $k$ without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when $k$ equals the number of data points, $n$). Intuitively then, the optimal choice of $k$ will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of $k$ is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision.
In practice, several heuristics are in use. Other approaches to clustering, e.g. DBSCAN and hierarchical clustering, do not require a number of clusters as input.
Some observations about high-dimensional data¶
Clustering in high dimension¶
In this section, we test our implementation of $k$-means on a simple simulated dataset in high dimension.
The following function generates $n$ data points from a mixture of two equally likely, spherical $d$-dimensional Gaussians with variance $1$, one with mean $-w\mathbf{e}_1$ and one with mean $w \mathbf{e}_1$. We use gmm2
from a previous section. It is found in mmids.py
.
def two_mixed_clusters(rng, d, n, w):
mu0 = np.hstack(([w], np.zeros(d-1)))
mu1 = np.hstack(([-w], np.zeros(d-1)))
return mmids.gmm2spherical(rng, d, n, 0.5, 0.5, mu0, 1, mu1, 1)
NUMERICAL CORNER: We start with $d=2$.
seed = 535
rng = np.random.default_rng(seed)
d, n, w = 2, 100, 3.
X = two_mixed_clusters(rng, d, n, w)
Let's run $k$-means on this dataset using $k=2$. We use kmeans()
from the mmids.py
file.
assign = mmids.kmeans(rng, X, 2)
1044.8267883490312 208.5284166285488 204.02397716710018 204.02397716710018 204.02397716710018
Our default of $10$ iterations seem to have been enough for the algorithm to converge. We can visualize the result by coloring the points according to the assignment.
plt.figure(figsize=(6,3))
plt.scatter(X[:,0], X[:,1], s=10, c=assign, cmap='brg')
plt.axis([-6,6,-3,3])
plt.show()
Let's see what happens in higher dimension. We repeat our experiment with $d=1000$.
d, n, w = 1000, 100, 3.
X = two_mixed_clusters(rng, d, n, w)
Again, we observe two clearly delineated clusters.
plt.figure(figsize=(6,3))
plt.scatter(X[:,0], X[:,1], s=10, c='k')
plt.axis([-6,6,-3,3])
plt.show()
This dataset is in $1000$ dimensions, but we've plotted the data in only the first two dimensions. If we plot in any two dimensions not including the first one instead, we see only one cluster.
plt.figure(figsize=(6,3))
plt.scatter(X[:,1], X[:,2], s=10, c='k')
plt.axis([-6,6,-3,3])
plt.show()
Let's see how $k$-means fares on this dataset.
assign = mmids.kmeans(rng, X, 2)
99518.03165136592 99518.03165136592 99518.03165136592 99518.03165136592 99518.03165136592
Our attempt at clustering does not appear to have been successful.
plt.figure(figsize=(6,3))
plt.scatter(X[:,0], X[:,1], s=10, c=assign, cmap='brg')
plt.axis([-6,6,-3,3])
plt.show()
What happened? While the clusters are easy to tease apart if we know to look at the first coordinate only, in the full space the within-cluster and between-cluster distances become harder to distinguish: the noise overwhelms the signal.
As the dimension increases, the distributions of intra-cluster and inter-cluster distances overlap significantly and become more or less indistinguishable. That provides some insights into why clustering may fail here. Note that we used the same offset for all simulations. On the other hand, if the separation between the clusters is sufficiently large, one would expect clustering to work even in high dimension.
TRY IT! What precedes (and what follows in the next subsection) is not a formal proof that $k$-means clustering will be unsuccessful here. The behavior of the algorithm is quite complex and depends, in particular, on the initialization and the density of points. Here, increasing the number of data points eventually leads to a much better performance. Explore this behavior on your own by modifying the code. (For some theoretical justifications (beyond this course), see here and here.)
NUMERICAL CORNER: We can check the theorem in a simulation. Here we pick $n$ points uniformly at random in the $d$-cube $\mathcal{C}$, for a range of dimensions up to dmax
. We then plot the frequency of landing in the inscribed $d$-ball $\mathcal{B}$ and see that it rapidly converges to $0$. Alternatively, we could just plot the formula for the volume of $\mathcal{B}$. But knowing how to do simulations is useful in situations where explicit formulas are unavailable or intractable. We plot the result up to dimension $10$.
dmax, n = 10, 1000
in_ball = np.zeros(dmax)
for d in range(dmax):
in_ball[d] = np.mean([(LA.norm(rng.random(d+1) - 1/2) < 1/2) for _ in range(n)])
plt.plot(np.arange(1,dmax+1), in_ball, c='k')
plt.show()