Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Chapter: 1-Introduction: a first data science problem
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: July 15, 2024
Copyright: © 2024 Sebastien Roch


$\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bfbeta}{\boldsymbol{\beta}}$ $\newcommand{\bflambda}{\boldsymbol{\lambda}}$ $\newcommand{\bgamma}{\boldsymbol{\gamma}}$ $\newcommand{\bsigma}{{\boldsymbol{\sigma}}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\btheta}{{\boldsymbol{\theta}}}$ $\newcommand{\bphi}{\boldsymbol{\phi}}$ $\newcommand{\balpha}{\boldsymbol{\alpha}}$ $\newcommand{\blambda}{\boldsymbol{\lambda}}$ $\renewcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bp}{\mathbf{p}}$ $\renewcommand{\bx}{\mathbf{x}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bY}{\mathbf{Y}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bZ}{\mathbf{Z}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bV}{\mathbf{V}}$ $\newcommand{\bfg}{\mathbf{g}}$ $\newcommand{\bfh}{\mathbf{h}}$ $\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}$ $\renewcommand{\S}{\mathcal{S}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\pa}{\mathrm{pa}}$ $\newcommand{\Z}{\mathcal{Z}}$ $\newcommand{\bh}{\mathbf{h}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bc}{\mathbf{c}}$ $\newcommand{\cE}{\mathcal{E}}$ $\newcommand{\cP}{\mathcal{P}}$ $\newcommand{\bbeta}{\boldsymbol{\beta}}$ $\newcommand{\bLambda}{\boldsymbol{\Lambda}}$ $\newcommand{\cov}{\mathrm{Cov}}$ $\newcommand{\bfk}{\mathbf{k}}$ $\newcommand{\idx}[1]{}$ $\newcommand{\xdi}{}$

Motivating example: identifying penguin species¶

Imagine that you are an evolutionary biologist studying penguins. You have collected measurements on a large number of individual specimens. Your goal is to identify different species within this collection based on those measurements.

Here is a penguin dataset collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. We upload the data in the form of a data table (similar to a spreadsheet) called DataFrame in pandas, where the columns are different measurements (or features) and the rows are different samples. Below, we load the data using pandas.read_csv and show the first $5$ lines of the dataset (see DataFrame.head). This dataset is a simplified version (i.e., with some columns removed) of the full dataset, maintained by Allison Horst at this GitHub page.

In [3]:
import pandas as pd
data = pd.read_csv('penguins-measurements.csv')
data.head()
Out[3]:
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
0 39.1 18.7 181.0 3750.0
1 39.5 17.4 186.0 3800.0
2 40.3 18.0 195.0 3250.0
3 NaN NaN NaN NaN
4 36.7 19.3 193.0 3450.0

Observe that this dataset has missing values (i.e., the entries NaN above). A common way to deal with this issue is to remove all rows with missing values. This can be done using pandas.DataFrame.dropna. This kind of pre-processing is fundamental in data science, but we will not discuss it much in this book. It is however important to be aware of it.

In [4]:
data = data.dropna()
data.head()
Out[4]:
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
0 39.1 18.7 181.0 3750.0
1 39.5 17.4 186.0 3800.0
2 40.3 18.0 195.0 3250.0
4 36.7 19.3 193.0 3450.0
5 39.3 20.6 190.0 3650.0

There are $342$ samples, as can be seen by using pandas.DataFrame.shape which gives the dimensions of the DataFrame as a tuple.

In [5]:
data.shape
Out[5]:
(342, 4)

Let us first extract the columns into a Numpy array using pandas.DataFrame.to_numpy(). We will have more to say later on about Numpy, a numerical library for Python that in essence allows to manipulate vectors and matrices.

In [6]:
X = data.to_numpy()
print(X)
[[  39.1   18.7  181.  3750. ]
 [  39.5   17.4  186.  3800. ]
 [  40.3   18.   195.  3250. ]
 ...
 [  50.4   15.7  222.  5750. ]
 [  45.2   14.8  212.  5200. ]
 [  49.9   16.1  213.  5400. ]]

We visualize two measurements in the data, the bill depth and flipper length. (The original dataset used the more precise term culmen depth.) Below, each point is a sample. This is called a scatter plot$\idx{scatter plot}\xdi$. Quoting Wikipedia:

The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

We use matplotlib.pyplot for most of our plotting needs in this book, with a few exceptions. Specifically, here we use the function matplotlib.pyplot.scatter.

In [7]:
import matplotlib.pyplot as plt
plt.scatter(X[:,1], X[:,2], s=5, c='k')
plt.xlabel('bill_depth_mm'), plt.ylabel('flipper_length_mm')
plt.show()
No description has been provided for this image

We observe what appears to be two fairly well-defined clusters of samples on the top left and bottom right respectively. What is a cluster? Intuitively, it is a group of samples that are close to each other, but far from every other sample. In this case, it may be an indication that these samples come from a separate species.

Now let's look at the full dataset. Visualizing the full $4$-dimensional data is not straightforward. One way to do this is to consider all pairwise scatter plots. We use the function seaborn.pairplot from the library Seaborn.

In [8]:
import seaborn as sns
sns.pairplot(data, height=2)
plt.show()
No description has been provided for this image

How many species of penguins do you think there are in this dataset?

What would be useful is a method that automatically identifies clusters whatever the dimension of the data. In this chapter, we will discuss a standard way to do this: $k$-means clustering. We will come back to the penguins dataset later in the chapter.

But first we need to review some basic concepts about vectors and distances in order to formulate clustering as an appropriate optimization problem, a perspective that will be recurring throughout.

Background: quick refresher of matrix algebra, differential calculus, and elementary probability¶

NUMERICAL CORNER: In Numpy, a vector is defined as a 1d array. We first must import the Numpy package, which is often abbreviated by np.

In [9]:
import numpy as np
u = np.array([1., 3., 5. ,7.])
print(u)
[1. 3. 5. 7.]

We access the entries of u as follows, where note that indexing in Numpy starts at $0$.

In [10]:
print(u[0])
print(u[1])
1.0
3.0

To obtain the norm of a vector, we can use the function linalg.norm, which requires the numpy.linalg package (often abbreviated as LA):

In [11]:
from numpy import linalg as LA
LA.norm(u)
Out[11]:
9.16515138991168

which we check next "by hand"

In [12]:
np.sqrt(np.sum(u ** 2))
Out[12]:
9.16515138991168

In Numpy, ** indicates element-wise exponentiation.

NUMERICAL CORNER: We will often work with collections of $n$ vectors $\mathbf{x}_1, \ldots, \mathbf{x}_n$ in $\mathbb{R}^d$ and it will be convenient to stack them up into a matrix

$$ X = \begin{bmatrix} \mathbf{x}_1^T \\ \mathbf{x}_2^T \\ \vdots \\ \mathbf{x}_n^T \\ \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \\ \end{bmatrix}. $$

To create a matrix out of two vectors, we use the function numpy.stack.

In [13]:
u = np.array([1., 3., 5., 7.])
v = np.array([2., 4., 6., 8.])
X = np.stack((u,v),axis=0)
print(X)
[[1. 3. 5. 7.]
 [2. 4. 6. 8.]]

Quoting the documentation:

The axis parameter specifies the index of the new axis in the dimensions of the result. For example, if axis=0 it will be the first dimension and if axis=-1 it will be the last dimension.

Alternatively, we can define the same matrix as follows.

In [14]:
Y = np.array([[1., 3., 5., 7.],[2., 4., 6., 8.]])
print(Y)
[[1. 3. 5. 7.]
 [2. 4. 6. 8.]]

We access the entries as follows.

In [15]:
print(Y[0,0])
print(Y[0,1])
1.0
3.0

NUMERICAL CORNER: In Numpy, the Frobenius norm of a matrix can be computed using the function numpy.linalg.norm.

In [16]:
A = np.array([[1., 0.],[0., 1.],[0., 0.]])
print(A)
[[1. 0.]
 [0. 1.]
 [0. 0.]]
In [17]:
LA.norm(A)
Out[17]:
1.4142135623730951

NUMERICAL CORNER: The function $f(x) = x^2$ over $\mathbb{R}$ has a global minimizer at $x^* = 0$. Indeed, we clearly have $f(x) \geq 0$ for all $x$ while $f(0) = 0$. To plot the function, we use the Matplotlib package again, and specifically its function matplotlib.pyplot.plot. We also use the function numpy.linspace to create an array of evenly spaced numbers where we evaluate $f$.

In [18]:
import matplotlib.pyplot as plt

x = np.linspace(-2,2,100)
y = x ** 2

plt.plot(x, y, c='k')
plt.ylim(-0.25,4.25)
plt.show()
No description has been provided for this image

The function $f(x) = e^x$ over $\mathbb{R}$ does not have a global minimizer. Indeed, $f(x) > 0$ but no $x$ achieves $0$. And, for any $m > 0$, there is $x$ small enough such that $f(x) < m$. Note that $\mathbb{R}$ is not bounded, therefore the Extreme Value Theorem does not apply here.

In [19]:
x = np.linspace(-2,2,100)
y = np.exp(x)

plt.plot(x, y, c='k')
plt.ylim(-0.25,4.25)
plt.show()
No description has been provided for this image

The function $f(x) = (x+1)^2 (x-1)^2$ over $\mathbb{R}$ has two global minimizers at $x^* = -1$ and $x^{**} = 1$. Indeed, $f(x) \geq 0$ and $f(x) = 0$ if and only $x = x^*$ or $x = x^{**}$.

In [20]:
x = np.linspace(-2,2,100)
y = ((x+1)**2) * ((x-1)**2)

plt.plot(x,y,c='k')
plt.ylim(-0.25,4.25)
plt.show()
No description has been provided for this image

NUMERICAL CORNER: We can use simulations to confirm the Weak Law of Large Numbers. Recall that a uniform random variable over the interval $[a,b]$ has density

$$ f_{X}(x) = \begin{cases} \frac{1}{b-a} & x \in [a,b] \\ 0 & \text{o.w.} \end{cases} $$

We write $X \sim \mathrm{U}[a,b]$. We can obtain a sample from $\mathrm{U}[0,1]$ by using the function numpy.random.Generator.uniform. We must first instantiate a random number generator (RNG) with numpy.random.default_rng in Numpy. We provide a seed as an initial state for the RNG. Using the same seed again ensures reproducibility.

In [21]:
seed = 535
rng = np.random.default_rng(seed)
rng.uniform()
Out[21]:
0.9836159914889122

Now we take $n$ samples from $\mathrm{U}[0,1]$ and compute their sample mean. We repeat $k$ times and display the empirical distribution of the sample means using an histogram. We start with $n=10$.

In [22]:
n, k = 10, 1000
sample_mean = [np.mean(rng.random(n)) for i in range(k)]
plt.hist(sample_mean, bins=10, color='lightblue', edgecolor='black')
plt.xlim(0,1)
plt.show()
No description has been provided for this image

Taking $n$ much larger leads to more concentration around the mean.

In [23]:
n, k = 100, 1000
sample_mean = [np.mean(rng.random(n)) for i in range(k)]
plt.hist(sample_mean, bins=10, color='lightblue', edgecolor='black')
plt.xlim(0,1)
plt.show()
No description has been provided for this image

Normal distribution $\idx{normal or Gaussian distribution}\xdi$ Recall that a standard Normal variable $X$ has PDF

$$ f_X(x) = \frac{1}{\sqrt{2 \pi}} \exp\left( - x^2/2 \right). $$

Its mean is $0$ and its variance is $1$.

NUMERICAL CORNER: The following function generates $n$ data points from a spherical $d$-dimensional Gaussians with variance $\sigma^2$ and mean $\bmu$.

Below, rng.normal(0,1,(n,d)) generates a n independent d-dimensional spherical Gaussian with mean $\mathbf{0}$ (as row vectors).

Throughout, when defining a function that uses a random number generator (RNG), we initialize the RNG outside the function and pass the RNG to it. It allows us to maintain control over the random number generation process at a higher level and ensures consistent results across multiple runs.

In [25]:
def spherical_gaussian(rng, d, n, mu, sig):
    return mu + sig * rng.normal(0,1,(n,d))

We generate $100$ data points in dimension $d=2$. We take $\sigma^2 = 1$ and $\bmu = w \mathbf{e}_1$. Below we use the function numpy.hstack to create a vector by concatenating two given vectors. We use [w] to create a vector with a single entry w. We also use the function numpy.zeros to create an all-zero vector.

In [26]:
d, n, w, sig = 2, 100, 3., 1.
mu = np.hstack(([w], np.zeros(d-1)))
X = spherical_gaussian(rng, d, n, mu, sig)
plt.scatter(X[:,0], X[:,1], s=5, c='k')
plt.axis([-1, 7, -4, 4])
plt.show()
No description has been provided for this image

This is straightforward to implement by using numpy.random.Generator.choice to choose the component of each sample.

The code is the following. It returns an d by n array X, where each row is a sample from a 2-component spherical Gaussian mixture.

In [27]:
def gmm2spherical(rng, d, n, phi0, phi1, mu0, sig0, mu1, sig1):
    
    phi, mu, sig = np.stack((phi0, phi1)), np.stack((mu0, mu1)), np.stack((sig0,sig1))

    X = np.zeros((n,d))
    component = rng.choice(2, size=n, p=phi)
    for i in range(n):
        X[i,:] = spherical_gaussian(rng, d, 1, mu[component[i],:], sig[component[i]])
    
    return X

NUMERICAL CORNER: Let us try it with following parameters.

In [28]:
d, n, w, sig0, sig1, phi0, phi1 = 2, 1000, 3., 1.5, 0.5, 0.2, 0.8
mu0, mu1 = np.hstack(([w], np.zeros(d-1))), np.hstack(([-w], np.zeros(d-1)))
X = gmm2spherical(rng, d, n, phi0, phi1, mu0, sig0, mu1, sig1)
plt.figure(figsize=(6,3))
plt.scatter(X[:,0], X[:,1], s=5, color='k')
plt.axis([-8, 8, -4, 4])
plt.show()
No description has been provided for this image

As expected, we observe two clusters. The one on the right (component $0$) is sparser (i.e., it contains fewer data points) since phi0 is much smaller than phi1. It is also larger as its variance is larger.

Clustering: an objective, an algorithm and a guarantee¶

Lloyd's algorithm and its analysis¶

We are now ready to describe Lloyd's algorithm$\idx{Lloyd's algorithm}\xdi$. We start from a random assignment of clusters. (An alternative initialization strategy is to choose $k$ representatives at random among the data points.) We then alternate between the optimal choices in the lemmas. In lieu of pseudo-code, we write out the algorithm in Python. We will use this approach throughout the book.

The input X is assumed to be a collection of $n$ vectors $\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d$ stacked into a matrix, with one row for each data point. The other input, k, is the desired number of clusters. There is an optional input maxiter for the maximum number of iterations, which is set to $5$ by default.

We first define separate functions for the two main steps. To find the minimum of an array, we use the function numpy.argmin. We also use numpy.linalg.norm to compute the Euclidean distance.

In [29]:
def opt_reps(X, k, assign):
    (n, d) = X.shape
    reps = np.zeros((k, d))
    for i in range(k):
        in_i = [j for j in range(n) if assign[j] == i]             
        reps[i,:] = np.sum(X[in_i,:],axis=0) / len(in_i)
    return reps

def opt_clust(X, k, reps):
    (n, d) = X.shape
    dist = np.zeros(n)
    assign = np.zeros(n, dtype=int)
    for j in range(n):
        dist_to_i = np.array([LA.norm(X[j,:] - reps[i,:]) for i in range(k)])
        assign[j] = np.argmin(dist_to_i)
        dist[j] = dist_to_i[assign[j]]
    G = np.sum(dist ** 2)
    print(G)
    return assign

The main function follows. Below, rng.integers(0,k,n) is an array of n uniformly chosen integers between 0 and k-1 (inclusive). See random.Generator.integers for details. Recall that throughout, when defining a function that uses a random number generator (RNG), we initialize the RNG outside the function and pass the RNG to it. It allows us to maintain control over the random number generation process at a higher level and ensures consistent results across multiple runs.

In [30]:
def kmeans(rng, X, k, maxiter=5):
    (n, d) = X.shape
    assign = rng.integers(0,k,n)
    reps = np.zeros((k, d), dtype=int)
    for iter in range(maxiter):
        reps = opt_reps(X, k, assign) 
        assign = opt_clust(X, k, reps) 
    return assign

NUMERICAL CORNER: We apply our implementation of $k$-means to the example above. We fix k to $3$. Here the data matrix X is the following:

In [31]:
seed = 535
rng = np.random.default_rng(seed)
X = np.array([[1., 0.],[-2., 0.],[-2.,1.],[1.,-3.],
              [-10.,10.],[2.,-2.],[-3.,1.],[3.,-1.]])
assign = kmeans(rng, X, 3)
162.7
74.8611111111111
9.083333333333334
9.083333333333334
9.083333333333334

We vizualize the output by coloring the points according to their cluster assignment.

In [32]:
plt.scatter(X[:,0], X[:,1], s=10, c=assign, cmap='brg')
plt.axis([-11,4,-4,11])
plt.show()
No description has been provided for this image

We can compute the final representatives (optimal for the final assignment) by using the subroutine opt_reps.

In [33]:
print(opt_reps(X, 3, assign))
[[ -2.33333333   0.66666667]
 [  1.75        -1.5       ]
 [-10.          10.        ]]

Each row is the center of the corresponding cluster. Note these match with the ones we previously computed. Indeed, the clustering is the same (although not necessarily in the same order).

NUMERICAL CORNER: We will test our implementation of $k$-means on the penguins dataset introduced earlier in the chapter. We first extract the columns and combine them into a data matrix X. As we did previously, we also remove the rows with missing values.

In [34]:
data = pd.read_csv('penguins-measurements.csv')
data = data.dropna()
X = data[['bill_length_mm', 'bill_depth_mm', 
        'flipper_length_mm', 'body_mass_g']].to_numpy()

We visualize a two-dimensional slice of the data.

In [35]:
plt.scatter(X[:,1], X[:,3], s=5, c='k')
plt.xlabel('bill_depth_mm'), plt.ylabel('body_mass_g')
plt.show()
No description has been provided for this image

Observe that the features have quite different scales (tens versus thousands in the plot above). In such a case, it is common to standardize the data so that each feature has roughly the same scale. For each column of X, we subtract its empirical mean and divide by its empirical standard deviation.

In [36]:
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
X = (X - mean) / std

Now we run Lloyd's algorithm with $k=2$ clusters.

In [37]:
assign = kmeans(rng, X, 2)
1338.2046936914157
820.9361062178352
603.8787658966849
575.2587351391593
567.7837494880662

We vizualize the output as we did before, but this time coloring the data points by their cluster assignment.

In [38]:
plt.scatter(X[:,1], X[:,3], s=5, c=assign, cmap='brg')
plt.xlabel('bill_depth (standardized)'), plt.ylabel('body_mass (standardized)')
plt.show()
No description has been provided for this image

This clustering looks quite good. Nevertheless recall that:

  1. in this plot we are looking at only two of the four variables while $k$-means uses all of them,

  2. we are not guaranteed to find the best solution,

  3. our objective function is somewhat arbitrary, and

  4. it is not clear what the right choice of $k$ is.

In fact, the original dataset contained the correct answer, as provided by biologists. Despite what the figure above may lead us to believe, there are in reality three separate species. So let us try with $k=3$ clusters.

In [39]:
assign = kmeans(rng, X, 3)
1312.344945158482
577.1700837839458
428.50397345437966
392.2616692426171
383.3452894259011

The output does not seem quite right.

In [40]:
plt.scatter(X[:,1], X[:,3], s=5, c=assign, cmap='brg')
plt.xlabel('bill_depth (standardized)'), plt.ylabel('body_mass (standardized)')
plt.show()
No description has been provided for this image

But, remembering the warnings mentioned previously, let us look at a different two-dimensional slice.

In [41]:
plt.scatter(X[:,0], X[:,3], s=5, c=assign, cmap='brg')
plt.xlabel('bill_length (standardized)'), plt.ylabel('body_mass (standardized)')
plt.show()
No description has been provided for this image

Let us load up the truth and compare. We only keep those samples that were not removed because of missing values (see pandas.DataFrame.iloc).

In [42]:
data_truth = pd.read_csv('penguins-species.csv') 
data_truth = data_truth.iloc[data.index]
data_truth.head()
Out[42]:
species
0 Adelie
1 Adelie
2 Adelie
4 Adelie
5 Adelie

The species are:

In [43]:
species = data_truth['species']
print(species.unique())
['Adelie' 'Chinstrap' 'Gentoo']

To plot the outcome, we color the species blue-green-red using a dictionary.

In [44]:
species2color_dict = {'Adelie': 'blue', 'Chinstrap': 'lime', 'Gentoo': 'red'}
truth = [species2color_dict[a] for a in species]

Finally, we can compare the output to the truth. The match is quite good -- but certainly not perfect.

In [45]:
f, (ax1, ax2) = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(6.5, 3))
ax1.scatter(X[:,0], X[:,3], s=5, c=truth)
ax1.set_title('truth')
ax2.scatter(X[:,0], X[:,3], s=5, c=assign, cmap='brg')
ax2.set_title('kmeans')
plt.show()
No description has been provided for this image

Determining the appropriate number of clusters is not a straighforward problem. To quote Wikipedia:

The correct choice of $k$ is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing $k$ without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when $k$ equals the number of data points, $n$). Intuitively then, the optimal choice of $k$ will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of $k$ is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision.

In practice, several heuristics are in use. Other approaches to clustering, e.g. DBSCAN and hierarchical clustering, do not require a number of clusters as input.

Some observations about high-dimensional data¶

Clustering in high dimension¶

In this section, we test our implementation of $k$-means on a simple simulated dataset in high dimension.

The following function generates $n$ data points from a mixture of two equally likely, spherical $d$-dimensional Gaussians with variance $1$, one with mean $-w\mathbf{e}_1$ and one with mean $w \mathbf{e}_1$. We use gmm2 from a previous section. It is found in mmids.py.

In [46]:
def two_mixed_clusters(rng, d, n, w):
    mu0 = np.hstack(([w], np.zeros(d-1)))
    mu1 = np.hstack(([-w], np.zeros(d-1)))
    return mmids.gmm2spherical(rng, d, n, 0.5, 0.5, mu0, 1, mu1, 1)

NUMERICAL CORNER: We start with $d=2$.

In [47]:
seed = 535
rng = np.random.default_rng(seed)
d, n, w = 2, 100, 3.
X = two_mixed_clusters(rng, d, n, w)

Let's run $k$-means on this dataset using $k=2$. We use kmeans() from the mmids.py file.

In [48]:
assign = mmids.kmeans(rng, X, 2)
1044.8267883490312
208.5284166285488
204.02397716710018
204.02397716710018
204.02397716710018

Our default of $10$ iterations seem to have been enough for the algorithm to converge. We can visualize the result by coloring the points according to the assignment.

In [49]:
plt.figure(figsize=(6,3))
plt.scatter(X[:,0], X[:,1], s=10, c=assign, cmap='brg')
plt.axis([-6,6,-3,3])
plt.show()
No description has been provided for this image

Let's see what happens in higher dimension. We repeat our experiment with $d=1000$.

In [50]:
d, n, w = 1000, 100, 3.
X = two_mixed_clusters(rng, d, n, w)

Again, we observe two clearly delineated clusters.

In [51]:
plt.figure(figsize=(6,3))
plt.scatter(X[:,0], X[:,1], s=10, c='k')
plt.axis([-6,6,-3,3])
plt.show()
No description has been provided for this image

This dataset is in $1000$ dimensions, but we've plotted the data in only the first two dimensions. If we plot in any two dimensions not including the first one instead, we see only one cluster.

In [52]:
plt.figure(figsize=(6,3))
plt.scatter(X[:,1], X[:,2], s=10, c='k')
plt.axis([-6,6,-3,3])
plt.show()
No description has been provided for this image

Let's see how $k$-means fares on this dataset.

In [53]:
assign = mmids.kmeans(rng, X, 2)
99518.03165136592
99518.03165136592
99518.03165136592
99518.03165136592
99518.03165136592

Our attempt at clustering does not appear to have been successful.

In [54]:
plt.figure(figsize=(6,3))
plt.scatter(X[:,0], X[:,1], s=10, c=assign, cmap='brg')
plt.axis([-6,6,-3,3])
plt.show()
No description has been provided for this image

What happened? While the clusters are easy to tease apart if we know to look at the first coordinate only, in the full space the within-cluster and between-cluster distances become harder to distinguish: the noise overwhelms the signal.

As the dimension increases, the distributions of intra-cluster and inter-cluster distances overlap significantly and become more or less indistinguishable. That provides some insights into why clustering may fail here. Note that we used the same offset for all simulations. On the other hand, if the separation between the clusters is sufficiently large, one would expect clustering to work even in high dimension.

TRY IT! What precedes (and what follows in the next subsection) is not a formal proof that $k$-means clustering will be unsuccessful here. The behavior of the algorithm is quite complex and depends, in particular, on the initialization and the density of points. Here, increasing the number of data points eventually leads to a much better performance. Explore this behavior on your own by modifying the code. (For some theoretical justifications (beyond this course), see here and here.)

NUMERICAL CORNER: We can check the theorem in a simulation. Here we pick $n$ points uniformly at random in the $d$-cube $\mathcal{C}$, for a range of dimensions up to dmax. We then plot the frequency of landing in the inscribed $d$-ball $\mathcal{B}$ and see that it rapidly converges to $0$. Alternatively, we could just plot the formula for the volume of $\mathcal{B}$. But knowing how to do simulations is useful in situations where explicit formulas are unavailable or intractable. We plot the result up to dimension $10$.

In [55]:
dmax, n = 10, 1000

in_ball = np.zeros(dmax)
for d in range(dmax):
    in_ball[d] = np.mean([(LA.norm(rng.random(d+1) - 1/2) < 1/2) for _ in range(n)])
    
plt.plot(np.arange(1,dmax+1), in_ball, c='k') 
plt.show()
No description has been provided for this image