Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Chapter: 6-Probabilistic models: from simple to complex
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: July 15, 2024
Copyright: © 2024 Sebastien Roch
$\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bfbeta}{\boldsymbol{\beta}}$ $\newcommand{\bflambda}{\boldsymbol{\lambda}}$ $\newcommand{\bgamma}{\boldsymbol{\gamma}}$ $\newcommand{\bsigma}{{\boldsymbol{\sigma}}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\btheta}{{\boldsymbol{\theta}}}$ $\newcommand{\bphi}{\boldsymbol{\phi}}$ $\newcommand{\balpha}{\boldsymbol{\alpha}}$ $\newcommand{\blambda}{\boldsymbol{\lambda}}$ $\renewcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bp}{\mathbf{p}}$ $\renewcommand{\bx}{\mathbf{x}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bY}{\mathbf{Y}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bZ}{\mathbf{Z}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bV}{\mathbf{V}}$ $\newcommand{\bfg}{\mathbf{g}}$ $\newcommand{\bfh}{\mathbf{h}}$ $\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}$ $\renewcommand{\S}{\mathcal{S}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\pa}{\mathrm{pa}}$ $\newcommand{\Z}{\mathcal{Z}}$ $\newcommand{\bh}{\mathbf{h}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bc}{\mathbf{c}}$ $\newcommand{\cE}{\mathcal{E}}$ $\newcommand{\cP}{\mathcal{P}}$ $\newcommand{\bbeta}{\boldsymbol{\beta}}$ $\newcommand{\bLambda}{\boldsymbol{\Lambda}}$ $\newcommand{\cov}{\mathrm{Cov}}$ $\newcommand{\bfk}{\mathbf{k}}$ $\newcommand{\idx}[1]{}$ $\newcommand{\xdi}{}$
Motivating example: tracking location¶
Suppose we let loose a cyborg corgi in a large park. We would like to know where it is at all time. For this purpose, it has an implanted location device that sends a signal to a tracking app.
Here is an example of the data we might have, through a simulation we will explain later on in this chapter. The dots are recorded locations at regular time intervals. The dotted line helps keep track of the time order of the recordings.
By convention, we start at $(0,0)$. Notice how squiggly the trajectory is. One issue might be that the times at which the location is recorded are too far between. But, in fact, there is another issue: the tracking device is inaccurate.
To get a better estimate of the true trajectory, it is natural to try to model the noise in the measurement as well as the dynamics itself. Probabilistic models are perfectly suited for this.
In this chapter, we will encounter of variety of such models and show how to take advantage of them to estimate unknown states (or parameters). In particular, conditional independence will play a key role.
We will come back to location tracking later in the chapter.
$\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bfbeta}{\boldsymbol{\beta}}$ $\newcommand{\bflambda}{\boldsymbol{\lambda}}$ $\newcommand{\bgamma}{\boldsymbol{\gamma}}$ $\newcommand{\bsigma}{{\boldsymbol{\sigma}}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\btheta}{{\boldsymbol{\theta}}}$ $\newcommand{\bphi}{\boldsymbol{\phi}}$ $\newcommand{\balpha}{\boldsymbol{\alpha}}$ $\newcommand{\blambda}{\boldsymbol{\lambda}}$ $\renewcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bp}{\mathbf{p}}$ $\renewcommand{\bx}{\mathbf{x}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bY}{\mathbf{Y}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bZ}{\mathbf{Z}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bV}{\mathbf{V}}$ $\newcommand{\bfg}{\mathbf{g}}$ $\newcommand{\bfh}{\mathbf{h}}$ $\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}$ $\renewcommand{\S}{\mathcal{S}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\pa}{\mathrm{pa}}$ $\newcommand{\Z}{\mathcal{Z}}$ $\newcommand{\bh}{\mathbf{h}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bc}{\mathbf{c}}$ $\newcommand{\cE}{\mathcal{E}}$ $\newcommand{\cP}{\mathcal{P}}$ $\newcommand{\bbeta}{\boldsymbol{\beta}}$ $\newcommand{\bLambda}{\boldsymbol{\Lambda}}$ $\newcommand{\cov}{\mathrm{Cov}}$ $\newcommand{\bfk}{\mathbf{k}}$ $\newcommand{\idx}[1]{}$ $\newcommand{\xdi}{}$
Background: introduction to parametric families and maximum likelihood estimation¶
NUMERICAL CORNER: The following code, which plots the density in the bivariate case, was adapted from gauss_plot_2d.ipynb by ChatGPT.
CHAT & LEARN Ask your favorite AI chatbot to explain the code! Try different covariance matrices. (Open In Colab) $\ddagger$
from scipy.stats import multivariate_normal
def gaussian_pdf(X, Y, mean, cov):
xy = np.stack([X.flatten(), Y.flatten()], axis=-1)
return multivariate_normal.pdf(
xy, mean=mean, cov=cov).reshape(X.shape)
def make_surface_plot(X, Y, Z):
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
surf = ax.plot_surface(
X, Y, Z, cmap=plt.cm.viridis, antialiased=False)
plt.show()
We plot the density for mean $(0,0)$ with two different covariance matrices:
$$ \bSigma_1 = \begin{bmatrix} 1.0 & 0 \\ 0 & 1.0 \end{bmatrix} \quad \text{and} \quad \bSigma_2 = \begin{bmatrix} \sigma_1^2 & \rho \sigma_1 \sigma_2 \\ \rho \sigma_1 \sigma_2 & \sigma_2^2 \end{bmatrix} $$where $\sigma_1 = 1.5$, $\sigma_2 = 0.5$ and $\rho = -0.75$.
start_point = 5
stop_point = 5
num_samples = 100
points = np.linspace(-start_point, stop_point, num_samples)
X, Y = np.meshgrid(points, points)
mean = np.array([0., 0.])
cov = np.array([[1., 0.], [0., 1.]])
make_surface_plot(X, Y, gaussian_pdf(X, Y, mean, cov))
mean = np.array([0., 0.])
cov = np.array([[1.5 ** 2., -0.75 * 1.5 * 0.5],
[-0.75 * 1.5 * 0.5, 0.5 ** 2.]])
make_surface_plot(X, Y, gaussian_pdf(X, Y, mean, cov))
NUMERICAL CORNER: In Numpy, as we have seen before, the module numpy.random provides a way to sample from a variety of standard distributions. We first initialize the pseudorandom number generator$\idx{pseudorandom number generator}\xdi$ with a random seed. Recall that it allows the results to be reproducible: using the same seed produces the same results again.
seed = 535
rng = np.random.default_rng(seed)
Here's are lists of available probability distributions.
p = 0.1
N = 5
print(rng.binomial(1, p, size=N))
[1 0 0 0 0]
Here are a few other examples.
p = [0.1, 0.2, 0.7]
n = 100
print(rng.multinomial(n, p, size=N))
[[ 9 12 79] [ 5 20 75] [13 18 69] [ 8 18 74] [ 8 24 68]]
mu = np.array([0.1, -0.3])
sig = np.array([[2., 0.],[0., 3.]])
print(rng.multivariate_normal(mu, sig, size=N))
[[-0.7275232 2.66555155] [ 0.45641186 -2.65834344] [ 1.13188325 0.43920735] [ 0.69846716 2.49891659] [ 0.91725117 1.89618733]]
$\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bfbeta}{\boldsymbol{\beta}}$ $\newcommand{\bflambda}{\boldsymbol{\lambda}}$ $\newcommand{\bgamma}{\boldsymbol{\gamma}}$ $\newcommand{\bsigma}{{\boldsymbol{\sigma}}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\btheta}{{\boldsymbol{\theta}}}$ $\newcommand{\bphi}{\boldsymbol{\phi}}$ $\newcommand{\balpha}{\boldsymbol{\alpha}}$ $\newcommand{\blambda}{\boldsymbol{\lambda}}$ $\renewcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bp}{\mathbf{p}}$ $\renewcommand{\bx}{\mathbf{x}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bY}{\mathbf{Y}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bZ}{\mathbf{Z}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bV}{\mathbf{V}}$ $\newcommand{\bfg}{\mathbf{g}}$ $\newcommand{\bfh}{\mathbf{h}}$ $\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}$ $\renewcommand{\S}{\mathcal{S}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\pa}{\mathrm{pa}}$ $\newcommand{\Z}{\mathcal{Z}}$ $\newcommand{\bh}{\mathbf{h}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bc}{\mathbf{c}}$ $\newcommand{\cE}{\mathcal{E}}$ $\newcommand{\cP}{\mathcal{P}}$ $\newcommand{\bbeta}{\boldsymbol{\beta}}$ $\newcommand{\bLambda}{\boldsymbol{\Lambda}}$ $\newcommand{\cov}{\mathrm{Cov}}$ $\newcommand{\bfk}{\mathbf{k}}$ $\newcommand{\idx}[1]{}$ $\newcommand{\xdi}{}$
Modeling more complex dependencies 1: using conditional independence¶
We implement the Naive Bayes model with Laplace smoothing.
We encode the data into a table, where the rows are the classes and the columns are the features. The entries are the corresponding $N_{k,m}$s. In addition we provide the vector $N_k$, which is the last column above, and the value $N$, which is the sum of the entries of $N_k$.
def nb_fit_table(N_km, alpha=1., beta=1.):
K, M = N_km.shape
N_k = np.sum(N_km,axis=-1)
N = np.sum(N_k)
pi_k = (N_k+alpha) / (N+K*alpha)
p_km = (N_km+beta) / (N_k[:,None]+2*beta)
return pi_k, p_km
The next function computes the negative logarithm of $\pi_k \prod_{m=1}^M p_{k,m}^{x_m} (1-p_{k,m})^{1-x_m}$, that is, the score of $k$, and outputs a $k$ achieving the minimum score.
def nb_predict(pi_k, p_km, x, label_set):
K = len(pi_k)
score_k = np.zeros(K)
for k in range(K):
score_k[k] -= np.log(pi_k[k])
score_k[k] -= np.sum(x * np.log(p_km[k,:])
+ (1 - x)*np.log(1 - p_km[k,:]))
return label_set[np.argmin(score_k, axis=0)]
NUMERICAL CORNER: We use a simple example from Towards Data Science:
Example: let’s say we have data on 1000 pieces of fruit. The fruit being a Banana, Orange or some other fruit and imagine we know 3 features of each fruit, whether it’s long or not, sweet or not and yellow or not, as displayed in the table below.
Fruit | Long | Sweet | Yellow | Total |
---|---|---|---|---|
Banana | 400 | 350 | 450 | 500 |
Orange | 0 | 150 | 300 | 300 |
Other | 100 | 150 | 50 | 200 |
Total | 500 | 650 | 800 | 1000 |
[...] Which should provide enough evidence to predict the class of another fruit as it’s introduced.
N_km = np.array([[400., 350., 450.],
[0., 150., 300.],
[100., 150., 50.]])
We run nb_fit_table
on our simple dataset.
pi_k, p_km = nb_fit_table(N_km)
print(pi_k)
[0.61495136 0.23092678 0.15412186]
print(p_km)
[[0.33361065 0.29201331 0.37520799] [0.00221239 0.3340708 0.6659292 ] [0.33443709 0.5 0.16887417]]
Continuing on with our previous example:
So let’s say we’re given the features of a piece of fruit and we need to predict the class. If we’re told that the additional fruit is Long, Sweet and Yellow, we can classify it using the [prediction] formula and subbing in the values for each outcome, whether it’s a Banana, an Orange or Other Fruit. The one with the highest probability (score) being the winner.
We run nb_predict
on our dataset with the additional fruit from the quote above.
label_set = ['Banana', 'Orange', 'Other']
x = np.array([1., 1., 1.])
nb_predict(pi_k, p_km, x, label_set)
'Banana'
$\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bfbeta}{\boldsymbol{\beta}}$ $\newcommand{\bflambda}{\boldsymbol{\lambda}}$ $\newcommand{\bgamma}{\boldsymbol{\gamma}}$ $\newcommand{\bsigma}{{\boldsymbol{\sigma}}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\btheta}{{\boldsymbol{\theta}}}$ $\newcommand{\bphi}{\boldsymbol{\phi}}$ $\newcommand{\balpha}{\boldsymbol{\alpha}}$ $\newcommand{\blambda}{\boldsymbol{\lambda}}$ $\renewcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bp}{\mathbf{p}}$ $\renewcommand{\bx}{\mathbf{x}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bY}{\mathbf{Y}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bZ}{\mathbf{Z}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bV}{\mathbf{V}}$ $\newcommand{\bfg}{\mathbf{g}}$ $\newcommand{\bfh}{\mathbf{h}}$ $\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}$ $\renewcommand{\S}{\mathcal{S}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\pa}{\mathrm{pa}}$ $\newcommand{\Z}{\mathcal{Z}}$ $\newcommand{\bh}{\mathbf{h}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bc}{\mathbf{c}}$ $\newcommand{\cE}{\mathcal{E}}$ $\newcommand{\cP}{\mathcal{P}}$ $\newcommand{\bbeta}{\boldsymbol{\beta}}$ $\newcommand{\bLambda}{\boldsymbol{\Lambda}}$ $\newcommand{\cov}{\mathrm{Cov}}$ $\newcommand{\bfk}{\mathbf{k}}$ $\newcommand{\idx}[1]{}$ $\newcommand{\xdi}{}$
Modeling more complex dependencies 2: marginalizing out an unobserved variable¶
EXAMPLE: (Gaussian mixture model) $\idx{Gaussian mixture model}\xdi$ For $i=1,\ldots,K$, let $\bmu_i$ and $\bSigma_i$ be the mean and covariance matrix of a multivariate Gaussian. Let $\bpi \in \Delta_K$. A Gaussian Mixture Model (GMM) is obtained as follows: take $Y \sim \mathrm{Cat}(\bpi)$ and
$$ \bX|\{Y=i\} \sim N_d(\bmu_i, \bSigma_i). $$Its probability density function (PDF) takes the form
$$ f_\bX(\bx) = \sum_{i=1}^K \pi_i \frac{1}{(2\pi)^{d/2} \,|\bSigma_i|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \bmu_i)^T \bSigma_i^{-1} (\bx - \bmu_i)\right). $$$\lhd$
NUMERICAL CORNER: We plot the density for means $\bmu_1 = (-2,-2)$ and $\bmu_2 = (2,2)$ and covariance matrices
$$ \bSigma_1 = \begin{bmatrix} 1.0 & 0 \\ 0 & 1.0 \end{bmatrix} \quad \text{and} \quad \bSigma_2 = \begin{bmatrix} \sigma_1^2 & \rho \sigma_1 \sigma_2 \\ \rho \sigma_1 \sigma_2 & \sigma_2^2 \end{bmatrix} $$where $\sigma_1 = 1.5$, $\sigma_2 = 0.5$ and $\rho = -0.75$. The mixing weights are $\pi_1 = 0.25$ and $\pi_2 = 0.75$.
from scipy.stats import multivariate_normal
def gmm2_pdf(X, Y, mean1, cov1, pi1, mean2, cov2, pi2):
xy = np.stack([X.flatten(), Y.flatten()], axis=-1)
Z1 = multivariate_normal.pdf(
xy, mean=mean1, cov=cov1).reshape(X.shape)
Z2 = multivariate_normal.pdf(
xy, mean=mean2, cov=cov2).reshape(X.shape)
return pi1 * Z1 + pi2 * Z2
start_point = 6
stop_point = 6
num_samples = 100
points = np.linspace(-start_point, stop_point, num_samples)
X, Y = np.meshgrid(points, points)
mean1 = np.array([-2., -2.])
cov1 = np.array([[1., 0.], [0., 1.]])
pi1 = 0.5
mean2 = np.array([2., 2.])
cov2 = np.array([[1.5 ** 2., -0.75 * 1.5 * 0.5],
[-0.75 * 1.5 * 0.5, 0.5 ** 2.]])
pi2 = 0.5
Z = gmm2_pdf(X, Y, mean1, cov1, pi1, mean2, cov2, pi2)
mmids.make_surface_plot(X, Y, Z)
In Numpy, as we have seen before, the module numpy.random also provides a way to sample from mixture models by using numpy.random.Generator.choice.
For instance, we consider mixtures of multivariate Gaussians. We chage the notation slightly to track Python's indexing. For $i=0,1$, we have a mean $\bmu_i \in \mathbb{R}^d$ and a positive definite covariance matrix $\bSigma_i \in \mathbb{R}^{d \times d}$. We also have mixture weights $\phi_0, \phi_1 \in (0,1)$ such that $\phi_0 + \phi_1 = 1$. Suppose we want to generate a total of $n$ samples.
For each sample $j=1,\ldots, n$, independently from everything else:
We first pick a component $i \in \{0,1\}$ at random according to the mixture weights, that is, $i=0$ is chosen with probability $\phi_0$ and $i=1$ is chosen with probability $\phi_1$.
We generate a sample $\bX_j = (X_{j,1},\ldots,X_{j,d})$ according to a multivariate Gaussian with mean $\bmu_i$ and covariance $\bSigma_i$.
This is straightforward to implement by using again numpy.random.Generator.choice
to choose the component of each sample and numpy.random.Generator.multivariate_normal
to generate multivariate Gaussians. For convenience, we will stack the means and covariances into one array with a new dimension. So, for instance, the covariance matrices will now be in a 3d-array, that is, an array with $3$ indices. The first index corresponds to the component (here $0$ or $1$).
The code is the following. It returns an d
by n
array X
, where each row is a sample from a 2-component Gaussian mixture.
def gmm2(rng, d, n, phi0, phi1, mu0, sigma0, mu1, sigma1):
phi = np.stack((phi0, phi1))
mu = np.stack((mu0, mu1))
sigma = np.stack((sigma0,sigma1))
X = np.zeros((n,d))
component = rng.choice(2, size=n, p=phi)
for i in range(n):
X[i,:] = rng.multivariate_normal(
mu[component[i],:],
sigma[component[i],:,:])
return X
NUMERICAL CORNER: Let us try it with following parameters. We first define the covariance matrices and show what happens when they are stacked into a 3d array (as is done within gmm2
).
d = 2
sigma0 = np.outer(np.array([2., 2.]), np.array([2., 2.]))
sigma0 += np.outer(np.array([-0.5, 0.5]), np.array([-0.5, 0.5]))
sigma1 = 2 * np.identity(d)
sigma = np.stack((sigma0,sigma1))
print(sigma[0,:,:])
[[4.25 3.75] [3.75 4.25]]
print(sigma[1,:,:])
[[2. 0.] [0. 2.]]
Then we define the rest of the parameters.
seed = 535
rng = np.random.default_rng(seed)
n, w = 200, 5.
phi0 = 0.8
phi1 = 0.2
mu0 = np.concatenate(([w], np.zeros(d-1)))
mu1 = np.concatenate(([-w], np.zeros(d-1)))
X = gmm2(rng, d, n, phi0, phi1, mu0, sigma0, mu1, sigma1)
plt.scatter(X[:,0], X[:,1], s=5, marker='o', c='k')
plt.axis('equal')
plt.show()
Example: Mixtures of multivariate Bernoullis and the EM algorithm¶
We implement the EM algorithm for mixtures of multivariate Bernoullis. For this purpose, we adapt our previous Naive Bayes routines. We also allow for the possibility of using Laplace smoothing.
def responsibility(pi_k, p_km, x):
K = len(pi_k)
score_k = np.zeros(K)
for k in range(K):
score_k[k] -= np.log(pi_k[k])
score_k[k] -= np.sum(x * np.log(p_km[k,:])
+ (1 - x) * np.log(1 - p_km[k,:]))
r_k = np.exp(-score_k)/(np.sum(np.exp(-score_k)))
return r_k
def update_parameters(eta_km, eta_k, eta, alpha, beta):
K = len(eta_k)
pi_k = (eta_k+alpha) / (eta+K*alpha)
p_km = (eta_km+beta) / (eta_k[:,None]+2*beta)
return pi_k, p_km
We implement the E and M Step next.
def em_bern(X, K, pi_0, p_0, maxiters = 10, alpha=0., beta=0.):
n, M = X.shape
pi_k = pi_0
p_km = p_0
for _ in range(maxiters):
# E Step
r_ki = np.zeros((K,n))
for i in range(n):
r_ki[:,i] = responsibility(pi_k, p_km, X[i,:])
# M Step
eta_km = np.zeros((K,M))
eta_k = np.sum(r_ki, axis=-1)
eta = np.sum(eta_k)
for k in range(K):
for m in range(M):
eta_km[k,m] = np.sum(X[:,m] * r_ki[k,:])
pi_k, p_km = update_parameters(
eta_km, eta_k, eta, alpha, beta)
return pi_k, p_km
NUMERICAL CORNER: We test the algorithm on a very simple dataset.
X = np.array([[1., 1., 1.],[1., 1., 1.],[1., 1., 1.],[1., 0., 1.],
[0., 1., 1.],[0., 0., 0.],[0., 0., 0.],[0., 0., 1.]])
n, M = X.shape
K = 2
pi_0 = np.ones(K)/K
p_0 = rng.random((K,M))
pi_k, p_km = em_bern(
X, K, pi_0, p_0, maxiters=100, alpha=0.01, beta=0.01)
print(pi_k)
[0.66500949 0.33499051]
print(p_km)
[[0.74982646 0.74982646 0.99800266] [0.00496739 0.00496739 0.25487292]]
We compute the probability that the vector $(0, 0, 1)$ is in each cluster.
x_test = np.array([0., 0., 1.])
print(responsibility(pi_k, p_km, x_test))
[0.32947702 0.67052298]
To give a more involved example, we use the MNIST dataset.
Quoting Wikipedia again:
The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels. The MNIST database contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset.
NUMERICAL CORNER: We load it from PyTorch. The data can be accessed with torchvision.datasets.MNIST
. The squeeze()
below removes the color dimension in the image, which is grayscale. The numpy()
converts the PyTorch tensors into Numpy arrays. See torch.utils.data.DataLoader
for details on the data loading. We will say more about PyTorch in a later chapter.
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
mnist = datasets.MNIST(root='./data', train=True,
download=True, transform=transforms.ToTensor())
train_loader = DataLoader(mnist, batch_size=len(mnist), shuffle=False)
imgs, labels = next(iter(train_loader))
imgs = imgs.squeeze().numpy()
labels = labels.numpy()
We turn the grayscale images (more precisely, each pixel is an integer between $0$ and $255$) into a black-and-white images by rounding the pixels (after dividing by $255$)
imgs = np.round(imgs)
There are two common ways to write a $2$. Let's see if a mixture of multivariate Bernoullis can find them. We extract the images labelled $2$.
mask = labels == 2
imgs2 = imgs[mask]
labels2 = labels[mask]
The first image is the following.
plt.imshow(imgs2[0], cmap='gray_r')
plt.show()
Next, we transform the images into vectors and convert into black and white by rounding.
X = imgs2.reshape(len(imgs2), -1)
We run the algorithm with $2$ clusters.
n, M = X.shape
K = 2
pi_0 = np.ones(K)/K
p_0 = rng.random((K,M))
pi_k, p_km = em_bern(
X, K, pi_0, p_0, maxiters=10, alpha=1., beta=1.)
print(pi_k)
[nan nan]
Uh-oh. Something went wrong. We encountered a numerical issue, underflow, which we discussed briefly previously. To confirm this, we run the code again but ask Python to warn us about it using numpy.seterr
. (By default, warnings are turned off in the book, but they can be reactivated using warnings.resetwarnings
.)
warnings.resetwarnings()
old_settings = np.seterr(all='warn')
pi_k, p_km = em_bern(
X, K, pi_0, p_0, maxiters=10, alpha=1., beta=1.)
/var/folders/k0/7k0fxl7j54q4k8dyqnrc6sz00000gr/T/ipykernel_87966/2844323350.py:10: RuntimeWarning: underflow encountered in exp r_k = np.exp(-score_k)/(np.sum(np.exp(-score_k))) /var/folders/k0/7k0fxl7j54q4k8dyqnrc6sz00000gr/T/ipykernel_87966/2844323350.py:10: RuntimeWarning: invalid value encountered in divide r_k = np.exp(-score_k)/(np.sum(np.exp(-score_k)))
When we compute the responsibilities
$$ r_{k,i}^t = \frac{\pi_k^t \prod_{m=1}^M (p_{k,m}^t)^{x_{i,m}} (1-p_{k,m}^t)^{1-x_{i,m}}} {\sum_{k'=1}^K \pi_{k'}^t \prod_{m=1}^M (p_{k',m}^t)^{x_{i,m}} (1-p_{k',m}^t)^{1-x_{i,m}}}, $$we first compute the negative logarithm of each term in the numerator as we did in the Naive Bayes case. But then we applied the function $e^{-x}$, because this time we are not simply computing an optimal score. When all scores are high, this last step may result in underflow, that is, produces numbers so small that they get rounded down to zero by Numpy. Then the ratio defining r_k
is not well-defined.
To deal with this, we introduce a technique called the log-sum-exp trick$\idx{log-sum-exp trick}\xdi$ (with some help from ChatGPT). Consider the computation of a function of $\mathbf{a} = (a_1, \ldots, a_n)$ of the form
$$ h(\mathbf{a}) = \log \left( \sum_{i=1}^{n} e^{-a_i} \right). $$When the $a_i$ values are large positive numbers, the terms $e^{-a_i}$ can be so small that they underflow to zero. To counter this, the log-sum-exp trick involves a shift to bring these terms into a more favorable numerical range.
It proceeds as follows:
Identify the minimum value $M$ among the $a_i$s
$$ M = \min\{a_1, a_2, \ldots, a_n\}. $$Subtract $M$ from each $a_i$ before exponentiation
$$ \log \left( \sum_{i=1}^{n} e^{-a_i} \right) = \log \left( e^{-M} \sum_{i=1}^{n} e^{- (a_i - M)} \right). $$Rewrite using log properties
$$ = -M + \log \left( \sum_{i=1}^{n} e^{-(a_i - M)} \right). $$
Why does this help with underflow? By subtracting $M$, the smallest value in the set, from each $a_i$: (i) the largest term in $\{e^{-(a_i - M)} : = 1,\ldots,n\}$ becomes $e^0 = 1$; and (ii) all other terms are between 0 and 1, as they are exponentiations of negative numbers or zero. This manipulation avoids terms underflowing to zero because even very large values, when shifted by $M$, are less likely to hit the underflow threshold.
Here is an example. Imagine you have $\mathbf{a} = (-1000, -1001, -1002)$.
Direct computation: $e^{-1000}$, $e^{-1001}$, and $e^{-1002}$ might all underflow to zero.
With the log-sum-exp trick: Subtract $M = 1000$, leading to $e^{0}$, $e^{-1}$, and $e^{-2}$, all meaningful, non-zero results that accurately contribute to the sum.
We implement in Numpy.
def log_sum_exp_trick(a):
min_val = np.min(a)
return - min_val + np.log(np.sum(np.exp(- a + min_val)))
NUMERICAL CORNER: We try it on a simple example.
a = np.array([1000, 1001, 1002])
We first attempt a direct computation.
np.log(np.sum(np.exp(-a)))
/var/folders/k0/7k0fxl7j54q4k8dyqnrc6sz00000gr/T/ipykernel_87966/214275762.py:1: RuntimeWarning: underflow encountered in exp np.log(np.sum(np.exp(-a))) /var/folders/k0/7k0fxl7j54q4k8dyqnrc6sz00000gr/T/ipykernel_87966/214275762.py:1: RuntimeWarning: divide by zero encountered in log np.log(np.sum(np.exp(-a)))
-inf
Predictly, we get an underflow error and a useless output.
Next, we try the log-sum-exp trick.
log_sum_exp_trick(a)
-999.5923940355556
This time we get an output which seems reasonable, something slightly larger than $-1000$ as expected (Why?).
After this long -- but important! -- parenthesis, we return to the EM algorithm. We modify it by implementing the log-sum-exp trick in the subroutine responsibility
.
def responsibility(pi_k, p_km, x):
K = len(pi_k)
score_k = np.zeros(K)
for k in range(K):
score_k[k] -= np.log(pi_k[k])
score_k[k] -= np.sum(x * np.log(p_km[k,:])
+ (1 - x) * np.log(1 - p_km[k,:]))
r_k = np.exp(-score_k - log_sum_exp_trick(score_k))
return r_k
NUMERICAL CORNER: We go back to the MNIST example with only the 2s.
pi_k, p_km = em_bern(X, K, pi_0, p_0, maxiters=10, alpha=1., beta=1.)
Here is one of them.
plt.figure()
plt.imshow(p_km[0,:].reshape((28,28)))
plt.show()
Here is the other one.
plt.figure()
plt.imshow(p_km[1,:].reshape((28,28)))
plt.show()
Now that the model is trained, we compute the probability that an example image is in each cluster. We use the first image in the dataset that we plotted earlier.
responsibility(pi_k, p_km, X[0,:])
array([1.00000000e+00, 5.09357087e-17])
It indeed identifies the second cluster as significantly more likely.
$\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bfbeta}{\boldsymbol{\beta}}$ $\newcommand{\bflambda}{\boldsymbol{\lambda}}$ $\newcommand{\bgamma}{\boldsymbol{\gamma}}$ $\newcommand{\bsigma}{{\boldsymbol{\sigma}}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\btheta}{{\boldsymbol{\theta}}}$ $\newcommand{\bphi}{\boldsymbol{\phi}}$ $\newcommand{\balpha}{\boldsymbol{\alpha}}$ $\newcommand{\blambda}{\boldsymbol{\lambda}}$ $\renewcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bp}{\mathbf{p}}$ $\renewcommand{\bx}{\mathbf{x}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bY}{\mathbf{Y}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bZ}{\mathbf{Z}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bV}{\mathbf{V}}$ $\newcommand{\bfg}{\mathbf{g}}$ $\newcommand{\bfh}{\mathbf{h}}$ $\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}$ $\renewcommand{\S}{\mathcal{S}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\pa}{\mathrm{pa}}$ $\newcommand{\Z}{\mathcal{Z}}$ $\newcommand{\bh}{\mathbf{h}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bc}{\mathbf{c}}$ $\newcommand{\cE}{\mathcal{E}}$ $\newcommand{\cP}{\mathcal{P}}$ $\newcommand{\bbeta}{\boldsymbol{\beta}}$ $\newcommand{\bLambda}{\boldsymbol{\Lambda}}$ $\newcommand{\cov}{\mathrm{Cov}}$ $\newcommand{\bfk}{\mathbf{k}}$ $\newcommand{\idx}[1]{}$ $\newcommand{\xdi}{}$
Application: linear-Gaussian models and Kalman filtering¶
We apply Kalman filtering to location tracking. Returning to our cyborg corgi example, we imagine that we get noisy observations about its successive positions in a park. (Think of GPS measurements.) We seek to get a better estimate of its location using the method above.
We model the true location as a linear-Gaussian system over the 2d position $(z_{1,t}, z_{2,t})_t$ and velocity $(\dot{z}_{1,t}, \dot{z}_{2,t})_t$ sampled at $\Delta$ intervals of time. Formally,
$$ \bX_t = (z_{1,t}, z_{2,t}, \dot{z}_{1,t}, \dot{z}_{2,t}), \quad F = \begin{pmatrix} 1 & 0 & \Delta & 0\\ 0 & 1 & 0 & \Delta\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1 \end{pmatrix}, $$so the unobserved dynamics are
$$ \begin{pmatrix} z_{1,t+1}\\ z_{2,t+1}\\ \dot{z}_{1,t+1}\\ \dot{z}_{2,t+1} \end{pmatrix} = \bX_{t+1} = F \,\bX_t + \bW_t = \begin{pmatrix} z_{1,t} + \Delta \dot{z}_{1,t} + W_{1,t}\\ z_{2,t} + \Delta \dot{z}_{2,t} + W_{2,t}\\ \dot{z}_{1,t} + \dot{W}_{1,t}\\ \dot{z}_{2,t} + \dot{W}_{2,t} \end{pmatrix} $$where the $\bW_t = (W_{1,t}, W_{2,t}, \dot{W}_{1,t}, \dot{W}_{2,t}) \sim N_{d_0}(\mathbf{0}, Q)$ with $Q$ known.
In words, the velocity is unchanged, up to Gaussian perturbation. The position changes proportionally to the velocity in the corresponding dimension.
The observations $(\tilde{z}_{1,t}, \tilde{z}_{2,t})_t$ are modeled as
$$ \bY_t = (\tilde{z}_{1,t}, \tilde{z}_{2,t}), \quad H = \begin{pmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ \end{pmatrix}. $$so the observed process satisfies
$$ \begin{pmatrix} \tilde{z}_{1,t}\\ \tilde{z}_{2,t} \end{pmatrix} = \bY_t = H\,\bX_t + \bV_t = \begin{pmatrix} \tilde{z}_{1,t} + \tilde{V}_{1,t}\\ \tilde{z}_{2,t} + \tilde{V}_{2,t} \end{pmatrix} $$where the $\bV_t = (\tilde{V}_{1,t}, \tilde{V}_{2,t}) \sim N_d(\mathbf{0}, R)$ with $R$ known.
In words, we only observe the positions, up to Gaussian noise.
Implementing the Kalman filter We implement the Kalman filter as described above with known covariance matrices. We take $\Delta = 1$ for simplicity. The code is adapted from [Mur].
We will test Kalman filtering on a simulated path drawn from the linear-Gaussian model above. The following function creates such a path and its noisy observations.
def lgSamplePath(rng, ss, os, F, H, Q, R, init_mu, init_Sig, T):
x = np.zeros((ss,T))
y = np.zeros((os,T))
x[:,0] = rng.multivariate_normal(init_mu, init_Sig)
for t in range(1,T):
x[:,t] = rng.multivariate_normal(F @ x[:,t-1],Q)
y[:,t] = rng.multivariate_normal(H @ x[:,t],R)
return x, y
NUMERICAL CORNER: Here is an example. In the plots, the red dots are the noisy observations and the green crosses are the unobserved true path.
seed = 535
rng = np.random.default_rng(seed)
ss = 4 # state size
os = 2 # observation size
F = np.array([[1., 0., 1., 0.],
[0., 1., 0., 1.],
[0., 0., 1., 0.],
[0., 0., 0., 1.]])
H = np.array([[1., 0., 0., 0.],
[0., 1, 0., 0.]])
Q = 0.1 * np.diag(np.ones(ss))
R = 10 * np.diag(np.ones(os))
init_mu = np.array([0., 0., 1., 1.])
init_Sig = 1 * np.diag(np.ones(ss))
T = 50
x, y = lgSamplePath(rng, ss, os, F, H, Q, R, init_mu, init_Sig, T)
In the next plot (and throughout this section), the dots are the noisy observations. The unobserved true path is also shown as a dotted line.
plt.scatter(y[0,:], y[1,:], s=5, c='r', alpha=0.5)
plt.plot(x[0,:], x[1,:], c='g', linestyle='dotted')
plt.xlim((np.min(y[0,:])-5, np.max(y[0,:])+5))
plt.ylim((np.min(y[1,:])-5, np.max(y[1,:])+5))
plt.show()
The following function implements the Kalman filter. The full recursion is broken up into several steps. We use numpy.linalg.inv
to compute the Kalman gain matrix. Below, mu_pred
is $F \bmu_{t-1}$ and Sig_pred
is $P_{t-1} = F \bSigma_{t-1} F^T + Q$, which are the mean vector and covariance matrix of $\bX_{t}$ given $\bY_{1:t-1}$ as computed in the Predict step.
def kalmanUpdate(ss, F, H, Q, R, y_t, mu_prev, Sig_prev):
mu_pred = F @ mu_prev
Sig_pred = F @ Sig_prev @ F.T + Q
e_t = y_t - H @ mu_pred
S = H @ Sig_pred @ H.T + R
Sinv = LA.inv(S)
K = Sig_pred @ H.T @ Sinv
mu_new = mu_pred + K @ e_t
Sig_new = (np.diag(np.ones(ss)) - K @ H) @ Sig_pred
return mu_new, Sig_new
def kalmanFilter(ss, os, y, F, H, Q, R, init_mu, init_Sig, T):
mu = np.zeros((ss, T))
Sig = np.zeros((ss, ss, T))
mu[:,0] = init_mu
Sig[:,:,0] = init_Sig
for t in range(1,T):
mu[:,t], Sig[:,:,t] = kalmanUpdate(ss, F, H, Q, R, y[:,t], mu[:,t-1], Sig[:,:,t-1])
return mu, Sig
NUMERICAL CORNER: We apply this to the location tracking example. The inferred states, or more precisely their estimated mean, are in blue. Note that we also inferred the velocity at each time point, but we are not plotting that information.
init_mu = np.array([0., 0., 1., 1.])
init_Sig = 1 * np.diag(np.ones(ss))
mu, Sig = kalmanFilter(ss, os, y, F, H, Q, R, init_mu, init_Sig, T)
plt.plot(mu[0,:], mu[1,:], c='b', marker='s', markersize=2, linewidth=1)
plt.scatter(y[0,:], y[1,:], s=5, c='r', alpha=0.5)
plt.plot(x[0,:], x[1,:], c='g', linestyle='dotted', alpha=0.5)
plt.xlim((np.min(y[0,:])-5, np.max(y[0,:])+5))
plt.ylim((np.min(y[1,:])-5, np.max(y[1,:])+5))
plt.show()
To quantify the improvement in the inferred means compared to the observations, we compute the mean squared error in both cases.
dobs = x[0:1,:] - y[0:1,:]
mse_obs = np.sqrt(np.sum(dobs**2))
print(mse_obs)
22.891982252201856
dfilt = x[0:1,:] - mu[0:1,:]
mse_filt = np.sqrt(np.sum(dfilt**2))
print(mse_filt)
9.778610100463018
We indeed observe a substantial reduction.