Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Chapter: 6-Optimization theory and algorithms
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: April 12, 2024
Copyright: © 2024 Sebastien Roch
Figure: Helpful map of ML by scitkit-learn (Source)
$\bowtie$
We now turn to classification.
Quoting Wikipedia:
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition. In the terminology of machine learning, classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available.
We will illustrate this problem on the MNIST dataset. Quoting Wikipedia again:
The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels. The MNIST database contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset.
Figure: MNIST sample images (Source)
$\bowtie$
We first load the data and convert it to an appropriate matrix representation. The data can be accessed with torchvision.datasets.MNIST
.
import torch
from torchvision import datasets, transforms
# Download and load the MNIST dataset
mnist = datasets.MNIST(root='./data',
train=True,
download=True,
transform=transforms.ToTensor())
# Convert the dataset to a PyTorch DataLoader
train_loader = torch.utils.data.DataLoader(mnist,
batch_size=len(mnist),
shuffle=False)
# Extract images and labels from the DataLoader
imgs, labels = next(iter(train_loader))
imgs = imgs.squeeze().numpy()
labels = labels.numpy()
The squeeze()
above removes the color dimension in the image, which is grayscale. The numpy()
converts the PyTorch tensors into Numpy arrays. See torch.utils.data.DataLoader
for details on the data loading. We will say more about PyTorch later in this chapter.
For example, the first image and its label are:
plt.figure()
plt.imshow(imgs[0])
plt.show()
labels[0]
5
For now, we look at a subset of the samples: the 0's and 1's.
# Filter out images with labels 0 and 1
mask = (labels == 0) | (labels == 1)
imgs01 = imgs[mask]
labels01 = labels[mask]
In this new dataset, the first sample is:
plt.figure()
plt.imshow(imgs01[0])
plt.show()
labels01[0]
0
Next, we transform the images into vectors. For this we use the reshape()
function, which changes the dimensions of an array without changing its data. Here the first dimension, which runs across the samples, remains of the same length len(imgs01)
. The -1
is understood to be whatever is needed to "fit" the remaining dimensions, here $28 \times 28 = 784$. In other words, we are effectively "flattening" each $28 \times 28$ image into a single vector of length $784$.
imgs01.shape
(12665, 28, 28)
X = imgs01.reshape(len(imgs01), -1)
X.shape
(12665, 784)
y = labels01
The input data is now of the form $\{(\mathbf{x}_i, y_i) : i=1,\ldots, n\}$ where $\mathbf{x}_i \in \mathbb{R}^d$ are the features and $y_i \in \{0,1\}$ is the label. Above we use the matrix representation $X \in \mathbb{R}^{d \times n}$ with columns $\mathbf{x}_i$, $i = 1,\ldots, n$ and $\mathbf{y} = (y_1, \ldots, y_n)^T \in \{0,1\}^n$.
Our goal:
learn a classifier from the examples $\{(\mathbf{x}_i, y_i) : i=1,\ldots, n\}$, that is, a function $\hat{f} : \mathbb{R}^d \to \mathbb{R}$ such that $\hat{f}(\mathbf{x}_i) \approx y_i$.
We may want to enforce that the output is in $\{0,1\}$ as well. This problem is referred to as binary classification.
A natural approach to this type of supervised learning problem is to define two objects:
Family of classifiers: A class $\widehat{\mathcal{F}}$ of classifiers from which to pick $\hat{f}$.
Loss function: A loss function $\ell(\hat{f}, (\mathbf{x},y))$ which quantifies how good of a fit $\hat{f}(\mathbf{x})$ is to $y$.
Our goal is then to solve
$$ \min_{\hat{f} \in \widehat{\mathcal{F}}} \frac{1}{n} \sum_{i=1}^n \ell(\hat{f}, (\mathbf{x}_i, y_i)), $$that is, we seek to find a classifier among $\widehat{\mathcal{F}}$ that minimizes the average loss over the examples.
For instance, in logistic regression, we consider linear classifiers of the form
$$ \hat{f}(\mathbf{x}) = \sigma(\mathbf{x}^T \boldsymbol{\theta}) \qquad \text{with} \qquad \sigma(t) = \frac{1}{1 + e^{-t}} $$where $\boldsymbol{\theta} \in \mathbb{R}^d$ is a parameter vector. And we use the cross-entropy loss
$$ \ell(\hat{f}, (\mathbf{x}, y)) = - y \log(\sigma(\mathbf{x}^T \boldsymbol{\theta})) - (1-y) \log(1- \sigma(\mathbf{x}^T \boldsymbol{\theta})). $$In parametric form, the problem boils down to
$$ \min_{\boldsymbol{\theta} \in \mathbb{R}^d} - \frac{1}{n} \sum_{i=1}^n y_i \log(\sigma(\mathbf{x}_i^T \boldsymbol{\theta})) - \frac{1}{n} \sum_{i=1}^n (1-y_i) \log(1- \sigma(\mathbf{x}_i^T \boldsymbol{\theta})). $$To obtain a prediction in $\{0,1\}$ here, we could cutoff $\hat{f}(\mathbf{x})$ at a threshold $\tau \in [0,1]$, that is, return $\mathbf{1}\{\hat{f}(\mathbf{x}) > \tau\}$.
We will explain in a later chapter where this choice comes from.
The purpose of this chapter is to develop some of the mathematical theory and algorithms needed to solve this type of optimization formulation.
We illustrate the use of automatic differentiation to compute gradients.
Quoting Wikipedia:
In mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation or computational differentiation, is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program. Automatic differentiation is distinct from symbolic differentiation and numerical differentiation (the method of finite differences). Symbolic differentiation can lead to inefficient code and faces the difficulty of converting a computer program into a single expression, while numerical differentiation can introduce round-off errors in the discretization process and cancellation.
We will use PyTorch. It uses tensors, which in many ways behave similarly to Numpy arrays. See here for a quick introduction. Here is an example. We first initialize the tensors. Here each corresponds to a single real variable. With the option requires_grad=True
, we indicate that these are variables with respect to which a gradient will be taken later. We initialize the tensors where the derivatives will be computed.
# Initialize variables
x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
The function .backward()
computes the gradient using backpropagation, to which we will return later. The partial derivatives are accessed with .grad
.
# Perform automatic differentiation
f = 3 * x**2 + torch.exp(x) + y
f.backward() # Compute gradients
# Print gradients
print(x.grad) # df/dx
print(y.grad) # df/dy
tensor(8.7183) tensor(1.)
The input parameters can also be vectors, which allows to consider function of large numbers of variables.
# New variables for the second example
z = torch.tensor([1., 2., 3.], requires_grad=True)
# Perform automatic differentiation
g = torch.sum(z**2)
g.backward() # Compute gradients
# Print gradient
print(z.grad) # gradient is (2 z_1, 2 z_2, 2 z_3)
tensor([2., 4., 6.])
Here is another typical example in a data science context.
# Variables for the third example
X = torch.randn(3, 2) # Random dataset (features)
y = torch.tensor([[1., 0., 1.]]) # Dataset (labels)
theta = torch.ones(2, 1, requires_grad=True) # Parameter assignment
# Perform automatic differentiation
predict = X @ theta # Classifier with parameter vector theta
loss = torch.sum((predict - y)**2) # Loss function
loss.backward() # Compute gradients
# Print gradient
print(theta.grad) # gradient of loss
tensor([[ 2.1983], [40.7574]])
$\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\blambda}{\boldsymbol{\lambda}}$
EXAMPLE: Consider $f(x) = e^x$. Then $f'(x) = f''(x) = e^x$. Suppose we are interested in approximating $f$ in the interval $[0,1]$. We take $a=0$ and $b=1$ in Taylor's Theorem. The linear term is
$$ f(a) + (x-a) f'(a) = 1 + x e^0 = 1 + x. $$Then for any $x \in [0,1]$
$$ f(x) = 1 + x + \frac{1}{2}x^2 e^{\xi_x} $$where $\xi_x \in (0,1)$ depends on $x$. We get a uniform bound on the error over $[0,1]$ by replacing $\xi_x$ with its worst possible value over $[0,1]$
$$ |f(x) - (1+x)| \leq \frac{1}{2}x^2 e^{\xi_x} \leq \frac{e}{2} x^2. $$x = np.linspace(0,1,100)
y = np.exp(x)
taylor = 1 + x
err = (np.exp(1)/2) * x**2
plt.plot(x,y,label='f')
plt.plot(x,taylor,label='taylor')
plt.legend()
plt.show()
If we plot the upper and lower bounds, we see that $f$ indeed falls within them.
plt.plot(x,y,label='f')
plt.plot(x,taylor,label='taylor')
plt.plot(x,taylor-err,linestyle=':',color='green',label='lower')
plt.plot(x,taylor+err,linestyle='--',color='green',label='upper')
plt.legend()
plt.show()
EXAMPLE: Let $f(x) = x^3$. Then $f'(x) = 3 x^2$ and $f''(x) = 6 x$ so that $f'(0) = 0$ and $f''(0) \geq 0$. Hence $x=0$ is a stationary point. But $x=0$ is not a local minimizer. Indeed $f(0) = 0$ but, for any $\delta > 0$, $f(-\delta) < 0$.
x = np.linspace(-2,2,100)
y = x**3
plt.plot(x,y)
plt.ylim(-5,5)
plt.show()
EXAMPLE: If we want to minimize $2 x_1^2 + 3 x_2^2$ over all two-dimensional unit vectors $\mathbf{x} = (x_1, x_2)$, then we can let
$$ f(\mathbf{x}) = 2 x_1^2 + 3 x_2^2 $$and
$$ h_1(\mathbf{x}) = 1 - x_1^2 - x_2^2 = 1 - \|\mathbf{x}\|^2. $$Observe that we could have chosen a different equality constraint to express the same minimization problem. $\lhd$
EXAMPLE: (continued) Returning to the previous example,
$$ \nabla f(\mathbf{x}) = \left( \frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2} \right) = (4 x_1, 6 x_2) $$and
$$ \nabla h_1(\mathbf{x}) = \left( \frac{\partial h_1(\mathbf{x})}{\partial x_1}, \frac{\partial h_1(\mathbf{x})}{\partial x_2} \right) = (- 2 x_1, - 2 x_2). $$The conditions in the theorem read
\begin{align*} &4 x_1 - 2 \lambda_1 x_1 = 0\\ &6 x_2 - 2 \lambda_1 x_2 = 0. \end{align*}The constraint $x_1^2 + x_2^2 = 1$ must also be satisfied. Observe that the linear independence condition is automatically satisfied since there is only one constraint.
There are several cases to consider.
1- If neither $x_1$ nor $x_2$ is $0$, then the first equation gives $\lambda_1 = 2$ while the second one gives $\lambda_1 = 3$. So that case cannot happen.
2- If $x_1 = 0$, then $x_2 = 1$ or $x_2 = -1$ by the constraint and the second equation gives $\lambda_1 = 3$ in either case.
3- If $x_2 = 0$, then $x_1 = 1$ or $x_1 = -1$ by the constraint and the first equation gives $\lambda_1 = 2$ in either case.
Does any of these last four solutions, i.e., $(x_1,x_2,\lambda_1) = (0,1,3)$, $(x_1,x_2,\lambda_1) = (0,-1,3)$, $(x_1,x_2,\lambda_1) = (1,0,2)$ and $(x_1,x_2,\lambda_1) = (-1,0,2)$, actually correspond to a local minimizer?
This problem can be solved manually. Indeed, replace $x_2^2 = 1 - x_1^2$ into the objective function to obtain
$$ 2 x_1^2 + 3(1 - x_1^2) = -x_1^2 + 3. $$This is minimized for the largest value that $x_1^2$ can take, namely when $x_1 = 1$ or $x_1 = -1$. Indeed, we must have $0 \leq x_1^2 \leq x_1^2 + x_2^2 = 1$. So both $(x_1, x_2) = (1,0)$ and $(x_1, x_2) = (-1,0)$ are global minimizers. A fortiori, they must be local minimizers.
What about $(x_1,x_2) = (0,1)$ and $(x_1,x_2) = (0,-1)$? Arguing as above, they in fact correspond to global maximizers of the objective function. $\lhd$
EXAMPLE: (continued) Returning to the previous example, the points satisfying $h_1(\mathbf{x}) = 0$ sit on the circle of radius $1$ around the origin. We have already seen that
$$ \nabla h_1(\mathbf{x}) = \left( \frac{\partial h_1(\mathbf{x})}{\partial x_1}, \frac{\partial h_1(\mathbf{x})}{\partial x_2} \right) = (- 2 x_1, - 2 x_2). $$Here is code plotting these (courtesy of ChatGPT 4). It uses numpy.meshgrid
to generate a grid of points for $x_1$ and $x_2$, and matplotlib.pyplot.contour
to plot the constraint set as a contour line (for the constant value $0$) of $h_1$. The gradients are plotted with the matplotlib.pyplot.quiver
function, which is used for plotting vectors as arrows.
# Define the constraint function
def h1(x1, x2):
return 1 - x1**2 - x2**2
# Generate a grid of points for x1 and x2
x1 = np.linspace(-1.5, 1.5, 400)
x2 = np.linspace(-1.5, 1.5, 400)
X1, X2 = np.meshgrid(x1, x2)
# Compute constraint function on grid
H1 = h1(X1, X2)
# Points on the constraint where the gradients will be plotted
points = [
(0.5, np.sqrt(3)/2),
(-0.5, np.sqrt(3)/2),
(0.5, -np.sqrt(3)/2),
(-0.5, -np.sqrt(3)/2),
(1, 0),
(-1, 0),
(0, 1),
(0, -1)
]
plt.figure(figsize=(8, 6))
plt.grid(True)
plt.axis('equal')
# Plot the constraint set where h1(x1, x2) = 0
plt.contour(X1, X2, H1, levels=[0], colors='blue')
# Plot gradients of h1 (red) at specified points
for x1, x2 in points:
plt.quiver(x1, x2, -2*x1, -2*x2, scale=10, color='red')
plt.figure(figsize=(8, 6))
plt.grid(True)
plt.axis('equal')
plt.contour(X1, X2, H1, levels=[0], colors='blue')
for x1, x2 in points:
plt.quiver(x1, x2, -x1/np.sqrt(x1**2 + x2**2),
-x2/np.sqrt(x1**2 + x2**2),
scale=10, color='red')
plt.quiver(x1, x2, 4*x1/np.sqrt(16 * x1**2 + 36 * x2**2),
6*x2/np.sqrt(16 * x1**2 + 36 * x2**2),
scale=10, color='green')
$\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bsigma}{\boldsymbol{\sigma}}$
NUMERICAL CORNER: We implement gradient descent in Python. We assume that a function f
and its gradient grad_f
are provided. We first code the basic steepest descent step with a step size $\alpha =$ alpha
.
def desc_update(grad_f, x, alpha):
return x - alpha*grad_f(x)
def gd(f, grad_f, x0, alpha=1e-3, niters=int(1e6)):
xk = x0
for _ in range(niters):
xk = desc_update(grad_f, xk, alpha)
return xk, f(xk)
We illustrate on a simple example.
def f(x):
return (x-1)**2 + 10
xgrid = np.linspace(-5,5,100)
plt.plot(xgrid, f(xgrid))
plt.show()
def grad_f(x):
return 2*(x-1)
gd(f, grad_f, 0)
(0.9999999999999722, 10.0)
We found a global minmizer in this case.
The next example shows that a different local minimizer may be reached depending on the starting point.
def f(x):
return 4 * (x-1)**2 * (x+1)**2 - 2*(x-1)
xgrid = np.linspace(-2,2,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.ylim((-1,10))
plt.legend()
plt.show()
CLICK ON TARGET: If we start gradient descent from $-2$, where will it converge? $\ddagger$
def grad_f(x):
return 8 * (x-1) * (x+1)**2 + 8 * (x-1)**2 * (x+1) - 2
xgrid = np.linspace(-2,2,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.plot(xgrid, grad_f(xgrid), label='grad_f')
plt.ylim((-10,10))
plt.legend()
plt.show()
gd(f, grad_f, 0)
(1.057453770738375, -0.0590145651028224)
gd(f, grad_f, -2)
(-0.9304029265558538, 3.933005966859003)
In the final example, we end up at a stationary point that is not a local minimizer. Here both the first and second derivatives are zero. This is known as a saddle point.
def f(x):
return x**3
xgrid = np.linspace(-2,2,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.ylim((-10,10))
plt.legend()
plt.show()
def grad_f(x):
return 3 * x**2
xgrid = np.linspace(-2,2,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.plot(xgrid, grad_f(xgrid), label='grad_f')
plt.ylim((-10,10))
plt.legend()
plt.show()
gd(f, grad_f, 2)
(0.00033327488712690107, 3.701755838398568e-11)
gd(f, grad_f, -2, niters=100)
(-4.93350410883896, -120.0788396909241)
NUMERICAL CORNER: We revisit our first simple single-variable example.
def f(x):
return (x-1)**2 + 10
xgrid = np.linspace(-5,5,100)
plt.plot(xgrid, f(xgrid))
plt.show()
Recall that the first derivative is:
def grad_f(x):
return 2*(x-1)
So the second derivative is $f''(x) = 2$. Hence, this $f$ is $L$-smooth and $m$-strongly convex with $L = m = 2$. The theory we developed suggests taking step size $\alpha_t = \alpha = 1/L = 1/2$. It also implies that
$$ f(x^1) - f(x^*) \leq \left(1 - \frac{m}{L}\right) [f(x^0) - f(x^*)] = 0. $$We converge in one step! And that holds for any starting point $x^0$.
Let's try this!
gd(f, grad_f, 0, alpha=0.5, niters=1)
(1.0, 10.0)
Let's try a different starting point.
gd(f, grad_f, 100, alpha=0.5, niters=1)
(1.0, 10.0)
We return to logistic regression, which we alluded to in the motivating example of this chapter.
The input data is of the form $\{(\boldsymbol{\alpha}_i, b_i) : i=1,\ldots, n\}$ where $\boldsymbol{\alpha}_i = (\alpha_{i,1}, \ldots, \alpha_{i,d}) \in \mathbb{R}^d$ are the features and $b_i \in \{0,1\}$ is the label. As before we use a matrix representation: $A \in \mathbb{R}^{n \times d}$ has rows $\boldsymbol{\alpha}_i^T$, $i = 1,\ldots, n$ and $\mathbf{b} = (b_1, \ldots, b_n) \in \{0,1\}^n$.
Logistic model. We summarize the logistic regression approach. Our goal is to find a function of the features that approximates the probability of the label $1$. For this purpose, we model the log-odds (or logit function) of the probability of label $1$ as a linear function of the features $\boldsymbol{\alpha} \in \mathbb{R}^d$
$$ \log \frac{p(\mathbf{x}; \boldsymbol{\alpha})}{1-p(\mathbf{x}; \boldsymbol{\alpha})} = \boldsymbol{\alpha}^T \mathbf{x} $$where $\mathbf{x} \in \mathbb{R}^d$ is the vector of coefficients (i.e., parameters). Inverting this expression gives
$$ p(\mathbf{x}; \boldsymbol{\alpha}) = \sigma(\boldsymbol{\alpha}^T \mathbf{x}) $$where the sigmoid function is
$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$for $z \in \mathbb{R}$.
We plot the sigmoid function.
def sigmoid(z):
return 1/(1+np.exp(-z))
grid = np.linspace(-5, 5, 100)
plt.plot(grid,sigmoid(grid),'r')
plt.show()
We seek to maximize the probability of observing the data (also known as likelihood function) assuming the labels are independent given the features, which is given by
$$ \mathcal{L}(\mathbf{x}; A, \mathbf{b}) = \prod_{i=1}^n p(\boldsymbol{\alpha}_i; \mathbf{x})^{b_i} (1- p(\boldsymbol{\alpha}_i; \mathbf{x}))^{1-b_i} $$Taking a logarithm, multiplying by $-1/n$ and substituting the sigmoid function, we want to minimize the cross-entropy loss
$$ \ell(\mathbf{x}; A, \mathbf{b}) = \frac{1}{n} \sum_{i=1}^n \left\{- b_i \log(\sigma(\boldsymbol{\alpha}_i^T \mathbf{x})) - (1-b_i) \log(1- \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}))\right\}. $$We used standard properties of the logarithm: for $x, y > 0$, $\log(xy) = \log x + \log y$ and $\log(x^y) = y \log x$.
Hence, we want to solve the minimization problem
$$ \min_{\mathbf{x} \in \mathbb{R}^d} \ell(\mathbf{x}; A, \mathbf{b}). $$We are implicitly using here that the logarithm is a strictly increasing function and therefore does not change the global maximum of a function. Multiplying by $-1$ changes the global maximum into a global minimum.
To use gradient descent, we need the gradient of $\ell$. We use the Chain Rule and first compute the derivative of $\sigma$ which is
$$ \sigma'(z) = \frac{e^{-z}}{(1 + e^{-z})^2} = \frac{1}{1 + e^{-z}}\left(1 - \frac{1}{1 + e^{-z}}\right) = \sigma(z) (1 - \sigma(z)). $$The latter expression is known as the logistic differential equation. It arises in a variety of applications, including the modeling of population dynamics. Here it will be a convenient way to compute the gradient.
Observe that, for $\boldsymbol{\alpha} = (\alpha_{1}, \ldots, \alpha_{d}) \in \mathbb{R}^d$, by the Chain Rule
$$ \nabla\sigma(\boldsymbol{\alpha}^T \mathbf{x}) = \sigma'(\boldsymbol{\alpha}^T \mathbf{x}) \nabla (\boldsymbol{\alpha}^T \mathbf{x}) = \sigma'(\boldsymbol{\alpha}^T \mathbf{x}) \boldsymbol{\alpha} $$where, throughout, the gradient is with respect to $\mathbf{x}$.
Alternatively, we can obtain the same formula by applying the single-variable Chain Rule
\begin{align*} \frac{\partial}{\partial x_j} \sigma(\boldsymbol{\alpha}^T \mathbf{x}) &= \sigma'(\boldsymbol{\alpha}^T \mathbf{x}) \frac{\partial}{\partial x_j}(\boldsymbol{\alpha}^T \mathbf{x})\\ &= \sigma'(\boldsymbol{\alpha}^T \mathbf{x}) \frac{\partial}{\partial x_j}\left(\alpha_{j} x_{j} + \sum_{\ell=1, \ell \neq j}^d \alpha_{\ell} x_{\ell}\right)\\ &= \sigma(\boldsymbol{\alpha}^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}^T \mathbf{x}))\, \alpha_{j} \end{align*}so that
\begin{align*} \nabla\sigma(\boldsymbol{\alpha}^T \mathbf{x}) &= \left(\sigma(\boldsymbol{\alpha}^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}^T \mathbf{x}))\, \alpha_{1}, \ldots, \sigma(\boldsymbol{\alpha}^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}^T \mathbf{x}))\, \alpha_{d}\right)\\ &= \sigma(\boldsymbol{\alpha}^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}^T \mathbf{x}))\, (\alpha_{1}, \ldots, \alpha_{d})\\ &= \sigma(\boldsymbol{\alpha}^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}^T \mathbf{x}))\, \boldsymbol{\alpha}. \end{align*}By another application of the Chain Rule, since $\frac{\mathrm{d}}{\mathrm{d} z} \log z = \frac{1}{z}$,
\begin{align*} &\nabla\ell(\mathbf{x}; A, \mathbf{b})\\ &= \nabla\left[\frac{1}{n} \sum_{i=1}^n \left\{- b_i \log(\sigma(\boldsymbol{\alpha_i}^T \mathbf{x})) - (1-b_i) \log(1- \sigma(\boldsymbol{\alpha_i}^T \mathbf{x}))\right\}\right]\\ &= - \frac{1}{n} \sum_{i=1}^n \frac{b_i}{\sigma(\boldsymbol{\alpha}_i^T \mathbf{x})} \nabla\sigma(\boldsymbol{\alpha}_i^T \mathbf{x}) - \frac{1}{n} \sum_{i=1}^n \frac{1-b_i}{1- \sigma(\boldsymbol{\alpha}_i^T \mathbf{x})} \nabla(1 - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}))\\ &= - \frac{1}{n} \sum_{i=1}^n \frac{b_i}{\sigma(\boldsymbol{\alpha}_i^T \mathbf{x})} \nabla\sigma(\boldsymbol{\alpha}_i^T \mathbf{x}) + \frac{1}{n} \sum_{i=1}^n \frac{1-b_i}{1- \sigma(\boldsymbol{\alpha}_i^T \mathbf{x})} \nabla\sigma(\boldsymbol{\alpha}_i^T \mathbf{x}). \end{align*}Using the expression for the gradient of the sigmoid functions, this is
\begin{align*} &= - \frac{1}{n} \sum_{i=1}^n \frac{b_i}{\sigma(\boldsymbol{\alpha}_i^T \mathbf{x})} \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x})) \,\boldsymbol{\alpha}_i\\ &\quad\quad + \frac{1}{n} \sum_{i=1}^n \frac{1-b_i}{1- \sigma(\boldsymbol{\alpha}_i^T \mathbf{x})} \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x})) \,\boldsymbol{\alpha}_i\\ &= - \frac{1}{n} \sum_{i=1}^n \left( b_i (1 - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x})) - (1-b_i)\sigma(\boldsymbol{\alpha}_i^T \mathbf{x}) \right)\,\boldsymbol{\alpha}_i\\ &= - \frac{1}{n} \sum_{i=1}^n ( b_i - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}) ) \,\boldsymbol{\alpha}_i. \end{align*}To implement this formula below, it will be useful to re-write it in terms of the matrix representation $A \in \mathbb{R}^{n \times d}$ (which has rows $\boldsymbol{\alpha}_i^T$, $i = 1,\ldots, n$) and $\mathbf{b} = (b_1, \ldots, b_n) \in \{0,1\}^n$. Let $\bsigma : \mathbb{R}^n \to \mathbb{R}$ be the vector-valued function that applies the sigmoid $\sigma$ entry-wise, i.e., $\bsigma(\mathbf{z}) = (\sigma(z_1),\ldots,\sigma(z_n))$ where $\mathbf{z} = (z_1,\ldots,z_n)$. Thinking of $\sum_{i=1}^n (b_i - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x})\,\boldsymbol{\alpha}_i$ as a linear combination of the columns of $A^T$ with coefficients being the entries of the vector $\mathbf{b} - \bsigma(A \mathbf{x})$, we that
$$ \nabla\ell(\mathbf{x}; A, \mathbf{b}) = - \frac{1}{n} \sum_{i=1}^n ( b_i - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}) ) \,\boldsymbol{\alpha}_i = -\frac{1}{n} A^T [\mathbf{b} - \bsigma(A \mathbf{x})]. $$We turn to the Hessian. By symmetry, we can think of the $j$-th column of the Hessian as the gradient of the partial derivative with respect to $x_j$. Hence we start by computing the gradient of the $j$-th entry of the summands in the gradient of $\ell$. We note that, for $\boldsymbol{\alpha} = (\alpha_{1}, \ldots, \alpha_{d}) \in \mathbb{R}^d$,
$$ \nabla [(b - \sigma(\boldsymbol{\alpha}^T \mathbf{x}))\, \alpha_{j}] = - \nabla [\sigma(\boldsymbol{\alpha}^T \mathbf{x})] \, \alpha_{j} = - \sigma(\boldsymbol{\alpha}^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}^T \mathbf{x}))\, \boldsymbol{\alpha}\alpha_{j}. $$Thus, using the fact that $\boldsymbol{\alpha} \alpha_{j}$ is the $j$-th column of the matrix $\boldsymbol{\alpha} \boldsymbol{\alpha}^T$, we get
$$ \mathbf{H}_{\ell}(\mathbf{x}; A, \mathbf{b}) = \frac{1}{n} \sum_{i=1}^n \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}) (1 - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}))\, \boldsymbol{\alpha}_i \boldsymbol{\alpha}_i^T $$where $\mathbf{H}_{\ell}(\mathbf{x}; A, \mathbf{b})$ indicates the Hessian with respect to the $\mathbf{x}$ variables, for fixed $A, \mathbf{b}$.
For step size $\beta$, one step of gradient descent is therefore
$$ \mathbf{x}^{t+1} = \mathbf{x}^{t} +\beta \frac{1}{n} \sum_{i=1}^n ( b_i - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}^t) ) \,\boldsymbol{\alpha}_i. $$NUMERICAL CORNER: Before implementing GD for logistic regression, we return to our proof of convergence for smooth functions using a special case. We illustrate it on a random dataset. The functions $\hat{f}$, $\mathcal{L}$ and $\frac{\partial}{\partial x}\mathcal{L}$ are defined next.
def fhat(x,a):
return 1 / ( 1 + np.exp(-np.outer(x,a)) )
def loss(x,a,b):
return np.mean(-b*np.log(fhat(x,a)) - (1 - b)*np.log(1 - fhat(x,a)), axis=1)
def grad(x,a,b):
return -np.mean((b - fhat(x,a))*a, axis=1)
n = 10000
a = 2*rng.uniform(0,1,n) - 1
b = rng.integers(2, size=n)
x = np.linspace(-1,1,100)
plt.plot(x, loss(x,a,b), label='loss')
plt.legend()
plt.show()
We plot next the upper and lower bounds in the Quadratic Bound for Smooth Functions around $x = x_0$. It turns out we can take $L=1$ because all features are uniformly random between $-1$ and $1$. Observe that minimizing the upper quadratic bound leads to a decrease in $\mathcal{L}$.
x0 = -0.3
x = np.linspace(x0-0.05,x0+0.05,100)
upper = loss(x0,a,b) + (x - x0)*grad(x0,a,b) + (1/2)*(x - x0)**2 # upper approximation
lower = loss(x0,a,b) + (x - x0)*grad(x0,a,b) - (1/2)*(x - x0)**2 # lower approximation
plt.plot(x, loss(x,a,b), label='loss')
plt.plot(x, upper, label='upper')
plt.plot(x, lower, label='lower')
plt.legend()
plt.show()
We modify our implementation of gradient descent to take a dataset as input. (That will also be useful to generalize to so-called stochastic gradient descent; see below.) Recall that to run gradient descent, we first implement a function computing a descent update. It takes as input a function grad_fn
computing the gradient itself, as well as a current iterate and a step size. We now also feed a dataset as additional input.
def desc_update_for_logreg(grad_fn, A, b, curr_x, beta):
gradient = grad_fn(curr_x, A, b)
return curr_x - beta*gradient
We are ready to implement GD. Our function takes as input a function loss_fn
computing the objective, a function grad_fn
computing the gradient, the dataset A
and b
, and an initial guess init_x
. Optional parameters are the step size and the number of iterations.
def gd_for_logreg(loss_fn, grad_fn, A, b, init_x, beta=1e-3, niters=int(1e5)):
# initialization
curr_x = init_x
# until the maximum number of iterations
for iter in range(niters):
curr_x = desc_update_for_logreg(grad_fn, A, b, curr_x, beta)
return curr_x
To implement loss_fn
and grad_fn
, we define the sigmoid as above. Below, pred_fn
is $\bsigma(A \mathbf{x})$. Here we write the loss function as
where $\odot$ is the Hadamard product, or element-wise product (for example $\mathbf{u} \odot \mathbf{v} = (u_1 v_1, \ldots,u_n v_n)^T$), the logarithm (denoted in bold) is applied element-wise and $\mathrm{mean}(\mathbf{z})$ is the mean of the entries of $\mathbf{z}$ (i.e., $\mathrm{mean}(\mathbf{z}) = n^{-1} \sum_{i=1}^n z_i$).
def pred_fn(x, A):
return sigmoid(A @ x)
def loss_fn(x, A, b):
return np.mean(-b*np.log(pred_fn(x, A)) - (1 - b)*np.log(1 - pred_fn(x, A)))
def grad_fn(x, A, b):
return -A.T @ (b - pred_fn(x, A))/len(b)
We can choosed a step size based on the smoothness of the objective as above. Recall that numpy.linalg.norm
computes the Frobenius norm by default.
def stepsize_for_logreg(A, b):
L = LA.norm(A)**2 /len(b)
return 1/L
Lebron James 2017 NBA Playoffs dataset We start with a simple dataset from UC Berkeley's DS100 course. The file lebron.csv
is available here. Quoting a previous version of the course's textbook:
In basketball, players score by shooting a ball through a hoop. One such player, LeBron James, is widely considered one of the best basketball players ever for his incredible ability to score. LeBron plays in the National Basketball Association (NBA), the United States's premier basketball league. We've collected a dataset of all of LeBron's attempts in the 2017 NBA Playoff Games using the NBA statistics website (https://stats.nba.com/).
We first load the data and look at its summary.
df = pd.read_csv('lebron.csv')
df.head()
game_date | minute | opponent | action_type | shot_type | shot_distance | shot_made | |
---|---|---|---|---|---|---|---|
0 | 20170415 | 10 | IND | Driving Layup Shot | 2PT Field Goal | 0 | 0 |
1 | 20170415 | 11 | IND | Driving Layup Shot | 2PT Field Goal | 0 | 1 |
2 | 20170415 | 14 | IND | Layup Shot | 2PT Field Goal | 0 | 1 |
3 | 20170415 | 15 | IND | Driving Layup Shot | 2PT Field Goal | 0 | 1 |
4 | 20170415 | 18 | IND | Alley Oop Dunk Shot | 2PT Field Goal | 0 | 1 |
df.describe()
game_date | minute | shot_distance | shot_made | |
---|---|---|---|---|
count | 3.840000e+02 | 384.00000 | 384.000000 | 384.000000 |
mean | 2.017052e+07 | 24.40625 | 10.695312 | 0.565104 |
std | 6.948501e+01 | 13.67304 | 10.547586 | 0.496390 |
min | 2.017042e+07 | 1.00000 | 0.000000 | 0.000000 |
25% | 2.017050e+07 | 13.00000 | 1.000000 | 0.000000 |
50% | 2.017052e+07 | 25.00000 | 6.500000 | 1.000000 |
75% | 2.017060e+07 | 35.00000 | 23.000000 | 1.000000 |
max | 2.017061e+07 | 48.00000 | 31.000000 | 1.000000 |
The two columns we will be interested in are shot_distance
(LeBron's distance from the basket when the shot was attempted (ft)) and shot_made
(0 if the shot missed, 1 if the shot went in). As the summary table above indicates, the average distance was 10.6953
and the frequency of shots made was 0.565104
. We extract those two columns and display them on a scatter plot.
feature = df['shot_distance']
label = df['shot_made']
plt.scatter(feature, label, alpha=0.2)
plt.show()
As you can see, this kind of data is hard to vizualize because of the superposition of points with the same $x$ and $y$-values. One trick is to jiggle the $y$'s a little bit by adding Gaussian noise. We do this next and plot again.
label_jitter = label + 0.05*rng.normal(0,1,len(label))
plt.scatter(feature, label_jitter, alpha=0.2)
plt.show()
We apply GD to logistic regression. We first construct the data matrices $A$ and $\mathbf{b}$. To allow an affine function of the features, we add a column of $1$'s as we have done before.
A = np.stack((np.ones(len(label)),feature),axis=-1)
b = label
We run GD starting from $(0,0)$ with a step size computed from the smoothness of the objective as above.
stepsize = stepsize_for_logreg(A, b)
print(stepsize)
0.0044179063265799194
init_x = np.zeros(A.shape[1])
best_x = gd_for_logreg(loss_fn, grad_fn, A, b, init_x, beta=stepsize)
print(best_x)
[ 0.90959003 -0.05890828]
Finally we plot the results.
grid = np.linspace(np.min(feature), np.max(feature), 100)
feature_grid = np.stack((np.ones(len(grid)),grid),axis=-1)
predict_grid = sigmoid(feature_grid @ best_x)
plt.scatter(feature, label_jitter, alpha=0.2)
plt.plot(grid,predict_grid,'r')
plt.show()
Stochastic gradient descent In stochastic gradient descent (SGD), a variant of gradient descent, we pick a sample $I_t$ uniformly at random in $\{1,\ldots,n\}$ and update as follows
$$ \mathbf{x}^{t+1} = \mathbf{x}^{t} +\beta \, ( b_{I_t} - \sigma(\boldsymbol{\alpha}_{I_t}^T \mathbf{x}^t) ) \, \boldsymbol{\alpha}_{I_t}. $$For the mini-batch version of SGD, we pick a random sub-sample $\mathcal{B}_t \subseteq \{1,\ldots,n\}$ of size $B$
$$ \mathbf{x}^{t+1} = \mathbf{x}^{t} +\beta \frac{1}{B} \sum_{i\in \mathcal{B}_t} ( b_i - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}^t) ) \,\boldsymbol{\alpha}_i. $$The key observation about the two stochastic updates above is that, in expectation, they perform a step of gradient descent. That turns out to be enough and it has computational advantages.
The only modification needed to the code is to pick a random mini-batch which can be fed to the descent update sub-routine as dataset.
def sgd_for_logreg(loss_fn, grad_fn, A, b,
init_x, beta=1e-3, niters=int(1e5), batch=40):
# initialization
curr_x = init_x
# until the maximum number of iterations
nsamples = len(b)
for _ in range(niters):
I = rng.integers(nsamples, size=batch)
curr_x = desc_update_for_logreg(grad_fn, A[I,:], b[I], curr_x, beta)
return curr_x
South African Heart Disease dataset We analyze a dataset from [ESL], which can be downloaded here. Quoting [ESL, Section 4.4.2]
The data [...] are a subset of the Coronary Risk-Factor Study (CORIS) baseline survey, carried out in three rural areas of the Western Cape, South Africa (Rousseauw et al., 1983). The aim of the study was to establish the intensity of ischemic heart disease risk factors in that high-incidence region. The data represent white males between 15 and 64, and the response variable is the presence or absence of myocardial infarction (MI) at the time of the survey (the overall prevalence of MI was 5.1% in this region). There are 160 cases in our data set, and a sample of 302 controls. These data are described in more detail in Hastie and Tibshirani (1987).
We load the data, which we slightly reformatted and look at a summary.
df = pd.read_csv('SAHeart.csv')
df.head()
sbp | tobacco | ldl | adiposity | typea | obesity | alcohol | age | chd | |
---|---|---|---|---|---|---|---|---|---|
0 | 160.0 | 12.00 | 5.73 | 23.11 | 49.0 | 25.30 | 97.20 | 52.0 | 1.0 |
1 | 144.0 | 0.01 | 4.41 | 28.61 | 55.0 | 28.87 | 2.06 | 63.0 | 1.0 |
2 | 118.0 | 0.08 | 3.48 | 32.28 | 52.0 | 29.14 | 3.81 | 46.0 | 0.0 |
3 | 170.0 | 7.50 | 6.41 | 38.03 | 51.0 | 31.99 | 24.26 | 58.0 | 1.0 |
4 | 134.0 | 13.60 | 3.50 | 27.78 | 60.0 | 25.99 | 57.34 | 49.0 | 1.0 |
df.describe()
sbp | tobacco | ldl | adiposity | typea | obesity | alcohol | age | chd | |
---|---|---|---|---|---|---|---|---|---|
count | 462.000000 | 462.000000 | 462.000000 | 462.000000 | 462.000000 | 462.000000 | 462.000000 | 462.000000 | 462.000000 |
mean | 138.326840 | 3.635649 | 4.740325 | 25.406732 | 53.103896 | 26.044113 | 17.044394 | 42.816017 | 0.346320 |
std | 20.496317 | 4.593024 | 2.070909 | 7.780699 | 9.817534 | 4.213680 | 24.481059 | 14.608956 | 0.476313 |
min | 101.000000 | 0.000000 | 0.980000 | 6.740000 | 13.000000 | 14.700000 | 0.000000 | 15.000000 | 0.000000 |
25% | 124.000000 | 0.052500 | 3.282500 | 19.775000 | 47.000000 | 22.985000 | 0.510000 | 31.000000 | 0.000000 |
50% | 134.000000 | 2.000000 | 4.340000 | 26.115000 | 53.000000 | 25.805000 | 7.510000 | 45.000000 | 0.000000 |
75% | 148.000000 | 5.500000 | 5.790000 | 31.227500 | 60.000000 | 28.497500 | 23.892500 | 55.000000 | 1.000000 |
max | 218.000000 | 31.200000 | 15.330000 | 42.490000 | 78.000000 | 46.580000 | 147.190000 | 64.000000 | 1.000000 |
Our goal to predict chd
, which stands for coronary heart disease, based on the other variables (which are briefly described here). We use logistic regression again.
We first construct the data matrices. We only use three of the predictors, as the convergence is quite slow. Try it for yourself!
feature = df[['tobacco', 'ldl', 'age']].to_numpy()
print(feature)
[[1.200e+01 5.730e+00 5.200e+01] [1.000e-02 4.410e+00 6.300e+01] [8.000e-02 3.480e+00 4.600e+01] ... [3.000e+00 1.590e+00 5.500e+01] [5.400e+00 1.161e+01 4.000e+01] [0.000e+00 4.820e+00 4.600e+01]]
label = df['chd'].to_numpy()
A = np.concatenate((np.ones((len(label),1)),feature),axis=1)
print(A)
[[1.000e+00 1.200e+01 5.730e+00 5.200e+01] [1.000e+00 1.000e-02 4.410e+00 6.300e+01] [1.000e+00 8.000e-02 3.480e+00 4.600e+01] ... [1.000e+00 3.000e+00 1.590e+00 5.500e+01] [1.000e+00 5.400e+00 1.161e+01 4.000e+01] [1.000e+00 0.000e+00 4.820e+00 4.600e+01]]
b = label
We use the same functions loss_fn
and grad_fn
, which were written for general logistic regression problems.
init_x = np.zeros(A.shape[1])
stepsize = stepsize_for_logreg(A, b)
print(stepsize)
0.00047434072581205094
best_x = gd_for_logreg(loss_fn, grad_fn, A, b,
init_x, beta=stepsize, niters=1000000)
print(best_x)
[-4.01602283 0.07651133 0.1856912 0.04802978]
The outcome is harder to vizualize. To get a sense of how accurate the result is, we compare our predictions to the true labels. By prediction, let us say that we mean that we predict label $1$ whenever $\sigma(\boldsymbol{\alpha}^T \mathbf{x}) > 1/2$. We try this on the training set. (A better approach would be to split the data into training and testing sets, but we will not do this here.)
def logis_acc(x, A, b):
return np.sum((pred_fn(x, A) > 0.5) == b)/len(b)
logis_acc(best_x, A, b)
0.7251082251082251
We also try mini-batch stochastic gradient descent (SGD).
init_x = np.zeros(A.shape[1])
best_x = sgd_for_logreg(loss_fn, grad_fn, A, b,
init_x, beta=stepsize, niters=1000000)
print(best_x)
[-4.02241376 0.07713229 0.18654377 0.04636768]
logis_acc(best_x, A, b)
0.7186147186147186
Rather than explicitly specifying the gradient function, we could use PyTorch to compute it automatically. This is done next. Note that the descent update is done within with torch.no_grad()
, which ensures that the update operation itself is not tracked for gradient computation. Here the input x0
as well as the output xk.numpy(force=True)
are Numpy arrays. The function numpy()
converts a PyTorch tensor to a Numpy array (see the documentation for an explanation of the force=True
option).
def gd_with_ad(f, x0, alpha=1e-3, niters=int(1e6)):
xk = torch.tensor(x0,
requires_grad=True,
dtype=torch.float)
for _ in range(niters):
# Compute the function value and its gradient
value = f(xk)
value.backward()
# Perform a gradient descent step
with torch.no_grad(): # Temporarily set all requires_grad flags to False
xk -= alpha * xk.grad
# Zero the gradients for the next iteration
xk.grad.zero_()
return xk.numpy(force=True), f(xk).item()
We revisit a previous example.
def f(x):
return x**3
xgrid = np.linspace(-2,2,100)
plt.plot(xgrid, f(xgrid), label='f')
plt.ylim((-10,10))
plt.legend()
plt.show()
gd_with_ad(f, 2, niters=int(1e4))
(array(0.03277362, dtype=float32), 3.5202472645323724e-05)
gd_with_ad(f, -2, niters=100)
(array(-4.9335055, dtype=float32), -120.07894897460938)
The Advertising
dataset and the least-squares solution We return to the Advertising
dataset.
df = pd.read_csv('advertising.csv')
df.head()
TV | radio | newspaper | sales | |
---|---|---|---|---|
0 | 230.1 | 37.8 | 69.2 | 22.1 |
1 | 44.5 | 39.3 | 45.1 | 10.4 |
2 | 17.2 | 45.9 | 69.3 | 9.3 |
3 | 151.5 | 41.3 | 58.5 | 18.5 |
4 | 180.8 | 10.8 | 58.4 | 12.9 |
We first compute the solution using the least-squares approach we detailed previously. We use numpy.column_stack
to add a column of ones to the feature vectors.
TV = df['TV'].to_numpy()
radio = df['radio'].to_numpy()
newspaper = df['newspaper'].to_numpy()
sales = df['sales'].to_numpy()
features = np.stack((TV, radio, newspaper), axis=-1)
A = np.column_stack((np.ones(n), features))
coeff = mmids.ls_by_qr(A, sales)
print(coeff)
[ 2.93888937e+00 4.57646455e-02 1.88530017e-01 -1.03749304e-03]
np.mean((A @ coeff - sales)**2)
2.784126314510936
Solving the problem using PyTorch We will be using PyTorch to implement the previous method. We first convert the data into PyTorch tensors. We then use torch.utils.data.TensorDataset
to create the dataset. Finally, torch.utils.data.DataLoader
provides the utilities to load the data in batches for training. We take mini-batches of size BATCH_SIZE = 64
and we apply a random permutation of the samples on every pass (with the option shuffle=True
).
# Convert data to PyTorch tensors
features_tensor = torch.tensor(features,
dtype=torch.float32)
sales_tensor = torch.tensor(sales,
dtype=torch.float32).view(-1, 1)
# Create a dataset and dataloader for training
BATCH_SIZE = 64
train_dataset = TensorDataset(features_tensor, sales_tensor)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
Now we construct our model. It is simply an affine map from $\mathbb{R}^3$ to $\mathbb{R}$. Note that there is no need to pre-process the inputs by adding $1$s. A constant term (or "bias variable") is automatically added by PyTorch (unless one chooses the option bias=False
).
# Define the model using nn.Sequential
model = nn.Sequential(
nn.Linear(3, 1) # 3 input features, 1 output value
)
Finally, we are ready to run an optimization method of our choice on the loss function, which are specified next. There are many optimizers available. (See this post for a brief explanation of many common optimizers.) Here we use SGD as the optimizer. And the loss function is the MSE. A quick tutorial is here.
Choosing the right number of passes (i.e. epochs) through the data requires some experimenting. Here $10^4$ suffices. But in the interest of time, we will run it only for $100$ epochs. As you will see from the results, this is not quite enough. On each pass, we compute the output of the current model, use backward()
to obtain the gradient, and then perform a descent update with step()
. We also have to reset the gradients first (otherwise they add up by default).
# Compile the model: define loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-5)
# Train the model
epochs = 100
for epoch in range(epochs):
for inputs, targets in train_loader:
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
if (epoch+1) % 10 == 0:
print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item()}")
Epoch 10/100, Loss: 61.717803955078125 Epoch 20/100, Loss: 74.88373565673828 Epoch 30/100, Loss: 3.718038558959961 Epoch 40/100, Loss: 7.307799816131592 Epoch 50/100, Loss: 13.322959899902344 Epoch 60/100, Loss: 4.912189960479736 Epoch 70/100, Loss: 4.8989787101745605 Epoch 80/100, Loss: 8.02322769165039 Epoch 90/100, Loss: 5.188374042510986 Epoch 100/100, Loss: 2.5458970069885254
The final parameters and loss are:
# Get and print the model weights and bias
weights = model[0].weight.detach().numpy()
bias = model[0].bias.detach().numpy()
print("Weights:", weights)
print("Bias:", bias)
Weights: [[ 0.04833269 0.27859896 -0.01787573]] Bias: [0.03412555]
# Evaluate the model
model.eval()
with torch.no_grad():
total_loss = 0
for inputs, targets in train_loader:
outputs = model(inputs)
loss = criterion(outputs, targets)
total_loss += loss.item()
print(f"Mean Squared Error on Training Set: {total_loss / len(train_loader)}")
Mean Squared Error on Training Set: 4.78020042181015
Figure: MNIST sample images (Source)
$\bowtie$
We first load the data. As before, the training dataset is a tensor -- think matrix with $3$ indices. One index runs through the $60,000$ training images, while the other two indices run through the horizontal and vertical pixel axes of each image. Here each image is $28 \times 28$. The training labels are between $0$ and $9$.
# Load and normalize the MNIST dataset
train_dataset = datasets.MNIST(root='./data',
train=True,
download=True,
transform=transforms.ToTensor())
test_dataset = datasets.MNIST(root='./data',
train=False,
download=True,
transform=transforms.ToTensor())
BATCH_SIZE = 32
train_loader = torch.utils.data.DataLoader(train_dataset,
batch_size=BATCH_SIZE,
shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset,
batch_size=BATCH_SIZE,
shuffle=False)
Implementation We implement multinomial logistic regression to learn a classifier for the MNIST data. We first check for the availability of GPUs.
# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)
Using device: cpu
In PyTorch, composition of functions can be achieved with torch.nn.Sequential
. Our model is:
# Define the model using nn.Sequential and move it to the device (GPU if available)
model = nn.Sequential(
nn.Flatten(),
nn.Linear(28 * 28, 10)
).to(device)
The torch.nn.Flatten
layer turns each input image into a vector of size $784$ (where $784 = 28^2$ is the number of pixels in each image). The final output is $10$-dimensional.
Here we use the torch.optim.Adam
optimizer (you can try SGD, but it is slow). The loss function is the cross-entropy, as implemented by torch.nn.CrossEntropyLoss
.
# Compile the model: define loss function and optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())
In the interest of time, we train for 3 epochs only. An epoch is one training iteration where all samples are iterated once (in a randomly shuffled order).
# Training function
def train(dataloader, model, loss_fn, optimizer, device):
size = len(dataloader.dataset)
model.train()
for batch, (X, y) in enumerate(dataloader):
X, y = X.to(device), y.to(device)
# Compute prediction error
pred = model(X)
loss = loss_fn(pred, y)
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Training loop
epochs = 3 # Adjust the number of epochs as needed
for epoch in range(epochs):
train(train_loader, model, loss_fn, optimizer, device)
if (epoch+1) % 1 == 0:
print(f"Epoch {epoch+1}/{epochs}")
Epoch 1/3 Epoch 2/3 Epoch 3/3
Because of the issue of overfitting, we use the test images to assess the performance of the final classifier.
# Evaluation function
def test(dataloader, model, device):
size = len(dataloader.dataset)
num_batches = len(dataloader)
model.eval()
test_loss, correct = 0, 0
with torch.no_grad():
for X, y in dataloader:
X, y = X.to(device), y.to(device)
pred = model(X)
test_loss += loss_fn(pred, y).item()
correct += (pred.argmax(dim=1) == y).type(torch.float).sum().item()
test_loss /= num_batches
accuracy = correct / size
print(f"Test error: {(100*accuracy):>0.1f}% accuracy")
# Evaluate the model
test(test_loader, model, device)
Test error: 92.3% accuracy
To make a prediction, we take a torch.nn.functional.softmax
of the output of our model. It transforms the output into a probability for each label. It is implicitly included in the cross-entropy loss, but is not actually part of the trained model. (Note that the softmax itself has no parameter.)
As an illustration, we do this for each test image. We use torch.cat
to concatenate a sequence of tensors into a single tensor.
def predict_softmax(dataloader, model, device):
size = len(dataloader.dataset)
num_batches = len(dataloader)
model.eval() # Set the model to evaluation mode
predictions = []
with torch.no_grad():
for X, y in dataloader:
X, y = X.to(device), y.to(device)
pred = model(X)
probabilities = F.softmax(pred, dim=1)
predictions.append(probabilities.cpu()) # Move predictions to CPU
return torch.cat(predictions, dim=0)
predictions = predict_softmax(test_loader, model, device).numpy()
The result for the first test image is shown below. To make a prediction, we choose the label with the highest probability.
print(predictions[0])
[2.7029557e-05 8.4270707e-10 5.6631994e-05 6.4428351e-03 1.7773822e-06 3.7374924e-05 5.0613664e-09 9.9277437e-01 5.0100676e-05 6.0985261e-04]
predictions[0].argmax(0)
7
The truth is:
images, labels = next(iter(test_loader))
images = images.squeeze().numpy()
labels = labels.numpy()
labels[0]
7
Above, next(iter(test_loader))
loads the first batch of test images. (See here for background on iterators in Python.)
Here's the first one.
# Visualization code for individual image
i = 0
plt.figure(figsize=(6,3))
plt.subplot(1,2,1)
plot_image(predictions[i], labels[i], images[i])
plt.subplot(1,2,2)
plot_value_array(predictions[i], labels[i])
plt.show()
This one is a little less clear.
# Visualization code for individual and multiple images
i = 11
plt.figure(figsize=(6,3))
plt.subplot(1,2,1)
plot_image(predictions[i], labels[i], images[i])
plt.subplot(1,2,2)
plot_value_array(predictions[i], labels[i])
plt.show()
This one is wrong.
# Visualization code for individual and multiple images
i = 8
plt.figure(figsize=(6,3))
plt.subplot(1,2,1)
plot_image(predictions[i], labels[i], images[i])
plt.subplot(1,2,2)
plot_value_array(predictions[i], labels[i])
plt.show()
Implementation We implement a neural network in PyTorch. We use the MNIST dataset again. We have already loaded it.
We construct a three-layer model.
# Define the model using nn.Sequential
model = nn.Sequential(
nn.Flatten(), # Flatten the input
nn.Linear(28 * 28, 32), # First Linear layer with 32 nodes
nn.Sigmoid(), # Sigmoid activation function
nn.Linear(32, 10) # Second Linear layer with 10 nodes (output layer)
).to(device)
As we did for multinomial logistic regression, we use the Adam optimizer and the cross-entropy loss. We also monitor progress by keeping track of the accuracy on the training data.
# Define the loss function and the optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())
Again, we train for 3 epochs.
# Training loop
epochs = 3 # Adjust the number of epochs as needed
for epoch in range(epochs):
train(train_loader, model, loss_fn, optimizer, device)
if (epoch+1) % 1 == 0:
print(f"Epoch {epoch+1}/{epochs}")
Epoch 1/3 Epoch 2/3 Epoch 3/3
On the test data, we get:
# Evaluate the model
test(test_loader, model, device)
Test error: 94.3% accuracy
This is a significantly more accurate model than what we obtained using multinomial logistic regression. One can do even better using a neural network tailored for images, known as convolutional neural networks.