Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Chapter: 8-Neural networks, backpropagation and stochastic gradient descent
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: July 16, 2024
Copyright: © 2024 Sebastien Roch
$\newcommand{\bmu}{\boldsymbol{\mu}}$ $\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bfbeta}{\boldsymbol{\beta}}$ $\newcommand{\bflambda}{\boldsymbol{\lambda}}$ $\newcommand{\bgamma}{\boldsymbol{\gamma}}$ $\newcommand{\bsigma}{{\boldsymbol{\sigma}}}$ $\newcommand{\bpi}{\boldsymbol{\pi}}$ $\newcommand{\btheta}{{\boldsymbol{\theta}}}$ $\newcommand{\bphi}{\boldsymbol{\phi}}$ $\newcommand{\balpha}{\boldsymbol{\alpha}}$ $\newcommand{\blambda}{\boldsymbol{\lambda}}$ $\renewcommand{\P}{\mathbb{P}}$ $\newcommand{\E}{\mathbb{E}}$ $\newcommand{\indep}{\perp\!\!\!\perp} \newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bp}{\mathbf{p}}$ $\renewcommand{\bx}{\mathbf{x}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bY}{\mathbf{Y}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bZ}{\mathbf{Z}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bV}{\mathbf{V}}$ $\newcommand{\bfg}{\mathbf{g}}$ $\newcommand{\bfh}{\mathbf{h}}$ $\newcommand{\horz}{\rule[.5ex]{2.5ex}{0.5pt}}$ $\renewcommand{\S}{\mathcal{S}}$ $\newcommand{\X}{\mathcal{X}}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\pa}{\mathrm{pa}}$ $\newcommand{\Z}{\mathcal{Z}}$ $\newcommand{\bh}{\mathbf{h}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bc}{\mathbf{c}}$ $\newcommand{\cE}{\mathcal{E}}$ $\newcommand{\cP}{\mathcal{P}}$ $\newcommand{\bbeta}{\boldsymbol{\beta}}$ $\newcommand{\bLambda}{\boldsymbol{\Lambda}}$ $\newcommand{\cov}{\mathrm{Cov}}$ $\newcommand{\bfk}{\mathbf{k}}$ $\newcommand{\idx}[1]{}$ $\newcommand{\xdi}{}$
Motivating example: classifying natural images¶
In this chapter, we return to the classification problem. This time we consider more complex datasets involving natural images. We have seen an example previously, the MNIST dataset. We use a related dataset known as Fashion-MNIST developed by the Zalando Research. Quoting from their GitHub repository:
Fashion-MNIST is a dataset of Zalando's article images -- consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.
We first load the data and convert it to an appropriate matrix representation. The data can be accessed with torchvision.datasets.FashionMNIST
.
import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, TensorDataset
fashion_mnist = datasets.FashionMNIST(root='./data', train=True,
download=True, transform=transforms.ToTensor())
For example, the first image and its label are the following. The squeeze()
below removes the color dimension in the image, which is grayscale.
img, label = fashion_mnist[0]
plt.figure()
plt.imshow(img.squeeze(), cmap='gray')
plt.show()
label
9
def FashionMNIST_get_class_name(label):
class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress",
"Coat", "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]
return class_names[label]
print(f"{label}: '{FashionMNIST_get_class_name(label)}'")
9: 'Ankle boot'
The purpose of this chapter is to develop some of the mathematical tools used to solve this kind of classification problem:
- neural networks,
- backpropagation,
- stochastic gradient descent.
Background: Jacobian, chain rule, and a brief introduction to automatic differentiation¶
Automatic differentiation in PyTorch We will use PyTorch. It uses tensors$\idx{tensor}\xdi$, which in many ways behave similarly to Numpy arrays. See here for a quick introduction. We first initialize the tensors. Here each tensor corresponds to a single real variable. With the option requires_grad=True
, we indicate that these are variables with respect to which a gradient will be taken later. We initialize the tensors at the values where the derivatives will be computed. If derivatives need to be computed at different values, we need to repeat this process. The function .backward()
computes the gradient using backpropagation, to which we will return later. The partial derivatives are accessed with .grad
.
NUMERICAL CORNER: This is better understood through an example.
x1 = torch.tensor(1.0, requires_grad=True)
x2 = torch.tensor(2.0, requires_grad=True)
We define the function. Note that we use
torch.exp
, the PyTorch implementation of the (element-wise) exponential function. Moreover, as in NumPy, PyTorch allows the use of **
for taking a power. Here is a list of operations on tensors in PyTorch.
f = 3 * (x1 ** 2) + x2 + torch.exp(x1 * x2)
f.backward()
print(x1.grad) # df/dx
print(x2.grad) # df/dy
tensor(20.7781) tensor(8.3891)
The input parameters can also be vectors, which allows to consider functions of large numbers of variables. Here we use torch.sum
for taking a sum of the arguments.
z = torch.tensor([1., 2., 3.], requires_grad=True)
g = torch.sum(z ** 2)
g.backward()
print(z.grad) # gradient is (2 z_1, 2 z_2, 2 z_3)
tensor([2., 4., 6.])
Here is another typical example in a data science context.
X = torch.randn(3, 2) # Random dataset (features)
y = torch.tensor([[1., 0., 1.]]) # Dataset (labels)
theta = torch.ones(2, 1, requires_grad=True) # Parameter assignment
predict = X @ theta # Classifier with parameter vector theta
loss = torch.sum((predict - y)**2) # Loss function
loss.backward() # Compute gradients
print(theta.grad) # gradient of loss
tensor([[11.9778], [89.0301]])
Implementing gradient descent in PyTorch Rather than explicitly specifying the gradient function, we could use PyTorch to compute it automatically. This is done next. Note that the descent update is done within with torch.no_grad()
, which ensures that the update operation itself is not tracked for gradient computation. Here the input x0
as well as the output xk.numpy(force=True)
are Numpy arrays. The function torch.Tensor.numpy()
converts a PyTorch tensor to a Numpy array (see the documentation for an explanation of the force=True
option). Also, quoting ChatGPT:
In the given code,
.item()
is used to extract the scalar value from a tensor. In PyTorch, when you perform operations on tensors, you get back tensors as results, even if the result is a single scalar value..item()
is used to extract this scalar value from the tensor.
def gd_with_ad(f, x0, alpha=1e-3, niters=int(1e6)):
xk = torch.tensor(x0, requires_grad=True, dtype=torch.float)
for _ in range(niters):
value = f(xk)
value.backward()
with torch.no_grad():
xk -= alpha * xk.grad
xk.grad.zero_()
return xk.numpy(force=True), f(xk).item()
NUMERICAL CORNER: We revisit a previous example.
def f(x):
return x**3
print(gd_with_ad(f, 2, niters=int(1e4)))
(array(0.03277362, dtype=float32), 3.5202472645323724e-05)
print(gd_with_ad(f, -2, niters=100))
(array(-4.9335055, dtype=float32), -120.07894897460938)
Building blocks of AI 1: backpropagation¶
NUMERICAL CORNER: To make things more concrete, we consider a specific example. We will use torch.linalg.vector_norm
to compute the Euclidean norm in PyTorch. Suppose $d=3$, $L=1$, $n_1 = 2$, and $K = 2$ with the following choices:
x = torch.tensor([1.,0.,-1.], requires_grad=True)
y = torch.tensor([0.,1.])
W0 = torch.tensor([[0.,1.,-1.],[2.,0.,1.]])
W1 = torch.tensor([[-1.,0.],[2.,-1.]])
z0 = x
z1 = W0 @ z0
z2 = W1 @ z1
f = 0.5 * (torch.linalg.vector_norm(y-z2) ** 2)
print(z0)
tensor([ 1., 0., -1.], requires_grad=True)
print(z1)
tensor([1., 1.], grad_fn=<MvBackward0>)
print(z2)
tensor([-1., 1.], grad_fn=<MvBackward0>)
print(f)
tensor(0.5000, grad_fn=<MulBackward0>)
NUMERICAL CORNER: We return to our concrete example.
with torch.no_grad():
F0 = W0
F1 = W1 @ F0
grad_f = ((z2 - y).unsqueeze(0)) @ F1
print(F0)
tensor([[ 0., 1., -1.], [ 2., 0., 1.]])
print(F1)
tensor([[ 0., -1., 1.], [-2., 2., -3.]])
print(grad_f)
tensor([[ 0., 1., -1.]])
We can check that we get the same outcome using AD.
f.backward()
print(x.grad)
tensor([ 0., 1., -1.])
NUMERICAL CORNER: We try our specific example.
with torch.no_grad():
G2 = ((z2 - y).unsqueeze(0))
G1 = G2 @ W1
grad_f = G1 @ W0
print(G2)
tensor([[-1., 0.]])
print(G1)
tensor([[1., 0.]])
print(grad_f)
tensor([[ 0., 1., -1.]])
We indeed obtain the same answer yet again.
NUMERICAL CORNER: We return to the concrete example from the previous subsection. This time the matrices W0
and W1
require partial derivatives.
x = torch.tensor([1.,0.,-1.])
y = torch.tensor([0.,1.])
W0 = torch.tensor([[0.,1.,-1.],[2.,0.,1.]], requires_grad=True)
W1 = torch.tensor([[-1.,0.],[2.,-1.]], requires_grad=True)
z0 = x
z1 = W0 @ z0
z2 = W1 @ z1
f = 0.5 * (torch.linalg.vector_norm(y-z2) ** 2)
print(z0)
tensor([ 1., 0., -1.])
print(z1)
tensor([1., 1.], grad_fn=<MvBackward0>)
print(z2)
tensor([-1., 1.], grad_fn=<MvBackward0>)
print(f)
tensor(0.5000, grad_fn=<MulBackward0>)
We compute the gradient $\nabla f(\mathbf{w})$ using AD.
f.backward()
print(W0.grad)
tensor([[ 1., 0., -1.], [ 0., 0., -0.]])
print(W1.grad)
tensor([[-1., -1.], [-0., -0.]])
These are written in the form of matrix derivatives
$$ \frac{\partial f}{\partial \mathcal{W}_0} = \begin{pmatrix} \frac{\partial f}{\partial w_0} & \frac{\partial f}{\partial w_1} & \frac{\partial f}{\partial w_2} \\ \frac{\partial f}{\partial w_3} & \frac{\partial f}{\partial w_4} & \frac{\partial f}{\partial w_5} \end{pmatrix} \quad\text{and}\quad \frac{\partial f}{\partial \mathcal{W}_1} = \begin{pmatrix} \frac{\partial f}{\partial w_6} & \frac{\partial f}{\partial w_7} \\ \frac{\partial f}{\partial w_8} & \frac{\partial f}{\partial w_9} \end{pmatrix}. $$We use our formulas to confirm that they match these results. We need the Kronecker product, which in PyTorch is implemented as torch.kron
.
with torch.no_grad():
grad_W0 = torch.kron(((z2 - y).unsqueeze(0)) @ W1, (z0.unsqueeze(0)))
grad_W1 = torch.kron(((z2 - y).unsqueeze(0)), (z1.unsqueeze(0)))
print(grad_W0)
tensor([[ 1., 0., -1., 0., 0., -0.]])
print(grad_W1)
tensor([[-1., -1., 0., 0.]])
Observe that this time these results are written in vectorized form (i.e., obtained by concatenating the rows). But they do match with the AD output.
Building blocks of AI 2: stochastic gradient descent¶
For the mini-batch version of SGD, we pick a random sub-sample $\mathcal{B}_t \subseteq \{1,\ldots,n\}$ of size $B$ and take the step
$$ \mathbf{x}^{t+1} = \mathbf{x}^{t} +\beta \frac{1}{B} \sum_{i\in \mathcal{B}_t} ( b_i - \sigma(\boldsymbol{\alpha}_i^T \mathbf{x}^t) ) \,\boldsymbol{\alpha}_i. $$We modify our previous code for logistic regression. The only change is to pick a random mini-batch which can be fed to the descent update sub-routine as dataset.
def sigmoid(z):
return 1/(1+np.exp(-z))
def pred_fn(x, A):
return sigmoid(A @ x)
def loss_fn(x, A, b):
return np.mean(-b*np.log(pred_fn(x, A)) - (1 - b)*np.log(1 - pred_fn(x, A)))
def grad_fn(x, A, b):
return -A.T @ (b - pred_fn(x, A))/len(b)
def desc_update_for_logreg(grad_fn, A, b, curr_x, beta):
gradient = grad_fn(curr_x, A, b)
return curr_x - beta*gradient
def sgd_for_logreg(rng, loss_fn, grad_fn, A, b,
init_x, beta=1e-3, niters=int(1e5), batch=40):
curr_x = init_x
nsamples = len(b)
for _ in range(niters):
I = rng.integers(nsamples, size=batch)
curr_x = desc_update_for_logreg(
grad_fn, A[I,:], b[I], curr_x, beta)
return curr_x
NUMERICAL CORNER: We analyze a dataset from [ESL], which can be downloaded here. Quoting [ESL, Section 4.4.2]
The data [...] are a subset of the Coronary Risk-Factor Study (CORIS) baseline survey, carried out in three rural areas of the Western Cape, South Africa (Rousseauw et al., 1983). The aim of the study was to establish the intensity of ischemic heart disease risk factors in that high-incidence region. The data represent white males between 15 and 64, and the response variable is the presence or absence of myocardial infarction (MI) at the time of the survey (the overall prevalence of MI was 5.1% in this region). There are 160 cases in our data set, and a sample of 302 controls. These data are described in more detail in Hastie and Tibshirani (1987).
We load the data, which we slightly reformatted and look at a summary.
data = pd.read_csv('SAHeart.csv')
data.head()
sbp | tobacco | ldl | adiposity | typea | obesity | alcohol | age | chd | |
---|---|---|---|---|---|---|---|---|---|
0 | 160.0 | 12.00 | 5.73 | 23.11 | 49.0 | 25.30 | 97.20 | 52.0 | 1.0 |
1 | 144.0 | 0.01 | 4.41 | 28.61 | 55.0 | 28.87 | 2.06 | 63.0 | 1.0 |
2 | 118.0 | 0.08 | 3.48 | 32.28 | 52.0 | 29.14 | 3.81 | 46.0 | 0.0 |
3 | 170.0 | 7.50 | 6.41 | 38.03 | 51.0 | 31.99 | 24.26 | 58.0 | 1.0 |
4 | 134.0 | 13.60 | 3.50 | 27.78 | 60.0 | 25.99 | 57.34 | 49.0 | 1.0 |
Our goal to predict chd
, which stands for coronary heart disease, based on the other variables (which are briefly described here). We use logistic regression again.
We first construct the data matrices. We only use three of the predictors.
feature = data[['tobacco', 'ldl', 'age']].to_numpy()
print(feature)
[[1.200e+01 5.730e+00 5.200e+01] [1.000e-02 4.410e+00 6.300e+01] [8.000e-02 3.480e+00 4.600e+01] ... [3.000e+00 1.590e+00 5.500e+01] [5.400e+00 1.161e+01 4.000e+01] [0.000e+00 4.820e+00 4.600e+01]]
label = data['chd'].to_numpy()
A = np.concatenate((np.ones((len(label),1)),feature),axis=1)
b = label
We try mini-batch SGD.
seed = 535
rng = np.random.default_rng(seed)
init_x = np.zeros(A.shape[1])
best_x = sgd_for_logreg(rng, loss_fn, grad_fn, A, b, init_x, beta=1e-3, niters=int(1e6))
print(best_x)
[-4.06558071 0.07990955 0.18813635 0.04693118]
The outcome is harder to vizualize. To get a sense of how accurate the result is, we compare our predictions to the true labels. By prediction, let us say that we mean that we predict label $1$ whenever $\sigma(\boldsymbol{\alpha}^T \mathbf{x}) > 1/2$. We try this on the training set. (A better approach would be to split the data into training and testing sets, but we will not do this here.)
def logis_acc(x, A, b):
return np.sum((pred_fn(x, A) > 0.5) == b)/len(b)
logis_acc(best_x, A, b)
0.7207792207792207
device = torch.device("cuda" if torch.cuda.is_available()
else ("mps" if torch.backends.mps.is_available()
else "cpu"))
print("Using device:", device)
Using device: mps
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import torch.nn as nn
import torch.optim as optim
seed = 42
torch.manual_seed(seed)
if device.type == 'cuda': # device-specific seeding and settings
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed) # for multi-GPU
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
elif device.type == 'mps':
torch.mps.manual_seed(seed) # MPS-specific seeding
g = torch.Generator()
g.manual_seed(seed)
train_dataset = datasets.FashionMNIST(root='./data', train=True,
download=True, transform=transforms.ToTensor())
test_dataset = datasets.FashionMNIST(root='./data', train=False,
download=True, transform=transforms.ToTensor())
BATCH_SIZE = 32
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, generator=g)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)
We used torch.utils.data.DataLoader
, which provides utilities to load the data in batches for training. We took mini-batches of size BATCH_SIZE = 32
and we apply a random permutation of the samples on every pass over the training data (with the option shuffle=True
). The function torch.manual_seed()
is used to set the global seed for PyTorch operations (e.g., weight initialization). The shuffling in DataLoader
uses its own separate random number generator, which we initialize with torch.Generator()
and manual_seed()
. (You can tell from the fact that seed=42
that Claude explained that one to me...)
We implement multinomial logistic regression to learn a classifier for the MNIST data. In PyTorch, composition of functions can be achieved with torch.nn.Sequential
. Our model is:
model = nn.Sequential(
nn.Flatten(),
nn.Linear(28 * 28, 10)
).to(device)
The torch.nn.Flatten
layer turns each input image into a vector of size $784$ (where $784 = 28^2$ is the number of pixels in each image). After the flattening, we have an affine map from $\mathbb{R}^{784}$ to $\mathbb{R}^{10}$. Note that there is no need to pre-process the inputs by adding $1$s. A constant term (or "bias variable") is automatically added by PyTorch (unless one chooses the option bias=False
). The final output is $10$-dimensional.
Finally, we are ready to run an optimization method of our choice on the loss function, which are specified next. There are many optimizers available. (See this post for a brief explanation of many common optimizers.) Here we use SGD as the optimizer. A quick tutorial is here. The loss function is the cross-entropy, as implemented by torch.nn.CrossEntropyLoss
, which first takes the softmax and expects the labels to be class names rather than their one-hot encoding.
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)
We implement special functions for training.
def train(dataloader, model, loss_fn, optimizer, device):
size = len(dataloader.dataset)
model.train()
for batch, (X, y) in enumerate(dataloader):
X, y = X.to(device), y.to(device)
pred = model(X)
loss = loss_fn(pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
def training_loop(train_loader, model, loss_fn, optimizer, device, epochs=3):
for epoch in range(epochs):
train(train_loader, model, loss_fn, optimizer, device)
print(f"Epoch {epoch+1}/{epochs}")
An epoch is one training iteration where all samples are iterated once (in a randomly shuffled order). In the interest of time, we train for 10 epochs only. But it does better if you train it longer (try it!). On each pass, we compute the output of the current model, use backward()
to obtain the gradient, and then perform a descent update with step()
. We also have to reset the gradients first (otherwise they add up by default).
training_loop(train_loader, model, loss_fn, optimizer, device, epochs=10)
Epoch 1/10 Epoch 2/10 Epoch 3/10 Epoch 4/10 Epoch 5/10 Epoch 6/10 Epoch 7/10 Epoch 8/10 Epoch 9/10 Epoch 10/10
Because of the issue of overfitting, we use the test images to assess the performance of the final classifier.
def test(dataloader, model, loss_fn, device):
size = len(dataloader.dataset)
correct = 0
model.eval()
with torch.no_grad():
for X, y in dataloader:
X, y = X.to(device), y.to(device)
pred = model(X)
correct += (pred.argmax(dim=1) == y).type(torch.float).sum().item()
print(f"Test error: {(100*(correct / size)):>0.1f}% accuracy")
test(test_loader, model, loss_fn, device)
Test error: 78.7% accuracy
To make a prediction, we take a torch.nn.functional.softmax
of the output of our model. Recall that it is implicitly included in torch.nn.CrossEntropyLoss
, but is not actually part of model
. (Note that the softmax itself has no parameter.)
As an illustration, we do this for each test image. We use torch.cat
to concatenate a sequence of tensors into a single tensor.
import torch.nn.functional as F
def predict_softmax(dataloader, model, device):
size = len(dataloader.dataset)
num_batches = len(dataloader)
model.eval()
predictions = []
with torch.no_grad():
for X, y in dataloader:
X, y = X.to(device), y.to(device)
pred = model(X)
probabilities = F.softmax(pred, dim=1)
predictions.append(probabilities.cpu())
return torch.cat(predictions, dim=0)
predictions = predict_softmax(test_loader, model, device).numpy()
The result for the first test image is shown below. To make a prediction, we choose the label with the highest probability.
print(predictions[0])
[4.4307165e-04 3.8354204e-04 2.0886613e-03 8.8066678e-04 3.6079765e-03 1.7791630e-01 1.4651606e-03 2.2466542e-01 4.8245404e-02 5.4030383e-01]
predictions[0].argmax(0)
9
The truth is:
images, labels = next(iter(test_loader))
images = images.squeeze().numpy()
labels = labels.numpy()
print(f"{labels[0]}: '{mmids.FashionMNIST_get_class_name(labels[0])}'")
9: 'Ankle boot'
Above, next(iter(test_loader))
loads the first batch of test images. (See here for background on iterators in Python.)
Building blocks of AI 3: neural networks¶
NUMERICAL CORNER: We return to the concrete example from the previous section. We re-write the gradient as
\begin{align*} \nabla f(\mathbf{w})^T &= \begin{pmatrix} [(\mathbf{z}_2 - \mathbf{y})^T \mathcal{W}_{1} \mathrm{diag}(\mathbf{z}_1 \odot (\mathbf{1} - \mathbf{z}_1))] \otimes \mathbf{z}_0^T & (\mathbf{z}_2 - \mathbf{y})^T \otimes \mathbf{z}_1^T \end{pmatrix}. \end{align*}We will use torch.nn.functional.sigmoid
and
torch.nn.functional.softmax
for the sigmoid and softmax functions respectively. We also use torch.dot
for the inner product (i.e., dot product) of two vectors (as tensors) and torch.diag
for the creation of a diagonal matrix with specified entries on its diagonal.
import torch.nn.functional as F
x = torch.tensor([1.,0.,-1.])
y = torch.tensor([0.,1.])
W0 = torch.tensor([[0.,1.,-1.],[2.,0.,1.]], requires_grad=True)
W1 = torch.tensor([[-1.,0.],[2.,-1.]], requires_grad=True)
z0 = x
z1 = F.sigmoid(W0 @ z0)
z2 = F.softmax(W1 @ z1, dim=0)
f = -torch.dot(torch.log(z2), y)
print(z0)
tensor([ 1., 0., -1.])
print(z1)
tensor([0.7311, 0.7311], grad_fn=<SigmoidBackward0>)
print(z2)
tensor([0.1881, 0.8119], grad_fn=<SoftmaxBackward0>)
print(f)
tensor(0.2084, grad_fn=<NegBackward0>)
We compute the gradient $\nabla f(\mathbf{w})$ using AD.
f.backward()
print(W0.grad)
tensor([[-0.1110, -0.0000, 0.1110], [ 0.0370, 0.0000, -0.0370]])
print(W1.grad)
tensor([[ 0.1375, 0.1375], [-0.1375, -0.1375]])
We use our formulas to confirm that they match these results.
with torch.no_grad():
grad_W0 = torch.kron(((z2 - y).unsqueeze(0)) @ W1 @ torch.diag(z1 * (1-z1)), (z0.unsqueeze(0)))
grad_W1 = torch.kron(((z2 - y).unsqueeze(0)), (z1.unsqueeze(0)))
print(grad_W0)
tensor([[-0.1110, -0.0000, 0.1110, 0.0370, 0.0000, -0.0370]])
print(grad_W1)
tensor([[ 0.1375, 0.1375, -0.1375, -0.1375]])
The results match with the AD output.
NUMERICAL CORNER: We implement the training of a neural network in PyTorch. We use the Fashion MNIST dataset again. We first load it again. We also check for the availability of GPUs.
device = torch.device('cuda' if torch.cuda.is_available()
else ('mps' if torch.backends.mps.is_available()
else 'cpu'))
print('Using device:', device)
Using device: mps
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import torch.nn as nn
import torch.optim as optim
seed = 42
torch.manual_seed(seed)
if device.type == 'cuda': # device-specific seeding and settings
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed) # for multi-GPU
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
elif device.type == 'mps':
torch.mps.manual_seed(seed) # MPS-specific seeding
g = torch.Generator()
g.manual_seed(seed)
train_dataset = datasets.FashionMNIST(root='./data', train=True,
download=True, transform=transforms.ToTensor())
test_dataset = datasets.FashionMNIST(root='./data', train=False,
download=True, transform=transforms.ToTensor())
BATCH_SIZE = 32
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, generator=g)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)
We construct a two-layer model.
model = nn.Sequential(
nn.Flatten(), # Flatten the input
nn.Linear(28 * 28, 32), # First Linear layer with 32 nodes
nn.Sigmoid(), # Sigmoid activation function
nn.Linear(32, 10) # Second Linear layer with 10 nodes (output layer)
).to(device)
As we did for multinomial logistic regression, we use the Adam optimizer and the cross-entropy loss (which in PyTorch includes the softmax function and expects labels to be class names rather than one-hot encoding).
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)
We train for 10 epochs.
mmids.training_loop(train_loader, model, loss_fn, optimizer, device, epochs=10)
Epoch 1/10 Epoch 2/10 Epoch 3/10 Epoch 4/10 Epoch 5/10 Epoch 6/10 Epoch 7/10 Epoch 8/10 Epoch 9/10 Epoch 10/10
On the test data, we get:
mmids.test(test_loader, model, loss_fn, device)
Test error: 64.0% accuracy
Disappointingly, this is significantly less accurate model than what we obtained using multinomial logistic regression. It turns out that using a different optimizer gives much better results.
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())
mmids.training_loop(train_loader, model, loss_fn, optimizer, device)
Epoch 1/3 Epoch 2/3 Epoch 3/3
mmids.test(test_loader, model, loss_fn, device)
Test error: 85.0% accuracy