Lecture .mono[011]

class: center, middle, inverse, title-slide

.title[
# Lecture .mono[011]
]
.subtitle[
## Neural networks
]
.author[
### Edward Rubin
]

---

exclude: true

---
name: admin
# Admin
## Today

.b[Topics:] We will start an introduction to neural networks:

- anatomy of a single neuron,
- logistic regression as a one-neuron network,
- activation functions and hidden layers,
- a brief overview of training,
- the return of MNIST.

.b[Reading and resources]

- .note[Mostly:] .it[Neural Networks and Deep Learning], Ch. [1](http://neuralnetworksanddeeplearning.com/chap1.html) (some [2](http://neuralnetworksanddeeplearning.com/chap2.html) + [3](http://neuralnetworksanddeeplearning.com/chap3.html)),
- .note[Also helpful:] .it[ISL] Ch. 11,
- .note[Fun, interactive tool:] TensorFlow's [Neural Network Playground](https://playground.tensorflow.org/).

---
layout: true
# Neural networks

---
class: inverse, middle
name: intro

## (Neural) Intuition

---
layout: false
class: clear, middle

.b[Anatomy] .attn[Neurons] are the building blocks of neural networks.
- take .b.slate[in] a weighted sum of inputs, `$w'x = w_1 x_1 + w_2 x_2 + \cdots + w_p x_p$`,
- apply a .b.slate[bias], `$b$`, that shifts the sum, `$z = b + w'x$`,
- pass `$z$` through an .b.pink[activation function] to produce .b.orange[output], `$a = g(z)$`.

---
layout: true

# Neural networks

---
## .slab.it.grey-light[History] It all started with the perceptron

The .attn[perceptron] use a .b[hard-threshold] activiation:

$$
a =
`\begin{cases}
1 & \text{if } \color{#314f4f}{w'x} \ge \color{#e64173}{T} \iff \color{#314f4f}{w'x} + \color{#e64173}{b} \ge 0 \\
0 & \text{if } \color{#314f4f}{w'x} < \color{#e64173}{T} \iff \color{#314f4f}{w'x} + \color{#e64173}{b} < 0
\end{cases}`
$$

The intuition literally came from neurons firing.

When the .slate.it[weighted sum of inputs] (`$\color{#314f4f}{w'x}$`) exceeds a .pink[threshold] (`$\color{#e64173}{T}$`), .b.pink[fire!]

.note[Intuition:] Compare .slate[weighted *evidence*] against the .pink[*burden of proof*].

We translate .pink[threshold] into .pink[bias] (`$b = -T$`) to get a more "standard" form.

---
## What do the weights and bias do?

$$
z = b + \sum_{j = 1}^p w_j x_j
$$

The weighted sum does two jobs:

- the .attn[weights] say which inputs matter and in which direction,
- the .attn[bias] shifts the cutoff left or right.
  
--

Even with its simple structure, the neuron is capable of learning

- to set the weights and bias to find a good decision boundary,
- to `$1$` when `$z \ge 0$` and `$0$` when `$z < 0$`,
- to separate our two classes.

---
## *Too* binary?

Binary activation may be .pink[intuitive], but it's .purple[not great for learning].

.note[Problem] We need to .purple[*learn*] the weights and bias `$(w,b)$`.

- When .b.purple[training], we want to .purple[adjust] `$w$` and `$b$` to reduce errors.
- If we make .purple[small changes] to `$w$` or `$b$`, binary activation may change *a lot*.

`$\implies$` training with binary activations can be .purple[unstable] and .purple[inefficient].

---
## *Too* binary?

.note[Solution] Keep what we like about the perceptron

- the .b.slate[weighted-sum of inputs shifted by bias],
- .b.pink[restricting the output] to a certain range (∈[0, 1] for binary classification).

but replace the hard step with a .purple[smooth, differentiable activation].

.qa[Q] Any ideas? .grey-vlight[You already know one!]

The .attn[sigmoid neuron] keeps the same weighted-sum idea and replaces the hard step with a differentiable activation—the .attn[sigmoid function].

---
## Logistic regression or a tiny neural network?

.attn[Logistic regression] For a binary outcome `$Y$`

`$$\Pr(Y = 1 \mid X = x) = \sigma(\beta_0 + x'\beta) = \sigma(w'x + b)$$`

where `$\sigma(z) = \dfrac{1}{1 + e^{-z}}$`.

.note[Also] .attn[Sigmoid neuron] with

- a .b.slate[weighted sum of inputs] and a .b.slate[bias],
- a .b.pink[sigmoid] activation function,
- one output.

Neural networks are not a sharp break from (logistic) regression.
 
They are a .it.slab[generalization].

---
layout: false
class: clear, middle, center

.b.slate[Perceptron] .note[vs.] .b.pink[sigmoid neuron]

---
layout: false
class: clear, middle, center

.b[Common activation functions.] .grey-vlight[(Match activation to the problem.)]

---
layout: true
# Neural networks

---
## Getting flexible

Adjusting weights and bias of a .b.slate[single neuron]—regardless of the activation function—is only going to get us so far. .grey-vlight[(Flavors of regression.)]

.qa[Q:] What else can we do to make our model .it.pink[more flexible]?
 .tran.slab[Q:] .grey-vlight[(Especially for complex, highly nonlinear problems?)]

---
layout: false
class: clear, middle

Adding a .attn[hidden layer] of neurons between the .slate[input] and .orange[output] layers

increases the model's ability to learn .pink[complex relationships] from .slate[x]→.orange[y].

- each .pink[hidden neuron] learns a different transformation of the input. `$\color{#e64173}{a_k} = a(w_k'x + b_k)$`
- an .orange[output neuron] flexibly combines those transformations. `$\color{#FFA500}{\hat{y}} = a_y(w_y'\color{#e64173}{a} + b_y)$`

---
layout: false
class: clear, middle

Let's think about building a .b.purple[feed-forward network for MNIST].

.note[Note:] .attn[Feed-forward networks] are simply neural networks where information flows in one direction—from .it.slate[input] to .it.orange[output]. .grey-vlight[(No cycles or feedback loops.)]

---
class: clear, middle, center

.b[A feedforward network] with one .pink[hidden layer.].

---
class: clear, middle

There's really no reason to stop at .it[one] .pink[hidden layer]!

---
class: clear, middle, center

.b[A feedforward network] with two .pink[hidden layers]!

---
class: clear, middle, center

.b[A feedforward network] with three .pink[hidden layers]! .grey-vlight[(Getting deep!)]

---
layout: true
# Neural networks

---
## Neuron stacking in hidden layers

.note[Let's be clear:] Hidden layers are not about .it[more coefficients].

The .it[power of .pink[hidden layers]] comes from their ability to .it[learn increasingly complex features] built from simpler ones.

- Early neurons can respond to simple patterns;
- later neurons can combine those neurons into more complex patterns;
- every layer layers complexity on top of the previous layers.
 
Adding .attn[depth] (hidden layers) allows NNs to learn more levels of .note[abstraction].
 
.grey-vlight[(It also adds more parameters to learn.)]

---
layout: false
class: clear

To see how .note[depth] and .note[activations] affect network flexibility and output,
 let's write out the math for this simple neural network.

- .slate[2 inputs],
- .pink[2 hidden neurons], and
- .orange[1 output].

---
class: clear

In the .pink[hidden layer], neuron `$\color{#e64173}{a_j}$` learns a transformation `$g$` of the .it.slate[input]:
`$$\color{#e64173}{a_j} = g\big(w^1_{j,1} \color{#314f4f}{x_1} + w^1_{j,2} \color{#314f4f}{x_2} + b^1_j\big)$$`
--

The .orange[output] `$\color{#FFA500}{\hat{y}}$` is a transformation `$h$` of the .it.pink[hidden layer output]:
`$$\color{#FFA500}{\hat{y}} = h\big(w^2_{1,1} \color{#e64173}{a_1} + w^2_{1,2} \color{#e64173}{a_2} + b^2_1\big)$$`
--

Combining the preceding two steps...
`\begin{align*}
\color{#FFA500}{\hat{y}} & = h\Big(w^2_{1,1} g\big(w^1_{1,1} \color{#314f4f}{x_1} + w^1_{1,2} \color{#314f4f}{x_2} + b^1_1\big) + w^2_{1,2} g\big(w^1_{2,1} \color{#314f4f}{x_1} + w^1_{2,2} \color{#314f4f}{x_2} + b^1_2\big) + b^2_1\Big)
\end{align*}`

---
class: clear

`\begin{align*}
\color{#FFA500}{\hat{y}} & = h\Big(w^2_{1,1} g\big(w^1_{1,1} \color{#314f4f}{x_1} + w^1_{1,2} \color{#314f4f}{x_2} + b^1_1\big) + w^2_{1,2} g\big(w^1_{2,1} \color{#314f4f}{x_1} + w^1_{2,2} \color{#314f4f}{x_2} + b^1_2\big) + b^2_1\Big)
\end{align*}`
--

.note[Case 1:] Suppose `$g$` and `$h$` are both the .attn[identity function] `$f(z) = z$`.

`$\color{#FFA500}{\hat{y}} = \Big(w^2_{1,1} \big(w^1_{1,1} \color{#314f4f}{x_1} + w^1_{1,2} \color{#314f4f}{x_2} + b^1_1\big) + w^2_{1,2} \big(w^1_{2,1} \color{#314f4f}{x_1} + w^1_{2,2} \color{#314f4f}{x_2} + b^1_2\big) + b^2_1\Big)$`

`$\color{#FFFFFF}{\hat{y}} = w^2_{1,1} w^1_{1,1} \color{#314f4f}{x_1} + w^2_{1,1} w^1_{1,2} \color{#314f4f}{x_2} + w^2_{1,1} b^1_1 + w^2_{1,2} w^1_{2,1} \color{#314f4f}{x_1} + w^2_{1,2} w^1_{2,2} \color{#314f4f}{x_2} + b^1_2 + b^2_1$`

`$\color{#FFFFFF}{\hat{y}} = \big(w^2_{1,1} w^1_{1,1} + w^2_{1,2} w^1_{2,1}\big) \color{#314f4f}{x_1} + \big(w^2_{1,1} w^1_{1,2} + w^2_{1,2} w^1_{2,2}\big) \color{#314f4f}{x_2} + \big(w^2_{1,1} b^1_1 + w^2_{1,2} b^1_2 + b^2_1\big)$`

`$\color{#FFFFFF}{\hat{y}} = \beta_{0} + \beta_{1} \color{#314f4f}{x_1} + \beta_{2} \color{#314f4f}{x_2}$`

A lot of work to get back to a .b[linear model]!

.note[Case 2:] Suppose `$g$` and `$h$` are .attn[linear]/.attn[affine] functions.

.grey-vlight[(insert math)] ... Still a .b[linear model]!

---
layout: true
# Neural networks

---
## Hidden layers need nonlinearity

So hidden layers only "work" when we add .b.pink[nonlinear activations].

If `$g$` and `$h$` are nonlinear, then `$\color{#FFA500}{\hat{y}}$` is

- a .orange[nonlinear combination] of the .pink[hidden-layer activations], which are
- .slate[nonlinear combinations of the inputs].

As you add more hidden layers, you *layer* more nonlinear transformations.

.note[Note] The .orange[output] activation function does not have to be nonlinear for the hidden layers to "work". .grey-vlight[(No requirement that *g* and *h* match either.)]

---
layout: false
class: clear, middle

.attn[Universal approximation theorem]

A feedforward network with one hidden layer can approximate *any* continuous function on a compact domain, given enough neurons in the hidden layer.

.grey-vlight[(It doesn't say how many neurons we need.)]

---
layout: true
# Neural networks

---
## Welcome to the playground

TensorFlow's [*Playground*](https://playground.tensorflow.org/) is a fun way to see NNs in action.

Play with the

- activation function,
- number of hidden layers and neurons,
- learning rate,
- amount of regularization.

Watch the .b[boundary] change and the .b[representation] of hidden layers.

---
class: inverse, middle
name: training

## How the network learns

---
## The loss function still runs the show

.note[As always:] We choose weights and biases to minimize a .attn[loss function].

- .note[regression]: MSE, MAE, Huber .grey-vlight.it[(linear output)]

- .note[binary]: log loss (binary cross-entropy). .grey-vlight.it[(sigmoid output)]
`$$C = - \big[y \log(\hat{p}) + (1 - y) \log(1 - \hat{p})\big]$$`

- .note[multiclass]: multiclass cross-entropy .grey-vlight.it[(softmax output)]
`$$C = -\sum_{k = 1}^K y_k \log \hat p_k$$`

.note[As before:] We can .attn[penalize] (.attn[regularize]) the weights to prevent overfitting.

---
## Optimization

We don't have time to cover the details of optimization; here's the *gist*:

- .attn[gradient descent] move parameters "downhill" fastest by following the negative gradient of the loss function
`$$\theta \leftarrow \theta - \eta \nabla_\theta C(\theta)$$`
- .attn[gradient] how each weight/bias changes the loss
- .attn[backpropagation] efficient chain rule to compute those gradients
- .attn[mini-batch] one small chunk of data used for one update
- .attn[epoch] one full pass through all training observations

It's a lot. In fact, this *gist* barely scratches the surface.

The key is efficiently computing the gradients—backpropagation.

---
class: inverse, middle
name: mnist

## MNIST application

---
## Why MNIST again?

MNIST is a natural neural-net example

- high-dimensional inputs (784 predictors for each image),
- 10 output classes,
- nonlinear relationships between pixels and classes,
- lots of variation in handwriting.
  
Let's see how well a neural network can learn from the raw pixel data.

---
layout: false
class: clear, middle

The raw inputs are still images.

---
layout: true
# Neural networks

---
## One image, many outputs

MNIST is a .attn[multi-output classification] problem.

For each image, the output layer should produce:
$$
\big(
\hat p_0,\,
\hat p_1,\,
\ldots,\,
\hat p_9
\big)
$$

If we remove the hidden layer and keep only the softmax output,
`$$\sigma(z_i) = \dfrac{e^{z_i}}{\sum_{k = 0} e^{z_k}}$$`
we are back at .attn[multinomial logistic regression].

---
exclude: true

## Fit the models

``` r
# Fit each workflow on the training split.
fit_logit_valid =
  workflow() |>
  add_recipe(rec_base) |>
  add_model(logit_spec) |>
  fit(mnist_train)

fit_mlp_start_valid =
  workflow() |>
  add_recipe(rec_base) |>
  add_model(mlp_start_spec) |>
  fit(mnist_train)

fit_mlp_tuned_valid =
  workflow() |>
  add_recipe(rec_tuned) |>
  add_model(mlp_tuned_spec) |>
  fit(mnist_train)

fit_cnn_valid =
  fit_simple_cnn(mnist_train, epochs = 6)

# Collect validation predictions for model comparison.
pred_logit_valid =
  bind_cols(
    mnist_valid |> select(label),
    predict(fit_logit_valid, mnist_valid)
  )

pred_mlp_start_valid =
  bind_cols(
    mnist_valid |> select(label),
    predict(fit_mlp_start_valid, mnist_valid)
  )

pred_mlp_tuned_valid =
  bind_cols(
    mnist_valid |> select(label),
    predict(fit_mlp_tuned_valid, mnist_valid)
  )

pred_cnn_valid =
  bind_cols(
    mnist_valid |> select(label),
    predict_simple_cnn(fit_cnn_valid, mnist_valid)
  )

validation_compare = tibble(
  model = c('Multinomial logit', 'Starter MLP', 'Tuned MLP', 'Simple CNN'),
  accuracy = c(
    pred_logit_valid |> metrics(truth = label, estimate = .pred_class) |> filter(.metric == 'accuracy') |> pull(.estimate),
    pred_mlp_start_valid |> metrics(truth = label, estimate = .pred_class) |> filter(.metric == 'accuracy') |> pull(.estimate),
    pred_mlp_tuned_valid |> metrics(truth = label, estimate = .pred_class) |> filter(.metric == 'accuracy') |> pull(.estimate),
    pred_cnn_valid |> metrics(truth = label, estimate = .pred_class) |> filter(.metric == 'accuracy') |> pull(.estimate)
  )
)

# Refit the strongest models on train + validation.
fit_logit_test =
  workflow() |>
  add_recipe(rec_base_tv) |>
  add_model(logit_spec) |>
  fit(mnist_train_valid)

fit_mlp_tuned_test =
  workflow() |>
  add_recipe(rec_tuned_tv) |>
  add_model(mlp_tuned_spec) |>
  fit(mnist_train_valid)

fit_cnn_test =
  fit_simple_cnn(mnist_train_valid, epochs = 6)

# Evaluate on the held-out test split.
pred_logit_test =
  bind_cols(
    mnist_test |> select(label),
    predict(fit_logit_test, mnist_test)
  )

pred_mlp_tuned_test =
  bind_cols(
    mnist_test |> select(label),
    predict(fit_mlp_tuned_test, mnist_test),
    predict(fit_mlp_tuned_test, mnist_test, type = 'prob')
  )

pred_cnn_test =
  bind_cols(
    mnist_test |> select(label),
    predict_simple_cnn(fit_cnn_test, mnist_test)
  )

test_compare = tibble(
  model = c('Multinomial logit', 'Tuned MLP', 'Simple CNN'),
  accuracy = c(
    pred_logit_test |> metrics(truth = label, estimate = .pred_class) |> filter(.metric == 'accuracy') |> pull(.estimate),
    pred_mlp_tuned_test |> metrics(truth = label, estimate = .pred_class) |> filter(.metric == 'accuracy') |> pull(.estimate),
    pred_cnn_test |> metrics(truth = label, estimate = .pred_class) |> filter(.metric == 'accuracy') |> pull(.estimate)
  )
)

# Build a confusion matrix and one probability-vector example for the CNN.
cm_cnn_test =
  pred_cnn_test |>
  count(label, .pred_class)

example_case =
  mnist_test |> slice(example_prob_row$case_id)

example_prob_plot_df =
  example_prob_row |>
  select(matches('^\\.pred_[0-9]+$')) |>
  pivot_longer(
    everything(),
    names_to = 'digit',
    values_to = 'probability'
  ) |>
  mutate(digit = gsub('.pred_', '', digit, fixed = TRUE))
```

---
exclude: true
## Validation results

The first pass looks like this:

- multinomial logit: .attn[87.5%]
- starter MLP: .attn[87.7%]
- tuned MLP: .attn[90.6%]
- simple CNN: .attn[95.2%]

So the lesson is not .it[any hidden layer wins].

The lesson is that .attn[capacity, preprocessing, and architecture all matter].

---
## Held-out test results

After tuning and fitting on the training data, here are the test accuracies

- .note[multinomial logit:] .attn[87.8%]
- .note[NN (1 hidden layer with 50 units):].super.pink[†] .attn[91.2%]
- .note[convolutional NN:].super.orange[††] .attn[95.2%]

Nonlinearity and achitectural choices matter
 
...even for a "simple" problem like MNIST.
 
.grey-vlight[(Simple is relative.)]

.footnote[.pink[†] You can use the [`brulee` package](https://brulee.tidymodels.org/reference/brulee_mlp.html) to fit a neural net using a `tidymodels` `recipe`. You'll also need to install [`torch`](https://torch.mlverse.org/). .orange[††] I know, we didn't get to CNNs. They're an approach that *slides filters* across the input to detect local patterns.]

---
class: inverse, middle
name: wrap

## Wrap-up

---
## Main takeaways

NNs are a powerful class of models that extends many prior topics

- .attn[logistic regression] is a one-neuron network with sigmoid activation,
- .attn[hidden layers] allow NNs to learn complex representations of the data,
- .attn[nonlinear activations] are essential for hidden layers,
- .attn[training] involves min. loss via gradient descent and backprop.,
- .attn[penalization] mirrors bias-variance intution from regularized regression.

.note[As always] Careful tuning and construction are key to performance.