class: center, middle, inverse, title-slide .title[ # Lecture .mono[011] ] .subtitle[ ## Neural networks ] .author[ ### Edward Rubin ] --- exclude: true --- name: admin # Admin ## Today .b[Topics:] We will start an introduction to neural networks: - anatomy of a single neuron, - logistic regression as a one-neuron network, - activation functions and hidden layers, - a brief overview of training, - the return of MNIST. .b[Reading and resources] - .note[Mostly:] .it[Neural Networks and Deep Learning], Ch. [1](http://neuralnetworksanddeeplearning.com/chap1.html) (some [2](http://neuralnetworksanddeeplearning.com/chap2.html) + [3](http://neuralnetworksanddeeplearning.com/chap3.html)), - .note[Also helpful:] .it[ISL] Ch. 11, - .note[Fun, interactive tool:] TensorFlow's [Neural Network Playground](https://playground.tensorflow.org/). --- layout: true # Neural networks --- class: inverse, middle name: intro ## (Neural) Intuition --- layout: false class: clear, middle <img src="slides_files/figure-html/logistic-neuron-1.svg" style="display: block; margin: auto;" /> .b[Anatomy] .attn[Neurons] are the building blocks of neural networks. - take .b.slate[in] a weighted sum of inputs, `\(w'x = w_1 x_1 + w_2 x_2 + \cdots + w_p x_p\)`, - apply a .b.slate[bias], `\(b\)`, that shifts the sum, `\(z = b + w'x\)`, - pass `\(z\)` through an .b.pink[activation function] to produce .b.orange[output], `\(a = g(z)\)`. --- layout: true # Neural networks --- ## .slab.it.grey-light[History] It all started with the perceptron The .attn[perceptron] use a .b[hard-threshold] activiation: $$ a = `\begin{cases} 1 & \text{if } \color{#314f4f}{w'x} \ge \color{#e64173}{T} \iff \color{#314f4f}{w'x} + \color{#e64173}{b} \ge 0 \\ 0 & \text{if } \color{#314f4f}{w'x} < \color{#e64173}{T} \iff \color{#314f4f}{w'x} + \color{#e64173}{b} < 0 \end{cases}` $$ The intuition literally came from neurons firing. When the .slate.it[weighted sum of inputs] (`\(\color{#314f4f}{w'x}\)`) exceeds a .pink[threshold] (`\(\color{#e64173}{T}\)`), .b.pink[fire!] .note[Intuition:] Compare .slate[weighted *evidence*] against the .pink[*burden of proof*]. -- We translate .pink[threshold] into .pink[bias] (`\(b = -T\)`) to get a more "standard" form. --- ## What do the weights and bias do? $$ z = b + \sum_{j = 1}^p w_j x_j $$ The weighted sum does two jobs: - the .attn[weights] say which inputs matter and in which direction, - the .attn[bias] shifts the cutoff left or right. -- Even with its simple structure, the neuron is capable of learning - to set the weights and bias to find a good decision boundary, - to `\(1\)` when `\(z \ge 0\)` and `\(0\)` when `\(z < 0\)`, - to separate our two classes. --- ## *Too* binary? Binary activation may be .pink[intuitive], but it's .purple[not great for learning]. .note[Problem] We need to .purple[*learn*] the weights and bias `\((w,b)\)`. - When .b.purple[training], we want to .purple[adjust] `\(w\)` and `\(b\)` to reduce errors. - If we make .purple[small changes] to `\(w\)` or `\(b\)`, binary activation may change *a lot*. `\(\implies\)` training with binary activations can be .purple[unstable] and .purple[inefficient]. --- ## *Too* binary? .note[Solution] Keep what we like about the perceptron - the .b.slate[weighted-sum of inputs shifted by bias], - .b.pink[restricting the output] to a certain range (∈[0, 1] for binary classification). but replace the hard step with a .purple[smooth, differentiable activation]. .qa[Q] Any ideas? .grey-vlight[You already know one!] -- The .attn[sigmoid neuron] keeps the same weighted-sum idea and replaces the hard step with a differentiable activation—the .attn[sigmoid function]. --- ## Logistic regression or a tiny neural network? <!-- `\(\implies\)` You already know how to build a (very simple) .b[neural network]! --> .attn[Logistic regression] For a binary outcome `\(Y\)` `$$\Pr(Y = 1 \mid X = x) = \sigma(\beta_0 + x'\beta) = \sigma(w'x + b)$$` where `\(\sigma(z) = \dfrac{1}{1 + e^{-z}}\)`. .note[Also] .attn[Sigmoid neuron] with - a .b.slate[weighted sum of inputs] and a .b.slate[bias], - a .b.pink[sigmoid] activation function, - one output. -- Neural networks are not a sharp break from (logistic) regression. <br> They are a .it.slab[generalization]. --- layout: false class: clear, middle, center .b.slate[Perceptron] .note[vs.] .b.pink[sigmoid neuron] <img src="slides_files/figure-html/step-v-sigmoid-1.svg" style="display: block; margin: auto;" /> --- layout: false class: clear, middle, center .b[Common activation functions.] .grey-vlight[(Match activation to the problem.)] <img src="slides_files/figure-html/activation-functions-1.svg" style="display: block; margin: auto;" /> --- layout: true # Neural networks --- ## Getting flexible Adjusting weights and bias of a .b.slate[single neuron]—regardless of the activation function—is only going to get us so far. .grey-vlight[(Flavors of regression.)] <img src="slides_files/figure-html/single-neuron-plot-1.svg" style="display: block; margin: auto;" /> .qa[Q:] What else can we do to make our model .it.pink[more flexible]? <br>.tran.slab[Q:] .grey-vlight[(Especially for complex, highly nonlinear problems?)] --- layout: false class: clear, middle Adding a .attn[hidden layer] of neurons between the .slate[input] and .orange[output] layers <img src="slides_files/figure-html/plot-intro-hidden-1.svg" style="display: block; margin: auto;" /> increases the model's ability to learn .pink[complex relationships] from .slate[x]→.orange[y]. - each .pink[hidden neuron] learns a different transformation of the input. <br> `\(\color{#e64173}{a_k} = a(w_k'x + b_k)\)` - an .orange[output neuron] flexibly combines those transformations. <br> `\(\color{#FFA500}{\hat{y}} = a_y(w_y'\color{#e64173}{a} + b_y)\)` --- layout: false class: clear, middle Let's think about building a .b.purple[feed-forward network for MNIST]. .note[Note:] .attn[Feed-forward networks] are simply neural networks where information flows in one direction—from .it.slate[input] to .it.orange[output]. .grey-vlight[(No cycles or feedback loops.)] --- class: clear, middle, center .b[A feedforward network] with one .pink[hidden layer.]. <img src="slides_files/figure-html/network-plot1-1.svg" style="display: block; margin: auto;" /> --- class: clear, middle There's really no reason to stop at .it[one] .pink[hidden layer]! --- class: clear, middle, center .b[A feedforward network] with two .pink[hidden layers]! <img src="slides_files/figure-html/network-plot2-1.svg" style="display: block; margin: auto;" /> --- class: clear, middle, center .b[A feedforward network] with three .pink[hidden layers]! .grey-vlight[(Getting deep!)] <img src="slides_files/figure-html/network-plot3-1.svg" style="display: block; margin: auto;" /> --- layout: true # Neural networks --- ## Neuron stacking in hidden layers .note[Let's be clear:] Hidden layers are not about .it[more coefficients]. The .it[power of .pink[hidden layers]] comes from their ability to<br>.it[learn increasingly complex features] built from simpler ones. - Early neurons can respond to simple patterns; - later neurons can combine those neurons into more complex patterns; - every layer layers complexity on top of the previous layers. Adding .attn[depth] (hidden layers) allows NNs to learn more levels of .note[abstraction]. <br> .grey-vlight[(It also adds more parameters to learn.)] --- layout: false class: clear <img src="slides_files/figure-html/simple-hidden1-1.svg" style="display: block; margin: auto;" /> To see how .note[depth] and .note[activations] affect network flexibility and output, <br> let's write out the math for this simple neural network. - .slate[2 inputs], - .pink[2 hidden neurons], and - .orange[1 output]. --- class: clear <img src="slides_files/figure-html/simple-hidden2-1.svg" style="display: block; margin: auto;" /> In the .pink[hidden layer], neuron `\(\color{#e64173}{a_j}\)` learns a transformation `\(g\)` of the .it.slate[input]: `$$\color{#e64173}{a_j} = g\big(w^1_{j,1} \color{#314f4f}{x_1} + w^1_{j,2} \color{#314f4f}{x_2} + b^1_j\big)$$` -- The .orange[output] `\(\color{#FFA500}{\hat{y}}\)` is a transformation `\(h\)` of the .it.pink[hidden layer output]: `$$\color{#FFA500}{\hat{y}} = h\big(w^2_{1,1} \color{#e64173}{a_1} + w^2_{1,2} \color{#e64173}{a_2} + b^2_1\big)$$` -- Combining the preceding two steps... `\begin{align*} \color{#FFA500}{\hat{y}} & = h\Big(w^2_{1,1} g\big(w^1_{1,1} \color{#314f4f}{x_1} + w^1_{1,2} \color{#314f4f}{x_2} + b^1_1\big) + w^2_{1,2} g\big(w^1_{2,1} \color{#314f4f}{x_1} + w^1_{2,2} \color{#314f4f}{x_2} + b^1_2\big) + b^2_1\Big) \end{align*}` --- class: clear `\begin{align*} \color{#FFA500}{\hat{y}} & = h\Big(w^2_{1,1} g\big(w^1_{1,1} \color{#314f4f}{x_1} + w^1_{1,2} \color{#314f4f}{x_2} + b^1_1\big) + w^2_{1,2} g\big(w^1_{2,1} \color{#314f4f}{x_1} + w^1_{2,2} \color{#314f4f}{x_2} + b^1_2\big) + b^2_1\Big) \end{align*}` -- .note[Case 1:] Suppose `\(g\)` and `\(h\)` are both the .attn[identity function] `\(f(z) = z\)`. -- `\(\color{#FFA500}{\hat{y}} = \Big(w^2_{1,1} \big(w^1_{1,1} \color{#314f4f}{x_1} + w^1_{1,2} \color{#314f4f}{x_2} + b^1_1\big) + w^2_{1,2} \big(w^1_{2,1} \color{#314f4f}{x_1} + w^1_{2,2} \color{#314f4f}{x_2} + b^1_2\big) + b^2_1\Big)\)` -- `\(\color{#FFFFFF}{\hat{y}} = w^2_{1,1} w^1_{1,1} \color{#314f4f}{x_1} + w^2_{1,1} w^1_{1,2} \color{#314f4f}{x_2} + w^2_{1,1} b^1_1 + w^2_{1,2} w^1_{2,1} \color{#314f4f}{x_1} + w^2_{1,2} w^1_{2,2} \color{#314f4f}{x_2} + b^1_2 + b^2_1\)` -- `\(\color{#FFFFFF}{\hat{y}} = \big(w^2_{1,1} w^1_{1,1} + w^2_{1,2} w^1_{2,1}\big) \color{#314f4f}{x_1} + \big(w^2_{1,1} w^1_{1,2} + w^2_{1,2} w^1_{2,2}\big) \color{#314f4f}{x_2} + \big(w^2_{1,1} b^1_1 + w^2_{1,2} b^1_2 + b^2_1\big)\)` -- `\(\color{#FFFFFF}{\hat{y}} = \beta_{0} + \beta_{1} \color{#314f4f}{x_1} + \beta_{2} \color{#314f4f}{x_2}\)` -- A lot of work to get back to a .b[linear model]! -- .note[Case 2:] Suppose `\(g\)` and `\(h\)` are .attn[linear]/.attn[affine] functions. -- .grey-vlight[(insert math)] ... Still a .b[linear model]! --- layout: true # Neural networks --- ## Hidden layers need nonlinearity So hidden layers only "work" when we add .b.pink[nonlinear activations]. `\begin{align*} \color{#FFA500}{\hat{y}} & = h\Big(w^2_{1,1} g\big(w^1_{1,1} \color{#314f4f}{x_1} + w^1_{1,2} \color{#314f4f}{x_2} + b^1_1\big) + w^2_{1,2} g\big(w^1_{2,1} \color{#314f4f}{x_1} + w^1_{2,2} \color{#314f4f}{x_2} + b^1_2\big) + b^2_1\Big) \end{align*}` If `\(g\)` and `\(h\)` are nonlinear, then `\(\color{#FFA500}{\hat{y}}\)` is - a .orange[nonlinear combination] of the .pink[hidden-layer activations], which are - .slate[nonlinear combinations of the inputs]. As you add more hidden layers, you *layer* more nonlinear transformations. .note[Note] The .orange[output] activation function does not have to be nonlinear for the hidden layers to "work". .grey-vlight[(No requirement that *g* and *h* match either.)] --- layout: false class: clear, middle .attn[Universal approximation theorem] A feedforward network with one hidden layer can approximate *any* continuous function on a compact domain, given enough neurons in the hidden layer. .grey-vlight[(It doesn't say how many neurons we need.)] --- layout: true # Neural networks --- ## Welcome to the playground TensorFlow's [*Playground*](https://playground.tensorflow.org/) is a fun way to see NNs in action. Play with the - activation function, - number of hidden layers and neurons, - learning rate, - amount of regularization. Watch the .b[boundary] change and the .b[representation] of hidden layers. --- class: inverse, middle name: training ## How the network learns --- ## The loss function still runs the show .note[As always:] We choose weights and biases to minimize a .attn[loss function]. - .note[regression]: MSE, MAE, Huber .grey-vlight.it[(linear output)] - .note[binary]: log loss (binary cross-entropy). .grey-vlight.it[(sigmoid output)] `$$C = - \big[y \log(\hat{p}) + (1 - y) \log(1 - \hat{p})\big]$$` - .note[multiclass]: multiclass cross-entropy .grey-vlight.it[(softmax output)] `$$C = -\sum_{k = 1}^K y_k \log \hat p_k$$` -- .note[As before:] We can .attn[penalize] (.attn[regularize]) the weights to prevent overfitting. --- ## Optimization We don't have time to cover the details of optimization; here's the *gist*: - .attn[gradient descent] move parameters "downhill" fastest by following the negative gradient of the loss function `$$\theta \leftarrow \theta - \eta \nabla_\theta C(\theta)$$` - .attn[gradient] how each weight/bias changes the loss - .attn[backpropagation] efficient chain rule to compute those gradients - .attn[mini-batch] one small chunk of data used for one update - .attn[epoch] one full pass through all training observations It's a lot. In fact, this *gist* barely scratches the surface. The key is efficiently computing the gradients—backpropagation. --- class: inverse, middle name: mnist ## MNIST application --- ## Why MNIST again? MNIST is a natural neural-net example - high-dimensional inputs (784 predictors for each image), - 10 output classes, - nonlinear relationships between pixels and classes, - lots of variation in handwriting. Let's see how well a neural network can learn from the raw pixel data. --- layout: false class: clear, middle The raw inputs are still images. <img src="slides_files/figure-html/mnist-examples-1.svg" style="display: block; margin: auto;" /> --- layout: true # Neural networks --- ## One image, many outputs MNIST is a .attn[multi-output classification] problem. For each image, the output layer should produce: $$ \big( \hat p_0,\, \hat p_1,\, \ldots,\, \hat p_9 \big) $$ If we remove the hidden layer and keep only the softmax output, `$$\sigma(z_i) = \dfrac{e^{z_i}}{\sum_{k = 0} e^{z_k}}$$` we are back at .attn[multinomial logistic regression]. --- exclude: true ## Fit the models ``` r # Fit each workflow on the training split. fit_logit_valid = workflow() |> add_recipe(rec_base) |> add_model(logit_spec) |> fit(mnist_train) fit_mlp_start_valid = workflow() |> add_recipe(rec_base) |> add_model(mlp_start_spec) |> fit(mnist_train) fit_mlp_tuned_valid = workflow() |> add_recipe(rec_tuned) |> add_model(mlp_tuned_spec) |> fit(mnist_train) fit_cnn_valid = fit_simple_cnn(mnist_train, epochs = 6) # Collect validation predictions for model comparison. pred_logit_valid = bind_cols( mnist_valid |> select(label), predict(fit_logit_valid, mnist_valid) ) pred_mlp_start_valid = bind_cols( mnist_valid |> select(label), predict(fit_mlp_start_valid, mnist_valid) ) pred_mlp_tuned_valid = bind_cols( mnist_valid |> select(label), predict(fit_mlp_tuned_valid, mnist_valid) ) pred_cnn_valid = bind_cols( mnist_valid |> select(label), predict_simple_cnn(fit_cnn_valid, mnist_valid) ) validation_compare = tibble( model = c('Multinomial logit', 'Starter MLP', 'Tuned MLP', 'Simple CNN'), accuracy = c( pred_logit_valid |> metrics(truth = label, estimate = .pred_class) |> filter(.metric == 'accuracy') |> pull(.estimate), pred_mlp_start_valid |> metrics(truth = label, estimate = .pred_class) |> filter(.metric == 'accuracy') |> pull(.estimate), pred_mlp_tuned_valid |> metrics(truth = label, estimate = .pred_class) |> filter(.metric == 'accuracy') |> pull(.estimate), pred_cnn_valid |> metrics(truth = label, estimate = .pred_class) |> filter(.metric == 'accuracy') |> pull(.estimate) ) ) # Refit the strongest models on train + validation. fit_logit_test = workflow() |> add_recipe(rec_base_tv) |> add_model(logit_spec) |> fit(mnist_train_valid) fit_mlp_tuned_test = workflow() |> add_recipe(rec_tuned_tv) |> add_model(mlp_tuned_spec) |> fit(mnist_train_valid) fit_cnn_test = fit_simple_cnn(mnist_train_valid, epochs = 6) # Evaluate on the held-out test split. pred_logit_test = bind_cols( mnist_test |> select(label), predict(fit_logit_test, mnist_test) ) pred_mlp_tuned_test = bind_cols( mnist_test |> select(label), predict(fit_mlp_tuned_test, mnist_test), predict(fit_mlp_tuned_test, mnist_test, type = 'prob') ) pred_cnn_test = bind_cols( mnist_test |> select(label), predict_simple_cnn(fit_cnn_test, mnist_test) ) test_compare = tibble( model = c('Multinomial logit', 'Tuned MLP', 'Simple CNN'), accuracy = c( pred_logit_test |> metrics(truth = label, estimate = .pred_class) |> filter(.metric == 'accuracy') |> pull(.estimate), pred_mlp_tuned_test |> metrics(truth = label, estimate = .pred_class) |> filter(.metric == 'accuracy') |> pull(.estimate), pred_cnn_test |> metrics(truth = label, estimate = .pred_class) |> filter(.metric == 'accuracy') |> pull(.estimate) ) ) # Build a confusion matrix and one probability-vector example for the CNN. cm_cnn_test = pred_cnn_test |> count(label, .pred_class) example_prob_row = pred_cnn_test |> mutate(case_id = row_number()) |> rowwise() |> mutate(max_prob = max(c_across(matches('^\\.pred_[0-9]+$')))) |> ungroup() |> filter(label == .pred_class) |> slice_max(order_by = max_prob, n = 1) example_case = mnist_test |> slice(example_prob_row$case_id) example_prob_plot_df = example_prob_row |> select(matches('^\\.pred_[0-9]+$')) |> pivot_longer( everything(), names_to = 'digit', values_to = 'probability' ) |> mutate(digit = gsub('.pred_', '', digit, fixed = TRUE)) ``` --- exclude: true ## Validation results The first pass looks like this: - multinomial logit: .attn[87.5%] - starter MLP: .attn[87.7%] - tuned MLP: .attn[90.6%] - simple CNN: .attn[95.2%] So the lesson is not .it[any hidden layer wins]. The lesson is that .attn[capacity, preprocessing, and architecture all matter]. --- ## Held-out test results After tuning and fitting on the training data, here are the test accuracies - .note[multinomial logit:] .attn[87.8%] - .note[NN (1 hidden layer with 50 units):].super.pink[†] .attn[91.2%] - .note[convolutional NN:].super.orange[††] .attn[95.2%] Nonlinearity and achitectural choices matter <br> ...even for a "simple" problem like MNIST. <br> .grey-vlight[(Simple is relative.)] .footnote[.pink[†] You can use the [`brulee` package](https://brulee.tidymodels.org/reference/brulee_mlp.html) to fit a neural net using a `tidymodels` `recipe`. You'll also need to install [`torch`](https://torch.mlverse.org/).<br>.orange[††] I know, we didn't get to CNNs. They're an approach that *slides filters* across the input to detect local patterns.] --- class: inverse, middle name: wrap ## Wrap-up --- ## Main takeaways NNs are a powerful class of models that extends many prior topics - .attn[logistic regression] is a one-neuron network with sigmoid activation, - .attn[hidden layers] allow NNs to learn complex representations of the data, - .attn[nonlinear activations] are essential for hidden layers, - .attn[training] involves min. loss via gradient descent and backprop., - .attn[penalization] mirrors bias-variance intution from regularized regression. .note[As always] Careful tuning and construction are key to performance.