--- title: "Lecture .mono[002]" subtitle: "Model accuracy and selection" author: "Edward Rubin" #date: "`r format(Sys.time(), '%d %B %Y')`" date: "14 January 2020" output: xaringan::moon_reader: css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css'] # self_contained: true nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- exclude: true ```{R, setup, include = F} library(pacman) p_load( broom, tidyverse, ggplot2, ggthemes, ggforce, ggridges, cowplot, latex2exp, viridis, extrafont, gridExtra, plotly, ggformula, kableExtra, snakecase, janitor, data.table, dplyr, lubridate, knitr, future, furrr, MASS, estimatr, FNN, caret, parsnip, huxtable, here, magrittr, parallel ) # Define colors red_pink = "#e64173" turquoise = "#20B2AA" orange = "#FFA500" red = "#fb6107" blue = "#3b3b9a" green = "#8bb174" grey_light = "grey70" grey_mid = "grey50" grey_dark = "grey20" purple = "#6A5ACD" slate = "#314f4f" # Knitr options opts_chunk$set( comment = "#>", fig.align = "center", fig.height = 7, fig.width = 10.5, warning = F, message = F ) opts_chunk$set(dev = "svg") options(device = function(file, width, height) { svg(tempfile(), width = width, height = height) }) options(knitr.table.format = "html") ``` --- layout: true # Admin --- class: inverse, middle --- name: admin-today ## Today .b[In-class] - Model accuracy - Loss for regression and classification - The variance bias-tradeoff - The Bayes classifier - KNN --- name: admin-soon # Admin ## Upcoming .b[Readings] - .note[Today] - Finish .it[ISL] Ch2 - [Prediction Policy Problems](https://www.aeaweb.org/articles?id=10.1257/aer.p20151023) by Kleinberg .it[et al.] (2015) - .note[Next] - .it[ISL] Ch. 3–4 .b[Problem set] Today; due Tuesday before midnight (PST) --- layout: true # Model accuracy --- class: inverse, middle --- name: accuracy-review ## Review: Supervised learning 1. Using .hi-slate[training data] $\left( \color{#FFA500}{\mathbf{y}},\, \color{#6A5ACD}{\mathbf{X}} \right)$, we train $\hat{\color{#20B2AA}{f}}$, estimating $\color{#FFA500}{\mathbf{y}} = \color{#20B2AA}{f}\!(\color{#6A5ACD}{\mathbf{X}}) + \varepsilon$. -- 1. Using this estimated model $\hat{\color{#20B2AA}{f}}$, we .it[can] calculate .hi-slate[training MSE] $$\color{#314f4f}{\text{MSE}_\text{train}} = \dfrac{1}{n} \sum_{1}^n \underbrace{\left[ \color{#FFA500}{\mathbf{y}}_i - \hat{\color{#20B2AA}{f}}\!\left( \color{#6A5ACD}{x}_i \right) \right]^{2}}_{\text{Squared error}} = \dfrac{1}{n} \sum_{1}^n \left[ \color{#FFA500}{\mathbf{y}}_i - \hat{\color{#FFA500}{\mathbf{y}}} \right]^2$$ .note[Note:] Assuming $\color{#FFA500}{\mathbf{y}}$ is numeric (regression problem). -- 1. We want the model to accurately predict previously unseen (.hi-pink[test]) data. This goal is sometimes call .attn[generalization] or .attn[external validity]. .center[ Average $\left[ \color{#e64173}{y_0} - \hat{\color{#20B2AA}{f}}\!\left( \color{#e64173}{x_0} \right) \right]^2$ for obs. $\left( \color{#e64173}{y_0},\, \color{#e64173}{x_0} \right)$ in our .hi-pink[test data]. ] --- ## Errors The item at the center of our focus is the (test-sample) .b[prediction error] $$\color{#FFA500}{\mathbf{y}}_i - \hat{\color{#20B2AA}{f}}\!\left( \color{#6A5ACD}{x}_i \right) = \color{#FFA500}{\mathbf{y}}_i - \hat{\color{#FFA500}{\mathbf{y}}}_i$$ the difference between the label $\left( \color{#FFA500}{\mathbf{y}} \right)$ and its prediction $\left( \hat{\color{#FFA500}{\mathbf{y}}} \right)$. The distance (_i.e._, non-negative value) between a true value and its prediction is often called .b[loss]. --- name: loss-functions ## Loss functions .b[Loss functions] aggregate and quantify loss. .white[►] .b[L1] loss function: $\sum_i \big| y_i - \hat{y}_i \big|$     .b[Mean abs. error]: $\dfrac{1}{n}\sum_i \big| y_i - \hat{y}_i \big|$ .white[►] .b[L2] loss function: $\sum_i \left( y_i - \hat{y}_i \right)^2$   .b[Mean squared error]: $\dfrac{1}{n}\sum_i \left( y_i - \hat{y}_i \right)^2$ -- Notice that .b[both functions impose assumptions]. 1. Both assume .b[overestimating] is equally bad as .b[underestimating]. 2. Both assume errors are similarly bad for .b[all individuals] $(i)$. 3. They differ in their assumptions about the .b[magnitude of errors]. - .b[L1] an additional unit of error is .b[equally bad] everywhere. - .b[L2] an additional unit of error is .b[worse] when the error is already big. --- layout: true class: clear, middle --- A very simple, univariate dataset $\left(\mathbf{y},\, \mathbf{x} \right)$ ```{R, data loss, echo = F, cache = T} # Sample size n = 30 # Sample set.seed(12345) loss_df = tibble( x = runif(n = n, min = 0, max = 5), y = 1 + x + rnorm(n, sd = 1.5), y_hat = lm(y ~ x)$fitted.values, loss = abs(y - y_hat) ) # Base graph loss_gg = ggplot(data = loss_df, aes(x = x, y = y)) + theme_void() ``` ```{R, graph loss 1, echo = F, cache = T, dependson = "data loss"} loss_gg + geom_point(size = 3.5) ``` --- ... on which we run a .pink[simple linear regression]. ```{R, graph loss 2, echo = F, cache = T, dependson = "data loss"} loss_gg + geom_smooth(color = red_pink, se = F, method = lm, size = 1.3) + geom_point(size = 3.5) ``` --- Each point $\left( y_i,\, x_i \right)$ has an associated .grey-mid[loss] (error). ```{R, graph loss 3, echo = F, cache = T, dependson = "data loss"} loss_gg + geom_segment(aes(x = x, xend = x, y = y, yend = y_hat), color = "grey70") + geom_smooth(color = red_pink, se = F, method = lm, size = 1.3) + geom_point(size = 3.5) ``` --- The L1 loss function weights all errors equally: $\sum_i \big| y_i - \hat{y}_i \big|$ ```{R, graph loss 4, echo = F, cache = T, dependson = "data loss"} loss_gg + geom_segment(aes(x = x, xend = x, y = y, yend = y_hat, color = abs(loss))) + geom_smooth(color = red_pink, se = F, method = lm, size = 1.3) + geom_point(size = 3.5) + scale_color_viridis_c(option = "magma", end = 0.95) + theme(legend.position = "none") ``` --- The L2 loss function .it[upweights] large weights: $\sum_i \left( y_i - \hat{y}_i \right)^2$ ```{R, graph loss 5, echo = F, cache = T, dependson = "data loss"} loss_gg + geom_segment(aes(x = x, xend = x, y = y, yend = y_hat, color = loss^2)) + geom_smooth(color = red_pink, se = F, method = lm, size = 1.3) + geom_point(size = 3.5) + scale_color_viridis_c(option = "magma", end = 0.95) + theme(legend.position = "none") ``` --- name: overfitting layout: false # Model accuracy ## Overfitting So what's the big deal? (.attn[Hint:] Look up.) -- We're facing a tradeoff—increasing model flexibility - offers potential to better fit complex systems - risks overfitting our model to the training data -- We can see these tradeoffs in our .hi-pink[test MSE] (but not the .hi-slate[training MSE]). --- class: clear, middle layout: true --- exclude: true ```{R, sim flexibility, echo = F, cache = T, eval = T} # Function to generate our data sim_fun = function(x) (x - 3)^2 * (3 * x + 3) * (x + 5) / 100 + 7 # Generate data set.seed(123) flex_train = tibble( x = runif(n = 300, min = -4.25, max = 4.25), y = sim_fun(x) + rnorm(300, sd = 3) ) %>% data.matrix() flex_test = tibble( x = runif(n = 300, min = -4.25, max = 4.25), y = sim_fun(x) + rnorm(300, sd = 3) ) %>% data.matrix() flex_range = seq(from = -4.25, to = 4.25, by = 0.01) # Iterate over flexibility parameter flex_df = mclapply( X = seq(0.01, 1.5, 0.01), FUN = function(s) { # Fit spline on training data spline_s = smooth.spline(x = flex_train, spar = s) # MSE mse_train = (flex_train[,"y"] - predict(spline_s, x = flex_train[,"x"])$y)^2 %>% mean() mse_test = (flex_test[,"y"] - predict(spline_s, x = flex_test[,"x"])$y)^2 %>% mean() # Return data frame tibble( s = rep(s, 2), mse_type = c("train", "test"), mse_value = c(mse_train, mse_test) ) }, mc.cores = detectCores() - 2 ) %>% bind_rows() # Find minima min_train = flex_df %>% filter(mse_type == "train") %>% filter(mse_value == min(mse_value)) min_test = flex_df %>% filter(mse_type == "test") %>% filter(mse_value == min(mse_value)) ``` ```{R, save flexibility simulation, include = F, cache = T, dependson = "sim flexibility"} saveRDS( object = flex_df, file = here("other-files", "flex-sim.rds") ) ``` --- name: ex-nonlinear-splines .hi-slate[Training data] and example models (splines) ```{R, plot data for flexibility, echo = F} ggplot(data = flex_train %>% as.data.frame(), aes(x, y)) + stat_function( aes(color = "1", size = "1"), fun = sim_fun, linetype = "longdash" ) + stat_spline( aes(color = "2", size = "2"), spar = min_test$s ) + stat_spline( aes(color = "3", size = "3"), spar = 1.5 ) + stat_spline( aes(color = "4", size = "4"), spar = min_train$s ) + geom_point(size = 3.5, shape = 1, color = "grey40") + xlab("x") + ylab("y") + theme_void(base_family = "Fira Sans Book") + scale_color_manual( "", values = c("black", magma(3, begin = 0.2, end = 0.9)), labels = c( "True model", "Spline: Test-based flexibility", "Linear fit", "Spline: Training-based flexibility" ) ) + scale_size_manual( "", values = c(0.5, 0.9, 0.9, 1.1), labels = c( "True model", "Spline: Test-based flexibility", "Linear fit", "Spline: Training-based flexibility" ) ) + theme( legend.position = c(0.05, 0.99), legend.justification = c(0,1), axis.title = element_text(size = 20), legend.text = element_text(size = 18) ) ``` --- ```{R, plot flexibility, echo = F} ggplot(data = flex_df, aes(x = 1.5 - s, y = mse_value, color = mse_type)) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + geom_line(size = 1.2) + geom_point(data = bind_rows(min_train, min_test), size = 3.5) + xlab("Model flexibility") + ylab("MSE") + scale_color_viridis_d( "", labels = c("Test MSE", "Train MSE"), option = "magma", begin = 0.2, end = 0.9 ) + theme_void(base_family = "Fira Sans Book") + theme( legend.position = c(0.9, 0.65), axis.title = element_text(size = 20, vjust = 1), axis.title.y = element_text(angle = 90), legend.text = element_text(size = 18) ) ``` --- The previous example has a pretty nonlinear relationship. .qa[Q] What happens when truth is actually linear? --- exclude: true ```{R, sim linear flexibility, echo = F, cache = T, eval = T} # Function to generate our data sim_linear = function(x) 7 + 3 * x # Generate data set.seed(123) lin_train = tibble( x = runif(n = 300, min = -4.25, max = 4.25), y = sim_linear(x) + rnorm(300, sd = 3) ) %>% data.matrix() lin_test = tibble( x = runif(n = 300, min = -4.25, max = 4.25), y = sim_linear(x) + rnorm(300, sd = 3) ) %>% data.matrix() lin_range = seq(from = -4.25, to = 4.25, by = 0.01) # Iterate over flexibility parameter lin_df = mclapply( X = seq(0.01, 1.5, 0.01), FUN = function(s) { # Fit spline on training data spline_s = smooth.spline(x = lin_train, spar = s) # MSE mse_train = (lin_train[,"y"] - predict(spline_s, x = lin_train[,"x"])$y)^2 %>% mean() mse_test = (lin_test[,"y"] - predict(spline_s, x = lin_test[,"x"])$y)^2 %>% mean() # Return data frame tibble( s = rep(s, 2), mse_type = c("train", "test"), mse_value = c(mse_train, mse_test) ) }, mc.cores = detectCores() - 2 ) %>% bind_rows() # Find minima min_train_lin = lin_df %>% filter(mse_type == "train") %>% filter(mse_value == min(mse_value)) min_test_lin = lin_df %>% filter(mse_type == "test") %>% filter(mse_value == min(mse_value)) ``` --- name: ex-linear-splines .hi-slate[Training data] and example models (splines) ```{R, plot data for linear flexibility, echo = F} ggplot(data = lin_train %>% as.data.frame(), aes(x, y)) + stat_function( aes(color = "1", size = "1"), fun = sim_linear, linetype = "longdash" ) + stat_spline( aes(color = "2", size = "2"), spar = min_test$s ) + stat_spline( aes(color = "3", size = "3"), spar = 1.5 ) + stat_spline( aes(color = "4", size = "4"), spar = min_train$s ) + geom_point(size = 3.5, shape = 1, color = "grey40") + xlab("x") + ylab("y") + theme_void(base_family = "Fira Sans Book") + scale_color_manual( "", values = c("black", magma(3, begin = 0.2, end = 0.9)), labels = c( "True model", "Spline: Test-based flexibility", "Linear fit", "Spline: Training-based flexibility" ) ) + scale_size_manual( "", values = c(0.5, 0.9, 0.9, 1.1), labels = c( "True model", "Spline: Test-based flexibility", "Linear fit", "Spline: Training-based flexibility" ) ) + theme( legend.position = c(0.05, 0.99), legend.justification = c(0,1), axis.title = element_text(size = 20), legend.text = element_text(size = 18) ) ``` --- ```{R, plot linear flexibility, echo = F} ggplot(data = lin_df, aes(x = 1.5 - s, y = mse_value, color = mse_type)) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + geom_line(size = 1.2) + geom_point(data = bind_rows(min_train_lin, min_test_lin), size = 3.5) + xlab("Model flexibility") + ylab("MSE") + scale_color_viridis_d( "", labels = c("Test MSE", "Train MSE"), option = "magma", begin = 0.2, end = 0.9 ) + theme_void(base_family = "Fira Sans Book") + theme( legend.position = c(0.9, 0.65), axis.title = element_text(size = 20, vjust = 1), axis.title.y = element_text(angle = 90), legend.text = element_text(size = 18) ) ``` --- layout: true # Model accuracy --- name: ## Solutions? Clearly we don't want to overfit our .hi-slate[training data].
It loos like our .hi-pink[testing data] can help. .qa[Q] How about the following routine? 1. train a model $\hat{\color{#20B2AA}{f}}$ on the .hi-slate[training data] 2. use the .hi-pink[test data] to "tune" the model's flexibility 3. repeat steps 1–2 until we find the optimal level of flexibility -- .qa[A] .b[No] -- .b[!] -- .b[!!] -- This is an algorithm for .b[overfitting your .hi-pink[test data]]. Okay... so maybe that was on overreaction, but we need to be careful. --- name: bias-variance This tradeoff that we keep coming back to has an official name:
.b[bias-variance tradeoff]. .grey-light[(or the variance-bias tradeoff)] -- .b[Variance] The amount $\hat{\color{#20B2AA}{f}}$ would change with a different .hi-slate[training sample] - If new .hi-slate[training sets] drastically change $\hat{\color{#20B2AA}{f}}$, then we have a lot of uncertainty about $\color{#20B2AA}{f}$ (and, in general, $\hat{\color{#20B2AA}{f}} \not\approx \color{#20B2AA}{f}$). - More flexible models generally add variance to $\color{#20B2AA}{f}$. -- .b[Bias] The error that comes from inaccurately estimating $\color{#20B2AA}{f}$. - More flexible models are better equipped to recover complex relationships $\left( \color{#20B2AA}{f} \right)$, reducing bias. (Real life is seldom linear.) - Simpler (less flexible) models typically increase bias. --- ## The bias-variance tradeoff, formally The expected value.super[.pink[†]] of the .hi-pink[test MSE] can be written $$ \begin{align} \mathop{E}\left[ \left(\color{#FFA500}{\mathbf{y}}_0 - \hat{\color{#20B2AA}{f}}\!(\color{#6A5ACD}{\mathbf{X}}_0) \right)^2 \right] = \underbrace{\mathop{\text{Var}} \left( \hat{\color{#20B2AA}{f}}\!(\color{#6A5ACD}{\mathbf{X}}_0) \right)}_{(1)} + \underbrace{\left[ \text{Bias}\left( \hat{\color{#20B2AA}{f}}\!(\color{#6A5ACD}{\mathbf{X}}_0) \right) \right]^2}_{(2)} + \underbrace{\mathop{\text{Var}} \left( \varepsilon \right)}_{(3)} \end{align} $$ .footnote[ .pink[†] Think: .it[mean] or .it[tendency] ] -- .qa[Q.sub[1]] What does this formula tell us? (Think intuition/interpretation.)
.qa[Q.sub[2]] How does model flexibility feed into this formula?
.qa[Q.sub[3]] What does this formula say about minimizing .hi-pink[test MSE]? -- .qa[A.sub[2]] In general, model flexibility increases (1) and decreases (2). --
.qa[A.sub[3]] Rates of change for variance and bias will lead to optimal flexibility.
We often see U-shape curves of .hi-pink[test MSE] w.r.t. to model flexibility. --- layout: false class: clear, middle .b[U-shaped test MSE] w.r.t. model flexibility
Increases in variance eventually overcome reductions in (squared) bias. ```{R, plot flexibility again, echo = F, fig.height = 6} ggplot(data = flex_df, aes(x = 1.5 - s, y = mse_value, color = mse_type)) + geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + geom_line(size = 1.2) + geom_point(data = bind_rows(min_train, min_test), size = 3.5) + xlab("Model flexibility") + ylab("MSE") + scale_color_viridis_d( "", labels = c("Test MSE", "Train MSE"), option = "magma", begin = 0.2, end = 0.9 ) + theme_void(base_family = "Fira Sans Book") + theme( legend.position = c(0.9, 0.65), axis.title = element_text(size = 20, vjust = 1), axis.title.y = element_text(angle = 90), legend.text = element_text(size = 18) ) ``` --- # Model accuracy ## Bias-variance tradeoff The bias-variance tradeoff key to understanding many ML concepts. - Loss functions and model performance - Overfitting and model flexibility - Training and testing (and cross validating) Spend some time thinking about it and building intution.
It's time well spent. --- class: clear, middle So far we've focused on regression problems; what about classification? --- layout: true # Model accuracy --- name: class-review ## Classification problems .note[Recall] We're still supervised, but now we're predicting categorical labels. With categorical variables, MSE doesn't work—_e.g._, .center[ $\color{#FFA500}{\mathbf{y}} - \hat{\color{#FFA500}{\mathbf{y}}} =$ .orange[(Chihuahua)] - .orange[(Blueberry muffin)] $=$ not math (does not compute) ] Clearly we need a different way to define model performance. --- ## Classification problems The most common approach is exactly what you'd guess... .hi-slate[Training error rate] The share of training predictions that we get wrong. $$ \begin{align} \dfrac{1}{n} \sum_{i=1}^{n} \mathbb{I}\!\left( \color{#FFA500}{y}_i \neq \hat{\color{#FFA500}{y}}_i \right) \end{align} $$ where $\mathbb{I}\!\left( \color{#FFA500}{y}_i \neq \hat{\color{#FFA500}{y}}_i \right)$ is an indicator function that equals 1 whenever our prediction is wrong. -- .hi-pink[Test error rate] The share of test predictions that we get wrong. .center[ Average $\mathbb{I}\!\left( \color{#FFA500}{y}_0 \neq \hat{\color{#FFA500}{y}}_0 \right)$ in our .hi-pink[test data] ] --- layout: false class: clear ```{R, pic chihuahua, echo = F, out.width = '72%'} include_graphics("images/chihuahua-muffin.jpg") ``` --- layout: true # Model accuracy --- name: bayes ## The Bayes classifier .note[Recall] .hi-pink[Test error rate] is the share of test predictions that we get wrong. .center[ Average $\mathbb{I}\!\left( \color{#FFA500}{y}_0 \neq \hat{\color{#FFA500}{y}}_0 \right)$ in our .hi-pink[test data] ] The .b[Bayes classifier] as the classifier that assigns an observation to its most probable groups, given the values of its predictors, _i.e._, .center[ Assign obs. $i$ to the class $j$ for which $\mathop{\text{Pr}}\left(\color{#FFA500}{\mathbf{y}} = j | \color{#6A5ACD}{\mathbf{X}} = \mathbf{x}_0\right)$ is the largest ] The .b[Bayes classifier] minimizes the .hi-pink[test error rate]. -- .note[Note] $\mathop{\text{Pr}}\left(\mathbf{y}=j|\mathbf{X}=x_0\right)$ is the probability that random variable $\mathbf{y}$ equals $j$, given.super[.pink[†]] the variable(s) $\mathbf{X} = x_0$. .footnote[ .pink[†] The "given" is also read as "conditional on". Think of it as subsetting to where $X=x_0$. ] --- ## The Bayes classifier .note[Example] - Pr(.orange[y] = "chihuahua" | .purple[X] = "orange and purple") = 0.3 - Pr(.orange[y] = "blueberry muffin" | .purple[X] = "orange and purple") = 0.4 - Pr(.orange[y] = "squirrel" | .purple[X] = "orange and purple") = 0.2 - Pr(.orange[y] = "other" | .purple[X] = "orange and purple") = 0.1 Then the Bayes classifier says we should predict "blueberry muffin". --- ## The Bayes classifier More notes on the Bayes classifier 1. In the .b[two-class case], we're basically looking for
$\text{Pr}(\color{#FFA500}{\mathbf{y}}=j|\color{#6A5ACD}{\mathbf{X}}=x_0)>0.5$ for one class. 1. The .b[Bayes decision boundary] is the point where the probability is equal between the most likely groups (_i.e._, exactly 50% for two groups). 1. The Bayes classifier produces the lowest possible .hi-pink[test error rate], which is called the .b[Bayes error rate]. 1. Just as with $\color{#20B2AA}{f}$, the probabilities $\mathop{\text{Pr}}\left(\color{#FFA500}{\mathbf{y}}=j|\color{#6A5ACD}{\mathbf{X}}=x_o\right)$ that the Bayes classifier relies upon are .b[unknown]. We have to estimate. --- exclude: true ```{R, gen bayes data, echo = F, cache = T} # Generate data set.seed(1234) n_b = 70 bayes_gen = tibble( x1 = runif(n_b, 10, 90), x2 = x1 + rnorm(n_b, sd = 30), y = (x1 - 0.9 * x2 + rnorm(10) > 0) %>% as.numeric() ) bayes_truth = expand.grid(x1 = 1:100, x2 = 1:100) %>% as_tibble() est_knn = knn.reg( train = bayes_gen[,c("x1", "x2")], test = bayes_truth, y = bayes_gen$y, k = 5 ) bayes_truth$p = est_knn$pred bayes_truth$y = as.numeric(est_knn$pred > 0.5) # Sample data points bayes_sample = sample_n(bayes_truth, size = 100) bayes_sample %<>% mutate(y = rbernoulli(n = 100, p = p) * 1) bayes_sample2 = sample_n(bayes_truth, size = 100) bayes_sample2 %<>% mutate(y = rbernoulli(n = 100, p = p) * 1) # Train kNN est_boundary = knn.reg( train = bayes_sample[,c("x1", "x2")], test = bayes_truth[,c("x1", "x2")], y = bayes_sample$y, k = 5 ) est_boundary2 = knn.reg( train = bayes_sample2[,c("x1", "x2")], test = bayes_truth[,c("x1", "x2")], y = bayes_sample2$y, k = 5 ) est_boundary_k1 = knn.reg( train = bayes_sample[,c("x1", "x2")], test = bayes_truth[,c("x1", "x2")], y = bayes_sample$y, k = 1 ) est_boundary_k60 = knn.reg( train = bayes_sample[,c("x1", "x2")], test = bayes_truth[,c("x1", "x2")], y = bayes_sample$y, k = 60 ) # Now add estimates to full dataset bayes_truth$y_hat = as.numeric(est_boundary$pred > 0.5) bayes_truth$y_hat2 = as.numeric(est_boundary2$pred > 0.5) bayes_truth$y_hat_k1 = as.numeric(est_boundary_k1$pred > 0.5) bayes_truth$y_hat_k60 = as.numeric(est_boundary_k60$pred > 0.5) ``` --- layout: true class: clear, middle --- name: ex-bayes The .hi-pink[Bayes decision boundary] between classes .orange[A] and .navy[B] ```{R, plot bayes boundary, echo = F, cache = T, dependson = "gen bayes data"} ggplot(data = bayes_truth, aes(x1, x2, color = y)) + geom_point(shape = 20, size = 0.7) + geom_contour( aes(x = x1, y = x2, z = y), bins = 1, color = red_pink, size = 1.3 ) + scale_color_viridis_c(option = "magma", begin = 0.1, end = 0.85) + theme_void() + theme(legend.position = "none") ``` --- Now we sample... ```{R, plot bayes sample, echo = F, cache = T, dependson = "gen bayes data"} ggplot(data = bayes_truth, aes(x1, x2, color = y)) + geom_point(shape = 20, size = 0.5) + geom_contour( aes(x = x1, y = x2, z = y), bins = 1, color = red_pink, size = 1.3 ) + geom_point( data = bayes_sample, aes(x1, x2, color = y), size = 2 ) + scale_color_viridis_c(option = "magma", begin = 0.1, end = 0.85) + theme_void() + theme(legend.position = "none") ``` --- ... and our sample gives us an .hi-purple[estimated decision boundary]. ```{R, plot bayes est boundary, echo = F, cache = T, dependson = "gen bayes data"} ggplot(data = bayes_truth, aes(x1, x2, color = y)) + geom_point(shape = 20, size = 0.5) + geom_contour( aes(x = x1, y = x2, z = y), bins = 1, color = red_pink, size = 1.3 ) + geom_point( data = bayes_sample, aes(x1, x2, color = y), size = 2 ) + geom_contour( aes(x = x1, y = x2, z = y_hat), bins = 1, color = purple, size = 1.3 ) + scale_color_viridis_c(option = "magma", begin = 0.1, end = 0.85) + theme_void() + theme(legend.position = "none") ``` --- And a new sample gives us another .hi-turquoise[estimated decision boundary]. ```{R, plot bayes est boundary 2, echo = F, cache = T, dependson = "gen bayes data"} ggplot(data = bayes_truth, aes(x1, x2, color = y)) + geom_point(shape = 20, size = 0.5) + geom_contour( aes(x = x1, y = x2, z = y), bins = 1, color = red_pink, size = 1.3 ) + geom_point( data = bayes_sample2, aes(x1, x2, color = y), size = 2 ) + geom_contour( aes(x = x1, y = x2, z = y_hat), bins = 1, color = purple, size = 1.3 ) + geom_contour( aes(x = x1, y = x2, z = y_hat2), bins = 1, color = turquoise , size = 1.3 ) + scale_color_viridis_c(option = "magma", begin = 0.1, end = 0.85) + theme_void() + theme(legend.position = "none") ``` --- One non-parametric way to estimate these unknown conditional probabilities: K-nearest neighbors (KNN). --- layout: true # K-nearest neighbors --- name: knn-setup ## Setup K-nearest neighbors (KNN) simply assigns a category based upon the nearest K neighbors votes (their values). -- .note[More formally:] Using the K closest neighbors.super[.pink[†]] to test observation $\color{#6A5ACD}{\mathbf{x_0}}$, we calculate the share of the observations whose class equals $j$, $$ \begin{align} \hat{\mathop{\text{Pr}}}\left(\mathbf{y} = j | \mathbf{X} = \color{#6A5ACD}{\mathbf{x_0}}\right) = \dfrac{1}{K} \sum_{i \in \mathcal{N}_0} \mathop{\mathbb{I}}\left( \color{#FFA500}{\mathbf{y}}_i = j \right) \end{align} $$ These shares are our estimates for the unknown conditional probabilities. We then assign observation $\color{#6A5ACD}{\mathbf{x_0}}$ to the class with the highest probability. .footnote[ .pink[†] In $\color{#6A5ACD}{\mathbf{X}}$ space. ] --- name: knn-fig layout: false class: clear, middle .b[KNN in action]
.note[Left:] K=3 estimation for "x".       .note[Right:] KNN decision boundaries. ```{R, fig knn, echo = F} include_graphics("images/isl-knn.png") ``` .smaller.it[Source: ISL] --- class: clear, middle The choice of K is very important—ranging from super flexible to inflexible. --- name: ex-knn class: clear, middle Decision boundaries: .hi-pink[Bayes], .hi-purple[K=1], and .hi-turquoise[K=60] ```{R, plot knn k, echo = F, cache = T, dependson = "gen bayes data"} ggplot(data = bayes_truth, aes(x1, x2, color = y)) + geom_point(shape = 20, size = 0.5) + geom_contour( aes(x = x1, y = x2, z = y), bins = 1, color = red_pink, size = 1.3 ) + geom_point( data = bayes_sample, aes(x1, x2, color = y), size = 2 ) + geom_contour( aes(x = x1, y = x2, z = y_hat_k1), bins = 1, color = purple, size = 1.3 ) + geom_contour( aes(x = x1, y = x2, z = y_hat_k60), bins = 1, color = turquoise, size = 1.3 ) + scale_color_viridis_c(option = "magma", begin = 0.1, end = 0.85) + theme_void() + theme(legend.position = "none") ``` --- class: clear, middle .b[KNN error rates], as K increases ```{R, fig knn error, echo = F, out.width = '85%'} include_graphics("images/isl-knn-error.png") ``` .smaller.it[Source: ISL] --- # Model accuracy ## Summary The bias-variance tradeoff is central to quality prediction. - Relevant for classification and regression settings - Benefits and costs of increasing model flexibility - U-shaped test error curves - Avoid overfitting—including in test data --- name: sources layout: false # Sources These notes draw upon - [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) (*ISL*)
James, Witten, Hastie, and Tibshirani - [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
Jake VanderPlas - ['Chihuahua or Muffin' is from Twitter](https://twitter.com/teenybiscuit/status/705232709220769792) --- # Table of contents .col-left[ .smallest[ #### Admin - [Today](#admin-today) - [Upcoming](#admin-soon) #### Model accuracy: Regression - [Review](#accuracy-review) - [Loss (functions)](#loss-functions) - [Overfitting](#overfitting) - [The bias-variance tradeoff](#bias-variance) #### Model accuracy: Classification - [Returning to classification](#class-review) - [The Bayes classifier](#bayes) #### KNN - [Setup](#knn-setup) - [Figures](#knn-fig) ] ] .col-right[ .smallest[ #### Examples - [Train *vs.* test MSE: Nonlinear truth](#ex-nonlinear-splines) - [Train *vs.* test MSE: Linear truth](#ex-linear-splines) - [Bayes decision boundaries](#ex-bayes) - [KNN choice of K](#ex-knn) #### Other - [Sources/references](#sources) ] ] --- exclude: true ```{R, save pdfs, include = F, eval = F} system("`npm bin`/decktape remark 002-slides.html 002-slides.pdf --chrome-arg=--allow-file-access-from-files --slides 1-200") ```