--- title: "Lecture .mono[001]" subtitle: "Statistical learning: Foundations" author: "Edward Rubin" #date: "`r format(Sys.time(), '%d %B %Y')`" date: "14 January 2020" output: xaringan::moon_reader: css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css'] # self_contained: true nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- exclude: true ```{R, setup, include = F} library(pacman) p_load( broom, tidyverse, ggplot2, ggthemes, ggforce, ggridges, cowplot, latex2exp, viridis, extrafont, gridExtra, plotly, kableExtra, snakecase, janitor, data.table, dplyr, lubridate, knitr, future, furrr, estimatr, FNN, caret, parsnip, huxtable, here, magrittr ) # Define colors red_pink = "#e64173" turquoise = "#20B2AA" orange = "#FFA500" red = "#fb6107" blue = "#3b3b9a" green = "#8bb174" grey_light = "grey70" grey_mid = "grey50" grey_dark = "grey20" purple = "#6A5ACD" slate = "#314f4f" # Knitr options opts_chunk$set( comment = "#>", fig.align = "center", fig.height = 7, fig.width = 10.5, warning = F, message = F ) opts_chunk$set(dev = "svg") options(device = function(file, width, height) { svg(tempfile(), width = width, height = height) }) options(knitr.table.format = "html") ``` --- layout: true # Admin --- class: inverse, middle --- name: admin-today ## Today .hi-slate[In-class] - .note[Course website:] [https://github.com/edrubin/EC524W20/](https://github.com/edrubin/EC524W20/) - .note[Resources] - [RStudio](https://education.rstudio.com/learn/) cheatsheets, books, and tutorials - [UO library](http://uoregon.libcal.com/calendar/dataservices/?cid=11979&t=g&d=0000-00-00&cal=11979,11173) - See course page for more... - Formalizing statistical learning, notation, goals (and problems) --- layout: false class: clear, middle ```{R, eugene r, echo = F} knitr::include_graphics("images/eugene-r.png") ``` .smaller[[Tweet](https://twitter.com/ryann_crowley/status/1216880767072002048); [h/t: Grant McDermott](https://grantmcdermott.com/)] --- name: admin-soon # Admin ## Upcoming .hi-slate[Readings] - .note[Today] - .it[ISL] Ch1–Ch2 - [Prediction Policy Problems](https://www.aeaweb.org/articles?id=10.1257/aer.p20151023) by Kleinberg .it[et al.] (2015) - .note[Next] - .it[ISL] Ch. 3–4 .hi-slate[Problem set] Likely assigned Thursday and due Tuesday. --- layout: true # Statistical learning --- class: inverse, middle --- name: sl-definition ## What is it? -- .hi[Statistical learning] is a .attn[set of tools] developed .attn[to understand/model data]. -- Examples - .hi-slate[Regression analysis] quantifies the relationship between an outcome and a set of explanatory variables—most usefully in a causal setting. -- - .hi-slate[Exploratory data analysis] (EDA) is a preliminary, often graphical, "exploration" of data to understand levels, variation, missingess, *etc.* -- - .hi-slate[Classification trees] search through explanatory variables, splitting along the most "predictive" dimensions (random forests extend trees). -- - .hi-slate[Regression trees] extend *classification trees* to numerical outcomes (random forests extend, as well). -- - .hi-slate[K-means clustering] partitions observations into K groups (clusters) based upon a set of variables. --- name: sl-classes ## What is it good for? -- A lot of things. -- We tend to break statistical-learning into two(-ish) classes: 1. .hi-slate[Supervised learning] builds ("learns") a statistical model for predicting an .hi-orange[output] $\left( \color{#FFA500}{\mathbf{y}} \right)$ given a set of .hi-purple[inputs] $\left( \color{#6A5ACD}{\mathbf{x}_{1},\, \ldots,\, \mathbf{x}_{p}} \right)$, -- _i.e._, we want to build a model/function $\color{#20B2AA}{f}$ $$\color{#FFA500}{\mathbf{y}} = \color{#20B2AA}{f}\!\left( \color{#6A5ACD}{\mathbf{x}_{1},\, \ldots,\, \mathbf{x}_{p}} \right)$$ that accurately describes $\color{#FFA500}{\mathbf{y}}$ given some values of $\color{#6A5ACD}{\mathbf{x}_{1},\, \ldots,\, x_{p}}$. -- 2. .hi-slate[Unsupervised learning] learns relationships and structure using only .hi-purple[inputs] $\left( \color{#6A5ACD}{x_{1},\, \ldots,\, x_{p}} \right)$ without any *supervising* output -- —letting the data "speak for itself." --- layout: false class: clear, middle .hi-slate[Semi-supervised learning] falls somewhere between these supervised and unsupervised learning—generally applied to supervised tasks when labeled .hi-orange[outputs] are incomplete. --- class: clear, middle ```{R, comic, echo = F} knitr::include_graphics("images/comic-learning.jpg") ``` .it[.smaller[[Source](https://twitter.com/athena_schools/status/1063013435779223553)]] --- layout: true # Statistical learning --- ## Output We tend to further break .hi-slate[supervised learning] into two groups, based upon the .hi-orange[output] (the .orange[outcome] we want to predict): -- 1. .hi-slate[Classification tasks] for which the values of $\color{#FFA500}{\mathbf{y}}$ are discrete categories
*E.g.*, race, sex, loan default, hazard, disease, flight status 2. .hi-slate[Regression tasks] in which $\color{#FFA500}{\mathbf{y}}$ takes on continuous, numeric values.
*E.g.*, price, arrival time, number of emails, temperature .note[Note.sub[1]] The use of .it[regression] differs from our use of .it[linear regression]. -- .note[Note.sub[2]] Don't get tricked: Not all numbers represent continuous, numerical values—_e.g._, zip codes, industry codes, social security numbers..super[.pink[†]] .footnote[ .pink[†] .qa[Q] Where would you put responses to 5-item Likert scales? ] --- ## Why *Learning*? .qa[Q] What puts the "learning" in statistical/machine learning? -- .qa[A] Most learning models/algorithms will .attn[tune model parameters] based upon the observed dataset—learning from the data. --- layout: true # Notation --- name: notation-source class: inverse, middle Our class will typically follow the notation and definitions of [.it[ISL]](http://faculty.marshall.usc.edu/gareth-james/ISL/). --- name: notation-data ## Data $\color{#e64173}{n}$ gives the .pink[number of observations] $\color{#6A5ACD}{p}$ represents the .purple[number of variables] available for predictions -- $\mathbf{X}$ is our $\color{#e64173}{n}\times\color{#6A5ACD}{p}$ matrix of predictors - .note[Other names] ***features***, *inputs*, *independent/explanatory variables*, ... - $x_{\color{#e64173}{i},\color{#6A5ACD}{j}}$ is observation $\color{#e64173}{i}$ (in $\color{#e64173}{1,\ldots,n}$) on variable $\color{#6A5ACD}{j}$ (for $\color{#6A5ACD}{j}$ in $\color{#6A5ACD}{1,...,p}$) -- $$ \begin{align} \mathbf{X} = \begin{bmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,\color{#6A5ACD}{p}} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,\color{#6A5ACD}{p}} \\ \vdots & \vdots & \ddots & \vdots \\ x_{\color{#e64173}{n},1} & x_{\color{#e64173}{n},2} & \cdots & x_{\color{#e64173}{n},\color{#6A5ACD}{p}} \end{bmatrix} \end{align} $$ --- name: notation-dimensions ## Dimensions of $\mathbf{X}$ Now let us split our $\mathbf{X}$ matrix of predictors by its two dimensions. -- .col-left[ .hi-pink[Observation] $\color{#e64173}{i}$ is a $\color{#6A5ACD}{p}$-length vector $$ \begin{align} x_{\color{#e64173}{i}} = \begin{bmatrix} x_{\color{#e64173}{i},\color{#6A5ACD}{1}} \\ x_{\color{#e64173}{i},\color{#6A5ACD}{2}} \\ \vdots \\ x_{\color{#e64173}{i},\color{#6A5ACD}{p}} \end{bmatrix} \end{align} $$ ] -- .col-right[ .hi-purple[Variable] $\color{#6A5ACD}{j}$ is a $\color{#e64173}{n}$-length vector $$ \begin{align} \mathbf{x}_{\color{#6A5ACD}{j}} = \begin{bmatrix} x_{\color{#e64173}{1},\color{#6A5ACD}{j}} \\ x_{\color{#e64173}{2},\color{#6A5ACD}{j}} \\ \vdots \\ x_{\color{#e64173}{n},\color{#6A5ACD}{j}} \end{bmatrix} \end{align} $$ ] -- Applied to .mono[R]: - `dim(x_df)` = $\color{#e64173}{n}$ $\color{#6A5ACD}{p}$ - `nrow(x_df)` $= \color{#e64173}{n}$; `ncol(x_df)` $= \color{#6A5ACD}{p}$ - `x_df[1,]` $\left( \color{#e64173}{i = 1} \right)$; `x_df[,1]` $\left( \color{#6A5ACD}{j = 1} \right)$ --- name: notation-outcomes ## Outcomes In supervised settings, we will denote our .hi-orange[outcome variable] as $\color{#FFA500}{\mathbf{y}}$. .note[Synonyms] *output*, *outcome*, *dependent/response variable*, ... -- The .orange[outcome] for our .pink[i.super[th]] obsevation is $\color{#FFA500}{y}_{\color{#e64173}{i}}$. Together the $\color{#e64173}{n}$ observations form $$ \begin{align} \color{#FFA500}{\mathbf{y}} = \begin{bmatrix} y_{\color{#e64173}{1}} \\ y_{\color{#e64173}{2}} \\ \vdots \\ y_{\color{#e64173}{n}} \end{bmatrix} \end{align} $$ -- and our full dataset is composed of $\bigg\{ \left( x_{\color{#e64173}{1}},\color{#FFA500}{y}_{\color{#e64173}{1}} \right),\, \left( x_{\color{#e64173}{2}},\color{#FFA500}{y}_{\color{#e64173}{2}} \right),\, \ldots,\, \left( x_{\color{#e64173}{n}},\color{#FFA500}{y}_{\color{#e64173}{n}} \right) \bigg\}$. --- layout: false class: clear, middle Back to the problem of (supervised) statistical learning... --- layout: true # Statistical learning --- name: sl-goal ## The goal As defined before, we want to *learn* a model to understand our data. -- 1. Take our (numeric) .orange[output] $\color{#FFA500}{\mathbf{y}}$. 2. Imagine there is a .turquoise[function] $\color{#20B2AA}{f}$ that takes .purple[inputs] $\color{#6A5ACD}{\mathbf{X}} = \color{#6A5ACD}{\mathbf{x}_1}, \ldots, \color{#6A5ACD}{\mathbf{x}_p}$
and maps them, plus a random, mean-zero .pink[error term] $\color{#e64173}{\varepsilon}$, to the .orange[output]. $$\color{#FFA500}{\mathbf{y}} = \color{#20B2AA}{f} \! \left( \color{#6A5ACD}{\mathbf{X}} \right) + \color{#e64173}{\varepsilon}$$ -- .qa[Q] What is $\color{#20B2AA}{f}$? --
.qa[A] .note[ISL:] $\color{#20B2AA}{f}$ represents the *systematic* information that $\color{#6A5ACD}{\mathbf{X}}$ provides about $\color{#FFA500}{\mathbf{y}}$. -- .qa[Q] How else can you describe $\color{#20B2AA}{f}$? --- ## Our missing $f$ $$\color{#FFA500}{\mathbf{y}} = \color{#20B2AA}{f} \! \left( \color{#6A5ACD}{\mathbf{X}} \right) + \color{#e64173}{\varepsilon}$$ .qa[Q] $\color{#20B2AA}{f}$ is unknown (as is $\color{#e64173}{\varepsilon}$). What should we do? --
.qa[A] Use the observed data to learn/estimate $\color{#20B2AA}{f}(\cdot)$, _i.e._, construct $\widehat{\color{#20B2AA}{f}}$..super[.pink[†]] .footnote[ .pink[†] More notation: hats $\left( \hat{} \right)$ are estimators/estimates. ] -- .qa[Q] Okay. How? --
.qa[A] .it[How do I estimate] $\color{#20B2AA}{f}$.it[?] is one way to phrase *all questions* that underly statistical learning—model selection, cross validation, evaluation, *etc.* -- All of the techniques, algorithms, tools of stat. learning attempt to accurately recover $\color{#20B2AA}{f}$ based upon the settings' goals/limitations. -- .grey-light[You'll have to wait on any real/specific answers...] --- ## Learning from $\hat{f}$ There are two main reasons we want to learn about $\color{#20B2AA}{f}$ 1. .hi-slate[*Causal* inference settings] How do changes in $\color{#6A5ACD}{\mathbf{X}}$ affect $\color{#FFA500}{\mathbf{y}}$?
.grey-light[The territory of .mono[EC523] and .mono[EC525].] -- 1. .hi-slate[Prediction problems] Predict $\color{#FFA500}{\mathbf{y}}$ using our estimated $\color{#20B2AA}{f}$, _i.e._, $$\hat{\color{#FFA500}{\mathbf{y}}} = \hat{\color{#20B2AA}{f}}\!(\color{#6A5ACD}{\mathbf{X}})$$ our *black-box setting* where we care less about $\color{#20B2AA}{f}$ than $\hat{\color{#FFA500}{\mathbf{y}}}$..super[.pink[†]] .footnote[ .pink[†] You shouldn't actually treat your prediction methods as total black boxes. ] -- Similarly, in causal-inference settings, we don't particulary care about $\hat{\color{#FFA500}{\mathbf{y}}}$. --- name: sl-prediction ## Prediction errors As tends to be the case in life, you will make errors in predicting $\color{#FFA500}{\mathbf{y}}$. The accuracy of $\hat{\color{#FFA500}{\mathbf{y}}}$ depends upon .hi-slate[two errors]: -- 1. .hi-slate[Reducible error] The error due to $\hat{\color{#20B2AA}{f}}$ imperfectly estimating $\color{#20B2AA}{f}$.
*Reducible* in the sense that we could improve $\hat{\color{#20B2AA}{f}}$. -- 1. .hi-slate[Irreducible error] The error component that is outside of the model $\color{#20B2AA}{f}$.
*Irreducible* because we defined an error term $\color{#e64173}{\varepsilon}$ unexplained by $\color{#20B2AA}{f}$. -- .note[Note] As its name implies, you can't get rid of .it[irreducible] error—but we can try to get rid of .it[reducible] errors. --- ## Prediction errors Why we're stuck with .it[irreducible] error $$ \begin{aligned} \mathop{E}\left[ \left\{ \color{#FFA500}{\mathbf{y}} - \hat{\color{#FFA500}{\mathbf{y}}} \right\}^2 \right] &= \mathop{E}\left[ \left\{ \color{#20B2AA}{f}(\color{#6A5ACD}{\mathbf{X}}) + \color{#e64173}{\varepsilon} + \hat{\color{#20B2AA}{f}}(\color{#6A5ACD}{\mathbf{X}}) \right\}^2 \right] \\ &= \underbrace{\left[ \color{#20B2AA}{f}(\color{#6A5ACD}{\mathbf{X}}) - \hat{\color{#20B2AA}{f}}(\color{#6A5ACD}{\mathbf{X}}) \right]^2}_{\text{Reducible}} + \underbrace{\mathop{\text{Var}} \left( \color{#e64173}{\varepsilon} \right)}_{\text{Irreducible}} \end{aligned} $$ In less math: - If $\color{#e64173}{\varepsilon}$ exists, then $\color{#6A5ACD}{\mathbf{X}}$ cannot perfectly explain $\color{#FFA500}{\mathbf{y}}$. - So even if $\hat{\color{#20B2AA}{f}} = \color{#20B2AA}{f}$, we still have irreducible error. -- Thus, to form our .hi-slate[best predictors], we will .hi-slate[minimize reducible error]. --- name: sl-parameters ## Which type of $\hat{f}$? Once you have your .purple[inputs] $\left(\color{#6A5ACD}{\mathbf{X}} \right)$ and .orange[output] $\left( \color{#FFA500}{\mathbf{y}} \right)$ data, you still need to decide how parametric your $\hat{\color{#20B2AA}{f}}$ should be..super[.pink[†]] .footnote[ .pink[†] I'm saying "how parametric" b/c some methods are much more parametric than others. ] -- .hi-slate[Parametric methods] assume a function typically involve two steps 1. Select a functional form (shape) to represent $\color{#20B2AA}{f}$ 2. Train your selected model on your data $\color{#FFA500}{\mathbf{y}}$ and $\color{#6A5ACD}{\mathbf{X}}$. -- .hi-slate[Non-parametric methods] avoid explicit assumption about the shape of $\color{#20B2AA}{f}$.
Attempt to .pink[flexibly fit] the data, while trying to .pink[avoid overfitting]. --- ## Which type of $\hat{f}$? Methods' parametric assumptions come with tradeoffs. .hi-slate[Parametric methods]
 .pink.mono[+] Simpler to estimate and interpret.
 .purple.mono[-] If assumed functional form is bad, model performance will suffer. .hi-slate[Non-parametric methods]
 .pink.mono[+] Fewer assumptions. More flexibility.
 .purple.mono[-] Lower interpretability. Susceptible to overfitting. Want lots of data. --- layout: true class: clear, middle --- .hi-slate[Example:] Let's start with a pretty funky, nonlinear function. --- exclude: true ```{R, ex data, cache = T} # Sample size n = 70 # Set seed set.seed(12345) # Define function f = function(x1, x2, e) x1 + x2 - x1 * x2 + (x1 > x2) * x1 + (x1 < x2) * x2^2 + e # Generate data sample_df = tibble( x1 = runif(n = n, max = 10), x2 = runif(n = n, max = 10), e = rnorm(n = n, sd = 1), y = f(x1, x2, e) ) # Estimate linear-regression model est_lm = lm(y ~ x1 * x2, data = sample_df) # Estimate kNN models: k=10,5,1 est_knn10 = knnreg( y = sample_df$y, x = sample_df[, c("x1", "x2")], k = 10 ) est_knn5 = knnreg( y = sample_df$y, x = sample_df[, c("x1", "x2")], k = 5 ) est_knn1 = knnreg( y = sample_df$y, x = sample_df[, c("x1", "x2")], k = 1 ) # Add predictions sample_df %<>% mutate( y_lm = est_lm$fitted.values, y_knn10 = predict(est_knn10, newdata = sample_df[, c("x1", "x2")]), y_knn5 = predict(est_knn5, newdata = sample_df[, c("x1", "x2")]), y_knn1 = predict(est_knn1, newdata = sample_df[, c("x1", "x2")]) ) # Fit a linear-regression model # True data frame truth_df = tibble(x1 = seq(0, 10, 0.1), x2 = seq(0, 10, 0.1)) %>% expand(x1, x2) truth_df %<>% mutate( y = f(x1, x2, 0), y_lm = predict(est_lm, newdata = truth_df), y_knn10 = predict(est_knn10, newdata = truth_df[, c("x1", "x2")]), y_knn5 = predict(est_knn5, newdata = truth_df[, c("x1", "x2")]), y_knn1 = predict(est_knn1, newdata = truth_df[, c("x1", "x2")]) ) # Find range of x, y, and prediction errors range_x = c(0,10) range_y = c( min( sample_df %>% select(starts_with("y")), truth_df %>% select(starts_with("y")) ), max( sample_df %>% select(starts_with("y")), truth_df %>% select(starts_with("y")) ) ) range_error = c( min(sample_df %>% transmute(y - y_lm, y - y_knn10, y - y_knn5, y - y_knn1)), max(sample_df %>% transmute(y - y_lm, y - y_knn10, y - y_knn5, y - y_knn1)) ) ``` --- name: ex-truth .hi-slate[Truth:] The (nonlinear) $f(\mathbf{X})$ that we hope to recover. ```{R, ex truth, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # True 'f(X)' (surface) plot_ly( x = unique(truth_df$x1), y = unique(truth_df$x2), z = matrix(data = truth_df$y, ncol = sqrt(nrow(truth_df))), colors = colorRamp(viridis::magma(8)), cmin = range_y[1], cmax = range_y[2] ) %>% add_surface() %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "y", range = range_y) )) %>% hide_colorbar() ``` --- .hi-slate[The sample:] $n=70$ randomly drawn observations for $\mathbf{y} = f(\mathbf{x}_1,\, \mathbf{x}_2) + \varepsilon$ ```{R, ex sample, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # Sample observations (3d scatter) plot_ly( type = "scatter3d", x = sample_df$x1, y = sample_df$x2, z = sample_df$y, mode = "markers", color = sample_df$y, colors = colorRamp(viridis::magma(8)), ) %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "y", range = range_y) )) %>% hide_colorbar() ``` --- name: ex-lm .hi-slate[Estimated linear-regression model:] $\hat{\mathbf{y}} = \hat\beta_0 + \hat\beta_1 \mathbf{x}_1 + \hat\beta_2 \mathbf{x}_2 + \hat\beta_3 \mathbf{x}_1 \mathbf{x}_2$ ```{R, ex lm, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # Linear regression estimate (surface) plot_ly( x = unique(truth_df$x1), y = unique(truth_df$x2), z = matrix(data = truth_df$y_lm, ncol = sqrt(nrow(truth_df))), colors = colorRamp(viridis::magma(8)), cmin = range_y[1], cmax = range_y[2] ) %>% add_surface() %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "y", range = range_y) )) %>% colorbar(limits = range_y) %>% hide_colorbar() ``` --- .hi-slate[Prediction error] from our fitted linear regression model ```{R, ex lm errors, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # Regression error (3d scatter) plot_ly( type = "scatter3d", x = sample_df$x1, y = sample_df$x2, z = sample_df$y - sample_df$y_lm, mode = "markers", color = sample_df$y - sample_df$y_lm, colors = colorRamp(viridis::magma(8)) ) %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "error", range = range_error) )) %>% colorbar(limits = range_error) %>% hide_colorbar() ``` --- name: ex-knn .hi-slate[k-nearest neighbors] (kNN) using k=5 .grey-light[(a *non-parametric* method)] ```{R, ex knn5, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # knn model (k = 5) (surface) plot_ly( x = unique(truth_df$x1), y = unique(truth_df$x2), z = matrix(data = truth_df$y_knn5, ncol = sqrt(nrow(truth_df))), colors = colorRamp(viridis::magma(8)), cmin = range_y[1], cmax = range_y[2] ) %>% add_surface() %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "y", range = range_y) )) %>% hide_colorbar() ``` --- .hi-slate[k-nearest neighbors] (kNN) using k=10 .grey-light[(notice increased smoothness)] ```{R, ex knn10, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # knn model (k = 10) (surface) plot_ly( x = unique(truth_df$x1), y = unique(truth_df$x2), z = matrix(data = truth_df$y_knn10, ncol = sqrt(nrow(truth_df))), colors = colorRamp(viridis::magma(8)), cmin = range_y[1], cmax = range_y[2] ) %>% add_surface() %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "y", range = range_y) )) %>% hide_colorbar() ``` --- .hi-slate[k-nearest neighbors] (kNN) using k=1 .grey-light[(notice decreased smoothness)] ```{R, ex knn1, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # knn model (k = 1) (surface) plot_ly( x = unique(truth_df$x1), y = unique(truth_df$x2), z = matrix(data = truth_df$y_knn1, ncol = sqrt(nrow(truth_df))), colors = colorRamp(viridis::magma(8)), cmin = range_y[1], cmax = range_y[2] ) %>% add_surface() %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "y", range = range_y) )) %>% hide_colorbar() ``` --- .hi-slate[Prediction error] from our fitted kNN (k=5) model ```{R, ex knn 5 errors, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # kNN 5 error (3d scatter) plot_ly( type = "scatter3d", x = sample_df$x1, y = sample_df$x2, z = sample_df$y - sample_df$y_knn5, mode = "markers", color = sample_df$y - sample_df$y_knn5, colors = colorRamp(viridis::magma(8)) ) %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "error", range = range_error) )) %>% colorbar(limits = range_error) %>% hide_colorbar() ``` --- .hi-slate[Prediction error] from our fitted kNN (k=10) model ```{R, ex knn 10 errors, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # kNN 10 error (3d scatter) plot_ly( type = "scatter3d", x = sample_df$x1, y = sample_df$x2, z = sample_df$y - sample_df$y_knn10, mode = "markers", color = sample_df$y - sample_df$y_knn10, colors = colorRamp(viridis::magma(8)) ) %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "error", range = range_error) )) %>% colorbar(limits = range_error) %>% hide_colorbar() ``` --- .hi-slate[Prediction error] from our fitted kNN (k=1) model ```{R, ex knn 1 errors, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # kNN 1 error (3d scatter) plot_ly( type = "scatter3d", x = sample_df$x1, y = sample_df$x2, z = sample_df$y - sample_df$y_knn1, mode = "markers", color = sample_df$y - sample_df$y_knn1, colors = colorRamp(viridis::magma(8)) ) %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "error", range = range_error) )) %>% colorbar(limits = range_error) %>% hide_colorbar() ``` --- .note[Recall] .hi-slate[Prediction error] from our fitted linear regression model ```{R, ex lm errors again, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # Regression error (3d scatter) plot_ly( type = "scatter3d", x = sample_df$x1, y = sample_df$x2, z = sample_df$y - sample_df$y_lm, mode = "markers", color = sample_df$y - sample_df$y_lm, colors = colorRamp(viridis::magma(8)) ) %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "error", range = range_error) )) %>% colorbar(limits = range_error) %>% hide_colorbar() ``` --- layout: true # Model accuracy --- name: accuracy-questions ## Questions 1. Which of the methods was the most flexible? Inflexible? 1. Why do you think kNN with k=1 had such low prediction error? 1. How could we (better) assess model/predictive performance? 1. Why would we ever want to choose a less flexible model? --- ## Measurement You probably will not be surprised to know that there is no one-size-fits-all solution in statistical learning. .qa[Q] How do we choose between competing models? -- .qa[A] We're a few steps away, but before we do anything, we need a way to .hi-slate[define model performance]. --- name: accuracy-subtlety ## Subtlety Defining performance can actually be quite tricky... .note[Regression setting, 1] Which do you prefer? 1. Lots of little errors and a few really large errors. 1. Medium-sized errors for everyone. .note[Regression setting, 2] Is a 1-unit error (*e.g.*, $1,000) equally bad for everyone? --- ## Subtlety Defining performance can actually be quite tricky... .note[Classification setting, 1] Which is worse? 1. False positive (*e.g.*, incorrectly diagnosing cancer) 1. False negative (*e.g.*, missing cancer) .note[Classification setting, 2] Which is more important? 1. True positive (*e.g.*, correct diagnosis of cancer) 1. True negative (*e.g.*, correct diagnosis of "no cancer") --- name: mse ## MSE .attn[Mean squared error (MSE)] is the most common.super[.pink[†]] way to measure model performance in a regression setting. .footnote[ .pink[†] *Most common* does not mean best—it just means lots of people use it. ] $$\text{MSE} = \dfrac{1}{n} \sum_{i=1}^n \left[ \color{#FFA500}{y}_i - \hat{\color{#20B2AA}{f}}(\color{#6A5ACD}{x}_i) \right]^2$$ .note[Recall:] $\color{#FFA500}{y}_i - \hat{\color{#20B2AA}{f}}(\color{#6A5ACD}{x}_i) = \color{#FFA500}{y}_i - \hat{\color{#FFA500}{y}}_i$ is our prediction error. -- Two notes about MSE 1. MSE will be (relatively) very small when .hi-slate[prediction error] is nearly zero. 1. MSE .hi-slate[penalizes] big errors more than little errors (the squared part). --- name: training-testing ## Training or testing? Low MSE (accurate performance) on the data that trained the model isn't actually impressive—maybe the model is just overfitting our data..super[.pink[†]] .footnote[ .pink[†] Recall the kNN performance for k=1. ] .note[What we want:] How well does the model perform .hi-slate[on data it has never seen]? -- This introduces an important distinction: 1. .hi-slate[Training data]: The observations $(\color{#FFA500}{y}_i,\color{#e64173}{x}_i)$ used to .hi-slate[train] our model $\hat{\color{#20B2AA}{f}}$. 1. .hi-slate[Testing data]: The observations $(\color{#FFA500}{y}_0,\color{#e64173}{x}_0)$ that our model has yet to see—and which we can use to evaluate the performance of $\hat{\color{#20B2AA}{f}}$. -- .hi-slate[Real goal: Low test-sample MSE] (not the training MSE from before). --- layout: false class: clear, middle .hi-slate[Next time:] model performance, the variance-bias tradeoff, and kNN --- name: sources layout: false # Sources These notes draw upon - [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) (*ISL*)
James, Witten, Hastie, and Tibshirani - [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
Jake VanderPlas I pulled the comic from [Twitter](https://twitter.com/athena_schools/status/1063013435779223553/photo/1). --- # Table of contents .col-left[ .smallest[ #### Admin - [Today](#admin-today) - [Upcoming](#admin-soon) #### Statistical learning - [Definition](#sl-definition) - [Classes](#sl-classes) #### Notation - [Source](#notation-source) - [Data](#notation-data) - [Dimensions of $\mathbf{X}$](#notation-dimensions) - [Outcomes](#notation-outcomes) #### Statistical learning, continued - [The goal](#sl-goal) - [Prediction](#sl-prediction) - [Parameterization](#sl-parameters) ] ] .col-right[ .smallest[ #### Example - [Data-generating process (truth)](#ex-truth) - [Regression model](#ex-lm) - [kNN model](#ex-knn) #### Model accuracy - [Questions](#accuracy-questions) - [Subtlety](#accuracy-subtlety) - [MSE](#mse) - [Training *vs.* testing](#training-testing) #### Other - [Sources/references](#sources) ] ] --- exclude: true ```{R, save pdfs, include = F, eval = F} system("`npm bin`/decktape remark 001-slides.html 001-slides.pdf --chrome-arg=--allow-file-access-from-files --slides 1-100") ```