class: center, middle, inverse, title-slide .title[ # Lecture .mono[003] ] .subtitle[ ## Resampling ] .author[ ### Edward Rubin ] --- exclude: true --- layout: true # Admin --- class: inverse, middle --- name: admin-today ## Class today .b[Review] - [Regression and loss](#review-loss-functions) - [Classification](#review-classification) - [KNN](#review-knn) - [The bias-variance tradeoff](#review-bias-variance) .b[Resampling methods] - Cross validation 🚸 - The bootstrap 👢 --- name: admin-soon # Admin ## Upcoming .b[Readings] Today: .it[ISL] Ch. 5 .b[Problem set] --- layout: true # Review --- class: inverse, middle --- name: review-loss ## Regression and loss For .b[regression settings], the loss is our .pink[prediction]'s distance from .orange[truth], _i.e._, $$ `\begin{align} \text{error}_i = \color{#FFA500}{y_i} - \color{#e64173}{\hat{y}_i} && \text{loss}_i = \big| \color{#FFA500}{y_i} - \color{#e64173}{\hat{y}_i} \big| = \big| \text{error}_i \big| \end{align}` $$ Depending upon our ultimate goal, we choose .b[loss/objective functions]. $$ `\begin{align} \text{L1 loss} = \sum_i \big| \color{#FFA500}{y_i} - \color{#e64173}{\hat{y}_i} \big| &&&& \text{MAE} = \dfrac{1}{n}\sum_i \big| \color{#FFA500}{y_i} - \color{#e64173}{\hat{y}_i} \big| \\ \text{L2 loss} = \sum_i \left( \color{#FFA500}{y_i} - \color{#e64173}{\hat{y}_i} \right)^2 &&&& \text{MSE} = \dfrac{1}{n}\sum_i \left( \color{#FFA500}{y_i} - \color{#e64173}{\hat{y}_i} \right)^2 \\ \end{align}` $$ Whatever we're using, we care about .hi[test performance] (_e.g._, test MSE), rather than training performance. --- name: review-classification ## Classification For .b[classification problems], we often use the .hi[test error rate]. $$ `\begin{align} \dfrac{1}{n} \sum_{i=1}^{n} \mathop{\mathbb{I}}\left( \color{#FFA500}{y_i} \neq \color{#e64173}{\hat{y}_i} \right) \end{align}` $$ The .b[Bayes classifier] 1. predicts class `\(\color{#e64173}{j}\)` when `\(\mathop{\text{Pr}}\left(\color{#FFA500}{y_0} = \color{#e64173}{j} \big | \color{#6A5ACD}{\mathbf{X}} = \mathbf{x}_0 \right)\)` exceeds all other classes. 2. produces the .b[Bayes decision boundary]—the decision boundary with the lowest test error rate. 3. is unknown: we must predict `\(\mathop{\text{Pr}}\left(\color{#FFA500}{y_0} = \color{#e64173}{j} \big | \color{#6A5ACD}{\mathbf{X}} = \mathbf{x}_0 \right)\)`. --- name: review-knn ## KNN .b[K-nearest neighbors] (KNN) is a non-parametric method for estimating $$ `\begin{align} \mathop{\text{Pr}}\left(\color{#FFA500}{y_0} = \color{#e64173}{j} \big | \color{#6A5ACD}{\mathbf{X}} = \mathbf{x}_0 \right) \end{align}` $$ that makes a prediction using the most-common class among an observation's "nearest" K neighbors. - .b[Low values of K] (_e.g._, 1) are exteremly flexible but tend to overfit (increase variance). - .b[Large values of K] (_e.g._, N) are very inflexible—essentially making the same prediction for each observation. The .it[optimal] value of K will trade off between overfitting and accuracy. --- name: review-bias-variance ## The bias-variance tradeoff Finding the optimal level of flexibility highlights the .hi-pink[bias]-.hi-purple[variance] .b[tradeoff]. .hi-pink[Bias] The error that comes from inaccurately estimating `\(\color{#20B2AA}{f}\)`. - More flexible models are better equipped to recover complex relationships `\(\left( \color{#20B2AA}{f} \right)\)`, reducing bias. (Real life is seldom linear.) - Simpler (less flexible) models typically increase bias. .hi-purple[Variance] The amount `\(\hat{\color{#20B2AA}{f}}\)` would change with a different .hi-slate[training sample] - If new .hi-slate[training sets] drastically change `\(\hat{\color{#20B2AA}{f}}\)`, then we have a lot of uncertainty about `\(\color{#20B2AA}{f}\)` (and, in general, `\(\hat{\color{#20B2AA}{f}} \not\approx \color{#20B2AA}{f}\)`). - More flexible models generally add variance to `\(\color{#20B2AA}{f}\)`. --- ## The bias-variance tradeoff The expected value.super[.pink[†]] of the .hi-pink[test MSE] can be written $$ `\begin{align} \mathop{E}\left[ \left(\color{#FFA500}{\mathbf{y_0}} - \mathop{\hat{\color{#20B2AA}{f}}}\left(\color{#6A5ACD}{\mathbf{X}_0}\right) \right)^2 \right] = \underbrace{\mathop{\text{Var}} \left( \mathop{\hat{\color{#20B2AA}{f}}}\left(\color{#6A5ACD}{\mathbf{X}_0}\right) \right)}_{\text{Variance}} + \underbrace{\left[ \text{Bias}\left( \mathop{\hat{\color{#20B2AA}{f}}}\left(\color{#6A5ACD}{\mathbf{X}_0}\right) \right) \right]^2}_{\text{Bias}} + \underbrace{\mathop{\text{Var}} \left( \varepsilon \right)}_{\text{Irr. error}} \end{align}` $$ .b[The tradeoff] in terms of model flexibility - Increasing flexibility .it[from total inflexibility] generally .b[reduces bias more] than it increases variance (reducing test MSE). - At some point, the marginal benefits of flexibility .b[equal] marginal costs. - Past this point (optimal flexibility), we .b[increase variance more] than we reduce bias (increasing test MSE). --- layout: false class: clear, middle .hi[U-shaped test MSE] with respect to model flexibility (KNN here). <br>Increases in variance eventually overcome reductions in (squared) bias. <img src="slides_files/figure-html/review-bias-variance-1.svg" style="display: block; margin: auto;" /> --- layout: true # Resampling methods --- class: inverse, middle --- name: resampling-intro ## Intro .hi[Resampling methods] help understand uncertainty in statistical modeling. -- - .ex[Ex.] .it[Linear regression:] How precise is your `\(\hat{\beta}_1\)`? - .ex[Ex.] .it[With KNN:] Which K minimizes (out-of-sample) test MSE? -- The process behind the magic of resampling methods: 1. .b[Repeatedly draw samples] from the .b[training data]. 1. .b[Fit your model](s) on each random sample. 1. .b[Compare] model performance (or estimates) .b[across samples]. 1. Infer the .b[variability/uncertainty in your model] from (3). -- .note[Warning.sub[1]] Resampling methods can be computationally intensive. <br>.note[Warning.sub[2]] Certain methods don't work in certain settings. --- ## Today Let's distinguish between two important .b[modeling tasks:] - .hi-purple[Model selection] Choosing and tuning a model - .hi-purple[Model assessment] Evaluating a model's accuracy -- We're going to focus on two common .b[resampling methods:] 1. .hi[Cross validation] used to estimate test error, evaluating performance or selecting a model's flexibility 1. .hi[Bootstrap] used to assess accuracy—parameter estimates or methods --- name: resampling-holdout ## Hold out .note[Recall:] We want to find the model that .b[minimizes out-of-sample test error]. If we have a large test dataset, we can use it (once). .qa[Q.sub[1]] What if we don't have a test set? <br> .qa[Q.sub[2]] What if we need to select and train a model? <br> .qa[Q.sub[3]] How can we avoid overfitting our training.super[.pink[†]] data during model selection? .footnote[ .normal[.pink[†]] Also relevant for .it[testing] data. ] -- .qa[A.sub[1,2,3]] .b[Hold-out methods] (_e.g._, cross validation) use training data to estimate test performance—.b[holding out] a mini "test" sample of the training data that we use to estimate the test error. --- name: resampling-validation layout: true # Hold-out methods ## Option 1: The .it[validation set] approach To estimate the .hi-pink[test error], we can .it[hold out] a subset of our .hi-purple[training data] and then .hi-slate[validate] (evaluate) our model on this held out .hi-slate[validation set]. - The .hi-slate[validation error rate] estimates the .hi-pink[test error rate] - The model only "sees" the non-validation subset of the .hi-purple[training data]. --- --- <img src="slides_files/figure-html/plot-validation-set-1.svg" style="display: block; margin: auto;" /> .col-left[.hi-purple[Initial training set]] --- <img src="slides_files/figure-html/plot-validation-set-2-1.svg" style="display: block; margin: auto;" /> .col-left[.hi-slate[Validation (sub)set]] .col-right[.hi-purple[Training set:] .purple[Model training]] --- <img src="slides_files/figure-html/plot-validation-set-3-1.svg" style="display: block; margin: auto;" /> .col-left[.hi-slate[Validation (sub)set]] .col-right[.hi-purple[Training set:] .purple[Model training]] --- layout: true # Hold-out methods ## Option 1: The .it[validation set] approach --- .ex[Example] We could use the validation-set approach to help select the degree of a polynomial for a linear-regression model ([Kaggle]((https://www.kaggle.com/edwardarubin/ec524-lecture-003/)). -- The goal of the validation set is to .hi-pink[.it[estimate] out-of-sample (test) error.] .qa[Q] So what? -- - Estimates come with .b[uncertainty]—varying from sample to sample. - Variability (standard errors) is larger with .b[smaller samples]. .qa[Problem] This estimated error is often based upon a fairly small sample (<30% of our training data). So its variance can be large. --- exclude: true --- name: validation-simulation layout: false class: clear, middle .b[Validation MSE] for 10 different validation samples <img src="slides_files/figure-html/plot-vset-sim-1.svg" style="display: block; margin: auto;" /> --- layout: false class: clear, middle .b[True test MSE] compared to validation-set estimates <img src="slides_files/figure-html/plot-vset-sim-2-1.svg" style="display: block; margin: auto;" /> --- # Hold-out methods ## Option 1: The .it[validation set] approach Put differently: The validation-set approach has (≥) two major drawbacks: 1. .hi[High variability] Which observations are included in the validation set can greatly affect the validation MSE. 2. .hi[Inefficiency in training our model] We're essentially throwing away the validation data when training the model—"wasting" observations. -- (2) ⟹ validation MSE may overestimate test MSE. Even if the validation-set approach provides an unbiased estimator for test error, it is likely a pretty noisy estimator. --- layout: true # Hold-out methods ## Option 2: Leave-one-out cross validation --- name: resampling-loocv .hi[Cross validation] solves the validation-set method's main problems. - Use more (= all) of the data for training (lower variability; less bias). - Still maintains separation between training and validation subsets. -- .hi[Leave-one-out cross validation] (LOOCV) is perhaps the cross-validation method most similar to the validation-set approach. - Your validation set is exactly one observation. - .note[New] You repeat the validation exercise for every observation. - .note[New] Estimate MSE as the mean across all observations. --- layout: true # Hold-out methods ## Option 2: Leave-one-out cross validation Each observation takes a turn as the .hi-slate[validation set], <br>while the other n-1 observations get to .hi-purple[train the model]. <br> <br> --- exclude: true --- <img src="slides_files/figure-html/plot-loocv-1-1.svg" style="display: block; margin: auto;" /> .slate[Observation 1's turn for validation produces MSE.sub[1]]. --- <img src="slides_files/figure-html/plot-loocv-2-1.svg" style="display: block; margin: auto;" /> .slate[Observation 2's turn for validation produces MSE.sub[2]]. --- <img src="slides_files/figure-html/plot-loocv-3-1.svg" style="display: block; margin: auto;" /> .slate[Observation 3's turn for validation produces MSE.sub[3]]. --- <img src="slides_files/figure-html/plot-loocv-4-1.svg" style="display: block; margin: auto;" /> .slate[Observation 4's turn for validation produces MSE.sub[4]]. --- <img src="slides_files/figure-html/plot-loocv-5-1.svg" style="display: block; margin: auto;" /> .slate[Observation 5's turn for validation produces MSE.sub[5]]. --- <img src="slides_files/figure-html/plot-loocv-n-1.svg" style="display: block; margin: auto;" /> .slate[Observation n's turn for validation produces MSE.sub[n]]. --- layout: true # Hold-out methods ## Option 2: Leave-one-out cross validation --- Because .hi-pink[LOOCV uses n-1 observations] to train the model,.super[.pink[†]] MSE.sub[i] (validation MSE from observation i) is approximately unbiased for test MSE. .footnote[ .pink[†] And because often n-1 ≈ n. ] .qa[Problem] MSE.sub[i] is a terribly noisy estimator for test MSE (albeit ≈unbiased). -- <br>.qa[Solution] Take the mean! $$ `\begin{align} \text{CV}_{(n)} = \dfrac{1}{n} \sum_{i=1}^{n} \text{MSE}_i \end{align}` $$ -- 1. LOOCV .b[reduces bias] by using n-1 (almost all) observations for training. 2. LOOCV .b[resolves variance]: it makes all possible comparisons<br>(no dependence upon which validation-test split you make). --- exclude: true --- name: ex-loocv layout: false class: clear, middle .b[True test MSE] and .hi-orange[LOOCV MSE] compared to .hi-purple[validation-set estimates] <img src="slides_files/figure-html/plot-loocv-mse-1.svg" style="display: block; margin: auto;" /> --- layout: true # Hold-out methods ## Option 3: k-fold cross validation --- name: resampling-kcv Leave-one-out cross validation is a special case of a broader strategy: <br>.hi[k-fold cross validation]. 1. .b[Divide] the training data into `\(k\)` equally sized groups (folds). 2. .b[Iterate] over the `\(k\)` folds, treating each as a validation set once<br>(training the model on the other `\(k-1\)` folds). 3. .b[Average] the folds' MSEs to estimate test MSE. -- Benefits? -- 1. .b[Less computationally demanding] (fit model `\(k=\)` 5 or 10 times; not `\(n\)`). -- 2. .b[Greater accuracy] (in general) due to bias-variance tradeoff! -- - Somewhat higher bias, relative to LOOCV: `\(n-1\)` *vs.* `\((k-1)/k\)`. -- - Lower variance due to high-degree of correlation in LOOCV MSE.sub[i]. -- 🤯 --- exclude: true --- layout: true # Hold-out methods ## Option 3: k-fold cross validation With `\(k\)`-fold cross validation, we estimate test MSE as $$ `\begin{align} \text{CV}_{(k)} = \dfrac{1}{k} \sum_{i=1}^{k} \text{MSE}_{i} \end{align}` $$ --- <img src="slides_files/figure-html/plot-cvk-0a-1.svg" style="display: block; margin: auto;" /> Our `\(k=\)` 5 folds. --- <img src="slides_files/figure-html/plot-cvk-0b-1.svg" style="display: block; margin: auto;" /> Each fold takes a turn at .hi-slate[validation]. The other `\(k-1\)` folds .hi-purple[train]. --- <img src="slides_files/figure-html/plot-cvk-1-1.svg" style="display: block; margin: auto;" /> For `\(k=5\)`, fold number `\(1\)` as the .hi-slate[validation set] produces MSE.sub[k=1]. --- <img src="slides_files/figure-html/plot-cvk-2-1.svg" style="display: block; margin: auto;" /> For `\(k=5\)`, fold number `\(2\)` as the .hi-slate[validation set] produces MSE.sub[k=2]. --- <img src="slides_files/figure-html/plot-cvk-3-1.svg" style="display: block; margin: auto;" /> For `\(k=5\)`, fold number `\(3\)` as the .hi-slate[validation set] produces MSE.sub[k=3]. --- <img src="slides_files/figure-html/plot-cvk-4-1.svg" style="display: block; margin: auto;" /> For `\(k=5\)`, fold number `\(4\)` as the .hi-slate[validation set] produces MSE.sub[k=4]. --- <img src="slides_files/figure-html/plot-cvk-5-1.svg" style="display: block; margin: auto;" /> For `\(k=5\)`, fold number `\(5\)` as the .hi-slate[validation set] produces MSE.sub[k=5]. --- exclue: true --- name: ex-cv-sim layout: false class: clear, middle .b[Test MSE] .it[vs.] estimates: .orange[LOOCV], .pink[5-fold CV] (20x), and .purple[validation set] (10x) <img src="slides_files/figure-html/plot-cv-mse-1.svg" style="display: block; margin: auto;" /> --- layout: false class: clear, middle .note[Note:] Each of these methods extends to classification settings, _e.g._, LOOCV $$ `\begin{align} \text{CV}_{(n)} = \dfrac{1}{n} \sum_{i=1}^{n} \mathop{\mathbb{I}}\left( \color{#FFA500}{y_i} \neq \color{#FFA500}{\hat{y}_i} \right) \end{align}` $$ --- name: holdout-caveats layout: false # Hold-out methods ## Caveat So far, we've treated each observation as separate/independent from each other observation. The methods that we've defined assume this .b.slate[independence]. -- Make sure that you think about - the .b.slate[structure] of your data - the .b.slate[goal] of the prediction exercise .note[E.g.,] 1. Are you trying to predict the behavior of .b.purple[existing] or .b.pink[new] customers? 2. Are you trying to predict .b.purple[historical] or .b.pink[future] recessions? --- layout: true # The bootstrap --- class: inverse, middle --- name: boot-intro ## Intro The .b[bootstrap] is a resampling method often used to quantify the uncertainty (variability) underlying an estimator or learning method. .hi-purple[Hold-out methods] - randomly divide the sample into training and validation subsets - train and validate ("test") model on each subset/division .hi-pink[Bootstrapping] - randomly samples .b[with replacement] from the original sample - estimates model on each of the .it[bootstrap samples] --- ## Intro Estimating a estimate's standard error involves assumptions and theory..super[.pink[†]] .footnote[ .pink[†] Recall the standard-error estimator for OLS. ] There are times this derivation is difficult or even impossible, *e.g.*, $$ `\begin{align} \mathop{\text{Var}}\left(\dfrac{\hat{\beta}_1}{1-\hat{\beta}_2}\right) \end{align}` $$ The bootstrap can help in these situations. Rather than deriving an estimator's variance, we use bootstrapped samles to build a distribution and then learn about the estimator's variance. --- layout: false class: clear, middle ## Intuition .note[Idea:] Bootstrapping builds a distribution for the estimate using the variability embedded in the training sample. --- exclude: true --- layout: true # The bootstrap --- name: boot-graph ## Graphically .thin-left[ `$$Z$$` <img src="slides_files/figure-html/g1-boot0-1.svg" width="100%" style="display: block; margin: auto;" /> `$$\hat\beta = 0.653$$` <img src="slides_files/figure-html/g2-boot0-1.svg" width="100%" style="display: block; margin: auto;" /> ] -- .thin-left[ `$$Z^{\star 1}$$` <img src="slides_files/figure-html/g1-boot1-1.svg" width="100%" style="display: block; margin: auto;" /> `$$\hat\beta = -0.96$$` <img src="slides_files/figure-html/g2-boot1-1.svg" width="100%" style="display: block; margin: auto;" /> ] -- .thin-left[ `$$Z^{\star 2}$$` <img src="slides_files/figure-html/g1-boot2-1.svg" width="100%" style="display: block; margin: auto;" /> `$$\hat\beta = 0.968$$` <img src="slides_files/figure-html/g2-boot2-1.svg" width="100%" style="display: block; margin: auto;" /> ] -- .left5[ <br><br><br>⋯ ] .thin-left[ `$$Z^{\star B}$$` <img src="slides_files/figure-html/g1-boot3-1.svg" width="100%" style="display: block; margin: auto;" /> `$$\hat\beta = 0.978$$` <img src="slides_files/figure-html/g2-boot3-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- Running this bootstrap 10,000 times ```r plan(multisession, workers = 10) # Set a seed set.seed(123) # Run the simulation 1e4 times boot_df <- future_map_dfr( # Repeat sample size 100 for 1e4 times rep(n, 1e4), # Our function function(n) { # Estimates via bootstrap est <- lm(y ~ x, data = z[sample(1:n, n, replace = T), ]) # Return a tibble data.frame(int = est$coefficients[1], coef = est$coefficients[2]) }, # Let furrr know we want to set a seed .options = furrr_options(seed = TRUE) ) ``` --- name: boot-ex layout: false class: clear, middle <img src="slides_files/figure-html/boot-full-graph-1.png" style="display: block; margin: auto;" /> --- layout: true # The bootstrap --- ## Comparison: Standard-error estimates The .attn[bootstrapped standard error] of `\(\hat\alpha\)` is the standard deviation of the `\(\hat\alpha^{\star b}\)` $$ `\begin{align} \mathop{\text{SE}_{B}}\left( \hat\alpha \right) = \sqrt{\dfrac{1}{B} \sum_{b=1}^{B} \left( \hat\alpha^{\star b} - \dfrac{1}{B} \sum_{\ell=1}^{B} \hat\alpha^{\star \ell} \right)^2} \end{align}` $$ .pink[This 10,000-sample bootstrap estimates] `\(\color{#e64173}{\mathop{\text{S.E.}}\left( \hat\beta_1 \right)\approx}\)` .pink[0.77.] -- .purple[If we go the old-fashioned OLS route, we estimate 0.673.] --- layout: false class: clear, middle <img src="slides_files/figure-html/boot-dist-graph-1.svg" style="display: block; margin: auto;" /> --- layout: false # Resampling ## Review .hi-purple[Previous resampling methods] - Split data into .hi-purple[subsets]: `\(n_v\)` validation and `\(n_t\)` training `\((n_v + n_t = n)\)`. - Repeat estimation on each subset. - Estimate the true test error (to help tune flexibility). .hi-pink[Bootstrap] - Randomly samples from training data .hi-pink[with replacement] to generate `\(B\)` "samples", each of size `\(n\)`. - Repeat estimation on each subset. - Estimate the variance estimate using variability across `\(B\)` samples. --- name: sources layout: false # Sources These notes draw upon - [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) (*ISL*)<br>James, Witten, Hastie, and Tibshirani - [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)<br>Jake VanderPlas --- layout: false # Table of contents .col-left[ .smallest[ #### Admin - [Today](#admin-today) - [Upcoming](#admin-soon) #### Review - [Regression and loss](#review-loss-functions) - [Classification](#review-classification) - [KNN](#review-knn) - [The bias-variance tradeoff](#review-bias-variance) #### Examples - [Validation-set simulation](#validation-simulation) - [LOOCV MSE](#ex-loocvs) - [k-fold CV](#ex-cv-sim) ] ] .col-right[ .smallest[ #### Resampling - [Intro](#resampling-intro) - [Hold-out methods](#resampling-holdout) - [Validation sets](#resampling-validation) - [LOOCV](#resampling-loocv) - [k-fold cross validation](#resampling-kcv) - [The bootstrap](#boot-intro) - [Intro](#boot-intro) - [Graphically](#boot-graph) - [Example](#boot-ex) #### Other - [Sources/references](#sources) ] ]