class: center, middle, inverse, title-slide .title[ # Lecture .mono[004] ] .subtitle[ ## Regression strikes back ] .author[ ### Edward Rubin ] --- exclude: true --- layout: true # Admin --- class: inverse, middle --- name: admin-today ## Today .b[In-class] - A roadmap (where are we going?) - Linear regression and model selection --- name: admin-soon # Admin ## Upcoming .b[Readings] - .note[Today] - .it[ISL] Ch. 3 and 6.1 - .note[Next] - .it[ISL] Ch. 6 and 4 .b[Problem set] Next problem set very soon! --- name: admin-roadmap layout: false # Roadmap ## Where are we? We've essentially covered the central topics in statistical learning.super[.pink[†]] - Prediction and inference - Supervised .it[vs.] unsupervised methods - Regression and classification problems - The dangers of overfitting - The bias-variance tradeoff - Model assessment - Holdouts, validation sets, and cross validation.super[.pink[††]] - Model training and tuning - Simulation .footnote[ .pink[†] Plus a few of the "basic" methods: OLS regression and KNN. <br>.pink[††] And the bootstrap! ] --- # Roadmap ## Where are we going? Next, we will cover many common machine-learning algorithms, _e.g._, - Decision trees - Random forests and ensemble techniques - SVM - Neural nets - Clustering -- But first, we return to good old .hi-orange[linear regression]—in a new light... - Linear regression - Variable/model selection and LASSO/Ridge regression - .it[Plus:] Logistic regression and discriminant analysis --- # Roadmap ## Why return to regression? .hi[Motivation 1] <br>We have new tools. It might help to first apply them in a .b[familiar] setting. -- .hi[Motivation 2] <br>We have new tools. Maybe linear regression will be (even) .b[better now?] <br>.it[E.g.], did (cross) validation help you beat your old model? -- .hi[Motivation 3] > many fancy statistical learning approaches can be seen as .b[generalizations or extensions of linear regression]. .pad-left[.grey-light.it[Source: ISL, p. 59; emphasis added]] --- layout: true # Linear regression --- class: inverse, middle --- name: regression-intro ## Regression regression .note[Recall] Linear regression "fits" coefficients `\(\color{#e64173}{\beta}_0,\, \ldots,\, \color{#e64173}{\beta}_p\)` for a model $$ `\begin{align} \color{#FFA500}{y}_i = \color{#e64173}{\beta}_0 + \color{#e64173}{\beta}_1 x_{1,i} + \color{#e64173}{\beta}_2 x_{2,i} + \cdots + \color{#e64173}{\beta}_p x_{p,i} + \varepsilon_i \end{align}` $$ and is often applied in two distinct settings with fairly distinct goals: 1. .hi[Causal inference] estimates and interprets .pink[the coefficients]. 1. .hi-orange[Prediction] focuses on accurately estimating .orange[outcomes]. Regardless of the goal, the way we "fit" (estimate) the model is the same. --- ## Fitting the regression line As is the case with many statistical learning methods, regression focuses on minimizing some measure of loss/error. $$ `\begin{align} e_i = \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \end{align}` $$ -- Linear regression uses the L.sub[2] loss function—also called .it[residual sum of squares] (RSS) or .it[sum of squared errors] (SSE) $$ `\begin{align} \text{RSS} = e_1^2 + e_2^2 + \cdots + e_n^2 = \sum_{i=1}^n e_i^2 \end{align}` $$ Specifically: OLS chooses the `\(\color{#e64173}{\hat{\beta}_j}\)` that .hi[minimize RSS]. --- name: performance ## Performance There's a large variety of ways to assess the fit.super[.pink[†]] of linear-regression models. .footnote[ .pink[†] or predictive performance ] .hi[Residual standard error] (.hi[RSE]) $$ `\begin{align} \text{RSE}=\sqrt{\dfrac{1}{n-p-1}\text{RSS}}=\sqrt{\dfrac{1}{n-p-1}\sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2} \end{align}` $$ .hi[R-squared] (.hi[R.super[2]]) $$ `\begin{align} R^2 = \dfrac{\text{TSS} - \text{RSS}}{\text{TSS}} = 1 - \dfrac{\text{RSS}}{\text{TSS}} \quad \text{where} \quad \text{TSS} = \sum_{i=1}^{n} \left( y_i - \overline{y} \right)^2 \end{align}` $$ --- ## Performance and overfit As we've seen throughout the course, we need to be careful .b[not to overfit]. -- .hi[R.super[2]] provides no protection against overfitting—and actually encourages it. $$ `\begin{align} R^2 = 1 - \dfrac{\text{RSS}}{\text{TSS}} \end{align}` $$ .attn[Add a new variable:] RSS `\(\downarrow\)` and TSS is unchanged. Thus, R.super[2] increases. -- .hi[RSE] .it[slightly] penalizes additional variables: $$ `\begin{align} \text{RSE}=\sqrt{\dfrac{1}{n-p-1}\text{RSS}} \end{align}` $$ .attn[Add a new variable:] RSS `\(\downarrow\)` but `\(p\)` increases. Thus, RSE's change is uncertain. --- exclude: true ```r library(pacman) p_load(parallel, stringr, data.table, magrittr, here) # Set parameters set.seed(123) N = 2e3 n = 500 p = n - 1 # Generate data X = matrix(data = rnorm(n = N*p), ncol = p) β = matrix(data = rnorm(p, sd = 0.005), ncol = 1) y = X %*% β + matrix(rnorm(N, sd = 0.01), ncol = 1) # Create a data table pop_dt = X %>% data.matrix() %>% as.data.table() setnames(pop_dt, paste0("x", str_pad(1:p, 4, "left", 0))) pop_dt[, y := y %>% unlist()] # Subset sub_dt = pop_dt[1:n,] out_dt = pop_dt[(n+1):N,] Nn = N - n # For j in 1 to p: fit a model, record R2 and RSE fit_dt = mclapply(X = seq(1, p, by = 5), mc.cores = 12, FUN = function(j) { # Fit a model with the the first j variables lm_j = lm(y ~ ., data = sub_dt[, c(1:j,p+1), with = F]) # Unseen data performance y_hat = predict(lm_j, newdata = out_dt[, c(1:j,p+1), with = F]) out_rss = sum((out_dt[,y] - y_hat)^2) out_tss = sum((out_dt[,y] - mean(out_dt[,y]))^2) # Return data table data.table( p = j, in_rse = summary(lm_j)$sigma, in_r2 = summary(lm_j)$r.squared, in_r2_adj = summary(lm_j)$adj.r.squared, in_aic = AIC(lm_j), in_bic = BIC(lm_j), out_rse = sqrt(1 / (Nn - j - 1) * out_rss), out_r2 = 1 - out_rss/out_tss, out_r2_adj = 1 - ((out_rss) / (Nn - j - 1)) / ((out_tss) / (Nn-1)) ) }) %>% rbindlist() # Save results saveRDS( object = fit_dt, file = here("other-files", "overfit-data.rds") ) ``` ```r # Load the data fit_dt = here("other-files", "overfit-data.rds") %>% read_rds() ``` --- layout: true class: clear --- name: overfit class: middle ## Example Let's see how .b[R.super[2]] and .b[RSE] perform with 500 very weak predictors. To address overfitting, we can compare .b[in-] .it[vs.] .hi[out-of-sample] performance. --- .b[In-sample R.super[2]] mechanically increases as we add predictors. <br> .white[.b[Out-of-sample R.super[2]] does not.] <img src="004-slides_files/figure-html/overfit-plot-r2-1.svg" style="display: block; margin: auto;" /> --- .b[In-sample R.super[2]] mechanically increases as we add predictors. <br> .hi[Out-of-sample R.super[2]] does not. <img src="004-slides_files/figure-html/overfit-plot-r2-both-1.svg" style="display: block; margin: auto;" /> --- class: middle What about RSE? Does its penalty .it[help]? --- Despite its penalty for adding variables, .b[in-sample RSE] still can overfit, <br> .white[as evidenced by .b[out-of-sample RSE].] <img src="004-slides_files/figure-html/plot-overfit-rse-1.svg" style="display: block; margin: auto;" /> --- Despite its penalty for adding variables, .b[in-sample RSE] still can overfit, <br> as evidenced by .hi[out-of-sample RSE]. <img src="004-slides_files/figure-html/plot-overfit-rse-both-1.svg" style="display: block; margin: auto;" /> --- layout: false # Linear regression ## Penalization RSE is not the only way to penalization the addition of variables..super[.pink[†]] .footnote[ .pink[†] We'll talk about other penalization methods (LASSO and Ridge) shortly. ] .b[Adjusted R.super[2]] is another .it[classic] solution. $$ `\begin{align} \text{Adjusted }R^2 = 1 - \dfrac{\text{RSS}\color{#6A5ACD}{/(n - p - 1)}}{\text{TSS}\color{#6A5ACD}{/(n-1)}} \end{align}` $$ Adj. R.super[2] attempts to "fix" R.super[2] by .hi-purple[adding a penalty for the number of variables]. -- - `\(\text{RSS}\)` always decreases when a new variable is added. -- - `\(\color{#6A5ACD}{\text{RSS}/(n-p-1)}\)` may increase or decrease with a new variable. --- layout: true class: clear --- However, .b[in-sample adjusted R.super[2]] still can overfit. <br> .white[Illustrated by .b[out-of-sample R.super[2]].] <img src="004-slides_files/figure-html/plot-overfit-adjr2-1.svg" style="display: block; margin: auto;" /> --- However, .b[in-sample adjusted R.super[2]] still can overfit. <br> Illustrated by .hi[out-of-sample adjusted R.super[2]]. <img src="004-slides_files/figure-html/plot-overfit-adjr2-both-1.svg" style="display: block; margin: auto;" /> --- Here are .b[in-sample] .b.orange[AIC] and .b.purple[BIC]. <br> Neither in-sample metric seems to entirely guard against overfitting. <img src="004-slides_files/figure-html/test-1.svg" style="display: block; margin: auto;" /> --- layout: true # Model selection --- name: better ## A better way? R.super[2], adjusted R.super[2], and RSE each offer some flavor of model fit, but they appear .b[limited in their abilities to prevent overfitting]. -- We want a method to optimally select a (linear) model—balancing variance and bias and avoiding overfit. -- We'll discuss two (related) methods today: 1. .hi[Subset selection] chooses a (sub)set of our `\(p\)` potential predictors 2. .hi[Shrinkage] fits a model using all `\(p\)` variables but "shrinks" its coefficients --- name: subset-selection ## Subset selection In .attn[subset selection], we 1. whittle down the `\(p\)` potential predictors (using some magic/algorithm) 1. estimate the chosen linear model using OLS -- How do we do the *whittling* (selection)? -- We've got .b[options]. - .attn[Best subset selection] fits a model for every possible subset. - .attn[Forward stepwise selection] starts with only an intercept and tries to build up to the best model (using some fit criterion). - .attn[Backward stepwise selection] starts with all `\(p\)` variables and tries to drop variables until it hits the best model (using some fit criterion). - .attn[Hybrid approaches] are what their name implies (_i.e._, hybrids). --- name: best-subset ## Best subset selection .attn[Best subset selection] is based upon a simple idea: Estimate a model for every possible subset of variables; then compare their performances. -- .qa[Q] So what's the problem? (Why do we need other selection methods?) -- <br> .qa[A] "a model for .hi[every possible subset]" can mean .hi[a lot] `\(\left( 2^p \right)\)` of models. -- .note[E.g.,] - 10 predictors `\(\rightarrow\)` 1,024 models to fit - 25 predictors `\(\rightarrow\)` >33.5 million models to fit - 100 predictors `\(\rightarrow\)` ~1.5 trillion models to fit -- Even with plentiful, cheap computational power, we can run into barriers. --- ## Best subset selection Computational constraints aside, we can implement .attn[best subset selection] as .mono-small[ 1. Define `\(\mathcal{M}_0\)` as the model with no predictors. 1. For `\(k\)` in 1 to `\(p\)`: - Fit every possible model with `\(k\)` predictors. - Define `\(\mathcal{M}_k\)` as the "best" model with `\(k\)` predictors. 1. Select the "best" model from `\(\mathcal{M}_0,\,\ldots,\,\mathcal{M}_p\)`. ] As we've seen, RSS declines (and R.super[2] increases) with `\(p\)`, so we should use a cross-validated measure of model performance in step .mono[3]..super[.pink[†]] .footnote[ .pink[†] Back to our distinction between test .it[vs.] training performance. ] --- ## Example dataset: `Credit` We're going to use the `Credit` dataset from .it[ISL]'s R package `ISLR`. --
The `Credit` dataset has 400 observations on 12 variables. --- ## Example dataset: `Credit` We need to pre-process the dataset before we can select a model...
Now the dataset on has 400 observations on 12 variables (2,048 subsets). --- exclude: true --- layout: false class: clear, middle <img src="004-slides_files/figure-html/bss-graph-rmse-1.svg" style="display: block; margin: auto;" /> --- layout: false class: clear, middle <img src="004-slides_files/figure-html/bss-graph-r2-1.svg" style="display: block; margin: auto;" /> --- layout: true # Model selection --- ## Best subset selection From here, you would 1. Estimate cross-validated error for each `\(\mathcal{M}_k\)`. 1. Choose the `\(\mathcal{M}_k\)` that minimizes the CV error. 1. Train the chosen model on the full dataset. --- ## Best subset selection .b[Warnings] - Computationally intensive - Selected models may not be "right" (squared terms with linear terms) - You need to protect against overfitting when choosing across `\(\mathcal{M}_k\)` - Also should worry about overfitting when `\(p\)` is "big" - Dependent upon the variables (transformations) you provide .b[Benefits] - Comprehensive search across provided variables - Resulting model—when estimated with OLS—has OLS properties - Can be applied to other (non-OLS) estimators --- name: stepwise ## Stepwise selection .attn[Stepwise selection] provides a less computational intensive alternative to best subset selection. The basic idea behind .attn[stepwise selection] .mono-small[ 1. Start with an arbitrary model. 1. Try to find a "better" model by adding/removing variables. 1. Repeat. 1. Stop when you have the best model. (Or choose the best model.) ] -- The two most-common varieties of stepwise selection: - .attn[Forward] starts with only intercept `\(\left( \mathcal{M}_0 \right)\)` and adds variables - .attn[Backward] starts with all variables `\(\left( \mathcal{M}_p \right)\)` and removes variables --- name: forward ## Forward stepwise selection The process... .mono-small[ 1. Start with a model with only an intercept (no predictors), `\(\mathcal{M}_0\)`. 1. For `\(k=0,\,\ldots,\,p\)`: - Estimate a model for each of the remaining `\(p-k\)` predictors, separately adding the predictors to model `\(\mathcal{M}_k\)`. - Define `\(\mathcal{M}_{k+1}\)` as the "best" model of the `\(p-k\)` models. 1. Select the "best" model from `\(\mathcal{M}_0,\,\ldots,\, \mathcal{M}_p\)`. ] -- What do we mean by "best"? <br> .mono-small[2:] .it[best] is often RSS or R.super[2]. <br> .mono-small[3:] .it[best] should be a cross-validated fit criterion. --- layout: false class: clear .hi-purple[Forward stepwise selection] with `caret` in R ```r train_forward = train( y = credit_dt[["balance"]], x = credit_dt %>% dplyr::select(-balance), trControl = trainControl(method = "cv", number = 5), method = "leapForward", tuneGrid = expand.grid(nvmax = 1:11) ) ``` -- .col-left[
] .col-right[ <img src="004-slides_files/figure-html/forward-train-plot-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- name: backward # Model selection ## Backward stepwise selection The process for .attn[backward stepwise selection] is quite similar... -- .mono-small[ 1. Start with a model that includes all `\(p\)` predictors: `\(\mathcal{M}_p\)`. 1. For `\(k=p,\, p-1,\, \ldots,\,1\)`: - Estimate `\(k\)` models, where each model removes exactly one of the `\(k\)` predictors from `\(\mathcal{M}_k\)`. - Define `\(\mathcal{M}_{k-1}\)` as the "best" of the `\(k\)` models. 1. Select the "best" model from `\(\mathcal{M}_0,\,\ldots,\, \mathcal{M}_p\)`. ] -- What do we mean by "best"? <br> .mono-small[2:] .it[best] is often RSS or R.super[2]. <br> .mono-small[3:] .it[best] should be a cross-validated fit criterion. --- layout: false class: clear .hi-pink[Backward stepwise selection] with `caret` in R ```r train_backward = train( y = credit_dt[["balance"]], x = credit_dt %>% dplyr::select(-balance), trControl = trainControl(method = "cv", number = 5), method = "leapBackward", tuneGrid = expand.grid(nvmax = 1:11) ) ``` -- .col-left[
] .col-right[ <img src="004-slides_files/figure-html/backward-train-plot-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: clear, middle .note[Note:] .hi-purple[forward] and .hi-pink[backward] step. selection can choose different models. <img src="004-slides_files/figure-html/stepwise-plot-zoom-1.svg" style="display: block; margin: auto;" /> --- name: stepwise-notes # Model selection ## Stepwise selection Notes on stepwise selection - .b[Less computationally intensive] (relative to best subset selection) - With `\(p=20\)`, BSS fits 1,048,576 models. - With `\(p=20\)`, foward/backward selection fits 211 models. - There is .b[no guarantee] that stepwise selection finds the best model. - .b.it[Best] is defined by your fit criterion (as always). - Again, .b[cross validation is key] to avoiding overfitting. --- name: criteria # Model selection ## Criteria Which model you choose is a function of .b[how you define "best"]. -- And we have many options... -- We've seen RSS, (R)MSE, RSE, MA, R.super[2], Adj. R.super[2]. -- Of course, there's more. Each .hi-purple[penalizes] the `\(d\)` predictors differently. $$ `\begin{align} C_p &= \frac{1}{n} \left( \text{RSS} + \color{#6A5ACD}{2 d \hat{\sigma}^2 }\right) \\[1ex] \text{AIC} &= \frac{1}{n\hat{\sigma}^2} \left( \text{RSS} + \color{#6A5ACD}{2 d \hat{\sigma}^2 }\right) \\[1ex] \text{BIC} &= \frac{1}{n\hat{\sigma}^2} \left( \text{RSS} + \color{#6A5ACD}{\log(n) d \hat{\sigma}^2 }\right) \end{align}` $$ --- # Model selection ## Criteria > `\(C_p\)`, `\(AIC\)`, and `\(BIC\)` all have rigorous theoretical justifications... the adjusted `\(R^2\)` is not as well motivated in statistical theory .grey-light.it[ISL, p. 213] In general, we will stick with cross-validated criteria, but you still need to choose a selection criterion. --- name: sources layout: false # Sources These notes draw upon - [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) (*ISL*)<br>James, Witten, Hastie, and Tibshirani --- # Table of contents .col-left[ .smallest[ #### Admin - [Today](#admin-today) - [Upcoming](#admin-soon) - [Roadmap](#admin-roadmap) #### Linear regression - [Regression regression](#regression-intro) - [Performance](#performance) - [Overfit](#overfit) #### Model selection - [A better way?](#better) - [Subset selection](#subset-selection) - [Best subset selection](#best-subset) - [Stepwise selection](#stepwise) - [Forward](#forward) - [Backward](#backward) - [Notes](#stepwise-notes) - [Criteria](#criteria) ] ] .col-right[ .smallest[ #### Other - [Sources/references](#sources) ] ] --- exclude: true