class: center, middle, inverse, title-slide # Lecture .mono[005] ## Shrinkage methods ### Edward Rubin ### 06 February 2020 --- exclude: true --- layout: true # Admin --- class: inverse, middle --- name: admin-today ## Material .b[Last time] - Linear regression - Model selection - Best subset selection - Stepwise selection (forward/backward) .b[Today] Shrinkage methods --- name: admin-soon # Admin ## Upcoming .b[Readings] - .note[Today] .it[ISL] Ch. 6 - .note[Next] .it[ISL] 4 .b[Problem sets] .it[Next:] After we finish this set of notes --- layout: true # Shrinkage methods --- name: shrinkage-intro ## Intro .note[Recap:] .attn[Subset-selection methods] (last time) 1. algorithmically search for the .pink["best" subset] of our `\(p\)` predictors 1. estimate the linear models via .pink[least squares] -- These methods assume we need to choose a model before we fit it... -- .note[Alternative approach:] .attn[Shrinkage methods] - fit a model that contains .pink[all] `\(\color{#e64173}{p}\)` .pink[predictors] - simultaneously: .pink[shrink.super[.pink[†]] coefficients] toward zero .footnote[ .pink[†] Synonyms for .it[shrink]: constrain or regularize ] -- .note[Idea:] Penalize the model for coefficients as they move away from zero. --- name: shrinkage-why ## Why? .qa[Q] How could shrinking coefficients twoard zero help or predictions? -- .qa[A] Remember we're generally facing a tradeoff between bias and variance. -- - Shrinking our coefficients toward zero .hi[reduces the model's variance]..super[.pink[†]] - .hi[Penalizing] our model for .hi[larger coefficients] shrinks them toward zero. - The .hi[optimal penalty] will balance reduced variance with increased bias. .footnote[ .pink[†] Imagine the extreme case: a model whose coefficients are all zeros has no variance. ] -- Now you understand shrinkage methods. - .attn[Ridge regression] - .attn[Lasso] - .attn[Elasticnet] --- layout: true # Ridge regression --- class: inverse, middle --- name: ridge ## Back to least squares (again) .note[Recall] Least-squares regression gets `\(\hat{\beta}_j\)`'s by minimizing RSS, _i.e._, $$ `\begin{align} \min_{\hat{\beta}} \text{RSS} = \min_{\hat{\beta}} \sum_{i=1}^{n} e_i^2 = \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\underbrace{\left[ \hat{\beta}_0 + \hat{\beta}_1 x_{i,1} + \cdots + \hat{\beta}_p x_{i,p} \right]}_{=\hat{y}_i}} \bigg)^2 \end{align}` $$ -- .attn[Ridge regression] makes a small change - .pink[adds a shrinkage penalty] = the sum of squared coefficents `\(\left( \color{#e64173}{\lambda\sum_{j}\beta_j^2} \right)\)` - .pink[minimizes] the (weighted) sum of .pink[RSS and the shrinkage penalty] -- $$ `\begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}` $$ --- name: ridge-penalization .col-left[ .hi[Ridge regression] $$ `\begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}` $$ ] .col-right[ .b[Least squares] $$ `\begin{align} \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 \end{align}` $$ ] <br><br><br><br> `\(\color{#e64173}{\lambda}\enspace (\geq0)\)` is a tuning parameter for the harshness of the penalty. <br> `\(\color{#e64173}{\lambda} = 0\)` implies no penalty: we are back to least squares. -- <br> Each value of `\(\color{#e64173}{\lambda}\)` produces a new set of coefficents. -- Ridge's approach to the bias-variance tradeoff: Balance - reducing .b[RSS], _i.e._, `\(\sum_i\left( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \right)^2\)` - reducing .b[coefficients] .grey-light[(ignoring the intercept)] `\(\color{#e64173}{\lambda}\)` determines how much ridge "cares about" these two quantities..super[.pink[†]] .footnote[ .pink[†] With `\(\lambda=0\)`, least-squares regression only "cares about" RSS. ] --- ## `\(\lambda\)` and penalization Choosing a .it[good] value for `\(\lambda\)` is key. - If `\(\lambda\)` is too small, then our model is essentially back to OLS. - If `\(\lambda\)` is too large, then we shrink all of our coefficients too close to zero. -- .qa[Q] So what do we do? -- <br> .qa[A] Cross validate! .grey-light[(You saw that coming, right?)] --- ## Penalization .note[Note] Because we sum the .b[squared] coefficients, we penalize increasing .it[big] coefficients much more than increasing .it[small] coefficients. .ex[Example] For a value of `\(\beta\)`, we pay a penalty of `\(2 \lambda \beta\)` for a small increase..super[.pink[†]] .footnote[ .pink[†] This quantity comes from taking the derivative of `\(\lambda \beta^2\)` with respect to `\(\beta\)`. ] - At `\(\beta = 0\)`, the penalty for a small increase is `\(0\)`. - At `\(\beta = 1\)`, the penalty for a small increase is `\(2\lambda\)`. - At `\(\beta = 2\)`, the penalty for a small increase is `\(4\lambda\)`. - At `\(\beta = 3\)`, the penalty for a small increase is `\(6\lambda\)`. - At `\(\beta = 10\)`, the penalty for a small increase is `\(20\lambda\)`. Now you see why we call it .it[shrinkage]: it encourages small coefficients. --- name: ridge-standardization ## Penalization and standardization .attn[Important] Predictors' .hi[units] can drastically .hi[affect ridge regression results]. .b[Why?] -- Because `\(\mathbf{x}_j\)`'s units affect `\(\beta_j\)`, and ridge is very sensitive to `\(\beta_j\)`. -- .ex[Example] Let `\(x_1\)` denote distance. .b[Least-squares regression] <br> If `\(x_1\)` is .it[meters] and `\(\beta_1 = 3\)`, then when `\(x_1\)` is .it[km], `\(\beta_1 = 3,000\)`. <br> The scale/units of predictors do not affect least squares' estimates. -- .hi[Ridge regression] pays a much larger penalty for `\(\beta_1=3,000\)` than `\(\beta_1=3\)`. <br>You will not get the same (scaled) estimates when you change units. -- .note[Solution] Standardize your variables, _i.e._, `x_stnd = (x - mean(x))/sd(x)`. --- name: ridge-example ## Example Let's return to the credit dataset. .ex[Recall] We have 11 predictors and a numeric outcome `balance`. I standardized our .b[predictors] using `preProcess()` from `caret`, _i.e._, ```r # Standardize all variables except 'balan ce' credit_stnd = preProcess( # Do not process the outcome 'balance' x = credit_dt %>% dplyr::select(-balance), # Standardizing means 'center' and 'scale' method = c("center", "scale") ) # We have to pass the 'preProcess' object to 'predict' to get new data credit_stnd %<>% predict(newdata = credit_dt) ``` --- ## Example For ridge regression.super[.pink[†]] in R, we will use `glmnet()` from the `glmnet` package. .footnote[ .pink[†] And lasso! ] The .hi-slate[key arguments] for `glmnet()` are .col-left[ - `x` a .b[matrix] of predictors - `y` outcome variable as a vector - `standardize` (`T` or `F`) - `alpha` elasticnet parameter - `alpha=0` gives ridge - `alpha=1` gives lasso ] .col-right[ - `lambda` tuning parameter (sequence of numbers) - `nlambda` alternatively, R picks a sequence of values for `\(\lambda\)` ] --- ## Example We just need to define a decreasing sequence for `\(\lambda\)`, and then we're set. ```r # Define our range of lambdas (glmnet wants decreasing range) lambdas = 10^seq(from = 5, to = -2, length = 100) # Fit ridge regression est_ridge = glmnet( x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(), y = credit_stnd$balance, standardize = T, alpha = 0, lambda = lambdas ) ``` The `glmnet` output (`est_ridge` here) contains estimated coefficients for `\(\lambda\)`. You can use `predict()` to get coefficients for additional values of `\(\lambda\)`. --- layout: false class: clear, middle .b[Ridge regression coefficents] for `\(\lambda\)` between 0.01 and 100,000 <img src="005-slides_files/figure-html/plot-ridge-glmnet-1.svg" style="display: block; margin: auto;" /> --- layout: true # Ridge regression ## Example --- `glmnet` also provides convenient cross-validation function: `cv.glmnet()`. ```r # Define our lambdas lambdas = 10^seq(from = 5, to = -2, length = 100) # Cross validation ridge_cv = cv.glmnet( x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(), y = credit_stnd$balance, alpha = 0, standardize = T, lambda = lambdas, # New: How we make decisions and number of folds type.measure = "mse", nfolds = 5 ) ``` --- layout: false class: clear, middle .b[Cross-validated RMSE and] `\(\lambda\)`: Which `\(\color{#e64173}{\lambda}\)` minimizes CV RMSE? <img src="005-slides_files/figure-html/plot-cv-ridge-1.svg" style="display: block; margin: auto;" /> --- class: clear, middle Often, you will have a minimum farther away from your extremes... --- layout: false class: clear, middle .b[Cross-validated RMSE and] `\(\lambda\)`: Which `\(\color{#e64173}{\lambda}\)` minimizes CV RMSE? <img src="005-slides_files/figure-html/plot-cv-ridge2-1.svg" style="display: block; margin: auto;" /> --- # Ridge regression ## Example We can also use `train()` from `caret` to cross validate ridge regression. ```r # Our range of lambdas lambdas = 10^seq(from = 5, to = -2, length = 1e3) # Ridge regression with cross validation ridge_cv = train( # The formula balance ~ ., # The dataset data = credit_stnd, # The 'glmnet' package does ridge and lasso method = "glmnet", # 5-fold cross validation trControl = trainControl("cv", number = 10), # The parameters of 'glmnet' (alpha = 0 gives ridge regression) tuneGrid = expand.grid(alpha = 0, lambda = lambdas) ) ``` --- name: ridge-predict layout: false # Ridge regression ## Prediction in R Once you find your `\(\lambda\)` via cross validation 1\. Fit your model on the full dataset using the optimal `\(\lambda\)` ```r # Fit final model final_ridge = glmnet( x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(), y = credit_stnd$balance, standardize = T, alpha = 0, lambda = ridge_cv$lambda.min ) ``` --- # Ridge regression ## Prediction in R Once you find your `\(\lambda\)` via cross validation 1\. Fit your model on the full dataset using the optimal `\(\lambda\)` 2\. Make predictions ```r predict( final_ridge, type = "response", # Our chosen lambda s = ridge_cv$lambda.min, # Our data newx = credit_stnd %>% dplyr::select(-balance) %>% as.matrix() ) ``` --- # Ridge regression ## Shrinking While ridge regression .it[shrinks] coefficients close to zero, it never forces them to be equal to zero. .b[Drawbacks] 1. We cannot use ridge regression for subset/feature selection. 1. We often end up with a bunch of tiny coefficients. -- .qa[Q] Can't we just drive the coefficients to zero? -- <br> .qa[A] Yes. Just not with ridge (due to `\(\sum_j \hat{\beta}_j^2\)`). --- layout: true # Lasso --- class: inverse, middle --- name: lasso ## Intro .attn[Lasso] simply replaces ridge's .it[squared] coefficients with absolute values. -- .hi[Ridge regression] $$ `\begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}` $$ .hi-grey[Lasso] $$ `\begin{align} \min_{\hat{\beta}^L} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|} \end{align}` $$ Everything else will be the same—except one aspect... --- name: lasso-shrinkage ## Shrinkage Unlike ridge, lasso's penalty does not increase with the size of `\(\beta_j\)`. You always pay `\(\color{#8AA19E}{\lambda}\)` to increase `\(\big|\beta_j\big|\)` by one unit. -- The only way to avoid lasso's penalty is to .hi[set coefficents to zero]. -- This feature has two .hi-slate[benefits] 1. Some coefficients will be .hi[set to zero]—we get "sparse" models. 1. Lasso can be used for subset/feature .hi[selection]. -- We will still need to carefully select `\(\color{#8AA19E}{\lambda}\)`. --- layout: true # Lasso ## Example --- name: lasso-example We can also use `glmnet()` for lasso. .ex[Recall] The .hi-slate[key arguments] for `glmnet()` are .col-left[ - `x` a .b[matrix] of predictors - `y` outcome variable as a vector - `standardize` (`T` or `F`) - `alpha` elasticnet parameter - `alpha=0` gives ridge - .hi[`alpha=1` gives lasso] ] .col-right[ - `lambda` tuning parameter (sequence of numbers) - `nlambda` alternatively, R picks a sequence of values for `\(\lambda\)` ] --- Again, we define a decreasing sequence for `\(\lambda\)`, and we're set. ```r # Define our range of lambdas (glmnet wants decreasing range) lambdas = 10^seq(from = 5, to = -2, length = 100) # Fit ridge regression est_lasso = glmnet( x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(), y = credit_stnd$balance, standardize = T, alpha = 1, lambda = lambdas ) ``` The `glmnet` output (`est_lasso` here) contains estimated coefficients for `\(\lambda\)`. You can use `predict()` to get coefficients for additional values of `\(\lambda\)`. --- layout: false class: clear, middle .b[Lasso coefficents] for `\(\lambda\)` between 0.01 and 100,000 <img src="005-slides_files/figure-html/plot-lasso-glmnet-1.svg" style="display: block; margin: auto;" /> --- class: clear, middle Compare lasso's tendency to force coefficients to zero with our previous ridge-regression results. --- class: clear, middle .b[Ridge regression coefficents] for `\(\lambda\)` between 0.01 and 100,000 <img src="005-slides_files/figure-html/plot-ridge-glmnet-2-1.svg" style="display: block; margin: auto;" /> --- # Lasso ## Example We can also cross validate `\(\lambda\)` with `cv.glmnet()`. ```r # Define our lambdas lambdas = 10^seq(from = 5, to = -2, length = 100) # Cross validation lasso_cv = cv.glmnet( x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(), y = credit_stnd$balance, alpha = 1, standardize = T, lambda = lambdas, # New: How we make decisions and number of folds type.measure = "mse", nfolds = 5 ) ``` --- layout: false class: clear, middle .b[Cross-validated RMSE and] `\(\lambda\)`: Which `\(\color{#8AA19E}{\lambda}\)` minimizes CV RMSE? <img src="005-slides_files/figure-html/plot-cv-lasso-1.svg" style="display: block; margin: auto;" /> --- class: clear, middle Again, you will have a minimum farther away from your extremes... --- class: clear, middle .b[Cross-validated RMSE and] `\(\lambda\)`: Which `\(\color{#8AA19E}{\lambda}\)` minimizes CV RMSE? <img src="005-slides_files/figure-html/plot-cv-lasso2-1.svg" style="display: block; margin: auto;" /> --- class: clear, middle So which shrinkage method should you choose? --- layout: true # Ridge or lasso? --- name: or .col-left.pink[ .b[Ridge regression] <br> <br>.b.orange[+] shrinks `\(\hat{\beta}_j\)` .it[near] 0 <br>.b.orange[-] many small `\(\hat\beta_j\)` <br>.b.orange[-] doesn't work for selection <br>.b.orange[-] difficult to interpret output <br>.b.orange[+] better when all `\(\beta_j\neq\)` 0 <br><br> .it[Best:] `\(p\)` is large & `\(\beta_j\approx\beta_k\)` ] .col-right.purple[ .b[Lasso] <br> <br>.b.orange[+] shrinks `\(\hat{\beta}_j\)` to 0 <br>.b.orange[+] many `\(\hat\beta_j=\)` 0 <br>.b.orange[+] great for selection <br>.b.orange[+] sparse models easier to interpret <br>.b.orange[-] implicitly assumes some `\(\beta=\)` 0 <br><br> .it[Best:] `\(p\)` is large & many `\(\beta_j\approx\)` 0 ] -- .left-full[ > [N]either ridge... nor the lasso will universally dominate the other. .ex[ISL, p. 224] ] --- name: both layout: false # Ridge .it[and] lasso ## Why not both? .hi-blue[Elasticnet] combines .pink[ridge regression] and .grey[lasso]. -- $$ `\begin{align} \min_{\beta^E} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#181485}{(1-\alpha)} \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} + \color{#181485}{\alpha} \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|} \end{align}` $$ (We now have two tuning parameters: `\(\lambda\)` and `\(\color{#181485}{\alpha}\)`. -- Remember the `alpha` argument in `glmnet()`? - `\(\color{#e64173}{\alpha = 0}\)` specifies ridge - `\(\color{#8AA19E}{\alpha=1}\)` specifies lasso --- # Ridge .it[and] lasso ## Why not both? We can use `train()` from `caret` to cross validate `\(\alpha\)` and `\(\lambda\)`. .note[Note] You need to consider all combinations of the two parameters. <br>This combination can create *a lot* of models to estimate. For example, - 1,000 values of `\(\lambda\)` - 1,000 values of `\(\alpha\)` leaves you with 1,000,000 models to estimate..super[.pink[†]] .footnote[ .pink[†] 5,000,000 if you are doing 5-fold CV! ] --- layout: false class: clear, middle ```r # Our range of λ lambdas = 10^seq(from = 5, to = -2, length = 1e3) # Our range of α alphas = seq(from = 0, to = 1, by = 0.1) # Ridge regression with cross validation net_cv = train( # The formula balance ~ ., # The dataset data = credit_stnd, # The 'glmnet' package does ridge and lasso method = "glmnet", # 5-fold cross validation trControl = trainControl("cv", number = 10), # The parameters of 'glmnet' tuneGrid = expand.grid(alpha = alphas, lambda = lambdas) ) ``` --- name: sources layout: false # Sources These notes draw upon - [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) (*ISL*)<br>James, Witten, Hastie, and Tibshirani --- # Table of contents .col-left[ .smallest[ #### Admin - [Today](#admin-today) - [Upcoming](#admin-soon) #### Shrinkage - [Introduction](#shrinkage-intro) - [Why?](#shrinkage-why) #### Ridge regression - [Intro](#ridge) - [Penalization](#ridge-penalization) - [Standardization](#standardization) - [Example](#ridge-example) - [Prediction](#ridge-prediction) ] ] .col-right[ .smallest[ #### (The) lasso - [Intro](#lasso) - [Shrinkage](#lasso-shrinkage) - [Example](#lasso-example) #### Ridge or lasso - [Plus/minus](#or) - [Both?](#both) #### Other - [Sources/references](#sources) ] ]