--- title: "Lecture .mono[005]" subtitle: "Shrinkage methods" author: "Edward Rubin" #date: "`r format(Sys.time(), '%d %B %Y')`" date: "06 February 2020" output: xaringan::moon_reader: css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css'] # self_contained: true nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- exclude: true ```{R, setup, include = F} library(pacman) p_load( broom, tidyverse, ggplot2, ggthemes, ggforce, ggridges, cowplot, scales, latex2exp, viridis, extrafont, gridExtra, plotly, ggformula, kableExtra, DT, data.table, dplyr, snakecase, janitor, lubridate, knitr, future, furrr, MASS, estimatr, caret, glmnet, huxtable, here, magrittr, parallel ) # Define colors red_pink = "#e64173" turquoise = "#20B2AA" orange = "#FFA500" red = "#fb6107" blue = "#3b3b9a" green = "#8bb174" grey_light = "grey70" grey_mid = "grey50" grey_dark = "grey20" purple = "#6A5ACD" slate = "#314f4f" # Knitr options opts_chunk$set( comment = "#>", fig.align = "center", fig.height = 7, fig.width = 10.5, warning = F, message = F ) opts_chunk$set(dev = "svg") options(device = function(file, width, height) { svg(tempfile(), width = width, height = height) }) options(knitr.table.format = "html") ``` --- layout: true # Admin --- class: inverse, middle --- name: admin-today ## Material .b[Last time] - Linear regression - Model selection - Best subset selection - Stepwise selection (forward/backward) .b[Today] Shrinkage methods --- name: admin-soon # Admin ## Upcoming .b[Readings] - .note[Today] .it[ISL] Ch. 6 - .note[Next] .it[ISL] 4 .b[Problem sets] .it[Next:] After we finish this set of notes --- layout: true # Shrinkage methods --- name: shrinkage-intro ## Intro .note[Recap:] .attn[Subset-selection methods] (last time) 1. algorithmically search for the .pink["best" subset] of our $p$ predictors 1. estimate the linear models via .pink[least squares] -- These methods assume we need to choose a model before we fit it... -- .note[Alternative approach:] .attn[Shrinkage methods] - fit a model that contains .pink[all] $\color{#e64173}{p}$ .pink[predictors] - simultaneously: .pink[shrink.super[.pink[†]] coefficients] toward zero .footnote[ .pink[†] Synonyms for .it[shrink]: constrain or regularize ] -- .note[Idea:] Penalize the model for coefficients as they move away from zero. --- name: shrinkage-why ## Why? .qa[Q] How could shrinking coefficients twoard zero help or predictions? -- .qa[A] Remember we're generally facing a tradeoff between bias and variance. -- - Shrinking our coefficients toward zero .hi[reduces the model's variance]..super[.pink[†]] - .hi[Penalizing] our model for .hi[larger coefficients] shrinks them toward zero. - The .hi[optimal penalty] will balance reduced variance with increased bias. .footnote[ .pink[†] Imagine the extreme case: a model whose coefficients are all zeros has no variance. ] -- Now you understand shrinkage methods. - .attn[Ridge regression] - .attn[Lasso] - .attn[Elasticnet] --- layout: true # Ridge regression --- class: inverse, middle --- name: ridge ## Back to least squares (again) .note[Recall] Least-squares regression gets $\hat{\beta}_j$'s by minimizing RSS, _i.e._, $$ \begin{align} \min_{\hat{\beta}} \text{RSS} = \min_{\hat{\beta}} \sum_{i=1}^{n} e_i^2 = \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\underbrace{\left[ \hat{\beta}_0 + \hat{\beta}_1 x_{i,1} + \cdots + \hat{\beta}_p x_{i,p} \right]}_{=\hat{y}_i}} \bigg)^2 \end{align} $$ -- .attn[Ridge regression] makes a small change - .pink[adds a shrinkage penalty] = the sum of squared coefficents $\left( \color{#e64173}{\lambda\sum_{j}\beta_j^2} \right)$ - .pink[minimizes] the (weighted) sum of .pink[RSS and the shrinkage penalty] -- $$ \begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align} $$ --- name: ridge-penalization .col-left[ .hi[Ridge regression] $$ \begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align} $$ ] .col-right[ .b[Least squares] $$ \begin{align} \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 \end{align} $$ ]



$\color{#e64173}{\lambda}\enspace (\geq0)$ is a tuning parameter for the harshness of the penalty.
$\color{#e64173}{\lambda} = 0$ implies no penalty: we are back to least squares. --
Each value of $\color{#e64173}{\lambda}$ produces a new set of coefficents. -- Ridge's approach to the bias-variance tradeoff: Balance - reducing .b[RSS], _i.e._, $\sum_i\left( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \right)^2$ - reducing .b[coefficients] .grey-light[(ignoring the intercept)] $\color{#e64173}{\lambda}$ determines how much ridge "cares about" these two quantities..super[.pink[†]] .footnote[ .pink[†] With $\lambda=0$, least-squares regression only "cares about" RSS. ] --- ## $\lambda$ and penalization Choosing a .it[good] value for $\lambda$ is key. - If $\lambda$ is too small, then our model is essentially back to OLS. - If $\lambda$ is too large, then we shrink all of our coefficients too close to zero. -- .qa[Q] So what do we do? --
.qa[A] Cross validate! .grey-light[(You saw that coming, right?)] --- ## Penalization .note[Note] Because we sum the .b[squared] coefficients, we penalize increasing .it[big] coefficients much more than increasing .it[small] coefficients. .ex[Example] For a value of $\beta$, we pay a penalty of $2 \lambda \beta$ for a small increase..super[.pink[†]] .footnote[ .pink[†] This quantity comes from taking the derivative of $\lambda \beta^2$ with respect to $\beta$. ] - At $\beta = 0$, the penalty for a small increase is $0$. - At $\beta = 1$, the penalty for a small increase is $2\lambda$. - At $\beta = 2$, the penalty for a small increase is $4\lambda$. - At $\beta = 3$, the penalty for a small increase is $6\lambda$. - At $\beta = 10$, the penalty for a small increase is $20\lambda$. Now you see why we call it .it[shrinkage]: it encourages small coefficients. --- name: ridge-standardization ## Penalization and standardization .attn[Important] Predictors' .hi[units] can drastically .hi[affect ridge regression results]. .b[Why?] -- Because $\mathbf{x}_j$'s units affect $\beta_j$, and ridge is very sensitive to $\beta_j$. -- .ex[Example] Let $x_1$ denote distance. .b[Least-squares regression]
If $x_1$ is .it[meters] and $\beta_1 = 3$, then when $x_1$ is .it[km], $\beta_1 = 3,000$.
The scale/units of predictors do not affect least squares' estimates. -- .hi[Ridge regression] pays a much larger penalty for $\beta_1=3,000$ than $\beta_1=3$.
You will not get the same (scaled) estimates when you change units. -- .note[Solution] Standardize your variables, _i.e._, `x_stnd = (x - mean(x))/sd(x)`. --- name: ridge-example ## Example Let's return to the credit dataset. .ex[Recall] We have 11 predictors and a numeric outcome `balance`. I standardized our .b[predictors] using `preProcess()` from `caret`, _i.e._, ```{R, credit-data-work, include = F, cache = T} # The Credit dataset credit_dt = ISLR::Credit %>% clean_names() %T>% setDT() # Clean variables credit_dt[, `:=`( i_female = 1 * (gender == "Female"), i_student = 1 * (student == "Yes"), i_married = 1 * (married == "Yes"), i_asian_american = 1 * (ethnicity == "Asian"), i_african_american = 1 * (ethnicity == "African American") )] # Drop unwanted variables credit_dt[, `:=`(id = NULL, gender = NULL, student = NULL, married = NULL, ethnicity = NULL)] ``` ```{R, credit-data-preprocess, cache = T} # Standardize all variables except 'balan ce' credit_stnd = preProcess( # Do not process the outcome 'balance' x = credit_dt %>% dplyr::select(-balance), # Standardizing means 'center' and 'scale' method = c("center", "scale") ) # We have to pass the 'preProcess' object to 'predict' to get new data credit_stnd %<>% predict(newdata = credit_dt) ``` --- ## Example For ridge regression.super[.pink[†]] in R, we will use `glmnet()` from the `glmnet` package. .footnote[ .pink[†] And lasso! ] The .hi-slate[key arguments] for `glmnet()` are .col-left[ - `x` a .b[matrix] of predictors - `y` outcome variable as a vector - `standardize` (`T` or `F`) - `alpha` elasticnet parameter - `alpha=0` gives ridge - `alpha=1` gives lasso ] .col-right[ - `lambda` tuning parameter (sequence of numbers) - `nlambda` alternatively, R picks a sequence of values for $\lambda$ ] --- ## Example We just need to define a decreasing sequence for $\lambda$, and then we're set. ```{R, ex-ridge-glmnet} # Define our range of lambdas (glmnet wants decreasing range) lambdas = 10^seq(from = 5, to = -2, length = 100) # Fit ridge regression est_ridge = glmnet( x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(), y = credit_stnd$balance, standardize = T, alpha = 0, lambda = lambdas ) ``` The `glmnet` output (`est_ridge` here) contains estimated coefficients for $\lambda$. You can use `predict()` to get coefficients for additional values of $\lambda$. --- layout: false class: clear, middle .b[Ridge regression coefficents] for $\lambda$ between 0.01 and 100,000 ```{R, plot-ridge-glmnet, echo = F} ridge_df = est_ridge %>% coef() %>% t() %>% as.matrix() %>% as.data.frame() ridge_df %<>% dplyr::select(-1) %>% mutate(lambda = est_ridge$lambda) ridge_df %<>% gather(key = "variable", value = "coefficient", -lambda) ggplot( data = ridge_df, aes(x = lambda, y = coefficient, color = variable) ) + geom_line() + scale_x_continuous( expression(lambda), labels = c("0.1", "10", "1,000", "100,000"), breaks = c(0.1, 10, 1000, 100000), trans = "log10" ) + scale_y_continuous("Ridge coefficient") + scale_color_viridis_d("Predictor", option = "magma", end = 0.9) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- layout: true # Ridge regression ## Example --- `glmnet` also provides convenient cross-validation function: `cv.glmnet()`. ```{R, cv-ridge, cache = T} # Define our lambdas lambdas = 10^seq(from = 5, to = -2, length = 100) # Cross validation ridge_cv = cv.glmnet( x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(), y = credit_stnd$balance, alpha = 0, standardize = T, lambda = lambdas, # New: How we make decisions and number of folds type.measure = "mse", nfolds = 5 ) ``` --- layout: false class: clear, middle .b[Cross-validated RMSE and] $\lambda$: Which $\color{#e64173}{\lambda}$ minimizes CV RMSE? ```{R, plot-cv-ridge, echo = F} # Create data frame of our results ridge_cv_df = data.frame( lambda = ridge_cv$lambda, rmse = sqrt(ridge_cv$cvm) ) # Plot ggplot( data = ridge_cv_df, aes(x = lambda, y = rmse) ) + geom_line() + geom_point( data = ridge_cv_df %>% filter(rmse == min(rmse)), size = 3.5, color = red_pink ) + scale_y_continuous("RMSE") + scale_x_continuous( expression(lambda), trans = "log10", labels = c("0.1", "10", "1,000", "100,000"), breaks = c(0.1, 10, 1000, 100000), ) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- class: clear, middle Often, you will have a minimum farther away from your extremes... --- layout: false class: clear, middle ```{R, cv-ridge2, cache = T, include = F} # Define our lambdas lambdas = 10^seq(from = 5, to = -2, length = 100) # Cross validation ridge_cv2 = cv.glmnet( x = credit_stnd %>% dplyr::select(-balance, -rating, -limit, -income) %>% as.matrix(), y = credit_stnd$balance, alpha = 0, standardize = T, lambda = lambdas, # New: How we make decisions and number of folds type.measure = "mse", nfolds = 5 ) ``` .b[Cross-validated RMSE and] $\lambda$: Which $\color{#e64173}{\lambda}$ minimizes CV RMSE? ```{R, plot-cv-ridge2, echo = F} # Create data frame of our results ridge_cv_df2 = data.frame( lambda = ridge_cv2$lambda, rmse = sqrt(ridge_cv2$cvm) ) # Plot ggplot( data = ridge_cv_df2, aes(x = lambda, y = rmse) ) + geom_line() + geom_point( data = ridge_cv_df2 %>% filter(rmse == min(rmse)), size = 3.5, color = red_pink ) + scale_y_continuous("RMSE") + scale_x_continuous( expression(lambda), trans = "log10", labels = c("0.1", "10", "1,000", "100,000"), breaks = c(0.1, 10, 1000, 100000), ) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- # Ridge regression ## Example We can also use `train()` from `caret` to cross validate ridge regression. ```{R, credit-ridge-ex, eval = F} # Our range of lambdas lambdas = 10^seq(from = 5, to = -2, length = 1e3) # Ridge regression with cross validation ridge_cv = train( # The formula balance ~ ., # The dataset data = credit_stnd, # The 'glmnet' package does ridge and lasso method = "glmnet", # 5-fold cross validation trControl = trainControl("cv", number = 10), # The parameters of 'glmnet' (alpha = 0 gives ridge regression) tuneGrid = expand.grid(alpha = 0, lambda = lambdas) ) ``` --- name: ridge-predict layout: false # Ridge regression ## Prediction in R Once you find your $\lambda$ via cross validation 1\. Fit your model on the full dataset using the optimal $\lambda$ ```{R, ridge-final-1, eval = F} # Fit final model final_ridge = glmnet( x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(), y = credit_stnd$balance, standardize = T, alpha = 0, lambda = ridge_cv$lambda.min ) ``` --- # Ridge regression ## Prediction in R Once you find your $\lambda$ via cross validation 1\. Fit your model on the full dataset using the optimal $\lambda$ 2\. Make predictions ```{R, ridge-final-2, eval = F} predict( final_ridge, type = "response", # Our chosen lambda s = ridge_cv$lambda.min, # Our data newx = credit_stnd %>% dplyr::select(-balance) %>% as.matrix() ) ``` --- # Ridge regression ## Shrinking While ridge regression .it[shrinks] coefficients close to zero, it never forces them to be equal to zero. .b[Drawbacks] 1. We cannot use ridge regression for subset/feature selection. 1. We often end up with a bunch of tiny coefficients. -- .qa[Q] Can't we just drive the coefficients to zero? --
.qa[A] Yes. Just not with ridge (due to $\sum_j \hat{\beta}_j^2$). --- layout: true # Lasso --- class: inverse, middle --- name: lasso ## Intro .attn[Lasso] simply replaces ridge's .it[squared] coefficients with absolute values. -- .hi[Ridge regression] $$ \begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align} $$ .hi-grey[Lasso] $$ \begin{align} \min_{\hat{\beta}^L} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|} \end{align} $$ Everything else will be the same—except one aspect... --- name: lasso-shrinkage ## Shrinkage Unlike ridge, lasso's penalty does not increase with the size of $\beta_j$. You always pay $\color{#8AA19E}{\lambda}$ to increase $\big|\beta_j\big|$ by one unit. -- The only way to avoid lasso's penalty is to .hi[set coefficents to zero]. -- This feature has two .hi-slate[benefits] 1. Some coefficients will be .hi[set to zero]—we get "sparse" models. 1. Lasso can be used for subset/feature .hi[selection]. -- We will still need to carefully select $\color{#8AA19E}{\lambda}$. --- layout: true # Lasso ## Example --- name: lasso-example We can also use `glmnet()` for lasso. .ex[Recall] The .hi-slate[key arguments] for `glmnet()` are .col-left[ - `x` a .b[matrix] of predictors - `y` outcome variable as a vector - `standardize` (`T` or `F`) - `alpha` elasticnet parameter - `alpha=0` gives ridge - .hi[`alpha=1` gives lasso] ] .col-right[ - `lambda` tuning parameter (sequence of numbers) - `nlambda` alternatively, R picks a sequence of values for $\lambda$ ] --- Again, we define a decreasing sequence for $\lambda$, and we're set. ```{R, ex-lasso-glmnet} # Define our range of lambdas (glmnet wants decreasing range) lambdas = 10^seq(from = 5, to = -2, length = 100) # Fit ridge regression est_lasso = glmnet( x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(), y = credit_stnd$balance, standardize = T, alpha = 1, lambda = lambdas ) ``` The `glmnet` output (`est_lasso` here) contains estimated coefficients for $\lambda$. You can use `predict()` to get coefficients for additional values of $\lambda$. --- layout: false class: clear, middle .b[Lasso coefficents] for $\lambda$ between 0.01 and 100,000 ```{R, plot-lasso-glmnet, echo = F} lasso_df = est_lasso %>% coef() %>% t() %>% as.matrix() %>% as.data.frame() lasso_df %<>% dplyr::select(-1) %>% mutate(lambda = est_lasso$lambda) lasso_df %<>% gather(key = "variable", value = "coefficient", -lambda) ggplot( data = lasso_df, aes(x = lambda, y = coefficient, color = variable) ) + geom_line() + scale_x_continuous( expression(lambda), labels = c("0.1", "10", "1,000", "100,000"), breaks = c(0.1, 10, 1000, 100000), trans = "log10" ) + scale_y_continuous("Lasso coefficient") + scale_color_viridis_d("Predictor", option = "magma", end = 0.9) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- class: clear, middle Compare lasso's tendency to force coefficients to zero with our previous ridge-regression results. --- class: clear, middle .b[Ridge regression coefficents] for $\lambda$ between 0.01 and 100,000 ```{R, plot-ridge-glmnet-2, echo = F} ridge_df = est_ridge %>% coef() %>% t() %>% as.matrix() %>% as.data.frame() ridge_df %<>% dplyr::select(-1) %>% mutate(lambda = est_ridge$lambda) ridge_df %<>% gather(key = "variable", value = "coefficient", -lambda) ggplot( data = ridge_df, aes(x = lambda, y = coefficient, color = variable) ) + geom_line() + scale_x_continuous( expression(lambda), labels = c("0.1", "10", "1,000", "100,000"), breaks = c(0.1, 10, 1000, 100000), trans = "log10" ) + scale_y_continuous("Ridge coefficient") + scale_color_viridis_d("Predictor", option = "magma", end = 0.9) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") + theme(legend.position = "bottom") ``` --- # Lasso ## Example We can also cross validate $\lambda$ with `cv.glmnet()`. ```{R, cv-lasso, cache = T} # Define our lambdas lambdas = 10^seq(from = 5, to = -2, length = 100) # Cross validation lasso_cv = cv.glmnet( x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(), y = credit_stnd$balance, alpha = 1, standardize = T, lambda = lambdas, # New: How we make decisions and number of folds type.measure = "mse", nfolds = 5 ) ``` --- layout: false class: clear, middle .b[Cross-validated RMSE and] $\lambda$: Which $\color{#8AA19E}{\lambda}$ minimizes CV RMSE? ```{R, plot-cv-lasso, echo = F} # Create data frame of our results lasso_cv_df = data.frame( lambda = lasso_cv$lambda, rmse = sqrt(lasso_cv$cvm) ) # Plot ggplot( data = lasso_cv_df, aes(x = lambda, y = rmse) ) + geom_line() + geom_point( data = lasso_cv_df %>% filter(rmse == min(rmse)), size = 3.5, color = "#8AA19E" ) + scale_y_continuous("RMSE") + scale_x_continuous( expression(lambda), trans = "log10", labels = c("0.1", "10", "1,000", "100,000"), breaks = c(0.1, 10, 1000, 100000), ) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- class: clear, middle Again, you will have a minimum farther away from your extremes... --- class: clear, middle .b[Cross-validated RMSE and] $\lambda$: Which $\color{#8AA19E}{\lambda}$ minimizes CV RMSE? ```{R, cv-lasso2, cache = T, include = F} # Define our lambdas lambdas = 10^seq(from = 5, to = -2, length = 100) # Cross validation lasso_cv2 = cv.glmnet( x = credit_stnd %>% dplyr::select(-balance, -rating, -limit, -income) %>% as.matrix(), y = credit_stnd$balance, alpha = 1, standardize = T, lambda = lambdas, # New: How we make decisions and number of folds type.measure = "mse", nfolds = 5 ) ``` ```{R, plot-cv-lasso2, echo = F} # Create data frame of our results lasso_cv_df2 = data.frame( lambda = lasso_cv2$lambda, rmse = sqrt(lasso_cv2$cvm) ) # Plot ggplot( data = lasso_cv_df2, aes(x = lambda, y = rmse) ) + geom_line() + geom_point( data = lasso_cv_df2 %>% filter(rmse == min(rmse)), size = 3.5, color = "#8AA19E" ) + scale_y_continuous("RMSE") + scale_x_continuous( expression(lambda), trans = "log10", labels = c("0.1", "10", "1,000", "100,000"), breaks = c(0.1, 10, 1000, 100000), ) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- class: clear, middle So which shrinkage method should you choose? --- layout: true # Ridge or lasso? --- name: or .col-left.pink[ .b[Ridge regression]

.b.orange[+] shrinks $\hat{\beta}_j$ .it[near] 0
.b.orange[-] many small $\hat\beta_j$
.b.orange[-] doesn't work for selection
.b.orange[-] difficult to interpret output
.b.orange[+] better when all $\beta_j\neq$ 0

.it[Best:] $p$ is large & $\beta_j\approx\beta_k$ ] .col-right.purple[ .b[Lasso]

.b.orange[+] shrinks $\hat{\beta}_j$ to 0
.b.orange[+] many $\hat\beta_j=$ 0
.b.orange[+] great for selection
.b.orange[+] sparse models easier to interpret
.b.orange[-] implicitly assumes some $\beta=$ 0

.it[Best:] $p$ is large & many $\beta_j\approx$ 0 ] -- .left-full[ > [N]either ridge... nor the lasso will universally dominate the other. .ex[ISL, p. 224] ] --- name: both layout: false # Ridge .it[and] lasso ## Why not both? .hi-blue[Elasticnet] combines .pink[ridge regression] and .grey[lasso]. -- $$ \begin{align} \min_{\beta^E} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#181485}{(1-\alpha)} \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} + \color{#181485}{\alpha} \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|} \end{align} $$ (We now have two tuning parameters: $\lambda$ and $\color{#181485}{\alpha}$. -- Remember the `alpha` argument in `glmnet()`? - $\color{#e64173}{\alpha = 0}$ specifies ridge - $\color{#8AA19E}{\alpha=1}$ specifies lasso --- # Ridge .it[and] lasso ## Why not both? We can use `train()` from `caret` to cross validate $\alpha$ and $\lambda$. .note[Note] You need to consider all combinations of the two parameters.
This combination can create *a lot* of models to estimate. For example, - 1,000 values of $\lambda$ - 1,000 values of $\alpha$ leaves you with 1,000,000 models to estimate..super[.pink[†]] .footnote[ .pink[†] 5,000,000 if you are doing 5-fold CV! ] --- layout: false class: clear, middle ```{R, credit-net-ex, eval = F} # Our range of λ lambdas = 10^seq(from = 5, to = -2, length = 1e3) # Our range of α alphas = seq(from = 0, to = 1, by = 0.1) # Ridge regression with cross validation net_cv = train( # The formula balance ~ ., # The dataset data = credit_stnd, # The 'glmnet' package does ridge and lasso method = "glmnet", # 5-fold cross validation trControl = trainControl("cv", number = 10), # The parameters of 'glmnet' tuneGrid = expand.grid(alpha = alphas, lambda = lambdas) ) ``` --- name: sources layout: false # Sources These notes draw upon - [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) (*ISL*)
James, Witten, Hastie, and Tibshirani --- # Table of contents .col-left[ .smallest[ #### Admin - [Today](#admin-today) - [Upcoming](#admin-soon) #### Shrinkage - [Introduction](#shrinkage-intro) - [Why?](#shrinkage-why) #### Ridge regression - [Intro](#ridge) - [Penalization](#ridge-penalization) - [Standardization](#standardization) - [Example](#ridge-example) - [Prediction](#ridge-prediction) ] ] .col-right[ .smallest[ #### (The) lasso - [Intro](#lasso) - [Shrinkage](#lasso-shrinkage) - [Example](#lasso-example) #### Ridge or lasso - [Plus/minus](#or) - [Both?](#both) #### Other - [Sources/references](#sources) ] ]