---
title: "Lecture .mono[005]"
subtitle: "Shrinkage methods"
author: "Edward Rubin"
#date: "`r format(Sys.time(), '%d %B %Y')`"
date: "06 February 2020"
output:
xaringan::moon_reader:
css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css']
# self_contained: true
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
exclude: true
```{R, setup, include = F}
library(pacman)
p_load(
broom, tidyverse,
ggplot2, ggthemes, ggforce, ggridges, cowplot, scales,
latex2exp, viridis, extrafont, gridExtra, plotly, ggformula,
kableExtra, DT,
data.table, dplyr, snakecase, janitor,
lubridate, knitr, future, furrr,
MASS, estimatr, caret, glmnet,
huxtable, here, magrittr, parallel
)
# Define colors
red_pink = "#e64173"
turquoise = "#20B2AA"
orange = "#FFA500"
red = "#fb6107"
blue = "#3b3b9a"
green = "#8bb174"
grey_light = "grey70"
grey_mid = "grey50"
grey_dark = "grey20"
purple = "#6A5ACD"
slate = "#314f4f"
# Knitr options
opts_chunk$set(
comment = "#>",
fig.align = "center",
fig.height = 7,
fig.width = 10.5,
warning = F,
message = F
)
opts_chunk$set(dev = "svg")
options(device = function(file, width, height) {
svg(tempfile(), width = width, height = height)
})
options(knitr.table.format = "html")
```
---
layout: true
# Admin
---
class: inverse, middle
---
name: admin-today
## Material
.b[Last time]
- Linear regression
- Model selection
- Best subset selection
- Stepwise selection (forward/backward)
.b[Today] Shrinkage methods
---
name: admin-soon
# Admin
## Upcoming
.b[Readings]
- .note[Today] .it[ISL] Ch. 6
- .note[Next] .it[ISL] 4
.b[Problem sets] .it[Next:] After we finish this set of notes
---
layout: true
# Shrinkage methods
---
name: shrinkage-intro
## Intro
.note[Recap:] .attn[Subset-selection methods] (last time)
1. algorithmically search for the .pink["best" subset] of our $p$ predictors
1. estimate the linear models via .pink[least squares]
--
These methods assume we need to choose a model before we fit it...
--
.note[Alternative approach:] .attn[Shrinkage methods]
- fit a model that contains .pink[all] $\color{#e64173}{p}$ .pink[predictors]
- simultaneously: .pink[shrink.super[.pink[†]] coefficients] toward zero
.footnote[
.pink[†] Synonyms for .it[shrink]: constrain or regularize
]
--
.note[Idea:] Penalize the model for coefficients as they move away from zero.
---
name: shrinkage-why
## Why?
.qa[Q] How could shrinking coefficients twoard zero help or predictions?
--
.qa[A] Remember we're generally facing a tradeoff between bias and variance.
--
- Shrinking our coefficients toward zero .hi[reduces the model's variance]..super[.pink[†]]
- .hi[Penalizing] our model for .hi[larger coefficients] shrinks them toward zero.
- The .hi[optimal penalty] will balance reduced variance with increased bias.
.footnote[
.pink[†] Imagine the extreme case: a model whose coefficients are all zeros has no variance.
]
--
Now you understand shrinkage methods.
- .attn[Ridge regression]
- .attn[Lasso]
- .attn[Elasticnet]
---
layout: true
# Ridge regression
---
class: inverse, middle
---
name: ridge
## Back to least squares (again)
.note[Recall] Least-squares regression gets $\hat{\beta}_j$'s by minimizing RSS, _i.e._,
$$
\begin{align}
\min_{\hat{\beta}} \text{RSS} = \min_{\hat{\beta}} \sum_{i=1}^{n} e_i^2 = \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\underbrace{\left[ \hat{\beta}_0 + \hat{\beta}_1 x_{i,1} + \cdots + \hat{\beta}_p x_{i,p} \right]}_{=\hat{y}_i}} \bigg)^2
\end{align}
$$
--
.attn[Ridge regression] makes a small change
- .pink[adds a shrinkage penalty] = the sum of squared coefficents $\left( \color{#e64173}{\lambda\sum_{j}\beta_j^2} \right)$
- .pink[minimizes] the (weighted) sum of .pink[RSS and the shrinkage penalty]
--
$$
\begin{align}
\min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2}
\end{align}
$$
---
name: ridge-penalization
.col-left[
.hi[Ridge regression]
$$
\begin{align}
\min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2}
\end{align}
$$
]
.col-right[
.b[Least squares]
$$
\begin{align}
\min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2
\end{align}
$$
]
$\color{#e64173}{\lambda}\enspace (\geq0)$ is a tuning parameter for the harshness of the penalty.
$\color{#e64173}{\lambda} = 0$ implies no penalty: we are back to least squares.
--
Each value of $\color{#e64173}{\lambda}$ produces a new set of coefficents.
--
Ridge's approach to the bias-variance tradeoff: Balance
- reducing .b[RSS], _i.e._, $\sum_i\left( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \right)^2$
- reducing .b[coefficients] .grey-light[(ignoring the intercept)]
$\color{#e64173}{\lambda}$ determines how much ridge "cares about" these two quantities..super[.pink[†]]
.footnote[
.pink[†] With $\lambda=0$, least-squares regression only "cares about" RSS.
]
---
## $\lambda$ and penalization
Choosing a .it[good] value for $\lambda$ is key.
- If $\lambda$ is too small, then our model is essentially back to OLS.
- If $\lambda$ is too large, then we shrink all of our coefficients too close to zero.
--
.qa[Q] So what do we do?
--
.qa[A] Cross validate!
.grey-light[(You saw that coming, right?)]
---
## Penalization
.note[Note] Because we sum the .b[squared] coefficients, we penalize increasing .it[big] coefficients much more than increasing .it[small] coefficients.
.ex[Example] For a value of $\beta$, we pay a penalty of $2 \lambda \beta$ for a small increase..super[.pink[†]]
.footnote[
.pink[†] This quantity comes from taking the derivative of $\lambda \beta^2$ with respect to $\beta$.
]
- At $\beta = 0$, the penalty for a small increase is $0$.
- At $\beta = 1$, the penalty for a small increase is $2\lambda$.
- At $\beta = 2$, the penalty for a small increase is $4\lambda$.
- At $\beta = 3$, the penalty for a small increase is $6\lambda$.
- At $\beta = 10$, the penalty for a small increase is $20\lambda$.
Now you see why we call it .it[shrinkage]: it encourages small coefficients.
---
name: ridge-standardization
## Penalization and standardization
.attn[Important] Predictors' .hi[units] can drastically .hi[affect ridge regression results].
.b[Why?]
--
Because $\mathbf{x}_j$'s units affect $\beta_j$, and ridge is very sensitive to $\beta_j$.
--
.ex[Example] Let $x_1$ denote distance.
.b[Least-squares regression]
If $x_1$ is .it[meters] and $\beta_1 = 3$, then when $x_1$ is .it[km], $\beta_1 = 3,000$.
The scale/units of predictors do not affect least squares' estimates.
--
.hi[Ridge regression] pays a much larger penalty for $\beta_1=3,000$ than $\beta_1=3$.
You will not get the same (scaled) estimates when you change units.
--
.note[Solution] Standardize your variables, _i.e._, `x_stnd = (x - mean(x))/sd(x)`.
---
name: ridge-example
## Example
Let's return to the credit dataset.
.ex[Recall] We have 11 predictors and a numeric outcome `balance`.
I standardized our .b[predictors] using `preProcess()` from `caret`, _i.e._,
```{R, credit-data-work, include = F, cache = T}
# The Credit dataset
credit_dt = ISLR::Credit %>% clean_names() %T>% setDT()
# Clean variables
credit_dt[, `:=`(
i_female = 1 * (gender == "Female"),
i_student = 1 * (student == "Yes"),
i_married = 1 * (married == "Yes"),
i_asian_american = 1 * (ethnicity == "Asian"),
i_african_american = 1 * (ethnicity == "African American")
)]
# Drop unwanted variables
credit_dt[, `:=`(id = NULL, gender = NULL, student = NULL, married = NULL, ethnicity = NULL)]
```
```{R, credit-data-preprocess, cache = T}
# Standardize all variables except 'balan ce'
credit_stnd = preProcess(
# Do not process the outcome 'balance'
x = credit_dt %>% dplyr::select(-balance),
# Standardizing means 'center' and 'scale'
method = c("center", "scale")
)
# We have to pass the 'preProcess' object to 'predict' to get new data
credit_stnd %<>% predict(newdata = credit_dt)
```
---
## Example
For ridge regression.super[.pink[†]] in R, we will use `glmnet()` from the `glmnet` package.
.footnote[
.pink[†] And lasso!
]
The .hi-slate[key arguments] for `glmnet()` are
.col-left[
- `x` a .b[matrix] of predictors
- `y` outcome variable as a vector
- `standardize` (`T` or `F`)
- `alpha` elasticnet parameter
- `alpha=0` gives ridge
- `alpha=1` gives lasso
]
.col-right[
- `lambda` tuning parameter (sequence of numbers)
- `nlambda` alternatively, R picks a sequence of values for $\lambda$
]
---
## Example
We just need to define a decreasing sequence for $\lambda$, and then we're set.
```{R, ex-ridge-glmnet}
# Define our range of lambdas (glmnet wants decreasing range)
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Fit ridge regression
est_ridge = glmnet(
x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(),
y = credit_stnd$balance,
standardize = T,
alpha = 0,
lambda = lambdas
)
```
The `glmnet` output (`est_ridge` here) contains estimated coefficients for $\lambda$. You can use `predict()` to get coefficients for additional values of $\lambda$.
---
layout: false
class: clear, middle
.b[Ridge regression coefficents] for $\lambda$ between 0.01 and 100,000
```{R, plot-ridge-glmnet, echo = F}
ridge_df = est_ridge %>% coef() %>% t() %>% as.matrix() %>% as.data.frame()
ridge_df %<>% dplyr::select(-1) %>% mutate(lambda = est_ridge$lambda)
ridge_df %<>% gather(key = "variable", value = "coefficient", -lambda)
ggplot(
data = ridge_df,
aes(x = lambda, y = coefficient, color = variable)
) +
geom_line() +
scale_x_continuous(
expression(lambda),
labels = c("0.1", "10", "1,000", "100,000"),
breaks = c(0.1, 10, 1000, 100000),
trans = "log10"
) +
scale_y_continuous("Ridge coefficient") +
scale_color_viridis_d("Predictor", option = "magma", end = 0.9) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
theme(legend.position = "bottom")
```
---
layout: true
# Ridge regression
## Example
---
`glmnet` also provides convenient cross-validation function: `cv.glmnet()`.
```{R, cv-ridge, cache = T}
# Define our lambdas
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Cross validation
ridge_cv = cv.glmnet(
x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(),
y = credit_stnd$balance,
alpha = 0,
standardize = T,
lambda = lambdas,
# New: How we make decisions and number of folds
type.measure = "mse",
nfolds = 5
)
```
---
layout: false
class: clear, middle
.b[Cross-validated RMSE and] $\lambda$: Which $\color{#e64173}{\lambda}$ minimizes CV RMSE?
```{R, plot-cv-ridge, echo = F}
# Create data frame of our results
ridge_cv_df = data.frame(
lambda = ridge_cv$lambda,
rmse = sqrt(ridge_cv$cvm)
)
# Plot
ggplot(
data = ridge_cv_df,
aes(x = lambda, y = rmse)
) +
geom_line() +
geom_point(
data = ridge_cv_df %>% filter(rmse == min(rmse)),
size = 3.5,
color = red_pink
) +
scale_y_continuous("RMSE") +
scale_x_continuous(
expression(lambda),
trans = "log10",
labels = c("0.1", "10", "1,000", "100,000"),
breaks = c(0.1, 10, 1000, 100000),
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book")
```
---
class: clear, middle
Often, you will have a minimum farther away from your extremes...
---
layout: false
class: clear, middle
```{R, cv-ridge2, cache = T, include = F}
# Define our lambdas
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Cross validation
ridge_cv2 = cv.glmnet(
x = credit_stnd %>% dplyr::select(-balance, -rating, -limit, -income) %>% as.matrix(),
y = credit_stnd$balance,
alpha = 0,
standardize = T,
lambda = lambdas,
# New: How we make decisions and number of folds
type.measure = "mse",
nfolds = 5
)
```
.b[Cross-validated RMSE and] $\lambda$: Which $\color{#e64173}{\lambda}$ minimizes CV RMSE?
```{R, plot-cv-ridge2, echo = F}
# Create data frame of our results
ridge_cv_df2 = data.frame(
lambda = ridge_cv2$lambda,
rmse = sqrt(ridge_cv2$cvm)
)
# Plot
ggplot(
data = ridge_cv_df2,
aes(x = lambda, y = rmse)
) +
geom_line() +
geom_point(
data = ridge_cv_df2 %>% filter(rmse == min(rmse)),
size = 3.5,
color = red_pink
) +
scale_y_continuous("RMSE") +
scale_x_continuous(
expression(lambda),
trans = "log10",
labels = c("0.1", "10", "1,000", "100,000"),
breaks = c(0.1, 10, 1000, 100000),
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book")
```
---
# Ridge regression
## Example
We can also use `train()` from `caret` to cross validate ridge regression.
```{R, credit-ridge-ex, eval = F}
# Our range of lambdas
lambdas = 10^seq(from = 5, to = -2, length = 1e3)
# Ridge regression with cross validation
ridge_cv = train(
# The formula
balance ~ .,
# The dataset
data = credit_stnd,
# The 'glmnet' package does ridge and lasso
method = "glmnet",
# 5-fold cross validation
trControl = trainControl("cv", number = 10),
# The parameters of 'glmnet' (alpha = 0 gives ridge regression)
tuneGrid = expand.grid(alpha = 0, lambda = lambdas)
)
```
---
name: ridge-predict
layout: false
# Ridge regression
## Prediction in R
Once you find your $\lambda$ via cross validation
1\. Fit your model on the full dataset using the optimal $\lambda$
```{R, ridge-final-1, eval = F}
# Fit final model
final_ridge = glmnet(
x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(),
y = credit_stnd$balance,
standardize = T,
alpha = 0,
lambda = ridge_cv$lambda.min
)
```
---
# Ridge regression
## Prediction in R
Once you find your $\lambda$ via cross validation
1\. Fit your model on the full dataset using the optimal $\lambda$
2\. Make predictions
```{R, ridge-final-2, eval = F}
predict(
final_ridge,
type = "response",
# Our chosen lambda
s = ridge_cv$lambda.min,
# Our data
newx = credit_stnd %>% dplyr::select(-balance) %>% as.matrix()
)
```
---
# Ridge regression
## Shrinking
While ridge regression .it[shrinks] coefficients close to zero, it never forces them to be equal to zero.
.b[Drawbacks]
1. We cannot use ridge regression for subset/feature selection.
1. We often end up with a bunch of tiny coefficients.
--
.qa[Q] Can't we just drive the coefficients to zero?
--
.qa[A] Yes. Just not with ridge (due to $\sum_j \hat{\beta}_j^2$).
---
layout: true
# Lasso
---
class: inverse, middle
---
name: lasso
## Intro
.attn[Lasso] simply replaces ridge's .it[squared] coefficients with absolute values.
--
.hi[Ridge regression]
$$
\begin{align}
\min_{\hat{\beta}^R} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2}
\end{align}
$$
.hi-grey[Lasso]
$$
\begin{align}
\min_{\hat{\beta}^L} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|}
\end{align}
$$
Everything else will be the same—except one aspect...
---
name: lasso-shrinkage
## Shrinkage
Unlike ridge, lasso's penalty does not increase with the size of $\beta_j$.
You always pay $\color{#8AA19E}{\lambda}$ to increase $\big|\beta_j\big|$ by one unit.
--
The only way to avoid lasso's penalty is to .hi[set coefficents to zero].
--
This feature has two .hi-slate[benefits]
1. Some coefficients will be .hi[set to zero]—we get "sparse" models.
1. Lasso can be used for subset/feature .hi[selection].
--
We will still need to carefully select $\color{#8AA19E}{\lambda}$.
---
layout: true
# Lasso
## Example
---
name: lasso-example
We can also use `glmnet()` for lasso.
.ex[Recall] The .hi-slate[key arguments] for `glmnet()` are
.col-left[
- `x` a .b[matrix] of predictors
- `y` outcome variable as a vector
- `standardize` (`T` or `F`)
- `alpha` elasticnet parameter
- `alpha=0` gives ridge
- .hi[`alpha=1` gives lasso]
]
.col-right[
- `lambda` tuning parameter (sequence of numbers)
- `nlambda` alternatively, R picks a sequence of values for $\lambda$
]
---
Again, we define a decreasing sequence for $\lambda$, and we're set.
```{R, ex-lasso-glmnet}
# Define our range of lambdas (glmnet wants decreasing range)
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Fit ridge regression
est_lasso = glmnet(
x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(),
y = credit_stnd$balance,
standardize = T,
alpha = 1,
lambda = lambdas
)
```
The `glmnet` output (`est_lasso` here) contains estimated coefficients for $\lambda$. You can use `predict()` to get coefficients for additional values of $\lambda$.
---
layout: false
class: clear, middle
.b[Lasso coefficents] for $\lambda$ between 0.01 and 100,000
```{R, plot-lasso-glmnet, echo = F}
lasso_df = est_lasso %>% coef() %>% t() %>% as.matrix() %>% as.data.frame()
lasso_df %<>% dplyr::select(-1) %>% mutate(lambda = est_lasso$lambda)
lasso_df %<>% gather(key = "variable", value = "coefficient", -lambda)
ggplot(
data = lasso_df,
aes(x = lambda, y = coefficient, color = variable)
) +
geom_line() +
scale_x_continuous(
expression(lambda),
labels = c("0.1", "10", "1,000", "100,000"),
breaks = c(0.1, 10, 1000, 100000),
trans = "log10"
) +
scale_y_continuous("Lasso coefficient") +
scale_color_viridis_d("Predictor", option = "magma", end = 0.9) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
theme(legend.position = "bottom")
```
---
class: clear, middle
Compare lasso's tendency to force coefficients to zero with our previous ridge-regression results.
---
class: clear, middle
.b[Ridge regression coefficents] for $\lambda$ between 0.01 and 100,000
```{R, plot-ridge-glmnet-2, echo = F}
ridge_df = est_ridge %>% coef() %>% t() %>% as.matrix() %>% as.data.frame()
ridge_df %<>% dplyr::select(-1) %>% mutate(lambda = est_ridge$lambda)
ridge_df %<>% gather(key = "variable", value = "coefficient", -lambda)
ggplot(
data = ridge_df,
aes(x = lambda, y = coefficient, color = variable)
) +
geom_line() +
scale_x_continuous(
expression(lambda),
labels = c("0.1", "10", "1,000", "100,000"),
breaks = c(0.1, 10, 1000, 100000),
trans = "log10"
) +
scale_y_continuous("Ridge coefficient") +
scale_color_viridis_d("Predictor", option = "magma", end = 0.9) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
theme(legend.position = "bottom")
```
---
# Lasso
## Example
We can also cross validate $\lambda$ with `cv.glmnet()`.
```{R, cv-lasso, cache = T}
# Define our lambdas
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Cross validation
lasso_cv = cv.glmnet(
x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(),
y = credit_stnd$balance,
alpha = 1,
standardize = T,
lambda = lambdas,
# New: How we make decisions and number of folds
type.measure = "mse",
nfolds = 5
)
```
---
layout: false
class: clear, middle
.b[Cross-validated RMSE and] $\lambda$: Which $\color{#8AA19E}{\lambda}$ minimizes CV RMSE?
```{R, plot-cv-lasso, echo = F}
# Create data frame of our results
lasso_cv_df = data.frame(
lambda = lasso_cv$lambda,
rmse = sqrt(lasso_cv$cvm)
)
# Plot
ggplot(
data = lasso_cv_df,
aes(x = lambda, y = rmse)
) +
geom_line() +
geom_point(
data = lasso_cv_df %>% filter(rmse == min(rmse)),
size = 3.5,
color = "#8AA19E"
) +
scale_y_continuous("RMSE") +
scale_x_continuous(
expression(lambda),
trans = "log10",
labels = c("0.1", "10", "1,000", "100,000"),
breaks = c(0.1, 10, 1000, 100000),
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book")
```
---
class: clear, middle
Again, you will have a minimum farther away from your extremes...
---
class: clear, middle
.b[Cross-validated RMSE and] $\lambda$: Which $\color{#8AA19E}{\lambda}$ minimizes CV RMSE?
```{R, cv-lasso2, cache = T, include = F}
# Define our lambdas
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Cross validation
lasso_cv2 = cv.glmnet(
x = credit_stnd %>% dplyr::select(-balance, -rating, -limit, -income) %>% as.matrix(),
y = credit_stnd$balance,
alpha = 1,
standardize = T,
lambda = lambdas,
# New: How we make decisions and number of folds
type.measure = "mse",
nfolds = 5
)
```
```{R, plot-cv-lasso2, echo = F}
# Create data frame of our results
lasso_cv_df2 = data.frame(
lambda = lasso_cv2$lambda,
rmse = sqrt(lasso_cv2$cvm)
)
# Plot
ggplot(
data = lasso_cv_df2,
aes(x = lambda, y = rmse)
) +
geom_line() +
geom_point(
data = lasso_cv_df2 %>% filter(rmse == min(rmse)),
size = 3.5,
color = "#8AA19E"
) +
scale_y_continuous("RMSE") +
scale_x_continuous(
expression(lambda),
trans = "log10",
labels = c("0.1", "10", "1,000", "100,000"),
breaks = c(0.1, 10, 1000, 100000),
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book")
```
---
class: clear, middle
So which shrinkage method should you choose?
---
layout: true
# Ridge or lasso?
---
name: or
.col-left.pink[
.b[Ridge regression]
.b.orange[+] shrinks $\hat{\beta}_j$ .it[near] 0
.b.orange[-] many small $\hat\beta_j$
.b.orange[-] doesn't work for selection
.b.orange[-] difficult to interpret output
.b.orange[+] better when all $\beta_j\neq$ 0
.it[Best:] $p$ is large & $\beta_j\approx\beta_k$
]
.col-right.purple[
.b[Lasso]
.b.orange[+] shrinks $\hat{\beta}_j$ to 0
.b.orange[+] many $\hat\beta_j=$ 0
.b.orange[+] great for selection
.b.orange[+] sparse models easier to interpret
.b.orange[-] implicitly assumes some $\beta=$ 0
.it[Best:] $p$ is large & many $\beta_j\approx$ 0
]
--
.left-full[
> [N]either ridge... nor the lasso will universally dominate the other.
.ex[ISL, p. 224]
]
---
name: both
layout: false
# Ridge .it[and] lasso
## Why not both?
.hi-blue[Elasticnet] combines .pink[ridge regression] and .grey[lasso].
--
$$
\begin{align}
\min_{\beta^E} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#181485}{(1-\alpha)} \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} + \color{#181485}{\alpha} \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|}
\end{align}
$$
(We now have two tuning parameters: $\lambda$ and $\color{#181485}{\alpha}$.
--
Remember the `alpha` argument in `glmnet()`?
- $\color{#e64173}{\alpha = 0}$ specifies ridge
- $\color{#8AA19E}{\alpha=1}$ specifies lasso
---
# Ridge .it[and] lasso
## Why not both?
We can use `train()` from `caret` to cross validate $\alpha$ and $\lambda$.
.note[Note] You need to consider all combinations of the two parameters.
This combination can create *a lot* of models to estimate.
For example,
- 1,000 values of $\lambda$
- 1,000 values of $\alpha$
leaves you with 1,000,000 models to estimate..super[.pink[†]]
.footnote[
.pink[†] 5,000,000 if you are doing 5-fold CV!
]
---
layout: false
class: clear, middle
```{R, credit-net-ex, eval = F}
# Our range of λ
lambdas = 10^seq(from = 5, to = -2, length = 1e3)
# Our range of α
alphas = seq(from = 0, to = 1, by = 0.1)
# Ridge regression with cross validation
net_cv = train(
# The formula
balance ~ .,
# The dataset
data = credit_stnd,
# The 'glmnet' package does ridge and lasso
method = "glmnet",
# 5-fold cross validation
trControl = trainControl("cv", number = 10),
# The parameters of 'glmnet'
tuneGrid = expand.grid(alpha = alphas, lambda = lambdas)
)
```
---
name: sources
layout: false
# Sources
These notes draw upon
- [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) (*ISL*)
James, Witten, Hastie, and Tibshirani
---
# Table of contents
.col-left[
.smallest[
#### Admin
- [Today](#admin-today)
- [Upcoming](#admin-soon)
#### Shrinkage
- [Introduction](#shrinkage-intro)
- [Why?](#shrinkage-why)
#### Ridge regression
- [Intro](#ridge)
- [Penalization](#ridge-penalization)
- [Standardization](#standardization)
- [Example](#ridge-example)
- [Prediction](#ridge-prediction)
]
]
.col-right[
.smallest[
#### (The) lasso
- [Intro](#lasso)
- [Shrinkage](#lasso-shrinkage)
- [Example](#lasso-example)
#### Ridge or lasso
- [Plus/minus](#or)
- [Both?](#both)
#### Other
- [Sources/references](#sources)
]
]