---
title: "Lecture .mono[005]"
subtitle: "Shrinkage methods"
author: "Edward Rubin"
#date: "`r format(Sys.time(), '%d %B %Y')`"
date: "06 February 2020"
output:
  xaringan::moon_reader:
    css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css']
    # self_contained: true
    nature:
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
---
exclude: true

```{R, setup, include = F}
library(pacman)
p_load(
  broom, tidyverse,
  ggplot2, ggthemes, ggforce, ggridges, cowplot, scales,
  latex2exp, viridis, extrafont, gridExtra, plotly, ggformula,
  kableExtra, DT,
  data.table, dplyr, snakecase, janitor,
  lubridate, knitr, future, furrr,
  MASS, estimatr, caret, glmnet,
  huxtable, here, magrittr, parallel
)
# Define colors
red_pink   = "#e64173"
turquoise  = "#20B2AA"
orange     = "#FFA500"
red        = "#fb6107"
blue       = "#3b3b9a"
green      = "#8bb174"
grey_light = "grey70"
grey_mid   = "grey50"
grey_dark  = "grey20"
purple     = "#6A5ACD"
slate      = "#314f4f"
# Knitr options
opts_chunk$set(
  comment = "#>",
  fig.align = "center",
  fig.height = 7,
  fig.width = 10.5,
  warning = F,
  message = F
)
opts_chunk$set(dev = "svg")
options(device = function(file, width, height) {
  svg(tempfile(), width = width, height = height)
})
options(knitr.table.format = "html")
```
---
layout: true
# Admin

---
class: inverse, middle
---
name: admin-today
## Material

.b[Last time]
- Linear regression
- Model selection
  - Best subset selection
  - Stepwise selection (forward/backward)

.b[Today] Shrinkage methods

---
name: admin-soon
# Admin

## Upcoming

.b[Readings]

- .note[Today] .it[ISL] Ch. 6
- .note[Next] .it[ISL] 4

.b[Problem sets] .it[Next:] After we finish this set of notes
---
layout: true
# Shrinkage methods

---
name: shrinkage-intro
## Intro

.note[Recap:] .attn[Subset-selection methods] (last time)
1. algorithmically search for the .pink["best" subset] of our $p$ predictors
1. estimate the linear models via .pink[least squares]

--

These methods assume we need to choose a model before we fit it...

--

.note[Alternative approach:] .attn[Shrinkage methods]
- fit a model that contains .pink[all] $\color{#e64173}{p}$ .pink[predictors]
- simultaneously: .pink[shrink.super[.pink[†]] coefficients] toward zero

.footnote[
.pink[†] Synonyms for .it[shrink]: constrain or regularize
]

--

.note[Idea:] Penalize the model for coefficients as they move away from zero.

---
name: shrinkage-why
## Why?

.qa[Q] How could shrinking coefficients twoard zero help or predictions?

--

.qa[A] Remember we're generally facing a tradeoff between bias and variance.

--

- Shrinking our coefficients toward zero .hi[reduces the model's variance]..super[.pink[†]]
- .hi[Penalizing] our model for .hi[larger coefficients] shrinks them toward zero.
- The .hi[optimal penalty] will balance reduced variance with increased bias.

.footnote[
.pink[†] Imagine the extreme case: a model whose coefficients are all zeros has no variance.
]

--

Now you understand shrinkage methods.
- .attn[Ridge regression]
- .attn[Lasso]
- .attn[Elasticnet]

---
layout: true
# Ridge regression

---
class: inverse, middle

---
name: ridge
## Back to least squares (again)

.note[Recall] Least-squares regression gets $\hat{\beta}_j$'s by minimizing RSS, _i.e._,
$$
\begin{align}
  \min_{\hat{\beta}} \text{RSS} = \min_{\hat{\beta}} \sum_{i=1}^{n} e_i^2 = \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\underbrace{\left[ \hat{\beta}_0 + \hat{\beta}_1 x_{i,1} + \cdots + \hat{\beta}_p x_{i,p} \right]}_{=\hat{y}_i}} \bigg)^2
\end{align}
$$

--

.attn[Ridge regression] makes a small change
- .pink[adds a shrinkage penalty] = the sum of squared coefficents $\left( \color{#e64173}{\lambda\sum_{j}\beta_j^2} \right)$
- .pink[minimizes] the (weighted) sum of .pink[RSS and the shrinkage penalty]

--

$$
\begin{align}
  \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2}
\end{align}
$$

---
name: ridge-penalization

.col-left[
.hi[Ridge regression]
$$
\begin{align}
\min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2}
\end{align}
$$
]

.col-right[
.b[Least squares]
$$
\begin{align}
\min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2
\end{align}
$$
]

<br><br><br><br>

$\color{#e64173}{\lambda}\enspace (\geq0)$ is a tuning parameter for the harshness of the penalty.
<br>
$\color{#e64173}{\lambda} = 0$ implies no penalty: we are back to least squares.
--
<br>
Each value of $\color{#e64173}{\lambda}$ produces a new set of coefficents.

--

Ridge's approach to the bias-variance tradeoff: Balance
- reducing .b[RSS], _i.e._, $\sum_i\left( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \right)^2$
- reducing .b[coefficients] .grey-light[(ignoring the intercept)]

$\color{#e64173}{\lambda}$ determines how much ridge "cares about" these two quantities..super[.pink[†]]

.footnote[
.pink[†] With $\lambda=0$, least-squares regression only "cares about" RSS.
]

---
## $\lambda$ and penalization

Choosing a .it[good] value for $\lambda$ is key.
- If $\lambda$ is too small, then our model is essentially back to OLS.
- If $\lambda$ is too large, then we shrink all of our coefficients too close to zero.

--

.qa[Q] So what do we do?
--
<br>
.qa[A] Cross validate!

.grey-light[(You saw that coming, right?)]

---
## Penalization

.note[Note] Because we sum the .b[squared] coefficients, we penalize increasing .it[big] coefficients much more than increasing .it[small] coefficients.

.ex[Example] For a value of $\beta$, we pay a penalty of $2 \lambda \beta$ for a small increase..super[.pink[†]]

.footnote[
.pink[†] This quantity comes from taking the derivative of $\lambda \beta^2$ with respect to $\beta$.
]

- At $\beta = 0$, the penalty for a small increase is $0$.
- At $\beta = 1$, the penalty for a small increase is $2\lambda$.
- At $\beta = 2$, the penalty for a small increase is $4\lambda$.
- At $\beta = 3$, the penalty for a small increase is $6\lambda$.
- At $\beta = 10$, the penalty for a small increase is $20\lambda$.

Now you see why we call it .it[shrinkage]: it encourages small coefficients.
---
name: ridge-standardization
## Penalization and standardization

.attn[Important] Predictors' .hi[units] can drastically .hi[affect ridge regression results].

.b[Why?]
--
 Because $\mathbf{x}_j$'s units affect $\beta_j$, and ridge is very sensitive to $\beta_j$.

--

.ex[Example] Let $x_1$ denote distance.

.b[Least-squares regression]
<br>
If $x_1$ is .it[meters] and $\beta_1 = 3$, then when $x_1$ is .it[km], $\beta_1 = 3,000$.
<br>
The scale/units of predictors do not affect least squares' estimates.

--

.hi[Ridge regression] pays a much larger penalty for $\beta_1=3,000$ than $\beta_1=3$.
<br>You will not get the same (scaled) estimates when you change units.


--

.note[Solution] Standardize your variables, _i.e._, `x_stnd = (x - mean(x))/sd(x)`.

---
name: ridge-example
## Example

Let's return to the credit dataset.

.ex[Recall] We have 11 predictors and a numeric outcome `balance`.

I standardized our .b[predictors] using `preProcess()` from `caret`, _i.e._,

```{R, credit-data-work, include = F, cache = T}
# The Credit dataset
credit_dt = ISLR::Credit %>% clean_names() %T>% setDT()
# Clean variables
credit_dt[, `:=`(
  i_female = 1 * (gender == "Female"),
  i_student = 1 * (student == "Yes"),
  i_married = 1 * (married == "Yes"),
  i_asian_american = 1 * (ethnicity == "Asian"),
  i_african_american = 1 * (ethnicity == "African American")
)]
# Drop unwanted variables
credit_dt[, `:=`(id = NULL, gender = NULL, student = NULL, married = NULL, ethnicity = NULL)]
```

```{R, credit-data-preprocess, cache = T}
# Standardize all variables except 'balan ce'
credit_stnd = preProcess(
  # Do not process the outcome 'balance'
  x = credit_dt %>% dplyr::select(-balance),
  # Standardizing means 'center' and 'scale'
  method = c("center", "scale")
)
# We have to pass the 'preProcess' object to 'predict' to get new data
credit_stnd %<>% predict(newdata = credit_dt)
```

---
## Example

For ridge regression.super[.pink[†]] in R, we will use `glmnet()` from the `glmnet` package.

.footnote[
.pink[†] And lasso!
]

The .hi-slate[key arguments] for `glmnet()` are

.col-left[
- `x` a .b[matrix] of predictors
- `y` outcome variable as a vector
- `standardize` (`T` or `F`)
- `alpha` elasticnet parameter
  - `alpha=0` gives ridge
  - `alpha=1` gives lasso
]

.col-right[
- `lambda` tuning parameter (sequence of numbers)
- `nlambda` alternatively, R picks a sequence of values for $\lambda$
]
---
## Example

We just need to define a decreasing sequence for $\lambda$, and then we're set.

```{R, ex-ridge-glmnet}
# Define our range of lambdas (glmnet wants decreasing range)
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Fit ridge regression
est_ridge = glmnet(
  x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(),
  y = credit_stnd$balance,
  standardize = T,
  alpha = 0,
  lambda = lambdas
)
```

The `glmnet` output (`est_ridge` here) contains estimated coefficients for $\lambda$. You can use `predict()` to get coefficients for additional values of $\lambda$.
---
layout: false
class: clear, middle

.b[Ridge regression coefficents] for $\lambda$ between 0.01 and 100,000
```{R, plot-ridge-glmnet, echo = F}
ridge_df = est_ridge %>% coef() %>% t() %>% as.matrix() %>% as.data.frame()
ridge_df %<>% dplyr::select(-1) %>% mutate(lambda = est_ridge$lambda)
ridge_df %<>% gather(key = "variable", value = "coefficient", -lambda)
ggplot(
  data = ridge_df,
  aes(x = lambda, y = coefficient, color = variable)
) +
geom_line() +
scale_x_continuous(
  expression(lambda),
  labels = c("0.1", "10", "1,000", "100,000"),
  breaks = c(0.1, 10, 1000, 100000),
  trans = "log10"
) +
scale_y_continuous("Ridge coefficient") +
scale_color_viridis_d("Predictor", option = "magma", end = 0.9) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
theme(legend.position = "bottom")
```
---
layout: true
# Ridge regression
## Example

---
`glmnet` also provides convenient cross-validation function: `cv.glmnet()`.

```{R, cv-ridge, cache = T}
# Define our lambdas
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Cross validation
ridge_cv = cv.glmnet(
  x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(),
  y = credit_stnd$balance,
  alpha = 0,
  standardize = T,
  lambda = lambdas,
  # New: How we make decisions and number of folds
  type.measure = "mse",
  nfolds = 5
)
```

---
layout: false
class: clear, middle

.b[Cross-validated RMSE and] $\lambda$: Which $\color{#e64173}{\lambda}$ minimizes CV RMSE?

```{R, plot-cv-ridge, echo = F}
# Create data frame of our results
ridge_cv_df = data.frame(
  lambda = ridge_cv$lambda,
  rmse = sqrt(ridge_cv$cvm)
)
# Plot
ggplot(
  data = ridge_cv_df,
  aes(x = lambda, y = rmse)
) +
geom_line() +
geom_point(
  data = ridge_cv_df %>% filter(rmse == min(rmse)),
  size = 3.5,
  color = red_pink
) +
scale_y_continuous("RMSE") +
scale_x_continuous(
  expression(lambda),
  trans = "log10",
  labels = c("0.1", "10", "1,000", "100,000"),
  breaks = c(0.1, 10, 1000, 100000),
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book")
```

---
class: clear, middle

Often, you will have a minimum farther away from your extremes...

---
layout: false
class: clear, middle

```{R, cv-ridge2, cache = T, include = F}
# Define our lambdas
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Cross validation
ridge_cv2 = cv.glmnet(
  x = credit_stnd %>% dplyr::select(-balance, -rating, -limit, -income) %>% as.matrix(),
  y = credit_stnd$balance,
  alpha = 0,
  standardize = T,
  lambda = lambdas,
  # New: How we make decisions and number of folds
  type.measure = "mse",
  nfolds = 5
)
```

.b[Cross-validated RMSE and] $\lambda$: Which $\color{#e64173}{\lambda}$ minimizes CV RMSE?

```{R, plot-cv-ridge2, echo = F}
# Create data frame of our results
ridge_cv_df2 = data.frame(
  lambda = ridge_cv2$lambda,
  rmse = sqrt(ridge_cv2$cvm)
)
# Plot
ggplot(
  data = ridge_cv_df2,
  aes(x = lambda, y = rmse)
) +
geom_line() +
geom_point(
  data = ridge_cv_df2 %>% filter(rmse == min(rmse)),
  size = 3.5,
  color = red_pink
) +
scale_y_continuous("RMSE") +
scale_x_continuous(
  expression(lambda),
  trans = "log10",
  labels = c("0.1", "10", "1,000", "100,000"),
  breaks = c(0.1, 10, 1000, 100000),
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book")
```

---
# Ridge regression
## Example

We can also use `train()` from `caret` to cross validate ridge regression.

```{R, credit-ridge-ex, eval = F}
# Our range of lambdas
lambdas = 10^seq(from = 5, to = -2, length = 1e3)
# Ridge regression with cross validation
ridge_cv = train(
  # The formula
  balance ~ .,
  # The dataset
  data = credit_stnd,
  # The 'glmnet' package does ridge and lasso
  method = "glmnet",
  # 5-fold cross validation
  trControl = trainControl("cv", number = 10),
  # The parameters of 'glmnet' (alpha = 0 gives ridge regression)
  tuneGrid = expand.grid(alpha = 0, lambda = lambdas)
)
```
---
name: ridge-predict
layout: false
# Ridge regression
## Prediction in R

Once you find your $\lambda$ via cross validation

1\. Fit your model on the full dataset using the optimal $\lambda$
```{R, ridge-final-1, eval = F}
# Fit final model
final_ridge =  glmnet(
  x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(),
  y = credit_stnd$balance,
  standardize = T,
  alpha = 0,
  lambda = ridge_cv$lambda.min
)
```

---
# Ridge regression
## Prediction in R

Once you find your $\lambda$ via cross validation

1\. Fit your model on the full dataset using the optimal $\lambda$

2\. Make predictions
```{R, ridge-final-2, eval = F}
predict(
  final_ridge,
  type = "response",
  # Our chosen lambda
  s = ridge_cv$lambda.min,
  # Our data
  newx = credit_stnd %>% dplyr::select(-balance) %>% as.matrix()
)
```


---
# Ridge regression
## Shrinking

While ridge regression .it[shrinks] coefficients close to zero, it never forces them to be equal to zero.

.b[Drawbacks]
1. We cannot use ridge regression for subset/feature selection.
1. We often end up with a bunch of tiny coefficients.

--

.qa[Q] Can't we just drive the coefficients to zero?
--
<br>
.qa[A] Yes. Just not with ridge (due to $\sum_j \hat{\beta}_j^2$).
---
layout: true
# Lasso

---
class: inverse, middle
---
name: lasso
## Intro

.attn[Lasso] simply replaces ridge's .it[squared] coefficients with absolute values.

--

.hi[Ridge regression]
$$
\begin{align}
\min_{\hat{\beta}^R} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2}
\end{align}
$$

.hi-grey[Lasso]
$$
\begin{align}
\min_{\hat{\beta}^L} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|}
\end{align}
$$

Everything else will be the same—except one aspect...

---
name: lasso-shrinkage
## Shrinkage

Unlike ridge, lasso's penalty does not increase with the size of $\beta_j$.

You always pay $\color{#8AA19E}{\lambda}$ to increase $\big|\beta_j\big|$ by one unit.

--

The only way to avoid lasso's penalty is to .hi[set coefficents to zero].

--

This feature has two .hi-slate[benefits]
1. Some coefficients will be .hi[set to zero]—we get "sparse" models.
1. Lasso can be used for subset/feature .hi[selection].

--

We will still need to carefully select $\color{#8AA19E}{\lambda}$.
---
layout: true
# Lasso
## Example

---
name: lasso-example

We can also use `glmnet()` for lasso.

.ex[Recall] The .hi-slate[key arguments] for `glmnet()` are

.col-left[
- `x` a .b[matrix] of predictors
- `y` outcome variable as a vector
- `standardize` (`T` or `F`)
- `alpha` elasticnet parameter
  - `alpha=0` gives ridge
  - .hi[`alpha=1` gives lasso]
]

.col-right[
- `lambda` tuning parameter (sequence of numbers)
- `nlambda` alternatively, R picks a sequence of values for $\lambda$
]

---

Again, we define a decreasing sequence for $\lambda$, and we're set.

```{R, ex-lasso-glmnet}
# Define our range of lambdas (glmnet wants decreasing range)
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Fit ridge regression
est_lasso = glmnet(
  x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(),
  y = credit_stnd$balance,
  standardize = T,
  alpha = 1,
  lambda = lambdas
)
```

The `glmnet` output (`est_lasso` here) contains estimated coefficients for $\lambda$. You can use `predict()` to get coefficients for additional values of $\lambda$.
---
layout: false
class: clear, middle

.b[Lasso coefficents] for $\lambda$ between 0.01 and 100,000
```{R, plot-lasso-glmnet, echo = F}
lasso_df = est_lasso %>% coef() %>% t() %>% as.matrix() %>% as.data.frame()
lasso_df %<>% dplyr::select(-1) %>% mutate(lambda = est_lasso$lambda)
lasso_df %<>% gather(key = "variable", value = "coefficient", -lambda)
ggplot(
  data = lasso_df,
  aes(x = lambda, y = coefficient, color = variable)
) +
geom_line() +
scale_x_continuous(
  expression(lambda),
  labels = c("0.1", "10", "1,000", "100,000"),
  breaks = c(0.1, 10, 1000, 100000),
  trans = "log10"
) +
scale_y_continuous("Lasso coefficient") +
scale_color_viridis_d("Predictor", option = "magma", end = 0.9) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
theme(legend.position = "bottom")
```
---
class: clear, middle

Compare lasso's tendency to force coefficients to zero with our previous ridge-regression results.

---
class: clear, middle

.b[Ridge regression coefficents] for $\lambda$ between 0.01 and 100,000
```{R, plot-ridge-glmnet-2, echo = F}
ridge_df = est_ridge %>% coef() %>% t() %>% as.matrix() %>% as.data.frame()
ridge_df %<>% dplyr::select(-1) %>% mutate(lambda = est_ridge$lambda)
ridge_df %<>% gather(key = "variable", value = "coefficient", -lambda)
ggplot(
  data = ridge_df,
  aes(x = lambda, y = coefficient, color = variable)
) +
geom_line() +
scale_x_continuous(
  expression(lambda),
  labels = c("0.1", "10", "1,000", "100,000"),
  breaks = c(0.1, 10, 1000, 100000),
  trans = "log10"
) +
scale_y_continuous("Ridge coefficient") +
scale_color_viridis_d("Predictor", option = "magma", end = 0.9) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
theme(legend.position = "bottom")
```
---
# Lasso
## Example

We can also cross validate $\lambda$ with `cv.glmnet()`.

```{R, cv-lasso, cache = T}
# Define our lambdas
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Cross validation
lasso_cv = cv.glmnet(
  x = credit_stnd %>% dplyr::select(-balance) %>% as.matrix(),
  y = credit_stnd$balance,
  alpha = 1,
  standardize = T,
  lambda = lambdas,
  # New: How we make decisions and number of folds
  type.measure = "mse",
  nfolds = 5
)
```
---
layout: false
class: clear, middle

.b[Cross-validated RMSE and] $\lambda$: Which $\color{#8AA19E}{\lambda}$ minimizes CV RMSE?

```{R, plot-cv-lasso, echo = F}
# Create data frame of our results
lasso_cv_df = data.frame(
  lambda = lasso_cv$lambda,
  rmse = sqrt(lasso_cv$cvm)
)
# Plot
ggplot(
  data = lasso_cv_df,
  aes(x = lambda, y = rmse)
) +
geom_line() +
geom_point(
  data = lasso_cv_df %>% filter(rmse == min(rmse)),
  size = 3.5,
  color = "#8AA19E"
) +
scale_y_continuous("RMSE") +
scale_x_continuous(
  expression(lambda),
  trans = "log10",
  labels = c("0.1", "10", "1,000", "100,000"),
  breaks = c(0.1, 10, 1000, 100000),
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book")
```
---
class: clear, middle

Again, you will have a minimum farther away from your extremes...

---
class: clear, middle

.b[Cross-validated RMSE and] $\lambda$: Which $\color{#8AA19E}{\lambda}$ minimizes CV RMSE?

```{R, cv-lasso2, cache = T, include = F}
# Define our lambdas
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Cross validation
lasso_cv2 = cv.glmnet(
  x = credit_stnd %>% dplyr::select(-balance, -rating, -limit, -income) %>% as.matrix(),
  y = credit_stnd$balance,
  alpha = 1,
  standardize = T,
  lambda = lambdas,
  # New: How we make decisions and number of folds
  type.measure = "mse",
  nfolds = 5
)
```

```{R, plot-cv-lasso2, echo = F}
# Create data frame of our results
lasso_cv_df2 = data.frame(
  lambda = lasso_cv2$lambda,
  rmse = sqrt(lasso_cv2$cvm)
)
# Plot
ggplot(
  data = lasso_cv_df2,
  aes(x = lambda, y = rmse)
) +
geom_line() +
geom_point(
  data = lasso_cv_df2 %>% filter(rmse == min(rmse)),
  size = 3.5,
  color = "#8AA19E"
) +
scale_y_continuous("RMSE") +
scale_x_continuous(
  expression(lambda),
  trans = "log10",
  labels = c("0.1", "10", "1,000", "100,000"),
  breaks = c(0.1, 10, 1000, 100000),
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book")
```
---
class: clear, middle

So which shrinkage method should you choose?

---
layout: true
# Ridge or lasso?

---
name: or

.col-left.pink[
.b[Ridge regression]
<br>
<br>.b.orange[+] shrinks $\hat{\beta}_j$ .it[near] 0
<br>.b.orange[-] many small $\hat\beta_j$
<br>.b.orange[-] doesn't work for selection
<br>.b.orange[-] difficult to interpret output
<br>.b.orange[+] better when all $\beta_j\neq$ 0
<br><br> .it[Best:] $p$ is large & $\beta_j\approx\beta_k$
]

.col-right.purple[
.b[Lasso]
<br>
<br>.b.orange[+] shrinks $\hat{\beta}_j$ to 0
<br>.b.orange[+] many $\hat\beta_j=$ 0
<br>.b.orange[+] great for selection
<br>.b.orange[+] sparse models easier to interpret
<br>.b.orange[-] implicitly assumes some $\beta=$ 0
<br><br> .it[Best:] $p$ is large & many $\beta_j\approx$ 0
]

--

.left-full[
> [N]either ridge... nor the lasso will universally dominate the other.

.ex[ISL, p. 224]
]

---
name: both
layout: false
# Ridge .it[and] lasso
## Why not both?

.hi-blue[Elasticnet] combines .pink[ridge regression] and .grey[lasso].

--

$$
\begin{align}
\min_{\beta^E} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#181485}{(1-\alpha)} \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} + \color{#181485}{\alpha} \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|}
\end{align}
$$

(We now have two tuning parameters: $\lambda$ and $\color{#181485}{\alpha}$.

--

Remember the `alpha` argument in `glmnet()`?

- $\color{#e64173}{\alpha = 0}$ specifies ridge
- $\color{#8AA19E}{\alpha=1}$ specifies lasso

---
# Ridge .it[and] lasso
## Why not both?

We can use `train()` from `caret` to cross validate $\alpha$ and $\lambda$.

.note[Note] You need to consider all combinations of the two parameters.
<br>This combination can create *a lot* of models to estimate.

For example,
- 1,000 values of $\lambda$
- 1,000 values of $\alpha$

leaves you with 1,000,000 models to estimate..super[.pink[†]]

.footnote[
.pink[†] 5,000,000 if you are doing 5-fold CV!
]


---
layout: false
class: clear, middle

```{R, credit-net-ex, eval = F}
# Our range of λ
lambdas = 10^seq(from = 5, to = -2, length = 1e3)
# Our range of α
alphas = seq(from = 0, to = 1, by = 0.1)
# Ridge regression with cross validation
net_cv = train(
  # The formula
  balance ~ .,
  # The dataset
  data = credit_stnd,
  # The 'glmnet' package does ridge and lasso
  method = "glmnet",
  # 5-fold cross validation
  trControl = trainControl("cv", number = 10),
  # The parameters of 'glmnet'
  tuneGrid = expand.grid(alpha = alphas, lambda = lambdas)
)
```

---
name: sources
layout: false
# Sources

These notes draw upon

- [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) (*ISL*)<br>James, Witten, Hastie, and Tibshirani
---
# Table of contents

.col-left[
.smallest[
#### Admin
- [Today](#admin-today)
- [Upcoming](#admin-soon)

#### Shrinkage
- [Introduction](#shrinkage-intro)
- [Why?](#shrinkage-why)

#### Ridge regression
- [Intro](#ridge)
- [Penalization](#ridge-penalization)
- [Standardization](#standardization)
- [Example](#ridge-example)
- [Prediction](#ridge-prediction)

]
]
.col-right[
.smallest[

#### (The) lasso
- [Intro](#lasso)
- [Shrinkage](#lasso-shrinkage)
- [Example](#lasso-example)

#### Ridge or lasso
- [Plus/minus](#or)
- [Both?](#both)

#### Other
- [Sources/references](#sources)
]
]