Lecture .mono[005]

class: center, middle, inverse, title-slide

.title[
# Lecture .mono[005]
]
.subtitle[
## Shrinkage methods
]
.author[
### Edward Rubin
]

---

exclude: true

---
layout: true
# Admin

---
class: inverse, middle

---
name: admin-today
## Material

.b[Last time]
- Linear regression
- Model selection
  - Best subset selection
  - Stepwise selection (forward/backward)

.b[Today] 
- `tidymodels`
- Shrinkage methods

---
name: admin-soon
# Admin

## Upcoming

.b[Readings]

- .note[Today] .it[ISL] Ch. 6
- .note[Next] .it[ISL] 4

.b[Problem sets] Soon!

---
layout: true
# Shrinkage methods

---
name: shrinkage-intro
## Intro

.note[Recap:] .attn[Subset-selection methods] (last time)
1. algorithmically search for the .pink["best" subset] of our `$p$` predictors
1. estimate the linear models via .pink[least squares]

These methods assume we need to choose a model before we fit it...

.note[Alternative approach:] .attn[Shrinkage methods]
- fit a model that contains .pink[all] `$\color{#e64173}{p}$` .pink[predictors]
- simultaneously: .pink[shrink.super[.pink[†]] coefficients] toward zero

.footnote[
.pink[†] Synonyms for .it[shrink]: constrain or regularize
]

.note[Idea:] Penalize the model for coefficients as they move away from zero.

---
name: shrinkage-why
## Why?

.qa[Q] How could shrinking coefficients toward zero help our predictions?

.qa[A] Remember we're generally facing a tradeoff between bias and variance.

- Shrinking our coefficients toward zero .hi[reduces the model's variance]..super[.pink[†]]
- .hi[Penalizing] our model for .hi[larger coefficients] shrinks them toward zero.
- The .hi[optimal penalty] will balance reduced variance with increased bias.

.footnote[
.pink[†] Imagine the extreme case: a model whose coefficients are all zeros has no variance.
]

Now you understand shrinkage methods.
- .attn[Ridge regression]
- .attn[Lasso]
- .attn[Elasticnet]

---
layout: true
# Ridge regression

---
class: inverse, middle

---
name: ridge
## Back to least squares (again)

.note[Recall] Least-squares regression gets `$\hat{\beta}_j$`'s by minimizing RSS, _i.e._,
$$
`\begin{align}
  \min_{\hat{\beta}} \text{RSS} = \min_{\hat{\beta}} \sum_{i=1}^{n} e_i^2 = \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\underbrace{\left[ \hat{\beta}_0 + \hat{\beta}_1 x_{i,1} + \cdots + \hat{\beta}_p x_{i,p} \right]}_{=\hat{y}_i}} \bigg)^2
\end{align}`
$$

.attn[Ridge regression] makes a small change
- .pink[adds a shrinkage penalty] = the sum of squared coefficents `$\left( \color{#e64173}{\lambda\sum_{j}\beta_j^2} \right)$`
- .pink[minimizes] the (weighted) sum of .pink[RSS and the shrinkage penalty]

$$
`\begin{align}
  \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2}
\end{align}`
$$

---
name: ridge-penalization

.col-left[
.hi[Ridge regression]
$$
`\begin{align}
\min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2}
\end{align}`
$$
]

.col-right[
.b[Least squares]
$$
`\begin{align}
\min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2
\end{align}`
$$
]

`$\color{#e64173}{\lambda}\enspace (\geq0)$` is a tuning parameter for the harshness of the penalty.
 
`$\color{#e64173}{\lambda} = 0$` implies no penalty: we are back to least squares.
--
 
Each value of `$\color{#e64173}{\lambda}$` produces a new set of coefficents.

Ridge's approach to the bias-variance tradeoff: Balance
- reducing .b[RSS], _i.e._, `$\sum_i\left( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \right)^2$`
- reducing .b[coefficients] .grey-light[(ignoring the intercept)]

`$\color{#e64173}{\lambda}$` determines how much ridge "cares about" these two quantities..super[.pink[†]]

.footnote[
.pink[†] With `$\lambda=0$`, least-squares regression only "cares about" RSS.
]

---
## `$\lambda$` and penalization

Choosing a .it[good] value for `$\lambda$` is key.
- If `$\lambda$` is too small, then our model is essentially back to OLS.
- If `$\lambda$` is too large, then we shrink all of our coefficients too close to zero.

.qa[Q] So what do we do?
--
 
.qa[A] Cross validate!

.grey-light[(You saw that coming, right?)]

---
## Penalization

.note[Note] Because we sum the .b[squared] coefficients, we penalize increasing .it[big] coefficients much more than increasing .it[small] coefficients.

.ex[Example] For a value of `$\beta$`, we pay a penalty of `$2 \lambda \beta$` for a small increase..super[.pink[†]]

.footnote[
.pink[†] This quantity comes from taking the derivative of `$\lambda \beta^2$` with respect to `$\beta$`.
]

- At `$\beta = 0$`, the penalty for a small increase is `$0$`.
- At `$\beta = 1$`, the penalty for a small increase is `$2\lambda$`.
- At `$\beta = 2$`, the penalty for a small increase is `$4\lambda$`.
- At `$\beta = 3$`, the penalty for a small increase is `$6\lambda$`.
- At `$\beta = 10$`, the penalty for a small increase is `$20\lambda$`.

Now you see why we call it .it[shrinkage]: it encourages small coefficients.

---
name: standardization
## Penalization and standardization

.attn[Important] Predictors' .hi[units] can drastically .hi[affect ridge regression results].

.b[Why?]
--
 Because `$\mathbf{x}_j$`'s units affect `$\beta_j$`, and ridge is very sensitive to `$\beta_j$`.

.ex[Example] Let `$x_1$` denote distance.

.b[Least-squares regression]
 
If `$x_1$` is .it[meters] and `$\beta_1 = 3$`, then when `$x_1$` is .it[km], `$\beta_1 = 3,000$`.
 
The scale/units of predictors do not affect least squares' estimates.

.hi[Ridge regression] pays a much larger penalty for `$\beta_1=3,000$` than `$\beta_1=3$`.
 You will not get the same (scaled) estimates when you change units.

.note[Solution] Standardize your variables, _i.e._, `x_stnd = (x - mean(x))/sd(x)`.

---
## Penalization and standardization

.attn[Important] Predictors' .hi[units] can drastically .hi[affect ridge regression results].

.b[Why?] Because `$\mathbf{x}_j$`'s units affect `$\beta_j$`, and ridge is very sensitive to `$\beta_j$`.

.ex[Example] Let `$x_1$` denote distance.

.hi[Ridge regression] pays a much larger penalty for `$\beta_1=3,000$` than `$\beta_1=3$`.
 You will not get the same (scaled) estimates when you change units.

.note[Solution] Standardize your variables, _i.e._, `recipes::step_normalize()`.

---
name: ridge-example
## Example

Let's return to the credit dataset—and [pre-processing with `tidymodels`](https://www.kaggle.com/edwardarubin/intro-tidymodels-preprocessing).

.ex[Recall] We have 11 predictors and a numeric outcome `balance`.

We can standardize our .b[predictors] using `step_normalize()` from `recipes`:

```r
# Load the credit dataset
credit_df = ISLR::Credit %>% clean_names()
# Processing recipe: Define ID, standardize, create dummies, rename (lowercase)
credit_recipe = credit_df %>% recipe(balance ~ .) %>% 
  update_role(id, new_role = "id variable") %>% 
  step_normalize(all_predictors() & all_numeric()) %>% 
  step_dummy(all_predictors() & all_nominal()) %>% 
  step_rename_at(everything(), fn = str_to_lower)
# Time to juice
credit_clean = credit_recipe %>% prep() %>% juice()
```

---
## Example

For ridge regression.super[.pink[†]] in R, we will use `glmnet()` from the `glmnet` package.

.footnote[
.pink[†] And lasso!
]

The .hi-slate[key arguments] for `glmnet()` are

.col-left[
- `x` a .b[matrix] of predictors
- `y` outcome variable as a vector
- `standardize` (`T` or `F`)
- `alpha` elasticnet parameter
  - `alpha=0` gives ridge
  - `alpha=1` gives lasso
]

.col-right[
- `lambda` tuning parameter (sequence of numbers)
- `nlambda` alternatively, R picks a sequence of values for `$\lambda$`
]
---
## Example

We just need to define a decreasing sequence for `$\lambda$`, and then we're set.

```r
# Define our range of lambdas (glmnet wants decreasing range)
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Fit ridge regression
est_ridge = glmnet(
  x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(),
  y = credit_clean$balance,
  standardize = F,
  alpha = 0,
  lambda = lambdas
)
```

The `glmnet` output (`est_ridge` here) contains estimated coefficients for `$\lambda$`. You can use `predict()` to get coefficients for additional values of `$\lambda$`.
---
layout: false
class: clear, middle

.b[Ridge regression coefficents] for `$\lambda$` between 0.01 and 100,000
<img src="slides_files/figure-html/plot-ridge-glmnet-1.svg" style="display: block; margin: auto;" />

---
layout: true
# Ridge regression
## Example

---
`glmnet` also provides convenient cross-validation function: `cv.glmnet()`.

```r
# Define our lambdas
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Cross validation
ridge_cv = cv.glmnet(
  x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(),
  y = credit_clean$balance,
  alpha = 0,
  standardize = F,
  lambda = lambdas,
  # New: How we make decisions and number of folds
  type.measure = "mse",
  nfolds = 5
)
```

---
layout: false
class: clear, middle

.b[Cross-validated RMSE and] `$\lambda$`: Which `$\color{#e64173}{\lambda}$` minimizes CV RMSE?

---
class: clear, middle

Often, you will have a minimum more obviously far from the extremes.

.ex[Recall:] Variance-bias tradeoff.

---
layout: false
class: clear, middle

.b[Cross-validated RMSE and] `$\lambda$`: Which `$\color{#e64173}{\lambda}$` minimizes CV RMSE?

---
# Ridge regression
## In `tidymodels`

`tidymodels` can also cross validate (and fit) ridge regression.

- Back to our the `linear_reg()` model 'specification'.

- The penalty `$\lambda$` (what we want to tune) is `penalty` instead of `lambda`.

- Set `mixture = 0` inside `linear_reg()` (same as `alpha = 0`, above).

- Use the `glmnet` engine.

```r
# Define the model
model_ridge = linear_reg(penalty = tune(), mixture = 0) %>% set_engine("glmnet")
```

---
class: clear

.b.purple[Example of ridge regression with] .purple[`tidymodels`]

```r
# Our range of lambdas
lambdas = 10^seq(from = 5, to = -2, length = 1e3)
# Define the 5-fold split
set.seed(12345)
credit_cv = credit_df %>% vfold_cv(v = 5)
# Define the model
model_ridge = linear_reg(penalty = tune(), mixture = 0) %>% set_engine("glmnet")
# Define our ridge workflow
workflow_ridge = workflow() %>%
  add_model(model_ridge) %>% add_recipe(credit_recipe)
# CV with our range of lambdas
cv_ridge = 
  workflow_ridge %>%
  tune_grid(
    credit_cv,
    grid = data.frame(penalty = lambdas),
    metrics = metric_set(rmse)
  )
# Show the best models
cv_ridge %>% show_best()
```
---
class: clear, middle

With `tidymodels`...

.note[Next steps:] Finalize your workflow and fit your last model.

.note[Recall:] `finalize_workflow()`, `last_fit()`, and `collect_predictions()`

---
name: ridge-predict
layout: false
# Ridge regression
## Prediction in R

Otherwise: Once you find your `$\lambda$` via cross validation,

1\. Fit your model on the full dataset using the optimal `$\lambda$`

```r
# Fit final model
final_ridge =  glmnet(
  x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(),
  y = credit_clean$balance,
  standardize = T,
  alpha = 0,
  lambda = ridge_cv$lambda.min
)
```

---
# Ridge regression
## Prediction in R

Once you find your `$\lambda$` via cross validation

1\. Fit your model on the full dataset using the optimal `$\lambda$`

2\. Make predictions

```r
predict(
  final_ridge,
  type = "response",
  # Our chosen lambda
  s = ridge_cv$lambda.min,
  # Our data
  newx = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix()
)
```

---
# Ridge regression
## Shrinking

While ridge regression .it[shrinks] coefficients close to zero, it never forces them to be equal to zero.

.b[Drawbacks]
1. We cannot use ridge regression for subset/feature selection.
1. We often end up with a bunch of tiny coefficients.

.qa[Q] Can't we just drive the coefficients to zero?
--
 
.qa[A] Yes. Just not with ridge (due to `$\sum_j \hat{\beta}_j^2$`).
---
layout: true
# Lasso

---
class: inverse, middle
---
name: lasso
## Intro

.attn[Lasso] simply replaces ridge's .it[squared] coefficients with absolute values.

.hi[Ridge regression]
$$
`\begin{align}
\min_{\hat{\beta}^R} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2}
\end{align}`
$$

.hi-grey[Lasso]
$$
`\begin{align}
\min_{\hat{\beta}^L} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|}
\end{align}`
$$

Everything else will be the same—except one aspect...

---
name: lasso-shrinkage
## Shrinkage

Unlike ridge, lasso's penalty does not increase with the size of `$\beta_j$`.

You always pay `$\color{#8AA19E}{\lambda}$` to increase `$\big|\beta_j\big|$` by one unit.

The only way to avoid lasso's penalty is to .hi[set coefficents to zero].

This feature has two .hi-slate[benefits]
1. Some coefficients will be .hi[set to zero]—we get "sparse" models.
1. Lasso can be used for subset/feature .hi[selection].

We will still need to carefully select `$\color{#8AA19E}{\lambda}$`.
---
layout: true
# Lasso
## Example

---
name: lasso-example

We can also use `glmnet()` for lasso.

.ex[Recall] The .hi-slate[key arguments] for `glmnet()` are

.col-left[
- `x` a .b[matrix] of predictors
- `y` outcome variable as a vector
- `standardize` (`T` or `F`)
- `alpha` elasticnet parameter
  - `alpha=0` gives ridge
  - .hi[`alpha=1` gives lasso]
]

.col-right[
- `lambda` tuning parameter (sequence of numbers)
- `nlambda` alternatively, R picks a sequence of values for `$\lambda$`
]

---

Again, we define a decreasing sequence for `$\lambda$`, and we're set.

```r
# Define our range of lambdas (glmnet wants decreasing range)
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Fit lasso regression
est_lasso = glmnet(
  x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(),
  y = credit_clean$balance,
  standardize = F,
  `alpha = 1`,
  lambda = lambdas
)
```

The `glmnet` output (`est_lasso` here) contains estimated coefficients for `$\lambda$`. You can use `predict()` to get coefficients for additional values of `$\lambda$`.
---
layout: false
class: clear, middle

.b[Lasso coefficents] for `$\lambda$` between 0.01 and 100,000
<img src="slides_files/figure-html/plot-lasso-glmnet-1.svg" style="display: block; margin: auto;" />
---
class: clear, middle

Compare lasso's tendency to force coefficients to zero with our previous ridge-regression results.

---
class: clear, middle

.b[Ridge regression coefficents] for `$\lambda$` between 0.01 and 100,000
<img src="slides_files/figure-html/plot-ridge-glmnet-2-1.svg" style="display: block; margin: auto;" />

---
# Lasso
## Example

We can also cross validate `$\lambda$` with `cv.glmnet()`.

```r
# Define our lambdas
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Cross validation
lasso_cv = cv.glmnet(
  x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(),
  y = credit_clean$balance,
  alpha = 1,
  standardize = F,
  lambda = lambdas,
  # New: How we make decisions and number of folds
  type.measure = "mse",
  nfolds = 5
)
```

---
layout: false
class: clear, middle

.b[Cross-validated RMSE and] `$\lambda$`: Which `$\color{#8AA19E}{\lambda}$` minimizes CV RMSE?

---
class: clear, middle

Again, you will have a minimum farther away from your extremes...

---
class: clear, middle

.b[Cross-validated RMSE and] `$\lambda$`: Which `$\color{#8AA19E}{\lambda}$` minimizes CV RMSE?

<img src="slides_files/figure-html/plot-cv-lasso2-1.svg" style="display: block; margin: auto;" />
---
class: clear, middle

So which shrinkage method should you choose?

---
layout: true
# Ridge or lasso?

---
name: or

.col-left.pink[
.b[Ridge regression]
 
 .b.orange[+] shrinks `$\hat{\beta}_j$` .it[near] 0
 .b.orange[-] many small `$\hat\beta_j$`
 .b.orange[-] doesn't work for selection
 .b.orange[-] difficult to interpret output
 .b.orange[+] better when all `$\beta_j\neq$` 0
 .it[Best:] `$p$` is large & `$\beta_j\approx\beta_k$`
]

.col-right.purple[
.b[Lasso]
 
 .b.orange[+] shrinks `$\hat{\beta}_j$` to 0
 .b.orange[+] many `$\hat\beta_j=$` 0
 .b.orange[+] great for selection
 .b.orange[+] sparse models easier to interpret
 .b.orange[-] implicitly assumes some `$\beta=$` 0
 .it[Best:] `$p$` is large & many `$\beta_j\approx$` 0
]

.left-full[
> [N]either ridge... nor the lasso will universally dominate the other.

.ex[ISL, p. 224]
]

---
name: both
layout: false
# Ridge .it[and] lasso
## Why not both?

.hi-blue[Elasticnet] combines .pink[ridge regression] and .grey[lasso].

$$
`\begin{align}
\min_{\beta^E} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#181485}{(1-\alpha)} \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} + \color{#181485}{\alpha} \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|}
\end{align}`
$$

We now have two tuning parameters: `$\lambda$` (penalty) and `$\color{#181485}{\alpha}$` (mixture).

Remember the `alpha` argument in `glmnet()`?

- `$\color{#e64173}{\alpha = 0}$` specifies ridge
- `$\color{#8AA19E}{\alpha=1}$` specifies lasso

---
# Ridge .it[and] lasso
## Why not both?

We can use `tune()` from `tidymodels` to cross validate both `$\alpha$` and `$\lambda$`.

.note[Note] You need to consider all combinations of the two parameters.
 This combination can create *a lot* of models to estimate.

For example,
- 1,000 values of `$\lambda$`
- 1,000 values of `$\alpha$`

leaves you with 1,000,000 models to estimate..super[.pink[†]]

.footnote[
.pink[†] 5,000,000 if you are doing 5-fold CV!
]

---
layout: false
class: clear, middle

.b[Cross validating elasticnet in] `tidymodels`

```r
# Our range of λ and α
lambdas = 10^seq(from = 5, to = -2, length = 1e2)
alphas = seq(from = 0, to = 1, by = 0.1)
# Define the 5-fold split
set.seed(12345)
credit_cv = credit_df %>% vfold_cv(v = 5)
# Define the elasticnet model
model_net = linear_reg(
  penalty = tune(), mixture = tune()
) %>% set_engine("glmnet")
# Define our workflow
workflow_net = workflow() %>%
  add_model(model_net) %>% add_recipe(credit_recipe)
# CV elasticnet with our range of lambdas
cv_net = 
  workflow_net %>%
  tune_grid(
    credit_cv,
    grid = expand_grid(mixture = alphas, penalty = lambdas),
    metrics = metric_set(rmse)
  )
```

---
layout: false
class: clear, middle

.b[Cross validating elasticnet in] `tidymodels` with .orange[`grid_regular()`]

```r
# Our range of λ and α
lambdas = 10^seq(from = 5, to = -2, length = 1e2)
alphas = seq(from = 0, to = 1, by = 0.1)
# Define the 5-fold split
set.seed(12345)
credit_cv = credit_df %>% vfold_cv(v = 5)
# Define the elasticnet model
model_net = linear_reg(
  penalty = tune(), mixture = tune()
) %>% set_engine("glmnet")
# Define our workflow
workflow_net = workflow() %>%
  add_model(model_net) %>% add_recipe(credit_recipe)
# CV elasticnet with our range of lambdas
cv_net = 
  workflow_net %>%
  tune_grid(
    credit_cv,
    `grid = grid_regular(mixture(), penalty(), levels = 100:100)`,
    metrics = metric_set(rmse)
  )
```

---
class: clear, middle

.ex[In case you are curious:] The .it[best] model had `$\lambda\approx$` 0.628 and `$\alpha\approx$` 0.737.

CV estimates elasticnet actually reduced RMSE from ridge's 118 to 101.
 
---
name: sources
layout: false
# Sources

These notes draw upon

- [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) (*ISL*) James, Witten, Hastie, and Tibshirani

---
# Table of contents

.col-left.smallest[

#### Admin
- [Today](#admin-today)
- [Upcoming](#admin-soon)

#### Shrinkage
- [Introduction](#shrinkage-intro)
- [Why?](#shrinkage-why)

#### Ridge regression
- [Intro](#ridge)
- [Penalization](#ridge-penalization)
- [Standardization](#standardization)
- [Example](#ridge-example)
- [Prediction](#ridge-prediction)

]

.col-right.smallest[

#### (The) lasso
- [Intro](#lasso)
- [Shrinkage](#lasso-shrinkage)
- [Example](#lasso-example)

#### Ridge or lasso
- [Plus/minus](#or)
- [Both?](#both)

#### Other
- [Sources/references](#sources)
]