Processing math: 2%
+ - 0:00:00
Notes for current slide
Notes for next slide

Lecture 005

Shrinkage methods

Edward Rubin

1 / 48

Admin

2 / 48

Admin

Material

Last time

  • Linear regression
  • Model selection
    • Best subset selection
    • Stepwise selection (forward/backward)

Today

  • tidymodels
  • Shrinkage methods
3 / 48

Admin

Admin

Upcoming

Readings

  • Today ISL Ch. 6
  • Next ISL 4

Problem sets Soon!

4 / 48

Shrinkage methods

Intro

Recap: Subset-selection methods (last time)

  1. algorithmically search for the "best" subset of our p predictors
  2. estimate the linear models via least squares
5 / 48

Shrinkage methods

Intro

Recap: Subset-selection methods (last time)

  1. algorithmically search for the "best" subset of our p predictors
  2. estimate the linear models via least squares

These methods assume we need to choose a model before we fit it...

5 / 48

Shrinkage methods

Intro

Recap: Subset-selection methods (last time)

  1. algorithmically search for the "best" subset of our p predictors
  2. estimate the linear models via least squares

These methods assume we need to choose a model before we fit it...

Alternative approach: Shrinkage methods

  • fit a model that contains all p predictors
  • simultaneously: shrink coefficients toward zero

Synonyms for shrink: constrain or regularize

5 / 48

Shrinkage methods

Intro

Recap: Subset-selection methods (last time)

  1. algorithmically search for the "best" subset of our p predictors
  2. estimate the linear models via least squares

These methods assume we need to choose a model before we fit it...

Alternative approach: Shrinkage methods

  • fit a model that contains all p predictors
  • simultaneously: shrink coefficients toward zero

Synonyms for shrink: constrain or regularize

Idea: Penalize the model for coefficients as they move away from zero.

5 / 48

Shrinkage methods

Why?

Q How could shrinking coefficients toward zero help our predictions?

6 / 48

Shrinkage methods

Why?

Q How could shrinking coefficients toward zero help our predictions?

A Remember we're generally facing a tradeoff between bias and variance.

6 / 48

Shrinkage methods

Why?

Q How could shrinking coefficients toward zero help our predictions?

A Remember we're generally facing a tradeoff between bias and variance.

  • Shrinking our coefficients toward zero reduces the model's variance.
  • Penalizing our model for larger coefficients shrinks them toward zero.
  • The optimal penalty will balance reduced variance with increased bias.

Imagine the extreme case: a model whose coefficients are all zeros has no variance.

6 / 48

Shrinkage methods

Why?

Q How could shrinking coefficients toward zero help our predictions?

A Remember we're generally facing a tradeoff between bias and variance.

  • Shrinking our coefficients toward zero reduces the model's variance.
  • Penalizing our model for larger coefficients shrinks them toward zero.
  • The optimal penalty will balance reduced variance with increased bias.

Imagine the extreme case: a model whose coefficients are all zeros has no variance.

Now you understand shrinkage methods.

  • Ridge regression
  • Lasso
  • Elasticnet
6 / 48

Ridge regression

7 / 48

Ridge regression

Back to least squares (again)

Recall Least-squares regression gets ˆβj's by minimizing RSS, i.e., min

8 / 48

Ridge regression

Back to least squares (again)

Recall Least-squares regression gets \hat{\beta}_j's by minimizing RSS, i.e., \begin{align} \min_{\hat{\beta}} \text{RSS} = \min_{\hat{\beta}} \sum_{i=1}^{n} e_i^2 = \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\underbrace{\left[ \hat{\beta}_0 + \hat{\beta}_1 x_{i,1} + \cdots + \hat{\beta}_p x_{i,p} \right]}_{=\hat{y}_i}} \bigg)^2 \end{align}

Ridge regression makes a small change

  • adds a shrinkage penalty = the sum of squared coefficents \left( \color{#e64173}{\lambda\sum_{j}\beta_j^2} \right)
  • minimizes the (weighted) sum of RSS and the shrinkage penalty
8 / 48

Ridge regression

Back to least squares (again)

Recall Least-squares regression gets \hat{\beta}_j's by minimizing RSS, i.e., \begin{align} \min_{\hat{\beta}} \text{RSS} = \min_{\hat{\beta}} \sum_{i=1}^{n} e_i^2 = \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\underbrace{\left[ \hat{\beta}_0 + \hat{\beta}_1 x_{i,1} + \cdots + \hat{\beta}_p x_{i,p} \right]}_{=\hat{y}_i}} \bigg)^2 \end{align}

Ridge regression makes a small change

  • adds a shrinkage penalty = the sum of squared coefficents \left( \color{#e64173}{\lambda\sum_{j}\beta_j^2} \right)
  • minimizes the (weighted) sum of RSS and the shrinkage penalty

\begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}

8 / 48

Ridge regression

Ridge regression \begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}

Least squares \begin{align} \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 \end{align}





\color{#e64173}{\lambda}\enspace (\geq0) is a tuning parameter for the harshness of the penalty.
\color{#e64173}{\lambda} = 0 implies no penalty: we are back to least squares.

9 / 48

Ridge regression

Ridge regression \begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}

Least squares \begin{align} \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 \end{align}





\color{#e64173}{\lambda}\enspace (\geq0) is a tuning parameter for the harshness of the penalty.
\color{#e64173}{\lambda} = 0 implies no penalty: we are back to least squares.
Each value of \color{#e64173}{\lambda} produces a new set of coefficents.

9 / 48

Ridge regression

Ridge regression \begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}

Least squares \begin{align} \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 \end{align}





\color{#e64173}{\lambda}\enspace (\geq0) is a tuning parameter for the harshness of the penalty.
\color{#e64173}{\lambda} = 0 implies no penalty: we are back to least squares.
Each value of \color{#e64173}{\lambda} produces a new set of coefficents.

Ridge's approach to the bias-variance tradeoff: Balance

  • reducing RSS, i.e., \sum_i\left( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \right)^2
  • reducing coefficients (ignoring the intercept)

\color{#e64173}{\lambda} determines how much ridge "cares about" these two quantities.

With \lambda=0, least-squares regression only "cares about" RSS.

9 / 48

Ridge regression

\lambda and penalization

Choosing a good value for \lambda is key.

  • If \lambda is too small, then our model is essentially back to OLS.
  • If \lambda is too large, then we shrink all of our coefficients too close to zero.
10 / 48

Ridge regression

\lambda and penalization

Choosing a good value for \lambda is key.

  • If \lambda is too small, then our model is essentially back to OLS.
  • If \lambda is too large, then we shrink all of our coefficients too close to zero.

Q So what do we do?

10 / 48

Ridge regression

\lambda and penalization

Choosing a good value for \lambda is key.

  • If \lambda is too small, then our model is essentially back to OLS.
  • If \lambda is too large, then we shrink all of our coefficients too close to zero.

Q So what do we do?
A Cross validate!

(You saw that coming, right?)

10 / 48

Ridge regression

Penalization

Note Because we sum the squared coefficients, we penalize increasing big coefficients much more than increasing small coefficients.

Example For a value of \beta, we pay a penalty of 2 \lambda \beta for a small increase.

This quantity comes from taking the derivative of \lambda \beta^2 with respect to \beta.

  • At \beta = 0, the penalty for a small increase is 0.
  • At \beta = 1, the penalty for a small increase is 2\lambda.
  • At \beta = 2, the penalty for a small increase is 4\lambda.
  • At \beta = 3, the penalty for a small increase is 6\lambda.
  • At \beta = 10, the penalty for a small increase is 20\lambda.

Now you see why we call it shrinkage: it encourages small coefficients.

11 / 48

Ridge regression

Penalization and standardization

Important Predictors' units can drastically affect ridge regression results.

Why?

12 / 48

Ridge regression

Penalization and standardization

Important Predictors' units can drastically affect ridge regression results.

Why? Because \mathbf{x}_j's units affect \beta_j, and ridge is very sensitive to \beta_j.

12 / 48

Ridge regression

Penalization and standardization

Important Predictors' units can drastically affect ridge regression results.

Why? Because \mathbf{x}_j's units affect \beta_j, and ridge is very sensitive to \beta_j.

Example Let x_1 denote distance.

Least-squares regression
If x_1 is meters and \beta_1 = 3, then when x_1 is km, \beta_1 = 3,000.
The scale/units of predictors do not affect least squares' estimates.

12 / 48

Ridge regression

Penalization and standardization

Important Predictors' units can drastically affect ridge regression results.

Why? Because \mathbf{x}_j's units affect \beta_j, and ridge is very sensitive to \beta_j.

Example Let x_1 denote distance.

Least-squares regression
If x_1 is meters and \beta_1 = 3, then when x_1 is km, \beta_1 = 3,000.
The scale/units of predictors do not affect least squares' estimates.

Ridge regression pays a much larger penalty for \beta_1=3,000 than \beta_1=3.
You will not get the same (scaled) estimates when you change units.

12 / 48

Ridge regression

Penalization and standardization

Important Predictors' units can drastically affect ridge regression results.

Why? Because \mathbf{x}_j's units affect \beta_j, and ridge is very sensitive to \beta_j.

Example Let x_1 denote distance.

Least-squares regression
If x_1 is meters and \beta_1 = 3, then when x_1 is km, \beta_1 = 3,000.
The scale/units of predictors do not affect least squares' estimates.

Ridge regression pays a much larger penalty for \beta_1=3,000 than \beta_1=3.
You will not get the same (scaled) estimates when you change units.

Solution Standardize your variables, i.e., x_stnd = (x - mean(x))/sd(x).

12 / 48

Ridge regression

Penalization and standardization

Important Predictors' units can drastically affect ridge regression results.

Why? Because \mathbf{x}_j's units affect \beta_j, and ridge is very sensitive to \beta_j.

Example Let x_1 denote distance.

Least-squares regression
If x_1 is meters and \beta_1 = 3, then when x_1 is km, \beta_1 = 3,000.
The scale/units of predictors do not affect least squares' estimates.

Ridge regression pays a much larger penalty for \beta_1=3,000 than \beta_1=3.
You will not get the same (scaled) estimates when you change units.

Solution Standardize your variables, i.e., recipes::step_normalize().

13 / 48

Ridge regression

Example

Let's return to the credit dataset—and pre-processing with tidymodels.

Recall We have 11 predictors and a numeric outcome balance.

We can standardize our predictors using step_normalize() from recipes:

# Load the credit dataset
credit_df = ISLR::Credit %>% clean_names()
# Processing recipe: Define ID, standardize, create dummies, rename (lowercase)
credit_recipe = credit_df %>% recipe(balance ~ .) %>%
update_role(id, new_role = "id variable") %>%
step_normalize(all_predictors() & all_numeric()) %>%
step_dummy(all_predictors() & all_nominal()) %>%
step_rename_at(everything(), fn = str_to_lower)
# Time to juice
credit_clean = credit_recipe %>% prep() %>% juice()
14 / 48

Ridge regression

Example

For ridge regression in R, we will use glmnet() from the glmnet package.

And lasso!

The key arguments for glmnet() are

  • x a matrix of predictors
  • y outcome variable as a vector
  • standardize (T or F)
  • alpha elasticnet parameter
    • alpha=0 gives ridge
    • alpha=1 gives lasso
  • lambda tuning parameter (sequence of numbers)
  • nlambda alternatively, R picks a sequence of values for \lambda
15 / 48

Ridge regression

Example

We just need to define a decreasing sequence for \lambda, and then we're set.

# Define our range of lambdas (glmnet wants decreasing range)
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Fit ridge regression
est_ridge = glmnet(
x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(),
y = credit_clean$balance,
standardize = F,
alpha = 0,
lambda = lambdas
)

The glmnet output (est_ridge here) contains estimated coefficients for \lambda. You can use predict() to get coefficients for additional values of \lambda.

16 / 48

Ridge regression coefficents for \lambda between 0.01 and 100,000

17 / 48

Ridge regression

Example

glmnet also provides convenient cross-validation function: cv.glmnet().

# Define our lambdas
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Cross validation
ridge_cv = cv.glmnet(
x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(),
y = credit_clean$balance,
alpha = 0,
standardize = F,
lambda = lambdas,
# New: How we make decisions and number of folds
type.measure = "mse",
nfolds = 5
)
18 / 48

Cross-validated RMSE and \lambda: Which \color{#e64173}{\lambda} minimizes CV RMSE?

19 / 48

Often, you will have a minimum more obviously far from the extremes.

Recall: Variance-bias tradeoff.

20 / 48

Cross-validated RMSE and \lambda: Which \color{#e64173}{\lambda} minimizes CV RMSE?

21 / 48

Ridge regression

In tidymodels

tidymodels can also cross validate (and fit) ridge regression.

  • Back to our the linear_reg() model 'specification'.

  • The penalty \lambda (what we want to tune) is penalty instead of lambda.

  • Set mixture = 0 inside linear_reg() (same as alpha = 0, above).

  • Use the glmnet engine.

# Define the model
model_ridge = linear_reg(penalty = tune(), mixture = 0) %>% set_engine("glmnet")
22 / 48

Example of ridge regression with tidymodels

# Our range of lambdas
lambdas = 10^seq(from = 5, to = -2, length = 1e3)
# Define the 5-fold split
set.seed(12345)
credit_cv = credit_df %>% vfold_cv(v = 5)
# Define the model
model_ridge = linear_reg(penalty = tune(), mixture = 0) %>% set_engine("glmnet")
# Define our ridge workflow
workflow_ridge = workflow() %>%
add_model(model_ridge) %>% add_recipe(credit_recipe)
# CV with our range of lambdas
cv_ridge =
workflow_ridge %>%
tune_grid(
credit_cv,
grid = data.frame(penalty = lambdas),
metrics = metric_set(rmse)
)
# Show the best models
cv_ridge %>% show_best()
23 / 48

With tidymodels...

Next steps: Finalize your workflow and fit your last model.

Recall: finalize_workflow(), last_fit(), and collect_predictions()

24 / 48

Ridge regression

Prediction in R

Otherwise: Once you find your \lambda via cross validation,

1. Fit your model on the full dataset using the optimal \lambda

# Fit final model
final_ridge = glmnet(
x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(),
y = credit_clean$balance,
standardize = T,
alpha = 0,
lambda = ridge_cv$lambda.min
)
25 / 48

Ridge regression

Prediction in R

Once you find your \lambda via cross validation

1. Fit your model on the full dataset using the optimal \lambda

2. Make predictions

predict(
final_ridge,
type = "response",
# Our chosen lambda
s = ridge_cv$lambda.min,
# Our data
newx = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix()
)
26 / 48

Ridge regression

Shrinking

While ridge regression shrinks coefficients close to zero, it never forces them to be equal to zero.

Drawbacks

  1. We cannot use ridge regression for subset/feature selection.
  2. We often end up with a bunch of tiny coefficients.
27 / 48

Ridge regression

Shrinking

While ridge regression shrinks coefficients close to zero, it never forces them to be equal to zero.

Drawbacks

  1. We cannot use ridge regression for subset/feature selection.
  2. We often end up with a bunch of tiny coefficients.

Q Can't we just drive the coefficients to zero?

27 / 48

Ridge regression

Shrinking

While ridge regression shrinks coefficients close to zero, it never forces them to be equal to zero.

Drawbacks

  1. We cannot use ridge regression for subset/feature selection.
  2. We often end up with a bunch of tiny coefficients.

Q Can't we just drive the coefficients to zero?
A Yes. Just not with ridge (due to \sum_j \hat{\beta}_j^2).

27 / 48

Lasso

28 / 48

Lasso

Intro

Lasso simply replaces ridge's squared coefficients with absolute values.

29 / 48

Lasso

Intro

Lasso simply replaces ridge's squared coefficients with absolute values.

Ridge regression \begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}

Lasso \begin{align} \min_{\hat{\beta}^L} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|} \end{align}

Everything else will be the same—except one aspect...

29 / 48

Lasso

Shrinkage

Unlike ridge, lasso's penalty does not increase with the size of \beta_j.

You always pay \color{#8AA19E}{\lambda} to increase \big|\beta_j\big| by one unit.

30 / 48

Lasso

Shrinkage

Unlike ridge, lasso's penalty does not increase with the size of \beta_j.

You always pay \color{#8AA19E}{\lambda} to increase \big|\beta_j\big| by one unit.

The only way to avoid lasso's penalty is to set coefficents to zero.

30 / 48

Lasso

Shrinkage

Unlike ridge, lasso's penalty does not increase with the size of \beta_j.

You always pay \color{#8AA19E}{\lambda} to increase \big|\beta_j\big| by one unit.

The only way to avoid lasso's penalty is to set coefficents to zero.

This feature has two benefits

  1. Some coefficients will be set to zero—we get "sparse" models.
  2. Lasso can be used for subset/feature selection.
30 / 48

Lasso

Shrinkage

Unlike ridge, lasso's penalty does not increase with the size of \beta_j.

You always pay \color{#8AA19E}{\lambda} to increase \big|\beta_j\big| by one unit.

The only way to avoid lasso's penalty is to set coefficents to zero.

This feature has two benefits

  1. Some coefficients will be set to zero—we get "sparse" models.
  2. Lasso can be used for subset/feature selection.

We will still need to carefully select \color{#8AA19E}{\lambda}.

30 / 48

Lasso

Example

We can also use glmnet() for lasso.

Recall The key arguments for glmnet() are

  • x a matrix of predictors
  • y outcome variable as a vector
  • standardize (T or F)
  • alpha elasticnet parameter
    • alpha=0 gives ridge
    • alpha=1 gives lasso
  • lambda tuning parameter (sequence of numbers)
  • nlambda alternatively, R picks a sequence of values for \lambda
31 / 48

Lasso

Example

Again, we define a decreasing sequence for \lambda, and we're set.

# Define our range of lambdas (glmnet wants decreasing range)
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Fit lasso regression
est_lasso = glmnet(
x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(),
y = credit_clean$balance,
standardize = F,
alpha = 1,
lambda = lambdas
)

The glmnet output (est_lasso here) contains estimated coefficients for \lambda. You can use predict() to get coefficients for additional values of \lambda.

32 / 48

Lasso coefficents for \lambda between 0.01 and 100,000

33 / 48

Compare lasso's tendency to force coefficients to zero with our previous ridge-regression results.

34 / 48

Ridge regression coefficents for \lambda between 0.01 and 100,000

35 / 48

Lasso

Example

We can also cross validate \lambda with cv.glmnet().

# Define our lambdas
lambdas = 10^seq(from = 5, to = -2, length = 100)
# Cross validation
lasso_cv = cv.glmnet(
x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(),
y = credit_clean$balance,
alpha = 1,
standardize = F,
lambda = lambdas,
# New: How we make decisions and number of folds
type.measure = "mse",
nfolds = 5
)
36 / 48

Cross-validated RMSE and \lambda: Which \color{#8AA19E}{\lambda} minimizes CV RMSE?

37 / 48

Again, you will have a minimum farther away from your extremes...

38 / 48

Cross-validated RMSE and \lambda: Which \color{#8AA19E}{\lambda} minimizes CV RMSE?

39 / 48

So which shrinkage method should you choose?

40 / 48

Ridge or lasso?

Ridge regression

+ shrinks \hat{\beta}_j near 0
- many small \hat\beta_j
- doesn't work for selection
- difficult to interpret output
+ better when all \beta_j\neq 0

Best: p is large & \beta_j\approx\beta_k

Lasso

+ shrinks \hat{\beta}_j to 0
+ many \hat\beta_j= 0
+ great for selection
+ sparse models easier to interpret
- implicitly assumes some \beta= 0

Best: p is large & many \beta_j\approx 0

41 / 48

Ridge or lasso?

Ridge regression

+ shrinks \hat{\beta}_j near 0
- many small \hat\beta_j
- doesn't work for selection
- difficult to interpret output
+ better when all \beta_j\neq 0

Best: p is large & \beta_j\approx\beta_k

Lasso

+ shrinks \hat{\beta}_j to 0
+ many \hat\beta_j= 0
+ great for selection
+ sparse models easier to interpret
- implicitly assumes some \beta= 0

Best: p is large & many \beta_j\approx 0

[N]either ridge... nor the lasso will universally dominate the other.

ISL, p. 224

41 / 48

Ridge and lasso

Why not both?

Elasticnet combines ridge regression and lasso.

42 / 48

Ridge and lasso

Why not both?

Elasticnet combines ridge regression and lasso.

\begin{align} \min_{\beta^E} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#181485}{(1-\alpha)} \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} + \color{#181485}{\alpha} \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|} \end{align}

We now have two tuning parameters: \lambda (penalty) and \color{#181485}{\alpha} (mixture).

42 / 48

Ridge and lasso

Why not both?

Elasticnet combines ridge regression and lasso.

\begin{align} \min_{\beta^E} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#181485}{(1-\alpha)} \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} + \color{#181485}{\alpha} \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|} \end{align}

We now have two tuning parameters: \lambda (penalty) and \color{#181485}{\alpha} (mixture).

Remember the alpha argument in glmnet()?

  • \color{#e64173}{\alpha = 0} specifies ridge
  • \color{#8AA19E}{\alpha=1} specifies lasso
42 / 48

Ridge and lasso

Why not both?

We can use tune() from tidymodels to cross validate both \alpha and \lambda.

Note You need to consider all combinations of the two parameters.
This combination can create a lot of models to estimate.

For example,

  • 1,000 values of \lambda
  • 1,000 values of \alpha

leaves you with 1,000,000 models to estimate.

5,000,000 if you are doing 5-fold CV!

43 / 48

Cross validating elasticnet in tidymodels

# Our range of λ and α
lambdas = 10^seq(from = 5, to = -2, length = 1e2)
alphas = seq(from = 0, to = 1, by = 0.1)
# Define the 5-fold split
set.seed(12345)
credit_cv = credit_df %>% vfold_cv(v = 5)
# Define the elasticnet model
model_net = linear_reg(
penalty = tune(), mixture = tune()
) %>% set_engine("glmnet")
# Define our workflow
workflow_net = workflow() %>%
add_model(model_net) %>% add_recipe(credit_recipe)
# CV elasticnet with our range of lambdas
cv_net =
workflow_net %>%
tune_grid(
credit_cv,
grid = expand_grid(mixture = alphas, penalty = lambdas),
metrics = metric_set(rmse)
)
44 / 48

Cross validating elasticnet in tidymodels with grid_regular()

# Our range of λ and α
lambdas = 10^seq(from = 5, to = -2, length = 1e2)
alphas = seq(from = 0, to = 1, by = 0.1)
# Define the 5-fold split
set.seed(12345)
credit_cv = credit_df %>% vfold_cv(v = 5)
# Define the elasticnet model
model_net = linear_reg(
penalty = tune(), mixture = tune()
) %>% set_engine("glmnet")
# Define our workflow
workflow_net = workflow() %>%
add_model(model_net) %>% add_recipe(credit_recipe)
# CV elasticnet with our range of lambdas
cv_net =
workflow_net %>%
tune_grid(
credit_cv,
grid = grid_regular(mixture(), penalty(), levels = 100:100),
metrics = metric_set(rmse)
)
45 / 48

In case you are curious: The best model had \lambda\approx 0.628 and \alpha\approx 0.737.

CV estimates elasticnet actually reduced RMSE from ridge's 118 to 101.

46 / 48

Sources

These notes draw upon

47 / 48

Table of contents

(The) lasso

Ridge or lasso

Other

48 / 48

Admin

2 / 48
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Alt + fFit Slides to Screen
Esc Back to slideshow