Last time
Today
tidymodels
Readings
Problem sets Soon!
Recap: Subset-selection methods (last time)
Recap: Subset-selection methods (last time)
These methods assume we need to choose a model before we fit it...
Recap: Subset-selection methods (last time)
These methods assume we need to choose a model before we fit it...
Alternative approach: Shrinkage methods
† Synonyms for shrink: constrain or regularize
Recap: Subset-selection methods (last time)
These methods assume we need to choose a model before we fit it...
Alternative approach: Shrinkage methods
† Synonyms for shrink: constrain or regularize
Idea: Penalize the model for coefficients as they move away from zero.
Q How could shrinking coefficients toward zero help our predictions?
Q How could shrinking coefficients toward zero help our predictions?
A Remember we're generally facing a tradeoff between bias and variance.
Q How could shrinking coefficients toward zero help our predictions?
A Remember we're generally facing a tradeoff between bias and variance.
† Imagine the extreme case: a model whose coefficients are all zeros has no variance.
Q How could shrinking coefficients toward zero help our predictions?
A Remember we're generally facing a tradeoff between bias and variance.
† Imagine the extreme case: a model whose coefficients are all zeros has no variance.
Now you understand shrinkage methods.
Recall Least-squares regression gets ˆβj's by minimizing RSS, i.e., min
Recall Least-squares regression gets \hat{\beta}_j's by minimizing RSS, i.e., \begin{align} \min_{\hat{\beta}} \text{RSS} = \min_{\hat{\beta}} \sum_{i=1}^{n} e_i^2 = \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\underbrace{\left[ \hat{\beta}_0 + \hat{\beta}_1 x_{i,1} + \cdots + \hat{\beta}_p x_{i,p} \right]}_{=\hat{y}_i}} \bigg)^2 \end{align}
Ridge regression makes a small change
Recall Least-squares regression gets \hat{\beta}_j's by minimizing RSS, i.e., \begin{align} \min_{\hat{\beta}} \text{RSS} = \min_{\hat{\beta}} \sum_{i=1}^{n} e_i^2 = \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\underbrace{\left[ \hat{\beta}_0 + \hat{\beta}_1 x_{i,1} + \cdots + \hat{\beta}_p x_{i,p} \right]}_{=\hat{y}_i}} \bigg)^2 \end{align}
Ridge regression makes a small change
\begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}
Ridge regression \begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}
Least squares \begin{align} \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 \end{align}
\color{#e64173}{\lambda}\enspace (\geq0) is a tuning parameter for the harshness of the penalty.
\color{#e64173}{\lambda} = 0 implies no penalty: we are back to least squares.
Ridge regression \begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}
Least squares \begin{align} \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 \end{align}
\color{#e64173}{\lambda}\enspace (\geq0) is a tuning parameter for the harshness of the penalty.
\color{#e64173}{\lambda} = 0 implies no penalty: we are back to least squares.
Each value of \color{#e64173}{\lambda} produces a new set of coefficents.
Ridge regression \begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}
Least squares \begin{align} \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 \end{align}
\color{#e64173}{\lambda}\enspace (\geq0) is a tuning parameter for the harshness of the penalty.
\color{#e64173}{\lambda} = 0 implies no penalty: we are back to least squares.
Each value of \color{#e64173}{\lambda} produces a new set of coefficents.
Ridge's approach to the bias-variance tradeoff: Balance
\color{#e64173}{\lambda} determines how much ridge "cares about" these two quantities.†
† With \lambda=0, least-squares regression only "cares about" RSS.
Choosing a good value for \lambda is key.
Choosing a good value for \lambda is key.
Q So what do we do?
Choosing a good value for \lambda is key.
Q So what do we do?
A Cross validate!
(You saw that coming, right?)
Note Because we sum the squared coefficients, we penalize increasing big coefficients much more than increasing small coefficients.
Example For a value of \beta, we pay a penalty of 2 \lambda \beta for a small increase.†
† This quantity comes from taking the derivative of \lambda \beta^2 with respect to \beta.
Now you see why we call it shrinkage: it encourages small coefficients.
Important Predictors' units can drastically affect ridge regression results.
Why?
Important Predictors' units can drastically affect ridge regression results.
Why? Because \mathbf{x}_j's units affect \beta_j, and ridge is very sensitive to \beta_j.
Important Predictors' units can drastically affect ridge regression results.
Why? Because \mathbf{x}_j's units affect \beta_j, and ridge is very sensitive to \beta_j.
Example Let x_1 denote distance.
Least-squares regression
If x_1 is meters and \beta_1 = 3, then when x_1 is km, \beta_1 = 3,000.
The scale/units of predictors do not affect least squares' estimates.
Important Predictors' units can drastically affect ridge regression results.
Why? Because \mathbf{x}_j's units affect \beta_j, and ridge is very sensitive to \beta_j.
Example Let x_1 denote distance.
Least-squares regression
If x_1 is meters and \beta_1 = 3, then when x_1 is km, \beta_1 = 3,000.
The scale/units of predictors do not affect least squares' estimates.
Ridge regression pays a much larger penalty for \beta_1=3,000 than \beta_1=3.
You will not get the same (scaled) estimates when you change units.
Important Predictors' units can drastically affect ridge regression results.
Why? Because \mathbf{x}_j's units affect \beta_j, and ridge is very sensitive to \beta_j.
Example Let x_1 denote distance.
Least-squares regression
If x_1 is meters and \beta_1 = 3, then when x_1 is km, \beta_1 = 3,000.
The scale/units of predictors do not affect least squares' estimates.
Ridge regression pays a much larger penalty for \beta_1=3,000 than \beta_1=3.
You will not get the same (scaled) estimates when you change units.
Solution Standardize your variables, i.e., x_stnd = (x - mean(x))/sd(x)
.
Important Predictors' units can drastically affect ridge regression results.
Why? Because \mathbf{x}_j's units affect \beta_j, and ridge is very sensitive to \beta_j.
Example Let x_1 denote distance.
Least-squares regression
If x_1 is meters and \beta_1 = 3, then when x_1 is km, \beta_1 = 3,000.
The scale/units of predictors do not affect least squares' estimates.
Ridge regression pays a much larger penalty for \beta_1=3,000 than \beta_1=3.
You will not get the same (scaled) estimates when you change units.
Solution Standardize your variables, i.e., recipes::step_normalize()
.
Let's return to the credit dataset—and pre-processing with tidymodels
.
Recall We have 11 predictors and a numeric outcome balance
.
We can standardize our predictors using step_normalize()
from recipes
:
# Load the credit datasetcredit_df = ISLR::Credit %>% clean_names()# Processing recipe: Define ID, standardize, create dummies, rename (lowercase)credit_recipe = credit_df %>% recipe(balance ~ .) %>% update_role(id, new_role = "id variable") %>% step_normalize(all_predictors() & all_numeric()) %>% step_dummy(all_predictors() & all_nominal()) %>% step_rename_at(everything(), fn = str_to_lower)# Time to juicecredit_clean = credit_recipe %>% prep() %>% juice()
For ridge regression† in R, we will use glmnet()
from the glmnet
package.
† And lasso!
The key arguments for glmnet()
are
x
a matrix of predictorsy
outcome variable as a vectorstandardize
(T
or F
)alpha
elasticnet parameteralpha=0
gives ridgealpha=1
gives lassolambda
tuning parameter (sequence of numbers)nlambda
alternatively, R picks a sequence of values for \lambdaWe just need to define a decreasing sequence for \lambda, and then we're set.
# Define our range of lambdas (glmnet wants decreasing range)lambdas = 10^seq(from = 5, to = -2, length = 100)# Fit ridge regressionest_ridge = glmnet( x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(), y = credit_clean$balance, standardize = F, alpha = 0, lambda = lambdas)
The glmnet
output (est_ridge
here) contains estimated coefficients for \lambda. You can use predict()
to get coefficients for additional values of \lambda.
Ridge regression coefficents for \lambda between 0.01 and 100,000
glmnet
also provides convenient cross-validation function: cv.glmnet()
.
# Define our lambdaslambdas = 10^seq(from = 5, to = -2, length = 100)# Cross validationridge_cv = cv.glmnet( x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(), y = credit_clean$balance, alpha = 0, standardize = F, lambda = lambdas, # New: How we make decisions and number of folds type.measure = "mse", nfolds = 5)
Cross-validated RMSE and \lambda: Which \color{#e64173}{\lambda} minimizes CV RMSE?
Often, you will have a minimum more obviously far from the extremes.
Recall: Variance-bias tradeoff.
Cross-validated RMSE and \lambda: Which \color{#e64173}{\lambda} minimizes CV RMSE?
tidymodels
tidymodels
can also cross validate (and fit) ridge regression.
Back to our the linear_reg()
model 'specification'.
The penalty \lambda (what we want to tune) is penalty
instead of lambda
.
Set mixture = 0
inside linear_reg()
(same as alpha = 0
, above).
Use the glmnet
engine.
# Define the modelmodel_ridge = linear_reg(penalty = tune(), mixture = 0) %>% set_engine("glmnet")
Example of ridge regression with tidymodels
# Our range of lambdaslambdas = 10^seq(from = 5, to = -2, length = 1e3)# Define the 5-fold splitset.seed(12345)credit_cv = credit_df %>% vfold_cv(v = 5)# Define the modelmodel_ridge = linear_reg(penalty = tune(), mixture = 0) %>% set_engine("glmnet")# Define our ridge workflowworkflow_ridge = workflow() %>% add_model(model_ridge) %>% add_recipe(credit_recipe)# CV with our range of lambdascv_ridge = workflow_ridge %>% tune_grid( credit_cv, grid = data.frame(penalty = lambdas), metrics = metric_set(rmse) )# Show the best modelscv_ridge %>% show_best()
With tidymodels
...
Next steps: Finalize your workflow and fit your last model.
Recall: finalize_workflow()
, last_fit()
, and collect_predictions()
Otherwise: Once you find your \lambda via cross validation,
1. Fit your model on the full dataset using the optimal \lambda
# Fit final modelfinal_ridge = glmnet( x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(), y = credit_clean$balance, standardize = T, alpha = 0, lambda = ridge_cv$lambda.min)
Once you find your \lambda via cross validation
1. Fit your model on the full dataset using the optimal \lambda
2. Make predictions
predict( final_ridge, type = "response", # Our chosen lambda s = ridge_cv$lambda.min, # Our data newx = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix())
While ridge regression shrinks coefficients close to zero, it never forces them to be equal to zero.
Drawbacks
While ridge regression shrinks coefficients close to zero, it never forces them to be equal to zero.
Drawbacks
Q Can't we just drive the coefficients to zero?
While ridge regression shrinks coefficients close to zero, it never forces them to be equal to zero.
Drawbacks
Q Can't we just drive the coefficients to zero?
A Yes. Just not with ridge (due to \sum_j \hat{\beta}_j^2).
Lasso simply replaces ridge's squared coefficients with absolute values.
Lasso simply replaces ridge's squared coefficients with absolute values.
Ridge regression \begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}
Lasso \begin{align} \min_{\hat{\beta}^L} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|} \end{align}
Everything else will be the same—except one aspect...
Unlike ridge, lasso's penalty does not increase with the size of \beta_j.
You always pay \color{#8AA19E}{\lambda} to increase \big|\beta_j\big| by one unit.
Unlike ridge, lasso's penalty does not increase with the size of \beta_j.
You always pay \color{#8AA19E}{\lambda} to increase \big|\beta_j\big| by one unit.
The only way to avoid lasso's penalty is to set coefficents to zero.
Unlike ridge, lasso's penalty does not increase with the size of \beta_j.
You always pay \color{#8AA19E}{\lambda} to increase \big|\beta_j\big| by one unit.
The only way to avoid lasso's penalty is to set coefficents to zero.
This feature has two benefits
Unlike ridge, lasso's penalty does not increase with the size of \beta_j.
You always pay \color{#8AA19E}{\lambda} to increase \big|\beta_j\big| by one unit.
The only way to avoid lasso's penalty is to set coefficents to zero.
This feature has two benefits
We will still need to carefully select \color{#8AA19E}{\lambda}.
We can also use glmnet()
for lasso.
Recall The key arguments for glmnet()
are
x
a matrix of predictorsy
outcome variable as a vectorstandardize
(T
or F
)alpha
elasticnet parameteralpha=0
gives ridgealpha=1
gives lassolambda
tuning parameter (sequence of numbers)nlambda
alternatively, R picks a sequence of values for \lambdaAgain, we define a decreasing sequence for \lambda, and we're set.
# Define our range of lambdas (glmnet wants decreasing range)lambdas = 10^seq(from = 5, to = -2, length = 100)# Fit lasso regressionest_lasso = glmnet( x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(), y = credit_clean$balance, standardize = F, alpha = 1, lambda = lambdas)
The glmnet
output (est_lasso
here) contains estimated coefficients for \lambda. You can use predict()
to get coefficients for additional values of \lambda.
Lasso coefficents for \lambda between 0.01 and 100,000
Compare lasso's tendency to force coefficients to zero with our previous ridge-regression results.
Ridge regression coefficents for \lambda between 0.01 and 100,000
We can also cross validate \lambda with cv.glmnet()
.
# Define our lambdaslambdas = 10^seq(from = 5, to = -2, length = 100)# Cross validationlasso_cv = cv.glmnet( x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(), y = credit_clean$balance, alpha = 1, standardize = F, lambda = lambdas, # New: How we make decisions and number of folds type.measure = "mse", nfolds = 5)
Cross-validated RMSE and \lambda: Which \color{#8AA19E}{\lambda} minimizes CV RMSE?
Again, you will have a minimum farther away from your extremes...
Cross-validated RMSE and \lambda: Which \color{#8AA19E}{\lambda} minimizes CV RMSE?
So which shrinkage method should you choose?
Ridge regression
+ shrinks \hat{\beta}_j near 0
- many small \hat\beta_j
- doesn't work for selection
- difficult to interpret output
+ better when all \beta_j\neq 0
Best: p is large & \beta_j\approx\beta_k
Lasso
+ shrinks \hat{\beta}_j to 0
+ many \hat\beta_j= 0
+ great for selection
+ sparse models easier to interpret
- implicitly assumes some \beta= 0
Best: p is large & many \beta_j\approx 0
Ridge regression
+ shrinks \hat{\beta}_j near 0
- many small \hat\beta_j
- doesn't work for selection
- difficult to interpret output
+ better when all \beta_j\neq 0
Best: p is large & \beta_j\approx\beta_k
Lasso
+ shrinks \hat{\beta}_j to 0
+ many \hat\beta_j= 0
+ great for selection
+ sparse models easier to interpret
- implicitly assumes some \beta= 0
Best: p is large & many \beta_j\approx 0
[N]either ridge... nor the lasso will universally dominate the other.
ISL, p. 224
Elasticnet combines ridge regression and lasso.
Elasticnet combines ridge regression and lasso.
\begin{align} \min_{\beta^E} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#181485}{(1-\alpha)} \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} + \color{#181485}{\alpha} \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|} \end{align}
We now have two tuning parameters: \lambda (penalty) and \color{#181485}{\alpha} (mixture).
Elasticnet combines ridge regression and lasso.
\begin{align} \min_{\beta^E} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#181485}{(1-\alpha)} \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} + \color{#181485}{\alpha} \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|} \end{align}
We now have two tuning parameters: \lambda (penalty) and \color{#181485}{\alpha} (mixture).
Remember the alpha
argument in glmnet()
?
We can use tune()
from tidymodels
to cross validate both \alpha and \lambda.
Note You need to consider all combinations of the two parameters.
This combination can create a lot of models to estimate.
For example,
leaves you with 1,000,000 models to estimate.†
† 5,000,000 if you are doing 5-fold CV!
Cross validating elasticnet in tidymodels
# Our range of λ and αlambdas = 10^seq(from = 5, to = -2, length = 1e2)alphas = seq(from = 0, to = 1, by = 0.1)# Define the 5-fold splitset.seed(12345)credit_cv = credit_df %>% vfold_cv(v = 5)# Define the elasticnet modelmodel_net = linear_reg( penalty = tune(), mixture = tune()) %>% set_engine("glmnet")# Define our workflowworkflow_net = workflow() %>% add_model(model_net) %>% add_recipe(credit_recipe)# CV elasticnet with our range of lambdascv_net = workflow_net %>% tune_grid( credit_cv, grid = expand_grid(mixture = alphas, penalty = lambdas), metrics = metric_set(rmse) )
Cross validating elasticnet in tidymodels
with grid_regular()
# Our range of λ and αlambdas = 10^seq(from = 5, to = -2, length = 1e2)alphas = seq(from = 0, to = 1, by = 0.1)# Define the 5-fold splitset.seed(12345)credit_cv = credit_df %>% vfold_cv(v = 5)# Define the elasticnet modelmodel_net = linear_reg( penalty = tune(), mixture = tune()) %>% set_engine("glmnet")# Define our workflowworkflow_net = workflow() %>% add_model(model_net) %>% add_recipe(credit_recipe)# CV elasticnet with our range of lambdascv_net = workflow_net %>% tune_grid( credit_cv, grid = grid_regular(mixture(), penalty(), levels = 100:100), metrics = metric_set(rmse) )
In case you are curious: The best model had \lambda\approx 0.628 and \alpha\approx 0.737.
CV estimates elasticnet actually reduced RMSE from ridge's 118 to 101.
These notes draw upon
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Alt + f | Fit Slides to Screen |
Esc | Back to slideshow |
Last time
Today
tidymodels
Readings
Problem sets Soon!
Recap: Subset-selection methods (last time)
Recap: Subset-selection methods (last time)
These methods assume we need to choose a model before we fit it...
Recap: Subset-selection methods (last time)
These methods assume we need to choose a model before we fit it...
Alternative approach: Shrinkage methods
† Synonyms for shrink: constrain or regularize
Recap: Subset-selection methods (last time)
These methods assume we need to choose a model before we fit it...
Alternative approach: Shrinkage methods
† Synonyms for shrink: constrain or regularize
Idea: Penalize the model for coefficients as they move away from zero.
Q How could shrinking coefficients toward zero help our predictions?
Q How could shrinking coefficients toward zero help our predictions?
A Remember we're generally facing a tradeoff between bias and variance.
Q How could shrinking coefficients toward zero help our predictions?
A Remember we're generally facing a tradeoff between bias and variance.
† Imagine the extreme case: a model whose coefficients are all zeros has no variance.
Q How could shrinking coefficients toward zero help our predictions?
A Remember we're generally facing a tradeoff between bias and variance.
† Imagine the extreme case: a model whose coefficients are all zeros has no variance.
Now you understand shrinkage methods.
Recall Least-squares regression gets \hat{\beta}_j's by minimizing RSS, i.e., \begin{align} \min_{\hat{\beta}} \text{RSS} = \min_{\hat{\beta}} \sum_{i=1}^{n} e_i^2 = \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\underbrace{\left[ \hat{\beta}_0 + \hat{\beta}_1 x_{i,1} + \cdots + \hat{\beta}_p x_{i,p} \right]}_{=\hat{y}_i}} \bigg)^2 \end{align}
Recall Least-squares regression gets \hat{\beta}_j's by minimizing RSS, i.e., \begin{align} \min_{\hat{\beta}} \text{RSS} = \min_{\hat{\beta}} \sum_{i=1}^{n} e_i^2 = \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\underbrace{\left[ \hat{\beta}_0 + \hat{\beta}_1 x_{i,1} + \cdots + \hat{\beta}_p x_{i,p} \right]}_{=\hat{y}_i}} \bigg)^2 \end{align}
Ridge regression makes a small change
Recall Least-squares regression gets \hat{\beta}_j's by minimizing RSS, i.e., \begin{align} \min_{\hat{\beta}} \text{RSS} = \min_{\hat{\beta}} \sum_{i=1}^{n} e_i^2 = \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\underbrace{\left[ \hat{\beta}_0 + \hat{\beta}_1 x_{i,1} + \cdots + \hat{\beta}_p x_{i,p} \right]}_{=\hat{y}_i}} \bigg)^2 \end{align}
Ridge regression makes a small change
\begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}
Ridge regression \begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}
Least squares \begin{align} \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 \end{align}
\color{#e64173}{\lambda}\enspace (\geq0) is a tuning parameter for the harshness of the penalty.
\color{#e64173}{\lambda} = 0 implies no penalty: we are back to least squares.
Ridge regression \begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}
Least squares \begin{align} \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 \end{align}
\color{#e64173}{\lambda}\enspace (\geq0) is a tuning parameter for the harshness of the penalty.
\color{#e64173}{\lambda} = 0 implies no penalty: we are back to least squares.
Each value of \color{#e64173}{\lambda} produces a new set of coefficents.
Ridge regression \begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}
Least squares \begin{align} \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 \end{align}
\color{#e64173}{\lambda}\enspace (\geq0) is a tuning parameter for the harshness of the penalty.
\color{#e64173}{\lambda} = 0 implies no penalty: we are back to least squares.
Each value of \color{#e64173}{\lambda} produces a new set of coefficents.
Ridge's approach to the bias-variance tradeoff: Balance
\color{#e64173}{\lambda} determines how much ridge "cares about" these two quantities.†
† With \lambda=0, least-squares regression only "cares about" RSS.
Choosing a good value for \lambda is key.
Choosing a good value for \lambda is key.
Q So what do we do?
Choosing a good value for \lambda is key.
Q So what do we do?
A Cross validate!
(You saw that coming, right?)
Note Because we sum the squared coefficients, we penalize increasing big coefficients much more than increasing small coefficients.
Example For a value of \beta, we pay a penalty of 2 \lambda \beta for a small increase.†
† This quantity comes from taking the derivative of \lambda \beta^2 with respect to \beta.
Now you see why we call it shrinkage: it encourages small coefficients.
Important Predictors' units can drastically affect ridge regression results.
Why?
Important Predictors' units can drastically affect ridge regression results.
Why? Because \mathbf{x}_j's units affect \beta_j, and ridge is very sensitive to \beta_j.
Important Predictors' units can drastically affect ridge regression results.
Why? Because \mathbf{x}_j's units affect \beta_j, and ridge is very sensitive to \beta_j.
Example Let x_1 denote distance.
Least-squares regression
If x_1 is meters and \beta_1 = 3, then when x_1 is km, \beta_1 = 3,000.
The scale/units of predictors do not affect least squares' estimates.
Important Predictors' units can drastically affect ridge regression results.
Why? Because \mathbf{x}_j's units affect \beta_j, and ridge is very sensitive to \beta_j.
Example Let x_1 denote distance.
Least-squares regression
If x_1 is meters and \beta_1 = 3, then when x_1 is km, \beta_1 = 3,000.
The scale/units of predictors do not affect least squares' estimates.
Ridge regression pays a much larger penalty for \beta_1=3,000 than \beta_1=3.
You will not get the same (scaled) estimates when you change units.
Important Predictors' units can drastically affect ridge regression results.
Why? Because \mathbf{x}_j's units affect \beta_j, and ridge is very sensitive to \beta_j.
Example Let x_1 denote distance.
Least-squares regression
If x_1 is meters and \beta_1 = 3, then when x_1 is km, \beta_1 = 3,000.
The scale/units of predictors do not affect least squares' estimates.
Ridge regression pays a much larger penalty for \beta_1=3,000 than \beta_1=3.
You will not get the same (scaled) estimates when you change units.
Solution Standardize your variables, i.e., x_stnd = (x - mean(x))/sd(x)
.
Important Predictors' units can drastically affect ridge regression results.
Why? Because \mathbf{x}_j's units affect \beta_j, and ridge is very sensitive to \beta_j.
Example Let x_1 denote distance.
Least-squares regression
If x_1 is meters and \beta_1 = 3, then when x_1 is km, \beta_1 = 3,000.
The scale/units of predictors do not affect least squares' estimates.
Ridge regression pays a much larger penalty for \beta_1=3,000 than \beta_1=3.
You will not get the same (scaled) estimates when you change units.
Solution Standardize your variables, i.e., recipes::step_normalize()
.
Let's return to the credit dataset—and pre-processing with tidymodels
.
Recall We have 11 predictors and a numeric outcome balance
.
We can standardize our predictors using step_normalize()
from recipes
:
# Load the credit datasetcredit_df = ISLR::Credit %>% clean_names()# Processing recipe: Define ID, standardize, create dummies, rename (lowercase)credit_recipe = credit_df %>% recipe(balance ~ .) %>% update_role(id, new_role = "id variable") %>% step_normalize(all_predictors() & all_numeric()) %>% step_dummy(all_predictors() & all_nominal()) %>% step_rename_at(everything(), fn = str_to_lower)# Time to juicecredit_clean = credit_recipe %>% prep() %>% juice()
For ridge regression† in R, we will use glmnet()
from the glmnet
package.
† And lasso!
The key arguments for glmnet()
are
x
a matrix of predictorsy
outcome variable as a vectorstandardize
(T
or F
)alpha
elasticnet parameteralpha=0
gives ridgealpha=1
gives lassolambda
tuning parameter (sequence of numbers)nlambda
alternatively, R picks a sequence of values for \lambdaWe just need to define a decreasing sequence for \lambda, and then we're set.
# Define our range of lambdas (glmnet wants decreasing range)lambdas = 10^seq(from = 5, to = -2, length = 100)# Fit ridge regressionest_ridge = glmnet( x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(), y = credit_clean$balance, standardize = F, alpha = 0, lambda = lambdas)
The glmnet
output (est_ridge
here) contains estimated coefficients for \lambda. You can use predict()
to get coefficients for additional values of \lambda.
Ridge regression coefficents for \lambda between 0.01 and 100,000
glmnet
also provides convenient cross-validation function: cv.glmnet()
.
# Define our lambdaslambdas = 10^seq(from = 5, to = -2, length = 100)# Cross validationridge_cv = cv.glmnet( x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(), y = credit_clean$balance, alpha = 0, standardize = F, lambda = lambdas, # New: How we make decisions and number of folds type.measure = "mse", nfolds = 5)
Cross-validated RMSE and \lambda: Which \color{#e64173}{\lambda} minimizes CV RMSE?
Often, you will have a minimum more obviously far from the extremes.
Recall: Variance-bias tradeoff.
Cross-validated RMSE and \lambda: Which \color{#e64173}{\lambda} minimizes CV RMSE?
tidymodels
tidymodels
can also cross validate (and fit) ridge regression.
Back to our the linear_reg()
model 'specification'.
The penalty \lambda (what we want to tune) is penalty
instead of lambda
.
Set mixture = 0
inside linear_reg()
(same as alpha = 0
, above).
Use the glmnet
engine.
# Define the modelmodel_ridge = linear_reg(penalty = tune(), mixture = 0) %>% set_engine("glmnet")
Example of ridge regression with tidymodels
# Our range of lambdaslambdas = 10^seq(from = 5, to = -2, length = 1e3)# Define the 5-fold splitset.seed(12345)credit_cv = credit_df %>% vfold_cv(v = 5)# Define the modelmodel_ridge = linear_reg(penalty = tune(), mixture = 0) %>% set_engine("glmnet")# Define our ridge workflowworkflow_ridge = workflow() %>% add_model(model_ridge) %>% add_recipe(credit_recipe)# CV with our range of lambdascv_ridge = workflow_ridge %>% tune_grid( credit_cv, grid = data.frame(penalty = lambdas), metrics = metric_set(rmse) )# Show the best modelscv_ridge %>% show_best()
With tidymodels
...
Next steps: Finalize your workflow and fit your last model.
Recall: finalize_workflow()
, last_fit()
, and collect_predictions()
Otherwise: Once you find your \lambda via cross validation,
1. Fit your model on the full dataset using the optimal \lambda
# Fit final modelfinal_ridge = glmnet( x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(), y = credit_clean$balance, standardize = T, alpha = 0, lambda = ridge_cv$lambda.min)
Once you find your \lambda via cross validation
1. Fit your model on the full dataset using the optimal \lambda
2. Make predictions
predict( final_ridge, type = "response", # Our chosen lambda s = ridge_cv$lambda.min, # Our data newx = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix())
While ridge regression shrinks coefficients close to zero, it never forces them to be equal to zero.
Drawbacks
While ridge regression shrinks coefficients close to zero, it never forces them to be equal to zero.
Drawbacks
Q Can't we just drive the coefficients to zero?
While ridge regression shrinks coefficients close to zero, it never forces them to be equal to zero.
Drawbacks
Q Can't we just drive the coefficients to zero?
A Yes. Just not with ridge (due to \sum_j \hat{\beta}_j^2).
Lasso simply replaces ridge's squared coefficients with absolute values.
Lasso simply replaces ridge's squared coefficients with absolute values.
Ridge regression \begin{align} \min_{\hat{\beta}^R} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} \end{align}
Lasso \begin{align} \min_{\hat{\beta}^L} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|} \end{align}
Everything else will be the same—except one aspect...
Unlike ridge, lasso's penalty does not increase with the size of \beta_j.
You always pay \color{#8AA19E}{\lambda} to increase \big|\beta_j\big| by one unit.
Unlike ridge, lasso's penalty does not increase with the size of \beta_j.
You always pay \color{#8AA19E}{\lambda} to increase \big|\beta_j\big| by one unit.
The only way to avoid lasso's penalty is to set coefficents to zero.
Unlike ridge, lasso's penalty does not increase with the size of \beta_j.
You always pay \color{#8AA19E}{\lambda} to increase \big|\beta_j\big| by one unit.
The only way to avoid lasso's penalty is to set coefficents to zero.
This feature has two benefits
Unlike ridge, lasso's penalty does not increase with the size of \beta_j.
You always pay \color{#8AA19E}{\lambda} to increase \big|\beta_j\big| by one unit.
The only way to avoid lasso's penalty is to set coefficents to zero.
This feature has two benefits
We will still need to carefully select \color{#8AA19E}{\lambda}.
We can also use glmnet()
for lasso.
Recall The key arguments for glmnet()
are
x
a matrix of predictorsy
outcome variable as a vectorstandardize
(T
or F
)alpha
elasticnet parameteralpha=0
gives ridgealpha=1
gives lassolambda
tuning parameter (sequence of numbers)nlambda
alternatively, R picks a sequence of values for \lambdaAgain, we define a decreasing sequence for \lambda, and we're set.
# Define our range of lambdas (glmnet wants decreasing range)lambdas = 10^seq(from = 5, to = -2, length = 100)# Fit lasso regressionest_lasso = glmnet( x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(), y = credit_clean$balance, standardize = F, alpha = 1, lambda = lambdas)
The glmnet
output (est_lasso
here) contains estimated coefficients for \lambda. You can use predict()
to get coefficients for additional values of \lambda.
Lasso coefficents for \lambda between 0.01 and 100,000
Compare lasso's tendency to force coefficients to zero with our previous ridge-regression results.
Ridge regression coefficents for \lambda between 0.01 and 100,000
We can also cross validate \lambda with cv.glmnet()
.
# Define our lambdaslambdas = 10^seq(from = 5, to = -2, length = 100)# Cross validationlasso_cv = cv.glmnet( x = credit_clean %>% dplyr::select(-balance, -id) %>% as.matrix(), y = credit_clean$balance, alpha = 1, standardize = F, lambda = lambdas, # New: How we make decisions and number of folds type.measure = "mse", nfolds = 5)
Cross-validated RMSE and \lambda: Which \color{#8AA19E}{\lambda} minimizes CV RMSE?
Again, you will have a minimum farther away from your extremes...
Cross-validated RMSE and \lambda: Which \color{#8AA19E}{\lambda} minimizes CV RMSE?
So which shrinkage method should you choose?
Ridge regression
+ shrinks \hat{\beta}_j near 0
- many small \hat\beta_j
- doesn't work for selection
- difficult to interpret output
+ better when all \beta_j\neq 0
Best: p is large & \beta_j\approx\beta_k
Lasso
+ shrinks \hat{\beta}_j to 0
+ many \hat\beta_j= 0
+ great for selection
+ sparse models easier to interpret
- implicitly assumes some \beta= 0
Best: p is large & many \beta_j\approx 0
Ridge regression
+ shrinks \hat{\beta}_j near 0
- many small \hat\beta_j
- doesn't work for selection
- difficult to interpret output
+ better when all \beta_j\neq 0
Best: p is large & \beta_j\approx\beta_k
Lasso
+ shrinks \hat{\beta}_j to 0
+ many \hat\beta_j= 0
+ great for selection
+ sparse models easier to interpret
- implicitly assumes some \beta= 0
Best: p is large & many \beta_j\approx 0
[N]either ridge... nor the lasso will universally dominate the other.
ISL, p. 224
Elasticnet combines ridge regression and lasso.
Elasticnet combines ridge regression and lasso.
\begin{align} \min_{\beta^E} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#181485}{(1-\alpha)} \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} + \color{#181485}{\alpha} \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|} \end{align}
We now have two tuning parameters: \lambda (penalty) and \color{#181485}{\alpha} (mixture).
Elasticnet combines ridge regression and lasso.
\begin{align} \min_{\beta^E} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#181485}{(1-\alpha)} \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} + \color{#181485}{\alpha} \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|} \end{align}
We now have two tuning parameters: \lambda (penalty) and \color{#181485}{\alpha} (mixture).
Remember the alpha
argument in glmnet()
?
We can use tune()
from tidymodels
to cross validate both \alpha and \lambda.
Note You need to consider all combinations of the two parameters.
This combination can create a lot of models to estimate.
For example,
leaves you with 1,000,000 models to estimate.†
† 5,000,000 if you are doing 5-fold CV!
Cross validating elasticnet in tidymodels
# Our range of λ and αlambdas = 10^seq(from = 5, to = -2, length = 1e2)alphas = seq(from = 0, to = 1, by = 0.1)# Define the 5-fold splitset.seed(12345)credit_cv = credit_df %>% vfold_cv(v = 5)# Define the elasticnet modelmodel_net = linear_reg( penalty = tune(), mixture = tune()) %>% set_engine("glmnet")# Define our workflowworkflow_net = workflow() %>% add_model(model_net) %>% add_recipe(credit_recipe)# CV elasticnet with our range of lambdascv_net = workflow_net %>% tune_grid( credit_cv, grid = expand_grid(mixture = alphas, penalty = lambdas), metrics = metric_set(rmse) )
Cross validating elasticnet in tidymodels
with grid_regular()
# Our range of λ and αlambdas = 10^seq(from = 5, to = -2, length = 1e2)alphas = seq(from = 0, to = 1, by = 0.1)# Define the 5-fold splitset.seed(12345)credit_cv = credit_df %>% vfold_cv(v = 5)# Define the elasticnet modelmodel_net = linear_reg( penalty = tune(), mixture = tune()) %>% set_engine("glmnet")# Define our workflowworkflow_net = workflow() %>% add_model(model_net) %>% add_recipe(credit_recipe)# CV elasticnet with our range of lambdascv_net = workflow_net %>% tune_grid( credit_cv, grid = grid_regular(mixture(), penalty(), levels = 100:100), metrics = metric_set(rmse) )
In case you are curious: The best model had \lambda\approx 0.628 and \alpha\approx 0.737.
CV estimates elasticnet actually reduced RMSE from ridge's 118 to 101.
These notes draw upon