Load packages

Loading essential packages for our regularized regression analysis. The tidyverse provides data manipulation functions, tidymodels offers a unified framework for modeling, and plsmod enables partial least squares regression.

library(tidyverse)   # for data wrangling and visualization
library(tidymodels)  # for data modeling
library(plsmod)      # for estimating pls regression
library(here)         # for referencing files and folders

The following code installs BiocManager and the mixOmics package, which will be required later for PLS regression. The BiocManager provides access to Bioconductor packages.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("mixOmics")

Setting a seed ensures reproducibility of our results. This is crucial for random processes like data splitting and cross-validation.

set.seed(1203)

Read data

Loading the browser dataset which contains information about user behavior and spending patterns. We use the here() function for path-independent file referencing.

browser <- here("05-regression-regularization/data","browser-sample.csv") %>% 
  read_csv()
## Rows: 1200 Columns: 201
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (201): log_spend, americansingles.com, opm.gov, netzerovoice.com, mobile...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Sample spliting

Dividing our data into training (75%) and test (25%) sets. The training set is further split into 5 cross-validation folds to properly tune model hyperparameters without overfitting.

browser_split <- browser %>% 
  initial_split(prop = 0.75)

browser_train <- training(browser_split)
browser_test  <- testing(browser_split)

browser_folds <- browser_train %>% 
  vfold_cv(v = 5)

Lasso or ridge

Specify model recipe and workflow

Set mixture = 0 for ridge and mixture = 1 for lasso. penalty is \(\lambda\). We’ll go for ridge:

lm_spec <- linear_reg() %>% 
  set_args(penalty = tune(), mixture = 0, nlambda = 10) %>% 
  set_engine("glmnet") %>% 
  set_mode("regression")

Creating a recipe that describes our preprocessing steps. We center and scale all predictors (standardization) and remove zero-variance predictors that provide no information.

browser_rec <- recipe(log_spend ~ .,data = browser_train) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors()) %>% 
  step_zv(all_predictors())

Combining our model specification and preprocessing recipe into a workflow, which provides a unified interface for model fitting and prediction.

lm_wfl <- workflow() %>% 
  add_recipe(browser_rec) %>% 
  add_model(lm_spec)

Tune hyperparameter

Performing cross-validation to tune the penalty parameter \(\lambda\). This evaluates each candidate \(\lambda\) value across all folds to find the optimal regularization strength.

lm_results <- lm_wfl %>% 
  tune_grid(
    resamples = browser_folds
  )
## Warning: package 'glmnet' was built under R version 4.4.3

Displaying the top-performing \(\lambda\) values ranked by root mean squared error (RMSE). Lower RMSE indicates better predictive performance.

lm_results %>% 
  show_best(metric = "rmse")
## # A tibble: 5 x 7
##    penalty .metric .estimator  mean     n std_err .config              
##      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1 9.29e- 1 rmse    standard    1.86     5  0.0589 Preprocessor1_Model10
## 2 8.86e- 2 rmse    standard    2.22     5  0.102  Preprocessor1_Model09
## 3 1.37e-10 rmse    standard    2.31     5  0.109  Preprocessor1_Model01
## 4 2.35e- 9 rmse    standard    2.31     5  0.109  Preprocessor1_Model02
## 5 5.45e- 8 rmse    standard    2.31     5  0.109  Preprocessor1_Model03

Set best lambda

Selecting the optimal \(\lambda\) using two common approaches: minimizing RMSE directly, or using the one-standard-error rule which chooses the most parsimonious model within one standard error of the minimum RMSE.

lambda_min <- lm_results %>% 
  select_best(metric = "rmse")

lambda_1se <- lm_results %>% 
  select_by_one_std_err(
    metric = "rmse",
    desc(penalty)
  ) %>% 
  select(penalty)

Evaluate model on the test set

Finalizing the workflow by updating it with our chosen \(\lambda\) value (using the one-standard-error rule). This locks in the tuning parameter for final model fitting.

lm_wfl_final <- lm_wfl %>%
  finalize_workflow(lambda_1se)

Fitting the finalized model to the entire training set and evaluating it on the previously unseen test set to get an unbiased estimate of performance.

browser_test_results <- lm_wfl_final %>% 
  last_fit(split = browser_split)

Extracting and labeling the ridge regression performance metrics for later comparison with other models.

ridge_results <- browser_test_results %>% 
  collect_metrics() %>% 
  filter(.metric == "rmse") %>% 
  mutate(method = "ridge")

PCR

Specify model recipe and workflow

Defining a standard linear regression model for PCR. Unlike ridge/lasso, the regularization in PCR comes from dimensionality reduction, not from penalizing coefficients.

lm_spec <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")

Creating a recipe that includes principal component analysis (PCA) transformation. The number of principal components to retain is a tuning parameter that controls model complexity.

browser_rec <- recipe(log_spend ~ ., data = browser_train) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_pca(all_predictors(), num_comp = tune())

Combining the PCR preprocessing recipe and model specification into a unified workflow.

lm_wfl <- workflow() %>% 
  add_recipe(browser_rec) %>% 
  add_model(lm_spec)

Tune hyperparameter

Creating a grid of potential values for the number of principal components to evaluate. We’ll consider using 1 to 10 components.

num_comp_grid <- expand_grid(num_comp = 1:10)

Performing cross-validation to determine the optimal number of principal components that minimizes prediction error.

lm_results <- lm_wfl %>% 
  tune_grid(
    resamples = browser_folds,
    grid = num_comp_grid
  )

Displaying the top-performing configurations ranked by RMSE to see how model performance varies with the number of components.

lm_results %>% 
  show_best(metric = "rmse")
## # A tibble: 5 x 7
##   num_comp .metric .estimator  mean     n std_err .config              
##      <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1        3 rmse    standard    1.65     5  0.0233 Preprocessor03_Model1
## 2        1 rmse    standard    1.65     5  0.0221 Preprocessor01_Model1
## 3        2 rmse    standard    1.65     5  0.0224 Preprocessor02_Model1
## 4        4 rmse    standard    1.65     5  0.0223 Preprocessor04_Model1
## 5        5 rmse    standard    1.65     5  0.0220 Preprocessor05_Model1

Set best number of components

Selecting the optimal number of components using two approaches: minimizing RMSE directly, or using the one-standard-error rule which chooses the simplest model (fewer components) within one standard error of the minimum.

num_comp_min <- lm_results %>% 
  select_best(metric = "rmse")

num_comp_1se <- lm_results %>% 
  select_by_one_std_err(
    metric = "rmse",
    desc(num_comp)
  ) %>% 
  select(num_comp)

Evaluate model on the test set

Finalizing the PCR workflow with the selected number of principal components based on the one-standard-error rule.

lm_wfl_final <- lm_wfl %>%
  finalize_workflow(num_comp_1se)

Fitting the finalized PCR model to the entire training set and evaluating it on the test set to get an unbiased performance estimate.

browser_test_results <- lm_wfl_final %>% 
  last_fit(split = browser_split)

Extracting and labeling the PCR performance metrics for later comparison with other models.

pcr_results <- browser_test_results %>% 
  collect_metrics() %>% 
  filter(.metric == "rmse") %>% 
  mutate(method = "pcr")

PLS

Installing the mixOmics package required for PLS regression. BiocManager is used because mixOmics is part of the Bioconductor project, which focuses on tools for analyzing genomic data.

install.packages("BiocManager")
BiocManager::install("mixOmics")

Specify model recipe and workflow

Creating a partial least squares regression model specification. PLS, like PCR, uses dimension reduction but creates components that are specifically designed to maximize correlation with the response variable.

pls_spec <- pls() %>% 
  set_args(num_comp = tune()) %>% 
  set_engine("mixOmics") %>% 
  set_mode("regression")

Setting up a recipe for data preprocessing without PCA transformation, as PLS performs its own dimension reduction internally.

browser_rec <- recipe(log_spend ~ ., data = browser_train) %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors()) %>% 
  step_zv(all_predictors())

Combining the PLS model specification and preprocessing recipe into a unified workflow.

pls_wfl <- workflow() %>% 
  add_recipe(browser_rec) %>% 
  add_model(pls_spec)

Tune hyperparameter

Creating a grid of potential values for the number of PLS components to evaluate. As with PCR, we’ll consider using 1 to 10 components.

num_comp_grid <- expand_grid(num_comp = 1:10)

Performing cross-validation to determine the optimal number of PLS components that minimizes prediction error.

pls_results <- pls_wfl %>% 
  tune_grid(
    resamples = browser_folds,
    grid = num_comp_grid
  )

Displaying the top-performing configurations to see how model performance varies with the number of PLS components.

pls_results %>% 
  show_best(metric = "rmse")
## # A tibble: 5 x 7
##   num_comp .metric .estimator  mean     n std_err .config              
##      <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1        1 rmse    standard    2.06     5   0.170 Preprocessor1_Model01
## 2        2 rmse    standard    2.08     5   0.114 Preprocessor1_Model02
## 3        3 rmse    standard    2.15     5   0.121 Preprocessor1_Model03
## 4        4 rmse    standard    2.23     5   0.106 Preprocessor1_Model04
## 5        5 rmse    standard    2.29     5   0.111 Preprocessor1_Model05

Set best number of components

Selecting the optimal number of PLS components using the same two approaches as before: minimizing RMSE directly, or using the one-standard-error rule.

num_comp_min <- pls_results %>% 
  select_best(metric = "rmse")

num_comp_1se <- pls_results %>% 
  select_by_one_std_err(
    metric = "rmse",
    desc(num_comp)
  )

Evaluate model on the test set

Finalizing the PLS workflow with the selected number of components based on the one-standard-error rule.

pls_wfl_final <- pls_wfl %>%
  finalize_workflow(num_comp_1se)

Fitting the finalized PLS model to the entire training set and evaluating it on the test set.

browser_test_results <- pls_wfl_final %>% 
  last_fit(split = browser_split)

Extracting and labeling the PLS performance metrics for later comparison.

pls_results <- browser_test_results %>% 
  collect_metrics() %>% 
  filter(.metric == "rmse") %>% 
  mutate(method = "pls")

Summary

Combining and comparing the test set RMSE results from all three regularization methods to determine which approach performs best for this dataset.

bind_rows(
  ridge_results,
  pcr_results,
  pls_results
) %>% 
  arrange(.estimate)
## # A tibble: 3 x 5
##   .metric .estimator .estimate .config              method
##   <chr>   <chr>          <dbl> <chr>                <chr> 
## 1 rmse    standard        2.15 Preprocessor1_Model1 pcr   
## 2 rmse    standard        4.89 Preprocessor1_Model1 ridge 
## 3 rmse    standard        5.36 Preprocessor1_Model1 pls