Abstract
This document demonstrates a tidymodels workflow for ridge, lasso,
principal component regression (PCR), and partial least squares (PLS)
regression. We will use the browser
dataset to estimate the
regularization path, tune the hyperparameter \(\lambda\) for ridge and lasso and the
number of components for PCR and PLS, and evaluate the models on the
test set.
Loading essential packages for our regularized regression analysis. The tidyverse provides data manipulation functions, tidymodels offers a unified framework for modeling, and plsmod enables partial least squares regression.
library(tidyverse) # for data wrangling and visualization
library(tidymodels) # for data modeling
library(plsmod) # for estimating pls regression
library(here) # for referencing files and folders
The following code installs BiocManager
and the
mixOmics
package, which will be required later for PLS
regression. The BiocManager
provides access to Bioconductor
packages.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("mixOmics")
Setting a seed ensures reproducibility of our results. This is crucial for random processes like data splitting and cross-validation.
Loading the browser
dataset which contains information
about user behavior and spending patterns. We use the
here()
function for path-independent file referencing.
## Rows: 1200 Columns: 201
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (201): log_spend, americansingles.com, opm.gov, netzerovoice.com, mobile...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Dividing our data into training (75%) and test (25%) sets. The training set is further split into 5 cross-validation folds to properly tune model hyperparameters without overfitting.
Set mixture = 0
for ridge and mixture = 1
for lasso. penalty
is \(\lambda\). We’ll go for ridge:
lm_spec <- linear_reg() %>%
set_args(penalty = tune(), mixture = 0, nlambda = 10) %>%
set_engine("glmnet") %>%
set_mode("regression")
Creating a recipe that describes our preprocessing steps. We center and scale all predictors (standardization) and remove zero-variance predictors that provide no information.
browser_rec <- recipe(log_spend ~ .,data = browser_train) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_zv(all_predictors())
Combining our model specification and preprocessing recipe into a workflow, which provides a unified interface for model fitting and prediction.
Performing cross-validation to tune the penalty parameter \(\lambda\). This evaluates each candidate \(\lambda\) value across all folds to find the optimal regularization strength.
## Warning: package 'glmnet' was built under R version 4.4.3
Displaying the top-performing \(\lambda\) values ranked by root mean squared error (RMSE). Lower RMSE indicates better predictive performance.
## # A tibble: 5 x 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 9.29e- 1 rmse standard 1.86 5 0.0589 Preprocessor1_Model10
## 2 8.86e- 2 rmse standard 2.22 5 0.102 Preprocessor1_Model09
## 3 1.37e-10 rmse standard 2.31 5 0.109 Preprocessor1_Model01
## 4 2.35e- 9 rmse standard 2.31 5 0.109 Preprocessor1_Model02
## 5 5.45e- 8 rmse standard 2.31 5 0.109 Preprocessor1_Model03
Selecting the optimal \(\lambda\) using two common approaches: minimizing RMSE directly, or using the one-standard-error rule which chooses the most parsimonious model within one standard error of the minimum RMSE.
Finalizing the workflow by updating it with our chosen \(\lambda\) value (using the one-standard-error rule). This locks in the tuning parameter for final model fitting.
Fitting the finalized model to the entire training set and evaluating it on the previously unseen test set to get an unbiased estimate of performance.
Extracting and labeling the ridge regression performance metrics for later comparison with other models.
Defining a standard linear regression model for PCR. Unlike ridge/lasso, the regularization in PCR comes from dimensionality reduction, not from penalizing coefficients.
Creating a recipe that includes principal component analysis (PCA) transformation. The number of principal components to retain is a tuning parameter that controls model complexity.
browser_rec <- recipe(log_spend ~ ., data = browser_train) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_zv(all_predictors()) %>%
step_pca(all_predictors(), num_comp = tune())
Combining the PCR preprocessing recipe and model specification into a unified workflow.
Creating a grid of potential values for the number of principal components to evaluate. We’ll consider using 1 to 10 components.
Performing cross-validation to determine the optimal number of principal components that minimizes prediction error.
Displaying the top-performing configurations ranked by RMSE to see how model performance varies with the number of components.
## # A tibble: 5 x 7
## num_comp .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 3 rmse standard 1.65 5 0.0233 Preprocessor03_Model1
## 2 1 rmse standard 1.65 5 0.0221 Preprocessor01_Model1
## 3 2 rmse standard 1.65 5 0.0224 Preprocessor02_Model1
## 4 4 rmse standard 1.65 5 0.0223 Preprocessor04_Model1
## 5 5 rmse standard 1.65 5 0.0220 Preprocessor05_Model1
Selecting the optimal number of components using two approaches: minimizing RMSE directly, or using the one-standard-error rule which chooses the simplest model (fewer components) within one standard error of the minimum.
Finalizing the PCR workflow with the selected number of principal components based on the one-standard-error rule.
Fitting the finalized PCR model to the entire training set and evaluating it on the test set to get an unbiased performance estimate.
Extracting and labeling the PCR performance metrics for later comparison with other models.
Installing the mixOmics package required for PLS regression.
BiocManager
is used because mixOmics
is part
of the Bioconductor project, which focuses on tools for analyzing
genomic data.
Creating a partial least squares regression model specification. PLS, like PCR, uses dimension reduction but creates components that are specifically designed to maximize correlation with the response variable.
pls_spec <- pls() %>%
set_args(num_comp = tune()) %>%
set_engine("mixOmics") %>%
set_mode("regression")
Setting up a recipe for data preprocessing without PCA transformation, as PLS performs its own dimension reduction internally.
browser_rec <- recipe(log_spend ~ ., data = browser_train) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_zv(all_predictors())
Combining the PLS model specification and preprocessing recipe into a unified workflow.
Creating a grid of potential values for the number of PLS components to evaluate. As with PCR, we’ll consider using 1 to 10 components.
Performing cross-validation to determine the optimal number of PLS components that minimizes prediction error.
Displaying the top-performing configurations to see how model performance varies with the number of PLS components.
## # A tibble: 5 x 7
## num_comp .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 rmse standard 2.06 5 0.170 Preprocessor1_Model01
## 2 2 rmse standard 2.08 5 0.114 Preprocessor1_Model02
## 3 3 rmse standard 2.15 5 0.121 Preprocessor1_Model03
## 4 4 rmse standard 2.23 5 0.106 Preprocessor1_Model04
## 5 5 rmse standard 2.29 5 0.111 Preprocessor1_Model05
Selecting the optimal number of PLS components using the same two approaches as before: minimizing RMSE directly, or using the one-standard-error rule.
Finalizing the PLS workflow with the selected number of components based on the one-standard-error rule.
Fitting the finalized PLS model to the entire training set and evaluating it on the test set.
Extracting and labeling the PLS performance metrics for later comparison.
Combining and comparing the test set RMSE results from all three regularization methods to determine which approach performs best for this dataset.
## # A tibble: 3 x 5
## .metric .estimator .estimate .config method
## <chr> <chr> <dbl> <chr> <chr>
## 1 rmse standard 2.15 Preprocessor1_Model1 pcr
## 2 rmse standard 4.89 Preprocessor1_Model1 ridge
## 3 rmse standard 5.36 Preprocessor1_Model1 pls