Abstract
This document demonstrates a tidymodels workflow for ridge, lasso,
principal component regression (PCR), and partial least squares (PLS)
regression. We will use the browser dataset to estimate the
regularization path, tune the hyperparameter \(\lambda\) for ridge and lasso and the
number of components for PCR and PLS, and evaluate the models on the
test set.
Load packages
library(tidyverse) # for data wrangling and visualization
library(tidymodels) # for data modeling
library(plsmod) # for estimating pls regression
library(here) # for referencing files and foldersif (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("mixOmics")and set seed for replication
Read the browser dataset from the data
folder.
## Rows: 1200 Columns: 201
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (201): log_spend, americansingles.com, opm.gov, netzerovoice.com, mobile...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Split the data to training and test sets, and the training set to 10 folds for cross-validation.
Set mixture = 0 for ridge and mixture = 1
for lasso. penalty is \(\lambda\). We’ll go for ridge:
lm_spec <- linear_reg() %>%
set_args(penalty = tune(), mixture = 0, nlambda = 10) %>%
set_engine("glmnet") %>%
set_mode("regression")## Warning: package 'glmnet' was built under R version 4.3.3
show best results
## # A tibble: 5 x 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2.97e- 1 rmse standard 2.06 5 0.0868 Preprocessor1_Model10
## 2 2.08e- 2 rmse standard 2.31 5 0.109 Preprocessor1_Model09
## 3 4.57e-10 rmse standard 2.31 5 0.109 Preprocessor1_Model01
## 4 1.34e- 9 rmse standard 2.31 5 0.109 Preprocessor1_Model02
## 5 4.02e- 8 rmse standard 2.31 5 0.109 Preprocessor1_Model03
Two options: lambda that minimizes RMSE, and the 1 standard error rule of thumb:
Specify a linear model. Note that in this case, there are no tuning parameters.
Note how the number of PCs is determined inside the recipe object.
show best results
## # A tibble: 5 x 7
## num_comp .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 3 rmse standard 1.65 5 0.0233 Preprocessor03_Model1
## 2 1 rmse standard 1.65 5 0.0221 Preprocessor01_Model1
## 3 2 rmse standard 1.65 5 0.0224 Preprocessor02_Model1
## 4 4 rmse standard 1.65 5 0.0223 Preprocessor04_Model1
## 5 5 rmse standard 1.65 5 0.0220 Preprocessor05_Model1
Two options: num_comp that minimizes RMSE, and the 1
standard error rule of thumb:
To run tidymodels’ plsmod we’ll need to make sure we
have installed the {mixOmics} package.
Specify plsmod
pls_spec <- pls() %>%
set_args(num_comp = tune()) %>%
set_engine("mixOmics") %>%
set_mode("regression")show best results
## # A tibble: 5 x 7
## num_comp .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 rmse standard 2.06 5 0.170 Preprocessor1_Model01
## 2 2 rmse standard 2.08 5 0.114 Preprocessor1_Model02
## 3 3 rmse standard 2.15 5 0.121 Preprocessor1_Model03
## 4 4 rmse standard 2.23 5 0.106 Preprocessor1_Model04
## 5 5 rmse standard 2.29 5 0.111 Preprocessor1_Model05
Two options: num_comp that minimizes RMSE, and the 1
standard error rule of thumb:
## # A tibble: 3 x 5
## .metric .estimator .estimate .config method
## <chr> <chr> <dbl> <chr> <chr>
## 1 rmse standard 2.15 Preprocessor1_Model1 pcr
## 2 rmse standard 4.96 Preprocessor1_Model1 ridge
## 3 rmse standard 5.36 Preprocessor1_Model1 pls