# Load packages
::p_load(tidyverse, tidymodels, janitor, skimr, magrittr, here)
pacman# Set random seed
set.seed(42)
# Load the data
= read_csv(here("data", "train.csv")) %>% clean_names()
train_df # Rename first and second floor square footage
%<>%
train_df rename(first_floor_sqft = x1st_flr_sf,
second_floor_sqft = x2nd_flr_sf)
# Create a variable for total square footage
%<>%
train_df mutate(total_sqft = first_floor_sqft + second_floor_sqft)
%<>% select(-roof_matl, -exterior1st, -heating_qc, -electrical) train_df
EC524W25: Lab 003
Practice with tidymodels
Introduction
In this document, we are going to keep predicting housing prices. I have set up a tidymodels workflow, including:
- a recipe
- a linear regression model
- a cross validation strategy
- a workflow to fit the model and to make predictions on newdata
Task: Take this document and change it and make it better. Instead of using basic lm
, switch to a penalized regression model using glmnet
. Use either Ridge, Lasso, or Elastic Net. Change the model specification to include more variables, but make sure to start simple. Tune the model using a grid search. Make your CPU work.
Everything should be working out of the gate. Before getting started, make sure compiling the document works.
Resources:
Everything in this document is pulled from the Kaggle notebook from the previous lab. Use it for help.
However, missing from this notebook is how to use glmnet
. Use the internet to find out how to use glmnet
with tidymodels
. Tuning the model works the same as the knn
example in the kaggle notebook.
Lastly, ask me for help if you need it. Feel free to work together!
tidymodels
workflow
Setup
First, setup the document, loading in packages, setting a random seed, loading the data, and adjusting column names
Recipe
Let’s define the recipe for pre-processing.
= recipe(
price_recipe ~ ., data = train_df) %>%
sale_price step_impute_mean(all_numeric(), -all_outcomes()) %>%
step_select(all_numeric()) %>%
step_normalize(all_numeric()) %>%
step_interact(~ all_predictors() : all_predictors()) %>%
step_lincomb(all_predictors())
One important note is preprocessing using recipes helps keep our analysis unadulterated when using a resampling step. This is because the preprocessing steps are applied to each fold of the data separately, ensuring that the model is getting additional information from data in the validation sets.
Model
The model is where we define the type of model we want to use for the task at hand.
# Define our linear regression model (with 'lm' engine)
=
model_lm linear_reg(penalty = tune(), mixture = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
=
model_log logistic_reg(penalty = tune(), mixture = tune()) %>%
set_mode("classification") %>%
set_engine("glmnet")
# Check the result
model_lm
Linear Regression Model Specification (regression)
Main Arguments:
penalty = tune()
mixture = tune()
Computational engine: glmnet
model_log
Logistic Regression Model Specification (classification)
Main Arguments:
penalty = tune()
mixture = tune()
Computational engine: glmnet
Workflow
The model
and the recipe
fit together in a workflow
. This is where we define the steps to fit the model to the data. These workflow
s make our prediction tasks easier.
workflow() %>%
add_model(model_lm) %>%
add_recipe(price_recipe)
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps
• step_impute_mean()
• step_select()
• step_normalize()
• step_interact()
• step_lincomb()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Main Arguments:
penalty = tune()
mixture = tune()
Computational engine: glmnet
Let’s define our workflow as price_workflow
# Define our workflow
= workflow() %>%
price_workflow add_model(model_lm) %>%
add_recipe(price_recipe)
Fit the model
Simple fit
As an example of how workflow
s fit a model with a recipe
, let’s fit the model to the data using a workflow, but without the cross validation strategy.
# Fit the model without cross-validation
# price_workflow %>% fit(train_df)
Adding cross-validation strategy
Now let’s add a cross-validation strategy to our workflow. First we have to create a cross-validation strategy object. Feel free to tinker with the number of folds.
# Create a 5-fold cross validation strategy
= train_df %>% vfold_cv(v = 5) rsmp_cv
To look at the resampling splits across the data
%>% tidy() rsmp_cv
Fit the model with cross-validation
Now let’s fit the model to the data using a workflow and a cross-validation strategy.
= expand.grid(
parameter_grid penalty = c(10^seq(2, -3, length.out = 25)),
mixture = seq(0, 1, 0.01)
)
::plan("multisession", workers = 8)
future
= price_workflow %>%
fit_cv tune_grid(
rsmp_cv,grid = parameter_grid,
metrics = metric_set(rmse)
)
One can specify which metrics to fit within the fit_resamples
function with:
%>%
... fit_resamples(resamples = rsmp_cv, metrics = metric_set(rsq))
Now it is a good idea to assess model performance. How else do you know which model is best? The following code will summarize the results of the cross-validation. Play around until you get some good results.
# Collect the rmse metric
= fit_cv %>% collect_metrics(summarize = TRUE) %>% arrange(mean)
metric_df # fit_cv %>% collect_metrics(summarize = TRUE) %>% select(mean) %>% fndistinct()
# Plot the results
ggplot(metric_df %>% filter(mixture < 0.1), aes(x = penalty, y = mean, color = mixture, group = mixture)) +
# geom_point() +
geom_line(lwd = 1) +
labs(title = "Model Performance",
x = "Penalty",
y = "RMSE") +
scale_color_viridis_c(option = "magma") +
# Log scale for x-axis
scale_x_log10()
Once you have a good model, we will want to predict on new data.
Before we can predict on new data, we need to finalize our workflow. This will fit the model to the entire dataset. The following code block will finalize the workflow and fit the model to the entire dataset.
# Finalize workflow
= price_workflow %>%
best_model finalize_workflow(select_best(fit_cv, metric = "rmse"))
# Check the result
best_model
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps
• step_impute_mean()
• step_select()
• step_normalize()
• step_interact()
• step_lincomb()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Main Arguments:
penalty = 0.0287298483335366
mixture = 0.02
Computational engine: glmnet
Finally, now that we have a finalized workflow that has been tuned to the best model, we can fit the model to the entire dataset. We always want to use all of our training data to fit the model before making predictions on new data.
# Fit the final workflow
= best_model %>% fit(train_df)
final_fit # Examine the final fit
final_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps
• step_impute_mean()
• step_select()
• step_normalize()
• step_interact()
• step_lincomb()
── Model ───────────────────────────────────────────────────────────────────────
Call: glmnet::glmnet(x = maybe_matrix(x), y = y, family = "gaussian", alpha = ~0.02)
Df %Dev Lambda
1 0 0.00 39.540
2 1 0.31 36.020
3 1 0.64 32.820
4 4 1.25 29.910
5 5 2.34 27.250
6 7 3.79 24.830
7 10 5.67 22.620
8 11 7.97 20.610
9 11 10.38 18.780
10 15 13.13 17.110
11 18 16.12 15.590
12 25 19.49 14.210
13 29 23.13 12.950
14 31 26.86 11.800
15 36 30.61 10.750
16 39 34.40 9.793
17 43 38.10 8.923
18 45 41.73 8.131
19 46 45.20 7.408
20 50 48.52 6.750
21 55 51.67 6.150
22 58 54.67 5.604
23 59 57.47 5.106
24 63 60.07 4.653
25 64 62.51 4.239
26 70 64.79 3.863
27 72 67.03 3.520
28 77 69.15 3.207
29 83 71.13 2.922
30 84 72.96 2.662
31 90 74.65 2.426
32 94 76.22 2.210
33 99 77.66 2.014
34 107 79.00 1.835
35 112 80.22 1.672
36 128 81.35 1.524
37 132 82.40 1.388
38 141 83.36 1.265
39 146 84.25 1.152
40 156 85.07 1.050
41 167 85.84 0.957
42 179 86.55 0.872
43 187 87.20 0.794
44 193 87.80 0.724
45 200 88.35 0.659
46 210 88.86 0.601
...
and 54 more lines.
Prediction
First let’s load in the testing data to make our predictions on
# Generate predictions for the test set
= read_csv(here("data", "test.csv")) %>% clean_names()
test_df # Rename first and second floor square footage
%<>%
test_df rename(first_floor_sqft = x1st_flr_sf,
second_floor_sqft = x2nd_flr_sf)
# Create a variable for total square footage
%<>%
test_df mutate(total_sqft = first_floor_sqft + second_floor_sqft)
# Create a NA column for sale_price
$sale_price = NA test_df
Now we can make predictions on the test data. It is pretty simple.
# Predict the sale price
# predictions = predict(final_fit, new_data = test_df)