tidymodels
::opts_chunk$set(warning = FALSE) ## Limit warnings
knitr::opts_chunk$set(message = FALSE) ## Limit messages
knitr::opts_chunk$set(echo = TRUE) ## Do not limit echo
knitroptions(htmltools.dir.version = FALSE)
# setup; load packages, data; set seed
options(scipen = 999)
library(pacman)
p_load(tidyverse, modeldata, skimr, janitor, tidymodels, magrittr, data.table, here)
set.seed(12345)
source("~/Documents/scripts/colors/colors.R")
We have spent a lot of time over the last few weeks learning tidymodels
. But there is one thing that we glazed over briefly that is super important for your final projects in this class.
Once we cross validate, fit a model within a workflow, and pick the best parameters, how do we predict that model onto our testing data? This set of notes will show you how to do this last step mimicking the project000
using housing data.
Let’s load the housing data (training and testing) from Ames, Iowa. Recall that sales prices are only reported in the training data set.
# Load data ------------------------------------------------------------------------------
# Training data
= here('lab/001-cleaning/data', 'train.csv') %>%
train_dt fread() %>%
::clean_names() %>%
janitormutate(sqft = x1st_flr_sf + x2nd_flr_sf)
# Testing data
= here('lab/001-cleaning/data', 'test.csv') %>%
test_dt fread() %>%
::clean_names() %>%
janitormutate(sqft = x1st_flr_sf + x2nd_flr_sf)
Always a good idea to skim
to check for missing vars + other info. Here I avoided using any
%>% skimr::skim() train_dt
Name | Piped data |
Number of rows | 1460 |
Number of columns | 82 |
Key | NULL |
_______________________ | |
Column type frequency: | |
character | 43 |
numeric | 39 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
ms_zoning | 0 | 1.00 | 2 | 7 | 0 | 5 | 0 |
street | 0 | 1.00 | 4 | 4 | 0 | 2 | 0 |
alley | 1369 | 0.06 | 4 | 4 | 0 | 2 | 0 |
lot_shape | 0 | 1.00 | 3 | 3 | 0 | 4 | 0 |
land_contour | 0 | 1.00 | 3 | 3 | 0 | 4 | 0 |
utilities | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
lot_config | 0 | 1.00 | 3 | 7 | 0 | 5 | 0 |
land_slope | 0 | 1.00 | 3 | 3 | 0 | 3 | 0 |
neighborhood | 0 | 1.00 | 5 | 7 | 0 | 25 | 0 |
condition1 | 0 | 1.00 | 4 | 6 | 0 | 9 | 0 |
condition2 | 0 | 1.00 | 4 | 6 | 0 | 8 | 0 |
bldg_type | 0 | 1.00 | 4 | 6 | 0 | 5 | 0 |
house_style | 0 | 1.00 | 4 | 6 | 0 | 8 | 0 |
roof_style | 0 | 1.00 | 3 | 7 | 0 | 6 | 0 |
roof_matl | 0 | 1.00 | 4 | 7 | 0 | 8 | 0 |
exterior1st | 0 | 1.00 | 5 | 7 | 0 | 15 | 0 |
exterior2nd | 0 | 1.00 | 5 | 7 | 0 | 16 | 0 |
mas_vnr_type | 8 | 0.99 | 4 | 7 | 0 | 4 | 0 |
exter_qual | 0 | 1.00 | 2 | 2 | 0 | 4 | 0 |
exter_cond | 0 | 1.00 | 2 | 2 | 0 | 5 | 0 |
foundation | 0 | 1.00 | 4 | 6 | 0 | 6 | 0 |
bsmt_qual | 37 | 0.97 | 2 | 2 | 0 | 4 | 0 |
bsmt_cond | 37 | 0.97 | 2 | 2 | 0 | 4 | 0 |
bsmt_exposure | 38 | 0.97 | 2 | 2 | 0 | 4 | 0 |
bsmt_fin_type1 | 37 | 0.97 | 3 | 3 | 0 | 6 | 0 |
bsmt_fin_type2 | 38 | 0.97 | 3 | 3 | 0 | 6 | 0 |
heating | 0 | 1.00 | 4 | 5 | 0 | 6 | 0 |
heating_qc | 0 | 1.00 | 2 | 2 | 0 | 5 | 0 |
central_air | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
electrical | 1 | 1.00 | 3 | 5 | 0 | 5 | 0 |
kitchen_qual | 0 | 1.00 | 2 | 2 | 0 | 4 | 0 |
functional | 0 | 1.00 | 3 | 4 | 0 | 7 | 0 |
fireplace_qu | 690 | 0.53 | 2 | 2 | 0 | 5 | 0 |
garage_type | 81 | 0.94 | 6 | 7 | 0 | 6 | 0 |
garage_finish | 81 | 0.94 | 3 | 3 | 0 | 3 | 0 |
garage_qual | 81 | 0.94 | 2 | 2 | 0 | 5 | 0 |
garage_cond | 81 | 0.94 | 2 | 2 | 0 | 5 | 0 |
paved_drive | 0 | 1.00 | 1 | 1 | 0 | 3 | 0 |
pool_qc | 1453 | 0.00 | 2 | 2 | 0 | 3 | 0 |
fence | 1179 | 0.19 | 4 | 5 | 0 | 4 | 0 |
misc_feature | 1406 | 0.04 | 4 | 4 | 0 | 4 | 0 |
sale_type | 0 | 1.00 | 2 | 5 | 0 | 9 | 0 |
sale_condition | 0 | 1.00 | 6 | 7 | 0 | 6 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1.00 | 730.50 | 421.61 | 1 | 365.75 | 730.5 | 1095.25 | 1460 | ▇▇▇▇▇ |
ms_sub_class | 0 | 1.00 | 56.90 | 42.30 | 20 | 20.00 | 50.0 | 70.00 | 190 | ▇▅▂▁▁ |
lot_frontage | 259 | 0.82 | 70.05 | 24.28 | 21 | 59.00 | 69.0 | 80.00 | 313 | ▇▃▁▁▁ |
lot_area | 0 | 1.00 | 10516.83 | 9981.26 | 1300 | 7553.50 | 9478.5 | 11601.50 | 215245 | ▇▁▁▁▁ |
overall_qual | 0 | 1.00 | 6.10 | 1.38 | 1 | 5.00 | 6.0 | 7.00 | 10 | ▁▂▇▅▁ |
overall_cond | 0 | 1.00 | 5.58 | 1.11 | 1 | 5.00 | 5.0 | 6.00 | 9 | ▁▁▇▅▁ |
year_built | 0 | 1.00 | 1971.27 | 30.20 | 1872 | 1954.00 | 1973.0 | 2000.00 | 2010 | ▁▂▃▆▇ |
year_remod_add | 0 | 1.00 | 1984.87 | 20.65 | 1950 | 1967.00 | 1994.0 | 2004.00 | 2010 | ▅▂▂▃▇ |
mas_vnr_area | 8 | 0.99 | 103.69 | 181.07 | 0 | 0.00 | 0.0 | 166.00 | 1600 | ▇▁▁▁▁ |
bsmt_fin_sf1 | 0 | 1.00 | 443.64 | 456.10 | 0 | 0.00 | 383.5 | 712.25 | 5644 | ▇▁▁▁▁ |
bsmt_fin_sf2 | 0 | 1.00 | 46.55 | 161.32 | 0 | 0.00 | 0.0 | 0.00 | 1474 | ▇▁▁▁▁ |
bsmt_unf_sf | 0 | 1.00 | 567.24 | 441.87 | 0 | 223.00 | 477.5 | 808.00 | 2336 | ▇▅▂▁▁ |
total_bsmt_sf | 0 | 1.00 | 1057.43 | 438.71 | 0 | 795.75 | 991.5 | 1298.25 | 6110 | ▇▃▁▁▁ |
x1st_flr_sf | 0 | 1.00 | 1162.63 | 386.59 | 334 | 882.00 | 1087.0 | 1391.25 | 4692 | ▇▅▁▁▁ |
x2nd_flr_sf | 0 | 1.00 | 346.99 | 436.53 | 0 | 0.00 | 0.0 | 728.00 | 2065 | ▇▃▂▁▁ |
low_qual_fin_sf | 0 | 1.00 | 5.84 | 48.62 | 0 | 0.00 | 0.0 | 0.00 | 572 | ▇▁▁▁▁ |
gr_liv_area | 0 | 1.00 | 1515.46 | 525.48 | 334 | 1129.50 | 1464.0 | 1776.75 | 5642 | ▇▇▁▁▁ |
bsmt_full_bath | 0 | 1.00 | 0.43 | 0.52 | 0 | 0.00 | 0.0 | 1.00 | 3 | ▇▆▁▁▁ |
bsmt_half_bath | 0 | 1.00 | 0.06 | 0.24 | 0 | 0.00 | 0.0 | 0.00 | 2 | ▇▁▁▁▁ |
full_bath | 0 | 1.00 | 1.57 | 0.55 | 0 | 1.00 | 2.0 | 2.00 | 3 | ▁▇▁▇▁ |
half_bath | 0 | 1.00 | 0.38 | 0.50 | 0 | 0.00 | 0.0 | 1.00 | 2 | ▇▁▅▁▁ |
bedroom_abv_gr | 0 | 1.00 | 2.87 | 0.82 | 0 | 2.00 | 3.0 | 3.00 | 8 | ▁▇▂▁▁ |
kitchen_abv_gr | 0 | 1.00 | 1.05 | 0.22 | 0 | 1.00 | 1.0 | 1.00 | 3 | ▁▇▁▁▁ |
tot_rms_abv_grd | 0 | 1.00 | 6.52 | 1.63 | 2 | 5.00 | 6.0 | 7.00 | 14 | ▂▇▇▁▁ |
fireplaces | 0 | 1.00 | 0.61 | 0.64 | 0 | 0.00 | 1.0 | 1.00 | 3 | ▇▇▁▁▁ |
garage_yr_blt | 81 | 0.94 | 1978.51 | 24.69 | 1900 | 1961.00 | 1980.0 | 2002.00 | 2010 | ▁▁▅▅▇ |
garage_cars | 0 | 1.00 | 1.77 | 0.75 | 0 | 1.00 | 2.0 | 2.00 | 4 | ▁▃▇▂▁ |
garage_area | 0 | 1.00 | 472.98 | 213.80 | 0 | 334.50 | 480.0 | 576.00 | 1418 | ▂▇▃▁▁ |
wood_deck_sf | 0 | 1.00 | 94.24 | 125.34 | 0 | 0.00 | 0.0 | 168.00 | 857 | ▇▂▁▁▁ |
open_porch_sf | 0 | 1.00 | 46.66 | 66.26 | 0 | 0.00 | 25.0 | 68.00 | 547 | ▇▁▁▁▁ |
enclosed_porch | 0 | 1.00 | 21.95 | 61.12 | 0 | 0.00 | 0.0 | 0.00 | 552 | ▇▁▁▁▁ |
x3ssn_porch | 0 | 1.00 | 3.41 | 29.32 | 0 | 0.00 | 0.0 | 0.00 | 508 | ▇▁▁▁▁ |
screen_porch | 0 | 1.00 | 15.06 | 55.76 | 0 | 0.00 | 0.0 | 0.00 | 480 | ▇▁▁▁▁ |
pool_area | 0 | 1.00 | 2.76 | 40.18 | 0 | 0.00 | 0.0 | 0.00 | 738 | ▇▁▁▁▁ |
misc_val | 0 | 1.00 | 43.49 | 496.12 | 0 | 0.00 | 0.0 | 0.00 | 15500 | ▇▁▁▁▁ |
mo_sold | 0 | 1.00 | 6.32 | 2.70 | 1 | 5.00 | 6.0 | 8.00 | 12 | ▃▆▇▃▃ |
yr_sold | 0 | 1.00 | 2007.82 | 1.33 | 2006 | 2007.00 | 2008.0 | 2009.00 | 2010 | ▇▇▇▇▅ |
sale_price | 0 | 1.00 | 180921.20 | 79442.50 | 34900 | 129975.00 | 163000.0 | 214000.00 | 755000 | ▇▅▁▁▁ |
sqft | 0 | 1.00 | 1509.62 | 521.16 | 334 | 1123.75 | 1458.0 | 1775.25 | 5642 | ▇▇▁▁▁ |
Set up our cross validation splits
# 5-fold CV on the training dataset
= train_dt %>% vfold_cv(v = 5) train_cv
tidymodel
ingrecipe
I create a very simple recipe here with little for thought. Not really trying my hardest with the model formula. But I could easily start changing my predictors by altering my recipe.
= recipe(sale_price ~ id +
recipe + full_bath + year_built + lot_area + sqft +
yr_sold + half_bath +
central_air + overall_qual + overall_cond + lot_area,
pool_area data = train_dt) %>%
# step_rm(state, county, pop_pct_veteran, dir_2016) %>%
update_role(id, new_role = "id var") %>%
step_normalize(all_predictors() & all_numeric())
recipe
## Recipe
##
## Inputs:
##
## role #variables
## id var 1
## outcome 1
## predictor 10
##
## Operations:
##
## Centering and scaling for all_predictors() & all_numeric()
model
I’m going to give a random forest a shot for this prediction. Using 5 fold CV, I want to tune three parameters in my random forest:
min_n
: minimum number of data points in a node that are required for the node to be split furthermtry
: number of predictors that will be randomly sampled at each splittrees
: number of trees contained in the ensemble=
model rand_forest(
mode = 'regression',
engine = 'ranger',
mtry = tune(),
trees = tune(),
min_n = tune()
)
workflow
Set my workflow with my model and recipe
=
wf workflow() %>%
add_model(model) %>%
add_recipe(recipe)
fit
Fit my model, tuning my parameters across several values. The measure of error I care about is the RMSE.
= wf %>%
fit tune_grid(
train_cv,grid = expand_grid(mtry = c(1,2,3,5,10),
min_n = c(1,3,5,10, 25),
trees = c(50, 100, 150, 300)),
metrics = metric_set(rmse)
)
After tuning, I plot my RMSE to check an see which how well my model did for each variation in my parameters
%>%
fit collect_metrics(summarize = TRUE) %>%
ggplot(aes(x = min_n, y = mean, group = factor(mtry), color = factor(mtry))) +
geom_line(size = 0.7, alpha = 0.6) +
geom_point(size = 2.5) +
facet_wrap(~trees, nrow = 1) +
labs(
title = "Random forest model performance",
subtitle = "Predicting housing prices",
caption = "",
x = "Min data points per node",
y = 'RMSE',
color = 'Number of predictors'
+
) ::theme_ipsum() +
hrbrthemesscale_color_manual(values = c(grant_pal_1, grant_pal_3, grant_pal_5,
+
grant_pal_6, grant_pal_9)) scale_x_continuous(breaks = c(1, 3, 5, 10, 25)) +
theme(
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
legend.position = 'bottom'
)
finalize_workflow
Now I want to finalize my workflow and select my best model parameters. This will output the model I want to use to predict housing prices in the testing data.
=
final_rf %>%
wf finalize_workflow(select_best(fit, metric = "rmse"))
The following code chunk actually fits that model.
# Fitting our final workflow
= final_rf %>% fit(data = train_dt)
final # Examine the final workflow
final
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## • step_normalize()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~3, x), num.trees = ~50, min.node.size = min_rows(~1, x), num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
##
## Type: Regression
## Number of trees: 50
## Sample size: 1460
## Number of independent variables: 10
## Mtry: 3
## Target node size: 1
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 1123092792
## R squared (OOB): 0.8220452
predict
Using the finalized workflow, I now predict housing prices in the testing data set using the familiar predict
function. The result is a vector of \(y^{hat}\)s that my model would predict the sales price of each house.
# Predict onto the test data
= final %>% predict(new_data = test_dt)
yhat # Check out predictions
%>% head() yhat
## # A tibble: 6 × 1
## .pred
## <dbl>
## 1 127339.
## 2 151166.
## 3 185422.
## 4 191608.
## 5 201316.
## 6 190181.
$yhat <- yhat test_dt