class: center, middle, inverse, title-slide # Lab .mono[004] ## Data Cleaning and Workflow with Tidymodels ### Connor Lennon ### 21 February 2020 --- exclude: true --- class: inverse, middle # Welcome to Parsnip --- name: admin # Welcome to Parsnip ## Why bother with Tidymodels? - As Ed said: -- .hi-orange[ It's the future, Marty] -- <img src="images/ec524-heart-disease.jpg" width="60%" style="display: block; margin: auto;" /> No but really: it is extremely intuitive, and streamlines a lot of the annoying parts of data science (data cleaning, cross validation, mutliple model testing, etc.) --- # Welcome to Parsnip ## Why bother with Tidymodels? - Why use this over `caret()`? -- Two reasons: 1.) It plays well with the tidyverse, and uses the very nice pipes system from dplyr -- 2.) It lets you quickly substitute in new parts to reuse old code - making your life a LOT simpler -- But, to use `Parsnip` or `tidymodels`, you first have to have a good sense of the 'data science workflow' according to the makers --- class: inverse, middle # Workflow --- name: admin #Workflow ## Workflows <img src="images/Workflows.png" width="80%" style="display: block; margin: auto;" /> .footnote[ .orange[*] Source: https://education.rstudio.com/blog/2020/02/conf20-intro-ml/. ] --- name: parsnip ## Example with just parsnip .hi-blue[Parsnip] lets us build a model much like Caret, but in the .hi-orange[smallest possible] building blocks. These include, in order: -- - A model type. This is a broad category of model, like .hi[linear regression] or .hi-blue[classification tree]. -- - An "engine." This is the package that will run the model type you provided above. This could be either `lm` or `glmnet` for linear regression -- - A fit object. This fits the data to the model you built above -- Let's do a worked example, using your homework data from the last project. --- # Models with Parsnip Let's read in the heart data, and then we can use the `parsnip()` package to run a simple regression ```r #Let's build a model in steps lazzo = linear_reg() ``` --- # Models with Parsnip Let's read in the heart data, and then we can use the `parsnip()` package to run a simple regression ```r #Let's build a model in steps lazzo = linear_reg() %>% set_engine('lm') ``` --- # Models with Parsnip Let's read in the heart data, and then we can use the `parsnip()` package to run a simple regression ```r #Let's build a model in steps lazzo = linear_reg() %>% set_engine('lm') %>% parsnip::fit( heart_disease ~ sex + chest_pain + resting_bp, data = heart_data_tr ) rez = predict(lazzo, new_data = heart_data_tr) %>% rename(lm = .pred) mean((rez[[1]] %>% round() - heart_data_tr$heart_disease)^2, na.rm = T) ``` ``` #> [1] 0.2587719 ``` Not great. Maybe a lasso would help us? All that work again? -- .hi[No!] --- # Models with Parsnip I can just overwrite an engine/model and then rerun all of this with one simple update ```r #build initial model lazzo_mod = linear_reg() %>% set_engine('lm') ``` -- ```r #switch to lasso/ridge/elasticnet. We can set hyperparameters in 'set_args' results=lazzo_mod %>% set_engine('glmnet') %>% set_args(penalty = 0.001, mixture = 0.5) %>% parsnip::fit( heart_disease ~ sex + chest_pain + resting_bp, data = heart_data_tr ) rez = predict(results, new_data = heart_data_tr) mean((rez[[1]] %>% round() - heart_data_tr$heart_disease)^2, na.rm = T) ``` ``` #> [1] 0.2587719 ``` --- # Models with Parsnip But, from your homework, you know you .hi[really] need to do this using classification methods. -- And that requires data cleaning, to at the very least get some factors built. -- .hi-orange[Ugh] -- That's where `recipes()` is going to save the day. But - quickly, we need to revisit our workflow diagram, because we're going to organize everything according to a .hi[workflow] object. -- Yup, that's a package too --- #Workflows() <img src="images/Workflows.png" width="80%" style="display: block; margin: auto;" /> .footnote[ .orange[*] Source: https://education.rstudio.com/blog/2020/02/conf20-intro-ml/. ] --- class: inverse, middle # Improving your workflow with workflow --- # Improving your workflow with workflow Using workflow is quite simple. You just make a `workflow()` object, and tack on steps using a `%>%`. ```r #workflow always starts with workflow my_big_tuna = workflow() ``` --- # Improving your workflow with workflow Using workflow is quite simple. You just make a `workflow()` object, and tack on steps using a `%>%`. ```r #let's use a formula to tell our model what to do my_big_tuna %<>% add_formula(heart_disease ~ sex + chest_pain + resting_bp) ``` --- # Improving your workflow with workflow Using workflow is quite simple. You just make a `workflow()` object, and tack on steps using a `%>%`. ```r #now we can add a parsnip model from before onto our workflow to predict some stuff my_big_tuna %<>% add_model(lazzo_mod) ``` --- # Improving your workflow with workflow Using workflow is quite simple. You just make a `workflow()` object, and tack on steps using a `%>%`. ```r #now we can just pass this thing to fit, and get our fitted object back. parsnip::fit(my_big_tuna, data = heart_data_tr) ``` ``` #> ══ Workflow [trained] ═══════════════════════════════════════════════════ #> Preprocessor: Formula #> Model: linear_reg() #> #> ── Preprocessor ───────────────────────────────────────────────────────── #> heart_disease ~ sex + chest_pain + resting_bp #> #> ── Model ──────────────────────────────────────────────────────────────── #> #> Call: #> stats::lm(formula = formula, data = data) #> #> Coefficients: #> (Intercept) sex chest_pain resting_bp #> -1.134294 0.304522 0.219247 0.005328 ``` --- # Improving your workflow with workflow Oh oops. That was an `lm()` model. We might want our `glmnet` back. Ok, no problem -- ```r #Just create a new model real_lazzo = lazzo_mod %>% set_engine('glmnet') %>% set_args(penalty = 0.001, mixture = 0.5) ``` ```r #...and update our workflow accordingly my_big_tuna %<>% update_model(real_lazzo) parsnip::fit(my_big_tuna, data = heart_data_tr) %>% predict(heart_data_tr) %>% rmse(estimate = .[[1]], truth = heart_data_tr$heart_disease) ```
.metric
.estimator
.estimate
rmse
standard
0.425
-- Right... back to `recipes` --- class: inverse, middle # Recipes --- #Recipes ## Imagine You are baking muffins... The idea behind recipes is to treat .hi-orange[data processing] and .hi-blue[model building] as an understandable process (.hi[recipe]) for a program to follow to 'bake' your predictions into the model you want. -- - .hi-blue[Step 1:] Separate your ingredients: tell your program what variables are which. This is done using formula and model objects -- - .hi-orange[Step 2:] Provide a list of directions to mix your ingredients (change them) in order. -- - .hi-orange[Step 3:] Prep your data using your instructions (mixing the recipe) Recipes essentially act like super sophisticated .hi-orange[formulae] objects, and let you tell your model how it needs to treat any new data it sees. --- #Recipes ## Example If you recall, we don't want to use ID to do our predicting! Maybe we could keep that variable, and exclude it from all of our data cleaning, but still hold onto it in our data frame... Yup. We can do that. -- `Recipes` uses .hi-orange[roles] to separate your .hi-blue[ingredients (variables)] into separate buckets. -- I'll give in and use the baking analogy: you don't want to sift your eggs and beat your salt. -- First though - here's an adorable graphic. --- #Recipe <img src="images/socute.png" style="display: block; margin: auto;" /> .footnote[ .orange[*] Source: https://education.rstudio.com/blog/2020/02/conf20-intro-ml/. ] --- #Recipe A recipe begins its life, usually, as a formula object that takes some information from our data. -- Let's make one and see what it looks like ```r hd_recipe = recipe(heart_disease ~ ., data = heart_data_tr) hd_recipe %>% summary() %>% head(3) ```
variable
type
role
source
id
numeric
predictor
original
age
numeric
predictor
original
sex
numeric
predictor
original
--- #Recipe ## Roles We can change or add roles (variables can have multiple) by either `update_role` or `add_role` ```r hd_recipe %<>% update_role(id, new_role = 'id_variables') %>% add_role(sex, exercise_angina, high_sugar, chest_pain, slope, vessels, thalium_scan, ecg, new_role = 'sneaky_dummies') ``` --- # Recipes ## Now what? Ok, this might not seem useful to you... .hi[but just wait.] We are essentially creating different 'dry ingredients/wet ingredients' shelves in our cabinet so we can tell R to treat them differently in preprocessing (or for another reason.) -- - Let's do our preprocessing using this recipe analogy. We can use what are called .hi[selectors] to pick and choose what variables to transform and how. Here's a brief snapshot of some of them that are available --- # Recipes <img src="images/selectors.png" width="80%" style="display: block; margin: auto;" /> These then get passed to .hi-orange['step_*'] methods, of which there are a TON. The cool thing is, is that these roughly follow the naming convention: step_ -- your_friendly_preProcess_method_from_caret() -- Let's see these in action --- #Recipes - We want to build our preprocessing as if we were writing a recipe for someone else to replicate, using a series of steps. ```r hd_recipe %<>% step_mutate_at(has_role('sneaky_dummies'), -all_outcomes(), fn = as.factor) ``` --- #Recipes - We want to build our preprocessing as if we were writing a recipe for someone else to replicate, using a series of steps. ```r hd_recipe %<>% step_mutate_at(has_role('sneaky_dummies'), fn = as.factor) %>% step_center(all_predictors(), -all_nominal()) ``` --- #Recipes - We want to build our preprocessing as if we were writing a recipe for someone else to replicate, using a series of steps. ```r hd_recipe %<>% step_mutate_at(has_role('sneaky_dummies'), fn = as.factor) %>% step_center(all_predictors(), -all_nominal()) %>% step_scale(all_predictors(), -all_nominal()) %>% step_nzv(all_predictors()) %>% step_mutate_at(all_outcomes(), fn = as.factor) ``` --- #Recipes However, we probably also want to do imputation on our data. Let's do that too. ```r hd_recipe %<>% step_medianimpute(all_numeric()) %>% step_modeimpute(all_nominal()) ``` Great, we have done a ton of... -- .hi-orange[what?] -- - We've now given a set of directions to follow when we train a model. -- - what model? -- Any model we want, so long as we are using the data cleaning process we just described. --- # Recipes ## Training preprocessing? Before we do anything more, we actually have to set up our ingredients in our shelves. This is what `prep()` does. -- ```r prepped_recipe = prep(hd_recipe) ``` -- Why do we need to train a preprocessing step, you might ask? -- .hi[So we can use the SAME process to transform any dataset we like] - for instance, our testing data. --- # Recipes ```r heart_data_ts$heart_disease = -2 bake(prepped_recipe, new_data = heart_data_ts) %>% head(2) ```
id
age
sex
chest_pain
resting_bp
cholestoral
high_sugar
ecg
max_rate
exercise_angina
st_depression
slope
vessels
thalium_scan
heart_disease
179
-1.26
1
3
-0.111
1.35
0
0
0.556
0
0.776
1
1
3
14
-1.15
1
2
-0.672
0.297
0
0
1.01
0
-0.882
1
0
7
--- #Recipes ## Getting the data out of the tidymodels sphere Now, if we're just interested in the training data, we can actually get the transformed data using the `juice` function. I'll show you that in a second. -- - Think of `juice()` as .hi[bake light] -- .hi[Let's bring this all together] --- #Tidymodel process Just like with our model object, we can pass our recipe object to the workflow directly, in place of the formula we had before. ```r #Classification tree? ctree_mod = decision_tree() %>% set_engine('rpart') %>% set_args(min_n = 8) %>% set_mode('classification') my_big_tuna = workflow() %>% add_recipe(hd_recipe) %>% add_model(ctree_mod) ``` --- #Tidymodel process Now, we can just fit this data here! We can use 'Juice' to get our heart disease variable, and then put both our predictions and passed values in a ROC curve. ```r #Classification tree? truths = as.factor(heart_data_tr$heart_disease) hdpred = parsnip::fit(my_big_tuna, data = heart_data_tr) %>% predict(new_data = heart_data_tr, 'prob') %>% mutate(true_pred = truths, pred = as.numeric(.pred_1)) ``` --- #Tidymodel process ```r roc_curve(hdpred, truth = true_pred, estimate = pred) %>% autoplot() ``` <img src="004-slides_files/figure-html/unnamed-chunk-20-1.svg" width="80%" style="display: block; margin: auto;" /> --- #Tidymodel process With all this in place though, where's the upside? -- .hi[We can swap models, recipes, formulas, dataset and steps without rewriting code] ```r #Random Forest? rfmod = rand_forest() %>% set_engine('ranger') %>% set_args(min_n = 8, trees = 10000, mtry = 6) %>% set_mode('classification') my_big_tuna %<>% update_model(rfmod) ``` --- #Tidymodel process Now just refit/repredict and... -- ```r hdpred = parsnip::fit(my_big_tuna, data = heart_data_tr) %>% predict(new_data = heart_data_tr, 'prob') %>% mutate(true_pred = (prepped_recipe %>% juice())$heart_disease %>% as.numeric() %>% as.factor(), pred = as.numeric(.pred_1)) ``` --- #Tidymodel process ```r roc_curve(hdpred, truth = true_pred, estimate = pred) %>% autoplot() ``` <img src="004-slides_files/figure-html/unnamed-chunk-23-1.svg" width="80%" style="display: block; margin: auto;" /> --- #Tidymodel process But we're missing one key ingredient... -- .hi-orange[cross validation] -- - You've likely noticed that everything I've done so far has been only .hi-blue[in sample] -- - That's because to do cross validation, we need to know how to create splits - that's where `rsample` comes in. --- class: inverse, middle # Cross Validation in tidymodels --- --- class: inverse, middle # Pop Quiz! --- # Cross Validation in Tidymodels As you've learned already, .hi-orange[cross validation] is critical for good .hi-orange[oos estimates] -- As you proabably have also learned, doing cross validation by hand can be .hi[difficult]. -- ```r splitcount = 30 splits = list() split = rep_sample_n(heart_data_tr, size = nrow(heart_data_tr)/splitcount, reps = splitcount) for(i in 1:splitcount){ splits[[i]] = filter(split, replicate == i) } ``` --- # Cross Validation in Tidymodels ```r forml = heart_disease ~ . atrtestfunc = function(split){ splitlist = setdiff(splits, split) mod = lm(data = split, formula = forml) pred = predict(mod, newdata = bind_rows(splitlist)) return(pred) } list_of_preds = lapply(splits, FUN = atrtestfunc) ``` -- And now we need to provide a grid to search over... there must be a better way... --- #Cross Validation in Tidymodels Just like in `caret`, there is a set of functions that will help us cross validate. -- This comes from the package .hi[rsample] <img src="images/rsample.jpeg" style="display: block; margin: auto;" /> --- #Cross Validation in Tidymodels This feeds into the `tidymodels` process we've already begun to flesh out, and let's us do our cross validation for prediction however we prefer. -- Another advantage of rsample: -- - cross validation is .hi[memory intensive], which rsample avoids by creating objects called .hi-blue[rsplits]. -- - however, you don't need to work with these objects directly, because tidymodels will do that for you. --- #Cross Validation in Tidymodels Lets use our recipe and a .hi-orange[gradient boosted tree] from above to try and predict heart disease. -- - A .hi-orange[gradient boosted tree] essentially is a .hi-blue[decision tree] where splits further down the branches are dictated by the residuals in the leaf. -- ```r #Boosted tree? gbtree_mod = boost_tree() %>% set_engine('xgboost') %>% set_args(min_n = 8) %>% set_mode('classification') my_big_tuna = workflow() %>% add_recipe(hd_recipe) %>% add_model(gbtree_mod) ``` --- #Cross Validation in Tidymodels But now, we need to create our splits ```r cv_folds = vfold_cv(heart_data_tr, v = 10, strata = heart_disease) ``` -- We also have access to all of the other methods of cross validation you have with caret, for instance, .hi[monte carlo] ```r mch = mc_cv(heart_data_tr, times = 25, strata = heart_disease) ``` You can see these are a new object type, which would make them a pain to learn, but .hi[we don't have to] --- #Cross Validation in Tidymodels Instead, we can use the `fit_resamples` function, and then use a new function to collect our results ```r fit_resamples(my_big_tuna, resamples = cv_folds) %>% collect_metrics() ``` -- Unfortunately, .hi[this won't work.] -- This is because XGboost needs dummy variables rather than factors. Luckily, we were using tidymodels. -- .hi[Pop-quiz:] How do we add a new step to our `recipe`? --- #Cross Validation in Tidymodels ```r xgrec = hd_recipe %>% step_dummy(all_nominal(), -all_outcomes()) my_big_tuna_xg = my_big_tuna %>% update_recipe(xgrec) %>% update_model(gbtree_mod) ``` -- ```r fit_resamples(my_big_tuna_xg, resamples = cv_folds) %>% collect_metrics() ```
.metric
.estimator
mean
n
std_err
accuracy
binary
0.759
10
0.0235
roc_auc
binary
0.874
10
0.0194
--- # Crossvalidation in Tidymodels Now, we have all the tools to compare out of sample performance across multiple models. ```r model_rand_forest = rand_forest() %>% set_engine('ranger') %>% set_args(trees = 10000, min_n = 5, mtry = 4) %>% set_mode('classification') ``` ```r my_big_tuna_forest = my_big_tuna %>% update_model(model_rand_forest) fit_resamples(my_big_tuna_xg, resamples = cv_folds) %>% collect_metrics() ```
.metric
.estimator
mean
n
std_err
accuracy
binary
0.759
10
0.0235
roc_auc
binary
0.874
10
0.0194
--- ```r fit_resamples(my_big_tuna_forest, resamples = cv_folds) %>% collect_metrics() ```
.metric
.estimator
mean
n
std_err
accuracy
binary
0.793
10
0.0233
roc_auc
binary
0.895
10
0.0182
--- #Crossvalidation in Tidymodels .hi[Boosted Trees] <img src="004-slides_files/figure-html/unnamed-chunk-36-1.svg" style="display: block; margin: auto;" /> --- #Crossvalidation in Tidymodels .hi[Random Forest] <img src="004-slides_files/figure-html/unnamed-chunk-37-1.svg" style="display: block; margin: auto;" /> --- #Crossvalidation in Tidymodels ## Using tune and yardstick `tune()` and `yardstick` work in almost perfect tandem to allow us to cross-validate models using whatever metrics we may want. -- This adds functionality to our `caret()` tuneGrid parameter by allowing us to not only tune our models, but also our recipe/data cleaning process. -- It also lets us choose HOW we want to define .hi[best] in terms of our model. Area under the curve for ROC? -- We have it. -- RMSE? -- We have it. Specificity? -You get the idea Let's see how we can "tune" our cleaning process --- #Crossvalidation in Tidymodels ## Using tune for recipes .hi-orange[Ok.] -- I'm going to do something a little risky. Bear with me. - How mamy of you know what PCA is? -- - PCA stands for .hi-blue[Principal Component Analysis], and without going into exactly how it works... -- that would require me explain how eigenvectors and eigenvalues work. I'd really rather not... -- - So, informally, PCA breaks down a group of information and transforms data into consecutive 'directions of highest variation.' -- - In practice, this allows us to turn our existing variables into a set of fewer variables that summarize the variation from some larger group. --- #Crossvalidation in Tidymodels ## Using tune for recipes First, we are going to add a step that will collapse our numeric data into fewer variables that summarize our existing data. ```r pca_rec = hd_recipe %>% step_pca(all_numeric(), num_comp = tune()) big_tuna_pca = my_big_tuna_forest %>% update_recipe(pca_rec) ``` Like in `caret`, we need to create a grid to search over. -- We'll be searching over `num_comp` which is the number of directions we want to summarize our data by. Let's look at 1-9. -- The best part is this grid will search over both your model and your recipe. For now, let's just do the recipe. ```r grd = expand_grid(num_comp = 1:9) ``` --- #Crossvalidation in Tidymodels ## Using tune for recipes Now that we have a grid, we can take our grid, search over it for the optimal combination of hyperparameters, and get some results. -- We do this using the `tune_grid` function of the `tune` package. This lets us choose a metric to define how we want to select our best tune, and also whether we want to .hi-orange[maximize] or .hi-blue[minimize] our value ```r pca_results = big_tuna_pca %>% tune_grid(resamples = cv_folds, grid = grd) pca_results %>% show_best(metric = "roc_auc", maximize = T, 3) ```
num_comp
.metric
.estimator
mean
n
std_err
2
roc_auc
binary
0.899
10
0.0175
3
roc_auc
binary
0.894
10
0.0194
6
roc_auc
binary
0.889
10
0.0212
--- #Crossvalidation in Tidymodels .hi[Neat!] We added a tiny bit of performance to our model. Where we really will get gains though is by tuning our other hyperparameters. --- #Crossvalidation in Tidymodels ## Using tune for models So, now we want to edit our hyper-parameters. We already set these in prior lectures, but let's remind ourselves of what parameters we can use for tree ensemble models. -- - mtry: number of variables to be candidates for each split choice - trees: number of trees - min_n: minimum data in each end-point - tree_depth: max number of splits - learn_rate = learning rate for each iteration .hi((boosting)) - loss_reduction = the reduction in the loss function to perform a further split --- #Crossvalidation in Tidymodels ## Using tune for models The difference is we need to tell tune which parameters live in which functions, because there are several 'models' that can be tuned. - We do that by passing `tune()` in the arguments -- ```r model_rand_forest = rand_forest() %>% set_engine('ranger') %>% set_args(trees = 1500, min_n = tune(), mtry = tune()) %>% set_mode('classification') ``` -- ```r full_grd = expand_grid(mtry = 1:8, min_n = 8:25 ) ``` --- #Crossvalidation in Tidymodels ## Using tune for models Now, we can do the exact same thing as before. ```r big_tuna_full = my_big_tuna %>% update_model(model_rand_forest) %>% update_recipe(hd_recipe %>% step_pca(all_numeric(), num_comp = 6)) results = big_tuna_full %>% tune_grid(resamples = cv_folds, metrics = metric_set(roc_auc,f_meas), grid = full_grd) results %>% show_best(metric = "f_meas", maximize = T, 3) ```
mtry
min_n
.metric
.estimator
mean
n
std_err
1
14
f_meas
binary
0.86
10
0.0234
1
22
f_meas
binary
0.86
10
0.0234
1
10
f_meas
binary
0.856
10
0.0228
--- #Crossvalidation in Tidymodels ## Using tune for models And, for `xgboost` ```r gbtree_mod = boost_tree() %>% set_engine('xgboost') %>% set_args(learn_rate = tune(), tree_depth = tune()) %>% set_mode('classification') lrn = c(.15, .25, .3, .4, .45, .5) full_grd = expand_grid( learn_rate = lrn, tree_depth = 2:9 ) ``` --- #Crossvalidation in Tidymodels ## Using tune for models And, for `xgboost`... -- ```r results_xg = big_tuna_full %>% update_model(gbtree_mod) %>% update_recipe(hd_recipe %>% step_dummy(all_nominal(), -all_outcomes())) %>% tune_grid(resamples = cv_folds, metrics = metric_set(roc_auc, f_meas), grid = full_grd) results_xg %>% show_best(metric = "f_meas", maximize = T, 3) ```
tree_depth
learn_rate
.metric
.estimator
mean
n
std_err
3
0.15
f_meas
binary
0.822
10
0.0167
2
0.4
f_meas
binary
0.822
10
0.0168
2
0.45
f_meas
binary
0.817
10
0.0193
--- #Crossvalidation in Tidymodels ## Using tune for models We can also get the .hi[best] model directly by using ### select_best() ```r best_xg = results_xg %>% select_best(metric = 'f_meas') best_xg ```
tree_depth
learn_rate
3
0.15
```r best_rf = results %>% select_best(metric = 'f_meas') best_rf ```
mtry
min_n
1
14
--- #Crossvalidation in Tidymodels ## Finalizing workflow We're almost there! -- !! -- !! All we need to do now is fix our selected hyperparameters in place. -- .hi-orange[of course], we could just read them from our table and update the model directly, but that would take longer in compute time and work time. -- Instead, we will use the last piece of .hi-blue[workflow] to finalize it, using... -- `finalize_workflow` Which does exactly what you think it will. --- #Crossvalidation in Tidymodels ## Finalizing workflow We can finalize our workflow by passing the .hi[best] objects defined from a few slides ago to the function finalize_workflow() ```r final_tuna_xg = big_tuna_full %>% update_model(gbtree_mod) %>% update_recipe(hd_recipe %>% step_dummy(all_nominal(), -all_outcomes())) %>% finalize_workflow(best_xg) ``` -- ```r final_tuna_rf = my_big_tuna %>% update_model(model_rand_forest) %>% update_recipe(hd_recipe) %>% finalize_workflow(best_rf) ``` --- class: inverse, middle # The test set --- #Testing Ok. Now we're ready for the .hi-orange[big finale.] -- Let's prep and get our data ready for submission on kaggle ```r heart_data_ts$heart_disease = -2 finfitrf = final_tuna_rf %>% parsnip::fit(data = heart_data_tr) finfitxg = final_tuna_xg %>% parsnip::fit(data = heart_data_tr) out = data.frame(id = heart_data_ts$id, predict(finfitrf, new_data = heart_data_ts, 'prob'), predict(finfitxg, new_data = heart_data_ts, 'prob')) ``` wait wait... what's this? Why am I doing multiple predictions? -- We've got one model left for us. This is what's called a .hi[stacked regression] and it works by running a regression on other predicted values of y to try and predict y. -- These types of models .hi[frequently] win kaggle competitions. -- We're going to take our new predictions, and produce a third prediction based on them --- # Grouped models ```r ins = data.frame(id = heart_data_tr$id, heart_disease = predict(finfitrf, new_data = heart_data_tr, 'prob'), predict(finfitxg, new_data = heart_data_tr, 'prob'), truth = heart_data_tr$heart_disease %>% as.factor()) final_logit = logistic_reg() %>% set_engine('glm') recipefin = recipe(truth ~ heart_disease..pred_1 + .pred_1, data = ins) final_logit_wf = workflow() %>% add_recipe(recipefin) %>% add_model(final_logit) fitfinlm = parsnip::fit(final_logit_wf, data = ins) ``` --- #Grouped models 2 ```r submit = data.frame(id = heart_data_ts$id, predict(fitfinlm, new_data = out)) names(submit) = c('id','heart_disease') ``` --- # Submission <img src="images/kag_private_leaderboard.png" style="display: block; margin: auto;" />