0.1 Use skimr to check out the data. We’re predicting penguin species in the palmerpenguins dataset. How many species do we have? Are they fairly balanced in the data? Why does it matter?
# Load packageslibrary(pacman)p_load(palmerpenguins, data.table, collapse, tidymodels, ggplot2)# Load the penguin datadata('penguins')# Counts of penguin speciespenguins$species |>qtab()
Answer There are three penguin species in the dataset, and they are not very equally distributed. The imbalance of our outcome matters because models can achieve high levels of accuracy by only learning one class when the classes are vey imbalanced.
0.2 What is imputation?
Answer Imputation basically means filling in missing values.
0.3 Hopefully you noticed we have some missing data. Let’s impute. By hand. Since we’re doing it by hand, let’s start simply. For categorical variables, you can use the modal class. For numeric, let’s use the median.
Answer
# Make a copy of the dataset for these imputed valuespenguins_copy = penguins# Find the mode of 'sex'mode_index = penguins_copy$sex |>table() |>which.max()mode_sex = mode_index |>names()# Find the penguins with NA for 'sex'na_sex = penguins_copy$sex |>is.na() |>which()penguins_copy$sex[na_sex] = mode_sex
Note: It would have been easier to just use fmode() from the collapse package. But now you have options.
0.4 Now let’s repeat using tidymodels. Make a recipe and then prep and bake it. Check out the objects!
0.5 How could we be more sophisticated/clever about imputation? Why would it matter?
Answer Our median/mode-based imputation ignores all of the information that we have on penguins that have missing values. Just as we want to use covariates to predict the outcome, we probably could do a better job predicting the missing values by using the non-missing covariates.
Part 1: A short tree
Now we’ll grow a (short) tree. By hand. In this section, we’re going to focus on the mechanics of growing a tree. We will ignore CV for now.
1.1 For the moment, let’s focus on the possible splits of the island variable. There are three islands. How many possible splits are there?
Answer There are three possible splits (for three islands):
{Biscoe} v. {Dream, Torgersen}
{Dream} v. {Biscoe, Torgersen}
{Torgersen} v. {Biscoe, Dream}
1.2 Try each split of island and then calculate the split’s accuracy and gini.
Answer
# Find the islandsislands = basic_bake$island |>unique()# Iterating over islandslapply(X = islands, FUN =function(i) {# Split into Biscoe and not Biscoe grp1 = basic_bake |>filter(island == i) grp2 = basic_bake |>filter(island != i)# Find the modal species in each group species1 = grp1$species |>fmode() species2 = grp2$species |>fmode()# Calculate accuracy acc1 =fmean(grp1$species == species1) acc2 =fmean(grp2$species == species2)# Weighted accuracy acc = (acc1 *nrow(grp1) + acc2 *nrow(grp2)) / (nrow(grp1) +nrow(grp2))# Calculate gini g1 = grp1$species |>table() |>prop.table() g2 = grp2$species |>table() |>prop.table() gini1 =sum(g1 * (1- g1)) gini2 =sum(g2 * (1- g2))# Weighted gini (using different approach) gini =fmean(c(gini1, gini2), w =c(nrow(grp1), nrow(grp2)))# Return a data tabledata.table(split =paste0( i, ' v. ','{', paste(setdiff(islands, i), collapse =', '), '}' ),accuracy = acc,gini = gini )}) |>rbindlist()
split accuracy gini
<char> <num> <num>
1: Torgersen v. {Biscoe, Dream} 0.5116279 0.5501752
2: Biscoe v. {Torgersen, Dream} 0.6744186 0.4314155
3: Dream v. {Torgersen, Biscoe} 0.5581395 0.4931324
1.3 Repeat 1.2 but for the sex variable.
Answer
# Find the islandssexes = basic_bake$sex |>unique()# NOTE There is only one possible split# Iterating over islandslapply(X = sexes[1], FUN =function(i) {# Split into Biscoe and not Biscoe grp1 = basic_bake |>filter(sex == i) grp2 = basic_bake |>filter(sex != i)# Find the modal species in each group species1 = grp1$species |>fmode() species2 = grp2$species |>fmode()# Calculate accuracy acc1 =fmean(grp1$species == species1) acc2 =fmean(grp2$species == species2)# Weighted accuracy acc = (acc1 *nrow(grp1) + acc2 *nrow(grp2)) / (nrow(grp1) +nrow(grp2))# Calculate gini g1 = grp1$species |>table() |>prop.table() g2 = grp2$species |>table() |>prop.table() gini1 =sum(g1 * (1- g1)) gini2 =sum(g2 * (1- g2))# Weighted gini (using different approach) gini =fmean(c(gini1, gini2), w =c(nrow(grp1), nrow(grp2)))# Return a data tabledata.table(split =paste0( i, ' v. ', setdiff(sexes, i) ),accuracy = acc,gini = gini )}) |>rbindlist()
split accuracy gini
<char> <num> <num>
1: male v. female 0.4418605 0.6356101
1.4 Which variable and split would you make? Why?
Answer I would choose the Biscoe vs. {Torgersen, Dream} island-based split due to its low gini.
1.5 We would typically want to consider all possible splits for the numeric variables as well. Explain (or code) an approach that would try all of the splits for bill_length_mm.
Answer I would use the following steps:
Find all of the unique numeric values.
Sort these values smallest to largest.
Calculate the midpoint between each pair of consecutive values.
We can use these midpoints for the splits.
Part 2: A bigger tree
Now that you know how to grow a tree, we can just let tidymodels do it for you. However, we now want to (1) add cross validation and (2) bring our imputation step into the workflow that we’re validating. Stick with mode and median imputation.
2.1 Use tidymodels to grow a tree (actually multiple trees). Remember: We can (need to) tune the cost complexity (cost_complexity) to try to restrict the tree from overfitting.
Answer We already have a basic recipe (basic_recipe), so now we need a model, a workflow, and then we will be ready to tune.
Start with the model…
# Define the modelbasic_tree =decision_tree(mode ='classification',engine ='rpart',cost_complexity =tune(),tree_depth =30,min_n =3)
Now the workflow…
# Define the splitsset.seed(123)split = penguins |>vfold_cv(v =5)# Define the modelwf_tree =workflow() |>add_model(basic_tree) |>add_recipe(basic_recipe)
It doesn’t actually look like we want to enforce too much cost complexity.
Part 3: More trees and more randomness
The first “more trees” option that we saw in class was bagging—bootstrap aggregation—where we boostrap our training data and then train a big tree on each of the bootstrapped datasets. Random forests repeat this idea, but they also only let the tree see a random subset of variables at each split.
3.1 Briefly explain how you could build upon your work in Part 1 to grow a random forest.
Answer We would need to do two things:
Create a bunch of bootstrapped samples.
Change our algorithm to only consider a random subset of features for each split.
3.2 Now use tidymodels to tune (with CV) a random forest. Make sure you tune the relevant parameters (what are they?).
Answer The most relevant hyperparameter is the number of variables considered in a given split (mtry). We probably also want to tune the minimum leaf size (min_n).
3.5 Explain why it might be important to include imputation in cross validation (rather than finishing all of the data cleaning before you tune/train/validate).
Answer Imputation (and other parts of the recipe) are all part of the learning process. You want to capture how accurate (and how uncertain/variable) your whole learning process is—not just one part of it.