Lecture 20

class: center, middle, inverse, title-slide

.title[
# Lecture 20
]
.subtitle[
## Using <code>tidymodels</code> to train ML models
]
.author[
### Tyler Ransom
]
.date[
### ECON 5253, University of Oklahoma
]

---

# Plan for the day

- Review topics from last class (bias, variance, regularization)

- Practice training and tuning ML models using R's `tidymodels` package

---
# Quick review

- The main goal of machine learning is to maximize out-of-sample prediction of the target variable `$y$`

- `$y$` can be continuous or categorical

- We measure prediction accuracy differently depending on if `$y$` is continuous or not

- A model is .hi[overfit] or has .hi[high variance] if it predicts well in-sample but poorly out-of-sample

- A model is .hi[underfit] or has .hi[high bias] if it predicts poorly both in- and out-of-sample

- We optimally trade off bias and variance with regularization

---
# Regularization

- Regularization is a way of penalizing model complexity

- For models where `$y$` is continuous, we can use L1 or L2 regularization

- e.g. with linear regression, we penalize the objective function:

`\begin{align*}
\min_\beta (y-X\beta)'(y-X\beta) + \lambda \sum_k \vert\beta_k\vert & &(\text{L1})\\
\min_\beta (y-X\beta)'(y-X\beta) + \lambda \sum_k (\beta_k)^2       & &(\text{L2})
\end{align*}`

- We obtain estimates of the `$\beta$`'s in the .hi[training] phase

- We obtain the best value of `$\lambda$` in the .hi[validation] phase

---
# k-fold cross-validation
Take the 80% of your data that isn't the test set and randomly divide it into a remaining 80/20 set `$k$` number of times. Typically `$k$` is between 3 and 10. (See graphic below)

.center[
<img src="k-foldCV.png" width="642" />
]

---
# How to do ML in R

There are a couple of nice packages in R that will do k-fold cross validation for you, and will work with a number of commonly used prediction algorithms

1. `tidymodels`

2. `mlr3`

Both packages are well built and do many of the same things; we'll be using `tidymodels` for the rest of this unit of the course

In Python, [`scikit-learn`](https://scikit-learn.org/stable/) is the workhorse machine learning frontend

In Julia, [`MLJ.jl`](https://alan-turing-institute.github.io/MLJ.jl/dev/) is the workhorse machine learning frontend

---
# `tidymodels`

- Like `tidyverse`, `tidymodels` is a collection of packages:

- [`rsample`](https://rsample.tidymodels.org/): for data splitting 
    
    - [`recipes`](https://recipes.tidymodels.org/index.html): for pre-processing
    
    - [`parsnip`](https://www.tidyverse.org/blog/2018/11/parsnip-0-0-1/): for model building 
    
    - [`dials`](https://github.com/tidymodels/dials) and [`tune`](https://github.com/tidymodels/tune): parameter tuning 
    
    - [`yardstick`](https://github.com/tidymodels/yardstick): for model evaluations 
    
    - [`workflows`](https://github.com/tidymodels/workflows): for bundling a pipeline that bundles together pre-processing, modeling, and post-processing requests

---
# How to use `tidymodels`

You need to tell `tidymodels` the following information before it will run:

1. Training data (which will end up being split `$k$` ways when doing cross-validation)
2. Testing data
3. The task at hand (e.g. regression or classification)
4. The validation scheme (e.g. 3-fold, 5-fold, or 10-fold CV)
5. The method used to tune the parameters of the algorithm (e.g. `$\lambda$` for LASSO)
6. The prediction algorithm (e.g. "decision tree")
7. The parameters of the prediction algorithm that need to be cross-validated (e.g. tree depth, `$\lambda$`, etc.)

For a complete list of prediction algorithms supported by `tidymodels`, see [here](https://topepo.github.io/caret/available-models.html)

---
# Example datasets

- UC Irivne hosts a number of ML databases

- These are similar to R datasets `iris`, `mtcars`, etc.

- But they are used in ML applications

- Today and in the problem set, you'll use the `housing` data

---
# Packages you'll need to install

- `tidymodels`

- This should install all of the sub-packages (`parsnip`, `recipes`, `rsample`, `tune`, `workflows`, `yardstick`)

- But in case it doesn't, you'll need to install those as well

- Also install `glmnet` for LASSO, ridge regression and elastic net

---
# Simple example
Let's start by doing linear regression on a well-known dataset: the Boston housing data from UC Irvine (complete R script [here](https://github.com/tyleransom/DScourseS22/blob/master/LectureNotes/20-tidymodels/tidymodels_example.R))

First step: read in the data and assign column names

.scroll-box-10[
```r
library(tidyverse)
library(tidymodels)
library(magrittr)

housing <- read_table("http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data", col_names = FALSE)
names(housing) <- c("crim","zn","indus","chas","nox","rm","age","dis","rad","tax","ptratio","b","lstat","medv")

# From UC Irvine's website (http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names)
#    1. CRIM      per capita crime rate by town
#    2. ZN        proportion of residential land zoned for lots over 25,000 sq.ft.
#    3. INDUS     proportion of non-retail business acres per town
#    4. CHAS      Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
#    5. NOX       nitric oxides concentration (parts per 10 million)
#    6. RM        average number of rooms per dwelling
#    7. AGE       proportion of owner-occupied units built prior to 1940
#    8. DIS       weighted distances to five Boston employment centres
#    9. RAD       index of accessibility to radial highways
#    10. TAX      full-value property-tax rate per $10,000
#    11. PTRATIO  pupil-teacher ratio by town
#    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
#    13. LSTAT    lower status of the population
#    14. MEDV     Median value of owner-occupied homes in $1000's
```
]

---
# Divide data into training and test sets

- Before we do anything else, we should demarcate the training and test sets

- We can use `rsample` for this

```r
housing_split <- initial_split(housing, prop = 0.8)
housing_train <- training(housing_split)
housing_test  <- testing(housing_split)
```

---
# Data pre-processing with `recipes`

- Before we do any modeling, we might need to pre-process the data

- We talked about this earlier in the semester

- Some common pre-processing steps:

- log transformation
    
    - convert categorical variables to `factor`s
    
    - create interaction terms
    
    - data imputation (should we use mean/mode imputation?)
    
    - ...

---
# Using `recipes` on the UCI `housing` data

```r
housing_recipe <- recipe(medv ~ ., data = housing_train) %>%
                  # convert outcome variable to logs
                  step_log(all_outcomes()) %>%
                  # convert 0/1 chas to a factor
                  step_bin2factor(chas) %>%
                  # create interaction term between crime and nox
                  step_interact(terms = ~ crim:nox) %>%
                  # create square terms of some continuous variables
                  step_poly(dis,nox) %>%
                  prep()
```

---
# `bake()` your `recipe()`

- With your recipe created and prepped, now you can apply it to the training and test sets

- Two steps:

1. Apply your "prepped" recipe to the training data with `juice()`
    2. Apply your "prepped" recipe to the testing data with `bake()`

```r
housing_train_prepped <- housing_recipe %>% juice
housing_test_prepped  <- housing_recipe %>% bake(new_data = housing_test)
```

---
# Now train the model with OLS

```r
housing_train_x <- housing_train_prepped %>% select(-medv)
housing_test_x  <- housing_test_prepped  %>% select(-medv)
housing_train_y <- housing_train_prepped %>% select( medv)
housing_test_y  <- housing_test_prepped  %>% select( medv)

# Fit the regression model
est.ols <- lm(housing_train_y$medv ~ ., data = housing_train_x)
# Predict outcome for the test data
ols_predicted <- predict(est.ols, newdata = housing_test_x)
# Root mean-squared error
sqrt(mean((housing_test_y$medv - ols_predicted)^2))
```

---
# OLS the easy way

- The previous slide was a lot of work to do something simple!

- We could have more easily done it this way:

```r
# easy way
est.ols.easy <- lm(log(medv) ~ crim + zn + indus + as.factor(chas) + 
                     rm + age + rad + tax + ptratio + b + 
                     lstat + crim:nox + poly(dis,2) + poly(nox,2), 
                   data = housing_train)
# Predict outcome for the test data
ols_easy_predicted <- predict(est.ols.easy, newdata = housing_test)
# Root mean-squared error
sqrt(mean((housing_test_y$medv - ols_easy_predicted)^2))
```

- But doing it the "roundabout" way will have dividends

- Because we'll be able to have the computer automatically do the many different regressions we need it to

---
# Using `parsnip` to train models

- `parsnip` is a package for training models

- You specify the following up-front and it takes care of the rest:

1. Specify a model (e.g. "linear regression" or "decision tree")
    
    2. Specify an engine (e.g. "lm" or "glmnet", etc.)
    
    3. Specify a mode (e.g. "regression" or "classification")

---
# OLS training using `parsnip`

```r
ols_spec <- linear_reg() %>%       # Specify a model
  set_engine("lm") %>%   # Specify an engine: lm, glmnet, stan, keras, spark
  set_mode("regression") # Declare a mode: regression or classification

ols_fit <- ols_spec %>%
          fit(medv ~ ., data=juice(housing_recipe))

# inspect coefficients
tidy(ols_fit$fit$coefficients) %>% print
tidy(est.ols) %>% print
```

---
# Measuring model performance with `yardstick`

- `yardstick` allows the user to define performance metrics

- e.g. `rmse`, `mae`, `rsq`
    
    - or `accuracy`, `precision`, `recall` for classification tasks

```r
# predict RMSE in sample
ols_fit %>% predict(housing_train_prepped) %>%
            mutate(truth = housing_train_prepped$medv) %>%
            rmse(truth,`.pred`) %>%
            print

# predict RMSE out of sample
ols_fit %>% predict(housing_test_prepped) %>%
            mutate(truth = housing_test_prepped$medv) %>%
            rmse(truth,`.pred`) %>%
            print
```

---
# Now with LASSO instead of OLS

- We've already got our training and test data from `recipes`

- Now we want to train LASSO with `parsnip`

- We also want to use `tune` to get the optimal value of `$\lambda$`

- To do so, we adjust the inputs to the `linear_reg()` function of `parsnip`

---
# Setting up the LASSO task

```r
tune_spec <- linear_reg(
  penalty = tune(), # tuning parameter
  mixture = 1       # 1 = lasso, 0 = ridge
) %>% 
  set_engine("glmnet") %>%
  set_mode("regression")

# define a grid over which to try different values of lambda
lambda_grid <- grid_regular(penalty(), levels = 50)

# 10-fold cross-validation
rec_folds <- vfold_cv(housing_train_prepped, v = 10)
```

---
# Create a workflow to do k-fold CV

```r
# Workflow
rec_wf <- workflow() %>%
  add_formula(log(medv) ~ .) %>%
  add_model(tune_spec) #%>%
  #add_recipe(housing_recipe)

# Tuning results
rec_res <- rec_wf %>%
  tune_grid(
    resamples = rec_folds,
    grid = lambda_grid
  )
```

---
# Optimal `$\lambda$`

```r
top_rmse  <- show_best(rec_res, metric = "rmse")
best_rmse <- select_best(rec_res, metric = "rmse")
```

---
# Optimally cross-validated model

```r
# Now train with tuned lambda
final_lasso <- finalize_workflow(rec_wf, best_rmse)

# Print out results in test set
last_fit(final_lasso, split = housing_split) %>%
         collect_metrics() %>% print

# show best RMSE
top_rmse %>% print(n = 1)
```