04 - A Typical (Supervised) ML Workflow

class: center, middle, inverse, title-slide

# 04 - A Typical (Supervised) ML Workflow
## ml4econ, HUJI 2021
### Itamar Caspi
### April 4, 2021 (updated: 2021-04-01)

---

# Packages and setup

Use the [pacman](https://cran.r-project.org/web/packages/pacman/vignettes/Introduction_to_pacman.html) package that automatically loads and installs packages:

```r
if (!require("pacman")) install.packages("pacman")

pacman::p_load(
  tidyverse,   # for data wrangling and visualization
  tidymodels,  # for data modeling
  GGally,      # for pairs plot
  skimr,       # for summary statistics
  here         # for referencing folders and files
  )
```

Set a theme for `ggplot` (Relevant only for the presentation)

```r
theme_set(theme_grey(20))
```

---
# The `tidymodels` package

.pull-left[
<img src="figs/tidymodels.png" width="80%" style="display: block; margin: auto;" />
]
.pull-right[

>"[`tidymodels`](https://github.com/tidymodels/tidymodels) is a "meta-package" for modeling and statistical analysis that share the underlying design philosophy, grammar, and data structures of the tidyverse."

]

---
# Supervised ML Workflow

Step 1: [Define the Prediton Task](#background)

Step 2: [Explore the Data](#eda)

Step 3: [Set Model and Tuning Parameters](#model)

Step 4: [Cross-validation](#cv)

Step 5: [Evaluate the Model](#eval)

---
class: title-slide-section-blue, center, middle
name: background

# Step 1: Define the Prediction Task

---

# Predicting Boston Housing Prices

.pull-left[
We will use the `BostonHousing`: housing data for 506 census tracts of Boston from the 1970 census (Harrison and Rubinfeld, 1978).

- `medv` (target): median value of owner-occupied homes in USD 1000's.
- `lstat`(predictor): percentage of lower status of the
population.
- `chas` (predictor): Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

__OBJECTIVE:__ Predict `medv`.
]
.pull-right[
<img src="figs/boston.jpg" width="1000%" style="display: block; margin: auto;" />
Source: [https://www.bostonusa.com/](https://www.bostonusa.com/)
]

---
# Load the Data

Load the data

```r
boston_raw <- read_csv(here("04-ml-workflow/data","BostonHousing.csv"))
```

```
## Parsed with column specification:
## cols(
##   crim = col_double(),
##   zn = col_double(),
##   indus = col_double(),
##   chas = col_double(),
##   nox = col_double(),
##   rm = col_double(),
##   age = col_double(),
##   dis = col_double(),
##   rad = col_double(),
##   tax = col_double(),
##   ptratio = col_double(),
##   b = col_double(),
##   lstat = col_double(),
##   medv = col_double()
## )
```

---
# What Type of Data?

We can use the `glimpse()` function in order to better understand the data structure:

```r
glimpse(boston_raw)
```

```
## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 12...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.8...
## $ chas    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524, ...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631, ...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 94...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505,...
## $ rad     <dbl> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 311...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15....
## $ b       <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90,...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10,...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15....
```

The `chas` variable is mostly zero `$\Rightarrow$` should be a factor.

---
# Initial Data Filtering

Select `medv` and `lstat`

```r
boston <- boston_raw %>% 
  as_tibble() %>% 
  select(medv, lstat, chas) %>% 
  mutate(chas = as_factor(chas))

head(boston)
```

```
## # A tibble: 6 x 3
##    medv lstat chas 
##   <dbl> <dbl> <fct>
## 1  24    4.98 0    
## 2  21.6  9.14 0    
## 3  34.7  4.03 0    
## 4  33.4  2.94 0    
## 5  36.2  5.33 0    
## 6  28.7  5.21 0
```

---
class: title-slide-section-blue, center, middle
name: split

# Step 2: Split the Data

---
# Initial Split

We will use the `initial_split()`, `training()` and `testing()` functions from the [rsample](https://tidymodels.github.io/rsample/) package to perform an initial train-test split

Set seed for reproducibility

```r
set.seed(1203) 
```

Initial split:

```r
boston_split <- boston %>% 
  initial_split(prop = 2/3, strata = medv)

boston_split
```

```
## <Training/Validation/Total>
## <338/168/506>
```

---
# Prepare Training and Test Sets

```r
boston_train_raw <- training(boston_split)
boston_test_raw  <- testing(boston_split)

head(boston_train_raw, 5)
```

```
## # A tibble: 5 x 3
##    medv lstat chas 
##   <dbl> <dbl> <fct>
## 1  24    4.98 0    
## 2  34.7  4.03 0    
## 3  33.4  2.94 0    
## 4  36.2  5.33 0    
## 5  28.7  5.21 0
```

```r
head(boston_test_raw, 5)
```

```
## # A tibble: 5 x 3
##    medv lstat chas 
##   <dbl> <dbl> <fct>
## 1  21.6  9.14 0    
## 2  16.5 29.9  0    
## 3  18.9 17.1  0    
## 4  18.9 13.3  0    
## 5  20.4  8.26 0
```

---
class: title-slide-section-blue, center, middle
name: eda

# Step 3: Explore the Data

---
# Summary Statistics Using `skimr`

```r
boston_train_raw %>% 
  skim()
```

(Does not come out well on these slides)

---
# Pairs Plot Using `GGally`

.pull-left[

We now use a __pairs plot__ which compactly plots every variable in a dataset against every other one.

```r
boston_train_raw %>% ggpairs()
```
]
.pull-right[
![](04-ml-workflow_files/figure-html/unnamed-chunk-1-1..svg)
]
 
---
# Select a Model

.pull-left[

We choose the class of polynomial models:

`$$medv_i = \beta_0 + \sum_{j=1}^{\lambda}\beta_j lstat_i^j+\varepsilon_i$$`

```r
boston_train_raw %>% ggplot(aes(lstat, medv)) +
  geom_point() +
  geom_smooth(
    method = lm,
    formula = y ~ poly(x,1),
    se = FALSE,
    color = "blue"
  ) +
  geom_smooth(
    method = lm,
    formula = y ~ poly(x,10),
    se = FALSE,
    color = "red"
  )
```
]

.pull-right[
In blueת `$\lambda=1$`; in red, `$\lambda = 10$`.
![](04-ml-workflow_files/figure-html/unnamed-chunk-2-1..svg)
]
 
 
---
class: title-slide-section-blue, center, middle
name: model

# Step 4: Set Model and Tuning Parameters

---

# Data Preprocessing using `recipes`

The `recipes` package is a great tool for data preprocessing that fits in naturally with the tidy approach to ML.

```r
boston_rec <- 
  recipe(medv ~ lstat + chas, data = boston_train_raw) %>% 
  step_poly(lstat, degree = tune("lambda")) %>% 
  step_dummy(chas)

boston_rec
```

```
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          2
## 
## Operations:
## 
## Orthogonal polynomials on lstat
## Dummy variables from chas
```

---
# Set a Grid for `$\lambda$`

What are our tuning parameters?

```r
boston_rec %>% parameters()
```

```
## Collection of 1 parameters for tuning
## 
##      id parameter type object class
##  lambda         degree    nparam[+]
```

We need to tune the polynomial degree parameter `$(\lambda)$` when building our models on the train data. In this example, we will set the range between 1 and 8:

```r
lambda_grid <- expand_grid("lambda" = 1:8)
```

---
# Define the Model

We will use the linear regression model

```r
lm_mod <- linear_reg()%>%
  set_engine("lm")

lm_mod
```

```
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
```
Note that there are no tuning parameters here.

---
class: title-slide-section-blue, center, middle
name: cv

# Step 5: Cross-validation

---
# Split the Training Set to 5-folds

We will use the `vfold-cv()` function from the [rsample](https://tidymodels.github.io/rsample/) package to split the training set to 5-folds:

```r
cv_splits <- boston_train_raw %>% 
  vfold_cv(v = 5)
  
cv_splits
```

```
## #  5-fold cross-validation 
## # A tibble: 5 x 2
##   splits           id   
##   <named list>     <chr>
## 1 <split [270/68]> Fold1
## 2 <split [270/68]> Fold2
## 3 <split [270/68]> Fold3
## 4 <split [271/67]> Fold4
## 5 <split [271/67]> Fold5
```

---
# Estimate CV-RMSE Over the `$\lambda$` Grid

We now estimate the CV-RMSE for each value of `$\lambda$`.

```r
boston_results <- tune_grid(
  boston_rec,
  model     = lm_mod,
  resamples = cv_splits,
  grid      = lambda_grid
)

boston_results
```

```
## #  5-fold cross-validation 
## # A tibble: 5 x 4
##   splits           id    .metrics          .notes          
##   <named list>     <chr> <list>            <list>          
## 1 <split [270/68]> Fold1 <tibble [16 x 4]> <tibble [0 x 1]>
## 2 <split [270/68]> Fold2 <tibble [16 x 4]> <tibble [0 x 1]>
## 3 <split [270/68]> Fold3 <tibble [16 x 4]> <tibble [0 x 1]>
## 4 <split [271/67]> Fold4 <tibble [16 x 4]> <tibble [0 x 1]>
## 5 <split [271/67]> Fold5 <tibble [16 x 4]> <tibble [0 x 1]>
```

---
# Find the Optimal `$\lambda$`

Let's find the top-3 performing models

```r
 boston_results %>% 
  show_best(metric = "rmse", n = 3, maximize = FALSE)
```

```
## # A tibble: 3 x 6
##   lambda .metric .estimator  mean     n std_err
##    <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1      4 rmse    standard    4.79     5   0.260
## 2      5 rmse    standard    4.81     5   0.273
## 3      6 rmse    standard    4.82     5   0.268
```

<midd-blockquote> _"[I]n reality there is rarely if ever a true underlying model, and even if there was a true underlying model, selecting that model will not necessarily give the best forecasts..."_ .right[&mdash; [__Rob J. Hyndman__](https://robjhyndman.com/hyndsight/crossvalidation/)] </midd-blockquote>

---
# And Now Using a Graph

.pull-left[

```r
boston_results %>% 
  autoplot()
```
]
.pull-right[
![](04-ml-workflow_files/figure-html/unnamed-chunk-3-1..svg)

]

---
class: title-slide-section-blue, center, middle
name: eval

# Step 6: Evaluate the Model

---

# Use the Test Set to Evaluate the Best Model

Select the optimal `$\lambda$`

```r
best_lambda <- boston_results %>% 
  select_best(metric = "rmse", maximize = FALSE)

best_lambda
```

```
## # A tibble: 1 x 1
##   lambda
##    <int>
## 1      4
```

Prepare a recipe with the optimal `$\lambda = 4$`

```r
boston_final <- boston_rec %>% 
  finalize_recipe(best_lambda)
```

---
# Apply the Recipe to the Training and Test Sets

`juice()` applies the recipe to the training set and `bake()` to the test set.

```r
boston_train <- boston_final %>% 
  prep() %>% 
  juice()

boston_test <- boston_final %>% 
  prep() %>% 
  bake(new_data = boston_test_raw)
```

For example, let's take a look at the training set:

```r
head(boston_train, 3)
```

```
## # A tibble: 3 x 6
##    medv lstat_poly_1 lstat_poly_2 lstat_poly_3 lstat_poly_4 chas_X1
##   <dbl>        <dbl>        <dbl>        <dbl>        <dbl>   <dbl>
## 1  24        -0.0586       0.0449      -0.0214     -0.00896       0
## 2  34.7      -0.0660       0.0627      -0.0515      0.0283        0
## 3  33.4      -0.0744       0.0851      -0.0940      0.0917        0
```

---

# Fit the Model to the Training Set

Fit the optimal model `$(\lambda = 4)$`) to the training set:

```r
boston_fit <- lm_mod %>% 
  fit(medv ~ ., data = boston_train)
```

Here are the estimated coefficients:

```r
boston_fit %>% tidy()
```

```
## # A tibble: 6 x 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)     22.1      0.272     81.3  3.10e-221
## 2 lstat_poly_1  -123.       4.78     -25.8  3.59e- 81
## 3 lstat_poly_2    53.3      4.78      11.1  9.85e- 25
## 4 lstat_poly_3   -20.1      4.80      -4.19 3.59e-  5
## 5 lstat_poly_4    20.3      4.79       4.24 2.94e-  5
## 6 chas_X1          4.76     0.924      5.15 4.43e-  7
```

---
# Make Predictions Using the Test Set

Create a tibble with the predictions and ground-truth

```r
boston_pred <- boston_fit %>% 
* predict(new_data = boston_test) %>%
  bind_cols(boston_test) %>% 
  select(medv, .pred)

head(boston_pred)
```

```
## # A tibble: 6 x 2
##    medv .pred
##   <dbl> <dbl>
## 1  21.6 22.9 
## 2  16.5  9.80
## 3  18.9 16.6 
## 4  18.9 18.7 
## 5  20.4 24.3 
## 6  18.2 21.4
```

Note that this is the first time we make use of the test set!

---
# Test-RMSE

Calculate the test root mean square error (test-RMSE):

```r
boston_pred %>% 
  rmse(medv, .pred)
```

```
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        5.94
```

The above is a measure of our model's performance on "general" data.

<midd-blockquote>__NOTE:__ the test set RMSE estimates the expected squared prediction error on unseen data _given_ the best model.</midd-blockquote>

---
# Always plot your prediction errors

.pull-left[

Plotting the prediction errors `$(y_i-\hat{y}_i)$` vs. the target provides valuable information about prediction quality.

```r
boston_pred %>% 
  mutate(resid = medv - .pred) %>% 
  ggplot(aes(medv, resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red")
```

For example, our predictions for high-end levels of `medv` are extremely biased `$\Rightarrow$` there's room for improvement...

]

.pull-right[
![](04-ml-workflow_files/figure-html/unnamed-chunk-4-1..svg)
]

---
# (A shortcut)

A much quicker way to get the test-set RMSE is using `tune`'s `final_fit()` function:

```r
boston_final %>% 
* last_fit(lm_mod, split = boston_split) %>%
  collect_metrics() %>% 
  filter(.metric == "rmse")
```

```
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        5.94
```

---
class: .title-slide-final, center, inverse, middle

# `slides::end()`

[<i class="fa fa-github"></i> Source code](https://github.com/ml4econ/lecture-notes-2021/tree/master/04-ml-workflow)