class: center, middle, inverse, title-slide .title[ # 03 - A Typical (Supervised) ML Workflow ] .subtitle[ ## ml4econ, HUJI 2025 ] .author[ ### Itamar Caspi ] .date[ ### March 30, 2025 (updated: 2025-03-30) ] --- # Packages and setup We will use the following packages during the presentation: ``` r library(tidyverse) # for data wrangling and visualization library(tidymodels) # for data modeling library(GGally) # for pairs plot library(skimr) # for summary statistics library(here) # for referencing folders and files ``` For the presentation, we will select a specific `ggplot` theme (not relevant otherwise): ``` r theme_set(theme_grey(20)) ``` --- # The `tidymodels` package .pull-left[ <img src="figs/tidymodels.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ >"[`tidymodels`](https://github.com/tidymodels/tidymodels) is a "meta-package" for modeling and statistical analysis that share the underlying design philosophy, grammar, and data structures of the tidyverse." [GET STARTED](https://www.tidymodels.org/start/) ] --- # Supervised Machine Learning Workflow 1. [Define the Prediction Task](#background) 2. [Explore the Data](#eda) 3. [Set Model and Tuning Parameters](#model) 4. [Perform Cross-Validation](#cv) 5. [Evaluate the Model](#eval) --- class: title-slide-section-blue, center, middle name: background # Step 1: Define the Prediction Task --- # The `BostonHousing` dataset️ .pull-left[ - Dataset: 506 census tracts from the 1970 Boston census (Harrison & Rubinfeld, 1978) - The authors examined how air pollution and proximity to the Charles River affected housing prices. - They quantified how much residents were willing to pay for cleaner air—a key early example of valuing environmental amenities. ] .pull-right[ <img src="figs/boston.jpg" width="1000%" style="display: block; margin: auto;" /> Source: [https://www.bostonusa.com/](https://www.bostonusa.com/) ] --- # A bird's-eye view of Boston <img src="https://media.wbur.org/wp/2019/11/1125_bosheatmap-2-1000x510.jpg" width="80%" style="display: block; margin: auto;" /> Source: [https://www.wbur.org/news/2019/11/25/heat-mapping-boston-museum-of-science](https://www.wbur.org/news/2019/11/25/heat-mapping-boston-museum-of-science) --- # Load the Data We will use the `read_csv()` function to import the raw dataset. ``` r boston_raw <- read_csv(here("03-ml-workflow/data","BostonHousing.csv")) boston_raw |> head(5) ``` ``` ## # A tibble: 5 x 14 ## crim zn indus chas nox rm age dis rad tax ptratio b lstat ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.00632 18 2.31 0 0.538 6.58 65.2 4.09 1 296 15.3 397. 4.98 ## 2 0.0273 0 7.07 0 0.469 6.42 78.9 4.97 2 242 17.8 397. 9.14 ## 3 0.0273 0 7.07 0 0.469 7.18 61.1 4.97 2 242 17.8 393. 4.03 ## 4 0.0324 0 2.18 0 0.458 7.00 45.8 6.06 3 222 18.7 395. 2.94 ## 5 0.0690 0 2.18 0 0.458 7.15 54.2 6.06 3 222 18.7 397. 5.33 ## # i 1 more variable: medv <dbl> ``` --- # Variables in the Dataset .pull-left[ * `crim` – Per capita crime rate by town * `zn` – Proportion of residential land zoned for lots over 25,000 sq. ft. * `indus` – Proportion of non-retail business acres per town * `chas` – Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) * `nox` – Nitric oxides concentration (parts per 10 million) * `rm` – Average number of rooms per dwelling * `age` – Proportion of owner-occupied units built prior to 1940 ] .pull-right[ * `dis` – Weighted distances to five Boston employment centers * `rad` – Index of accessibility to radial highways * `tax` – Full-value property tax rate per $10,000 * `ptratio` – Pupil-teacher ratio by town * `b` – 1000(Bk - 0.63)^2, where Bk is the proportion of Black residents by town * `lstat` – % lower status of the population * **`mdev` – Median value of owner-occupied homes in $1000s** ] --- # What Type of Data? For a better understanding of the data structure, apply the `glimpse()` function: ``` r glimpse(boston_raw) ``` ``` ## Rows: 506 ## Columns: 14 ## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829, 0.1445~ ## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12~ ## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.87, 7.8~ ## $ chas <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~ ## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524, 0.524,~ ## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631, 6.004,~ ## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 94.3, 82~ ## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505, 6.082~ ## $ rad <dbl> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ~ ## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 311, 307,~ ## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15.2, 15.~ ## $ b <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90, 386.6~ ## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10, 20.45~ ## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15.0, 18.~ ``` The `chas` variable predominantly consists of zeros, which implies that it should be treated as a categorical factor. --- # Our Task **Components** - `medv` (target): Median home value in thousands of dollars - `lstat` (predictor): Percentage of lower status population - `chas` (predictor): Proximity to Charles River (1 = yes, 0 = no) **Objective** - Predict `medv` based on the given predictors **Model** - Polynomial regression: `\(medv_i = \beta_0 + \sum_{j=1}^{\lambda}\beta_j lstat_i^j+\varepsilon_i\)` --- # Initial Data Filtering Select `medv` and `lstat` ``` r boston <- boston_raw %>% as_tibble() %>% select(medv, lstat, chas) %>% mutate(chas = as_factor(chas)) head(boston) ``` ``` ## # A tibble: 6 x 3 ## medv lstat chas ## <dbl> <dbl> <fct> ## 1 24 4.98 0 ## 2 21.6 9.14 0 ## 3 34.7 4.03 0 ## 4 33.4 2.94 0 ## 5 36.2 5.33 0 ## 6 28.7 5.21 0 ``` --- class: title-slide-section-blue, center, middle name: split # Step 2: Split the Data --- # Initial Split To perform an initial train-test split, we will use the `initial_split()`, `training()`, and `testing()` functions from the [rsample](https://tidymodels.github.io/rsample/) package. Remember to set a seed for reproducibility. ``` r set.seed(1203) ``` Initial split: ``` r boston_split <- boston %>% initial_split(prop = 2/3, strata = medv) boston_split ``` ``` ## <Training/Testing/Total> ## <336/170/506> ``` --- # Preparing Training and Test Sets ``` r boston_train_raw <- training(boston_split) boston_test_raw <- testing(boston_split) head(boston_train_raw, 5) ``` ``` ## # A tibble: 5 x 3 ## medv lstat chas ## <dbl> <dbl> <fct> ## 1 16.5 29.9 0 ## 2 15 20.4 0 ## 3 13.6 21.0 0 ## 4 15.2 18.7 0 ## 5 14.5 19.9 0 ``` ``` r head(boston_test_raw, 5) ``` ``` ## # A tibble: 5 x 3 ## medv lstat chas ## <dbl> <dbl> <fct> ## 1 24 4.98 0 ## 2 21.6 9.14 0 ## 3 27.1 19.2 0 ## 4 18.9 17.1 0 ## 5 18.2 10.3 0 ``` --- class: title-slide-section-blue, center, middle name: eda # Step 3: Explore the Data --- # Summary Statistics Using `skimr` ``` r boston_train_raw %>% skim() ``` (Not visually appealing on these slides) --- # Pairs Plot Using `GGally` .pull-left[ We will now create a __pairs plot__, which efficiently displays every variable in a dataset against all the others. ``` r boston_train_raw %>% ggpairs() ``` ] .pull-right[ <!-- --> ] --- # Select a Model .pull-left[ We will select the class of polynomial models, represented as follows: `$$medv_i = \beta_0 + \sum_{j=1}^{\lambda}\beta_j lstat_i^j+\varepsilon_i$$` ``` r boston_train_raw %>% ggplot(aes(lstat, medv)) + geom_point() + geom_smooth( method = lm, formula = y ~ poly(x,1), se = FALSE, color = "blue" ) + geom_smooth( method = lm, formula = y ~ poly(x,10), se = FALSE, color = "red" ) ``` ] .pull-right[ In blue `\(\lambda=1\)`; in red, `\(\lambda = 10\)`. <!-- --> ] --- class: title-slide-section-blue, center, middle name: model # Step 4: Set Model and Tuning Parameters --- # Data Preprocessing using `recipes` The `recipes` package is an excellent resource for data preprocessing, seamlessly integrating with the tidy approach to machine learning. ``` r boston_rec <- recipe(medv ~ lstat + chas, data = boston_train_raw) %>% step_poly(lstat, degree = tune("lambda")) %>% step_dummy(chas) boston_rec ``` Explanation of the code: - `recipe()`: Create a recipe object that specifies the target and predictors and the training data - `step_poly()`: Apply polynomial transformation to `lstat` - `step_dummy()`: Convert `chas` to a dummy variable --- # Set a Grid for `\(\lambda\)` What are the tuning parameters we need to consider? ``` r boston_rec %>% extract_parameter_set_dials() ``` We must tune the polynomial degree parameter `\((\lambda)\)` while constructing our models using the training data. In this example, we will establish a range between 1 and 8: ``` r lambda_grid <- expand_grid("lambda" = 1:8) ``` --- # Define the Model Using the linear regression model: ``` r lm_mod <- linear_reg()%>% set_engine("lm") lm_mod ``` ``` ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` Note that in this case, there are no tuning parameters involved. --- class: title-slide-section-blue, center, middle name: cv # Step 5: Cross-validation --- # Split the Training Set to 5-folds We will apply the `vfold_cv()` function from the [rsample](https://tidymodels.github.io/rsample/) package to divide the training set into 5-folds: ``` r cv_splits <- boston_train_raw %>% vfold_cv(v = 5) cv_splits ``` ``` ## # 5-fold cross-validation ## # A tibble: 5 x 2 ## splits id ## <list> <chr> ## 1 <split [268/68]> Fold1 ## 2 <split [269/67]> Fold2 ## 3 <split [269/67]> Fold3 ## 4 <split [269/67]> Fold4 ## 5 <split [269/67]> Fold5 ``` --- # Define the Workflow Next, we define a `workflow()` that combines a model specification with a recipe or model preprocessor. ``` r boston_wf <- workflow() %>% add_model(lm_mod) %>% add_recipe(boston_rec) ``` Note that in this case, there are no tuning parameters involved. --- # Estimate CV-RMSE Over the `\(\lambda\)` Grid We will now calculate the cross-validated root mean squared error (CV-RMSE) for each value of `\(\lambda\)`. ``` r boston_results <- boston_wf %>% tune_grid( resamples = cv_splits, grid = lambda_grid ) boston_results ``` ``` ## # Tuning results ## # 5-fold cross-validation ## # A tibble: 5 x 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [268/68]> Fold1 <tibble [16 x 5]> <tibble [0 x 3]> ## 2 <split [269/67]> Fold2 <tibble [16 x 5]> <tibble [0 x 3]> ## 3 <split [269/67]> Fold3 <tibble [16 x 5]> <tibble [0 x 3]> ## 4 <split [269/67]> Fold4 <tibble [16 x 5]> <tibble [0 x 3]> ## 5 <split [269/67]> Fold5 <tibble [16 x 5]> <tibble [0 x 3]> ``` --- # Find the Optimal `\(\lambda\)` Let's identify the top-3 best-performing models. ``` r boston_results %>% show_best(metric = "rmse", n = 3) ``` ``` ## # A tibble: 3 x 7 ## lambda .metric .estimator mean n std_err .config ## <int> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 6 rmse standard 5.29 5 0.273 Preprocessor6_Model1 ## 2 5 rmse standard 5.29 5 0.279 Preprocessor5_Model1 ## 3 7 rmse standard 5.33 5 0.293 Preprocessor7_Model1 ``` <midd-blockquote> _"[I]n reality there is rarely if ever a true underlying model, and even if there was a true underlying model, selecting that model will not necessarily give the best forecasts..."_ .right[— [__Rob J. Hyndman__](https://robjhyndman.com/hyndsight/crossvalidation/)] </midd-blockquote> --- # And Now Using a Graph .pull-left[ ``` r boston_results %>% autoplot() ``` ] .pull-right[ <!-- --> ] --- class: title-slide-section-blue, center, middle name: eval # Step 6: Evaluate the Model --- # Use the Test Set to Evaluate the Best Model Choose the optimal value of `\(\lambda\)` ``` r best_lambda <- boston_results %>% select_best(metric = "rmse") best_lambda ``` ``` ## # A tibble: 1 x 2 ## lambda .config ## <int> <chr> ## 1 6 Preprocessor6_Model1 ``` Create a recipe using the optimal `\(\lambda = 6\)` ``` r boston_final <- boston_rec %>% finalize_recipe(best_lambda) ``` --- # Apply the Recipe to the Training and Test Sets The `juice()` function applies the recipe to the training set, while the `bake()` function applies it to the test set. ``` r boston_train <- boston_final %>% prep() %>% juice() boston_test <- boston_final %>% prep() %>% bake(new_data = boston_test_raw) ``` For instance, let's examine the training set: ``` r head(boston_train, 3) ``` ``` ## # A tibble: 3 x 8 ## medv lstat_poly_1 lstat_poly_2 lstat_poly_3 lstat_poly_4 lstat_poly_5 lstat_poly_6 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 16.5 0.126 0.0942 -0.0311 -0.118 -0.0932 0.0108 ## 2 15 0.0565 -0.0399 -0.0549 0.0406 0.0604 -0.0342 ## 3 13.6 0.0606 -0.0358 -0.0613 0.0335 0.0693 -0.0218 ## # i 1 more variable: chas_X1 <dbl> ``` --- # Fit the Model to the Training Set Fit the optimal model (with `\(\lambda = 6\)`) to the training set: ``` r boston_fit <- lm_mod %>% fit(medv ~ ., data = boston_train) ``` The following are the estimated coefficients: ``` r boston_fit %>% tidy() ``` ``` ## # A tibble: 8 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 22.3 0.295 75.6 1.41e-209 ## 2 lstat_poly_1 -126. 5.21 -24.2 4.33e- 75 ## 3 lstat_poly_2 52.8 5.21 10.1 3.49e- 21 ## 4 lstat_poly_3 -21.4 5.23 -4.09 5.36e- 5 ## 5 lstat_poly_4 20.9 5.23 3.99 8.29e- 5 ## 6 lstat_poly_5 -14.7 5.23 -2.80 5.34e- 3 ## 7 lstat_poly_6 4.22 5.22 0.807 4.20e- 1 ## 8 chas_X1 4.45 1.12 3.96 9.27e- 5 ``` --- # Make Predictions Using the Test Set Generate a tibble that includes the predictions and the actual values: ``` r boston_pred <- boston_fit %>% * predict(new_data = boston_test) %>% bind_cols(boston_test) %>% select(medv, .pred) head(boston_pred) ``` ``` ## # A tibble: 6 x 2 ## medv .pred ## <dbl> <dbl> ## 1 24 31.3 ## 2 21.6 23.3 ## 3 27.1 15.0 ## 4 18.9 16.7 ## 5 18.2 22.2 ## 6 19.9 24.0 ``` It's worth noting that this is the first time we are utilizing the test set! --- # Test-RMSE Calculate the root mean square error (RMSE) for the test set (test-RMSE): ``` r boston_pred %>% rmse(medv, .pred) ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 5.00 ``` The above is a measure of our model's performance on "general" data. <midd-blockquote>__NOTE:__ The test set RMSE estimates the predicted squared error on unseen data, provided the best model.</midd-blockquote> --- # Always plot your prediction errors .pull-left[ Plotting the prediction errors `\((y_i-\hat{y}_i)\)` against the target variable provides critical information regarding prediction quality. ``` r boston_pred %>% mutate(resid = medv - .pred) %>% ggplot(aes(medv, resid)) + geom_point() + geom_hline(yintercept = 0, color = "red") ``` For example, our predictions for high-end levels of `medv` are highly biased, indicating that there's potential for improvement... ] .pull-right[ <!-- --> ] --- # (A shortcut) The `last_fit()` function from `tune` is a much quicker way to obtain the test-set RMSE. Firstly, we need to modify our workflow to utilize the optimal `\(\lambda\)` value. ``` r boston_wf <- workflow() %>% add_model(lm_mod) %>% * add_recipe(boston_final) ``` We will now use the optimal model to estimate the out-of-sample RMSE. ``` r boston_wf %>% * last_fit(split = boston_split) %>% collect_metrics() %>% filter(.metric == "rmse") ``` ``` ## # A tibble: 1 x 4 ## .metric .estimator .estimate .config ## <chr> <chr> <dbl> <chr> ## 1 rmse standard 5.00 Preprocessor1_Model1 ``` --- class: title-slide-section-blue, center, middle name: eval # Extensions --- # Feature Engineering .pull-left[ We could enhance with additional transformations, for example: **Interaction terms** between status and location: `$$medv_i = \beta_0 + \beta_1 lstat_i + \beta_2 chas_i$$` `$$+ \beta_3 \cdot (lstat_i \times chas_i) + \varepsilon_i$$` ] .pull-right[ <img src="03-ml-workflow_files/figure-html/feature_viz-1..svg" width="432" /> ] <midd-blockquote>Feature engineering represents the convergence of economic theory and data science - this is where domain expertise significantly impacts model performance.</midd-blockquote> --- # Learning Curves: Real Data Diagnostics .pull-left[ **Learning curves** provide critical model diagnostics by showing: * High-degree `\((\lambda=10)\)`: Low training/high validation error; high variance * Linear `\((\lambda=1)\)`: Close errors but both high; high bias * Optimal `\((\lambda=6)\)`: Balanced bias/variance; approaches best performance ] .pull-right[ <img src="03-ml-workflow_files/figure-html/plot_learning_curves-1..svg" width="432" style="display: block; margin: auto;" /> ] <midd-blockquote>Learning curves inform both model selection and data collection strategies—more data only helps if variance (overfitting) is the primary issue.</midd-blockquote> --- # Predict vs Explain (vs. Intervene) .pull-left[ **Polynomial “Black Box”** - Captures complex, non-linear effects - Coefficients are not directly interpretable - Often needs post-hoc visualization **Transparent Models** - Linear regressions offer constant marginal effects - Directly link predictors to outcomes - Easier to communicate in economic terms **None of the above guarantee causal interpretation** ] .pull-right[ <img src="figs/explain.png" width="100%" style="display: block; margin: auto;" /> ] --- class: .title-slide-final, center, inverse, middle # `slides |> end()` [<i class="fa fa-github"></i> Source code](https://github.com/ml4econ/lecture-notes-2025/tree/master/03-ml-workflow)