class: center, middle, inverse, title-slide # Basic Machine Learning Concepts ### Itamar Caspi ### March 10, 2019 (updated: 2019-05-02) --- # Replicating this presentation R packages used to produce this presentation ```r library(tidyverse) # for data wrangling and plotting library(svglite) # for better looking plots library(kableExtra) # for better looking tables library(RefManageR) # for referencing library(truncnorm) # for drawing from a truncated normal distribution library(tidymodels) # for modelling library(knitr) # for presenting tables library(mlbench) # for the Boston Housing data library(ggdag) # for plotting DAGs ``` If you are missing a package, run the following command ``` install.packages("package_name") ``` Alernatively, you can just use the [pacman](https://cran.r-project.org/web/packages/pacman/vignettes/Introduction_to_pacman.html) package that loads and installs packages: ```r if (!require("pacman")) install.packages("pacman") pacman::p_load(tidyverse, svglite, kabelExtra, RefManageR, truncnorm, tidymodels, knitr, mlbench, ggdag) ``` ??? Hat tip to Grant McDermott who introduced me the awesome __pacman__ package. --- # First things first: "Big Data" <midd-blockquote>"_A billion years ago modern homo sapiens emerged. A billion minutes ago, Christianity began. A billion seconds ago, the IBM PC was released. A billion Google searches ago ... was this morning._" .right[Hal Varian (2013)]</midd-blockquote> The 4 Vs of big data: + Volume - Scale of data. + Velocity - Analysis of streaming data. + Variety - Different forms of data. + Veracity - Uncertainty of data. --- # "Data Science" <img src="figs/venn.jpg" width="50%" style="display: block; margin: auto;" /> [*] Hacking `\(\approx\)` coding --- # Outline 1. [What is ML?](#concepts) 2. [The problem of overfitting](#overfitting) 3. [Too complext? Regularize!](#regularization) 4. [Putting it All Together](#ml_workflow) 5. [What's in it for economists?](#economics) --- class: title-slide-section-blue, center, middle name: concepts # What is ML? --- # So, what is ML? A concise definition by <a name=cite-athey2018the></a>[Athey (2018)](#bib-athey2018the): <midd-blockquote> "...[M]achine learning is a field that develops algorithms designed to be applied to datasets, with the main areas of focus being prediction (regression), classification, and clustering or grouping tasks." </midd-blockquote> Specifically, there are three broad classifications of ML problems: + supervised learning. + unsupervised learning. + reinforcement learning. > Most of the hype you hear about in recent years relates to supervised learning, and in particular, deep learning. --- <img src="figs/mltypes.png" width="80%" style="display: block; margin: auto;" /> ??? Source: [https://wordstream-files-prod.s3.amazonaws.com/s3fs-public/machine-learning.png](https://wordstream-files-prod.s3.amazonaws.com/s3fs-public/machine-learning.png) --- # An aside: ML and artificial intelligence (AI) <img src="figs/DL-ML-AI.png" width="70%" style="display: block; margin: auto;" /> ??? Source: [https://www.magnetic.com/blog/explaining-ai-machine-learning-vs-deep-learning-post/](https://www.magnetic.com/blog/explaining-ai-machine-learning-vs-deep-learning-post/) --- # Unsupervised learning In _unsupervised_ learning, the goal is to divide high-dimensional data into clusters that are __similar__ in their set of features `\((X)\)`. Examples of algorithms: - principal component analysis - `\(k\)`-means clustering - Latent Dirichlet Allocation model Applications: - image recognition - cluster analysis - topic modelling --- # Example: Clustering OECD Inflation Rates <img src="figs/clustInflationCropped.jpg" width="80%" style="display: block; margin: auto;" /> .footnote[_Source_: [Baudot-Trajtenberg and Caspi (2018)](https://www.bis.org/publ/bppdf/bispap100_l.pdf).] --- # Reinforcement learning (RL) A definition by <a name=cite-sutton2018rli></a>[Sutton and Barto (2018)](#bib-sutton2018rli): <midd-blockquote> _"Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them."_ </midd-blockquote> Prominent examples: - Games (e.g., Chess, AlphaGo). - Self driving cars. --- # Supervised learning Consider the following data generating process (DGP): `$$Y=f(\boldsymbol{X})+\epsilon$$` where `\(Y\)` is the outcome variable, `\(\boldsymbol{X}\)` is a `\(1\times p\)` vector of "features", and `\(\epsilon\)` is the irreducible error. - __Training set__ ("in-sample"): `\(\{(x_i,y_i)\}_{i=1}^{n}\)` - __Test set__ ("out-of-sample"): `\(\{(x_i,y_i)\}_{i=n+1}^{m}\)` <img src="figs/train_test.png" width="50%" style="display: block; margin: auto;" /> <midd-blockquote> Typical assumptions: (1) independent observations; (2) stable DGP across training _and_ test sets. </midd-blockquote> --- # The goal of supervised learning Use a labelled test set ( `\(X\)` and `\(Y\)` are known) to construct `\(\hat{f}(X)\)` such that it _generalizes_ to unseen test set (only `\(X\)` is known). __EXAMPLE:__ Consider the task of spam detection: <img src="figs/spam.png" width="100%" style="display: block; margin: auto;" /> In this case, `\(Y=\{spam, ham\}\)`, and `\(X\)` is the email text. --- # Traditional vs. modern approach to supervised learning <iframe width="100%" height="400" src="https://www.youtube.com/embed/xl3yQBhI6vY?start=405" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- # Traditional vs. modern approach to supervised learning - Traditional: rules based, e.g., define dictionaries of "good" and "bad" words, and use it to classify text. - Modern: learn from data, e.g., label text as "good" or "bad" and let the model estimate rules from (training) data. --- # More real world applications of ML | task | outcome `\((Y)\)` | features `\((X)\)` | |--------------------|-------------------|-------------------------------------------| | credit score | probability of default | loan history, payment record... | | fraud detection | fraud / no fraud | transaction history, timing, amount... | | voice recognition | word | recordings | | sentiment analysis | good / bad review | text | | image classification | cat / not cat | image | | overdraft prediction | yes / no | bank-account history | customer churn prediction | churn / no churn | customer features > A rather mind blowing example: Amazon's ["Anticipatory Package Shipping"](https://pdfpiw.uspto.gov/.piw?docid=08615473&SectionNum=1&IDKey=28097F792F1E&HomeUrl=http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2%2526Sect2=HITOFF%2526p=1%2526u%2025252Fnetahtml%2025252FPTO%20=%25%25%25%25%2025252Fsearch%20bool.html%202526r-2526f%20=%20G%20=%201%25%20=%2050%25%202526l%25%202526d%25%20AND%202526co1%20=%20=%20=%20PTXT%202526s1%25%25%25%20252522anticipatory%20252Bpackage%25%20=%25%20252522%25%202526OS%20252522anticipatory%20252Bpackage%25%25%20252%20522%25%202526RS%20252522anticipatory%25%20=%25%20252522%25%20252Bpackage) patent (December 2013): Imagine Amazon's algorithms reaching such levels of accuracy, casing it to change its business model from shopping-then-shipping to shipping-then-shopping! ??? A fascinating discussion on Amazon's shipping-then-shopping business model appears in the book ["Prediction Machines: The Simple Economics of Artificial Intelligence"](https://books.google.com/books/about/Prediction_Machines.html?id=wJY4DwAAQBAJ) <a name=cite-agrawal2018prediction></a>([Agrawal, Gans, and Goldfarb, 2018](#bib-agrawal2018prediction)). --- # Supervised learning algorithms ML comes with a rich set of parametric and non-parametric prediction algorithms (approximate year of discovery in parenthesis): - Linear and logistic regression (1805, 1958). - Decision and regression trees (1984). - K-Nearest neighbors (1967). - Support vector machines (1990s). - Neural networks (1940s, 1970s, 1980s, 1990s). - Simulation methods (Random forests (2001), bagging (2001), boosting (1990)). - etc. --- # So, why now? - ML methods are data-hungry and computationally expensive. Hence, $$ \text{big data} + \text{computational advancements} = \text{the rise of ML}$$ -- - Nevertheless, <midd-blockquote> "_[S]upervised learning [...] may involve high dimensions, non-linearities, binary variables, etc., but at the end of the day it’s still just regression._" .right[— [__Francis X. Diebold__](https://fxdiebold.blogspot.com/2018/06/machines-learning-finance.html)]</midd-blockqoute> <img src="figs/matrix.jpeg" width="25%" style="display: block; margin: auto;" /> --- # Wait, is ML just glorified statistics? .pull-left[ <img src="figs/frames.jpg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ The "two cultures" <a name=cite-breiman2001statistical></a>([Breiman, 2001](#bib-breiman2001statistical)): - Statisticians assume a data generating process and try to learn about it using data (parameters, confidence intervals, assumptions.) - Computer scientists treat the data mechanism as unknown and try to predict or classify with the most accuracy. ] ??? See further discussions here: - [https://towardsdatascience.com/no-machine-learning-is-not-just-glorified-statistics-26d3952234e3](https://towardsdatascience.com/no-machine-learning-is-not-just-glorified-statistics-26d3952234e3) - [https://www.quora.com/Is-Machine-Learning-just-glorified-statistics](https://www.quora.com/Is-Machine-Learning-just-glorified-statistics) --- class: title-slide-section-blue, center, middle name: overfitting # Overfitting --- # Prediction accuracy Before we define overfitting, we need to be more explicit about what we mean by "good prediction." - Let `\((x^0,y^0)\)` denote a single realization from the (unseen) test set. - Define a __loss function__ `\(L\)` in terms of predictions `\(\hat{y}^0=\hat{f}(x^0)\)` and the "ground truth" `\(y^0\)`, where `\(\hat{f}\)` is estimated using the _training_ set. - Examples - squared error (SE): `\(L(\hat{y}^0, y^0)=(y^0-\hat{f}(x^0))^2\)` - absolute error (AE): `\(L(\hat{y}^0, y^0)=|y^0-\hat{f}(x^0)|\)` - There are other possible forms of loss function (e.g., in classification, as the probability of misclassifying a case, or in terms of economic costs.) --- # The bias-variance decomposition Under a __squared error loss function__, an optimal predictive model is one that minimizes the _expected_ squared prediction error. It can be shown that if the true model is `\(Y=f(X)+\epsilon\)`, then `$$\begin{aligned}[t] \mathbb{E}\left[\text{SE}^0\right] &= \mathbb{E}\left[(y^0 - \hat{f}(x^0))^2\right] \\ &= \underbrace{\left(\mathbb{E}(\hat{f}(x^0)) - f(x^0)\right)^{2}}_{\text{bias}^2} + \underbrace{\mathbb{E}\left[\hat{f}(x^0) - \mathbb{E}(\hat{f}(x^0))\right]^2}_{\text{variance}} + \underbrace{\mathbb{E}\left[y^0 - f(x^0)\right]^{2}}_{\text{irreducible error}} \\ &= \underbrace{\mathrm{Bias}^2 + \mathbb{V}[\hat{f}(x^0)]}_{\text{reducible error}} + \sigma^2_{\epsilon} \end{aligned}$$` where the expectation is over the training set _and_ `\((x^0,y^0)\)`. --- # Intuition behind the bias variance trade-off Imagine you are a teaching assistant grading exams. You grade the first exam. What is your best prediction of the next exam's grade? + the first test score is an unbiased estimator of the mean grade. + but it is extremely variable. + any solution? Lets simulate it! ??? This example is taken from Susan Athey's AEA 2018 lecture, ["Machine Learning and Econometrics"](https://www.aeaweb.org/conference/cont-ed/2018-webcasts) (Athey and Imbens). --- # Exam grade prediction simulation Let's Draw 1000 grade duplets from the following truncated normal distribution `$$g_i \sim truncN(\mu = 75, \sigma = 15, a=0, b=100),\quad i=1,2$$` Next, calculate two types of predictions - `unbiased_pred` is the first exam's grade. - `shrinked_pred` is an average of the previous grade and a _prior_ mean grade of 70. Here a small sample from our simulated table: <table> <thead> <tr> <th style="text-align:right;"> attempt </th> <th style="text-align:right;"> grade1 </th> <th style="text-align:right;"> grade2 </th> <th style="text-align:right;"> unbiased_pred </th> <th style="text-align:right;"> shrinked_pred </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 86 </td> <td style="text-align:right;"> 81 </td> <td style="text-align:right;"> 72 </td> <td style="text-align:right;"> 81 </td> <td style="text-align:right;"> 75.5 </td> </tr> <tr> <td style="text-align:right;"> 306 </td> <td style="text-align:right;"> 73 </td> <td style="text-align:right;"> 78 </td> <td style="text-align:right;"> 73 </td> <td style="text-align:right;"> 71.5 </td> </tr> <tr> <td style="text-align:right;"> 822 </td> <td style="text-align:right;"> 97 </td> <td style="text-align:right;"> 100 </td> <td style="text-align:right;"> 97 </td> <td style="text-align:right;"> 83.5 </td> </tr> <tr> <td style="text-align:right;"> 297 </td> <td style="text-align:right;"> 87 </td> <td style="text-align:right;"> 95 </td> <td style="text-align:right;"> 87 </td> <td style="text-align:right;"> 78.5 </td> </tr> <tr> <td style="text-align:right;"> 456 </td> <td style="text-align:right;"> 83 </td> <td style="text-align:right;"> 78 </td> <td style="text-align:right;"> 83 </td> <td style="text-align:right;"> 76.5 </td> </tr> </tbody> </table> --- # The distribution of predictions <img src="02-basic-ml-concepts_files/figure-html/unnamed-chunk-11-1..svg" style="display: block; margin: auto;" /> --- # The MSE of grade predictions <table> <thead> <tr> <th style="text-align:right;"> unbiased_MSE </th> <th style="text-align:right;"> shrinked_MSE </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 325.895 </td> <td style="text-align:right;"> 213.9627 </td> </tr> </tbody> </table> Hence, the shrinked prediction turns out to be better (in the sense of MSE) then the unbiased one! __QUESTION:__ Is this a general result? Why? --- # Regressions and the bias-variance trade-off Consider the following hypothetical DGP: `$$consumption_i=\beta_0+\beta_1 \times income_i+\varepsilon_i$$` ```r set.seed(1505) # for replicating the simulation df <- crossing(economist = c("A", "B", "C"), obs = 1:20) %>% mutate(economist = as.factor(economist)) %>% mutate(income = rnorm(n(), mean = 100, sd = 10)) %>% mutate(consumption = 10 + 0.5 * income + rnorm(n(), sd = 10)) ``` --- # Scatterplot of the data .pull-left[ ```r df %>% ggplot(aes(y = income, x = consumption)) + geom_point() ``` ] .pull-right[ ![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-13-1..svg)<!-- --> ] --- # Split the sample between 3 economists .pull-left[ ```r df %>% ggplot(aes(x = consumption, y = income, color = economist)) + geom_point() ``` ```r knitr::kable(sample_n(df,6), format = "html") ``` <table> <thead> <tr> <th style="text-align:left;"> economist </th> <th style="text-align:right;"> obs </th> <th style="text-align:right;"> income </th> <th style="text-align:right;"> consumption </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 98.89330 </td> <td style="text-align:right;"> 49.65828 </td> </tr> <tr> <td style="text-align:left;"> C </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 84.94055 </td> <td style="text-align:right;"> 35.73597 </td> </tr> <tr> <td style="text-align:left;"> C </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 97.83834 </td> <td style="text-align:right;"> 43.40877 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 102.69686 </td> <td style="text-align:right;"> 53.47575 </td> </tr> <tr> <td style="text-align:left;"> C </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 109.78834 </td> <td style="text-align:right;"> 67.43844 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 103.71018 </td> <td style="text-align:right;"> 52.37188 </td> </tr> </tbody> </table> ] .pull-right[ ![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-14-1..svg)<!-- --> ] --- # Underffiting: high bias, low variance .pull-left[ The model: unconditional mean `$$Y_i = \beta_0+\varepsilon_i$$` ```r df %>% ggplot(aes(y = consumption, x = income, color = economist)) + geom_point() + geom_smooth(method = lm, * formula = y ~ 1, se = FALSE, color = "black") + facet_wrap(~ economist) + geom_vline(xintercept = 70, linetype = "dashed") + theme(legend.position = "bottom") ``` ] .pull-right[ ![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-15-1..svg)<!-- --> ] --- # Overfitting: low Bias, high variance .pull-left[ The model: high-degree polynomial `$$Y_i = \beta_0+\sum_{j=1}^{\lambda}\beta_jX_i^{\lambda}+\varepsilon_i$$` ```r df %>% ggplot(aes(y = consumption, x = income, color = economist)) + geom_point() + geom_smooth(method = lm, * formula = y ~ poly(x,5), se = FALSE, color = "black") + facet_wrap(~ economist) + geom_vline(xintercept = 70, linetype = "dashed") + theme(legend.position = "bottom") ``` ] .pull-right[ ![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-16-1..svg)<!-- --> ] --- # "Justfitting": bias and Variance are just right .pull-left[ The model: linear regression `$$Y_i = \beta_0+\beta_1 X_i + \varepsilon_i$$` ```r df %>% ggplot(aes(y = consumption, x = income, color = economist)) + geom_point() + geom_smooth(method = lm, * formula = y ~ x, se = FALSE, color = "black") + facet_wrap(~ economist) + geom_vline(xintercept = 70, linetype = "dashed") + theme(legend.position = "bottom") ``` ] .pull-right[ ![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-17-1..svg)<!-- --> ] --- # The typical bias-variance trade-off in ML Typically, ML models strive to find levels of bias and variance that are "just right": <img src="figs/biasvariance1.png" width="60%" style="display: block; margin: auto;" /> --- # When is the Bias-Variance Trade-off Important? In low-dimensional settings ( `\(n\gg p\)` ) + overfitting is highly __unlikely__ + training MSE closely approximates test MSE + conventional tools (e.g., OLS) will perform well on a test set INTUITION: As `\(n\rightarrow\infty\)`, insignificant terms will converge to their true value (zero). In high-dimensional settings ( `\(n\ll p\)` ) + overfitting is highly __likely__ + training MSE poorly approximates test MSE + conventional tools tend to overfit <midd-blockquote> `\(n\ll p\)` is prevalent in big-data </midd-blockquote> --- # Bias-variance trade-off in low-dimensional settings .pull-left[ The model is a 3rd degree polynomial `$$Y_i = \beta_0+\beta_1X_i+\beta_2X^2_i+\beta_3X_i^3+\varepsilon_i$$` only now, the sample size for each economist increases to `\(N=500\)`. > __INTUITION:__ as `\(n\rightarrow\infty\)`, `\(\hat{\beta}_2\)` and `\(\hat{\beta}_3\)` converge to their true value, zero. ] .pull-right[ ![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-19-1..svg)<!-- --> ] --- class: title-slide-section-blue, center, middle name: regularization # Regularization --- # Regularization - As the _complexity_ of our model goes up, it will tend to overfit. - __Regularization__ is the act of penalizing models for their complexity level. <midd-blockquote> Regularization typically results in simpler and more accurate (though “wrong”) models. </midd-blockquote> --- # How to penalize overfit? The test set MSE is _unobservable_. How can we tell if a model overfits? - Main idea in machine learning: use _less_ data! - Key point: fit the model to a subset of the training set, and __validate__ the model using the subset that was _not_ used to fit the model. - How can this "magic" work? Recall the stable DGP assumption. --- # Validation Split the sample to three folds: a training set, a validation set and a test set: <img src="figs/train_validate.png" width="50%" style="display: block; margin: auto;" /> The algorithm: 1. Fit a model to the training set. 2. Use the model to predict outcomes from the validation set. 3. Use the mean of the squared prediction errors to approximate the test-MSE. __CONCERNS__: (1) the algorithm might be sensitive to the choice of training and validation set; (2) the algorithm does not use all of the available information. --- # k-fold cross-validation Split the training set into `\(k\)` roughly equal-sized parts ( `\(k=5\)` in this example): <img src="figs/train_cross_validate.png" width="50%" style="display: block; margin: auto;" /> Approximate the test-MSE using the mean of `\(k\)` split-MSEs `$$\text{CV-MSE} = \frac{1}{k}\sum_{j=1}^{k}\text{MSE}_j$$` --- # Which model to choose? - Recall that the test-MSE is unobservable. - CV-MSE is our best guess. - CV-MSE is also a function of model complexity. - Hence, model selection amounts to choosing the complexity level that minimizes CV-MSE. --- # Sounds familiar? - In a way, you probably already know this: - Adjusted R2. - AIC, BIC (time series models). The above two measures indirectly take into account the overfitting that may occur due to the complexity of the model (i.e., adding too many covariates or lags). - In ML we use the data to tune the level of complexity such that it maximizes prediction accuracy. --- class: title-slide-section-blue, center, middle name: ml_workflow # Putting it All Together --- # Toy problem: Predicting Boston housing prices .pull-left[ We will use the `BostonHousing`: housing data for 506 census tracts of Boston from the 1970 census <a name=cite-harrison1978hedonic></a>([Harrison Jr and Rubinfeld, 1978](#bib-harrison1978hedonic)) - `medv` (outcome): median value of owner-occupied homes in USD 1000's. - `lstat`(predictor): percentage of lower status of the population. __OBJECTIVE:__ Find the best prediction model within the class of polynomial regression. Examples: in green: linear relation `\((\lambda=1)\)`; in blue, `\(\lambda = 10\)`. ] .pull-right[ ![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-22-1..svg)<!-- --> ] --- # Step 1: The train-test split We will use the `initial_split()`, `training()` and `testing()` functions from the [rsample](https://tidymodels.github.io/rsample/) package to perform an initial train-test split ```r set.seed(1203) # for reproducability df_split <- initial_split(df, prop = 0.75) df_split ``` ``` ## <380/126/506> ``` ```r training_df <- training(df_split) testing_df <- testing(df_split) head(training_df, 5) ``` ``` ## # A tibble: 5 x 2 ## medv lstat ## <dbl> <dbl> ## 1 21.6 9.14 ## 2 34.7 4.03 ## 3 36.2 5.33 ## 4 28.7 5.21 ## 5 22.9 12.4 ``` --- # Step 2: Prepare 10 folds for cross-validation We will use the `vfold-cv()` function from the [rsample](https://tidymodels.github.io/rsample/) package to split the training set to 10-folds: ```r cv_data <- training_df %>% vfold_cv(v = 10) %>% mutate(train = map(splits, ~training(.x)), validate = map(splits, ~testing(.x))) cv_data ``` ``` ## # 10-fold cross-validation ## # A tibble: 10 x 4 ## splits id train validate ## * <list> <chr> <list> <list> ## 1 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]> ## 2 <split [342/38]> Fold02 <tibble [342 x 2]> <tibble [38 x 2]> ## 3 <split [342/38]> Fold03 <tibble [342 x 2]> <tibble [38 x 2]> ## 4 <split [342/38]> Fold04 <tibble [342 x 2]> <tibble [38 x 2]> ## 5 <split [342/38]> Fold05 <tibble [342 x 2]> <tibble [38 x 2]> ## 6 <split [342/38]> Fold06 <tibble [342 x 2]> <tibble [38 x 2]> ## 7 <split [342/38]> Fold07 <tibble [342 x 2]> <tibble [38 x 2]> ## 8 <split [342/38]> Fold08 <tibble [342 x 2]> <tibble [38 x 2]> ## 9 <split [342/38]> Fold09 <tibble [342 x 2]> <tibble [38 x 2]> ## 10 <split [342/38]> Fold10 <tibble [342 x 2]> <tibble [38 x 2]> ``` --- # Step 3: Set search range for lambda We need to vary the polynomial degree parameter `\((\lambda)\)` when building our models on the train data. In this example, we will set the range between 1 and 10: ```r cv_tune <- cv_data %>% crossing(lambda = 1:10) cv_tune ``` ``` ## # A tibble: 100 x 5 ## splits id train validate lambda ## <list> <chr> <list> <list> <int> ## 1 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]> 1 ## 2 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]> 2 ## 3 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]> 3 ## 4 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]> 4 ## 5 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]> 5 ## 6 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]> 6 ## 7 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]> 7 ## 8 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]> 8 ## 9 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]> 9 ## 10 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]> 10 ## # ... with 90 more rows ``` --- # Step 4: Estimate CV-MSE We now estimate the CV-MSE for each value of `\(\lambda\)`. ```r cv_mse <- cv_tune %>% mutate(model = map2(lambda, train, ~ lm(medv ~ poly(lstat, .x), data = .y))) %>% mutate(predicted = map2(model, validate, ~ augment(.x, newdata = .y))) %>% unnest(predicted) %>% group_by(lambda) %>% summarise(mse = mean((.fitted - medv)^2)) cv_mse ``` ``` ## # A tibble: 10 x 2 ## lambda mse ## <int> <dbl> ## 1 1 39.0 ## 2 2 31.3 ## 3 3 30.1 ## 4 4 28.6 ## 5 5 28.1 ## 6 6 28.3 ## 7 7 28.6 ## 8 8 29.3 ## 9 9 28.3 ## 10 10 34.5 ``` --- # Step 5: Find the best model Recall that the best performing model minimizes the CV-MSE. <img src="02-basic-ml-concepts_files/figure-html/unnamed-chunk-28-1..svg" width="25%" style="display: block; margin: auto;" /> <midd-blockquote> _"[I]n reality there is rarely if ever a true underlying model, and even if there was a true underlying model, selecting that model will not necessarily give the best forecasts..."_ .right[— [__Rob J. Hyndman__](https://robjhyndman.com/hyndsight/crossvalidation/)] </midd-blockquote> --- # Step 6: Use the test set to evaluate the best model Fit the best model ( `\(\lambda = 5\)`) to the training set, make predictions on the test set, and calculate the test root mean square error (test-RMSE): <img src="figs/train_test.png" width="50%" style="display: block; margin: auto;" /> ```r training_df %>% lm(medv ~ poly(lstat, 5), data = .) %>% # fit model augment(newdata = testing_df) %>% # predict unseen data rmse(medv, .fitted) # evaluate accuracy ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 5.11 ``` > __NOTE__: the test set RMSE is an estimator of the expected squared prediction error on unseen data _given_ the best model. --- # An aside: plot your residuals .pull-left[ The distribution of the prediction errors `\((y_i-\hat{y}_i)\)` are another important sources of information about prediction quality. ```r training_df %>% lm(medv ~ poly(lstat, 5), data = .) %>% augment(newdata = testing_df) %>% mutate(error = medv - .fitted) %>% ggplot(aes(medv, error)) + geom_point() + labs(y = expression(y[i] - hat(y)[i])) ``` For example, see how biased the prediction for high levels of `medv` are. ] .pull-right[ ![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-30-1..svg)<!-- --> ] --- class: title-slide-section-blue, center, middle name: economics # ML and econometrics --- # ML vs. econometrics Apart from jargon (Training set vs. in-sample, test-set vs. out of sample, learn vs. estimate, etc.) here is a summary of some of the key differences between ML and econometrics: | Machine Learning | Econometrics | | :----------------- | :---------------------- | | prediction | causal inference | | `\(\hat{Y}\)` | `\(\hat{\beta}\)` | | minimize prediction error | unbiasedness, consictency, efficiency | | --- | statistical inference | | stable environment | counterfactual analysis | | black-box | structural | --- # ML and causal inference <midd-blockquote> _”Data are profoundly dumb about causal relationships.”_ .right[— [__Pearl and Mackenzie, _The Book of Why___](http://bayes.cs.ucla.edu/WHY/)] </midd-blockquote> Intervention violates the stable DGP assumption: `$$P(Y|X=x) \neq P(Y|do(X=x)),$$` where - `\(P(Y|X=x)\)` is the the probability of `\(Y\)` given that we _observe_ `\(X=x\)`. - `\(P(Y|do(X=x))\)` is the the probability of `\(Y\)` given that we _intervene_ to set `\(X=x\)` --- # "A new study shows that..." .pull-left[ __TOY PROBLEM:__ Say that we find in the data that `coffee` is a good predictor of `dementia`. Is avoiding `coffee` a good idea? ] .pull-right[ ![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-31-1..svg)<!-- --> ] ??? --- # To explain or to predict? .pull-left[ - In this example `coffee` is a good _predictor_ of `dementia`, despite not having any causal link to it. - Controlling for `smoking` will give us the causal effect of `coffee` on `dementia`, which is zero. > In general, causal inference always and everywhere dependes on _assumptions_ about the DGP (i.e., the data never "speaks for itself"). ] .pull-right[ ![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-32-1..svg)<!-- --> ] ??? The title of this slide "To Explain or to Predict?" is the name of a must-read [paper](https://projecteuclid.org/euclid.ss/1294167961) by <a name=cite-shmueli2010explain></a>[Shmueli (2010)](#bib-shmueli2010explain) that clarifies the distinction between predicting and explaining. --- # ML in aid of econometrics Consider the standard "treatment effect regression": `$$Y_i=\alpha+\underbrace{\tau D_i}_{\text{low dimensional}} +\underbrace{\sum_{j=1}^{p}\beta_{j}X_{ij}}_{\text{high dimensional}}+\varepsilon_i,\quad\text{for }i=1,\dots,n$$` where + An outcome `\(Y_i\)` + A treatment assignment `\(D_i\in\{0,1\}\)` + A vector of `\(p\)` control variables `\(X_i\)` Our object of interest is often `\(\hat{\tau}\)`, the average treatment effect (ATE). --- class: .title-slide-final, center, inverse, middle # `slides::end()` [<i class="fa fa-github"></i> Source code](https://github.com/ml4econ/notes-spring2019/tree/master/02-basic-ml-concepts) --- # References <a name=bib-agrawal2018prediction></a>[[1]](#cite-agrawal2018prediction) A. Agrawal, J. Gans and A. Goldfarb. _Prediction Machines: The Simple Economics of Artificial Intelligence_. Harvard Business Review Press, 2018. <a name=bib-athey2018the></a>[[2]](#cite-athey2018the) S. Athey. "The impact of machine learning on economics". In: _The Economics of Artificial Intelligence: An Agenda_. University of Chicago Press, 2018. <a name=bib-breiman2001statistical></a>[[3]](#cite-breiman2001statistical) L. Breiman. "Statistical modeling: The two cultures (with comments and a rejoinder by the author)". In: _Statistical science_ 16.3 (2001), pp. 199-231. <a name=bib-harrison1978hedonic></a>[[4]](#cite-harrison1978hedonic) D. Harrison Jr and D. L. Rubinfeld. "Hedonic housing prices and the demand for clean air". In: _Journal of environmental economics and management_ 5.1 (1978), pp. 81-102. <a name=bib-shmueli2010explain></a>[[5]](#cite-shmueli2010explain) G. Shmueli. "To explain or to predict?" In: _Statistical science_ 25.3 (2010), pp. 289-310. --- # References <a name=bib-sutton2018rli></a>[[1]](#cite-sutton2018rli) R. S. Sutton and A. G. Barto. _Reinforcement Learning: An Introduction_. MIT Press, 2018.