Basic Machine Learning Concepts

class: center, middle, inverse, title-slide

# Basic Machine Learning Concepts
### Itamar Caspi
### March 10, 2019 (updated: 2019-05-02)

---

# Replicating this presentation

R packages used to produce this presentation

```r
library(tidyverse) # for data wrangling and plotting
library(svglite) # for better looking plots
library(kableExtra) # for better looking tables
library(RefManageR) # for referencing
library(truncnorm) # for drawing from a truncated normal distribution
library(tidymodels) # for modelling
library(knitr) # for presenting tables
library(mlbench) # for the Boston Housing data
library(ggdag) # for plotting DAGs
```

If you are missing a package, run the following command

```
install.packages("package_name")

```

Alernatively, you can just use the [pacman](https://cran.r-project.org/web/packages/pacman/vignettes/Introduction_to_pacman.html) package that loads and installs packages:

```r
if (!require("pacman")) install.packages("pacman")

pacman::p_load(tidyverse, svglite, kabelExtra, RefManageR,
               truncnorm, tidymodels, knitr, mlbench, ggdag)
```

???

Hat tip to Grant McDermott who introduced me the awesome __pacman__ package.

---
# First things first: "Big Data"

<midd-blockquote>"_A billion years ago modern homo sapiens emerged. A billion minutes ago, Christianity began. A billion seconds ago, the IBM PC was released.
A billion Google searches ago ... was this morning._"  
.right[Hal Varian (2013)]</midd-blockquote>

The 4 Vs of big data:

+ Volume - Scale of data.  
+ Velocity - Analysis of streaming data.  
+ Variety - Different forms of data.  
+ Veracity - Uncertainty of data.

---

# "Data Science"

<img src="figs/venn.jpg" width="50%" style="display: block; margin: auto;" />
[*] Hacking `$\approx$` coding

---
# Outline

1. [What is ML?](#concepts)

2. [The problem of overfitting](#overfitting)

3. [Too complext? Regularize!](#regularization)

4. [Putting it All Together](#ml_workflow)

5. [What's in it for economists?](#economics)

---
class: title-slide-section-blue, center, middle
name: concepts

# What is ML?

---
# So, what is ML?

A concise definition by <a name=cite-athey2018the></a>[Athey (2018)](#bib-athey2018the):

<midd-blockquote>
"...[M]achine learning is a field
that develops algorithms designed to be applied to datasets, with the main areas of focus being prediction (regression), classification, and clustering or grouping tasks."
</midd-blockquote>

Specifically, there are three broad classifications of ML problems:

+ supervised learning.  
  + unsupervised learning.  
  + reinforcement learning.

> Most of the hype you hear about in recent years relates to supervised learning, and in particular, deep learning.

---

???

Source: [https://wordstream-files-prod.s3.amazonaws.com/s3fs-public/machine-learning.png](https://wordstream-files-prod.s3.amazonaws.com/s3fs-public/machine-learning.png)

---
# An aside: ML and artificial intelligence (AI)

???

Source: [https://www.magnetic.com/blog/explaining-ai-machine-learning-vs-deep-learning-post/](https://www.magnetic.com/blog/explaining-ai-machine-learning-vs-deep-learning-post/)

---
# Unsupervised learning

In _unsupervised_ learning, the goal is to divide high-dimensional data into clusters that are __similar__ in their set of features `$(X)$`.

Examples of algorithms:  
  - principal component analysis
  - `$k$`-means clustering
  - Latent Dirichlet Allocation model
  
Applications:  
  - image recognition
  - cluster analysis
  - topic modelling

---
# Example: Clustering OECD Inflation Rates

.footnote[_Source_: [Baudot-Trajtenberg and Caspi (2018)](https://www.bis.org/publ/bppdf/bispap100_l.pdf).]

---
# Reinforcement learning (RL)
 
A definition by <a name=cite-sutton2018rli></a>[Sutton and Barto (2018)](#bib-sutton2018rli):

<midd-blockquote>
_"Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them."_
</midd-blockquote>

Prominent examples:

- Games (e.g., Chess, AlphaGo).

- Self driving cars.

---
# Supervised learning

Consider the following data generating process (DGP):

`$$Y=f(\boldsymbol{X})+\epsilon$$`
where `$Y$` is the outcome variable, `$\boldsymbol{X}$` is a `$1\times p$` vector of "features", and `$\epsilon$` is the irreducible error.

- __Training set__ ("in-sample"): `$\{(x_i,y_i)\}_{i=1}^{n}$`
- __Test set__ ("out-of-sample"): `$\{(x_i,y_i)\}_{i=n+1}^{m}$`

<midd-blockquote>
Typical assumptions: (1) independent observations; (2) stable DGP across training _and_ test sets.
</midd-blockquote>

---

# The goal of supervised learning

Use a labelled test set ( `$X$` and `$Y$` are known) to construct `$\hat{f}(X)$` such that it _generalizes_ to unseen test set (only `$X$` is known).

__EXAMPLE:__ Consider the task of spam detection:

In this case, `$Y=\{spam, ham\}$`, and `$X$` is the email text.

---

# Traditional vs. modern approach to supervised learning

---

# Traditional vs. modern approach to supervised learning

- Traditional: rules based, e.g., define dictionaries of "good" and "bad" words, and use it to classify text.

- Modern: learn from data, e.g., label text as "good" or "bad" and let the model estimate rules from (training) data.

---
# More real world applications of ML

| task               | outcome `$(Y)$`   | features `$(X)$`                          |
|--------------------|-------------------|-------------------------------------------|
| credit score    | probability of default  | loan history, payment record... |
| fraud detection    | fraud / no fraud  | transaction history, timing, amount... |
| voice recognition  | word              | recordings                                |
| sentiment analysis | good / bad review | text                                      |
| image classification   | cat / not cat     | image                                     |
| overdraft prediction   | yes / no     | bank-account history 
| customer churn prediction   | churn / no churn     | customer features

> A rather mind blowing example: Amazon's ["Anticipatory Package Shipping"](https://pdfpiw.uspto.gov/.piw?docid=08615473&SectionNum=1&IDKey=28097F792F1E&HomeUrl=http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2%2526Sect2=HITOFF%2526p=1%2526u%2025252Fnetahtml%2025252FPTO%20=%25%25%25%25%2025252Fsearch%20bool.html%202526r-2526f%20=%20G%20=%201%25%20=%2050%25%202526l%25%202526d%25%20AND%202526co1%20=%20=%20=%20PTXT%202526s1%25%25%25%20252522anticipatory%20252Bpackage%25%20=%25%20252522%25%202526OS%20252522anticipatory%20252Bpackage%25%25%20252%20522%25%202526RS%20252522anticipatory%25%20=%25%20252522%25%20252Bpackage) patent (December 2013): Imagine Amazon's algorithms reaching such levels of accuracy, casing it to change its business model from shopping-then-shipping to shipping-then-shopping!

???

A fascinating discussion on Amazon's shipping-then-shopping business model appears in the book ["Prediction Machines: The Simple Economics of Artificial Intelligence"](https://books.google.com/books/about/Prediction_Machines.html?id=wJY4DwAAQBAJ) <a name=cite-agrawal2018prediction></a>([Agrawal, Gans, and Goldfarb, 2018](#bib-agrawal2018prediction)).
---
# Supervised learning algorithms

ML comes with a rich set of parametric and non-parametric prediction algorithms (approximate year of discovery in parenthesis):

- Linear and logistic regression (1805, 1958).
- Decision and regression trees (1984).  
- K-Nearest neighbors (1967).  
- Support vector machines (1990s).  
- Neural networks (1940s, 1970s, 1980s, 1990s).  
- Simulation methods (Random forests (2001), bagging (2001), boosting (1990)).  
- etc.

---
# So, why now?

- ML methods are data-hungry and computationally expensive. Hence,

$$ \text{big data} + \text{computational advancements} = \text{the rise of ML}$$
--

- Nevertheless,

<midd-blockquote> "_[S]upervised learning [...] may involve high dimensions, non-linearities, binary variables, etc., but at the end of the day it’s still just regression._" .right[&mdash; [__Francis X. Diebold__](https://fxdiebold.blogspot.com/2018/06/machines-learning-finance.html)]</midd-blockqoute>

---
# Wait, is ML just glorified statistics?

.pull-left[
<img src="figs/frames.jpg" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
The "two cultures" <a name=cite-breiman2001statistical></a>([Breiman, 2001](#bib-breiman2001statistical)):

- Statisticians assume a data generating process and try to learn  about it using data (parameters, confidence intervals, assumptions.)

- Computer scientists treat the data mechanism as unknown and try to predict or classify with the most accuracy.

]

???

See further discussions here:

- [https://towardsdatascience.com/no-machine-learning-is-not-just-glorified-statistics-26d3952234e3](https://towardsdatascience.com/no-machine-learning-is-not-just-glorified-statistics-26d3952234e3)

- [https://www.quora.com/Is-Machine-Learning-just-glorified-statistics](https://www.quora.com/Is-Machine-Learning-just-glorified-statistics)

---
class: title-slide-section-blue, center, middle
name: overfitting

# Overfitting

---
# Prediction accuracy

Before we define overfitting, we need to be more explicit about what we mean by "good prediction."

- Let `$(x^0,y^0)$` denote a single realization from the (unseen) test set.

- Define a __loss function__ `$L$` in terms of predictions `$\hat{y}^0=\hat{f}(x^0)$` and the "ground truth" `$y^0$`, where `$\hat{f}$` is estimated using the _training_ set.

- Examples

- squared error (SE): `$L(\hat{y}^0, y^0)=(y^0-\hat{f}(x^0))^2$`   
  - absolute error (AE): `$L(\hat{y}^0, y^0)=|y^0-\hat{f}(x^0)|$`

- There are other possible forms of loss function (e.g., in classification, as the probability of misclassifying a case, or in terms of economic costs.)

---
# The bias-variance decomposition

Under a __squared error loss function__, an optimal predictive model is one that minimizes the _expected_ squared prediction error.

It can be shown that if the true model is `$Y=f(X)+\epsilon$`, then

`$$\begin{aligned}[t]
\mathbb{E}\left[\text{SE}^0\right] &= \mathbb{E}\left[(y^0 - \hat{f}(x^0))^2\right] \\ &= \underbrace{\left(\mathbb{E}(\hat{f}(x^0)) - f(x^0)\right)^{2}}_{\text{bias}^2} + \underbrace{\mathbb{E}\left[\hat{f}(x^0) - \mathbb{E}(\hat{f}(x^0))\right]^2}_{\text{variance}} + \underbrace{\mathbb{E}\left[y^0 - f(x^0)\right]^{2}}_{\text{irreducible error}} \\ &= \underbrace{\mathrm{Bias}^2 + \mathbb{V}[\hat{f}(x^0)]}_{\text{reducible error}} + \sigma^2_{\epsilon}
\end{aligned}$$`

where the expectation is over the training set _and_ `$(x^0,y^0)$`.

---
# Intuition behind the bias variance trade-off

Imagine you are a teaching assistant grading exams. You grade the first exam. What is your best prediction of the next exam's grade?

+ the first test score is an unbiased estimator of the mean grade.

+ but it is extremely variable.

+ any solution?

Lets simulate it!

???

This example is taken from Susan Athey's AEA 2018 lecture, ["Machine Learning and Econometrics"](https://www.aeaweb.org/conference/cont-ed/2018-webcasts) (Athey and
Imbens).

---
# Exam grade prediction simulation

Let's Draw 1000 grade duplets from the following truncated normal distribution

`$$g_i \sim truncN(\mu = 75, \sigma = 15, a=0, b=100),\quad i=1,2$$`

Next, calculate two types of predictions
  - `unbiased_pred` is the first exam's grade.
  - `shrinked_pred` is an average of the previous grade and a _prior_ mean grade of 70.

Here a small sample from our simulated table:
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> attempt </th>
   <th style="text-align:right;"> grade1 </th>
   <th style="text-align:right;"> grade2 </th>
   <th style="text-align:right;"> unbiased_pred </th>
   <th style="text-align:right;"> shrinked_pred </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 86 </td>
   <td style="text-align:right;"> 81 </td>
   <td style="text-align:right;"> 72 </td>
   <td style="text-align:right;"> 81 </td>
   <td style="text-align:right;"> 75.5 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 306 </td>
   <td style="text-align:right;"> 73 </td>
   <td style="text-align:right;"> 78 </td>
   <td style="text-align:right;"> 73 </td>
   <td style="text-align:right;"> 71.5 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 822 </td>
   <td style="text-align:right;"> 97 </td>
   <td style="text-align:right;"> 100 </td>
   <td style="text-align:right;"> 97 </td>
   <td style="text-align:right;"> 83.5 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 297 </td>
   <td style="text-align:right;"> 87 </td>
   <td style="text-align:right;"> 95 </td>
   <td style="text-align:right;"> 87 </td>
   <td style="text-align:right;"> 78.5 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 456 </td>
   <td style="text-align:right;"> 83 </td>
   <td style="text-align:right;"> 78 </td>
   <td style="text-align:right;"> 83 </td>
   <td style="text-align:right;"> 76.5 </td>
  </tr>
</tbody>
</table>

---
# The distribution of predictions

---

# The MSE of grade predictions

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> unbiased_MSE </th>
   <th style="text-align:right;"> shrinked_MSE </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 325.895 </td>
   <td style="text-align:right;"> 213.9627 </td>
  </tr>
</tbody>
</table>

Hence, the shrinked prediction turns out to be better (in the sense of MSE) then the unbiased one!

__QUESTION:__ Is this a general result? Why?

---
# Regressions and the bias-variance trade-off

Consider the following hypothetical DGP:

`$$consumption_i=\beta_0+\beta_1 \times income_i+\varepsilon_i$$`

```r
set.seed(1505) # for replicating the simulation

df <- crossing(economist = c("A", "B", "C"),
         obs = 1:20) %>% 
  mutate(economist = as.factor(economist)) %>% 
  mutate(income = rnorm(n(), mean = 100, sd = 10)) %>% 
  mutate(consumption = 10 + 0.5 * income + rnorm(n(), sd = 10))
```

---
# Scatterplot of the data

.pull-left[

```r
df %>% 
  ggplot(aes(y = income,
             x = consumption)) +
  geom_point()
```
]
.pull-right[
![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-13-1..svg)
]
---
# Split the sample between 3 economists

.pull-left[

```r
df %>% 
  ggplot(aes(x = consumption,
             y = income,
             color = economist)) +
  geom_point()
```

```r
knitr::kable(sample_n(df,6), format = "html")
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> economist </th>
   <th style="text-align:right;"> obs </th>
   <th style="text-align:right;"> income </th>
   <th style="text-align:right;"> consumption </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 98.89330 </td>
   <td style="text-align:right;"> 49.65828 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> C </td>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:right;"> 84.94055 </td>
   <td style="text-align:right;"> 35.73597 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> C </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 97.83834 </td>
   <td style="text-align:right;"> 43.40877 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 102.69686 </td>
   <td style="text-align:right;"> 53.47575 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> C </td>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 109.78834 </td>
   <td style="text-align:right;"> 67.43844 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:right;"> 103.71018 </td>
   <td style="text-align:right;"> 52.37188 </td>
  </tr>
</tbody>
</table>
]
.pull-right[
![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-14-1..svg)
]

---
# Underffiting: high bias, low variance

.pull-left[
The model: unconditional mean

`$$Y_i = \beta_0+\varepsilon_i$$`

```r
df %>% 
  ggplot(aes(y = consumption,
             x = income,
             color = economist)) +
  geom_point() +
  geom_smooth(method = lm,
*             formula = y ~ 1,
              se = FALSE,
              color = "black") +
  facet_wrap(~ economist) +
  geom_vline(xintercept = 70, linetype = "dashed") +
  theme(legend.position = "bottom")
```
]
.pull-right[
![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-15-1..svg)
]
---
# Overfitting: low Bias, high variance

.pull-left[
The model: high-degree polynomial

`$$Y_i = \beta_0+\sum_{j=1}^{\lambda}\beta_jX_i^{\lambda}+\varepsilon_i$$`

```r
df %>% 
  ggplot(aes(y = consumption,
             x = income,
             color = economist)) +
  geom_point() +
  geom_smooth(method = lm,
*             formula = y ~ poly(x,5),
              se = FALSE,
              color = "black") +
  facet_wrap(~ economist) +
  geom_vline(xintercept = 70, linetype = "dashed") +
  theme(legend.position = "bottom")
```
]
.pull-right[
![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-16-1..svg)
]

---
# "Justfitting": bias and Variance are just right

.pull-left[
The model: linear regression

`$$Y_i = \beta_0+\beta_1 X_i + \varepsilon_i$$`

```r
df %>% 
  ggplot(aes(y = consumption,
             x = income,
             color = economist)) +
  geom_point() +
  geom_smooth(method = lm,
*             formula = y ~ x,
              se = FALSE,
              color = "black") +
  facet_wrap(~ economist) +
  geom_vline(xintercept = 70, linetype = "dashed") +
  theme(legend.position = "bottom")
```
]
.pull-right[
![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-17-1..svg)
]
---
# The typical bias-variance trade-off in ML

Typically, ML models strive to find levels of bias and variance that are "just right":

---
# When is the Bias-Variance Trade-off Important?

In low-dimensional settings ( `$n\gg p$` )  
  + overfitting is highly __unlikely__  
  + training MSE closely approximates test MSE  
  + conventional tools (e.g., OLS) will perform well on a test set

INTUITION: As `$n\rightarrow\infty$`, insignificant terms will converge to their true value (zero).

In high-dimensional settings ( `$n\ll p$` )  
  + overfitting is highly __likely__  
  + training MSE poorly approximates test MSE  
  + conventional tools tend to overfit  
  
<midd-blockquote> `$n\ll p$` is prevalent in big-data </midd-blockquote>

---
# Bias-variance trade-off in low-dimensional settings

.pull-left[
The model is a 3rd degree polynomial

`$$Y_i = \beta_0+\beta_1X_i+\beta_2X^2_i+\beta_3X_i^3+\varepsilon_i$$`

only now, the sample size for each economist increases to `$N=500$`.

> __INTUITION:__ as `$n\rightarrow\infty$`, `$\hat{\beta}_2$` and `$\hat{\beta}_3$` converge to their true value, zero.

]
.pull-right[
![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-19-1..svg)
]

---
class: title-slide-section-blue, center, middle
name: regularization

# Regularization

---

# Regularization

- As the _complexity_ of our model goes up, it will tend to overfit.

- __Regularization__ is the act of penalizing models for their complexity level.

<midd-blockquote>
Regularization typically results in simpler and more accurate (though “wrong”) models.
</midd-blockquote>

---

# How to penalize overfit?

The test set MSE is _unobservable_. How can we tell if a model overfits?

- Main idea in machine learning: use _less_ data!

- Key point: fit the model to a subset of the training set, and __validate__ the model using the subset that was _not_ used to fit the model.

- How can this "magic" work? Recall the stable DGP assumption.

---

# Validation

Split the sample to three folds: a training set, a validation set and a test set:

The algorithm:

1. Fit a model to the training set.

2. Use the model to predict outcomes from the validation set.

3. Use the mean of the squared prediction errors to approximate the test-MSE.

__CONCERNS__: (1) the algorithm might be sensitive to the choice of training and validation set; (2) the algorithm does not use all of the available information.

---

# k-fold cross-validation

Split the training set into `$k$` roughly equal-sized parts ( `$k=5$` in this example):

Approximate the test-MSE using the mean of `$k$` split-MSEs

`$$\text{CV-MSE} = \frac{1}{k}\sum_{j=1}^{k}\text{MSE}_j$$`

---

# Which model to choose?

- Recall that the test-MSE is unobservable.

- CV-MSE is our best guess.

- CV-MSE is also a function of model complexity.

- Hence, model selection amounts to choosing the complexity level that minimizes CV-MSE.

---

# Sounds familiar?

- In a way, you probably already know this:

- Adjusted R2.

- AIC, BIC (time series models).

The above two measures indirectly take into account the overfitting that may occur due to the complexity of the model (i.e., adding too many covariates or lags).

- In ML we use the data to tune the level of complexity such that it maximizes prediction accuracy.

---
class: title-slide-section-blue, center, middle
name: ml_workflow

# Putting it All Together

---

# Toy problem: Predicting Boston housing prices

.pull-left[

We will use the `BostonHousing`: housing data for 506 census tracts of Boston from the 1970 census <a name=cite-harrison1978hedonic></a>([Harrison Jr and Rubinfeld, 1978](#bib-harrison1978hedonic))

- `medv` (outcome): median value of owner-occupied homes in USD 1000's.
- `lstat`(predictor): percentage of lower status of the population.

__OBJECTIVE:__ Find the best prediction model within the class of polynomial regression.

Examples: in green: linear relation `$(\lambda=1)$`; in blue, `$\lambda = 10$`.

]

.pull-right[
![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-22-1..svg)
]
 
 
---
# Step 1: The train-test split

We will use the `initial_split()`, `training()` and `testing()` functions from the [rsample](https://tidymodels.github.io/rsample/) package to perform an initial train-test split

```r
set.seed(1203) # for reproducability

df_split <- initial_split(df, prop = 0.75)
df_split
```

```
## <380/126/506>
```

```r
training_df <- training(df_split) 
testing_df  <- testing(df_split)

head(training_df, 5)
```

```
## # A tibble: 5 x 2
##    medv lstat
##   <dbl> <dbl>
## 1  21.6  9.14
## 2  34.7  4.03
## 3  36.2  5.33
## 4  28.7  5.21
## 5  22.9 12.4
```

---
# Step 2: Prepare 10 folds for cross-validation

We will use the `vfold-cv()` function from the [rsample](https://tidymodels.github.io/rsample/) package to split the training set to 10-folds:

```r
cv_data <- training_df %>% 
  vfold_cv(v = 10) %>%  
  mutate(train     = map(splits, ~training(.x)), 
         validate  = map(splits, ~testing(.x)))
cv_data
```

```
## #  10-fold cross-validation 
## # A tibble: 10 x 4
##    splits           id     train              validate         
##  * <list>           <chr>  <list>             <list>           
##  1 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]>
##  2 <split [342/38]> Fold02 <tibble [342 x 2]> <tibble [38 x 2]>
##  3 <split [342/38]> Fold03 <tibble [342 x 2]> <tibble [38 x 2]>
##  4 <split [342/38]> Fold04 <tibble [342 x 2]> <tibble [38 x 2]>
##  5 <split [342/38]> Fold05 <tibble [342 x 2]> <tibble [38 x 2]>
##  6 <split [342/38]> Fold06 <tibble [342 x 2]> <tibble [38 x 2]>
##  7 <split [342/38]> Fold07 <tibble [342 x 2]> <tibble [38 x 2]>
##  8 <split [342/38]> Fold08 <tibble [342 x 2]> <tibble [38 x 2]>
##  9 <split [342/38]> Fold09 <tibble [342 x 2]> <tibble [38 x 2]>
## 10 <split [342/38]> Fold10 <tibble [342 x 2]> <tibble [38 x 2]>
```

---
# Step 3: Set search range for lambda

We need to vary the polynomial degree parameter `$(\lambda)$` when building our models on the train data. In this example, we will set the range between 1 and 10:

```r
cv_tune <- cv_data %>% 
  crossing(lambda = 1:10)

cv_tune
```

```
## # A tibble: 100 x 5
##    splits           id     train              validate          lambda
##    <list>           <chr>  <list>             <list>             <int>
##  1 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]>      1
##  2 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]>      2
##  3 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]>      3
##  4 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]>      4
##  5 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]>      5
##  6 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]>      6
##  7 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]>      7
##  8 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]>      8
##  9 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]>      9
## 10 <split [342/38]> Fold01 <tibble [342 x 2]> <tibble [38 x 2]>     10
## # ... with 90 more rows
```

---
# Step 4: Estimate CV-MSE

We now estimate the CV-MSE for each value of `$\lambda$`.

```r
cv_mse <- cv_tune %>% 
  mutate(model = map2(lambda, train, ~ lm(medv ~ poly(lstat, .x), data = .y))) %>% 
  mutate(predicted = map2(model, validate, ~ augment(.x, newdata = .y))) %>% 
  unnest(predicted) %>% 
  group_by(lambda) %>% 
  summarise(mse = mean((.fitted - medv)^2))

cv_mse
```

```
## # A tibble: 10 x 2
##    lambda   mse
##     <int> <dbl>
##  1      1  39.0
##  2      2  31.3
##  3      3  30.1
##  4      4  28.6
##  5      5  28.1
##  6      6  28.3
##  7      7  28.6
##  8      8  29.3
##  9      9  28.3
## 10     10  34.5
```

---
# Step 5: Find the best model

Recall that the best performing model minimizes the CV-MSE.
 
<img src="02-basic-ml-concepts_files/figure-html/unnamed-chunk-28-1..svg" width="25%" style="display: block; margin: auto;" />

<midd-blockquote> _"[I]n reality there is rarely if ever a true underlying model, and even if there was a true underlying model, selecting that model will not necessarily give the best forecasts..."_ .right[&mdash; [__Rob J. Hyndman__](https://robjhyndman.com/hyndsight/crossvalidation/)] </midd-blockquote>

---

# Step 6: Use the test set to evaluate the best model

Fit the best model ( `$\lambda = 5$`) to the training set, make predictions on the test set, and calculate the test root mean square error (test-RMSE):

```r
training_df %>% 
  lm(medv ~ poly(lstat, 5), data = .) %>%  # fit model
  augment(newdata = testing_df) %>%        # predict unseen data
  rmse(medv, .fitted)                  # evaluate accuracy
```

```
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        5.11
```

> __NOTE__: the test set RMSE is an estimator of the expected squared prediction error on unseen data _given_ the best model.

---
# An aside: plot your residuals

.pull-left[

The distribution of the prediction errors `$(y_i-\hat{y}_i)$` are another important sources of information about prediction quality.

```r
training_df %>% 
  lm(medv ~ poly(lstat, 5), data = .) %>% 
  augment(newdata = testing_df) %>%  
  mutate(error = medv - .fitted) %>% 
  ggplot(aes(medv, error)) +
  geom_point() +
  labs(y = expression(y[i] - hat(y)[i]))
```

For example, see how biased the prediction for high levels of `medv` are.

]

.pull-right[
![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-30-1..svg)
]

---
class: title-slide-section-blue, center, middle
name: economics

# ML and econometrics

---
# ML vs. econometrics

Apart from jargon (Training set vs. in-sample, test-set vs. out of sample, learn vs. estimate, etc.) here is a summary of some of the key differences between ML and econometrics:

|   Machine Learning | Econometrics |
| :----------------- | :---------------------- |
| prediction | causal inference |
| `$\hat{Y}$` | `$\hat{\beta}$` |
| minimize prediction error | unbiasedness, consictency, efficiency |
| --- | statistical inference |
| stable environment | counterfactual analysis |
| black-box | structural |

---
# ML and causal inference

<midd-blockquote> _”Data are profoundly dumb about causal relationships.”_   
.right[&mdash; [__Pearl and Mackenzie, _The Book of Why___](http://bayes.cs.ucla.edu/WHY/)]
</midd-blockquote>

Intervention violates the stable DGP assumption:

`$$P(Y|X=x) \neq P(Y|do(X=x)),$$`
where

- `$P(Y|X=x)$` is the the probability of `$Y$` given that we _observe_ `$X=x$`.

- `$P(Y|do(X=x))$` is the the probability of `$Y$` given that we _intervene_ to set `$X=x$`

---
# "A new study shows that..."

.pull-left[

__TOY PROBLEM:__ Say that we find in the data that `coffee` is a good predictor of `dementia`. Is avoiding `coffee` a good idea?

]
.pull-right[
![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-31-1..svg)
]

???

---
# To explain or to predict?

.pull-left[

- In this example `coffee` is a good _predictor_ of `dementia`, despite not having any causal link to it.

- Controlling for `smoking` will give us the causal effect of `coffee` on `dementia`, which is zero.

> In general, causal inference always and everywhere dependes on _assumptions_ about the DGP (i.e., the data never "speaks for itself").

]
.pull-right[
![](02-basic-ml-concepts_files/figure-html/unnamed-chunk-32-1..svg)
]

???

The title of this slide "To Explain or to Predict?" is the name of a must-read [paper](https://projecteuclid.org/euclid.ss/1294167961) by <a name=cite-shmueli2010explain></a>[Shmueli (2010)](#bib-shmueli2010explain) that clarifies the distinction between predicting and explaining.

---
# ML in aid of econometrics

Consider the standard "treatment effect regression":

`$$Y_i=\alpha+\underbrace{\tau D_i}_{\text{low dimensional}} +\underbrace{\sum_{j=1}^{p}\beta_{j}X_{ij}}_{\text{high dimensional}}+\varepsilon_i,\quad\text{for }i=1,\dots,n$$`
where
+ An outcome `$Y_i$`  
+ A treatment assignment `$D_i\in\{0,1\}$`  
+ A vector of `$p$` control variables `$X_i$`

Our object of interest is often `$\hat{\tau}$`, the average treatment effect (ATE).

---
class: .title-slide-final, center, inverse, middle

# `slides::end()`

[<i class="fa fa-github"></i> Source code](https://github.com/ml4econ/notes-spring2019/tree/master/02-basic-ml-concepts)

---
# References

<a
name=bib-agrawal2018prediction></a>[[1]](#cite-agrawal2018prediction)
A. Agrawal, J. Gans and A. Goldfarb. _Prediction Machines: The
Simple Economics of Artificial Intelligence_. Harvard Business
Review Press, 2018.

<a name=bib-athey2018the></a>[[2]](#cite-athey2018the) S. Athey.
"The impact of machine learning on economics". In: _The Economics
of Artificial Intelligence: An Agenda_. University of Chicago
Press, 2018.

<a
name=bib-breiman2001statistical></a>[[3]](#cite-breiman2001statistical)
L. Breiman. "Statistical modeling: The two cultures (with comments
and a rejoinder by the author)". In: _Statistical science_ 16.3
(2001), pp. 199-231.

<a
name=bib-harrison1978hedonic></a>[[4]](#cite-harrison1978hedonic)
D. Harrison Jr and D. L. Rubinfeld. "Hedonic housing prices and
the demand for clean air". In: _Journal of environmental economics
and management_ 5.1 (1978), pp. 81-102.

<a name=bib-shmueli2010explain></a>[[5]](#cite-shmueli2010explain)
G. Shmueli. "To explain or to predict?" In: _Statistical science_
25.3 (2010), pp. 289-310.

---
# References

<a name=bib-sutton2018rli></a>[[1]](#cite-sutton2018rli) R. S.
Sutton and A. G. Barto. _Reinforcement Learning: An Introduction_.
MIT Press, 2018.