Topic 13: Machine Learning Fundamentals

class: center, middle, inverse, title-slide

.title[
# Topic 13: <br> Machine Learning Fundamentals
]
.author[
### Nick Hagerty* <br> ECNS 460/560 <br> Montana State University
]
.date[
### <br> .small[*Slides 6-17 and 27-43 are adapted from <a href="https://github.com/edrubin/EC524W21">“Prediction and machine-learning in econometrics”</a> by Ed Rubin, used with permission, and excluded from this resource’s overall CC license.]
]

---

name: toc

# Table of contents

1. [Overview: Statistical learning](#learning)

1. [Assessing model accuracy](#accuracy)

1. [Cross-validation](#resampling)

---
class: inverse, middle
name: learning
# Overview: Statistical learning

---
# Prediction

`$$\color{#6A5ACD}{Y_i} = f(x_{0i}, x_{1i}, ..., x_{Ni}) = \color{#e64173}{\beta_0} x_{0i} + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \varepsilon_i$$`

.hi-purple[**Prediction:**] Want to estimate `$\color{#6A5ACD}{\hat{Y_{i}}}$` given observed data `$x_{0i}, x_{1i}, x_{2i}, ....$`.
- Does not matter whether your model is "correct" (the true DGP), only whether it *works.*

The idea is that we will:
1. **Train** a model on data for which we know both `$X$` and `$Y$`.
2. **Apply** the model to new situations where we know `$X$` but not `$Y$`.

---
layout: true
# Statistical learning

---

The class of methods for doing prediction is called **statistical learning** or **machine learning.**

First, a few definitions...

---
## Supervised vs. unsupervised

Statistical learning is generally divided into two classes:

1. .hi-slate[Supervised learning] builds ("learns") a statistical model for predicting an .hi-orange[output] `$\left( \color{#FFA500}{\mathbf{y}} \right)$` given a set of .hi-purple[inputs] `$\left( \color{#6A5ACD}{x_{1},\, \ldots,\, x_{p}} \right)$`.

2. .hi-slate[Unsupervised learning] learns relationships and structure using only .hi-purple[inputs] `$\left( \color{#6A5ACD}{x_{1},\, \ldots,\, x_{p}} \right)$` without any *supervising* output — letting the data "speak for itself."

---
class: clear, middle
layout: false

.it[.smaller[[Source](https://twitter.com/athena_schools/status/1063013435779223553)]]

---
layout: true
# Statistical learning

---
## Classification vs. Regression

.hi-slate[Supervised learning] is broken into two types, based on what kind of .hi-orange[output] we want to predict:

1. .hi-slate[Classification tasks] for which the values of `$\color{#FFA500}{\mathbf{y}}$` are discrete categories.
<br>*E.g.*, race, sex, loan default, hazard, disease, flight status

2. .hi-slate[Regression tasks] in which `$\color{#FFA500}{\mathbf{y}}$` takes on continuous, numeric values.
<br>*E.g.*, price, arrival time, number of emails, temperature

.note[Note] The use of .it[regression] differs from our use of .it[linear regression].

---
## Translating terms

`$\color{#6A5ACD}{\mathbf{X}}$` (treatment variable/covariates, independent variables, regressors)
- Now: **predictors, features.**

`$\hat{\color{#FFA500}{\mathbf{Y}}}$` (outcome variable, dependent variable)
- Now: **target, label.**

"Estimate a model" or "fit a model"
- Now: **Train** a model, **learn** a model.

---
## General framework

A .turquoise[function] `$\color{#20B2AA}{f}$` takes .purple[inputs] `$\color{#6A5ACD}{\mathbf{X}} = \color{#6A5ACD}{\mathbf{x}_1}, \ldots, \color{#6A5ACD}{\mathbf{x}_p}$` and maps them to the .orange[output], along with a random, mean-zero .pink[error term] `$\color{#e64173}{\varepsilon}$`.

`$$\color{#FFA500}{\mathbf{y}} = \color{#20B2AA}{f} \! \left( \color{#6A5ACD}{\mathbf{X}} \right) + \color{#e64173}{\varepsilon}$$`

If we can estimate `$\hat{\color{#20B2AA}{f}}$`, then we can use `$\color{#6A5ACD}{\mathbf{X}}_i$` to generate predictions `$\color{#FFA500}{\mathbf{y}}_i$`.

---
## How to predict

The basic workflow for predictive analysis:

1. **Choose** a model.
2. **Train** the model (estimate its parameters).
3. **Assess** its performance.
4. Repeat steps 1-3 for different models, and choose the **best** model.

---
layout: false
class: inverse, middle
name: accuracy

# Assessing model accuracy

---
layout: true
# Model accuracy

---
## Loss

*Prediction error* is defined as:
`$$\color{#FFA500}{\mathbf{y}}_i - \hat{\color{#FFA500}{\mathbf{y}}}_i$$`

the difference between the label `$\left( \color{#FFA500}{\mathbf{y}} \right)$` and its prediction `$\left( \hat{\color{#FFA500}{\mathbf{y}}} \right)$`.

The (absolute) distance between a true value and its prediction is often called .b[loss].

The way you choose to use loss to measure model performance is called a **loss function.**

---
name: accuracy-subtlety

## What loss function should we choose?

How do we compare loss across observations? Many questions:

- Which do you prefer?
  1. Lots of little errors and a few really large errors.
  1. Medium-sized errors for everyone.

- Is a 1-unit error (*e.g.*, $1,000) equally bad for everyone?

- Is an overestimate equally bad as an underestimate?

---
name: mse
## Most common: MSE

.attn[Mean squared error (MSE)] is the most common loss function in a regression setting. (Not necessarily the best.)

`$$\text{MSE} = \dfrac{1}{n} \sum_{i=1}^n \left[ \color{#FFA500}{y}_i - \hat{\color{#20B2AA}{f}}(\color{#6A5ACD}{x}_i) \right]^2$$`

Note:

1. MSE is small when .hi-slate[prediction error] is small.
1. MSE .hi-slate[penalizes] big errors more than small errors (the squared part).

---
name: overfitting
layout: false
# Model accuracy
## Overfitting

Low MSE on the data that trained the model is not necessarily impressive — maybe the model is just **overfitting** our data.

**Tradeoff:** More flexible models...

- might better fit complex systems (lower bias).

- but also might falsely interpret noise as signal (higher variance).

---
name: training-testing

# Model accuracy
## Training vs. testing data

**Our real goal:** A model that performs well *on data it has never seen.*

To avoid overfitting, we must:

1. Split our data into a **training set** and a **test set**.
2. Use the **training set** only to train our model.
3. Use the **test set** only to measure model performance.

---
class: clear, middle
layout: true

---

Fitting a polynomial of degree **1**
<img src="13-Learning_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" />

---

Fitting a polynomial of degree **2**
<img src="13-Learning_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" />

---

Fitting a polynomial of degree **3**
<img src="13-Learning_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" />

---

Fitting a polynomial of degree **4**
<img src="13-Learning_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" />

---

Fitting a polynomial of degree **5**
<img src="13-Learning_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" />

---

Fitting a polynomial of degree **7**
<img src="13-Learning_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" />

---

Fitting a polynomial of degree **11**
<img src="13-Learning_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" />

---

Fitting a polynomial of degree **17**
<img src="13-Learning_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" />

---

Fitting a polynomial of degree **25**
<img src="13-Learning_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" />

---
layout: false
name: bias-variance
# Model accuracy
## The bias-variance tradeoff

Finding the optimal level of flexibility highlights the .hi-pink[bias]-.hi-purple[variance] .b[tradeoff.]

.hi-pink[Bias:] The error that comes from modeling `$\color{#20B2AA}{f}$` with the wrong structure.
- More flexible models are better equipped to recover complex relationships `$\left( \color{#20B2AA}{f} \right)$`, reducing bias.
- Models that are too simple have high bias.

.hi-purple[Variance:] The amount `$\hat{\color{#20B2AA}{f}}$` would change with a different .hi-slate[training sample]
- If new .hi-slate[training sets] drastically change `$\hat{\color{#20B2AA}{f}}$`, then we have a lot of uncertainty about `$\color{#20B2AA}{f}$`.
- Models that are too flexible have high variance.

---
# Model accuracy
## The bias-variance tradeoff

The expected value of the .hi-pink[test MSE] can be written
$$
`\begin{align}
  \mathop{E}\left[ \left(\color{#FFA500}{\mathbf{y_0}} - \mathop{\hat{\color{#20B2AA}{f}}}\left(\color{#6A5ACD}{\mathbf{X}_0}\right) \right)^2 \right] =
  \underbrace{\mathop{\text{Var}} \left( \mathop{\hat{\color{#20B2AA}{f}}}\left(\color{#6A5ACD}{\mathbf{X}_0}\right) \right)}_{\text{Variance}} +
  \underbrace{\left[ \text{Bias}\left( \mathop{\hat{\color{#20B2AA}{f}}}\left(\color{#6A5ACD}{\mathbf{X}_0}\right) \right) \right]^2}_{\text{Bias}} +
  \underbrace{\mathop{\text{Var}} \left( \varepsilon \right)}_{\text{Irr. error}}
\end{align}`
$$

.b[The tradeoff] in terms of model flexibility:

- At first, adding flexibility reduces bias more than it increases variance.
- Later on, the bias reduction gets swamped out by increases in variance.
- At some point, the marginal benefits of flexibility equal marginal costs.

---
class: inverse, middle
name: resampling
# Cross-validation

---
layout: false
# Cross-validation

**How do we choose the best model?** (e.g., degree of polynomial)
- Based on test-set MSE?

--
- **NO!!!** We will just overfit the test set itself!
- The test set must be used *only for evaluation.*

Instead, we must conduct model selection *entirely within the training set.*

---
name: resampling-validation
layout: true
# Cross-validation
## The .it[validation set] approach

**Hold out** a subset of our training set to *estimate* the test error.

1. Train each model using the .hi-purple[rest of the training set].
2. Calculate MSE in the .hi-slate[validation set].
3. Compare performance and choose the best model.

---

.col-left[.hi-purple[Full training set]]

---

.col-left[.hi-slate[Validation set]]
.col-right[.hi-purple[Rest of training set]]

---

.col-left[.hi-slate[Validation set]]
.col-right[.hi-purple[Rest of training set]]

---
layout: true
# Cross-validation
## *k*-fold cross-validation

---
name: resampling-kcv

Even better is to use .hi[k-fold cross validation].

1. .b[Divide] the training data into `$k$` equally sized groups (folds).
2. .b[Iterate] over the `$k$` folds, treating each as a validation set once<br>(training the model on the other `$k-1$` folds).
3. .b[Average] the folds' MSEs to estimate test MSE.

---
exclude: true

---
layout: true
# Cross-validation
## *k*-fold cross-validation

With `$k$`-fold cross validation, we estimate test MSE as
$$
`\begin{align}
  \text{CV}_{(k)} = \dfrac{1}{k} \sum_{i=1}^{k} \text{MSE}_{i}
\end{align}`
$$
---

Our `$k=$` 5 folds.
---

Each fold takes a turn at .hi-slate[validation]. The other `$k-1$` folds .hi-purple[train].
---

For `$k=5$`, fold number `$1$` as the .hi-slate[validation set] produces MSE.sub[k=1].
---

For `$k=5$`, fold number `$2$` as the .hi-slate[validation set] produces MSE.sub[k=2].
---

For `$k=5$`, fold number `$3$` as the .hi-slate[validation set] produces MSE.sub[k=3].
---

For `$k=5$`, fold number `$4$` as the .hi-slate[validation set] produces MSE.sub[k=4].
---

For `$k=5$`, fold number `$5$` as the .hi-slate[validation set] produces MSE.sub[k=5].

---

Relative to the validation set approach:
- Uses all the data, so not as sensitive to exactly which observations end up in the validation set.
- So produces a lower-variance estimate of the test-set MSE.

---
layout: false
# Summary

.smaller[
.pull-left[
[**Prediction vs. causal inference**](#goal)
- In causal inference we want to estimate the treatment effect `$\hat{\beta}$`.
- In prediction problems we want to estimate the outcome value `$\hat{Y_i}$`.

**[Statistical learning](#learning)**
- Supervised vs. unsupervised learning.
- Regression vs. classification.

]
.pull-right[
**[Model accuracy](#accuracy)**
- Models can be assessed using loss functions that combine prediction errors.
- For regression problems, MSE is the most common loss function.
- Using separate testing and training data avoids overfitting.

**[Cross-validation](#resampling)**
- Resampling methods avoid overfitting in model assessment and selection.
- Validation set approach.

- *k*-fold cross-validation.
]
]