Lecture .mono[002]

class: center, middle, inverse, title-slide

.title[
# Lecture .mono[002]
]
.subtitle[
## Model accuracy and selection
]
.author[
### Edward Rubin
]

---

exclude: true

---
layout: true
# Admin

---
class: inverse, middle

---
name: admin-today
## Today

.b[In-class]

- Model accuracy
- Loss for regression and classification
- The bias-variance tradeoff
- The Bayes classifier
- KNN

---
name: admin-soon
# Admin

## Upcoming

.b[Readings]

- .note[Today]
  - Finish .it[ISL] Ch2
  - [Prediction Policy Problems](https://www.aeaweb.org/articles?id=10.1257/aer.p20151023) by Kleinberg .it[et al.] (2015)
- .note[Next]
  - .it[ISL] Ch. 3–4

---
layout: true
# Model accuracy

---
class: inverse, middle

---
name: accuracy-review
## Review: Supervised learning

1. Using .hi-slate[training data] `$\left( \color{#FFA500}{\mathbf{y}},\, \color{#6A5ACD}{\mathbf{X}} \right)$`, we train `$\hat{\color{#20B2AA}{f}}$`, estimating `$\color{#FFA500}{\mathbf{y}} = \color{#20B2AA}{f}\!(\color{#6A5ACD}{\mathbf{X}}) + \varepsilon$`.

1. Using this estimated model `$\hat{\color{#20B2AA}{f}}$`, we .it[can] calculate .hi-slate[training MSE]
`$$\color{#314f4f}{\text{MSE}_\text{train}} = \dfrac{1}{n} \sum_{1}^n \underbrace{\left[ \color{#FFA500}{\mathbf{y}}_i - \hat{\color{#20B2AA}{f}}\!\left( \color{#6A5ACD}{x}_i \right) \right]^{2}}_{\text{Squared error}} = \dfrac{1}{n} \sum_{1}^n \left[ \color{#FFA500}{\mathbf{y}}_i - \hat{\color{#FFA500}{\mathbf{y}}} \right]^2$$`
.note[Note:] Assuming `$\color{#FFA500}{\mathbf{y}}$` is numeric (regression problem).

1. We want the model to accurately predict previously unseen (.hi-pink[test]) data.
This goal is sometimes call .attn[generalization] or .attn[external validity].
.center[
Average `$\left[ \color{#e64173}{y_0} - \hat{\color{#20B2AA}{f}}\!\left( \color{#e64173}{x_0} \right) \right]^2$` for obs. `$\left( \color{#e64173}{y_0},\, \color{#e64173}{x_0} \right)$` in our .hi-pink[test data].
]

---
## Errors

The item at the center of our focus is the (test-sample) .b[prediction error]
`$$\color{#FFA500}{\mathbf{y}}_i - \hat{\color{#20B2AA}{f}}\!\left( \color{#6A5ACD}{x}_i \right) = \color{#FFA500}{\mathbf{y}}_i - \hat{\color{#FFA500}{\mathbf{y}}}_i$$`
the difference between the label `$\left( \color{#FFA500}{\mathbf{y}} \right)$` and its prediction `$\left( \hat{\color{#FFA500}{\mathbf{y}}} \right)$`.

The distance (_i.e._, non-negative value) between a true value and its prediction is often called .b[loss].

---
name: loss-functions
## Loss functions

.b[Loss functions] aggregate and quantify loss.

.white[►] .b[L1] loss function: `$\sum_i \big| y_i - \hat{y}_i \big|$`     .b[Mean abs. error]: `$\dfrac{1}{n}\sum_i \big| y_i - \hat{y}_i \big|$`

.white[►] .b[L2] loss function: `$\sum_i \left( y_i - \hat{y}_i \right)^2$`   .b[Mean squared error]: `$\dfrac{1}{n}\sum_i \left( y_i - \hat{y}_i \right)^2$`

Notice that .b[both functions impose assumptions].
1. Both assume .b[overestimating] is equally bad as .b[underestimating].
2. Both assume errors are similarly bad for .b[all individuals] `$(i)$`.
3. They differ in their assumptions about the .b[magnitude of errors].
  - .b[L1] an additional unit of error is .b[equally bad] everywhere.
  - .b[L2] an additional unit of error is .b[worse] when the error is already big.

---
layout: true
class: clear, middle

---
A very simple, univariate dataset `$\left(\mathbf{y},\, \mathbf{x} \right)$`

---
... on which we run a .pink[simple linear regression].

---
Each point `$\left( y_i,\, x_i \right)$` has an associated .grey-mid[loss] (error).

---
The L1 loss function weights all errors equally: `$\sum_i \big| y_i - \hat{y}_i \big|$`

---
The L2 loss function .it[upweights] large weights: `$\sum_i \left( y_i - \hat{y}_i \right)^2$`

---
name: overfitting
layout: false
# Model accuracy
## Overfitting

So what's the big deal? (.attn[Hint:] Look up.)

We're facing a tradeoff—increasing model flexibility
- offers potential to better fit complex systems
- risks overfitting our model to the training data

We can see these tradeoffs in our .hi-pink[test MSE] (but not the .hi-slate[training MSE]).

---
class: clear, middle
layout: true

---
exclude: true

---
name: ex-nonlinear-splines

.hi-slate[Training data] and example models (splines)

---

---
The previous example has a pretty nonlinear relationship.

.qa[Q] What happens when truth is actually linear?

---
exclude: true

---
name: ex-linear-splines

.hi-slate[Training data] and example models (splines)

---

---
layout: true
# Model accuracy

---
name:

## Solutions?

Clearly we don't want to overfit our .hi-slate[training data].
 It loos like our .hi-pink[testing data] can help.

.qa[Q] How about the following routine?
1. train a model `$\hat{\color{#20B2AA}{f}}$` on the .hi-slate[training data]
2. use the .hi-pink[test data] to "tune" the model's flexibility
3. repeat steps 1–2 until we find the optimal level of flexibility

.qa[A] .b[No]
--
.b[!]
--
.b[!!]
--
 This is an algorithm for .b[overfitting your .hi-pink[test data]].

Okay... so maybe that was on overreaction, but we need to be careful.

---
name: bias-variance

This tradeoff that we keep coming back to has an official name:
 .b[bias-variance tradeoff]. .grey-light[(or the variance-bias tradeoff)]

.b[Variance] The amount `$\hat{\color{#20B2AA}{f}}$` would change with a different .hi-slate[training sample]
- If new .hi-slate[training sets] drastically change `$\hat{\color{#20B2AA}{f}}$`, then we have a lot of uncertainty about `$\color{#20B2AA}{f}$` (and, in general, `$\hat{\color{#20B2AA}{f}} \not\approx \color{#20B2AA}{f}$`).
- More flexible models generally add variance to `$\color{#20B2AA}{f}$`.

.b[Bias] The error that comes from inaccurately estimating `$\color{#20B2AA}{f}$`.
- More flexible models are better equipped to recover complex relationships `$\left( \color{#20B2AA}{f} \right)$`, reducing bias. (Real life is seldom linear.)
- Simpler (less flexible) models typically increase bias.

---
## The bias-variance tradeoff, formally

The expected value.super[.pink[†]] of the .hi-pink[test MSE] can be written
$$
`\begin{align}
  \mathop{E}\left[ \left(\color{#FFA500}{\mathbf{y}}_0 - \hat{\color{#20B2AA}{f}}\!(\color{#6A5ACD}{\mathbf{X}}_0) \right)^2 \right] =
  \underbrace{\mathop{\text{Var}} \left( \hat{\color{#20B2AA}{f}}\!(\color{#6A5ACD}{\mathbf{X}}_0) \right)}_{(1)} +
  \underbrace{\left[ \text{Bias}\left( \hat{\color{#20B2AA}{f}}\!(\color{#6A5ACD}{\mathbf{X}}_0) \right) \right]^2}_{(2)} +
  \underbrace{\mathop{\text{Var}} \left( \varepsilon \right)}_{(3)}
\end{align}`
$$

.footnote[
.pink[†] Think: .it[mean] or .it[tendency]
]

.qa[Q.sub[1]] What does this formula tell us? (Think intuition/interpretation.)
 .qa[Q.sub[2]] How does model flexibility feed into this formula?
 .qa[Q.sub[3]] What does this formula say about minimizing .hi-pink[test MSE]?

.qa[A.sub[2]] In general, model flexibility increases (1) and decreases (2).
--
 .qa[A.sub[3]] Rates of change for variance and bias will lead to optimal flexibility.
 We often see U-shape curves of .hi-pink[test MSE] w.r.t. to model flexibility.

---
layout: false
class: clear, middle

.b[U-shaped test MSE] w.r.t. model flexibility
 Increases in variance eventually overcome reductions in (squared) bias.

---
# Model accuracy
## Bias-variance tradeoff

The bias-variance tradeoff key to understanding many ML concepts.
- Loss functions and model performance
- Overfitting and model flexibility
- Training and testing (and cross validating)

Spend some time thinking about it and building intution.
 It's time well spent.

---
class: clear, middle

So far we've focused on regression problems; what about classification?

---
layout: true
# Model accuracy

---
name: class-review
## Classification problems

.note[Recall] We're still supervised, but now we're predicting categorical labels.

With categorical variables, MSE doesn't work—_e.g._,
.center[
`$\color{#FFA500}{\mathbf{y}} - \hat{\color{#FFA500}{\mathbf{y}}} =$` .orange[(Chihuahua)] - .orange[(Blueberry muffin)] `$=$` not math (does not compute)
]

Clearly we need a different way to define model performance.

---
## Classification problems

The most common approach is exactly what you'd guess...

.hi-slate[Training error rate] The share of training predictions that we get wrong.
$$
`\begin{align}
  \dfrac{1}{n} \sum_{i=1}^{n} \mathbb{I}\!\left( \color{#FFA500}{y}_i \neq \hat{\color{#FFA500}{y}}_i \right)
\end{align}`
$$
where `$\mathbb{I}\!\left( \color{#FFA500}{y}_i \neq \hat{\color{#FFA500}{y}}_i \right)$` is an indicator function that equals 1 whenever our prediction is wrong.

.hi-pink[Test error rate] The share of test predictions that we get wrong.
.center[
Average `$\mathbb{I}\!\left( \color{#FFA500}{y}_0 \neq \hat{\color{#FFA500}{y}}_0 \right)$` in our .hi-pink[test data]
]

---
layout: false
class: clear

---
layout: true
# Model accuracy

---
name: bayes
## The Bayes classifier

.note[Recall] .hi-pink[Test error rate] is the share of test predictions that we get wrong.
.center[
Average `$\mathbb{I}\!\left( \color{#FFA500}{y}_0 \neq \hat{\color{#FFA500}{y}}_0 \right)$` in our .hi-pink[test data]
]

The .b[Bayes classifier] as the classifier that assigns an observation to its most probable groups, given the values of its predictors, _i.e._,
.center[
Assign obs. `$i$` to the class `$j$` for which `$\mathop{\text{Pr}}\left(\color{#FFA500}{\mathbf{y}} = j | \color{#6A5ACD}{\mathbf{X}} = \mathbf{x}_0\right)$` is the largest
]

The .b[Bayes classifier] minimizes the .hi-pink[test error rate].

.note[Note] `$\mathop{\text{Pr}}\left(\mathbf{y}=j|\mathbf{X}=x_0\right)$` is the probability that random variable `$\mathbf{y}$` equals `$j$`, given.super[.pink[†]] the variable(s) `$\mathbf{X} = x_0$`.

.footnote[
.pink[†] The "given" is also read as "conditional on". Think of it as subsetting to where `$X=x_0$`.
]

---
## The Bayes classifier

.note[Example]

- Pr(.orange[y] = "chihuahua" | .purple[X] = "orange and purple") = 0.3
- Pr(.orange[y] = "blueberry muffin" | .purple[X] = "orange and purple") = 0.4
- Pr(.orange[y] = "squirrel" | .purple[X] = "orange and purple") = 0.2
- Pr(.orange[y] = "other" | .purple[X] = "orange and purple") = 0.1

Then the Bayes classifier says we should predict "blueberry muffin".

---
## The Bayes classifier

More notes on the Bayes classifier

1. In the .b[two-class case], we're basically looking for `$\text{Pr}(\color{#FFA500}{\mathbf{y}}=j|\color{#6A5ACD}{\mathbf{X}}=x_0)>0.5$` for one class.

1. The .b[Bayes decision boundary] is the point where the probability is equal between the most likely groups (_i.e._, exactly 50% for two groups).

1. The Bayes classifier produces the lowest possible .hi-pink[test error rate], which is called the .b[Bayes error rate].

1. Just as with `$\color{#20B2AA}{f}$`, the probabilities `$\mathop{\text{Pr}}\left(\color{#FFA500}{\mathbf{y}}=j|\color{#6A5ACD}{\mathbf{X}}=x_o\right)$` that the Bayes classifier relies upon are .b[unknown]. We have to estimate.

---
exclude: true

---
layout: true
class: clear, middle

---
name: ex-bayes
The .hi-pink[Bayes decision boundary] between classes .orange[A] and .navy[B]
<img src="slides_files/figure-html/plot bayes boundary-1.svg" style="display: block; margin: auto;" />

---
Now we sample...
<img src="slides_files/figure-html/plot bayes sample-1.svg" style="display: block; margin: auto;" />

---
... and our sample gives us an .hi-purple[estimated decision boundary].
<img src="slides_files/figure-html/plot bayes est boundary-1.svg" style="display: block; margin: auto;" />

---
And a new sample gives us another .hi-turquoise[estimated decision boundary].
<img src="slides_files/figure-html/plot bayes est boundary 2-1.svg" style="display: block; margin: auto;" />

---

One non-parametric way to estimate these unknown conditional probabilities: K-nearest neighbors (KNN).

---
layout: true
# K-nearest neighbors

---
name: knn-setup
## Setup

K-nearest neighbors (KNN) simply assigns a category based upon the nearest K neighbors votes (their values).

.note[More formally:] Using the K closest neighbors.super[.pink[†]] to test observation `$\color{#6A5ACD}{\mathbf{x_0}}$`, we calculate the share of the observations whose class equals `$j$`,
$$
`\begin{align}
  \hat{\mathop{\text{Pr}}}\left(\mathbf{y} = j | \mathbf{X} = \color{#6A5ACD}{\mathbf{x_0}}\right) = \dfrac{1}{K} \sum_{i \in \mathcal{N}_0} \mathop{\mathbb{I}}\left( \color{#FFA500}{\mathbf{y}}_i = j \right)
\end{align}`
$$
These shares are our estimates for the unknown conditional probabilities.

We then assign observation `$\color{#6A5ACD}{\mathbf{x_0}}$` to the class with the highest probability.

.footnote[
.pink[†] In `$\color{#6A5ACD}{\mathbf{X}}$` space.
]

---
name: knn-fig
layout: false
class: clear, middle

.b[KNN in action]
 .note[Left:] K=3 estimation for "x".       .note[Right:] KNN decision boundaries.
<img src="images/isl-knn.png" width="2867" style="display: block; margin: auto;" />
.smaller.it[Source: ISL]

---
class: clear, middle

The choice of K is very important—ranging from super flexible to inflexible.

---
name: ex-knn
class: clear, middle

Decision boundaries: .hi-pink[Bayes], .hi-purple[K=1], and .hi-turquoise[K=60]
<img src="slides_files/figure-html/plot knn k-1.svg" style="display: block; margin: auto;" />

---
class: clear, middle
.b[KNN error rates], as K increases
<img src="images/isl-knn-error.png" width="85%" style="display: block; margin: auto;" />
.smaller.it[Source: ISL]

---
# Model accuracy
## Summary

The bias-variance tradeoff is central to quality prediction.
- Relevant for classification and regression settings
- Benefits and costs of increasing model flexibility
- U-shaped test error curves
- Avoid overfitting—including in test data

---
name: sources
layout: false
# Sources

These notes draw upon

- [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) (*ISL*) James, Witten, Hastie, and Tibshirani

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) Jake VanderPlas

- ['Chihuahua or Muffin' is from Twitter](https://twitter.com/teenybiscuit/status/705232709220769792)

---
# Table of contents

.col-left[
.smallest[
#### Admin
- [Today](#admin-today)
- [Upcoming](#admin-soon)

#### Model accuracy: Regression
- [Review](#accuracy-review)
- [Loss (functions)](#loss-functions)
- [Overfitting](#overfitting)
- [The bias-variance tradeoff](#bias-variance)

#### Model accuracy: Classification
- [Returning to classification](#class-review)
- [The Bayes classifier](#bayes)

#### KNN
- [Setup](#knn-setup)
- [Figures](#knn-fig)
]
]

.col-right[
.smallest[

#### Examples
- [Train *vs.* test MSE: Nonlinear truth](#ex-nonlinear-splines)
- [Train *vs.* test MSE: Linear truth](#ex-linear-splines)
- [Bayes decision boundaries](#ex-bayes)
- [KNN choice of K](#ex-knn)

#### Other
- [Sources/references](#sources)
]
]