---
title: "Lecture .mono[001]"
subtitle: "Statistical learning: Foundations"
author: "Edward Rubin"
#date: "`r format(Sys.time(), '%d %B %Y')`"
date: "14 January 2020"
output:
xaringan::moon_reader:
css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css']
# self_contained: true
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
exclude: true
```{R, setup, include = F}
library(pacman)
p_load(
broom, tidyverse,
ggplot2, ggthemes, ggforce, ggridges, cowplot,
latex2exp, viridis, extrafont, gridExtra, plotly,
kableExtra, snakecase, janitor,
data.table, dplyr,
lubridate, knitr, future, furrr,
estimatr, FNN, caret, parsnip,
huxtable, here, magrittr
)
# Define colors
red_pink = "#e64173"
turquoise = "#20B2AA"
orange = "#FFA500"
red = "#fb6107"
blue = "#3b3b9a"
green = "#8bb174"
grey_light = "grey70"
grey_mid = "grey50"
grey_dark = "grey20"
purple = "#6A5ACD"
slate = "#314f4f"
# Knitr options
opts_chunk$set(
comment = "#>",
fig.align = "center",
fig.height = 7,
fig.width = 10.5,
warning = F,
message = F
)
opts_chunk$set(dev = "svg")
options(device = function(file, width, height) {
svg(tempfile(), width = width, height = height)
})
options(knitr.table.format = "html")
```
---
layout: true
# Admin
---
class: inverse, middle
---
name: admin-today
## Today
.hi-slate[In-class]
- .note[Course website:] [https://github.com/edrubin/EC524W20/](https://github.com/edrubin/EC524W20/)
- .note[Resources]
- [RStudio](https://education.rstudio.com/learn/) cheatsheets, books, and tutorials
- [UO library](http://uoregon.libcal.com/calendar/dataservices/?cid=11979&t=g&d=0000-00-00&cal=11979,11173)
- See course page for more...
- Formalizing statistical learning, notation, goals (and problems)
---
layout: false
class: clear, middle
```{R, eugene r, echo = F}
knitr::include_graphics("images/eugene-r.png")
```
.smaller[[Tweet](https://twitter.com/ryann_crowley/status/1216880767072002048); [h/t: Grant McDermott](https://grantmcdermott.com/)]
---
name: admin-soon
# Admin
## Upcoming
.hi-slate[Readings]
- .note[Today]
- .it[ISL] Ch1–Ch2
- [Prediction Policy Problems](https://www.aeaweb.org/articles?id=10.1257/aer.p20151023) by Kleinberg .it[et al.] (2015)
- .note[Next]
- .it[ISL] Ch. 3–4
.hi-slate[Problem set] Likely assigned Thursday and due Tuesday.
---
layout: true
# Statistical learning
---
class: inverse, middle
---
name: sl-definition
## What is it?
--
.hi[Statistical learning] is a .attn[set of tools] developed .attn[to understand/model data].
--
Examples
- .hi-slate[Regression analysis] quantifies the relationship between an outcome and a set of explanatory variables—most usefully in a causal setting.
--
- .hi-slate[Exploratory data analysis] (EDA) is a preliminary, often graphical, "exploration" of data to understand levels, variation, missingess, *etc.*
--
- .hi-slate[Classification trees] search through explanatory variables, splitting along the most "predictive" dimensions (random forests extend trees).
--
- .hi-slate[Regression trees] extend *classification trees* to numerical outcomes (random forests extend, as well).
--
- .hi-slate[K-means clustering] partitions observations into K groups (clusters) based upon a set of variables.
---
name: sl-classes
## What is it good for?
--
A lot of things.
--
We tend to break statistical-learning into two(-ish) classes:
1. .hi-slate[Supervised learning] builds ("learns") a statistical model for predicting an .hi-orange[output] $\left( \color{#FFA500}{\mathbf{y}} \right)$ given a set of .hi-purple[inputs] $\left( \color{#6A5ACD}{\mathbf{x}_{1},\, \ldots,\, \mathbf{x}_{p}} \right)$,
--
_i.e._, we want to build a model/function $\color{#20B2AA}{f}$
$$\color{#FFA500}{\mathbf{y}} = \color{#20B2AA}{f}\!\left( \color{#6A5ACD}{\mathbf{x}_{1},\, \ldots,\, \mathbf{x}_{p}} \right)$$
that accurately describes $\color{#FFA500}{\mathbf{y}}$ given some values of $\color{#6A5ACD}{\mathbf{x}_{1},\, \ldots,\, x_{p}}$.
--
2. .hi-slate[Unsupervised learning] learns relationships and structure using only .hi-purple[inputs] $\left( \color{#6A5ACD}{x_{1},\, \ldots,\, x_{p}} \right)$ without any *supervising* output
--
—letting the data "speak for itself."
---
layout: false
class: clear, middle
.hi-slate[Semi-supervised learning] falls somewhere between these supervised and unsupervised learning—generally applied to supervised tasks when labeled .hi-orange[outputs] are incomplete.
---
class: clear, middle
```{R, comic, echo = F}
knitr::include_graphics("images/comic-learning.jpg")
```
.it[.smaller[[Source](https://twitter.com/athena_schools/status/1063013435779223553)]]
---
layout: true
# Statistical learning
---
## Output
We tend to further break .hi-slate[supervised learning] into two groups, based upon the .hi-orange[output] (the .orange[outcome] we want to predict):
--
1. .hi-slate[Classification tasks] for which the values of $\color{#FFA500}{\mathbf{y}}$ are discrete categories
*E.g.*, race, sex, loan default, hazard, disease, flight status
2. .hi-slate[Regression tasks] in which $\color{#FFA500}{\mathbf{y}}$ takes on continuous, numeric values.
*E.g.*, price, arrival time, number of emails, temperature
.note[Note.sub[1]] The use of .it[regression] differs from our use of .it[linear regression].
--
.note[Note.sub[2]] Don't get tricked: Not all numbers represent continuous, numerical values—_e.g._, zip codes, industry codes, social security numbers..super[.pink[†]]
.footnote[
.pink[†] .qa[Q] Where would you put responses to 5-item Likert scales?
]
---
## Why *Learning*?
.qa[Q] What puts the "learning" in statistical/machine learning?
--
.qa[A] Most learning models/algorithms will .attn[tune model parameters] based upon the observed dataset—learning from the data.
---
layout: true
# Notation
---
name: notation-source
class: inverse, middle
Our class will typically follow the notation and definitions of [.it[ISL]](http://faculty.marshall.usc.edu/gareth-james/ISL/).
---
name: notation-data
## Data
$\color{#e64173}{n}$ gives the .pink[number of observations]
$\color{#6A5ACD}{p}$ represents the .purple[number of variables] available for predictions
--
$\mathbf{X}$ is our $\color{#e64173}{n}\times\color{#6A5ACD}{p}$ matrix of predictors
- .note[Other names] ***features***, *inputs*, *independent/explanatory variables*, ...
- $x_{\color{#e64173}{i},\color{#6A5ACD}{j}}$ is observation $\color{#e64173}{i}$ (in $\color{#e64173}{1,\ldots,n}$) on variable $\color{#6A5ACD}{j}$ (for $\color{#6A5ACD}{j}$ in $\color{#6A5ACD}{1,...,p}$)
--
$$
\begin{align}
\mathbf{X} =
\begin{bmatrix}
x_{1,1} & x_{1,2} & \cdots & x_{1,\color{#6A5ACD}{p}} \\
x_{2,1} & x_{2,2} & \cdots & x_{2,\color{#6A5ACD}{p}} \\
\vdots & \vdots & \ddots & \vdots \\
x_{\color{#e64173}{n},1} & x_{\color{#e64173}{n},2} & \cdots & x_{\color{#e64173}{n},\color{#6A5ACD}{p}}
\end{bmatrix}
\end{align}
$$
---
name: notation-dimensions
## Dimensions of $\mathbf{X}$
Now let us split our $\mathbf{X}$ matrix of predictors by its two dimensions.
--
.col-left[
.hi-pink[Observation] $\color{#e64173}{i}$ is a $\color{#6A5ACD}{p}$-length vector
$$
\begin{align}
x_{\color{#e64173}{i}} =
\begin{bmatrix}
x_{\color{#e64173}{i},\color{#6A5ACD}{1}} \\
x_{\color{#e64173}{i},\color{#6A5ACD}{2}} \\
\vdots \\
x_{\color{#e64173}{i},\color{#6A5ACD}{p}}
\end{bmatrix}
\end{align}
$$
]
--
.col-right[
.hi-purple[Variable] $\color{#6A5ACD}{j}$ is a $\color{#e64173}{n}$-length vector
$$
\begin{align}
\mathbf{x}_{\color{#6A5ACD}{j}} =
\begin{bmatrix}
x_{\color{#e64173}{1},\color{#6A5ACD}{j}} \\
x_{\color{#e64173}{2},\color{#6A5ACD}{j}} \\
\vdots \\
x_{\color{#e64173}{n},\color{#6A5ACD}{j}}
\end{bmatrix}
\end{align}
$$
]
--
Applied to .mono[R]:
- `dim(x_df)` = $\color{#e64173}{n}$ $\color{#6A5ACD}{p}$
- `nrow(x_df)` $= \color{#e64173}{n}$; `ncol(x_df)` $= \color{#6A5ACD}{p}$
- `x_df[1,]` $\left( \color{#e64173}{i = 1} \right)$; `x_df[,1]` $\left( \color{#6A5ACD}{j = 1} \right)$
---
name: notation-outcomes
## Outcomes
In supervised settings, we will denote our .hi-orange[outcome variable] as $\color{#FFA500}{\mathbf{y}}$.
.note[Synonyms] *output*, *outcome*, *dependent/response variable*, ...
--
The .orange[outcome] for our .pink[i.super[th]] obsevation is $\color{#FFA500}{y}_{\color{#e64173}{i}}$. Together the $\color{#e64173}{n}$ observations form
$$
\begin{align}
\color{#FFA500}{\mathbf{y}} =
\begin{bmatrix}
y_{\color{#e64173}{1}} \\
y_{\color{#e64173}{2}} \\
\vdots \\
y_{\color{#e64173}{n}}
\end{bmatrix}
\end{align}
$$
--
and our full dataset is composed of $\bigg\{ \left( x_{\color{#e64173}{1}},\color{#FFA500}{y}_{\color{#e64173}{1}} \right),\, \left( x_{\color{#e64173}{2}},\color{#FFA500}{y}_{\color{#e64173}{2}} \right),\, \ldots,\, \left( x_{\color{#e64173}{n}},\color{#FFA500}{y}_{\color{#e64173}{n}} \right) \bigg\}$.
---
layout: false
class: clear, middle
Back to the problem of (supervised) statistical learning...
---
layout: true
# Statistical learning
---
name: sl-goal
## The goal
As defined before, we want to *learn* a model to understand our data.
--
1. Take our (numeric) .orange[output] $\color{#FFA500}{\mathbf{y}}$.
2. Imagine there is a .turquoise[function] $\color{#20B2AA}{f}$ that takes .purple[inputs] $\color{#6A5ACD}{\mathbf{X}} = \color{#6A5ACD}{\mathbf{x}_1}, \ldots, \color{#6A5ACD}{\mathbf{x}_p}$
and maps them, plus a random, mean-zero .pink[error term] $\color{#e64173}{\varepsilon}$, to the .orange[output].
$$\color{#FFA500}{\mathbf{y}} = \color{#20B2AA}{f} \! \left( \color{#6A5ACD}{\mathbf{X}} \right) + \color{#e64173}{\varepsilon}$$
--
.qa[Q] What is $\color{#20B2AA}{f}$?
--
.qa[A] .note[ISL:] $\color{#20B2AA}{f}$ represents the *systematic* information that $\color{#6A5ACD}{\mathbf{X}}$ provides about $\color{#FFA500}{\mathbf{y}}$.
--
.qa[Q] How else can you describe $\color{#20B2AA}{f}$?
---
## Our missing $f$
$$\color{#FFA500}{\mathbf{y}} = \color{#20B2AA}{f} \! \left( \color{#6A5ACD}{\mathbf{X}} \right) + \color{#e64173}{\varepsilon}$$
.qa[Q] $\color{#20B2AA}{f}$ is unknown (as is $\color{#e64173}{\varepsilon}$). What should we do?
--
.qa[A] Use the observed data to learn/estimate $\color{#20B2AA}{f}(\cdot)$, _i.e._, construct $\widehat{\color{#20B2AA}{f}}$..super[.pink[†]]
.footnote[
.pink[†] More notation: hats $\left( \hat{} \right)$ are estimators/estimates.
]
--
.qa[Q] Okay. How?
--
.qa[A] .it[How do I estimate] $\color{#20B2AA}{f}$.it[?] is one way to phrase *all questions* that underly statistical learning—model selection, cross validation, evaluation, *etc.*
--
All of the techniques, algorithms, tools of stat. learning attempt to accurately recover $\color{#20B2AA}{f}$ based upon the settings' goals/limitations.
--
.grey-light[You'll have to wait on any real/specific answers...]
---
## Learning from $\hat{f}$
There are two main reasons we want to learn about $\color{#20B2AA}{f}$
1. .hi-slate[*Causal* inference settings] How do changes in $\color{#6A5ACD}{\mathbf{X}}$ affect $\color{#FFA500}{\mathbf{y}}$?
.grey-light[The territory of .mono[EC523] and .mono[EC525].]
--
1. .hi-slate[Prediction problems] Predict $\color{#FFA500}{\mathbf{y}}$ using our estimated $\color{#20B2AA}{f}$, _i.e._,
$$\hat{\color{#FFA500}{\mathbf{y}}} = \hat{\color{#20B2AA}{f}}\!(\color{#6A5ACD}{\mathbf{X}})$$
our *black-box setting* where we care less about $\color{#20B2AA}{f}$ than $\hat{\color{#FFA500}{\mathbf{y}}}$..super[.pink[†]]
.footnote[
.pink[†] You shouldn't actually treat your prediction methods as total black boxes.
]
--
Similarly, in causal-inference settings, we don't particulary care about $\hat{\color{#FFA500}{\mathbf{y}}}$.
---
name: sl-prediction
## Prediction errors
As tends to be the case in life, you will make errors in predicting $\color{#FFA500}{\mathbf{y}}$.
The accuracy of $\hat{\color{#FFA500}{\mathbf{y}}}$ depends upon .hi-slate[two errors]:
--
1. .hi-slate[Reducible error] The error due to $\hat{\color{#20B2AA}{f}}$ imperfectly estimating $\color{#20B2AA}{f}$.
*Reducible* in the sense that we could improve $\hat{\color{#20B2AA}{f}}$.
--
1. .hi-slate[Irreducible error] The error component that is outside of the model $\color{#20B2AA}{f}$.
*Irreducible* because we defined an error term $\color{#e64173}{\varepsilon}$ unexplained by $\color{#20B2AA}{f}$.
--
.note[Note] As its name implies, you can't get rid of .it[irreducible] error—but we can try to get rid of .it[reducible] errors.
---
## Prediction errors
Why we're stuck with .it[irreducible] error
$$
\begin{aligned}
\mathop{E}\left[ \left\{ \color{#FFA500}{\mathbf{y}} - \hat{\color{#FFA500}{\mathbf{y}}} \right\}^2 \right]
&=
\mathop{E}\left[ \left\{ \color{#20B2AA}{f}(\color{#6A5ACD}{\mathbf{X}}) + \color{#e64173}{\varepsilon} + \hat{\color{#20B2AA}{f}}(\color{#6A5ACD}{\mathbf{X}}) \right\}^2 \right] \\
&= \underbrace{\left[ \color{#20B2AA}{f}(\color{#6A5ACD}{\mathbf{X}}) - \hat{\color{#20B2AA}{f}}(\color{#6A5ACD}{\mathbf{X}}) \right]^2}_{\text{Reducible}} + \underbrace{\mathop{\text{Var}} \left( \color{#e64173}{\varepsilon} \right)}_{\text{Irreducible}}
\end{aligned}
$$
In less math:
- If $\color{#e64173}{\varepsilon}$ exists, then $\color{#6A5ACD}{\mathbf{X}}$ cannot perfectly explain $\color{#FFA500}{\mathbf{y}}$.
- So even if $\hat{\color{#20B2AA}{f}} = \color{#20B2AA}{f}$, we still have irreducible error.
--
Thus, to form our .hi-slate[best predictors], we will .hi-slate[minimize reducible error].
---
name: sl-parameters
## Which type of $\hat{f}$?
Once you have your .purple[inputs] $\left(\color{#6A5ACD}{\mathbf{X}} \right)$ and .orange[output] $\left( \color{#FFA500}{\mathbf{y}} \right)$ data, you still need to decide how parametric your $\hat{\color{#20B2AA}{f}}$ should be..super[.pink[†]]
.footnote[
.pink[†] I'm saying "how parametric" b/c some methods are much more parametric than others.
]
--
.hi-slate[Parametric methods] assume a function typically involve two steps
1. Select a functional form (shape) to represent $\color{#20B2AA}{f}$
2. Train your selected model on your data $\color{#FFA500}{\mathbf{y}}$ and $\color{#6A5ACD}{\mathbf{X}}$.
--
.hi-slate[Non-parametric methods] avoid explicit assumption about the shape of $\color{#20B2AA}{f}$.
Attempt to .pink[flexibly fit] the data, while trying to .pink[avoid overfitting].
---
## Which type of $\hat{f}$?
Methods' parametric assumptions come with tradeoffs.
.hi-slate[Parametric methods]
.pink.mono[+] Simpler to estimate and interpret.
.purple.mono[-] If assumed functional form is bad, model performance will suffer.
.hi-slate[Non-parametric methods]
.pink.mono[+] Fewer assumptions. More flexibility.
.purple.mono[-] Lower interpretability. Susceptible to overfitting. Want lots of data.
---
layout: true
class: clear, middle
---
.hi-slate[Example:] Let's start with a pretty funky, nonlinear function.
---
exclude: true
```{R, ex data, cache = T}
# Sample size
n = 70
# Set seed
set.seed(12345)
# Define function
f = function(x1, x2, e) x1 + x2 - x1 * x2 + (x1 > x2) * x1 + (x1 < x2) * x2^2 + e
# Generate data
sample_df = tibble(
x1 = runif(n = n, max = 10),
x2 = runif(n = n, max = 10),
e = rnorm(n = n, sd = 1),
y = f(x1, x2, e)
)
# Estimate linear-regression model
est_lm = lm(y ~ x1 * x2, data = sample_df)
# Estimate kNN models: k=10,5,1
est_knn10 = knnreg(
y = sample_df$y,
x = sample_df[, c("x1", "x2")],
k = 10
)
est_knn5 = knnreg(
y = sample_df$y,
x = sample_df[, c("x1", "x2")],
k = 5
)
est_knn1 = knnreg(
y = sample_df$y,
x = sample_df[, c("x1", "x2")],
k = 1
)
# Add predictions
sample_df %<>% mutate(
y_lm = est_lm$fitted.values,
y_knn10 = predict(est_knn10, newdata = sample_df[, c("x1", "x2")]),
y_knn5 = predict(est_knn5, newdata = sample_df[, c("x1", "x2")]),
y_knn1 = predict(est_knn1, newdata = sample_df[, c("x1", "x2")])
)
# Fit a linear-regression model
# True data frame
truth_df = tibble(x1 = seq(0, 10, 0.1), x2 = seq(0, 10, 0.1)) %>%
expand(x1, x2)
truth_df %<>% mutate(
y = f(x1, x2, 0),
y_lm = predict(est_lm, newdata = truth_df),
y_knn10 = predict(est_knn10, newdata = truth_df[, c("x1", "x2")]),
y_knn5 = predict(est_knn5, newdata = truth_df[, c("x1", "x2")]),
y_knn1 = predict(est_knn1, newdata = truth_df[, c("x1", "x2")])
)
# Find range of x, y, and prediction errors
range_x = c(0,10)
range_y = c(
min(
sample_df %>% select(starts_with("y")),
truth_df %>% select(starts_with("y"))
),
max(
sample_df %>% select(starts_with("y")),
truth_df %>% select(starts_with("y"))
)
)
range_error = c(
min(sample_df %>% transmute(y - y_lm, y - y_knn10, y - y_knn5, y - y_knn1)),
max(sample_df %>% transmute(y - y_lm, y - y_knn10, y - y_knn5, y - y_knn1))
)
```
---
name: ex-truth
.hi-slate[Truth:] The (nonlinear) $f(\mathbf{X})$ that we hope to recover.
```{R, ex truth, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"}
# True 'f(X)' (surface)
plot_ly(
x = unique(truth_df$x1),
y = unique(truth_df$x2),
z = matrix(data = truth_df$y, ncol = sqrt(nrow(truth_df))),
colors = colorRamp(viridis::magma(8)),
cmin = range_y[1],
cmax = range_y[2]
) %>% add_surface() %>%
layout(scene = list(
xaxis = list(title = "x1", range = range_x),
yaxis = list(title = "x2", range = range_x),
zaxis = list(title = "y", range = range_y)
)) %>% hide_colorbar()
```
---
.hi-slate[The sample:] $n=70$ randomly drawn observations for $\mathbf{y} = f(\mathbf{x}_1,\, \mathbf{x}_2) + \varepsilon$
```{R, ex sample, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"}
# Sample observations (3d scatter)
plot_ly(
type = "scatter3d",
x = sample_df$x1,
y = sample_df$x2,
z = sample_df$y,
mode = "markers",
color = sample_df$y,
colors = colorRamp(viridis::magma(8)),
) %>%
layout(scene = list(
xaxis = list(title = "x1", range = range_x),
yaxis = list(title = "x2", range = range_x),
zaxis = list(title = "y", range = range_y)
)) %>% hide_colorbar()
```
---
name: ex-lm
.hi-slate[Estimated linear-regression model:] $\hat{\mathbf{y}} = \hat\beta_0 + \hat\beta_1 \mathbf{x}_1 + \hat\beta_2 \mathbf{x}_2 + \hat\beta_3 \mathbf{x}_1 \mathbf{x}_2$
```{R, ex lm, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"}
# Linear regression estimate (surface)
plot_ly(
x = unique(truth_df$x1),
y = unique(truth_df$x2),
z = matrix(data = truth_df$y_lm, ncol = sqrt(nrow(truth_df))),
colors = colorRamp(viridis::magma(8)),
cmin = range_y[1],
cmax = range_y[2]
) %>% add_surface() %>%
layout(scene = list(
xaxis = list(title = "x1", range = range_x),
yaxis = list(title = "x2", range = range_x),
zaxis = list(title = "y", range = range_y)
)) %>% colorbar(limits = range_y) %>% hide_colorbar()
```
---
.hi-slate[Prediction error] from our fitted linear regression model
```{R, ex lm errors, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"}
# Regression error (3d scatter)
plot_ly(
type = "scatter3d",
x = sample_df$x1,
y = sample_df$x2,
z = sample_df$y - sample_df$y_lm,
mode = "markers",
color = sample_df$y - sample_df$y_lm,
colors = colorRamp(viridis::magma(8))
) %>%
layout(scene = list(
xaxis = list(title = "x1", range = range_x),
yaxis = list(title = "x2", range = range_x),
zaxis = list(title = "error", range = range_error)
)) %>% colorbar(limits = range_error) %>% hide_colorbar()
```
---
name: ex-knn
.hi-slate[k-nearest neighbors] (kNN) using k=5 .grey-light[(a *non-parametric* method)]
```{R, ex knn5, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"}
# knn model (k = 5) (surface)
plot_ly(
x = unique(truth_df$x1),
y = unique(truth_df$x2),
z = matrix(data = truth_df$y_knn5, ncol = sqrt(nrow(truth_df))),
colors = colorRamp(viridis::magma(8)),
cmin = range_y[1],
cmax = range_y[2]
) %>% add_surface() %>%
layout(scene = list(
xaxis = list(title = "x1", range = range_x),
yaxis = list(title = "x2", range = range_x),
zaxis = list(title = "y", range = range_y)
)) %>% hide_colorbar()
```
---
.hi-slate[k-nearest neighbors] (kNN) using k=10 .grey-light[(notice increased smoothness)]
```{R, ex knn10, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"}
# knn model (k = 10) (surface)
plot_ly(
x = unique(truth_df$x1),
y = unique(truth_df$x2),
z = matrix(data = truth_df$y_knn10, ncol = sqrt(nrow(truth_df))),
colors = colorRamp(viridis::magma(8)),
cmin = range_y[1],
cmax = range_y[2]
) %>% add_surface() %>%
layout(scene = list(
xaxis = list(title = "x1", range = range_x),
yaxis = list(title = "x2", range = range_x),
zaxis = list(title = "y", range = range_y)
)) %>% hide_colorbar()
```
---
.hi-slate[k-nearest neighbors] (kNN) using k=1 .grey-light[(notice decreased smoothness)]
```{R, ex knn1, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"}
# knn model (k = 1) (surface)
plot_ly(
x = unique(truth_df$x1),
y = unique(truth_df$x2),
z = matrix(data = truth_df$y_knn1, ncol = sqrt(nrow(truth_df))),
colors = colorRamp(viridis::magma(8)),
cmin = range_y[1],
cmax = range_y[2]
) %>% add_surface() %>%
layout(scene = list(
xaxis = list(title = "x1", range = range_x),
yaxis = list(title = "x2", range = range_x),
zaxis = list(title = "y", range = range_y)
)) %>% hide_colorbar()
```
---
.hi-slate[Prediction error] from our fitted kNN (k=5) model
```{R, ex knn 5 errors, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"}
# kNN 5 error (3d scatter)
plot_ly(
type = "scatter3d",
x = sample_df$x1,
y = sample_df$x2,
z = sample_df$y - sample_df$y_knn5,
mode = "markers",
color = sample_df$y - sample_df$y_knn5,
colors = colorRamp(viridis::magma(8))
) %>%
layout(scene = list(
xaxis = list(title = "x1", range = range_x),
yaxis = list(title = "x2", range = range_x),
zaxis = list(title = "error", range = range_error)
)) %>% colorbar(limits = range_error) %>% hide_colorbar()
```
---
.hi-slate[Prediction error] from our fitted kNN (k=10) model
```{R, ex knn 10 errors, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"}
# kNN 10 error (3d scatter)
plot_ly(
type = "scatter3d",
x = sample_df$x1,
y = sample_df$x2,
z = sample_df$y - sample_df$y_knn10,
mode = "markers",
color = sample_df$y - sample_df$y_knn10,
colors = colorRamp(viridis::magma(8))
) %>%
layout(scene = list(
xaxis = list(title = "x1", range = range_x),
yaxis = list(title = "x2", range = range_x),
zaxis = list(title = "error", range = range_error)
)) %>% colorbar(limits = range_error) %>% hide_colorbar()
```
---
.hi-slate[Prediction error] from our fitted kNN (k=1) model
```{R, ex knn 1 errors, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"}
# kNN 1 error (3d scatter)
plot_ly(
type = "scatter3d",
x = sample_df$x1,
y = sample_df$x2,
z = sample_df$y - sample_df$y_knn1,
mode = "markers",
color = sample_df$y - sample_df$y_knn1,
colors = colorRamp(viridis::magma(8))
) %>%
layout(scene = list(
xaxis = list(title = "x1", range = range_x),
yaxis = list(title = "x2", range = range_x),
zaxis = list(title = "error", range = range_error)
)) %>% colorbar(limits = range_error) %>% hide_colorbar()
```
---
.note[Recall] .hi-slate[Prediction error] from our fitted linear regression model
```{R, ex lm errors again, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"}
# Regression error (3d scatter)
plot_ly(
type = "scatter3d",
x = sample_df$x1,
y = sample_df$x2,
z = sample_df$y - sample_df$y_lm,
mode = "markers",
color = sample_df$y - sample_df$y_lm,
colors = colorRamp(viridis::magma(8))
) %>%
layout(scene = list(
xaxis = list(title = "x1", range = range_x),
yaxis = list(title = "x2", range = range_x),
zaxis = list(title = "error", range = range_error)
)) %>% colorbar(limits = range_error) %>% hide_colorbar()
```
---
layout: true
# Model accuracy
---
name: accuracy-questions
## Questions
1. Which of the methods was the most flexible? Inflexible?
1. Why do you think kNN with k=1 had such low prediction error?
1. How could we (better) assess model/predictive performance?
1. Why would we ever want to choose a less flexible model?
---
## Measurement
You probably will not be surprised to know that there is no one-size-fits-all solution in statistical learning.
.qa[Q] How do we choose between competing models?
--
.qa[A] We're a few steps away, but before we do anything, we need a way to .hi-slate[define model performance].
---
name: accuracy-subtlety
## Subtlety
Defining performance can actually be quite tricky...
.note[Regression setting, 1] Which do you prefer?
1. Lots of little errors and a few really large errors.
1. Medium-sized errors for everyone.
.note[Regression setting, 2] Is a 1-unit error (*e.g.*, $1,000) equally bad for everyone?
---
## Subtlety
Defining performance can actually be quite tricky...
.note[Classification setting, 1] Which is worse?
1. False positive (*e.g.*, incorrectly diagnosing cancer)
1. False negative (*e.g.*, missing cancer)
.note[Classification setting, 2] Which is more important?
1. True positive (*e.g.*, correct diagnosis of cancer)
1. True negative (*e.g.*, correct diagnosis of "no cancer")
---
name: mse
## MSE
.attn[Mean squared error (MSE)] is the most common.super[.pink[†]] way to measure model performance in a regression setting.
.footnote[
.pink[†] *Most common* does not mean best—it just means lots of people use it.
]
$$\text{MSE} = \dfrac{1}{n} \sum_{i=1}^n \left[ \color{#FFA500}{y}_i - \hat{\color{#20B2AA}{f}}(\color{#6A5ACD}{x}_i) \right]^2$$
.note[Recall:] $\color{#FFA500}{y}_i - \hat{\color{#20B2AA}{f}}(\color{#6A5ACD}{x}_i) = \color{#FFA500}{y}_i - \hat{\color{#FFA500}{y}}_i$ is our prediction error.
--
Two notes about MSE
1. MSE will be (relatively) very small when .hi-slate[prediction error] is nearly zero.
1. MSE .hi-slate[penalizes] big errors more than little errors (the squared part).
---
name: training-testing
## Training or testing?
Low MSE (accurate performance) on the data that trained the model isn't actually impressive—maybe the model is just overfitting our data..super[.pink[†]]
.footnote[
.pink[†] Recall the kNN performance for k=1.
]
.note[What we want:] How well does the model perform .hi-slate[on data it has never seen]?
--
This introduces an important distinction:
1. .hi-slate[Training data]: The observations $(\color{#FFA500}{y}_i,\color{#e64173}{x}_i)$ used to .hi-slate[train] our model $\hat{\color{#20B2AA}{f}}$.
1. .hi-slate[Testing data]: The observations $(\color{#FFA500}{y}_0,\color{#e64173}{x}_0)$ that our model has yet to see—and which we can use to evaluate the performance of $\hat{\color{#20B2AA}{f}}$.
--
.hi-slate[Real goal: Low test-sample MSE] (not the training MSE from before).
---
layout: false
class: clear, middle
.hi-slate[Next time:] model performance, the variance-bias tradeoff, and kNN
---
name: sources
layout: false
# Sources
These notes draw upon
- [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) (*ISL*)
James, Witten, Hastie, and Tibshirani
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
Jake VanderPlas
I pulled the comic from [Twitter](https://twitter.com/athena_schools/status/1063013435779223553/photo/1).
---
# Table of contents
.col-left[
.smallest[
#### Admin
- [Today](#admin-today)
- [Upcoming](#admin-soon)
#### Statistical learning
- [Definition](#sl-definition)
- [Classes](#sl-classes)
#### Notation
- [Source](#notation-source)
- [Data](#notation-data)
- [Dimensions of $\mathbf{X}$](#notation-dimensions)
- [Outcomes](#notation-outcomes)
#### Statistical learning, continued
- [The goal](#sl-goal)
- [Prediction](#sl-prediction)
- [Parameterization](#sl-parameters)
]
]
.col-right[
.smallest[
#### Example
- [Data-generating process (truth)](#ex-truth)
- [Regression model](#ex-lm)
- [kNN model](#ex-knn)
#### Model accuracy
- [Questions](#accuracy-questions)
- [Subtlety](#accuracy-subtlety)
- [MSE](#mse)
- [Training *vs.* testing](#training-testing)
#### Other
- [Sources/references](#sources)
]
]
---
exclude: true
```{R, save pdfs, include = F, eval = F}
system("`npm bin`/decktape remark 001-slides.html 001-slides.pdf --chrome-arg=--allow-file-access-from-files --slides 1-100")
```