*E.g.*, race, sex, loan default, hazard, disease, flight status 2. .hi-slate[Regression tasks] in which $\color{#FFA500}{\mathbf{y}}$ takes on continuous, numeric values.
*E.g.*, price, arrival time, number of emails, temperature .note[Note.sub[1]] The use of .it[regression] differs from our use of .it[linear regression]. -- .note[Note.sub[2]] Don't get tricked: Not all numbers represent continuous, numerical values—_e.g._, zip codes, industry codes, social security numbers..super[.pink[†]] .footnote[ .pink[†] .qa[Q] Where would you put responses to 5-item Likert scales? ] --- ## Why *Learning*? .qa[Q] What puts the "learning" in statistical/machine learning? -- .qa[A] Most learning models/algorithms will .attn[tune model parameters] based upon the observed dataset—learning from the data. --- layout: true # Notation --- name: notation-source class: inverse, middle Our class will typically follow the notation and definitions of [.it[ISL]](http://faculty.marshall.usc.edu/gareth-james/ISL/). --- name: notation-data ## Data $\color{#e64173}{n}$ gives the .pink[number of observations] $\color{#6A5ACD}{p}$ represents the .purple[number of variables] available for predictions -- $\mathbf{X}$ is our $\color{#e64173}{n}\times\color{#6A5ACD}{p}$ matrix of predictors - .note[Other names] ***features***, *inputs*, *independent/explanatory variables*, ... - $x_{\color{#e64173}{i},\color{#6A5ACD}{j}}$ is observation $\color{#e64173}{i}$ (in $\color{#e64173}{1,\ldots,n}$) on variable $\color{#6A5ACD}{j}$ (for $\color{#6A5ACD}{j}$ in $\color{#6A5ACD}{1,...,p}$) -- $$ \begin{align} \mathbf{X} = \begin{bmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,\color{#6A5ACD}{p}} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,\color{#6A5ACD}{p}} \\ \vdots & \vdots & \ddots & \vdots \\ x_{\color{#e64173}{n},1} & x_{\color{#e64173}{n},2} & \cdots & x_{\color{#e64173}{n},\color{#6A5ACD}{p}} \end{bmatrix} \end{align} $$ --- name: notation-dimensions ## Dimensions of $\mathbf{X}$ Now let us split our $\mathbf{X}$ matrix of predictors by its two dimensions. -- .col-left[ .hi-pink[Observation] $\color{#e64173}{i}$ is a $\color{#6A5ACD}{p}$-length vector $$ \begin{align} x_{\color{#e64173}{i}} = \begin{bmatrix} x_{\color{#e64173}{i},\color{#6A5ACD}{1}} \\ x_{\color{#e64173}{i},\color{#6A5ACD}{2}} \\ \vdots \\ x_{\color{#e64173}{i},\color{#6A5ACD}{p}} \end{bmatrix} \end{align} $$ ] -- .col-right[ .hi-purple[Variable] $\color{#6A5ACD}{j}$ is a $\color{#e64173}{n}$-length vector $$ \begin{align} \mathbf{x}_{\color{#6A5ACD}{j}} = \begin{bmatrix} x_{\color{#e64173}{1},\color{#6A5ACD}{j}} \\ x_{\color{#e64173}{2},\color{#6A5ACD}{j}} \\ \vdots \\ x_{\color{#e64173}{n},\color{#6A5ACD}{j}} \end{bmatrix} \end{align} $$ ] -- Applied to .mono[R]: - `dim(x_df)` = $\color{#e64173}{n}$ $\color{#6A5ACD}{p}$ - `nrow(x_df)` $= \color{#e64173}{n}$; `ncol(x_df)` $= \color{#6A5ACD}{p}$ - `x_df[1,]` $\left( \color{#e64173}{i = 1} \right)$; `x_df[,1]` $\left( \color{#6A5ACD}{j = 1} \right)$ --- name: notation-outcomes ## Outcomes In supervised settings, we will denote our .hi-orange[outcome variable] as $\color{#FFA500}{\mathbf{y}}$. .note[Synonyms] *output*, *outcome*, *dependent/response variable*, ... -- The .orange[outcome] for our .pink[i.super[th]] obsevation is $\color{#FFA500}{y}_{\color{#e64173}{i}}$. Together the $\color{#e64173}{n}$ observations form $$ \begin{align} \color{#FFA500}{\mathbf{y}} = \begin{bmatrix} y_{\color{#e64173}{1}} \\ y_{\color{#e64173}{2}} \\ \vdots \\ y_{\color{#e64173}{n}} \end{bmatrix} \end{align} $$ -- and our full dataset is composed of $\bigg\{ \left( x_{\color{#e64173}{1}},\color{#FFA500}{y}_{\color{#e64173}{1}} \right),\, \left( x_{\color{#e64173}{2}},\color{#FFA500}{y}_{\color{#e64173}{2}} \right),\, \ldots,\, \left( x_{\color{#e64173}{n}},\color{#FFA500}{y}_{\color{#e64173}{n}} \right) \bigg\}$. --- layout: false class: clear, middle Back to the problem of (supervised) statistical learning... --- layout: true # Statistical learning --- name: sl-goal ## The goal As defined before, we want to *learn* a model to understand our data. -- 1. Take our (numeric) .orange[output] $\color{#FFA500}{\mathbf{y}}$. 2. Imagine there is a .turquoise[function] $\color{#20B2AA}{f}$ that takes .purple[inputs] $\color{#6A5ACD}{\mathbf{X}} = \color{#6A5ACD}{\mathbf{x}_1}, \ldots, \color{#6A5ACD}{\mathbf{x}_p}$
and maps them, plus a random, mean-zero .pink[error term] $\color{#e64173}{\varepsilon}$, to the .orange[output]. $$\color{#FFA500}{\mathbf{y}} = \color{#20B2AA}{f} \! \left( \color{#6A5ACD}{\mathbf{X}} \right) + \color{#e64173}{\varepsilon}$$ -- .qa[Q] What is $\color{#20B2AA}{f}$? --
.qa[A] .note[ISL:] $\color{#20B2AA}{f}$ represents the *systematic* information that $\color{#6A5ACD}{\mathbf{X}}$ provides about $\color{#FFA500}{\mathbf{y}}$. -- .qa[Q] How else can you describe $\color{#20B2AA}{f}$? --- ## Our missing $f$ $$\color{#FFA500}{\mathbf{y}} = \color{#20B2AA}{f} \! \left( \color{#6A5ACD}{\mathbf{X}} \right) + \color{#e64173}{\varepsilon}$$ .qa[Q] $\color{#20B2AA}{f}$ is unknown (as is $\color{#e64173}{\varepsilon}$). What should we do? --
.qa[A] Use the observed data to learn/estimate $\color{#20B2AA}{f}(\cdot)$, _i.e._, construct $\widehat{\color{#20B2AA}{f}}$..super[.pink[†]] .footnote[ .pink[†] More notation: hats $\left( \hat{} \right)$ are estimators/estimates. ] -- .qa[Q] Okay. How? --
.qa[A] .it[How do I estimate] $\color{#20B2AA}{f}$.it[?] is one way to phrase *all questions* that underly statistical learning—model selection, cross validation, evaluation, *etc.* -- All of the techniques, algorithms, tools of stat. learning attempt to accurately recover $\color{#20B2AA}{f}$ based upon the settings' goals/limitations. -- .grey-light[You'll have to wait on any real/specific answers...] --- ## Learning from $\hat{f}$ There are two main reasons we want to learn about $\color{#20B2AA}{f}$ 1. .hi-slate[*Causal* inference settings] How do changes in $\color{#6A5ACD}{\mathbf{X}}$ affect $\color{#FFA500}{\mathbf{y}}$?
.grey-light[The territory of .mono[EC523] and .mono[EC525].] -- 1. .hi-slate[Prediction problems] Predict $\color{#FFA500}{\mathbf{y}}$ using our estimated $\color{#20B2AA}{f}$, _i.e._, $$\hat{\color{#FFA500}{\mathbf{y}}} = \hat{\color{#20B2AA}{f}}\!(\color{#6A5ACD}{\mathbf{X}})$$ our *black-box setting* where we care less about $\color{#20B2AA}{f}$ than $\hat{\color{#FFA500}{\mathbf{y}}}$..super[.pink[†]] .footnote[ .pink[†] You shouldn't actually treat your prediction methods as total black boxes. ] -- Similarly, in causal-inference settings, we don't particulary care about $\hat{\color{#FFA500}{\mathbf{y}}}$. --- name: sl-prediction ## Prediction errors As tends to be the case in life, you will make errors in predicting $\color{#FFA500}{\mathbf{y}}$. The accuracy of $\hat{\color{#FFA500}{\mathbf{y}}}$ depends upon .hi-slate[two errors]: -- 1. .hi-slate[Reducible error] The error due to $\hat{\color{#20B2AA}{f}}$ imperfectly estimating $\color{#20B2AA}{f}$.
*Reducible* in the sense that we could improve $\hat{\color{#20B2AA}{f}}$. -- 1. .hi-slate[Irreducible error] The error component that is outside of the model $\color{#20B2AA}{f}$.
*Irreducible* because we defined an error term $\color{#e64173}{\varepsilon}$ unexplained by $\color{#20B2AA}{f}$. -- .note[Note] As its name implies, you can't get rid of .it[irreducible] error—but we can try to get rid of .it[reducible] errors. --- ## Prediction errors Why we're stuck with .it[irreducible] error $$ \begin{aligned} \mathop{E}\left[ \left\{ \color{#FFA500}{\mathbf{y}} - \hat{\color{#FFA500}{\mathbf{y}}} \right\}^2 \right] &= \mathop{E}\left[ \left\{ \color{#20B2AA}{f}(\color{#6A5ACD}{\mathbf{X}}) + \color{#e64173}{\varepsilon} + \hat{\color{#20B2AA}{f}}(\color{#6A5ACD}{\mathbf{X}}) \right\}^2 \right] \\ &= \underbrace{\left[ \color{#20B2AA}{f}(\color{#6A5ACD}{\mathbf{X}}) - \hat{\color{#20B2AA}{f}}(\color{#6A5ACD}{\mathbf{X}}) \right]^2}_{\text{Reducible}} + \underbrace{\mathop{\text{Var}} \left( \color{#e64173}{\varepsilon} \right)}_{\text{Irreducible}} \end{aligned} $$ In less math: - If $\color{#e64173}{\varepsilon}$ exists, then $\color{#6A5ACD}{\mathbf{X}}$ cannot perfectly explain $\color{#FFA500}{\mathbf{y}}$. - So even if $\hat{\color{#20B2AA}{f}} = \color{#20B2AA}{f}$, we still have irreducible error. -- Thus, to form our .hi-slate[best predictors], we will .hi-slate[minimize reducible error]. --- name: sl-parameters ## Which type of $\hat{f}$? Once you have your .purple[inputs] $\left(\color{#6A5ACD}{\mathbf{X}} \right)$ and .orange[output] $\left( \color{#FFA500}{\mathbf{y}} \right)$ data, you still need to decide how parametric your $\hat{\color{#20B2AA}{f}}$ should be..super[.pink[†]] .footnote[ .pink[†] I'm saying "how parametric" b/c some methods are much more parametric than others. ] -- .hi-slate[Parametric methods] assume a function typically involve two steps 1. Select a functional form (shape) to represent $\color{#20B2AA}{f}$ 2. Train your selected model on your data $\color{#FFA500}{\mathbf{y}}$ and $\color{#6A5ACD}{\mathbf{X}}$. -- .hi-slate[Non-parametric methods] avoid explicit assumption about the shape of $\color{#20B2AA}{f}$.
Attempt to .pink[flexibly fit] the data, while trying to .pink[avoid overfitting]. --- ## Which type of $\hat{f}$? Methods' parametric assumptions come with tradeoffs. .hi-slate[Parametric methods]
 .pink.mono[+] Simpler to estimate and interpret.
 .purple.mono[-] If assumed functional form is bad, model performance will suffer. .hi-slate[Non-parametric methods]
 .pink.mono[+] Fewer assumptions. More flexibility.
 .purple.mono[-] Lower interpretability. Susceptible to overfitting. y_lm, y - y_knn10, y - y_knn5, y - y_knn1)), max(sample_df %>% transmute(y - y_lm, y - y_knn10, y - y_knn5, y - y_knn1)) ) ``` --- name: ex-truth .hi-slate[Truth:] The (nonlinear) $f(\mathbf{X})$ that we hope to recover. ```{R, ex truth, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # True 'f(X)' (surface) plot_ly( x = unique(truth_df$x1), y = unique(truth_df$x2), z = matrix(data = truth_df$y, ncol = sqrt(nrow(truth_df))), colors = colorRamp(viridis::magma(8)), cmin = range_y[1], cmax = range_y[2] ) %>% add_surface() %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "y", range = range_y) )) %>% hide_colorbar() ``` --- .hi-slate[The sample:] $n=70$ randomly drawn observations for $\mathbf{y} = f(\mathbf{x}_1,\, \mathbf{x}_2) + \varepsilon$ ```{R, ex sample, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # Sample observations (3d scatter) plot_ly( type = "scatter3d", x = sample_df$x1, y = sample_df$x2, z = sample_df$y, mode = "markers", color = sample_df$y, colors = colorRamp(viridis::magma(8)), ) %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "y", range = range_y) )) %>% hide_colorbar() ``` --- name: ex-lm .hi-slate[Estimated linear-regression model:] $\hat{\mathbf{y}} = \hat\beta_0 + \hat\beta_1 \mathbf{x}_1 + \hat\beta_2 \mathbf{x}_2 + \hat\beta_3 \mathbf{x}_1 \mathbf{x}_2$ ```{R, ex lm, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # Linear regression estimate (surface) plot_ly( x = unique(truth_df$x1), y = unique(truth_df$x2), z = matrix(data = truth_df$y_lm, ncol = sqrt(nrow(truth_df))), colors = colorRamp(viridis::magma(8)), cmin = range_y[1], cmax = range_y[2] ) %>% add_surface() %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "y", range = range_y) )) %>% colorbar(limits = range_y) %>% hide_colorbar() ``` --- .hi-slate[Prediction error] from our fitted linear regression model ```{R, ex lm errors, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # Regression error (3d scatter) plot_ly( type = "scatter3d", x = sample_df$x1, y = sample_df$x2, z = sample_df$y - sample_df$y_lm, mode = "markers", color = sample_df$y - sample_df$y_lm, colors = colorRamp(viridis::magma(8)) ) %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "error", range = range_error) )) %>% colorbar(limits = range_error) %>% hide_colorbar() ``` --- name: ex-knn .hi-slate[k-nearest neighbors] (kNN) using k=5 .grey-light[(a *non-parametric* method)] ```{R, ex knn5, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # knn model (k = 5) (surface) plot_ly( x = unique(truth_df$x1), y = unique(truth_df$x2), z = matrix(data = truth_df$y_knn5, ncol = sqrt(nrow(truth_df))), colors = colorRamp(viridis::magma(8)), cmin = range_y[1], cmax = range_y[2] ) %>% add_surface() %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "y", range = range_y) )) %>% hide_colorbar() ``` --- .hi-slate[k-nearest neighbors] (kNN) using k=10 .grey-light[(notice increased smoothness)] ```{R, ex knn10, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # knn model (k = 10) (surface) plot_ly( x = unique(truth_df$x1), y = unique(truth_df$x2), z = matrix(data = truth_df$y_knn10, ncol = sqrt(nrow(truth_df))), colors = colorRamp(viridis::magma(8)), cmin = range_y[1], cmax = range_y[2] ) %>% add_surface() %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "y", range = range_y) )) %>% hide_colorbar() ``` --- .hi-slate[k-nearest neighbors] (kNN) using k=1 .grey-light[(notice decreased smoothness)] ```{R, ex knn1, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # knn model (k = 1) (surface) plot_ly( x = unique(truth_df$x1), y = unique(truth_df$x2), z = matrix(data = truth_df$y_knn1, ncol = sqrt(nrow(truth_df))), colors = colorRamp(viridis::magma(8)), cmin = range_y[1], cmax = range_y[2] ) %>% add_surface() %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "y", range = range_y) )) %>% hide_colorbar() ``` --- .hi-slate[Prediction error] from our fitted kNN (k=5) model ```{R, ex knn 5 errors, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # kNN 5 error (3d scatter) plot_ly( type = "scatter3d", x = sample_df$x1, y = sample_df$x2, z = sample_df$y - sample_df$y_knn5, mode = "markers", color = sample_df$y - sample_df$y_knn5, colors = colorRamp(viridis::magma(8)) ) %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "error", range = range_error) )) %>% colorbar(limits = range_error) %>% hide_colorbar() ``` --- .hi-slate[Prediction error] from our fitted kNN (k=10) model ```{R, ex knn 10 errors, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # kNN 10 error (3d scatter) plot_ly( type = "scatter3d", x = sample_df$x1, y = sample_df$x2, z = sample_df$y - sample_df$y_knn10, mode = "markers", color = sample_df$y - sample_df$y_knn10, colors = colorRamp(viridis::magma(8)) ) %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "error", range = range_error) )) %>% colorbar(limits = range_error) %>% hide_colorbar() ``` --- .hi-slate[Prediction error] from our fitted kNN (k=1) model ```{R, ex knn 1 errors, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # kNN 1 error (3d scatter) plot_ly( type = "scatter3d", x = sample_df$x1, y = sample_df$x2, z = sample_df$y - sample_df$y_knn1, mode = "markers", color = sample_df$y - sample_df$y_knn1, colors = colorRamp(viridis::magma(8)) ) %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "error", range = range_error) )) %>% colorbar(limits = range_error) %>% hide_colorbar() ``` --- .note[Recall] .hi-slate[Prediction error] from our fitted linear regression model ```{R, ex lm errors again, echo = F, fig.height = 7.5, cache = T, dependson = "ex data"} # Regression error (3d scatter) plot_ly( type = "scatter3d", x = sample_df$x1, y = sample_df$x2, z = sample_df$y - sample_df$y_lm, mode = "markers", color = sample_df$y - sample_df$y_lm, colors = colorRamp(viridis::magma(8)) ) %>% layout(scene = list( xaxis = list(title = "x1", range = range_x), yaxis = list(title = "x2", range = range_x), zaxis = list(title = "error", range = range_error) )) %>% colorbar(limits = range_error) %>% hide_colorbar() ``` --- layout: true # Model accuracy --- name: accuracy-questions ## Questions 1. Which of the methods was the most flexible? Inflexible? 1. Why do you think kNN with k=1 had such low prediction error? 1. How could we (better) assess model/predictive performance? 1. Why would we ever want to choose a less flexible model? --- ## Measurement You probably will not be surprised to know that there is no one-size-fits-all solution in statistical learning. .qa[Q] How do we choose between competing models? -- .qa[A] We're a few steps away, but before we do anything, we need a way to .hi-slate[define model performance]. --- name: accuracy-subtlety ## Subtlety Defining performance can actually be quite tricky... .note[Regression setting, 1] Which do you prefer? 1. Lots of little errors and a few really large errors. 1. Medium-sized errors for everyone. .note[Regression setting, 2] Is a 1-unit error (*e.g.*, $1,000) equally bad for everyone? --- ## Subtlety Defining performance can actually be quite tricky... .note[Classification setting, 1] Which is worse? 1. False positive (*e.g.*, incorrectly diagnosing cancer) 1. False negative (*e.g.*, missing cancer) .note[Classification setting, 2] Which is more important? 1. True positive (*e.g.*, correct diagnosis of cancer) 1. True negative (*e.g.*, correct diagnosis of "no cancer") --- name: mse ## MSE .attn[Mean squared error (MSE)] is the most common.super[.pink[†]] way to measure model performance in a regression setting. .footnote[ .pink[†] *Most common* does not mean best—it just means lots of people use it. ] $$\text{MSE} = \dfrac{1}{n} \sum_{i=1}^n \left[ \color{#FFA500}{y}_i - \hat{\color{#20B2AA}{f}}(\color{#6A5ACD}{x}_i) \right]^2$$ .note[Recall:] $\color{#FFA500}{y}_i - \hat{\color{#20B2AA}{f}}(\color{#6A5ACD}{x}_i) = \color{#FFA500}{y}_i - \hat{\color{#FFA500}{y}}_i$ is our prediction error. -- Two notes about MSE 1. MSE will be (relatively) very small when .hi-slate[prediction error] is nearly zero. 1. MSE .hi-slate[penalizes] big errors more than little errors (the squared part). --- name: training-testing ## Training or testing? Low MSE (accurate performance) on the data that trained the model isn't actually impressive—maybe the model is just overfitting our data..super[.pink[†]] .footnote[ .pink[†] Recall the kNN performance for k=1. ] .note[What we want:] How well does the model perform .hi-slate[on data it has never seen]? -- This introduces an important distinction: 1. .hi-slate[Training data]: The observations $(\color{#FFA500}{y}_i,\color{#e64173}{x}_i)$ used to .hi-slate[train] our model $\hat{\color{#20B2AA}{f}}$. 1. .hi-slate[Testing data]: The observations $(\color{#FFA500}{y}_0,\color{#e64173}{x}_0)$ that our model has yet to see—and which we can use to evaluate the performance of $\hat{\color{#20B2AA}{f}}$. -- .hi-slate[Real goal: Low test-sample MSE] (not the training MSE from before). --- layout: false class: clear, middle .hi-slate[Next time:] model performance, the variance-bias tradeoff, and kNN --- name: sources layout: false # Sources These notes draw upon - [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) (*ISL*)
James, Witten, Hastie, and Tibshirani - [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
