In-class
Readings
Problem sets
We've essentially covered the central topics in statistical learning†
† Plus a few of the "basic" methods: OLS regression and KNN.
†† And the bootstrap!
Next, we will cover many common machine-learning algorithms, e.g.,
Next, we will cover many common machine-learning algorithms, e.g.,
But first, we return to good old linear regression—in a new light...
Motivation 1
We have new tools. It might help to first apply them in a familiar setting.
Motivation 1
We have new tools. It might help to first apply them in a familiar setting.
Motivation 2
We have new tools. Maybe linear regression will be (even) better now?
Motivation 1
We have new tools. It might help to first apply them in a familiar setting.
Motivation 2
We have new tools. Maybe linear regression will be (even) better now?
Motivation 3
many fancy statistical learning approaches can be seen as generalizations or extensions of linear regression.
Source: ISL, p. 59; emphasis added
Recall Linear regression "fits" coefficients β0,…,βp for a model yi=β0+β1x1,i+β2x2,i+⋯+βpxp,i+εi and is often applied in two distinct settings with fairly distinct goals:
Causal inference estimates and interprets the coefficients.
Prediction focuses on accurately estimating outcomes.
Regardless of the goal, the way we "fit" (estimate) the model is the same.
As is the case with many statistical learning methods, regression focuses on minimizing some measure of loss/error.
ei=yi−ˆyi
As is the case with many statistical learning methods, regression focuses on minimizing some measure of loss/error.
ei=yi−ˆyi
Linear regression uses the L2 loss function—also called residual sum of squares (RSS) or sum of squared errors (SSE)
RSS=e21+e22+⋯+e2n=n∑i=1e2i
Specifically: OLS chooses the ˆβj that minimize RSS.
There's a large variety of ways to assess the fit† of linear-regression models.
† or predictive performance
Residual standard error (RSE) RSE=√1n−p−1RSS=√1n−p−1n∑i=1(yi−ˆyi)2
R-squared (R2) R2=TSS−RSSTSS=1−RSSTSSwhereTSS=n∑i=1(yi−¯y)2
As we've seen throughout the course, we need to be careful not to overfit.
As we've seen throughout the course, we need to be careful not to overfit.
R2 provides no protection against overfitting—and actually encourages it. R2=1−RSSTSS Add a new variable: RSS ↓ and TSS is unchanged. Thus, R2 increases.
As we've seen throughout the course, we need to be careful not to overfit.
R2 provides no protection against overfitting—and actually encourages it. R2=1−RSSTSS Add a new variable: RSS ↓ and TSS is unchanged. Thus, R2 increases.
RSE slightly penalizes additional variables: RSE=√1n−p−1RSS Add a new variable: RSS ↓ but p increases. Thus, RSE's change is uncertain.
Let's see how R2 and RSE perform with 500 very weak predictors.
To address overfitting, we can compare in- vs. out-of-sample performance.
In-sample R2 mechanically increases as we add predictors.
Out-of-sample R2 does not.
In-sample R2 mechanically increases as we add predictors.
Out-of-sample R2 does not.
What about RSE? Does its penalty help?
Despite its penalty for adding variables, in-sample RSE still can overfit,
as evidenced by out-of-sample RSE.
Despite its penalty for adding variables, in-sample RSE still can overfit,
as evidenced by out-of-sample RSE.
RSE is not the only way to penalization the addition of variables.†
† We'll talk about other penalization methods (LASSO and Ridge) shortly.
Adjusted R2 is another classic solution.
Adjusted R2=1−RSS/(n−p−1)TSS/(n−1)
Adj. R2 attempts to "fix" R2 by adding a penalty for the number of variables.
RSE is not the only way to penalization the addition of variables.†
† We'll talk about other penalization methods (LASSO and Ridge) shortly.
Adjusted R2 is another classic solution.
Adjusted R2=1−RSS/(n−p−1)TSS/(n−1)
Adj. R2 attempts to "fix" R2 by adding a penalty for the number of variables.
RSE is not the only way to penalization the addition of variables.†
† We'll talk about other penalization methods (LASSO and Ridge) shortly.
Adjusted R2 is another classic solution.
Adjusted R2=1−RSS/(n−p−1)TSS/(n−1)
Adj. R2 attempts to "fix" R2 by adding a penalty for the number of variables.
RSS always decreases when a new variable is added.
RSS/(n−p−1) may increase or decrease with a new variable.
However, in-sample adjusted R2 still can overfit.
Illustrated by out-of-sample R2.
However, in-sample adjusted R2 still can overfit.
Illustrated by out-of-sample adjusted R2.
R2, adjusted R2, and RSE each offer some flavor of model fit, but they appear limited in their abilities to prevent overfitting.
R2, adjusted R2, and RSE each offer some flavor of model fit, but they appear limited in their abilities to prevent overfitting.
We want a method to optimally select a (linear) model—balancing variance and bias and avoiding overfit.
R2, adjusted R2, and RSE each offer some flavor of model fit, but they appear limited in their abilities to prevent overfitting.
We want a method to optimally select a (linear) model—balancing variance and bias and avoiding overfit.
We'll discuss two (related) methods today:
Subset selection chooses a (sub)set of our p potential predictors
Shrinkage fits a model using all p variables but "shrinks" its coefficients
In subset selection, we
In subset selection, we
How do we do the whittling (selection)?
In subset selection, we
How do we do the whittling (selection)? We've go options.
Best subset selection is based upon a simple idea: Estimate a model for every possible subset of variables; then compare their performances.
Best subset selection is based upon a simple idea: Estimate a model for every possible subset of variables; then compare their performances.
Q So what's the problem? (Why do we need other selection methods?)
Best subset selection is based upon a simple idea: Estimate a model for every possible subset of variables; then compare their performances.
Q So what's the problem? (Why do we need other selection methods?)
A "a model for every possible subset" can mean a lot (2p) of models.
Best subset selection is based upon a simple idea: Estimate a model for every possible subset of variables; then compare their performances.
Q So what's the problem? (Why do we need other selection methods?)
A "a model for every possible subset" can mean a lot (2p) of models.
E.g.,
Best subset selection is based upon a simple idea: Estimate a model for every possible subset of variables; then compare their performances.
Q So what's the problem? (Why do we need other selection methods?)
A "a model for every possible subset" can mean a lot (2p) of models.
E.g.,
Even with plentiful, cheap computational power, we can run into barriers.
Computational constraints aside, we can implement best subset selection as
Define M0 as the model with no predictors.
For k in 1 to p:
Fit every possible model with k predictors.
Define Mk as the "best" model with k predictors.
Select the "best" model from M0,…,Mp.
As we've seen, RSS declines (and R2 increases) with p, so we should use a cross-validated measure of model performance in step 3.†
† Back to our distinction between test vs. training performance.
Credit
We're going to use the Credit
dataset from ISL's R package ISLR
.
Credit
We're going to use the Credit
dataset from ISL's R package ISLR
.
ID | Income | Limit | Rating | Cards | Age | Education | Gender | Student | Married | Ethnicity | Balance |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 14.891 | 3606 | 283 | 2 | 34 | 11 | Male | No | Yes | Caucasian | 333 |
2 | 106.025 | 6645 | 483 | 3 | 82 | 15 | Female | Yes | Yes | Asian | 903 |
3 | 104.593 | 7075 | 514 | 4 | 71 | 11 | Male | No | No | Asian | 580 |
4 | 148.924 | 9504 | 681 | 3 | 36 | 11 | Female | No | No | Asian | 964 |
5 | 55.882 | 4897 | 357 | 2 | 68 | 16 | Male | No | Yes | Caucasian | 331 |
6 | 80.18 | 8047 | 569 | 4 | 77 | 10 | Male | No | No | Caucasian | 1151 |
7 | 20.996 | 3388 | 259 | 2 | 37 | 12 | Female | No | No | African American | 203 |
8 | 71.408 | 7114 | 512 | 2 | 87 | 9 | Male | No | No | Asian | 872 |
9 | 15.125 | 3300 | 266 | 5 | 66 | 13 | Female | No | No | Caucasian | 279 |
10 | 71.061 | 6819 | 491 | 3 | 41 | 19 | Female | Yes | Yes | African American | 1350 |
The Credit
dataset has 400 observations on 12 variables.
Credit
We need to pre-process the dataset before we can select a model...
income | limit | rating | cards | age | education | i_female | i_student | i_married | i_asian | i_african_american | balance |
---|---|---|---|---|---|---|---|---|---|---|---|
14.891 | 3606 | 283 | 2 | 34 | 11 | 0 | 0 | 1 | 0 | 0 | 333 |
106.025 | 6645 | 483 | 3 | 82 | 15 | 1 | 1 | 1 | 1 | 0 | 903 |
104.593 | 7075 | 514 | 4 | 71 | 11 | 0 | 0 | 0 | 1 | 0 | 580 |
148.924 | 9504 | 681 | 3 | 36 | 11 | 1 | 0 | 0 | 1 | 0 | 964 |
55.882 | 4897 | 357 | 2 | 68 | 16 | 0 | 0 | 1 | 0 | 0 | 331 |
80.18 | 8047 | 569 | 4 | 77 | 10 | 0 | 0 | 0 | 0 | 0 | 1151 |
20.996 | 3388 | 259 | 2 | 37 | 12 | 1 | 0 | 0 | 0 | 1 | 203 |
71.408 | 7114 | 512 | 2 | 87 | 9 | 0 | 0 | 0 | 1 | 0 | 872 |
15.125 | 3300 | 266 | 5 | 66 | 13 | 1 | 0 | 0 | 0 | 0 | 279 |
71.061 | 6819 | 491 | 3 | 41 | 19 | 1 | 1 | 1 | 0 | 1 | 1350 |
Now the dataset on has 400 observations on 12 variables (2,048 subsets).
From here, you would
Estimate cross-validated error for each Mk.
Choose the Mk that minimizes the CV error.
Train the chosen model on the full dataset.
Warnings
Benefits
Stepwise selection provides a less computational intensive alternative to best subset selection.
The basic idea behind stepwise selection
Stepwise selection provides a less computational intensive alternative to best subset selection.
The basic idea behind stepwise selection
The two most-common varieties of stepwise selection:
The process...
Start with a model with only an intercept (no predictors), M0.
For k=0,…,p:
Estimate a model for each of the remaining p−k predictors, separately adding the predictors to model Mk.
Define Mk+1 as the "best" model of the p−k models.
Select the "best" model from M0,…,Mp.
The process...
Start with a model with only an intercept (no predictors), M0.
For k=0,…,p:
Estimate a model for each of the remaining p−k predictors, separately adding the predictors to model Mk.
Define Mk+1 as the "best" model of the p−k models.
Select the "best" model from M0,…,Mp.
What do we mean by "best"?
2: best is often RSS or R2.
3: best should be a cross-validated fit criterion.
Forward stepwise selection with caret
in R
train_forward = train( y = credit_dt[["balance"]], x = credit_dt %>% dplyr::select(-balance), trControl = trainControl(method = "cv", number = 5), method = "leapForward", tuneGrid = expand.grid(nvmax = 1:11))
Forward stepwise selection with caret
in R
train_forward = train( y = credit_dt[["balance"]], x = credit_dt %>% dplyr::select(-balance), trControl = trainControl(method = "cv", number = 5), method = "leapForward", tuneGrid = expand.grid(nvmax = 1:11))
N vars | RMSE | R2 | MAE |
---|---|---|---|
1 | 232.57 | 0.745 | 175.2 |
2 | 163.13 | 0.874 | 121.9 |
3 | 103.31 | 0.950 | 83.8 |
4 | 101.04 | 0.952 | 81.8 |
5 | 99.32 | 0.954 | 79.6 |
6 | 99.68 | 0.953 | 80.0 |
7 | 99.96 | 0.953 | 80.4 |
8 | 99.99 | 0.953 | 80.4 |
9 | 99.85 | 0.953 | 80.2 |
10 | 99.79 | 0.953 | 80.2 |
The process for backward stepwise selection is quite similar...
The process for backward stepwise selection is quite similar...
Start with a model that includes all p predictors: Mp.
For k=p,p−1,…,1:
Estimate k models, where each model removes exactly one of the k predictors from Mk.
Define Mk−1 as the "best" of the k models.
Select the "best" model from M0,…,Mp.
The process for backward stepwise selection is quite similar...
Start with a model that includes all p predictors: Mp.
For k=p,p−1,…,1:
Estimate k models, where each model removes exactly one of the k predictors from Mk.
Define Mk−1 as the "best" of the k models.
Select the "best" model from M0,…,Mp.
What do we mean by "best"?
2: best is often RSS or R2.
3: best should be a cross-validated fit criterion.
Backward stepwise selection with caret
in R
train_backward = train( y = credit_dt[["balance"]], x = credit_dt %>% dplyr::select(-balance), trControl = trainControl(method = "cv", number = 5), method = "leapBackward", tuneGrid = expand.grid(nvmax = 1:11))
Backward stepwise selection with caret
in R
train_backward = train( y = credit_dt[["balance"]], x = credit_dt %>% dplyr::select(-balance), trControl = trainControl(method = "cv", number = 5), method = "leapBackward", tuneGrid = expand.grid(nvmax = 1:11))
N vars | RMSE | R2 | MAE |
---|---|---|---|
1 | 233.06 | 0.743 | 177.6 |
2 | 165.41 | 0.871 | 124.9 |
3 | 104.30 | 0.949 | 83.8 |
4 | 99.88 | 0.954 | 79.5 |
5 | 99.40 | 0.954 | 79.4 |
6 | 99.41 | 0.954 | 79.4 |
7 | 99.64 | 0.954 | 79.5 |
8 | 100.02 | 0.953 | 79.7 |
9 | 100.00 | 0.953 | 79.9 |
10 | 99.84 | 0.954 | 79.7 |
Note: forward and backward step. selection can choose different models.
Notes on stepwise selection
Less computationally intensive (relative to best subset selection)
There is no guarantee that stepwise selection finds the best model.
Best is defined by your fit criterion (as always).
Again, cross validation is key to avoiding overfitting.
Which model you choose is a function of how you define "best".
Which model you choose is a function of how you define "best".
And we have many options...
Which model you choose is a function of how you define "best".
And we have many options... We've seen RSS, (R)MSE, RSE, MA, R2, Adj. R2.
Which model you choose is a function of how you define "best".
And we have many options... We've seen RSS, (R)MSE, RSE, MA, R2, Adj. R2.
Of course, there's more. Each penalizes the d predictors differently. Cp=1n(RSS+2dˆσ2)AIC=1nˆσ2(RSS+2dˆσ2)BIC=1nˆσ2(RSS+log(n)dˆσ2)
Cp, AIC, and BIC all have rigorous theoretical justifications... the adjusted R2 is not as well motivated in statistical theory
ISL, p. 213
In general, we will stick with cross-validated criteria, but you still need to choose a selection criterion.
These notes draw upon
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |