Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Lecture 004

Regression strikes back

Edward Rubin

04 February 2020

1 / 42

Admin

2 / 42

Admin

Today

In-class

  • A roadmap (where are we going?)
  • Linear regression and model selection
3 / 42

Admin

Admin

Upcoming

Readings

  • Today
    • ISL Ch. 3 and 6.1
  • Next
    • ISL Ch. 6 and 4

Problem sets

  • Due tonight! (How did it go?)
  • Next: After we finish this set of notes
4 / 42

Roadmap

Where are we?

We've essentially covered the central topics in statistical learning

  • Prediction and inference
  • Supervised vs. unsupervised methods
  • Regression and classification problems
  • The dangers of overfitting
  • The bias-variance tradeoff
  • Model assessment
  • Holdouts, validation sets, and cross validation††
  • Model training and tuning
  • Simulation

Plus a few of the "basic" methods: OLS regression and KNN.
†† And the bootstrap!

5 / 42

Roadmap

Where are we going?

Next, we will cover many common machine-learning algorithms, e.g.,

  • Decision trees and random forests
  • SVM
  • Neural nets
  • Clustering
  • Ensemble techniques
6 / 42

Roadmap

Where are we going?

Next, we will cover many common machine-learning algorithms, e.g.,

  • Decision trees and random forests
  • SVM
  • Neural nets
  • Clustering
  • Ensemble techniques

But first, we return to good old linear regression—in a new light...

  • Linear regression
  • Variable/model selection and LASSO/Ridge regression
  • Plus: Logistic regression and discriminant analysis
6 / 42

Roadmap

Why return to regression?

Motivation 1
We have new tools. It might help to first apply them in a familiar setting.

7 / 42

Roadmap

Why return to regression?

Motivation 1
We have new tools. It might help to first apply them in a familiar setting.

Motivation 2
We have new tools. Maybe linear regression will be (even) better now?

7 / 42

Roadmap

Why return to regression?

Motivation 1
We have new tools. It might help to first apply them in a familiar setting.

Motivation 2
We have new tools. Maybe linear regression will be (even) better now?

Motivation 3

many fancy statistical learning approaches can be seen as generalizations or extensions of linear regression.

Source: ISL, p. 59; emphasis added

7 / 42

Linear regression

8 / 42

Linear regression

Regression regression

Recall Linear regression "fits" coefficients β0,,βp for a model yi=β0+β1x1,i+β2x2,i++βpxp,i+εi and is often applied in two distinct settings with fairly distinct goals:

  1. Causal inference estimates and interprets the coefficients.

  2. Prediction focuses on accurately estimating outcomes.

Regardless of the goal, the way we "fit" (estimate) the model is the same.

9 / 42

Linear regression

Fitting the regression line

As is the case with many statistical learning methods, regression focuses on minimizing some measure of loss/error.

ei=yiˆyi

10 / 42

Linear regression

Fitting the regression line

As is the case with many statistical learning methods, regression focuses on minimizing some measure of loss/error.

ei=yiˆyi

Linear regression uses the L2 loss function—also called residual sum of squares (RSS) or sum of squared errors (SSE)

RSS=e21+e22++e2n=ni=1e2i

Specifically: OLS chooses the ˆβj that minimize RSS.

10 / 42

Linear regression

Performance

There's a large variety of ways to assess the fit of linear-regression models.

or predictive performance

Residual standard error (RSE) RSE=1np1RSS=1np1ni=1(yiˆyi)2

R-squared (R2) R2=TSSRSSTSS=1RSSTSSwhereTSS=ni=1(yi¯y)2

11 / 42

Linear regression

Performance and overfit

As we've seen throughout the course, we need to be careful not to overfit.

12 / 42

Linear regression

Performance and overfit

As we've seen throughout the course, we need to be careful not to overfit.

R2 provides no protection against overfitting—and actually encourages it. R2=1RSSTSS Add a new variable: RSS and TSS is unchanged. Thus, R2 increases.

12 / 42

Linear regression

Performance and overfit

As we've seen throughout the course, we need to be careful not to overfit.

R2 provides no protection against overfitting—and actually encourages it. R2=1RSSTSS Add a new variable: RSS and TSS is unchanged. Thus, R2 increases.

RSE slightly penalizes additional variables: RSE=1np1RSS Add a new variable: RSS but p increases. Thus, RSE's change is uncertain.

12 / 42

Example

Let's see how R2 and RSE perform with 500 very weak predictors.

To address overfitting, we can compare in- vs. out-of-sample performance.

13 / 42

In-sample R2 mechanically increases as we add predictors.
Out-of-sample R2 does not.

14 / 42

In-sample R2 mechanically increases as we add predictors.
Out-of-sample R2 does not.

15 / 42

What about RSE? Does its penalty help?

16 / 42

Despite its penalty for adding variables, in-sample RSE still can overfit,
as evidenced by out-of-sample RSE.

17 / 42

Despite its penalty for adding variables, in-sample RSE still can overfit,
as evidenced by out-of-sample RSE.

18 / 42

Linear regression

Penalization

RSE is not the only way to penalization the addition of variables.

We'll talk about other penalization methods (LASSO and Ridge) shortly.

Adjusted R2 is another classic solution.

Adjusted R2=1RSS/(np1)TSS/(n1)

Adj. R2 attempts to "fix" R2 by adding a penalty for the number of variables.

19 / 42

Linear regression

Penalization

RSE is not the only way to penalization the addition of variables.

We'll talk about other penalization methods (LASSO and Ridge) shortly.

Adjusted R2 is another classic solution.

Adjusted R2=1RSS/(np1)TSS/(n1)

Adj. R2 attempts to "fix" R2 by adding a penalty for the number of variables.

  • RSS always decreases when a new variable is added.
19 / 42

Linear regression

Penalization

RSE is not the only way to penalization the addition of variables.

We'll talk about other penalization methods (LASSO and Ridge) shortly.

Adjusted R2 is another classic solution.

Adjusted R2=1RSS/(np1)TSS/(n1)

Adj. R2 attempts to "fix" R2 by adding a penalty for the number of variables.

  • RSS always decreases when a new variable is added.

  • RSS/(np1) may increase or decrease with a new variable.

19 / 42

However, in-sample adjusted R2 still can overfit.
Illustrated by out-of-sample R2.

20 / 42

However, in-sample adjusted R2 still can overfit.
Illustrated by out-of-sample adjusted R2.

21 / 42

Model selection

A better way?

R2, adjusted R2, and RSE each offer some flavor of model fit, but they appear limited in their abilities to prevent overfitting.

22 / 42

Model selection

A better way?

R2, adjusted R2, and RSE each offer some flavor of model fit, but they appear limited in their abilities to prevent overfitting.

We want a method to optimally select a (linear) model—balancing variance and bias and avoiding overfit.

22 / 42

Model selection

A better way?

R2, adjusted R2, and RSE each offer some flavor of model fit, but they appear limited in their abilities to prevent overfitting.

We want a method to optimally select a (linear) model—balancing variance and bias and avoiding overfit.

We'll discuss two (related) methods today:

  1. Subset selection chooses a (sub)set of our p potential predictors

  2. Shrinkage fits a model using all p variables but "shrinks" its coefficients

22 / 42

Model selection

Subset selection

In subset selection, we

  1. whittle down the p potential predictors (using some magic/algorithm)
  2. estimate the chosen linear model using OLS
23 / 42

Model selection

Subset selection

In subset selection, we

  1. whittle down the p potential predictors (using some magic/algorithm)
  2. estimate the chosen linear model using OLS

How do we do the whittling (selection)?

23 / 42

Model selection

Subset selection

In subset selection, we

  1. whittle down the p potential predictors (using some magic/algorithm)
  2. estimate the chosen linear model using OLS

How do we do the whittling (selection)? We've go options.

  • Best subset selection fits a model for every possible subset.
  • Forward stepwise selection starts with only an intercept and tries to build up to the best model (using some fit criterion).
  • Backward stepwise selection starts with all p variables and tries to drop variables until it hits the best model (using some fit criterion).
  • Hybrid approaches are what their name implies (i.e., hybrids).
23 / 42

Model selection

Best subset selection

Best subset selection is based upon a simple idea: Estimate a model for every possible subset of variables; then compare their performances.

24 / 42

Model selection

Best subset selection

Best subset selection is based upon a simple idea: Estimate a model for every possible subset of variables; then compare their performances.

Q So what's the problem? (Why do we need other selection methods?)

24 / 42

Model selection

Best subset selection

Best subset selection is based upon a simple idea: Estimate a model for every possible subset of variables; then compare their performances.

Q So what's the problem? (Why do we need other selection methods?)
A "a model for every possible subset" can mean a lot (2p) of models.

24 / 42

Model selection

Best subset selection

Best subset selection is based upon a simple idea: Estimate a model for every possible subset of variables; then compare their performances.

Q So what's the problem? (Why do we need other selection methods?)
A "a model for every possible subset" can mean a lot (2p) of models.

E.g.,

  • 10 predictors 1,024 models to fit
  • 25 predictors >33.5 million models to fit
  • 100 predictors ~1.5 trillion models to fit
24 / 42

Model selection

Best subset selection

Best subset selection is based upon a simple idea: Estimate a model for every possible subset of variables; then compare their performances.

Q So what's the problem? (Why do we need other selection methods?)
A "a model for every possible subset" can mean a lot (2p) of models.

E.g.,

  • 10 predictors 1,024 models to fit
  • 25 predictors >33.5 million models to fit
  • 100 predictors ~1.5 trillion models to fit

Even with plentiful, cheap computational power, we can run into barriers.

24 / 42

Model selection

Best subset selection

Computational constraints aside, we can implement best subset selection as

  1. Define M0 as the model with no predictors.

  2. For k in 1 to p:

    • Fit every possible model with k predictors.

    • Define Mk as the "best" model with k predictors.

  3. Select the "best" model from M0,,Mp.

As we've seen, RSS declines (and R2 increases) with p, so we should use a cross-validated measure of model performance in step 3.

Back to our distinction between test vs. training performance.

25 / 42

Model selection

Example dataset: Credit

We're going to use the Credit dataset from ISL's R package ISLR.

26 / 42

Model selection

Example dataset: Credit

We're going to use the Credit dataset from ISL's R package ISLR.

The Credit dataset has 400 observations on 12 variables.

26 / 42

Model selection

Example dataset: Credit

We need to pre-process the dataset before we can select a model...

Now the dataset on has 400 observations on 12 variables (2,048 subsets).

27 / 42

28 / 42

29 / 42

Model selection

Best subset selection

From here, you would

  1. Estimate cross-validated error for each Mk.

  2. Choose the Mk that minimizes the CV error.

  3. Train the chosen model on the full dataset.

30 / 42

Model selection

Best subset selection

Warnings

  • Computationally intensive
  • Selected models may not be "right" (squared terms with linear terms)
  • You need to protect against overfitting when choosing across Mk
  • Also should worry about overfitting when p is "big"
  • Dependent upon the variables you provide

Benefits

  • Comprehensive search across provided variables
  • Resulting model—when estimated with OLS—has OLS properties
  • Can be applied to other (non-OLS) estimators
31 / 42

Model selection

Stepwise selection

Stepwise selection provides a less computational intensive alternative to best subset selection.

The basic idea behind stepwise selection

  1. Start with an arbitrary model.
  2. Try to find a "better" model by adding/removing variables.
  3. Repeat.
  4. Stop when you have the best model. (Or choose the best model.)
32 / 42

Model selection

Stepwise selection

Stepwise selection provides a less computational intensive alternative to best subset selection.

The basic idea behind stepwise selection

  1. Start with an arbitrary model.
  2. Try to find a "better" model by adding/removing variables.
  3. Repeat.
  4. Stop when you have the best model. (Or choose the best model.)

The two most-common varieties of stepwise selection:

  • Forward starts with only intercept (M0) and adds variables
  • Backward starts with all variables (Mp) and removes variables
32 / 42

Model selection

Forward stepwise selection

The process...

  1. Start with a model with only an intercept (no predictors), M0.

  2. For k=0,,p:

    • Estimate a model for each of the remaining pk predictors, separately adding the predictors to model Mk.

    • Define Mk+1 as the "best" model of the pk models.

  3. Select the "best" model from M0,,Mp.

33 / 42

Model selection

Forward stepwise selection

The process...

  1. Start with a model with only an intercept (no predictors), M0.

  2. For k=0,,p:

    • Estimate a model for each of the remaining pk predictors, separately adding the predictors to model Mk.

    • Define Mk+1 as the "best" model of the pk models.

  3. Select the "best" model from M0,,Mp.

What do we mean by "best"?
2: best is often RSS or R2.
3: best should be a cross-validated fit criterion.

33 / 42

Forward stepwise selection with caret in R

train_forward = train(
y = credit_dt[["balance"]],
x = credit_dt %>% dplyr::select(-balance),
trControl = trainControl(method = "cv", number = 5),
method = "leapForward",
tuneGrid = expand.grid(nvmax = 1:11)
)
34 / 42

Forward stepwise selection with caret in R

train_forward = train(
y = credit_dt[["balance"]],
x = credit_dt %>% dplyr::select(-balance),
trControl = trainControl(method = "cv", number = 5),
method = "leapForward",
tuneGrid = expand.grid(nvmax = 1:11)
)

34 / 42

Model selection

Backward stepwise selection

The process for backward stepwise selection is quite similar...

35 / 42

Model selection

Backward stepwise selection

The process for backward stepwise selection is quite similar...

  1. Start with a model that includes all p predictors: Mp.

  2. For k=p,p1,,1:

    • Estimate k models, where each model removes exactly one of the k predictors from Mk.

    • Define Mk1 as the "best" of the k models.

  3. Select the "best" model from M0,,Mp.

35 / 42

Model selection

Backward stepwise selection

The process for backward stepwise selection is quite similar...

  1. Start with a model that includes all p predictors: Mp.

  2. For k=p,p1,,1:

    • Estimate k models, where each model removes exactly one of the k predictors from Mk.

    • Define Mk1 as the "best" of the k models.

  3. Select the "best" model from M0,,Mp.

What do we mean by "best"?
2: best is often RSS or R2.
3: best should be a cross-validated fit criterion.

35 / 42

Backward stepwise selection with caret in R

train_backward = train(
y = credit_dt[["balance"]],
x = credit_dt %>% dplyr::select(-balance),
trControl = trainControl(method = "cv", number = 5),
method = "leapBackward",
tuneGrid = expand.grid(nvmax = 1:11)
)
36 / 42

Backward stepwise selection with caret in R

train_backward = train(
y = credit_dt[["balance"]],
x = credit_dt %>% dplyr::select(-balance),
trControl = trainControl(method = "cv", number = 5),
method = "leapBackward",
tuneGrid = expand.grid(nvmax = 1:11)
)

36 / 42

Note: forward and backward step. selection can choose different models.

37 / 42

Model selection

Stepwise selection

Notes on stepwise selection

  • Less computationally intensive (relative to best subset selection)

    • With p=20, BSS fits 1,048,576 models.
    • With p=20, foward/backward selection fits 211 models.
  • There is no guarantee that stepwise selection finds the best model.

  • Best is defined by your fit criterion (as always).

  • Again, cross validation is key to avoiding overfitting.

38 / 42

Model selection

Criteria

Which model you choose is a function of how you define "best".

39 / 42

Model selection

Criteria

Which model you choose is a function of how you define "best".

And we have many options...

39 / 42

Model selection

Criteria

Which model you choose is a function of how you define "best".

And we have many options... We've seen RSS, (R)MSE, RSE, MA, R2, Adj. R2.

39 / 42

Model selection

Criteria

Which model you choose is a function of how you define "best".

And we have many options... We've seen RSS, (R)MSE, RSE, MA, R2, Adj. R2.

Of course, there's more. Each penalizes the d predictors differently. Cp=1n(RSS+2dˆσ2)AIC=1nˆσ2(RSS+2dˆσ2)BIC=1nˆσ2(RSS+log(n)dˆσ2)

39 / 42

Model selection

Criteria

Cp, AIC, and BIC all have rigorous theoretical justifications... the adjusted R2 is not as well motivated in statistical theory

ISL, p. 213

In general, we will stick with cross-validated criteria, but you still need to choose a selection criterion.

40 / 42

Sources

These notes draw upon

41 / 42

Admin

2 / 42
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow