--- title: "Lecture .mono[004]" subtitle: "Regression strikes back" author: "Edward Rubin" #date: "`r format(Sys.time(), '%d %B %Y')`" date: "04 February 2020" output: xaringan::moon_reader: css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css'] # self_contained: true nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- exclude: true ```{R, setup, include = F} library(pacman) p_load( broom, tidyverse, ggplot2, ggthemes, ggforce, ggridges, cowplot, scales, latex2exp, viridis, extrafont, gridExtra, plotly, ggformula, kableExtra, DT, data.table, dplyr, snakecase, janitor, lubridate, knitr, future, furrr, MASS, estimatr, FNN, caret, parsnip, huxtable, here, magrittr, parallel ) # Define colors red_pink = "#e64173" turquoise = "#20B2AA" orange = "#FFA500" red = "#fb6107" blue = "#3b3b9a" green = "#8bb174" grey_light = "grey70" grey_mid = "grey50" grey_dark = "grey20" purple = "#6A5ACD" slate = "#314f4f" # Knitr options opts_chunk$set( comment = "#>", fig.align = "center", fig.height = 7, fig.width = 10.5, warning = F, message = F ) opts_chunk$set(dev = "svg") options(device = function(file, width, height) { svg(tempfile(), width = width, height = height) }) options(knitr.table.format = "html") ``` --- layout: true # Admin --- class: inverse, middle --- name: admin-today ## Today .b[In-class] - A roadmap (where are we going?) - Linear regression and model selection --- name: admin-soon # Admin ## Upcoming .b[Readings] - .note[Today] - .it[ISL] Ch. 3 and 6.1 - .note[Next] - .it[ISL] Ch. 6 and 4 .b[Problem sets] - Due tonight! (How did it go?) - .it[Next:] After we finish this set of notes --- name: admin-roadmap layout: false # Roadmap ## Where are we? We've essentially covered the central topics in statistical learning.super[.pink[†]] - Prediction and inference - Supervised .it[vs.] unsupervised methods - Regression and classification problems - The dangers of overfitting - The bias-variance tradeoff - Model assessment - Holdouts, validation sets, and cross validation.super[.pink[††]] - Model training and tuning - Simulation .footnote[ .pink[†] Plus a few of the "basic" methods: OLS regression and KNN.
.pink[††] And the bootstrap! ] --- # Roadmap ## Where are we going? Next, we will cover many common machine-learning algorithms, _e.g._, - Decision trees and random forests - SVM - Neural nets - Clustering - Ensemble techniques -- But first, we return to good old .hi-orange[linear regression]—in a new light... - Linear regression - Variable/model selection and LASSO/Ridge regression - .it[Plus:] Logistic regression and discriminant analysis --- # Roadmap ## Why return to regression? .hi[Motivation 1]
We have new tools. It might help to first apply them in a .b[familiar] setting. -- .hi[Motivation 2]
We have new tools. Maybe linear regression will be (even) .b[better now?] -- .hi[Motivation 3] > many fancy statistical learning approaches can be seen as .b[generalizations or extensions of linear regression]. .pad-left[.grey-light.it[Source: ISL, p. 59; emphasis added]] --- layout: true # Linear regression --- class: inverse, middle --- name: regression-intro ## Regression regression .note[Recall] Linear regression "fits" coefficients $\color{#e64173}{\beta}_0,\, \ldots,\, \color{#e64173}{\beta}_p$ for a model $$ \begin{align} \color{#FFA500}{y}_i = \color{#e64173}{\beta}_0 + \color{#e64173}{\beta}_1 x_{1,i} + \color{#e64173}{\beta}_2 x_{2,i} + \cdots + \color{#e64173}{\beta}_p x_{p,i} + \varepsilon_i \end{align} $$ and is often applied in two distinct settings with fairly distinct goals: 1. .hi[Causal inference] estimates and interprets .pink[the coefficients]. 1. .hi-orange[Prediction] focuses on accurately estimating .orange[outcomes]. Regardless of the goal, the way we "fit" (estimate) the model is the same. --- ## Fitting the regression line As is the case with many statistical learning methods, regression focuses on minimizing some measure of loss/error. $$ \begin{align} e_i = \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \end{align} $$ -- Linear regression uses the L.sub[2] loss function—also called .it[residual sum of squares] (RSS) or .it[sum of squared errors] (SSE) $$ \begin{align} \text{RSS} = e_1^2 + e_2^2 + \cdots + e_n^2 = \sum_{i=1}^n e_i^2 \end{align} $$ Specifically: OLS chooses the $\color{#e64173}{\hat{\beta}_j}$ that .hi[minimize RSS]. --- name: performance ## Performance There's a large variety of ways to assess the fit.super[.pink[†]] of linear-regression models. .footnote[ .pink[†] or predictive performance ] .hi[Residual standard error] (.hi[RSE]) $$ \begin{align} \text{RSE}=\sqrt{\dfrac{1}{n-p-1}\text{RSS}}=\sqrt{\dfrac{1}{n-p-1}\sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2} \end{align} $$ .hi[R-squared] (.hi[R.super[2]]) $$ \begin{align} R^2 = \dfrac{\text{TSS} - \text{RSS}}{\text{TSS}} = 1 - \dfrac{\text{RSS}}{\text{TSS}} \quad \text{where} \quad \text{TSS} = \sum_{i=1}^{n} \left( y_i - \overline{y} \right)^2 \end{align} $$ --- ## Performance and overfit As we've seen throughout the course, we need to be careful .b[not to overfit]. -- .hi[R.super[2]] provides no protection against overfitting—and actually encourages it. $$ \begin{align} R^2 = 1 - \dfrac{\text{RSS}}{\text{TSS}} \end{align} $$ .attn[Add a new variable:] RSS $\downarrow$ and TSS is unchanged. Thus, R.super[2] increases. -- .hi[RSE] .it[slightly] penalizes additional variables: $$ \begin{align} \text{RSE}=\sqrt{\dfrac{1}{n-p-1}\text{RSS}} \end{align} $$ .attn[Add a new variable:] RSS $\downarrow$ but $p$ increases. Thus, RSE's change is uncertain. --- exclude: true ```{R, overfit-data-gen, eval = F} library(pacman) p_load(parallel, stringr, data.table, magrittr, here) # Set parameters set.seed(123) N = 2e3 n = 500 p = n - 1 # Generate data X = matrix(data = rnorm(n = N*p), ncol = p) β = matrix(data = rnorm(p, sd = 0.005), ncol = 1) y = X %*% β + matrix(rnorm(N, sd = 0.01), ncol = 1) # Create a data table pop_dt = X %>% data.matrix() %>% as.data.table() setnames(pop_dt, paste0("x", str_pad(1:p, 4, "left", 0))) pop_dt[, y := y %>% unlist()] # Subset sub_dt = pop_dt[1:n,] out_dt = pop_dt[(n+1):N,] Nn = N - n # For j in 1 to p: fit a model, record R2 and RSE fit_dt = mclapply(X = seq(1, p, by = 5), mc.cores = 12, FUN = function(j) { # Fit a model with the the first j variables lm_j = lm(y ~ ., data = sub_dt[, c(1:j,p+1), with = F]) # Unseen data performance y_hat = predict(lm_j, newdata = out_dt[, c(1:j,p+1), with = F]) out_rss = sum((out_dt[,y] - y_hat)^2) out_tss = sum((out_dt[,y] - mean(out_dt[,y]))^2) # Return data table data.table( p = j, in_rse = summary(lm_j)$sigma, in_r2 = summary(lm_j)$r.squared, in_r2_adj = summary(lm_j)$adj.r.squared, out_rse = sqrt(1 / (Nn - j - 1) * out_rss), out_r2 = 1 - out_rss/out_tss, out_r2_adj = 1 - ((out_rss) / (Nn - j - 1)) / ((out_tss) / (Nn-1)) ) }) %>% rbindlist() # Save results saveRDS( object = fit_dt, file = here("other-files", "overfit-data.rds") ) ``` ```{R, overfit-data-load} # Load the data fit_dt = here("other-files", "overfit-data.rds") %>% read_rds() ``` --- layout: true class: clear --- name: overfit class: middle ## Example Let's see how .b[R.super[2]] and .b[RSE] perform with 500 very weak predictors. To address overfitting, we can compare .b[in-] .it[vs.] .hi[out-of-sample] performance. --- .b[In-sample R.super[2]] mechanically increases as we add predictors.
.white[.b[Out-of-sample R.super[2]] does not.] ```{R, overfit-plot-r2, echo = F, fig.height = 6} ggplot(data = fit_dt, aes(x = p, y = in_r2)) + geom_hline(yintercept = 0) + geom_line() + geom_point() + geom_line(aes(y = out_r2), color = NA) + geom_point(aes(y = out_r2), color = NA) + scale_y_continuous(expression(R^2)) + scale_x_continuous("Number of predictors", labels = comma) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- .b[In-sample R.super[2]] mechanically increases as we add predictors.
.hi[Out-of-sample R.super[2]] does not. ```{R, overfit-plot-r2-both, echo = F, fig.height = 6} ggplot(data = fit_dt, aes(x = p, y = in_r2)) + geom_hline(yintercept = 0) + geom_line() + geom_point() + geom_line(aes(y = out_r2), color = red_pink) + geom_point(aes(y = out_r2), color = red_pink) + scale_y_continuous(expression(R^2)) + scale_x_continuous("Number of predictors", labels = comma) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- class: middle What about RSE? Does its penalty .it[help]? --- Despite its penalty for adding variables, .b[in-sample RSE] still can overfit,
.white[as evidenced by .b[out-of-sample RSE].] ```{R, plot-overfit-rse, echo = F, fig.height = 6} ggplot(data = fit_dt, aes(x = p, y = in_rse)) + geom_hline(yintercept = 0) + geom_line() + geom_point() + geom_line(aes(y = out_rse), color = NA) + geom_point(aes(y = out_rse), color = NA) + scale_y_continuous("RSE") + scale_x_continuous("Number of predictors", labels = comma) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- Despite its penalty for adding variables, .b[in-sample RSE] still can overfit,
as evidenced by .hi[out-of-sample RSE]. ```{R, plot-overfit-rse-both, echo = F, fig.height = 6} ggplot(data = fit_dt, aes(x = p, y = in_rse)) + geom_hline(yintercept = 0) + geom_line() + geom_point() + geom_line(aes(y = out_rse), color = red_pink) + geom_point(aes(y = out_rse), color = red_pink) + scale_y_continuous("RSE") + scale_x_continuous("Number of predictors", labels = comma) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- layout: false # Linear regression ## Penalization RSE is not the only way to penalization the addition of variables..super[.pink[†]] .footnote[ .pink[†] We'll talk about other penalization methods (LASSO and Ridge) shortly. ] .b[Adjusted R.super[2]] is another .it[classic] solution. $$ \begin{align} \text{Adjusted }R^2 = 1 - \dfrac{\text{RSS}\color{#6A5ACD}{/(n - p - 1)}}{\text{TSS}\color{#6A5ACD}{/(n-1)}} \end{align} $$ Adj. R.super[2] attempts to "fix" R.super[2] by .hi-purple[adding a penalty for the number of variables]. -- - $\text{RSS}$ always decreases when a new variable is added. -- - $\color{#6A5ACD}{\text{RSS}/(n-p-1)}$ may increase or decrease with a new variable. --- layout: true class: clear --- However, .b[in-sample adjusted R.super[2]] still can overfit.
.white[Illustrated by .b[out-of-sample R.super[2]].] ```{R, plot-overfit-adjr2, echo = F, fig.height = 6} ggplot(data = fit_dt, aes(x = p, y = in_r2_adj)) + geom_hline(yintercept = 0) + geom_line() + geom_point() + geom_line(aes(y = out_r2_adj), color = NA) + geom_point(aes(y = out_r2_adj), color = NA) + scale_y_continuous(Adjusted~R^2) + scale_x_continuous("Number of predictors", labels = comma) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- However, .b[in-sample adjusted R.super[2]] still can overfit.
Illustrated by .hi[out-of-sample adjusted R.super[2]]. ```{R, plot-overfit-adjr2-both, echo = F, fig.height = 6} ggplot(data = fit_dt, aes(x = p, y = in_r2_adj)) + geom_hline(yintercept = 0) + geom_line() + geom_point() + geom_line(aes(y = out_r2_adj), color = red_pink) + geom_point(aes(y = out_r2_adj), color = red_pink) + scale_y_continuous(Adjusted~R^2) + scale_x_continuous("Number of predictors", labels = comma) + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- layout: true # Model selection --- name: better ## A better way? R.super[2], adjusted R.super[2], and RSE each offer some flavor of model fit, but they appear .b[limited in their abilities to prevent overfitting]. -- We want a method to optimally select a (linear) model—balancing variance and bias and avoiding overfit. -- We'll discuss two (related) methods today: 1. .hi[Subset selection] chooses a (sub)set of our $p$ potential predictors 2. .hi[Shrinkage] fits a model using all $p$ variables but "shrinks" its coefficients --- name: subset-selection ## Subset selection In .attn[subset selection], we 1. whittle down the $p$ potential predictors (using some magic/algorithm) 1. estimate the chosen linear model using OLS -- How do we do the *whittling* (selection)? -- We've go .b[options]. - .attn[Best subset selection] fits a model for every possible subset. - .attn[Forward stepwise selection] starts with only an intercept and tries to build up to the best model (using some fit criterion). - .attn[Backward stepwise selection] starts with all $p$ variables and tries to drop variables until it hits the best model (using some fit criterion). - .attn[Hybrid approaches] are what their name implies (_i.e._, hybrids). --- name: best-subset ## Best subset selection .attn[Best subset selection] is based upon a simple idea: Estimate a model for every possible subset of variables; then compare their performances. -- .qa[Q] So what's the problem? (Why do we need other selection methods?) --
.qa[A] "a model for .hi[every possible subset]" can mean .hi[a lot] $\left( 2^p \right)$ of models. -- .note[E.g.,] - 10 predictors $\rightarrow$ 1,024 models to fit - 25 predictors $\rightarrow$ >33.5 million models to fit - 100 predictors $\rightarrow$ ~1.5 trillion models to fit -- Even with plentiful, cheap computational power, we can run into barriers. --- ## Best subset selection Computational constraints aside, we can implement .attn[best subset selection] as .mono-small[ 1. Define $\mathcal{M}_0$ as the model with no predictors. 1. For $k$ in 1 to $p$: - Fit every possible model with $k$ predictors. - Define $\mathcal{M}_k$ as the "best" model with $k$ predictors. 1. Select the "best" model from $\mathcal{M}_0,\,\ldots,\,\mathcal{M}_p$. ] As we've seen, RSS declines (and R.super[2] increases) with $p$, so we should use a cross-validated measure of model performance in step .mono[3]..super[.pink[†]] .footnote[ .pink[†] Back to our distinction between test .it[vs.] training performance. ] --- ## Example dataset: `Credit` We're going to use the `Credit` dataset from .it[ISL]'s R package `ISLR`. -- ```{R, credit-data, echo = F} # Print it datatable( ISLR::Credit, height = '40%', rownames = F, options = list(dom = 't') ) ``` The `Credit` dataset has `r nrow(ISLR::Credit) %>% comma()` observations on `r ncol(ISLR::Credit)` variables. --- ## Example dataset: `Credit` ```{R, credit-data-work, include = F, cache = T} # The Credit dataset credit_dt = ISLR::Credit %>% clean_names() %T>% setDT() # Clean variables credit_dt[, `:=`( i_female = 1 * (gender == "Female"), i_student = 1 * (student == "Yes"), i_married = 1 * (married == "Yes"), i_asian = 1 * (ethnicity == "Asian"), i_african_american = 1 * (ethnicity == "African American") )] # Drop unwanted variables credit_dt[, `:=`(id = NULL, gender = NULL, student = NULL, married = NULL, ethnicity = NULL)] # Rearrange credit_dt = cbind(credit_dt[,-"balance"], credit_dt[,"balance"]) ``` We need to pre-process the dataset before we can select a model... ```{R, credit-data-cleaned, echo = F} # Print it datatable( credit_dt, height = '40%', rownames = F, options = list(dom = 't') ) ``` Now the dataset on has `r credit_dt %>% nrow() %>% comma()` observations on `r ncol(credit_dt)` variables (2,048 subsets). --- exclude: true ```{R, bss-data, include = F, cache = T} # Find all possible subsets of the variables var_dt = expand_grid( income = c(0,1), limit = c(0,1), rating = c(0,1), cards = c(0,1), age = c(0,1), education = c(0,1), i_female = c(0,1), i_student = c(0,1), i_married = c(0,1), i_asian = c(0,1), i_african_american = c(0,1) ) %T>% setDT() # Fit each model bss_dt = lapply(X = 2:nrow(var_dt), FUN = function(i) { # Find the variables vars_i = which(var_dt[i,] %>% equals(T)) # Estimate the model model_i = lm( balance ~ ., data = credit_dt[, c(vars_i, ncol(credit_dt)), with = F] ) # Output summaries data.table( n_variables = length(vars_i), r2 = summary(model_i)$r.squared, rmse = mean(model_i$residuals^2) %>% sqrt() ) }) %>% rbindlist() ``` --- layout: false class: clear, middle ```{R, bss-graph-rmse, echo = F} ggplot(data = bss_dt, aes(x = n_variables, y = rmse)) + geom_point(alpha = 0.5, size = 2.5) + geom_line( data = bss_dt[, .(rmse = min(rmse)), by = n_variables], color = red_pink, size = 1, alpha = 0.8 ) + scale_y_continuous("RMSE") + scale_x_continuous("Number of predictors") + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- layout: false class: clear, middle ```{R, bss-graph-r2, echo = F} ggplot(data = bss_dt, aes(x = n_variables, y = r2)) + geom_point(alpha = 0.5, size = 2.5) + geom_line( data = bss_dt[, .(r2 = max(r2)), by = n_variables], color = red_pink, size = 1, alpha = 0.8 ) + scale_y_continuous(expression(R^2)) + scale_x_continuous("Number of predictors") + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- layout: true # Model selection --- ## Best subset selection From here, you would 1. Estimate cross-validated error for each $\mathcal{M}_k$. 1. Choose the $\mathcal{M}_k$ that minimizes the CV error. 1. Train the chosen model on the full dataset. --- ## Best subset selection .b[Warnings] - Computationally intensive - Selected models may not be "right" (squared terms with linear terms) - You need to protect against overfitting when choosing across $\mathcal{M}_k$ - Also should worry about overfitting when $p$ is "big" - Dependent upon the variables you provide .b[Benefits] - Comprehensive search across provided variables - Resulting model—when estimated with OLS—has OLS properties - Can be applied to other (non-OLS) estimators --- name: stepwise ## Stepwise selection .attn[Stepwise selection] provides a less computational intensive alternative to best subset selection. The basic idea behind .attn[stepwise selection] .mono-small[ 1. Start with an arbitrary model. 1. Try to find a "better" model by adding/removing variables. 1. Repeat. 1. Stop when you have the best model. (Or choose the best model.) ] -- The two most-common varieties of stepwise selection: - .attn[Forward] starts with only intercept $\left( \mathcal{M}_0 \right)$ and adds variables - .attn[Backward] starts with all variables $\left( \mathcal{M}_p \right)$ and removes variables --- name: forward ## Forward stepwise selection The process... .mono-small[ 1. Start with a model with only an intercept (no predictors), $\mathcal{M}_0$. 1. For $k=0,\,\ldots,\,p$: - Estimate a model for each of the remaining $p-k$ predictors, separately adding the predictors to model $\mathcal{M}_k$. - Define $\mathcal{M}_{k+1}$ as the "best" model of the $p-k$ models. 1. Select the "best" model from $\mathcal{M}_0,\,\ldots,\, \mathcal{M}_p$. ] -- What do we mean by "best"?
 .mono-small[2:] .it[best] is often RSS or R.super[2].
 .mono-small[3:] .it[best] should be a cross-validated fit criterion. --- layout: false class: clear .hi-purple[Forward stepwise selection] with `caret` in R ```{R, forward-train, echo = T} train_forward = train( y = credit_dt[["balance"]], x = credit_dt %>% dplyr::select(-balance), trControl = trainControl(method = "cv", number = 5), method = "leapForward", tuneGrid = expand.grid(nvmax = 1:11) ) ``` -- .col-left[ ```{R, forward-train-results, echo = F} datatable( train_forward$results[,1:4], height = '50%', width = '100%', rownames = F, options = list(dom = 't'), colnames = c("N vars", "RMSE", "R2", "MAE") ) %>% formatRound(columns = 2:4, digits = c(2, 3, 1)) ``` ] .col-right[ ```{R, forward-train-plot, echo = F, fig.height = 9, out.width = '100%'} ggplot(data = train_forward$results, aes(x = nvmax, y = RMSE)) + geom_line(color = purple, size = 1) + geom_point(color = purple, size = 4.5) + scale_y_continuous("RMSE") + scale_x_continuous("Number of variables") + theme_minimal(base_size = 32, base_family = "Fira Sans Book") ``` ] --- name: backward # Model selection ## Backward stepwise selection The process for .attn[backward stepwise selection] is quite similar... -- .mono-small[ 1. Start with a model that includes all $p$ predictors: $\mathcal{M}_p$. 1. For $k=p,\, p-1,\, \ldots,\,1$: - Estimate $k$ models, where each model removes exactly one of the $k$ predictors from $\mathcal{M}_k$. - Define $\mathcal{M}_{k-1}$ as the "best" of the $k$ models. 1. Select the "best" model from $\mathcal{M}_0,\,\ldots,\, \mathcal{M}_p$. ] -- What do we mean by "best"?
 .mono-small[2:] .it[best] is often RSS or R.super[2].
 .mono-small[3:] .it[best] should be a cross-validated fit criterion. --- layout: false class: clear .hi-pink[Backward stepwise selection] with `caret` in R ```{R, backward-train, echo = T} train_backward = train( y = credit_dt[["balance"]], x = credit_dt %>% dplyr::select(-balance), trControl = trainControl(method = "cv", number = 5), method = "leapBackward", tuneGrid = expand.grid(nvmax = 1:11) ) ``` -- .col-left[ ```{R, backward-train-results, echo = F} datatable( train_backward$results[,1:4], height = '50%', width = '100%', rownames = F, options = list(dom = 't'), colnames = c("N vars", "RMSE", "R2", "MAE") ) %>% formatRound(columns = 2:4, digits = c(2, 3, 1)) ``` ] .col-right[ ```{R, backward-train-plot, echo = F, fig.height = 9, out.width = '100%'} ggplot(data = train_backward$results, aes(x = nvmax, y = RMSE)) + geom_line(data = train_forward$results, color = purple, size = 1, alpha = 0.3) + geom_point(data = train_forward$results, color = purple, size = 4.5, alpha = 0.3) + geom_line(color = red_pink, size = 1) + geom_point(color = red_pink, size = 4.5) + scale_y_continuous("RMSE") + scale_x_continuous("Number of variables") + theme_minimal(base_size = 32, base_family = "Fira Sans Book") ``` ] --- class: clear, middle .note[Note:] .hi-purple[forward] and .hi-pink[backward] step. selection can choose different models. ```{R, stepwise-plot-zoom, echo = F} ggplot(data = train_backward$results, aes(x = nvmax, y = RMSE)) + geom_line(data = train_forward$results, color = purple, alpha = 0.3, size = 0.9) + geom_point(data = train_forward$results, color = purple, size = 3.5, alpha = 0.3) + geom_line(color = red_pink, size = 0.8) + geom_point(color = red_pink, size = 4.5, shape = 1) + scale_y_continuous("RMSE") + scale_x_continuous("Number of variables") + theme_minimal(base_size = 20, base_family = "Fira Sans Book") ``` --- name: stepwise-notes # Model selection ## Stepwise selection Notes on stepwise selection - .b[Less computationally intensive] (relative to best subset selection) - With $p=20$, BSS fits 1,048,576 models. - With $p=20$, foward/backward selection fits 211 models. - There is .b[no guarantee] that stepwise selection finds the best model. - .b.it[Best] is defined by your fit criterion (as always). - Again, .b[cross validation is key] to avoiding overfitting. --- name: criteria # Model selection ## Criteria Which model you choose is a function of .b[how you define "best"]. -- And we have many options... -- We've seen RSS, (R)MSE, RSE, MA, R.super[2], Adj. R.super[2]. -- Of course, there's more. Each .hi-purple[penalizes] the $d$ predictors differently. $$ \begin{align} C_p &= \frac{1}{n} \left( \text{RSS} + \color{#6A5ACD}{2 d \hat{\sigma}^2 }\right) \\[1ex] \text{AIC} &= \frac{1}{n\hat{\sigma}^2} \left( \text{RSS} + \color{#6A5ACD}{2 d \hat{\sigma}^2 }\right) \\[1ex] \text{BIC} &= \frac{1}{n\hat{\sigma}^2} \left( \text{RSS} + \color{#6A5ACD}{\log(n) d \hat{\sigma}^2 }\right) \end{align} $$ --- # Model selection ## Criteria > $C_p$, $AIC$, and $BIC$ all have rigorous theoretical justifications... the adjusted $R^2$ is not as well motivated in statistical theory .grey-light.it[ISL, p. 213] In general, we will stick with cross-validated criteria, but you still need to choose a selection criterion. --- name: sources layout: false # Sources These notes draw upon - [An Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) (*ISL*)
James, Witten, Hastie, and Tibshirani --- # Table of contents .col-left[ .smallest[ #### Admin - [Today](#admin-today) - [Upcoming](#admin-soon) - [Roadmap](#admin-roadmap) #### Linear regression - [Regression regression](#regression-intro) - [Performance](#performance) - [Overfit](#overfit) #### Model selection - [A better way?](#better) - [Subset selection](#subset-selection) - [Best subset selection](#best-subset) - [Stepwise selection](#stepwise) - [Forward](#forward) - [Backward](#backward) - [Notes](#stepwise-notes) - [Criteria](#criteria) ] ] .col-right[ .smallest[ #### Other - [Sources/references](#sources) ] ]