Topic 14: Prediction Methods

class: center, middle, inverse, title-slide

.title[
# Topic 14: Prediction Methods
]
.subtitle[
## Part 1: Shrinkage Methods
]
.author[
### Nick Hagerty ECNS 460/560 Montana State University
]
.date[
### .smallest[*Adapted from <a href="https://github.com/edrubin/EC524W21">“Prediction and machine-learning in econometrics”</a> by Ed Rubin, used with permission. These slides are excluded from this resource’s overall CC license.]
]

---

exclude: true

---

# Table of contents

1. [Shrinkage methods](#intro)

1. [Ridge regression](#ridge)

1. [Lasso (and elasticnet)](#lasso)

---
layout: true
# Shrinkage methods

---
name: intro
## Intro

Linear regression is great for causal inference.
* Best linear approximation to the CEF.
* Best linear unbiased estimator (BLUE) (under certain assumptions).
* Can give us unbiased treatment effects (under certain assumptions).

Is it good for prediction too?
* More advanced, nonlinear methods sometimes work better.
* But linear regression can still work well even on fairly complex problems.

---

To use regression effectively in prediction, we have to deal with two issues:

**Model selection.** How do we choose the `$X_i$`'s?
- Which variables do you put in your regression?
- How do you parameterize/transform/interact them?
- What if you have more variables than observations?
- More variables improve fit, but how do you avoid overfitting?

**Variance reduction.** Can we improve predictions with some tweaks?
* OLS predictions tend to have low bias and relatively low variance.
* But if we accept a little more bias, we can get big reductions in variance, and therefore better predictions (lower MSE).

---

## Shrinkage methods
- fit a model that contains .pink[all] `$\color{#e64173}{p}$` .pink[predictors]
- simultaneously: .pink[**shrink** coefficients] toward zero
  - Shrink: constrain, regularize.

.note[Idea:] Penalize the model for coefficients as they move away from zero.

---
## Intuition

When we estimate a large, imprecise coefficient, should we just accept it?
- Is the pattern likely to come up again, or idiosyncratic to this sample?
- Maybe we should demand stronger evidence for larger coefficients.

---
name: shrinkage-why
## Overall idea

- Shrinking our coefficients toward zero .hi[reduces the model's variance]..super[.pink[†]]
- .hi[Penalizing] our model for .hi[larger coefficients] shrinks them toward zero.
- The .hi[optimal penalty] will balance reduced variance with increased bias.

.footnote[
.pink[†] Imagine the extreme case: a model whose coefficients are all zeros has no variance.
]

This is the basic idea behind all shrinkage methods. We'll cover 3:
- .attn[Ridge regression]
- .attn[Lasso]
- .attn[Elasticnet]

---
layout: true
# Ridge regression

---
class: inverse, middle

---
name: ridge
## Back to least squares

.note[Recall] Least-squares regression gets `$\hat{\beta}_j$`'s by minimizing RSS, _i.e._,
$$
`\begin{align}
  \min_{\hat{\beta}} \text{RSS} = \min_{\hat{\beta}} \sum_{i=1}^{n} e_i^2 = \min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\underbrace{\left[ \hat{\beta}_0 + \hat{\beta}_1 x_{i,1} + \cdots + \hat{\beta}_p x_{i,p} \right]}_{=\hat{y}_i}} \bigg)^2
\end{align}`
$$

.attn[Ridge regression] makes a small change
- .pink[adds a shrinkage penalty] = the sum of squared coefficients `$\left( \color{#e64173}{\lambda\sum_{j}\beta_j^2} \right)$`
- .pink[minimizes] the (weighted) sum of .pink[RSS and the shrinkage penalty]

$$
`\begin{align}
  \min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2}
\end{align}`
$$

---
name: ridge-penalization

.col-left[
.hi[Ridge regression]
$$
`\begin{align}
\min_{\hat{\beta}^R} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2}
\end{align}`
$$
]

.col-right[
.b[Least squares]
$$
`\begin{align}
\min_{\hat{\beta}} \sum_{i=1}^{n} \bigg( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \bigg)^2
\end{align}`
$$
]

`$\color{#e64173}{\lambda}\enspace (\geq0)$` is a tuning parameter for the harshness of the penalty.
 
`$\color{#e64173}{\lambda} = 0$` implies no penalty: we are back to least squares.
--
 
Each value of `$\color{#e64173}{\lambda}$` produces a new set of coefficients.

Ridge's approach to the bias-variance tradeoff: Balance
- reducing .b[RSS], _i.e._, `$\sum_i\left( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \right)^2$`
- reducing .b[coefficients] .grey-light[(ignoring the intercept)]

`$\color{#e64173}{\lambda}$` determines how much ridge "cares about" these two quantities..super[.pink[†]]

.footnote[
.pink[†] With `$\lambda=0$`, least-squares regression only "cares about" RSS.
]
---
layout: false
class: clear, middle

.b[Ridge regression coefficients] for `$\lambda$` between 0.01 and 100,000
<img src="14a-Shrinkage_files/figure-html/plot-ridge-glmnet-1.svg" style="display: block; margin: auto;" />

---
layout: true
# Ridge regression

---

## `$\lambda$` and penalization

Choosing a .it[good] value for `$\lambda$` is key.
- If `$\lambda$` is too small, then our model is essentially back to OLS.
- If `$\lambda$` is too large, then we shrink all of our coefficients too close to zero.

.qa[Q] So what do we do?
--
 
.qa[A] Cross validate!

---
## Penalization

.note[Note] Because we sum the .b[squared] coefficients, we penalize increasing .it[big] coefficients much more than increasing .it[small] coefficients.

.ex[Example] For a value of `$\beta$`, we pay a penalty of `$2 \lambda \beta$` for a small increase..super[.pink[†]]

.footnote[
.pink[†] This quantity comes from taking the derivative of `$\lambda \beta^2$` with respect to `$\beta$`.
]

- At `$\beta = 0$`, the penalty for a small increase is `$0$`.
- At `$\beta = 1$`, the penalty for a small increase is `$2\lambda$`.
- At `$\beta = 2$`, the penalty for a small increase is `$4\lambda$`.
- At `$\beta = 3$`, the penalty for a small increase is `$6\lambda$`.
- At `$\beta = 10$`, the penalty for a small increase is `$20\lambda$`.

Now you see why we call it .it[shrinkage]: it encourages small coefficients.
---
name: standardization
## Penalization and standardization

.attn[Important] Predictors' .hi[units] can drastically .hi[affect ridge regression results].

.b[Why?]
--
 Because `$\mathbf{x}_j$`'s units affect `$\beta_j$`, and ridge is very sensitive to `$\beta_j$`.

.ex[Example] Let `$x_1$` denote distance.

.b[Least-squares regression]
 
If `$x_1$` is .it[meters] and `$\beta_1 = 3$`, then when `$x_1$` is .it[km], `$\beta_1 = 3,000$`.
 
The scale/units of predictors do not affect least squares' estimates.

.hi[Ridge regression] pays a much larger penalty for `$\beta_1=3,000$` than `$\beta_1=3$`.
 You will not get the same (scaled) estimates when you change units.

.note[Solution] Standardize your variables, _i.e._, `x_stnd = (x - mean(x))/sd(x)`.

---
layout: true
# Lasso

---
class: inverse, middle
---
name: lasso
## Intro

.attn[Lasso] simply replaces ridge's .it[squared] coefficients with absolute values.

.hi[Ridge regression]
$$
`\begin{align}
\min_{\hat{\beta}^R} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2}
\end{align}`
$$

.hi-grey[Lasso]
$$
`\begin{align}
\min_{\hat{\beta}^L} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|}
\end{align}`
$$

Everything else will be the same—except one aspect...

---
name: lasso-shrinkage
## Shrinkage

Unlike ridge, lasso's penalty does not increase with the size of `$\beta_j$`.

You always pay `$\color{#8AA19E}{\lambda}$` to increase `$\big|\beta_j\big|$` by one unit.

The only way to avoid lasso's penalty is to .hi[set coefficients to zero].

This feature has two .hi-slate[benefits]
1. Some coefficients will be .hi[set to zero]—we get "sparse" models.
1. Lasso can be used for subset/feature .hi[selection].

We will still need to carefully select `$\color{#8AA19E}{\lambda}$`.

---
layout: false
class: clear, middle

.b[Lasso coefficients] for `$\lambda$` between 0.01 and 100,000
<img src="14a-Shrinkage_files/figure-html/plot-lasso-glmnet-1.svg" style="display: block; margin: auto;" />

---
layout: true
# Ridge or lasso?

---
name: or

So which shrinkage method should you choose?

.col-left.pink[
.b[Ridge regression]
 
 .b.orange[+] shrinks `$\hat{\beta}_j$` .it[near] 0
 .b.orange[-] many small `$\hat\beta_j$`
 .b.orange[-] doesn't work for selection
 .b.orange[-] difficult to interpret output
 .b.orange[+] better when all `$\beta_j\neq$` 0
 .b.orange[-] doesn't work when `$p>n$` (more predictors than data points)
]

.col-right.purple[
.b[Lasso]
 
 .b.orange[+] shrinks `$\hat{\beta}_j$` to 0
 .b.orange[+] many `$\hat\beta_j=$` 0
 .b.orange[+] great for selection
 .b.orange[+] sparse models easier to interpret
 .b.orange[-] implicitly assumes some `$\beta=$` 0
 .b.orange[+] still works when `$p>n$`
]

.left-full[
> [N]either ridge... nor the lasso will universally dominate the other.

.ex[ISL, p. 224]
]

---
name: both
layout: false
# Ridge .it[and] lasso
## Why not both?

.hi-blue[Elasticnet] combines .pink[ridge regression] and .grey[lasso].

$$
`\begin{align}
\min_{\beta^E} \sum_{i=1}^{n} \big( \color{#FFA500}{y_i} - \color{#6A5ACD}{\hat{y}_i} \big)^2 + \color{#181485}{(1-\alpha)} \color{#e64173}{\lambda \sum_{j=1}^{p} \beta_j^2} + \color{#181485}{\alpha} \color{#8AA19E}{\lambda \sum_{j=1}^{p} \big|\beta_j\big|}
\end{align}`
$$

We now have two tuning parameters: `$\lambda$` (penalty) and `$\color{#181485}{\alpha}$` (mixture).

- `$\color{#e64173}{\alpha = 0}$` specifies ridge
- `$\color{#8AA19E}{\alpha=1}$` specifies lasso