Why Regression?

class: center, middle, inverse, title-slide

.title[
# Why Regression?
]
.subtitle[
## EC 607, Set 03
]
.author[
### Edward Rubin
]

---

class: inverse, middle

# Prologue

---
name: schedule

# Schedule

### Last time

- The Experimental Ideal
- Fundamentals of .mono[R]

### Today

What's so great about linear regression and OLS?
 .hi-slate[Read] *MHE* 3.1

### Upcoming

.hi-slate[Assignment].sub[1] Custom OLS function fun.
 .hi-slate[Assignment].sub[2] First step of project proposal.

---
layout: true

# Regression
---
class: inverse, middle
---
name: why

## Why?

In our previous discussion, we began moving from simple differences to a regression framework.

.hi-slate[Q] Why do we.pink[†] care so much about linear regression and OLS?

.footnote[.pink[†] *we* = empirically inclined applied economists]

.hi-slate[A] As we discussed, regression allows us to control for covariates that *can* assist with (.slate[1]) causal identification and (.slate[2]) inference.

There's a deeper reason that we care about *linear* regression and ordinary least squares (OLS): .hi-pink[*the conditional expectation function (CEF).*]

---

## Why?

Even ignoring causality, we can show important relationships between

1. .hi-pink[the CEF] (the conditional expectation function),

2. the .hi-purple[population regression function],

3. and the .hi-slate[sampling distribution of regression estimates].
---
layout: true

# Regression
## The *CEF*
---
name: cef

.hi-slate[Definition] The .hi[conditional expectation function] for a dependent variable `$\text{Y}_{i}$`, given a `$\text{K}\times 1$` vector of covariates `$\text{X}_{i}$`, tells us .pink[the expected value (population average) of] `$\color{#e64173}{\text{Y}_{i}}$` .pink[with] `$\color{#e64173}{\text{X}_{i}}$` .pink[held constant.]

Written as `$\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]$`, the CEF is a function of `$\text{X}_{i}$`..pink[†]

.footnote[
.pink[†] We'll generally assume `$\text{X}_{i}$` is a random variable, which implies that `$\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]$` is also a random variable.
]

.hi-slate[Examples]

- `$\mathop{E}\left[ \text{Income}_i \mid \text{Education}_i \right]$`

- `$\mathop{E}\left[ \text{Wage}_i \mid \text{Gender}_i \right]$`

- `$\mathop{E}\left[ \text{Birth weight}_i \mid \text{Air quality}_i \right]$`
---

Formally, for continuous `$\text{Y}_{i}$` with conditional density `$f_y(t|\text{X}_{i}=x)$`,
$$
`\begin{align}
  \mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} = x \right] = \int t f_y(t|\text{X}_{i}=x)dt
\end{align}`
$$
--
and for discrete `$\text{Y}_{i}$` with conditional p.m.f. `$\mathop{\text{Pr}}\left(\text{Y}_{i}=t|\text{X}_{i}=x\right)$`,
$$
`\begin{align}
  \mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i}=x \right] = \sum_t t \mathop{\text{Pr}}\left(\text{Y}_{i}=t|\text{X}_{i}=x\right)
\end{align}`
$$

.hi-slate[*Notice*] We are focusing on the .hi-pink[population].
--
 We want to build our intuition about the parameters that we will eventually estimate.
---
layout: false
class: clear, middle

Graphically...
---
class: clear, center, middle
name: graphically

The conditional distributions of `$\text{Y}_{i}$` for `$\text{X}_{i}=x$` in 8, ..., 22.

<img src="03-why-regression_files/figure-html/fig_cef_dist-1.svg" style="display: block; margin: auto;" />
---
class: clear, middle, center

The CEF, `$\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]$`, connects these conditional distributions' means.

---
class: clear, middle, center

Focusing in on the CEF, `$\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]$`...

<img src="03-why-regression_files/figure-html/fig_cef_only-1.svg" style="display: block; margin: auto;" />
---
class: clear, middle

.hi-slate[Q] How does the CEF relate to/inform regression?
---
layout: true
# Regression
## The *CEF*
---
name: lie

As we derive the properties and relationships associated with the CEF, regression, and a host of other estimators, we will frequently rely upon .hi-slate[*the Law of Iterated Expectations*] (LIE).
--
$$
`\begin{align}
 \color{#6A5ACD}{\mathop{E}\left[ \text{Y}_{i} \right]} = \mathop{E}\!\bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]} \bigg)
\end{align}`
$$
--
which says that the .hi-purple[unconditional expectation] is equal to the .b[unconditional average] of the .hi-pink[conditional expectation function].

---
layout: false
# Regression

.hi-slate[A proof of the LIE]

First, we need notation...

Let `$\mathop{f_{x,y}}(u,t)$` denote the joint density for continuous RVs `$\left( \text{X}_{i},\text{Y}_{i} \right)$`.

Let `$\mathop{f_{y|x}}(t\mid \text{X}_{i}=u)$` denote the conditional distribution of `$\text{Y}_{i}$` given `$\text{X}_{i}=u$`.

And let `$\mathop{g_y}(t)$` and `$\mathop{g_x}(u)$` denote the marginal densities of `$\text{Y}_{i}$` and `$\text{X}_{i}$`.
---
# Regression

.hi-slate[A proof of the LIE]

`$\mathop{E}\!\bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]} \bigg)$`
--
   `$= {\displaystyle\int} \color{#e64173}{\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} = u \right]} \mathop{g_x}(u) du$`
--
   `$={\displaystyle\int} \color{#e64173}{\left[{\displaystyle\int} t \mathop{f_{y|x}}\left( t\mid \text{X}_{i}=u \right) dt\right]} \mathop{g_x}(u) du$`
--
   `$={\displaystyle\int} {\displaystyle\int} \color{#e64173}{t \mathop{f_{y|x}}\left( t\mid \text{X}_{i}=u \right)} \mathop{g_x}(u) du \, \color{#e64173}{dt}$`
--
   `$={\displaystyle\int} \color{#e64173}{t} \left[ {\displaystyle\int} \color{#e64173}{\mathop{f_{y|x}}\left( t\mid \text{X}_{i}=u \right)} \mathop{g_x}(u) du \right] \color{#e64173}{dt}$`
--
   `$={\displaystyle\int} \color{#e64173}{t} \left[ {\displaystyle\int} \mathop{f_{x,y}}(u,t) du \right] \color{#e64173}{dt}$`
--
   `$={\displaystyle\int} \color{#e64173}{t} \mathop{g_y(t)} \color{#e64173}{dt}$`
--
   `$=\mathop{E}\left[ \text{Y}_{i} \right]$`
--
  .bigger[🥳]
---
layout: false
class: clear, middle

Great. What's the point?
---
layout: true
# Regression
## The *LIE* and the *CEF*
---
name: decomposition

.hi-slate[Theorem] The CEF decomposition property (3.1.1)

The LIE allows us to **decompose random variables** into two pieces

$$
`\begin{align}
  \text{Y}_{i} = \color{#e64173}{\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]} + \color{#6A5ACD}{\varepsilon_i}
\end{align}`
$$

1. .hi-pink[the conditional expectation function]
2. .hi-purple[a residual] with special powers.pink[†]
  i.  `$\color{#6A5ACD}{\varepsilon_i}$` is mean independent of `$\text{X}_{i}$`, _i.e._, `$\mathop{E}\left[ \color{#6A5ACD}{\varepsilon_i} \mid \text{X}_{i} \right] = 0$`.
  ii.  `$\color{#6A5ACD}{\varepsilon_i}$` is uncorrelated with any function of `$\text{X}_{i}$`.

.footnote[.pink[†] Angrist and Pischke go with *special properties*.]

.hi-orange[*Important*] It might not seem like much, but these results are .hi-orange[huge] for building intuition, theory, *and* application.
--
 Put a ⭐ here!
---

.hi-slate[Proof] The CEF decomposition property (properties i. and ii. of `$\color{#6A5ACD}{\varepsilon_i}$`)

.pull-left[
.hi-slate[Mean independence], `$\mathop{E}\left[ \color{#6A5ACD}{\varepsilon_i} \mid \text{X}_{i} \right] = 0$`
$$
`\begin{align}
  &\mathop{E}\left[ \color{#6A5ACD}{\varepsilon_i} \mid \text{X}_{i} \right] \\[0.6em]
  &= \mathop{E}\!\bigg( \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]} \bigg| \text{X}_{i} \bigg) \\[0.6em]
  &= \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \mathop{E}\!\bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \bigg| \text{X}_{i} \bigg) \\[0.6em]
  &= \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \\[0.6em]
  &= 0
\end{align}`
$$
]
--
.pull-right[
.hi-slate[Zero correlation] btn. `$\color{#6A5ACD}{\varepsilon_i}$` and `$\mathop{h}\left( \text{X}_{i} \right)$`
$$
`\begin{align}
  &\mathop{E}\left[ \mathop{h}\left( \text{X}_{i} \right) \color{#6A5ACD}{\varepsilon_i}\right] \\[0.6em]
  &= \mathop{E}\!\bigg( \mathop{E}\left[ \mathop{h}\left( \text{X}_{i} \right) \color{#6A5ACD}{\varepsilon_i}\mid \text{X}_{i} \right]\bigg) \\[0.6em]
  &= \mathop{E}\!\bigg( \mathop{h}\left( \text{X}_{i} \right) \mathop{E}\left[\color{#6A5ACD}{\varepsilon_i}\mid \text{X}_{i} \right]\bigg) \\[0.6em]
  &= \mathop{E}\left[ \mathop{h}\left( \text{X}_{i} \right) \times 0\right] \\[0.6em]
  &= 0
\end{align}`
$$
]
---

.hi-slate[The CEF decomposition property]
 says that we can decompose any random variable (_e.g._, `$\text{Y}_{i}$`) into

1. a part that is .pink[explained by] `$\color{#e64173}{\text{X}_{i}}$` (_i.e._, the CEF `$\color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]}$`),
2. a part that is .purple[*orthogonal to*.purple[†] any function of] `$\color{#6A5ACD}{\text{X}_{i}}$` (_i.e._, `$\color{#6A5ACD}{\varepsilon_i}$`).

.footnote[.purple[†] "orthogonal to" = "uncorrelated with"]

.hi-slate[Why the CEF?]
 The .pink[CEF] also presents an intuitive summary of the relationship between `$\text{Y}_{i}$` and `$\text{X}_{i}$`, since we are often use means to characterize random variables.

But (of course) there are more reasons to use the CEF...
---
name: prediction

.hi-slate[Theorem] The CEF prediction property (3.1.2)

Let `$\mathop{m}\left( \text{X}_{i} \right)$` be *any* function of `$\text{X}_{i}$`. The CEF solves
$$
`\begin{align}
  \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} = \underset{\mathop{m}\left( \text{X}_{i} \right)}{\text{arg min}}\enspace \mathop{E}\left[ \left( \text{Y}_{i} - \mathop{m}\left( \text{X}_{i} \right) \right)^2 \right]
\end{align}`
$$
In other words, the .hi-pink[CEF] is the minimum mean-squared error (MMSE) predictor of `$\text{Y}_{i}$` given `$\text{X}_{i}$`.

.hi-slate[*Notice*]
1. We haven't restricted `$m$` to any class of functions—it can be nonlinear.
2. We're talking about *prediction* (specifically predicting `$\text{Y}_{i}$`).
---
layout: false
class: clear

.hi-slate[Proof] The CEF prediction property

`$\bigg( \text{Y}_{i} - \mathop{m}\left( \text{X}_{i} \right) \bigg)^2$` .right10[.orange[(**1**)]]
--
   `$= \bigg( \big\{ \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \big\} + \big\{ \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \mathop{m}\left( \text{X}_{i} \right) \big\} \bigg)^2$`
--
   `$= \bigg( \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \bigg)^2$`  .right10[.turquoise[(**a**)]]
    `$+ 2 \bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right] - \mathop{m}\left( \text{X}_{i} \right)}\bigg)\times \bigg( \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \bigg)$`  .right10[.turquoise[(**b**)]]
    `$+ \bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \mathop{m}\left( \text{X}_{i} \right) \bigg)^2$`  .right10[.turquoise[(**c**)]]

.hi-slate[Recall:] We want to choose the `$\mathop{m}\left( \text{X}_{i} \right)$` that minimizes .orange[(**1**)] in expectation.
--
  .turquoise[(**a**)] is irrelevant, _i.e._, it does not depend upon `$\mathop{m}\left( \text{X}_{i} \right)$`.
--
  .turquoise[(**b**)] equals zero in expectation: `$\mathop{E}\left[ \mathop{h}\left( \text{X}_{i} \right)\times \color{#6A5ACD}{\varepsilon_i} \right] = 0$`.
--
  .turquoise[(**c**)] is minimized by `$\mathop{m}\left( \text{X}_{i} \right) = \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]}$`, _i.e._, when `$\mathop{m}\left( \text{X}_{i} \right)$` is the .pink[CEF].
---
layout: true

# Regression
## The *LIE* and the *CEF*
---

∴ the .pink[CEF] is the function that minimizes the mean-squared error (MSE)
$$
`\begin{align}
  \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} = \underset{\mathop{m}\left( \text{X}_{i} \right)}{\text{arg min}}\enspace \mathop{E}\left[ \left( \text{Y}_{i} - \mathop{m}\left( \text{X}_{i} \right) \right)^2 \right]
\end{align}`
$$
---

One final property of the .pink[CEF] (very similar to the decomposition property)

.hi-slate[Theorem] The ANOVA theorem (3.1.3)
$$
`\begin{align}
  \mathop{\text{Var}} \left( \text{Y}_{i} \right) = \mathop{\text{Var}} \left( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \right) + \mathop{E}\left[ \mathop{\text{Var}} \left( \text{Y}_{i} \mid \text{X}_{i} \right) \right]
\end{align}`
$$
which says that we can decompose the variance in `$\text{Y}_{i}$` into
1. the variance in the .pink[CEF]
2. the variance of the residual

.hi-slate[*Example*] Decomposing wage variation into (.hi-slate[1]) variation explained by workers' characteristics and (.hi-slate[2]) unexplained (residual) variation

The proof centers on the independence from the decomposition property of the CEF.
---
layout: false
class: clear, middle

We now understand the CEF a bit better.
 But how does the CEF actually relate to regression?
---
layout: true
# Regression
## The *CEF* and regression
---

We've discussed how the .pink[CEF] summarizes empirical relationships.

*Previously* we discussed how regression provides simple empirical insights.

Let's link these two concepts.
---
name: pop_ls

.hi-slate[Population least-squares regression]

We will focus on `$\beta$`, the vector (a `$K\times 1$` matrix) of population, least-squares regression coefficients, _i.e._,
$$
`\begin{align}
  \beta = \underset{b}{\text{arg min}}\thinspace \mathop{E}\left[ \left( \text{Y}_{i} - \text{X}_{i}'b \right)^2 \right]
\end{align}`
$$
where `$b$` and `$\text{X}_{i}$` are also `$K\times 1$`, and `$\text{Y}_{i}$` is a scalar.

Taking the first-order condition gives
$$
`\begin{align}
   \mathop{E}\left[ \text{X}_{i} \left( \text{Y}_{i} - \text{X}_{i}'b \right) \right] = 0
\end{align}`
$$
---

From the first-order condition
$$
`\begin{align}
   \mathop{E}\left[ \text{X}_{i} \left( \text{Y}_{i} - \text{X}_{i}'b \right) \right] = 0
\end{align}`
$$
we can solve for `$b$`. We've defined the optimum as `$\beta$`. Thus,
$$
`\begin{align}
  \beta = \mathop{E}\left[ \text{X}_{i} \text{X}_{i}' \right]^{-1} \mathop{E}\left[ \text{X}_{i} \text{Y}_{i} \right]
\end{align}`
$$

.hi-slate[*Note*] The first-order conditions tell us that our least-squares population regression residuals `$\left(e_i = \text{Y}_{i} - \text{X}_{i}'\beta \right)$` are uncorrelated with `$\text{X}_i$`.
---
layout: true

# Regression
## Anatomy
---
name: anatomy

Our "new" result: `$\beta = \mathop{E}\left[ \text{X}_{i} \text{X}_{i}' \right]^{-1} \mathop{E}\left[ \text{X}_{i} \text{Y}_{i} \right]$`

In .hi-slate[simple linear regression] (an intercept and one regressor `$x_i$`),
$$
`\begin{align}
  \beta_1 &= \dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, x_i \right)}{\mathop{\text{Var}} \left( x_i \right)}
  & \beta_0 = \mathop{E}\left[ \text{Y}_{i} \right] - \beta_1 \mathop{E}\left[ x_i \right]
\end{align}`
$$

For .hi-slate[multivariate regression], the coefficient on the k.super[th] regressor `$x_{ki}$` is
$$
`\begin{align}
  \beta_k &= \dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \widetilde{x}_{ki} \right)}{\mathop{\text{Var}} \left( \widetilde{x}_{ki} \right)}
\end{align}`
$$
where `$\widetilde{x}_{ki}$` is the residual from a regression of `$x_{ki}$` on all other covariates.
---

This alternative formulation of least-squares coefficients is quite powerful.
$$
`\begin{align}
  \beta_k &= \dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \widetilde{x}_{ki} \right)}{\mathop{\text{Var}} \left( \widetilde{x}_{ki} \right)}
\end{align}`
$$

.hi-slate[Why?]
--
 This expression illustrates how each coefficient in a least-squares regression represents the bivariate slope coefficient .pink[after controlling for the other covariates].

---

In fact, we can re-write our coefficients to further emphasize this point
$$
`\begin{align}
  \beta_k &= \dfrac{\mathop{\text{Cov}} \left( \widetilde{\text{Y}}_{i},\, \widetilde{x}_{ki} \right)}{\mathop{\text{Var}} \left( \widetilde{x}_{ki} \right)}
\end{align}`
$$
`$\widetilde{\text{Y}}_{i}$` denotes the residual from regressing `$\text{Y}_{i}$` on all regressors except `$x_{ki}$`.
---
layout: false
class: clear, middle

Graphical example
---
class: clear, middle, center

`$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \varepsilon_i$`

<img src="03-why-regression_files/figure-html/fig_anatomy1-1.svg" style="display: block; margin: auto;" />
---
class: clear, middle, center
count: false