---
title: "Why Regression?"
subtitle: "EC 607, Set 03"
author: "Edward Rubin"
date: ""
output:
xaringan::moon_reader:
css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css']
# self_contained: true
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
class: inverse, middle
```{r, setup, include = F}
# devtools::install_github("dill/emoGG")
library(pacman)
p_load(
broom, tidyverse,
ggplot2, ggthemes, ggforce, ggridges,
latex2exp, viridis, extrafont, gridExtra,
kableExtra, snakecase, janitor,
data.table, dplyr, estimatr,
lubridate, knitr, parallel,
lfe,
here, magrittr
)
# Define pink color
red_pink <- "#e64173"
turquoise <- "#20B2AA"
orange <- "#FFA500"
red <- "#fb6107"
blue <- "#3b3b9a"
green <- "#8bb174"
grey_light <- "grey70"
grey_mid <- "grey50"
grey_dark <- "grey20"
purple <- "#6A5ACD"
slate <- "#314f4f"
# Dark slate grey: #314f4f
# Knitr options
opts_chunk$set(
comment = "#>",
fig.align = "center",
fig.height = 7,
fig.width = 10.5,
warning = F,
message = F
)
opts_chunk$set(dev = "svg")
options(device = function(file, width, height) {
svg(tempfile(), width = width, height = height)
})
options(crayon.enabled = F)
options(knitr.table.format = "html")
# A blank theme for ggplot
theme_empty <- theme_bw() + theme(
line = element_blank(),
rect = element_blank(),
strip.text = element_blank(),
axis.text = element_blank(),
plot.title = element_blank(),
axis.title = element_blank(),
plot.margin = structure(c(0, 0, -0.5, -1), unit = "lines", valid.unit = 3L, class = "unit"),
legend.position = "none"
)
theme_simple <- theme_bw() + theme(
line = element_blank(),
panel.grid = element_blank(),
rect = element_blank(),
strip.text = element_blank(),
axis.text.x = element_text(size = 18, family = "STIXGeneral"),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
plot.title = element_blank(),
axis.title = element_blank(),
# plot.margin = structure(c(0, 0, -1, -1), unit = "lines", valid.unit = 3L, class = "unit"),
legend.position = "none"
)
theme_axes_math <- theme_void() + theme(
text = element_text(family = "MathJax_Math"),
axis.title = element_text(size = 22),
axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")),
axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")),
axis.line = element_line(
color = "grey70",
size = 0.25,
arrow = arrow(angle = 30, length = unit(0.15, "inches")
)),
plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"),
legend.position = "none"
)
theme_axes_serif <- theme_void() + theme(
text = element_text(family = "MathJax_Main"),
axis.title = element_text(size = 22),
axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")),
axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")),
axis.line = element_line(
color = "grey70",
size = 0.25,
arrow = arrow(angle = 30, length = unit(0.15, "inches")
)),
plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"),
legend.position = "none"
)
theme_axes <- theme_void() + theme(
text = element_text(family = "Fira Sans Book"),
axis.title = element_text(size = 18),
axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")),
axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")),
axis.line = element_line(
color = grey_light,
size = 0.25,
arrow = arrow(angle = 30, length = unit(0.15, "inches")
)),
plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"),
legend.position = "none"
)
theme_set(theme_gray(base_size = 20))
# Column names for regression results
reg_columns <- c("Term", "Est.", "S.E.", "t stat.", "p-Value")
# Function for formatting p values
format_pvi <- function(pv) {
return(ifelse(
pv < 0.0001,
"<0.0001",
round(pv, 4) %>% format(scientific = F)
))
}
format_pv <- function(pvs) lapply(X = pvs, FUN = format_pvi) %>% unlist()
# Tidy regression results table
tidy_table <- function(x, terms, highlight_row = 1, highlight_color = "black", highlight_bold = T, digits = c(NA, 3, 3, 2, 5), title = NULL) {
x %>%
tidy() %>%
select(1:5) %>%
mutate(
term = terms,
p.value = p.value %>% format_pv()
) %>%
kable(
col.names = reg_columns,
escape = F,
digits = digits,
caption = title
) %>%
kable_styling(font_size = 20) %>%
row_spec(1:nrow(tidy(x)), background = "white") %>%
row_spec(highlight_row, bold = highlight_bold, color = highlight_color)
}
# A few extras
xaringanExtra::use_xaringan_extra(c("tile_view", "fit_screen"))
```
# Prologue
---
name: schedule
# Schedule
### Last time
- The Experimental Ideal
- Fundamentals of .mono[R]
### Today
What's so great about linear regression and OLS?
.hi-slate[Read] *MHE* 3.1
### Upcoming
.hi-slate[Assignment].sub[1] Custom OLS function fun.
.hi-slate[Assignment].sub[2] First step of project proposal.
---
layout: true
# Regression
---
class: inverse, middle
---
name: why
## Why?
In our previous discussion, we began moving from simple differences to a regression framework.
--
.hi-slate[Q] Why do we.pink[†] care so much about linear regression and OLS?
.footnote[.pink[†] *we* = empirically inclined applied economists]
--
.hi-slate[A] As we discussed, regression allows us to control for covariates that *can* assist with (.slate[1]) causal identification and (.slate[2]) inference.
--
There's a deeper reason that we care about *linear* regression and ordinary least squares (OLS): .hi-pink[*the conditional expectation function (CEF).*]
---
## Why?
Even ignoring causality, we can show important relationships between
1. .hi-pink[the CEF] (the conditional expectation function),
2. the .hi-purple[population regression function],
3. and the .hi-slate[sampling distribution of regression estimates].
---
layout: true
# Regression
## The *CEF*
---
name: cef
.hi-slate[Definition] The .hi[conditional expectation function] for a dependent variable $\text{Y}_{i}$, given a $\text{K}\times 1$ vector of covariates $\text{X}_{i}$, tells us .pink[the expected value (population average) of] $\color{#e64173}{\text{Y}_{i}}$ .pink[with] $\color{#e64173}{\text{X}_{i}}$ .pink[held constant.]
--
Written as $\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]$, the CEF is a function of $\text{X}_{i}$..pink[†]
.footnote[
.pink[†] We'll generally assume $\text{X}_{i}$ is a random variable, which implies that $\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]$ is also a random variable.
]
--
.hi-slate[Examples]
- $\mathop{E}\left[ \text{Income}_i \mid \text{Education}_i \right]$
--
- $\mathop{E}\left[ \text{Wage}_i \mid \text{Gender}_i \right]$
--
- $\mathop{E}\left[ \text{Birth weight}_i \mid \text{Air quality}_i \right]$
---
Formally, for continuous $\text{Y}_{i}$ with conditional density $f_y(t|\text{X}_{i}=x)$,
$$
\begin{align}
\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} = x \right] = \int t f_y(t|\text{X}_{i}=x)dt
\end{align}
$$
--
and for discrete $\text{Y}_{i}$ with conditional p.m.f. $\mathop{\text{Pr}}\left(\text{Y}_{i}=t|\text{X}_{i}=x\right)$,
$$
\begin{align}
\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i}=x \right] = \sum_t t \mathop{\text{Pr}}\left(\text{Y}_{i}=t|\text{X}_{i}=x\right)
\end{align}
$$
.hi-slate[*Notice*] We are focusing on the .hi-pink[population].
--
We want to build our intuition about the parameters that we will eventually estimate.
---
layout: false
class: clear, middle
Graphically...
---
class: clear, center, middle
name: graphically
The conditional distributions of $\text{Y}_{i}$ for $\text{X}_{i}=x$ in 8, ..., 22.
```{r, data_cef, echo = F, cache = T}
# Set seed
set.seed(12345)
# Sample size
n <- 1e4
# Generate extra disturbances
u <- sample(-2:2, size = 22, replace = T) * 1e3
# Generate data
cef_df <- tibble(
x = sample(x = seq(8, 22, 1), size = n, replace = T),
y = 15000 + 3000 * x + 1e3 * (x %% 3) + 500 * (x %% 2) + rnorm(n, sd = 1e4) + u[x]
) %>% mutate(x = round(x)) %>%
filter(y > 0)
# Means
means_df <- cef_df %>% group_by(x) %>% summarize(y = mean(y))
# The CEF in ggplot
gg_cef <- ggplot(data = cef_df, aes(x = y, y = x %>% as.factor())) +
geom_density_ridges_gradient(
aes(fill = ..x..),
rel_min_height = 0.003,
color = "white",
scale = 2.5,
size = 0.3
) +
scale_x_continuous(
"Annual income",
labels = scales::dollar
) +
ylab("Years of education") +
scale_fill_viridis(option = "magma") +
theme_pander(base_family = "Fira Sans Book", base_size = 18) +
theme(
legend.position = "none"
) +
coord_flip()
```
```{r, fig_cef_dist, echo = F, cache = T}
gg_cef
```
---
class: clear, middle, center
The CEF, $\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]$, connects these conditional distributions' means.
```{r, fig_cef, echo = F, cache = T}
gg_cef +
geom_path(
data = means_df,
aes(x = y, y = x %>% as.factor(), group = 1),
color = "white",
alpha = 0.85
) +
geom_point(
data = means_df,
aes(x = y, y = x %>% as.factor()),
color = "white",
shape = 16,
size = 3.5
)
```
---
class: clear, middle, center
Focusing in on the CEF, $\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]$...
```{r, fig_cef_only, echo = F, cache = T}
ggplot(data = cef_df, aes(x = y, y = x %>% as.factor())) +
geom_density_ridges(
rel_min_height = 0.003,
color = "grey85",
fill = NA,
scale = 2.5,
size = 0.3
) +
scale_x_continuous(
"Annual income",
labels = scales::dollar
) +
ylab("Years of education") +
scale_fill_viridis(option = "magma") +
theme_pander(base_family = "Fira Sans Book", base_size = 18) +
theme(
legend.position = "none"
) +
geom_path(
data = means_df,
aes(x = y, y = x %>% as.factor(), group = 1),
color = "grey20",
alpha = 0.85
) +
geom_point(
data = means_df,
aes(x = y, y = x %>% as.factor()),
color = "grey20",
shape = 16,
size = 3.5
) +
coord_flip()
```
---
class: clear, middle
.hi-slate[Q] How does the CEF relate to/inform regression?
---
layout: true
# Regression
## The *CEF*
---
name: lie
As we derive the properties and relationships associated with the CEF, regression, and a host of other estimators, we will frequently rely upon
.hi-slate[*the Law of Iterated Expectations*] (LIE).
--
$$
\begin{align}
\color{#6A5ACD}{\mathop{E}\left[ \text{Y}_{i} \right]} = \mathop{E}\!\bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]} \bigg)
\end{align}
$$
--
which says that the .hi-purple[unconditional expectation] is equal to the .b[unconditional average] of the .hi-pink[conditional expectation function].
---
layout: false
# Regression
.hi-slate[A proof of the LIE]
First, we need notation...
Let $\mathop{f_{x,y}}(u,t)$ denote the joint density for continuous RVs $\left( \text{X}_{i},\text{Y}_{i} \right)$.
Let $\mathop{f_{y|x}}(t\mid \text{X}_{i}=u)$ denote the conditional distribution of $\text{Y}_{i}$ given $\text{X}_{i}=u$.
And let $\mathop{g_y}(t)$ and $\mathop{g_x}(u)$ denote the marginal densities of $\text{Y}_{i}$ and $\text{X}_{i}$.
---
# Regression
.hi-slate[A proof of the LIE]
$\mathop{E}\!\bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]} \bigg)$
--
$= {\displaystyle\int} \color{#e64173}{\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} = u \right]} \mathop{g_x}(u) du$
--
$={\displaystyle\int} \color{#e64173}{\left[{\displaystyle\int} t \mathop{f_{y|x}}\left( t\mid \text{X}_{i}=u \right) dt\right]} \mathop{g_x}(u) du$
--
$={\displaystyle\int} {\displaystyle\int} \color{#e64173}{t \mathop{f_{y|x}}\left( t\mid \text{X}_{i}=u \right)} \mathop{g_x}(u) du \, \color{#e64173}{dt}$
--
$={\displaystyle\int} \color{#e64173}{t} \left[ {\displaystyle\int} \color{#e64173}{\mathop{f_{y|x}}\left( t\mid \text{X}_{i}=u \right)} \mathop{g_x}(u) du \right] \color{#e64173}{dt}$
--
$={\displaystyle\int} \color{#e64173}{t} \left[ {\displaystyle\int} \mathop{f_{x,y}}(u,t) du \right] \color{#e64173}{dt}$
--
$={\displaystyle\int} \color{#e64173}{t} \mathop{g_y(t)} \color{#e64173}{dt}$
--
$=\mathop{E}\left[ \text{Y}_{i} \right]$
--
.bigger[🥳]
---
layout: false
class: clear, middle
Great. What's the point?
---
layout: true
# Regression
## The *LIE* and the *CEF*
---
name: decomposition
.hi-slate[Theorem] The CEF decomposition property (3.1.1)
The LIE allows us to **decompose random variables** into two pieces
--
$$
\begin{align}
\text{Y}_{i} = \color{#e64173}{\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]} + \color{#6A5ACD}{\varepsilon_i}
\end{align}
$$
1. .hi-pink[the conditional expectation function]
2. .hi-purple[a residual] with special powers.pink[†]
i. $\color{#6A5ACD}{\varepsilon_i}$ is mean independent of $\text{X}_{i}$, _i.e._, $\mathop{E}\left[ \color{#6A5ACD}{\varepsilon_i} \mid \text{X}_{i} \right] = 0$.
ii. $\color{#6A5ACD}{\varepsilon_i}$ is uncorrelated with any function of $\text{X}_{i}$.
.footnote[.pink[†] Angrist and Pischke go with *special properties*.]
--
.hi-orange[*Important*] It might not seem like much, but these results are .hi-orange[huge] for building intuition, theory, *and* application.
--
Put a ⭐ here!
---
.hi-slate[Proof] The CEF decomposition property (properties i. and ii. of $\color{#6A5ACD}{\varepsilon_i}$)
--
.pull-left[
.hi-slate[Mean independence], $\mathop{E}\left[ \color{#6A5ACD}{\varepsilon_i} \mid \text{X}_{i} \right] = 0$
$$
\begin{align}
&\mathop{E}\left[ \color{#6A5ACD}{\varepsilon_i} \mid \text{X}_{i} \right] \\[0.6em]
&= \mathop{E}\!\bigg( \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]} \bigg| \text{X}_{i} \bigg) \\[0.6em]
&= \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \mathop{E}\!\bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \bigg| \text{X}_{i} \bigg) \\[0.6em]
&= \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \\[0.6em]
&= 0
\end{align}
$$
]
--
.pull-right[
.hi-slate[Zero correlation] btn. $\color{#6A5ACD}{\varepsilon_i}$ and $\mathop{h}\left( \text{X}_{i} \right)$
$$
\begin{align}
&\mathop{E}\left[ \mathop{h}\left( \text{X}_{i} \right) \color{#6A5ACD}{\varepsilon_i}\right] \\[0.6em]
&= \mathop{E}\!\bigg( \mathop{E}\left[ \mathop{h}\left( \text{X}_{i} \right) \color{#6A5ACD}{\varepsilon_i}\mid \text{X}_{i} \right]\bigg) \\[0.6em]
&= \mathop{E}\!\bigg( \mathop{h}\left( \text{X}_{i} \right) \mathop{E}\left[\color{#6A5ACD}{\varepsilon_i}\mid \text{X}_{i} \right]\bigg) \\[0.6em]
&= \mathop{E}\left[ \mathop{h}\left( \text{X}_{i} \right) \times 0\right] \\[0.6em]
&= 0
\end{align}
$$
]
---
.hi-slate[The CEF decomposition property]
says that we can decompose any random variable (_e.g._, $\text{Y}_{i}$) into
1. a part that is .pink[explained by] $\color{#e64173}{\text{X}_{i}}$ (_i.e._, the CEF $\color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]}$),
2. a part that is .purple[*orthogonal to*.purple[†] any function of] $\color{#6A5ACD}{\text{X}_{i}}$ (_i.e._, $\color{#6A5ACD}{\varepsilon_i}$).
.footnote[.purple[†] "orthogonal to" = "uncorrelated with"]
--
.hi-slate[Why the CEF?]
The .pink[CEF] also presents an intuitive summary of the relationship between $\text{Y}_{i}$ and $\text{X}_{i}$, since we are often use means to characterize random variables.
--
But (of course) there are more reasons to use the CEF...
---
name: prediction
.hi-slate[Theorem] The CEF prediction property (3.1.2)
Let $\mathop{m}\left( \text{X}_{i} \right)$ be *any* function of $\text{X}_{i}$. The CEF solves
$$
\begin{align}
\color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} = \underset{\mathop{m}\left( \text{X}_{i} \right)}{\text{arg min}}\enspace \mathop{E}\left[ \left( \text{Y}_{i} - \mathop{m}\left( \text{X}_{i} \right) \right)^2 \right]
\end{align}
$$
In other words, the .hi-pink[CEF] is the minimum mean-squared error (MMSE) predictor of $\text{Y}_{i}$ given $\text{X}_{i}$.
--
.hi-slate[*Notice*]
1. We haven't restricted $m$ to any class of functions—it can be nonlinear.
2. We're talking about *prediction* (specifically predicting $\text{Y}_{i}$).
---
layout: false
class: clear
.hi-slate[Proof] The CEF prediction property
$\bigg( \text{Y}_{i} - \mathop{m}\left( \text{X}_{i} \right) \bigg)^2$ .right10[.orange[(**1**)]]
--
$= \bigg( \big\{ \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \big\} + \big\{ \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \mathop{m}\left( \text{X}_{i} \right) \big\} \bigg)^2$
--
$= \bigg( \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \bigg)^2$ .right10[.turquoise[(**a**)]]
$+ 2 \bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right] - \mathop{m}\left( \text{X}_{i} \right)}\bigg)\times \bigg( \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \bigg)$ .right10[.turquoise[(**b**)]]
$+ \bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \mathop{m}\left( \text{X}_{i} \right) \bigg)^2$ .right10[.turquoise[(**c**)]]
--
.hi-slate[Recall:] We want to choose the $\mathop{m}\left( \text{X}_{i} \right)$ that minimizes .orange[(**1**)] in expectation.
--
.turquoise[(**a**)] is irrelevant, _i.e._, it does not depend upon $\mathop{m}\left( \text{X}_{i} \right)$.
--
.turquoise[(**b**)] equals zero in expectation: $\mathop{E}\left[ \mathop{h}\left( \text{X}_{i} \right)\times \color{#6A5ACD}{\varepsilon_i} \right] = 0$.
--
.turquoise[(**c**)] is minimized by $\mathop{m}\left( \text{X}_{i} \right) = \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]}$, _i.e._, when $\mathop{m}\left( \text{X}_{i} \right)$ is the .pink[CEF].
---
layout: true
# Regression
## The *LIE* and the *CEF*
---
∴ the .pink[CEF] is the function that minimizes the mean-squared error (MSE)
$$
\begin{align}
\color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} = \underset{\mathop{m}\left( \text{X}_{i} \right)}{\text{arg min}}\enspace \mathop{E}\left[ \left( \text{Y}_{i} - \mathop{m}\left( \text{X}_{i} \right) \right)^2 \right]
\end{align}
$$
---
One final property of the .pink[CEF] (very similar to the decomposition property)
.hi-slate[Theorem] The ANOVA theorem (3.1.3)
$$
\begin{align}
\mathop{\text{Var}} \left( \text{Y}_{i} \right) = \mathop{\text{Var}} \left( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \right) + \mathop{E}\left[ \mathop{\text{Var}} \left( \text{Y}_{i} \mid \text{X}_{i} \right) \right]
\end{align}
$$
which says that we can decompose the variance in $\text{Y}_{i}$ into
1. the variance in the .pink[CEF]
2. the variance of the residual
--
.hi-slate[*Example*] Decomposing wage variation into (.hi-slate[1]) variation explained by workers' characteristics and (.hi-slate[2]) unexplained (residual) variation
--
The proof centers on the independence from the decomposition property of the CEF.
---
layout: false
class: clear, middle
We now understand the CEF a bit better.
But how does the CEF actually relate to regression?
---
layout: true
# Regression
## The *CEF* and regression
---
We've discussed how the .pink[CEF] summarizes empirical relationships.
*Previously* we discussed how regression provides simple empirical insights.
Let's link these two concepts.
---
name: pop_ls
.hi-slate[Population least-squares regression]
We will focus on $\beta$, the vector (a $K\times 1$ matrix) of population, least-squares regression coefficients, _i.e._,
$$
\begin{align}
\beta = \underset{b}{\text{arg min}}\thinspace \mathop{E}\left[ \left( \text{Y}_{i} - \text{X}_{i}'b \right)^2 \right]
\end{align}
$$
where $b$ and $\text{X}_{i}$ are also $K\times 1$, and $\text{Y}_{i}$ is a scalar.
--
Taking the first-order condition gives
$$
\begin{align}
\mathop{E}\left[ \text{X}_{i} \left( \text{Y}_{i} - \text{X}_{i}'b \right) \right] = 0
\end{align}
$$
---
From the first-order condition
$$
\begin{align}
\mathop{E}\left[ \text{X}_{i} \left( \text{Y}_{i} - \text{X}_{i}'b \right) \right] = 0
\end{align}
$$
we can solve for $b$. We've defined the optimum as $\beta$. Thus,
$$
\begin{align}
\beta = \mathop{E}\left[ \text{X}_{i} \text{X}_{i}' \right]^{-1} \mathop{E}\left[ \text{X}_{i} \text{Y}_{i} \right]
\end{align}
$$
--
.hi-slate[*Note*] The first-order conditions tell us that our least-squares population regression residuals $\left(e_i = \text{Y}_{i} - \text{X}_{i}'\beta \right)$ are uncorrelated with $\text{X}_i$.
---
layout: true
# Regression
## Anatomy
---
name: anatomy
Our "new" result: $\beta = \mathop{E}\left[ \text{X}_{i} \text{X}_{i}' \right]^{-1} \mathop{E}\left[ \text{X}_{i} \text{Y}_{i} \right]$
In .hi-slate[simple linear regression] (an intercept and one regressor $x_i$),
$$
\begin{align}
\beta_1 &= \dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, x_i \right)}{\mathop{\text{Var}} \left( x_i \right)}
& \beta_0 = \mathop{E}\left[ \text{Y}_{i} \right] - \beta_1 \mathop{E}\left[ x_i \right]
\end{align}
$$
--
For .hi-slate[multivariate regression], the coefficient on the k.super[th] regressor $x_{ki}$ is
$$
\begin{align}
\beta_k &= \dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \widetilde{x}_{ki} \right)}{\mathop{\text{Var}} \left( \widetilde{x}_{ki} \right)}
\end{align}
$$
where $\widetilde{x}_{ki}$ is the residual from a regression of $x_{ki}$ on all other covariates.
---
This alternative formulation of least-squares coefficients is quite powerful.
$$
\begin{align}
\beta_k &= \dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \widetilde{x}_{ki} \right)}{\mathop{\text{Var}} \left( \widetilde{x}_{ki} \right)}
\end{align}
$$
--
.hi-slate[Why?]
--
This expression illustrates how each coefficient in a least-squares regression represents the bivariate slope coefficient .pink[after controlling for the other covariates].
---
In fact, we can re-write our coefficients to further emphasize this point
$$
\begin{align}
\beta_k &= \dfrac{\mathop{\text{Cov}} \left( \widetilde{\text{Y}}_{i},\, \widetilde{x}_{ki} \right)}{\mathop{\text{Var}} \left( \widetilde{x}_{ki} \right)}
\end{align}
$$
$\widetilde{\text{Y}}_{i}$ denotes the residual from regressing $\text{Y}_{i}$ on all regressors except $x_{ki}$.
---
layout: false
class: clear, middle
Graphical example
---
class: clear, middle, center
```{r, data_anatomy, cache = F, include = F}
n <- 1e2
set.seed(1234)
gen_df <- tibble(
x2 = sample(x = c(F, T), size = n, replace = T),
x1 = runif(n = n, min = -3, max = 3) + x2 - 0.5,
u = rnorm(n = n, mean = 0, sd = 1),
y = -3 + x1 + x2 * 4 + u
)
ya_mean <- filter(gen_df, x2 == F)$y %>% mean()
yb_mean <- filter(gen_df, x2 == T)$y %>% mean()
x1a_mean <- filter(gen_df, x2 == F)$x1 %>% mean()
x1b_mean <- filter(gen_df, x2 == T)$x1 %>% mean()
gen_df %<>% mutate(
y_dm = y - ya_mean * (x2 == F) - yb_mean * (x2 == T),
x1_dm = x1 - x1a_mean * (x2 == F) - x1b_mean * (x2 == T)
)
```
$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \varepsilon_i$
```{r, fig_anatomy1, echo = F}
ggplot(data = gen_df, aes(y = y, x = x1, color = x2, shape = x2)) +
geom_hline(yintercept = 0, color = "grey85") +
geom_vline(xintercept = 0, color = "grey85") +
geom_point(size = 3) +
annotate("text", x = -0.075, y = 6.75, label = TeX("$y$"), size = 8) +
annotate("text", x = 2.95, y = 0.3, label = TeX("$x_1$"), size = 8) +
ylim(c(-7, 7)) +
xlim(c(-3.5, 3.5)) +
theme_empty +
scale_color_manual(
expression(x[2]),
values = c("darkslategrey", red_pink),
labels = c("A", "B")
) +
scale_shape_manual(
expression(x[2]),
values = c(1, 19),
labels = c("A", "B")
) +
theme(
legend.position = "bottom",
text = element_text(size = 22)
)
```
---
class: clear, middle, center
count: false
$\beta_1$ gives the relationship between $y$ and $x_1$ *after controlling for* $x_2$
```{r, fig_anatomy2, echo = F}
ggplot(data = gen_df, aes(y = y, x = x1, color = x2, shape = x2)) +
geom_hline(yintercept = ya_mean, color = "darkslategrey", alpha = 0.5, size = 2) +
geom_hline(yintercept = yb_mean, color = red_pink, alpha = 0.5, size = 2) +
geom_hline(yintercept = 0, color = "grey85") +
geom_vline(xintercept = 0, color = "grey85") +
geom_point(size = 3) +
annotate("text", x = -0.075, y = 6.75, label = TeX("$y$"), size = 8) +
annotate("text", x = 2.95, y = 0.3, label = TeX("$x_1$"), size = 8) +
ylim(c(-7, 7)) +
xlim(c(-3.5, 3.5)) +
theme_empty +
scale_color_manual(
expression(x[2]),
values = c("darkslategrey", red_pink),
labels = c("A", "B")
) +
scale_shape_manual(
expression(x[2]),
values = c(1, 19),
labels = c("A", "B")
) +
theme(
legend.position = "bottom",
text = element_text(size = 22)
)
```
---
class: clear, middle, center
count: false
$\beta_1$ gives the relationship between $y$ and $x_1$ *after controlling for* $x_2$
```{r, fig_anatomy3, echo = F}
ggplot(data = gen_df %>% mutate(y = y - 4 * x2), aes(y = y_dm, x = x1)) +
geom_hline(yintercept = 0, color = "grey85") +
geom_vline(xintercept = 0, color = "grey85") +
geom_point(size = 3, aes(color = x2, shape = x2)) +
annotate("text", x = -0.075, y = 6.75, label = TeX("$y|x_2$"), size = 8, hjust = 1) +
annotate("text", x = 2.95, y = 0.3, label = TeX("$x_1$"), size = 8) +
ylim(c(-7, 7)) +
xlim(c(-3.5, 3.5)) +
theme_empty +
scale_color_manual(
expression(x[2]),
values = c("darkslategrey", red_pink),
labels = c("A", "B")
) +
scale_shape_manual(
expression(x[2]),
values = c(1, 19),
labels = c("A", "B")
) +
theme(
legend.position = "bottom",
text = element_text(size = 22)
)
```
---
class: clear, middle, center
count: false
$\beta_1$ gives the relationship between $y$ and $x_1$ *after controlling for* $x_2$
```{r, fig_anatomy4, echo = F}
ggplot(data = gen_df %>% mutate(y = y - 4 * x2), aes(y = y_dm, x = x1)) +
geom_hline(yintercept = 0, color = "grey85") +
geom_vline(xintercept = 0, color = "grey85") +
geom_vline(xintercept = x1a_mean, color = "darkslategrey", alpha = 0.5, size = 2) +
geom_vline(xintercept = x1b_mean, color = red_pink, alpha = 0.5, size = 2) +
geom_point(size = 3, aes(color = x2, shape = x2)) +
annotate("text", x = -0.075, y = 6.75, label = TeX("$y|x_2$"), size = 8, hjust = 1) +
annotate("text", x = 2.95, y = 0.3, label = TeX("$x_1$"), size = 8) +
ylim(c(-7, 7)) +
xlim(c(-3.5, 3.5)) +
theme_empty +
scale_color_manual(
expression(x[2]),
values = c("darkslategrey", red_pink),
labels = c("A", "B")
) +
scale_shape_manual(
expression(x[2]),
values = c(1, 19),
labels = c("A", "B")
) +
theme(
legend.position = "bottom",
text = element_text(size = 22)
)
```
---
class: clear, middle, center
count: false
$\beta_1$ gives the relationship between $y$ and $x_1$ *after controlling for* $x_2$
```{r, fig_anatomy5, echo = F}
ggplot(data = gen_df %>% mutate(y = y - 4 * x2), aes(y = y_dm, x = x1_dm)) +
geom_hline(yintercept = 0, color = "grey85") +
geom_vline(xintercept = 0, color = "grey85") +
geom_point(size = 3, aes(color = x2, shape = x2)) +
annotate("text", x = -0.075, y = 6.75, label = TeX("$y|x_2$"), size = 8, hjust = 1) +
annotate("text", x = 2.95, y = 0.3, label = TeX("$x_1|x_2$"), size = 8, hjust = 1) +
ylim(c(-7, 7)) +
xlim(c(-3.5, 3.5)) +
theme_empty +
scale_color_manual(
expression(x[2]),
values = c("darkslategrey", red_pink),
labels = c("A", "B")
) +
scale_shape_manual(
expression(x[2]),
values = c(1, 19),
labels = c("A", "B")
) +
theme(
legend.position = "bottom",
text = element_text(size = 22)
)
```
---
class: clear, middle, center
count: false
$\beta_1$ gives the relationship between $y$ and $x_1$ *after controlling for* $x_2$
```{r, fig_anatomy6, echo = F}
ggplot(data = gen_df %>% mutate(y = y - 4 * x2), aes(y = y_dm, x = x1_dm)) +
geom_hline(yintercept = 0, color = "grey85") +
geom_vline(xintercept = 0, color = "grey85") +
geom_smooth(method = lm, se = F, color = purple, alpha = 1, size = 2) +
geom_point(size = 3, aes(color = x2, shape = x2)) +
annotate("text", x = -0.075, y = 6.75, label = TeX("$y|x_2$"), size = 8, hjust = 1) +
annotate("text", x = 2.95, y = 0.3, label = TeX("$x_1|x_2$"), size = 8, hjust = 1) +
ylim(c(-7, 7)) +
xlim(c(-3.5, 3.5)) +
theme_empty +
scale_color_manual(
expression(x[2]),
values = c("darkslategrey", red_pink),
labels = c("A", "B")
) +
scale_shape_manual(
expression(x[2]),
values = c(1, 19),
labels = c("A", "B")
) +
theme(
legend.position = "bottom",
text = element_text(size = 22)
)
```
---
layout: false
class: clear, middle
Now that we've refreshed/deepened our regression knowledge, let's connect regression and the CEF.
---
layout: true
# Regression
## Regression and the *CEF*
---
Angrist and Pischke make the case that
> ... you should be interested in regression parameters if you are interested in the CEF. (*MHE*, p.36)
.hi-slate[Q] What is the reasoning/connection?
--
.hi-slate[A] We'll cover three reasons.
1. *If the CEF is linear*, then the population regression line is the CEF.
2. The function $\text{X}_{i}' \beta$ is the min. MSE *linear* predictor of $\text{Y}_{i}$ given $\text{X}_{i}$.
3. The function $\text{X}_{i}' \beta$ gives the min. MSE *linear* approximation to the CEF.
---
layout: true
# Regression
## Regression and the *CEF*
---
.hi-slate[Theorem] The linear CEF theorem (3.1.4)
If the CEF is linear, then the population regression is the CEF.
--
.hi-slate[Proof] Let the CEF equal some linear function, _i.e._, $\color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} = \text{X}_{i}' \beta^\star$.
From the CEF decomposition property, we know $\mathop{E}\left[ \text{X}_{i} \color{#6A5ACD}{\varepsilon_i} \right] = 0$.
--
$\implies \mathop{E}\left[ \text{X}_{i} \left( \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \right) \right] = 0$
--
$\implies \mathop{E}\left[ \text{X}_{i} \left( \text{Y}_{i} - \text{X}_{i}'\beta^\star \right) \right] = 0$
--
$\implies \mathop{E}\left[ \text{X}_{i}\text{Y}_{i} \right] - \mathop{E}\left[ \text{X}_{i} \text{X}_{i}' \beta^\star \right] = 0$
--
$\implies \beta^\star = \mathop{E}\left[ \text{X}_{i} \text{X}_{i}' \right]^{-1} \mathop{E}\left[ \text{X}_{i}\text{Y}_{i} \right]$
--
$=\beta$, our population regression coefficients.
---
>.hi-slate[Theorem] The linear CEF theorem (3.1.4)
>
>If the CEF is linear, then the population regression is the CEF.
Linearity can be a strong assumption. When might we expect linearity?
--
1. Situations in which $\left( \text{Y}_{i},\, \text{X}_{i} \right)$ follows a multivariate normal distribution.
.hi-slate[*Concern*] Might be limited—especially when $\text{Y}_{i}$ or $\text{X}_{i}$ are not continuous.
--
2. Saturated regression models
.hi-slate[*Example*] A model with two binary indicators and their interaction.
---
.hi-slate[Theorem] The best linear predictor theorem (3.1.5)
$\text{X}_{i}' \beta$ is the best *linear* predictor of $\text{Y}_{i}$ given $\text{X}_{i}$ (minimizes MSE).
--
.hi-slate[Proof] We defined $\beta$ as the vector that minimizes MSE, _i.e._,
$$
\begin{align}
\beta = \underset{b}{\text{arg min}}\thinspace \mathop{E}\left[ \left( \text{Y}_{i} - \text{X}_{i}'b \right)^2 \right]
\end{align}
$$
so $\text{X}_{i}'\beta$ is literally defined as the minimum MSE linear predictor of $\text{Y}_{i}$.
--
- The population-regression function $\left(\text{X}_{i}'\beta\right)$ is the best (min. MSE) *linear* predictor of $\text{Y}_{i}$ given $\text{X}_{i}$.
- The CEF $\left( \mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right] \right)$ is the best predictor (min. MSE) of $\text{Y}_{i}$ given $\text{X}_{i}$ across *all classes* of functions.
---
.hi-slate[Q] If $\text{X}_{i}'\beta$ is .hi[the best linear predictor] of $\text{Y}_{i}$ given $\text{X}_{i}$, then why is there so much interest machine learning for prediction (opposed to regression)?
--
.hi-slate[A] A few reasons
1. Relax *linearity*
2. Model selection
- choosing $\text{X}_{i}$ is not always obvious
- overfitting is bad (bias-variance tradeoff)
3. It's fancy, shiny, and new
4. Some ML methods boil down to regression
5. Others?
--
.hi-slate[Counter Q] Why are we (still) using regression?
---
name: reg_cef_theorem
.hi-slate[Theorem] The regression CEF theorem (3.1.6)
The population regression function $\text{X}_{i}'\beta$ provides the minimum MSE linear approximation to the CEF $\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]$, _i.e._,
$$
\begin{align}
\beta = \underset{b}{\text{arg min}}\thinspace \mathop{E}\!\left\{ \bigg( \mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right] - \text{X}_{i}'b \bigg)^2 \right\}
\end{align}
$$
--
.hi-slate[*Put simply*] Regression gives us the *best* linear approximation to the CEF.
---
layout: false
class: clear
.hi-slate[Proof] First, recall that, .orange[in expectation], $\beta$ is the $b$ that minimizes $\left( \text{Y}_{i} - \text{X}_{i}'b \right)^2$
--
$\left( \text{Y}_{i} - \text{X}_{i}'b \right)^2 \color{#ffffff}{\bigg|}$ .right10[.orange[(**1**)]]
--
$= \bigg( \left\{ \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \right\} + \left\{ \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \text{X}_{i}'b \right\} \bigg)^2$
--
$= \bigg( \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \bigg)^2$ .right10[.turquoise[(**a**)]]
$\color{#ffffff}{=}+\bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \text{X}_{i}'b \bigg)^2$ .right10[.turquoise[(**b**)]]
$\color{#ffffff}{=}+ 2 \bigg( \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \bigg) \bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \text{X}_{i}'b \bigg)$ .right10[.turquoise[(**c**)]]
--
We want to minimize .turquoise[(**b**)], and we know $\beta$ minimizes .orange[(**1**)].
--
.turquoise[(**a**)] is irrelevant, _i.e._, it does not depend upon $b$.
--
.turquoise[(**c**)] can be written as $2 \color{#6A5ACD}{\varepsilon_i} \mathop{h}(\text{X}_{i})$, which equals zero in expectation.
--
∴ (In expectation) If $b=\beta$ minimizes .orange[(**1**)], then $b=\beta$ minimizes .turquoise[(**b**)].
---
layout: true
# Regression
## Regression and the *CEF*
---
Let's review our new(-ish) regression results
1. When the CEF is linear, the regression function *is* the CEF.
.hi-slate[Too small] Very specific circumstances—or big assumptions.
1. Regression gives us the best *linear* predictor of $\text{Y}_{i}$ (given $\text{X}_{i}$)
.hi-slate[Off point] We're often interested in $\beta$—not $\widehat{\text{Y}}_{i}$.
1. Regression provides the best *linear* approximation of the CEF.
.hi-slate[Just right?] (Depends on your goals)
---
Motivation (**3**) tends to be the most compelling.
Even when the CEF is not linear, regression recovers the best linear approximation to the CEF.
--
> The statement that .pink[regression approximates the CEF] lines up with our view of .purple[empirical work as an effort to describe the essential features of statistical relationships] without necessarily trying to pin them down exactly. (*MHE*, p.39, emphasis added)
---
layout: false
class: clear, middle
Let's dig into this linear-approximate to the CEF a little more...
---
class: clear, middle, center
Returning to our **CEF**
```{r, fig_reg_cef, echo = F, cache = T}
# Estimate the relationship
cef_lm <- lm(y ~ x, data = cef_df)
# Find the regression points
lm_df <- tibble(
x = 8:22,
y = predict(object = cef_lm, newdata = data.frame(x = 8:22))
)
# Create the figure
gg_cef <- ggplot(data = cef_df, aes(x = y, y = x %>% as.factor())) +
geom_density_ridges(
rel_min_height = 0.003,
color = "grey85",
fill = NA,
scale = 2.5,
size = 0.3
) +
scale_x_continuous(
"Annual income",
labels = scales::dollar
) +
ylab("Years of education") +
scale_fill_viridis(option = "magma") +
theme_pander(base_family = "Fira Sans Book", base_size = 18) +
theme(
legend.position = "none"
) +
geom_path(
data = means_df,
aes(x = y, y = x %>% as.factor(), group = 1),
color = "grey20",
alpha = 0.85
) +
geom_point(
data = means_df,
aes(x = y, y = x %>% as.factor()),
color = "grey20",
shape = 16,
size = 3.5
) +
coord_flip()
# Plot it
gg_cef
```
---
class: clear, center, middle
Adding the population .hi-orange[regression function]
```{r, fig_reg_cef2, echo = F, cache = T}
# Figure
gg_cef +
geom_path(
data = lm_df,
aes(x = y, y = x %>% as.factor(), group = 1),
color = orange,
alpha = 0.5,
size = 2
)
```
---
layout: true
# Regression
## Regression and the *CEF*
---
name: wls
As the previous figure suggests, one way to think about least-squares regression is .hi-slate[estimating a weighted regression on the CEF] rather than the individual observations.
--
.slate[.mono[TLDR]] Use $\color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]}$ as the outcome, rather than $\text{Y}_{i}$, and properly weight.
--
Suppose $\text{X}_{i}$ is discrete with pmf $\mathop{g_x}(u)$
$$
\begin{align}
\mathop{E}\!\bigg[ \left( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \text{X}_{i}'b \right)^2 \bigg] = \sum_u \left( \mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} = u \right] - u'b \right)^2 \mathop{g_x}(u)
\end{align}
$$
_i.e._, $\beta$ can be expressed as weighted-least squares regression of $\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} = u\right]$ on $u$ (the values of $\text{X}_{i}$) weighted by $\mathop{g_x}(u)$.
---
We can also use LIE here
$\beta$
--
.pad-left[] $= \mathop{E}\left[ X_i X_i' \right]^{-1} \mathop{E}\left[ \text{X}_{i}\text{Y}_{i} \right]$
--
.pad-left[] $= \mathop{E}\left[ X_i X_i' \right]^{-1} \mathop{E}\left[ \text{X}_{i} \mathop{E}\left( \text{Y}_{i}\mid \text{X}_{i} \right) \right]$
--
.hi-pink[Pro] Useful for aggregated data when microdata are sensitive/big.
--
.hi-pink[Con] You .hi-slate[will not] get the same standard errors.
---
layout: false
# Table of contents
.pull-left[
### Admin
.smaller[
1. [Schedule](#schedule)
]
]
.pull-right[
### Regression
.smaller[
1. [Why?](#why)
1. [The CEF](#cef)
- [Definition](#cef)
- [Graphically](#graphically)
- [Law of iterated expectations](#lie)
- [Decomposition](#decomposition)
- [Prediction](#prediction)
1. [Population least squares](#pop_ls)
1. [Anatomy](#anatomy)
1. [Regression-CEF theorem](#reg_cef_theorem)
1. [WLS](#wls)
]
]
---
exclude: true
```{r, generate pdfs, include = F, eval = F}
pagedown::chrome_print("03-why-regression.html", output = "03-why-regression.pdf")
pagedown::chrome_print("03-why-regression.html", output = "03-why-regression-nopause.pdf")
```