--- title: "Why Regression?" subtitle: "EC 425/525, Set 3" author: "Edward Rubin" date: "`r format(Sys.time(), '%d %B %Y')`" output: xaringan::moon_reader: css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css'] # self_contained: true nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- class: inverse, middle ```{R, setup, include = F} # devtools::install_github("dill/emoGG") library(pacman) p_load( broom, tidyverse, ggplot2, ggthemes, ggforce, ggridges, latex2exp, viridis, extrafont, gridExtra, kableExtra, snakecase, janitor, data.table, dplyr, estimatr, lubridate, knitr, parallel, lfe, here, magrittr ) # Define pink color red_pink <- "#e64173" turquoise <- "#20B2AA" orange <- "#FFA500" red <- "#fb6107" blue <- "#3b3b9a" green <- "#8bb174" grey_light <- "grey70" grey_mid <- "grey50" grey_dark <- "grey20" purple <- "#6A5ACD" slate <- "#314f4f" # Dark slate grey: #314f4f # Knitr options opts_chunk$set( comment = "#>", fig.align = "center", fig.height = 7, fig.width = 10.5, warning = F, message = F ) opts_chunk$set(dev = "svg") options(device = function(file, width, height) { svg(tempfile(), width = width, height = height) }) options(crayon.enabled = F) options(knitr.table.format = "html") # A blank theme for ggplot theme_empty <- theme_bw() + theme( line = element_blank(), rect = element_blank(), strip.text = element_blank(), axis.text = element_blank(), plot.title = element_blank(), axis.title = element_blank(), plot.margin = structure(c(0, 0, -0.5, -1), unit = "lines", valid.unit = 3L, class = "unit"), legend.position = "none" ) theme_simple <- theme_bw() + theme( line = element_blank(), panel.grid = element_blank(), rect = element_blank(), strip.text = element_blank(), axis.text.x = element_text(size = 18, family = "STIXGeneral"), axis.text.y = element_blank(), axis.ticks = element_blank(), plot.title = element_blank(), axis.title = element_blank(), # plot.margin = structure(c(0, 0, -1, -1), unit = "lines", valid.unit = 3L, class = "unit"), legend.position = "none" ) theme_axes_math <- theme_void() + theme( text = element_text(family = "MathJax_Math"), axis.title = element_text(size = 22), axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")), axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")), axis.line = element_line( color = "grey70", size = 0.25, arrow = arrow(angle = 30, length = unit(0.15, "inches") )), plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"), legend.position = "none" ) theme_axes_serif <- theme_void() + theme( text = element_text(family = "MathJax_Main"), axis.title = element_text(size = 22), axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")), axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")), axis.line = element_line( color = "grey70", size = 0.25, arrow = arrow(angle = 30, length = unit(0.15, "inches") )), plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"), legend.position = "none" ) theme_axes <- theme_void() + theme( text = element_text(family = "Fira Sans Book"), axis.title = element_text(size = 18), axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")), axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")), axis.line = element_line( color = grey_light, size = 0.25, arrow = arrow(angle = 30, length = unit(0.15, "inches") )), plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"), legend.position = "none" ) theme_set(theme_gray(base_size = 20)) # Column names for regression results reg_columns <- c("Term", "Est.", "S.E.", "t stat.", "p-Value") # Function for formatting p values format_pvi <- function(pv) { return(ifelse( pv < 0.0001, "<0.0001", round(pv, 4) %>% format(scientific = F) )) } format_pv <- function(pvs) lapply(X = pvs, FUN = format_pvi) %>% unlist() # Tidy regression results table tidy_table <- function(x, terms, highlight_row = 1, highlight_color = "black", highlight_bold = T, digits = c(NA, 3, 3, 2, 5), title = NULL) { x %>% tidy() %>% select(1:5) %>% mutate( term = terms, p.value = p.value %>% format_pv() ) %>% kable( col.names = reg_columns, escape = F, digits = digits, caption = title ) %>% kable_styling(font_size = 20) %>% row_spec(1:nrow(tidy(x)), background = "white") %>% row_spec(highlight_row, bold = highlight_bold, color = highlight_color) } ``` # Prologue --- name: schedule # Schedule ### Last time - The Experimental Ideal - Fundamentals of .mono[R] (wrap up Lab 1). ### Today What's so great about linear regression and OLS?
.hi-slate[Read] *MHE* 3.1 ### Upcoming .hi-slate[Assignment] [First step of project proposal due April 15.super[th]](https://github.com/edrubin/EC525S19/#project). --- name: return # Follow up ## `return()` 1. `function()` automatically returns the last evaluated value—regardless of `return()`. 2. Hadley Wickham.pink[†] suggests reserving `return` for ["early" returns](http://adv-r.had.co.nz/Functions.html#return-values). .footnote[.pink[†] An .mono[R] god] --- layout: true # Regression --- class: inverse, middle --- name: why ## Why? In our previous discussion, we began moving from simple differences to a regression framework. -- .hi-slate[Q] Why do we.pink[†] care so much about linear regression and OLS? .footnote[.pink[†] *we* = empirically inclined applied economists] -- .hi-slate[A] As we discussed, regression allows us to control for covariates that *can* assist with (.slate[1]) causal identification and (.slate[2]) inference. -- There's a deeper reason that we care about *linear* regression and ordinary least squares (OLS): .hi-pink[*the conditional expectation function (CEF).*] --- ## Why? Even ignoring causality, we can show important relationships between 1. .hi-pink[the CEF] (the conditional expectation function), 2. the .hi-purple[population regression function], 3. and the .hi-slate[sampling distribution of regression estimates]. --- layout: true # Regression ## The *CEF* --- name: cef .hi-slate[Definition] The .hi[conditional expectation function] for a dependent variable $\text{Y}_{i}$, given a $\text{K}\times 1$ vector of covariates $\text{X}_{i}$, tells us .pink[the expected value (population average) of] $\color{#e64173}{\text{Y}_{i}}$ .pink[with] $\color{#e64173}{\text{X}_{i}}$ .pink[held constant.] -- Written as $\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]$, the CEF is a function of $\text{X}_{i}$..pink[†] .footnote[ .pink[†] We'll generally assume $\text{X}_{i}$ is a random variable, which implies that $\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]$ is also a random variable. ] -- .hi-slate[Examples] - $\mathop{E}\left[ \text{Income}_i \mid \text{Education}_i \right]$ -- - $\mathop{E}\left[ \text{Wage}_i \mid \text{Gender}_i \right]$ -- - $\mathop{E}\left[ \text{Birth weight}_i \mid \text{Air quality}_i \right]$ --- Formally, for continuous $\text{Y}_{i}$ with conditional density $f_y(t|\text{X}_{i}=x)$, $$ \begin{align} \mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} = x \right] = \int t f_y(t|\text{X}_{i}=x)dt \end{align} $$ -- and for discrete $\text{Y}_{i}$ with conditional p.m.f. $\mathop{\text{Pr}}\left(\text{Y}_{i}=t|\text{X}_{i}=x\right)$, $$ \begin{align} \mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i}=x \right] = \sum_t t \mathop{\text{Pr}}\left(\text{Y}_{i}=t|\text{X}_{i}=x\right) \end{align} $$ .hi-slate[*Notice*] We are focusing on the .hi-pink[population]. -- We want to build our intuition about the parameters that we will eventually estimate. --- layout: false class: clear, middle Graphically... --- class: clear, center, middle name: graphically The conditional distributions of $\text{Y}_{i}$ for $\text{X}_{i}=x$ in 8, ..., 22. ```{R, data_cef, echo = F, cache = T} # Set seed set.seed(12345) # Sample size n <- 1e4 # Generate extra disturbances u <- sample(-2:2, size = 22, replace = T) * 1e3 # Generate data cef_df <- tibble( x = sample(x = seq(8, 22, 1), size = n, replace = T), y = 15000 + 3000 * x + 1e3 * (x %% 3) + 500 * (x %% 2) + rnorm(n, sd = 1e4) + u[x] ) %>% mutate(x = round(x)) %>% filter(y > 0) # Means means_df <- cef_df %>% group_by(x) %>% summarize(y = mean(y)) # The CEF in ggplot gg_cef <- ggplot(data = cef_df, aes(x = y, y = x %>% as.factor())) + geom_density_ridges_gradient( aes(fill = ..x..), rel_min_height = 0.003, color = "white", scale = 2.5, size = 0.3 ) + scale_x_continuous( "Annual income", labels = scales::dollar ) + ylab("Years of education") + scale_fill_viridis(option = "magma") + theme_pander(base_family = "Fira Sans Book", base_size = 18) + theme( legend.position = "none" ) + coord_flip() ``` ```{R, fig_cef_dist, echo = F, cache = T} gg_cef ``` --- class: clear, middle, center The CEF, $\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]$, connects these conditional distributions' means. ```{R, fig_cef, echo = F, cache = T} gg_cef + geom_path( data = means_df, aes(x = y, y = x %>% as.factor(), group = 1), color = "white", alpha = 0.85 ) + geom_point( data = means_df, aes(x = y, y = x %>% as.factor()), color = "white", shape = 16, size = 3.5 ) ``` --- class: clear, middle, center Focusing in on the CEF, $\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]$... ```{R, fig_cef_only, echo = F, cache = T} ggplot(data = cef_df, aes(x = y, y = x %>% as.factor())) + geom_density_ridges( rel_min_height = 0.003, color = "grey85", fill = NA, scale = 2.5, size = 0.3 ) + scale_x_continuous( "Annual income", labels = scales::dollar ) + ylab("Years of education") + scale_fill_viridis(option = "magma") + theme_pander(base_family = "Fira Sans Book", base_size = 18) + theme( legend.position = "none" ) + geom_path( data = means_df, aes(x = y, y = x %>% as.factor(), group = 1), color = "grey20", alpha = 0.85 ) + geom_point( data = means_df, aes(x = y, y = x %>% as.factor()), color = "grey20", shape = 16, size = 3.5 ) + coord_flip() ``` --- class: clear, middle .hi-slate[Q] How does the CEF relate to/inform regression? --- layout: true # Regression ## The *CEF* --- name: lie As we derive the properties and relationships associated with the CEF, regression, and a host of other estimators, we will frequently rely upon
.hi-slate[*the Law of Iterated Expectations*] (LIE). -- $$ \begin{align} \color{#6A5ACD}{\mathop{E}\left[ \text{Y}_{i} \right]} = \mathop{E}\!\bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]} \bigg) \end{align} $$ -- which says that the .hi-purple[unconditional expectation] is equal to the .b[unconditional average] of the .hi-pink[conditional expectation function]. --- layout: false # Regression .hi-slate[A proof of the LIE] First, we need notation... Let $\mathop{f_{x,y}}(u,t)$ denote the joint density for continuous RVs $\left( \text{X}_{i},\text{Y}_{i} \right)$. Let $\mathop{f_{y|x}}(t\mid \text{X}_{i}=u)$ denote the conditional distribution of $\text{Y}_{i}$ given $\text{X}_{i}=u$. And let $\mathop{g_y}(t)$ and $\mathop{g_x}(u)$ denote the marginal densities of $\text{Y}_{i}$ and $\text{X}_{i}$. --- # Regression .hi-slate[A proof of the LIE] $\mathop{E}\!\bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]} \bigg)$ --
  $= {\displaystyle\int} \color{#e64173}{\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} = u \right]} \mathop{g_x}(u) du$ --
  $={\displaystyle\int} \color{#e64173}{\left[{\displaystyle\int} t \mathop{f_{y|x}}\left( t\mid \text{X}_{i}=u \right) dt\right]} \mathop{g_x}(u) du$ --
  $={\displaystyle\int} {\displaystyle\int} \color{#e64173}{t \mathop{f_{y|x}}\left( t\mid \text{X}_{i}=u \right)} \mathop{g_x}(u) du \, \color{#e64173}{dt}$ --
  $={\displaystyle\int} \color{#e64173}{t} \left[ {\displaystyle\int} \color{#e64173}{\mathop{f_{y|x}}\left( t\mid \text{X}_{i}=u \right)} \mathop{g_x}(u) du \right] \color{#e64173}{dt}$ --
  $={\displaystyle\int} \color{#e64173}{t} \left[ {\displaystyle\int} \mathop{f_{x,y}}(u,t) du \right] \color{#e64173}{dt}$ --
  $={\displaystyle\int} \color{#e64173}{t} \mathop{g_y(t)} \color{#e64173}{dt}$ --
  $=\mathop{E}\left[ \text{Y}_{i} \right]$ --  .bigger[🥳] --- layout: false class: clear, middle Great. What's the point? --- layout: true # Regression ## The *LIE* and the *CEF* --- name: decomposition .hi-slate[Theorem] The CEF decomposition property (3.1.1) The LIE allows us to **decompose random variables** into two pieces -- $$ \begin{align} \text{Y}_{i} = \color{#e64173}{\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]} + \color{#6A5ACD}{\varepsilon_i} \end{align} $$ 1. .hi-pink[the conditional expectation function] 2. .hi-purple[a residual] with special powers.pink[†]
 i.  $\color{#6A5ACD}{\varepsilon_i}$ is mean independent of $\text{X}_{i}$, _i.e._, $\mathop{E}\left[ \color{#6A5ACD}{\varepsilon_i} \mid \text{X}_{i} \right] = 0$.
 ii.  $\color{#6A5ACD}{\varepsilon_i}$ is uncorrelated with any function of $\text{X}_{i}$. .footnote[.pink[†] Angrist and Pischke go with *special properties*.] -- .hi-orange[*Important*] It might not seem like much, but these results are .hi-orange[huge] for building intuition, theory, *and* application. -- Put a ⭐ here! --- .hi-slate[Proof] The CEF decomposition property (properties i. and ii. of $\color{#6A5ACD}{\varepsilon_i}$) -- .pull-left[ .hi-slate[Mean independence], $\mathop{E}\left[ \color{#6A5ACD}{\varepsilon_i} \mid \text{X}_{i} \right] = 0$ $$ \begin{align} &\mathop{E}\left[ \color{#6A5ACD}{\varepsilon_i} \mid \text{X}_{i} \right] \\[0.6em] &= \mathop{E}\!\bigg( \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i}\mid \text{X}_{i} \right]} \bigg| \text{X}_{i} \bigg) \\[0.6em] &= \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \mathop{E}\!\bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \bigg| \text{X}_{i} \bigg) \\[0.6em] &= \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \\[0.6em] &= 0 \end{align} $$ ] -- .pull-right[ .hi-slate[Zero correlation] btn. $\color{#6A5ACD}{\varepsilon_i}$ and $\mathop{h}\left( \text{X}_{i} \right)$ $$ \begin{align} &\mathop{E}\left[ \mathop{h}\left( \text{X}_{i} \right) \color{#6A5ACD}{\varepsilon_i}\right] \\[0.6em] &= \mathop{E}\!\bigg( \mathop{E}\left[ \mathop{h}\left( \text{X}_{i} \right) \color{#6A5ACD}{\varepsilon_i}\mid \text{X}_{i} \right]\bigg) \\[0.6em] &= \mathop{E}\!\bigg( \mathop{h}\left( \text{X}_{i} \right) \mathop{E}\left[\color{#6A5ACD}{\varepsilon_i}\mid \text{X}_{i} \right]\bigg) \\[0.6em] &= \mathop{E}\left[ \mathop{h}\left( \text{X}_{i} \right) \times 0\right] \\[0.6em] &= 0 \end{align} $$ ] --- .hi-slate[The CEF decomposition property]
says that we can decompose any random variable (_e.g._, $\text{Y}_{i}$) into 1. a part that is .pink[explained by] $\color{#e64173}{\text{X}_{i}}$ (_i.e._, the CEF $\color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]}$), 2. a part that is .purple[*orthogonal to*.purple[†] any function of] $\color{#6A5ACD}{\text{X}_{i}}$ (_i.e._, $\color{#6A5ACD}{\varepsilon_i}$). .footnote[.purple[†] "orthogonal to" = "uncorrelated with"] -- .hi-slate[Why the CEF?]
The .pink[CEF] also presents an intuitive summary of the relationship between $\text{Y}_{i}$ and $\text{X}_{i}$, since we are often use means to characterize random variables. -- But (of course) there are more reasons to use the CEF... --- name: prediction .hi-slate[Theorem] The CEF prediction property (3.1.2) Let $\mathop{m}\left( \text{X}_{i} \right)$ be *any* function of $\text{X}_{i}$. The CEF solves $$ \begin{align} \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} = \underset{\mathop{m}\left( \text{X}_{i} \right)}{\text{arg min}}\enspace \mathop{E}\left[ \left( \text{Y}_{i} - \mathop{m}\left( \text{X}_{i} \right) \right)^2 \right] \end{align} $$ In other words, the .hi-pink[CEF] is the minimum mean-squared error (MMSE) predictor of $\text{Y}_{i}$ given $\text{X}_{i}$. -- .hi-slate[*Notice*] 1. We haven't restricted $m$ to any class of functions—it can be nonlinear. 2. We're talking about *prediction* (specifically predicting $\text{Y}_{i}$). --- layout: false class: clear .hi-slate[Proof] The CEF prediction property $\bigg( \text{Y}_{i} - \mathop{m}\left( \text{X}_{i} \right) \bigg)^2$ .right10[.orange[(**1**)]] --
  $= \bigg( \big\{ \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \big\} + \big\{ \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \mathop{m}\left( \text{X}_{i} \right) \big\} \bigg)^2$ --
  $= \bigg( \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \bigg)^2$  .right10[.turquoise[(**a**)]]
   $+ 2 \bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right] - \mathop{m}\left( \text{X}_{i} \right)}\bigg)\times \bigg( \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \bigg)$  .right10[.turquoise[(**b**)]]
   $+ \bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \mathop{m}\left( \text{X}_{i} \right) \bigg)^2$  .right10[.turquoise[(**c**)]] -- .hi-slate[Recall:] We want to choose the $\mathop{m}\left( \text{X}_{i} \right)$ that minimizes .orange[(**1**)] in expectation. --
 .turquoise[(**a**)] is irrelevant, _i.e._, it does not depend upon $\mathop{m}\left( \text{X}_{i} \right)$. --
 .turquoise[(**b**)] equals zero in expectation: $\mathop{E}\left[ \mathop{h}\left( \text{X}_{i} \right)\times \color{#6A5ACD}{\varepsilon_i} \right] = 0$. --
 .turquoise[(**c**)] is minimized by $\mathop{m}\left( \text{X}_{i} \right) = \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]}$, _i.e._, when $\mathop{m}\left( \text{X}_{i} \right)$ is the .pink[CEF]. --- layout: true # Regression ## The *LIE* and the *CEF* --- ∴ the .pink[CEF] is the function that minimizes the mean-squared error (MSE) $$ \begin{align} \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} = \underset{\mathop{m}\left( \text{X}_{i} \right)}{\text{arg min}}\enspace \mathop{E}\left[ \left( \text{Y}_{i} - \mathop{m}\left( \text{X}_{i} \right) \right)^2 \right] \end{align} $$ --- One final property of the .pink[CEF] (very similar to the decomposition property) .hi-slate[Theorem] The ANOVA theorem (3.1.3) $$ \begin{align} \mathop{\text{Var}} \left( \text{Y}_{i} \right) = \mathop{\text{Var}} \left( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \right) + \mathop{E}\left[ \mathop{\text{Var}} \left( \text{Y}_{i} \mid \text{X}_{i} \right) \right] \end{align} $$ which says that we can decompose the variance in $\text{Y}_{i}$ into 1. the variance in the .pink[CEF] 2. the variance of the residual -- .hi-slate[*Example*] Decomposing wage variation into (.hi-slate[1]) variation explained by workers' characteristics and (.hi-slate[2]) unexplained (residual) variation -- The proof centers on the fact that the decomposition property of the CEF. --- layout: false class: clear, middle We now understand the CEF a bit better.
But how does the CEF actually relate to regression? --- layout: true # Regression ## The *CEF* and regression --- We've discussed how the .pink[CEF] summarizes empirical relationships. *Previously* we discussed how regression provides simple empirical insights. Let's link these two concepts. --- name: pop_ls .hi-slate[Population least-squares regression] We will focus on $\beta$, the vector (a $K\times 1$ matrix) of population, least-squares regression coefficients, _i.e._, $$ \begin{align} \beta = \underset{b}{\text{arg min}}\thinspace \mathop{E}\left[ \left( \text{Y}_{i} - \text{X}_{i}'b \right)^2 \right] \end{align} $$ where $b$ and $\text{X}_{i}$ are also $K\times 1$, and $\text{Y}_{i}$ is a scalar. -- Taking the first-order condition gives $$ \begin{align} \mathop{E}\left[ \text{X}_{i} \left( \text{Y}_{i} - \text{X}_{i}'b \right) \right] = 0 \end{align} $$ --- From the first-order condition $$ \begin{align} \mathop{E}\left[ \text{X}_{i} \left( \text{Y}_{i} - \text{X}_{i}'b \right) \right] = 0 \end{align} $$ we can solve for $b$. We've defined the optimum as $\beta$. Thus, $$ \begin{align} \beta = \mathop{E}\left[ \text{X}_{i} \text{X}_{i}' \right]^{-1} \mathop{E}\left[ \text{X}_{i} \text{Y}_{i} \right] \end{align} $$ -- .hi-slate[*Note*] The first-order conditions tell us that our least-squares population regression residuals $\left(e_i = \text{Y}_{i} - \text{X}_{i}'\beta \right)$ are uncorrelated with $\text{X}_i$. --- layout: true # Regression ## Anatomy --- name: anatomy Our "new" result: $\beta = \mathop{E}\left[ \text{X}_{i} \text{X}_{i}' \right]^{-1} \mathop{E}\left[ \text{X}_{i} \text{Y}_{i} \right]$ In .hi-slate[simple linear regression] (an intercept and one regressor $x_i$), $$ \begin{align} \beta_1 &= \dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, x_i \right)}{\mathop{\text{Var}} \left( x_i \right)} & \beta_0 = \mathop{E}\left[ \text{Y}_{i} \right] - \beta_1 \mathop{E}\left[ x_i \right] \end{align} $$ -- For .hi-slate[multivariate regression], the coefficient on the k.super[th] regressor $x_{ki}$ is $$ \begin{align} \beta_k &= \dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \widetilde{x}_{ki} \right)}{\mathop{\text{Var}} \left( \widetilde{x}_{ki} \right)} \end{align} $$ where $\widetilde{x}_{ki}$ is the residual from a regression of $x_{ki}$ on all other covariates. --- This alternative formulation of least-squares coefficients is quite powerful. $$ \begin{align} \beta_k &= \dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \widetilde{x}_{ki} \right)}{\mathop{\text{Var}} \left( \widetilde{x}_{ki} \right)} \end{align} $$ -- .hi-slate[Why?] -- This expression illustrates how each coefficient in a least-squares regression represents the bivariate slope coefficient .pink[after controlling for the other covariates]. --- In fact, we can re-write our coefficients to further emphasize this point $$ \begin{align} \beta_k &= \dfrac{\mathop{\text{Cov}} \left( \widetilde{\text{Y}}_{i},\, \widetilde{x}_{ki} \right)}{\mathop{\text{Var}} \left( \widetilde{x}_{ki} \right)} \end{align} $$ $\widetilde{\text{Y}}_{i}$ denotes the residual from regressing $\text{Y}_{i}$ on all regressors except $x_{ki}$. --- layout: false class: clear, middle Graphical example --- class: clear, middle, center ```{R, data_anatomy, cache = F, include = F} n <- 1e2 set.seed(1234) gen_df <- tibble( x2 = sample(x = c(F, T), size = n, replace = T), x1 = runif(n = n, min = -3, max = 3) + x2 - 0.5, u = rnorm(n = n, mean = 0, sd = 1), y = -3 + x1 + x2 * 4 + u ) ya_mean <- filter(gen_df, x2 == F)$y %>% mean() yb_mean <- filter(gen_df, x2 == T)$y %>% mean() x1a_mean <- filter(gen_df, x2 == F)$x1 %>% mean() x1b_mean <- filter(gen_df, x2 == T)$x1 %>% mean() gen_df %<>% mutate( y_dm = y - ya_mean * (x2 == F) - yb_mean * (x2 == T), x1_dm = x1 - x1a_mean * (x2 == F) - x1b_mean * (x2 == T) ) ``` $y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \varepsilon_i$ ```{R, fig_anatomy1, echo = F} ggplot(data = gen_df, aes(y = y, x = x1, color = x2, shape = x2)) + geom_hline(yintercept = 0, color = "grey85") + geom_vline(xintercept = 0, color = "grey85") + geom_point(size = 3) + annotate("text", x = -0.075, y = 6.75, label = TeX("$y$"), size = 8) + annotate("text", x = 2.95, y = 0.3, label = TeX("$x_1$"), size = 8) + ylim(c(-7, 7)) + xlim(c(-3.5, 3.5)) + theme_empty + scale_color_manual( expression(x[2]), values = c("darkslategrey", red_pink), labels = c("A", "B") ) + scale_shape_manual( expression(x[2]), values = c(1, 19), labels = c("A", "B") ) + theme( legend.position = "bottom", text = element_text(size = 22) ) ``` --- class: clear, middle, center count: false $\beta_1$ gives the relationship between $y$ and $x_1$ *after controlling for* $x_2$ ```{R, fig_anatomy2, echo = F} ggplot(data = gen_df, aes(y = y, x = x1, color = x2, shape = x2)) + geom_hline(yintercept = ya_mean, color = "darkslategrey", alpha = 0.5, size = 2) + geom_hline(yintercept = yb_mean, color = red_pink, alpha = 0.5, size = 2) + geom_hline(yintercept = 0, color = "grey85") + geom_vline(xintercept = 0, color = "grey85") + geom_point(size = 3) + annotate("text", x = -0.075, y = 6.75, label = TeX("$y$"), size = 8) + annotate("text", x = 2.95, y = 0.3, label = TeX("$x_1$"), size = 8) + ylim(c(-7, 7)) + xlim(c(-3.5, 3.5)) + theme_empty + scale_color_manual( expression(x[2]), values = c("darkslategrey", red_pink), labels = c("A", "B") ) + scale_shape_manual( expression(x[2]), values = c(1, 19), labels = c("A", "B") ) + theme( legend.position = "bottom", text = element_text(size = 22) ) ``` --- class: clear, middle, center count: false $\beta_1$ gives the relationship between $y$ and $x_1$ *after controlling for* $x_2$ ```{R, fig_anatomy3, echo = F} ggplot(data = gen_df %>% mutate(y = y - 4 * x2), aes(y = y_dm, x = x1)) + geom_hline(yintercept = 0, color = "grey85") + geom_vline(xintercept = 0, color = "grey85") + geom_point(size = 3, aes(color = x2, shape = x2)) + annotate("text", x = -0.075, y = 6.75, label = TeX("$y|x_2$"), size = 8, hjust = 1) + annotate("text", x = 2.95, y = 0.3, label = TeX("$x_1$"), size = 8) + ylim(c(-7, 7)) + xlim(c(-3.5, 3.5)) + theme_empty + scale_color_manual( expression(x[2]), values = c("darkslategrey", red_pink), labels = c("A", "B") ) + scale_shape_manual( expression(x[2]), values = c(1, 19), labels = c("A", "B") ) + theme( legend.position = "bottom", text = element_text(size = 22) ) ``` --- class: clear, middle, center count: false $\beta_1$ gives the relationship between $y$ and $x_1$ *after controlling for* $x_2$ ```{R, fig_anatomy4, echo = F} ggplot(data = gen_df %>% mutate(y = y - 4 * x2), aes(y = y_dm, x = x1)) + geom_hline(yintercept = 0, color = "grey85") + geom_vline(xintercept = 0, color = "grey85") + geom_vline(xintercept = x1a_mean, color = "darkslategrey", alpha = 0.5, size = 2) + geom_vline(xintercept = x1b_mean, color = red_pink, alpha = 0.5, size = 2) + geom_point(size = 3, aes(color = x2, shape = x2)) + annotate("text", x = -0.075, y = 6.75, label = TeX("$y|x_2$"), size = 8, hjust = 1) + annotate("text", x = 2.95, y = 0.3, label = TeX("$x_1$"), size = 8) + ylim(c(-7, 7)) + xlim(c(-3.5, 3.5)) + theme_empty + scale_color_manual( expression(x[2]), values = c("darkslategrey", red_pink), labels = c("A", "B") ) + scale_shape_manual( expression(x[2]), values = c(1, 19), labels = c("A", "B") ) + theme( legend.position = "bottom", text = element_text(size = 22) ) ``` --- class: clear, middle, center count: false $\beta_1$ gives the relationship between $y$ and $x_1$ *after controlling for* $x_2$ ```{R, fig_anatomy5, echo = F} ggplot(data = gen_df %>% mutate(y = y - 4 * x2), aes(y = y_dm, x = x1_dm)) + geom_hline(yintercept = 0, color = "grey85") + geom_vline(xintercept = 0, color = "grey85") + geom_point(size = 3, aes(color = x2, shape = x2)) + annotate("text", x = -0.075, y = 6.75, label = TeX("$y|x_2$"), size = 8, hjust = 1) + annotate("text", x = 2.95, y = 0.3, label = TeX("$x_1|x_2$"), size = 8, hjust = 1) + ylim(c(-7, 7)) + xlim(c(-3.5, 3.5)) + theme_empty + scale_color_manual( expression(x[2]), values = c("darkslategrey", red_pink), labels = c("A", "B") ) + scale_shape_manual( expression(x[2]), values = c(1, 19), labels = c("A", "B") ) + theme( legend.position = "bottom", text = element_text(size = 22) ) ``` --- class: clear, middle, center count: false $\beta_1$ gives the relationship between $y$ and $x_1$ *after controlling for* $x_2$ ```{R, fig_anatomy6, echo = F} ggplot(data = gen_df %>% mutate(y = y - 4 * x2), aes(y = y_dm, x = x1_dm)) + geom_hline(yintercept = 0, color = "grey85") + geom_vline(xintercept = 0, color = "grey85") + geom_smooth(method = lm, se = F, color = purple, alpha = 1, size = 2) + geom_point(size = 3, aes(color = x2, shape = x2)) + annotate("text", x = -0.075, y = 6.75, label = TeX("$y|x_2$"), size = 8, hjust = 1) + annotate("text", x = 2.95, y = 0.3, label = TeX("$x_1|x_2$"), size = 8, hjust = 1) + ylim(c(-7, 7)) + xlim(c(-3.5, 3.5)) + theme_empty + scale_color_manual( expression(x[2]), values = c("darkslategrey", red_pink), labels = c("A", "B") ) + scale_shape_manual( expression(x[2]), values = c(1, 19), labels = c("A", "B") ) + theme( legend.position = "bottom", text = element_text(size = 22) ) ``` --- layout: false class: clear, middle Now that we've refreshed/deepened our regression knowledge, let's connect regression and the CEF. --- layout: true # Regression ## Regression and the *CEF* --- Angrist and Pischke make the case that > ... you should be interested in regression parameters if you are interested in the CEF. (*MHE*, p.36) .hi-slate[Q] What is the reasoning/connection? -- .hi-slate[A] We'll cover three reasons. 1. *If the CEF is linear*, then the population regression line is the CEF. 2. The function $\text{X}_{i}' \beta$ is the min. MSE *linear* predictor of $\text{Y}_{i}$ given $\text{X}_{i}$. 3. The function $\text{X}_{i}' \beta$ gives the min. MSE *linear* approximation to the CEF. --- layout: true # Regression ## Regression and the *CEF* --- .hi-slate[Theorem] The linear CEF theorem (3.1.4) If the CEF is linear, then the population regression is the CEF. -- .hi-slate[Proof] Let the CEF equal some linear function, _i.e._, $\color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} = \text{X}_{i}' \beta^\star$. From the CEF decomposition property, we know $\mathop{E}\left[ \text{X}_{i} \color{#6A5ACD}{\varepsilon_i} \right] = 0$. -- $\implies \mathop{E}\left[ \text{X}_{i} \left( \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \right) \right] = 0$ -- $\implies \mathop{E}\left[ \text{X}_{i} \left( \text{Y}_{i} - \text{X}_{i}'\beta^\star \right) \right] = 0$ -- $\implies \mathop{E}\left[ \text{X}_{i}\text{Y}_{i} \right] - \mathop{E}\left[ \text{X}_{i} \text{X}_{i}' \beta^\star \right] = 0$ -- $\implies \beta^\star = \mathop{E}\left[ \text{X}_{i} \text{X}_{i}' \right]^{-1} \mathop{E}\left[ \text{X}_{i}\text{Y}_{i} \right]$ -- $=\beta$, our population regression coefficients. --- >.hi-slate[Theorem] The linear CEF theorem (3.1.4) > >If the CEF is linear, then the population regression is the CEF. Linearity can be a strong assumption. When might we expect linearity? -- 1. Situations in which $\left( \text{Y}_{i},\, \text{X}_{i} \right)$ follows a multivariate normal distribution.
.hi-slate[*Concern*] Might be limited—especially when $\text{Y}_{i}$ or $\text{X}_{i}$ are not continuous. -- 2. Saturated regression models
.hi-slate[*Example*] A model with two binary indicators and their interaction. --- .hi-slate[Theorem] The best linear predictor theorem (3.1.5) The function $\text{X}_{i}' \beta$ the best linear predictor of $\text{Y}_{i}$ given $\text{X}_{i}$ (minimizes MSE). -- .hi-slate[Proof] We defined $\beta$ as the vector that minimizes MSE, _i.e._, $$ \begin{align} \beta = \underset{b}{\text{arg min}}\thinspace \mathop{E}\left[ \left( \text{Y}_{i} - \text{X}_{i}'b \right)^2 \right] \end{align} $$ so $\text{X}_{i}'\beta$ is literally defined as the minimum MSE linear predictor of $\text{Y}_{i}$. -- - The population-regression function $\left(\text{X}_{i}'\beta\right)$ is the best (min. MSE) *linear* predictor of $\text{Y}_{i}$ given $\text{X}_{i}$. - The CEF $\left( \mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right] \right)$ is the best predictor (min. MSE) of $\text{Y}_{i}$ given $\text{X}_{i}$ across *all classes* of functions. --- .hi-slate[Q] If $\text{X}_{i}'\beta$ is .hi[the best linear predictor] of $\text{Y}_{i}$ given $\text{X}_{i}$, then why is there so much interest machine learning for prediction (opposed to regression)? -- .hi-slate[A] A few reasons 1. Relax *linearity* 2. Model selection - choosing $\text{X}_{i}$ is not always obvious - overfitting is bad 3. It's fancy/shiny and new 4. Some ML methods boil down to regression 5. Others? -- .hi-slate[Counter Q] Why are we (still) using regression? --- name: reg_cef_theorem .hi-slate[Theorem] The regression CEF theorem (3.1.6) The population regression function $\text{X}_{i}'\beta$ provides the minimum MSE linear approximation to the CEF $\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]$, _i.e._, $$ \begin{align} \beta = \underset{b}{\text{arg min}}\thinspace \mathop{E}\!\left\{ \bigg( \mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right] - \text{X}_{i}'b \bigg)^2 \right\} \end{align} $$ -- .hi-slate[*Put simply*] Regression gives us the *best* linear approximation to the CEF. --- layout: false class: clear .hi-slate[Proof] First, recall that, .orange[in expectation], $\beta$ is the $b$ that minimizes $\left( \text{Y}_{i} - \text{X}_{i}'b \right)^2$ -- $\left( \text{Y}_{i} - \text{X}_{i}'b \right)^2 \color{#ffffff}{\bigg|}$ .right10[.orange[(**1**)]] --
  $= \bigg( \left\{ \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \right\} + \left\{ \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \text{X}_{i}'b \right\} \bigg)^2$ --
  $= \bigg( \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \bigg)^2$ .right10[.turquoise[(**a**)]]
  $\color{#ffffff}{=}+\bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \text{X}_{i}'b \bigg)^2$ .right10[.turquoise[(**b**)]]
  $\color{#ffffff}{=}+ 2 \bigg( \text{Y}_{i} - \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} \bigg) \bigg( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \text{X}_{i}'b \bigg)$ .right10[.turquoise[(**c**)]] -- We want to minimize .turquoise[(**b**)], and we know $\beta$ minimizes .orange[(**1**)]. --
 .turquoise[(**a**)] is irrelevant, _i.e._, it does not depend upon $b$. --
 .turquoise[(**c**)] can be written as $2 \color{#6A5ACD}{\varepsilon_i} \mathop{h}(\text{X}_{i})$, which equals zero in expectation. -- ∴ (In expectation) If $b=\beta$ minimizes .orange[(**1**)], then $b=\beta$ minimizes .turquoise[(**b**)]. --- layout: true # Regression ## Regression and the *CEF* --- Let's review our new(-ish) regression results 1. When the CEF is linear, the regression function *is* the CEF.
.hi-slate[Too small] Very specific circumstances—or big assumptions. 1. Regression gives us the best *linear* predictor of $\text{Y}_{i}$ (given $\text{X}_{i}$)
.hi-slate[Off point] We're often interested in $\beta$—not $\widehat{\text{Y}}_{i}$. 1. Regression provides the best *linear* approximation of the CEF.
.hi-slate[Just right?] (Depends on your goals) --- Motivation (**3**) tends to be the most compelling. Even when the CEF is not linear, regression recovers the best linear approximation to the CEF. -- > The statement that .pink[regression approximates the CEF] lines up with our view of .purple[empirical work as an effort to describe the essential features of statistical relationships] without necessarily trying to pin them down exactly. (*MHE*, p.39, emphasis added) --- layout: false class: clear, middle Let's dig into this linear-approximate to the CEF a little more... --- class: clear, middle, center Returning to our **CEF** ```{R, fig_reg_cef, echo = F, cache = T} # Estimate the relationship cef_lm <- lm(y ~ x, data = cef_df) # Find the regression points lm_df <- tibble( x = 8:22, y = predict(object = cef_lm, newdata = data.frame(x = 8:22)) ) # Create the figure gg_cef <- ggplot(data = cef_df, aes(x = y, y = x %>% as.factor())) + geom_density_ridges( rel_min_height = 0.003, color = "grey85", fill = NA, scale = 2.5, size = 0.3 ) + scale_x_continuous( "Annual income", labels = scales::dollar ) + ylab("Years of education") + scale_fill_viridis(option = "magma") + theme_pander(base_family = "Fira Sans Book", base_size = 18) + theme( legend.position = "none" ) + geom_path( data = means_df, aes(x = y, y = x %>% as.factor(), group = 1), color = "grey20", alpha = 0.85 ) + geom_point( data = means_df, aes(x = y, y = x %>% as.factor()), color = "grey20", shape = 16, size = 3.5 ) + coord_flip() # Plot it gg_cef ``` --- class: clear, center, middle Adding the population .hi-orange[regression function] ```{R, fig_reg_cef2, echo = F, cache = T} # Figure gg_cef + geom_path( data = lm_df, aes(x = y, y = x %>% as.factor(), group = 1), color = orange, alpha = 0.5, size = 2 ) ``` --- layout: true # Regression ## Regression and the *CEF* --- name: wls As the previous figure suggests, one way to think about least-squares regression is .hi-slate[estimating a weighted regression on the CEF] rather than the individual observations. -- .slate[.mono[TLDR]] Use $\color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]}$ as the outcome, rather than $\text{Y}_{i}$, and properly weight. -- Suppose $\text{X}_{i}$ is discrete with pmf $\mathop{g_x}(u)$ $$ \begin{align} \mathop{E}\!\bigg[ \left( \color{#e64173}{\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} \right]} - \text{X}_{i}'b \right)^2 \bigg] = \sum_u \left( \mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} = u \right] - u'b \right)^2 \mathop{g_x}(u) \end{align} $$ _i.e._, $\beta$ can be expressed as weighted-least squares regression of $\mathop{E}\left[ \text{Y}_{i} \mid \text{X}_{i} = u\right]$ on $u$ (the values of $\text{X}_{i}$) weighted by $\mathop{g_x}(u)$. --- We can also use LIE here $\beta$ --
.pad-left[] $= \mathop{E}\left[ X_i X_i' \right]^{-1} \mathop{E}\left[ \text{X}_{i}\text{Y}_{i} \right]$ --
.pad-left[] $= \mathop{E}\left[ X_i X_i' \right]^{-1} \mathop{E}\left[ \text{X}_{i} \mathop{E}\left( \text{Y}_{i}\mid \text{X}_{i} \right) \right]$ -- .hi-pink[Pro] Useful for aggregated data when microdata are sensitive/big. -- .hi-pink[Con] You .hi-slate[will not] get the same standard errors. --- layout: false # Table of contents .pull-left[ ### Admin .smaller[ 1. [Schedule](#schedule) 1. [`return`](#return) ] ] .pull-right[ ### Regression .smaller[ 1. [Why?](#why) 1. [The CEF](#cef) - [Definition](#cef) - [Graphically](#graphically) - [Law of iterated expectations](#lie) - [Decomposition](#decomposition) - [Prediction](#prediction) 1. [Population least squares](#pop_ls) 1. [Anatomy](#anatomy) 1. [Regression-CEF theorem](#reg_cef_theorem) 1. [WLS](#wls) ] ] --- exclude: true ```{R, generate pdfs, include = F, eval = T} source("../../ScriptsR/unpause.R") unpause("03WhyRegression.Rmd", ".", T, T) ```