--- title: "Controls" subtitle: "EC 607, Set 06" author: "Edward Rubin" date: "Spring 2021" output: xaringan::moon_reader: css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css'] # self_contained: true nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- class: inverse, middle ```{r, setup, include = F} # devtools::install_github("dill/emoGG") library(pacman) p_load( broom, tidyverse, ggplot2, ggthemes, ggforce, ggridges, latex2exp, viridis, extrafont, gridExtra, kableExtra, snakecase, janitor, data.table, dplyr, lubridate, knitr, estimatr, here, magrittr ) # Define pink color red_pink <- "#e64173" turquoise <- "#20B2AA" orange <- "#FFA500" red <- "#fb6107" blue <- "#3b3b9a" green <- "#8bb174" grey_light <- "grey70" grey_mid <- "grey50" grey_dark <- "grey20" purple <- "#6A5ACD" slate <- "#314f4f" # Dark slate grey: #314f4f # Knitr options opts_chunk$set( comment = "#>", fig.align = "center", fig.height = 7, fig.width = 10.5, warning = F, message = F ) opts_chunk$set(dev = "svg") options(device = function(file, width, height) { svg(tempfile(), width = width, height = height) }) options(crayon.enabled = F) options(knitr.table.format = "html") # A blank theme for ggplot theme_empty <- theme_bw() + theme( line = element_blank(), rect = element_blank(), strip.text = element_blank(), axis.text = element_blank(), plot.title = element_blank(), axis.title = element_blank(), plot.margin = structure(c(0, 0, -0.5, -1), unit = "lines", valid.unit = 3L, class = "unit"), legend.position = "none" ) theme_simple <- theme_bw() + theme( line = element_blank(), panel.grid = element_blank(), rect = element_blank(), strip.text = element_blank(), axis.text.x = element_text(size = 18, family = "STIXGeneral"), axis.text.y = element_blank(), axis.ticks = element_blank(), plot.title = element_blank(), axis.title = element_blank(), # plot.margin = structure(c(0, 0, -1, -1), unit = "lines", valid.unit = 3L, class = "unit"), legend.position = "none" ) theme_axes_math <- theme_void() + theme( text = element_text(family = "MathJax_Math"), axis.title = element_text(size = 22), axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")), axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")), axis.line = element_line( color = "grey70", size = 0.25, arrow = arrow(angle = 30, length = unit(0.15, "inches") )), plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"), legend.position = "none" ) theme_axes_serif <- theme_void() + theme( text = element_text(family = "MathJax_Main"), axis.title = element_text(size = 22), axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")), axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")), axis.line = element_line( color = "grey70", size = 0.25, arrow = arrow(angle = 30, length = unit(0.15, "inches") )), plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"), legend.position = "none" ) theme_axes <- theme_void() + theme( text = element_text(family = "Fira Sans Book"), axis.title = element_text(size = 18), axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")), axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")), axis.line = element_line( color = grey_light, size = 0.25, arrow = arrow(angle = 30, length = unit(0.15, "inches") )), plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"), legend.position = "none" ) theme_set(theme_gray(base_size = 20)) # Column names for regression results reg_columns <- c("Term", "Est.", "S.E.", "t stat.", "p-Value") # Function for formatting p values format_pvi <- function(pv) { return(ifelse( pv < 0.0001, "<0.0001", round(pv, 4) %>% format(scientific = F) )) } format_pv <- function(pvs) lapply(X = pvs, FUN = format_pvi) %>% unlist() # Tidy regression results table tidy_table <- function(x, terms, highlight_row = 1, highlight_color = "black", highlight_bold = T, digits = c(NA, 3, 3, 2, 5), title = NULL) { x %>% tidy() %>% select(1:5) %>% mutate( term = terms, p.value = p.value %>% format_pv() ) %>% kable( col.names = reg_columns, escape = F, digits = digits, caption = title ) %>% kable_styling(font_size = 20) %>% row_spec(1:nrow(tidy(x)), background = "white") %>% row_spec(highlight_row, bold = highlight_bold, color = highlight_color) } ``` ```{css, echo = F, eval = T} @media print { .has-continuation { display: block !important; } } ``` $$ \begin{align} \def\ci{\perp\mkern-10mu\perp} \end{align} $$ # Prologue --- name: schedule # Schedule ## Last time The conditional independence assumption: $\left\{ \text{Y}_{0i},\, \text{Y}_{1i}\right\} \ci \text{D}_{i}\big| \text{X}_{i}$
_I.e._, conditional on some controls $\left( \text{X}_{i} \right)$, treatment is as-good-as random. ## Today - Omitted variable bias - Good *vs.* bad controls ## Upcoming - Topics: Matching estimators --- layout: true # Omitted-variable bias --- class: inverse, middle name: OVB --- ## Revisiting an old friend Let's start where we left off: Returns to schooling. We have two linear, population models $$ \begin{align} \text{Y}_{i} &= \alpha + \rho \text{s}_i + \eta_i \tag{1} \\ \text{Y}_{i} &= \alpha + \rho \text{s}_i + \text{X}_{i}'\gamma + \nu_i \tag{2} \end{align} $$ -- We should not interpret $\hat{\rho}$ causally in model $\left( 1 \right)$ (for fear of selection bias). -- For model $\left( 2 \right)$, we can interpret $\hat{\rho}$ causally .b[*if*] $\thinspace\text{Y}_{si}\ci \text{s}_i\big|\text{X}_{i}\thinspace$ (CIA). -- In other words, the CIA says that our .hi[observable vector] $\color{#e64173}{\text{X}_{i}}$ .hi[must explain all of correlation between] $\color{#e64173}{s_i}$ .hi[and] $\color{#e64173}{\eta_i}$. --- name: ovb_formula ## The OVB formula We can use the omitted-variable bias (OVB) formula to compare regression estimates from .hi-slate[models with different sets of control variables]. -- We're concerned about selection and want to use a set of control variables to account for ability $\left( \text{A}_i \right)$—family background, motivation, intelligence. $$ \begin{align} \text{Y}_{i} &= \alpha + \beta \text{s}_i + v_i \tag{1} \\ \text{Y}_{i} &= \pi + \rho \text{s}_i + \text{A}_{i}'\gamma + e_i \tag{2} \end{align} $$ -- What happens if we can't get data on $\text{A}_i$ and opt for $\left( 1 \right)$? -- $$ \begin{align} \dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \text{s}_i \right)}{\mathop{\text{Var}} \left( \text{s}_i \right)} = \rho + \gamma' \delta_{As} \end{align} $$ where $\delta_{As}$ are coefficients from regressing $\text{A}_i$ on $\text{s}_i$. --- ## Interpretation Our two regressions $$ \begin{align} \text{Y}_{i} &= \alpha + \beta \text{s}_i + v_i \tag{1} \\ \text{Y}_{i} &= \pi + \rho \text{s}_i + \text{A}_{i}'\gamma + e_i \tag{2} \end{align} $$ will yield the same estimates for the returns to schooling $$ \begin{align} \dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \text{s}_i \right)}{\mathop{\text{Var}} \left( \text{s}_i \right)} = \rho + \gamma' \delta_{As} \end{align} $$ if (.hi-slate[a]) schooling is uncorrelated with ability $\left( \delta_{As} = 0 \right)$ *or* (.hi-slate[b]) ability is uncorrelated with earnings, conditional on schooling $\left( \gamma = 0 \right)$. --- name: ovb_ex ## Example ```{r, table_321, echo = F} coef_v <- c("0.132", "0.131", "0.114", "0.087", "0.066") se_v <- c(rep("0.007", 3), "0.009", "0.010") %>% paste0("(", ., ")") control_v <- c( "None", "Age Dum.", "2 + Add'l", "3 + AFQT", "4 + Occupation" ) names_v <- 1:5 tab_mat <- matrix(c(coef_v, se_v, control_v), nrow = 3, byrow = T)[,1:4] row.names(tab_mat) <- c("Schooling", "", "Controls") tab321 <- kable( x = tab_mat, col.names = names_v[1:4], caption = "Table 3.2.1, The returns to schooling", align = "c" ) %>% column_spec(1, bold = T, italic = F) # Print the table tab321 ``` Here we have four specifications of controls for a regression of log wages on years of schooling (from the NLSY). --- ## Example ```{r, table_321_1, echo = F} tab321 %>% column_spec(2, color = red_pink) ``` .hi[Column 1] (no control variables) suggests a 13.2% increase in wages for an additional year of schooling. --- ## Example ```{r, table_321_2, echo = F} tab321 %>% column_spec(3, color = red_pink) ``` .hi[Column 2] (age dummies) suggests a 13.1% increase in wages for an additional year of schooling. --- ## Example ```{r, table_321_3, echo = F} tab321 %>% column_spec(4, color = red_pink) ``` .hi[Column 3] (column 2 controls plus parents' ed. and self demographics) suggests a 11.4% increase in wages for an additional year of schooling. --- ## Example ```{r, table_321_4, echo = F} tab321 %>% column_spec(5, color = red_pink) ``` .hi[Column 4] (column 3 controls plus AFQT.super[.pink[†]] score) suggests a 8.7% increase in wages for an additional year of schooling. .footnote[.pink[†] *AFQT* is *Armed Forces Qualification Test*.] --- ## Example ```{r, table_321_5, echo = F} tab321 %>% column_spec(5, color = red_pink) %>% column_spec(2, color = purple) ``` As we ratchet up controls, the estimated returns to schooling drop by 4.5 percentage points (34% drop in the coefficient) from .hi-purple[Column 1] to .hi[Column 4]. -- $$ \begin{align} \color{#6A5ACD}{\dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \text{s}_i \right)}{\mathop{\text{Var}} \left( \text{s}_i \right)}} = \color{#e64173}{\rho} + \color{#20B2AA}{\gamma'} \color{#FFA500}{\delta_{As}} \end{align} $$ -- If we think .hi-turquoise[ability positively affects wages], then it looks like we also have .hi-orange[positive selection into schooling]. --- layout: false class: clear, center, middle name: ovb_venn ```{r, venn_iv, echo = F, fig.height = 7.5} # Colors (order: x1, x2, x3, y, z) venn_colors <- c(purple, red, "grey60", orange) # Line types (order: x1, x2, x3, y, z) venn_lines <- c("solid", "dotted", "dotted", "solid") # Locations of circles venn_df <- tibble( x = c( 0.0, -0.5, 1.5, -1.0), y = c( 0.0, -2.5, -1.8, 2.0), r = c( 1.9, 1.5, 1.5, 1.3), l = c( "Y", "X[1]", "X[2]", "X[3]"), xl = c( 0.0, -0.6, 1.6, -1.0), yl = c( 0.0, -2.6, -1.9, 2.2) ) # Venn ggplot(data = venn_df, aes(x0 = x, y0 = y, r = r, fill = l, color = l)) + geom_circle(aes(linetype = l), alpha = 0.3, size = 0.75) + theme_void() + theme(legend.position = "none") + scale_fill_manual(values = venn_colors) + scale_color_manual(values = venn_colors) + scale_linetype_manual(values = venn_lines) + geom_text(aes(x = xl, y = yl, label = l), size = 9, family = "Fira Sans Book", parse = T) + annotate( x = -6, y = 0, geom = "text", label = TeX("\\textit{Omitted:} $X_2$ and $X_3$"), size = 9, family = "Fira Sans Book", hjust = 0 ) + xlim(-6, 4.5) + ylim(-4.2, 3.4) + coord_equal() ``` --- layout: true # Omitted-variable bias --- ## Note This OVB formula .hi-slate[does not] require either of the models to be causal. The formula compares the regression coefficient in a .hi-purple[short model] to the regression coefficient on the same variable in a .hi-pink[long model]..super[.pink[†]] .footnote[.pink[†] Here, .hi-pink[*long model*] refers to a model with more controls than the .hi-purple[*short model*].] --- name: ovb_cia ## The OVB formula and the CIA.super[.pink[†]] .footnote[.pink[†] The title for my first spy novel.] In addition to helping us think through and sign OVB, the formula $$ \begin{align} \dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \text{s}_i \right)}{\mathop{\text{Var}} \left( \text{s}_i \right)} = \rho + \gamma' \delta_{As} \end{align} $$ drives home the point that we're leaning .it[very] hard on the conditional independence assumption to be able to interpret our coefficients as causal. -- .qa[Q] When is the CIA plausible? -- .qa[A] Two potential answers 1. Randomized experiments 2. Programs with arbitrary cutoffs/lotteries --- layout: false class: clear, middle Control variables play an enormous role in our quest for causality (the CIA). .qa[Q] Are "more controls" always better (or at least never worse)? --- class: clear, middle .qa[A] No. There are such things as... --- layout: true # Bad controls --- name: bad_controls class: inverse, middle --- name: bad_def ## Defined .qa[Q] What's a *bad* control—when can a control make a bad situation worse? -- .qa[A] *Bad controls* are variables that are (also) affected by treatment. --
.note[Note] There are other types of .it[bad controls] too. More soon (DAGs). -- .qa[Q] Okay, so why is it bad to control using a variable affected by treatment? -- .note[Hint] It's a flavor of selection bias. -- Let's consider an example... --- name: bad_ex ## Example Suppose we want to know the .hi-slate[effect of college graduation on wages]. 1. There are only two types of jobs: blue collar and white collar. 1. White-collar jobs, on average, pay more than blue-collar jobs. 1. Graduating college increases the likelihood of a white-collar job. -- .qa[Q] Should we control for occupation type when considering the effect of college graduation on wages? (Will occupation be an omitted variable?) -- .qa[A] No. -- Imagine college degrees are randomly assigned. -- When we condition on occupation, -- we compare degree-earners who chose blue-collar jobs to non-degree-earners who chose blue-collar jobs. -- Our assumption of random degrees says .b[nothing] about random job selection. --- layout: false class: clear, middle Bad controls can undo valid randomizations. --- layout: true # Bad controls --- name: bad_formal ## Formal-ish derivation More formally, let - $\text{W}_i$ be a dummy for whether $i$ has a white-collar job - $\text{Y}_i$ denote $i$'s earnings - $\text{C}_i$ refer to $i$'s .hi-slate[randomly assigned] college-graduation status -- $$ \begin{align} \text{Y}_{i} &= \text{C}_{i} \color{#e64173}{\text{Y}_{1i}} + \left( 1 - \text{C}_{i} \right) \color{#6A5ACD}{\text{Y}_{0i}} \\ \text{W}_{i} &= \text{C}_{i} \color{#e64173}{\text{W}_{1i}} + \left( 1 - \text{C}_{i} \right) \color{#6A5ACD}{\text{W}_{0i}} \end{align} $$ -- Becuase we've assumed $\text{C}_i$ is randomly assigned, differences in means yield causal estimates, _i.e._, $$ \begin{align} \mathop{E}\left[ \text{Y}_{i}\mid \color{#e64173}{\text{C}_{i} = 1} \right] - \mathop{E}\left[ \text{Y}_{i} \mid \color{#6A5ACD}{\text{C}_{i} = 0} \right] &= \mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} - \color{#6A5ACD}{\text{Y}_{0i}} \right] \\ \mathop{E}\left[ \text{W}_{i}\mid \color{#e64173}{\text{C}_{i} = 1} \right] - \mathop{E}\left[ \text{W}_{i} \mid \color{#6A5ACD}{\text{C}_{i} = 0} \right] &= \mathop{E}\left[ \color{#e64173}{\text{W}_{1i}} - \color{#6A5ACD}{\text{W}_{0i}} \right] \end{align} $$ --- ## Formal-ish derivation, continued Let's see what happens when we throw in some controls—_e.g._, focusing on the the wage-effect of college graduation for white-collar jobs. -- $\mathop{E}\left[ \text{Y}_{i} \mid \text{W}_i = 1,\, \color{#e64173}{\text{C}_i = 1} \right] - \mathop{E}\left[ \text{Y}_{i} \mid \text{W}_i = 1,\, \color{#6A5ACD}{\text{C}_i = 0} \right]$ -- .pad-left[ $= \mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} \mid \color{#e64173}{\text{W}_{1i}} = 1,\, \color{#e64173}{\text{C}_i = 1} \right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#6A5ACD}{\text{W}_{0i}} = 1,\, \color{#6A5ACD}{\text{C}_i = 0} \right]$ ] -- .pad-left[ $= \mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#6A5ACD}{\text{W}_{0i}} = 1\right]$ ] -- .pad-left[ $=\mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right]$
$\color{#ffffff}{=} + \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#6A5ACD}{\text{W}_{0i}} = 1\right]$ ] -- .pad-left[ $= \underbrace{\mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} - \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right]}_{\text{Causal effect on white-collar workers}} + \underbrace{\mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#6A5ACD}{\text{W}_{0i}} = 1\right]}_{\text{Selection bias}}$ ] --- ## Formal-ish derivation, continued By introducing a bad control, we introduced selection bias into a setting that did not have selection bias without controls. -- Specifically, the selection bias term $$ \begin{align} \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#6A5ACD}{\text{W}_{0i}} = 1\right] \end{align} $$ describes how college graduation changes the composition of the pool of white-collar workers. -- .note[Note] Even if the causal effect is zero, this selection bias need not be zero. --- name: bad_tricky_ex ## A trickier example A timely/trickier example: Wage gaps (_e.g._, female-male or black-white). -- .qa[Q] Should we control for occupation when we consider wage gaps? -- - What are we trying to capture? - If we're concerned about discrimination, it seems likely that discrimination also affects occupational choice and hiring outcomes. - Some motivate occupation controls with groups' differential preferences. -- What's the answer? --- name: bad_proxy ## Proxy variables Angrist and Pischke bring up an interesting scenario that intersects omitted-variable bias and bad controls. - We want to estimate the returns to education. - Ability is omitted. - We have a proxy for ability—a test taken after schooling finishes. -- We're a bit stuck. 1. If we omit the test altogether, we've got omitted-variable bias. 1. If we include our proxy, we've got a bad control. -- With some math/luck, we can bound the true effect with these estimates. --- name: bad_emp ## Example Returning to our OVB-motivated example, we control for occupation. ```{r, table_bad_control, echo = F} coef_v <- c("0.132", "0.131", "0.114", "0.087", "0.066") se_v <- c(rep("0.007", 3), "0.009", "0.010") %>% paste0("(", ., ")") control_v <- c( "None", "Age Dum.", "2 + Add'l", "3 + AFQT", "4 + Occupation" ) names_v <- 1:5 tab_mat <- matrix(c(coef_v, se_v, control_v), nrow = 3, byrow = T) row.names(tab_mat) <- c("Schooling", "", "Controls") kable( x = tab_mat, col.names = names_v, caption = "Table 3.2.1, The returns to schooling", align = "c" ) %>% column_spec(1, bold = T, italic = F) %>% column_spec(6, color = red_pink) ``` Schooling likely affects occupation; how do we interpret the new results? --- ## Conclusion Timing matters. The right controls can help tremendously, but bad controls hurt. --- layout: false # Table of contents .pull-left[ ### Admin .smaller[ 1. [Schedule](#schedule) ] ] .pull-right[ ### Controls .smaller[ 1. [Omitted-variable bias](#ovb) - [The formula](#ovb_formula) - [Example](#ovb_ex) - [OVB Venn](#ovb_venn) - [OVB and the CIA](#ovb_cia) 1. [Bad controls](#bad_controls) - [Defined](#bad_def) - [Example](#bad_ex) - [Formalization(ish)](#bad_formal) - [Trickier example](#bad_tricky_ex) - [Bad proxy conundrum](#bad_proxy) - [Empirical example](#bad_emp) ] ] --- exclude: true ```{r, generate pdfs, include = F, eval = F} pagedown::chrome_print("06-controls.html", output = "06-controls.pdf") pagedown::chrome_print("06-controls.html", output = "06-controls-nopause.pdf") ```