---
title: "Controls"
subtitle: "EC 607, Set 06"
author: "Edward Rubin"
date: ""
output:
xaringan::moon_reader:
css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css']
# self_contained: true
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
class: inverse, middle
```{r, setup, include = F}
# devtools::install_github("dill/emoGG")
library(pacman)
p_load(
broom, tidyverse,
ggplot2, ggthemes, ggforce, ggridges,
latex2exp, viridis, extrafont, gridExtra,
kableExtra, snakecase, janitor,
data.table, dplyr,
lubridate, knitr,
estimatr, here, magrittr
)
# Define pink color
red_pink = "#e64173"
turquoise = "#20B2AA"
orange = "#FFA500"
red = "#fb6107"
blue = "#3b3b9a"
green = "#8bb174"
grey_light = "grey70"
grey_mid = "grey50"
grey_dark = "grey20"
purple = "#6A5ACD"
slate = "#314f4f"
# Dark slate grey: #314f4f
# Knitr options
opts_chunk$set(
comment = "#>",
fig.align = "center",
fig.height = 7,
fig.width = 10.5,
warning = F,
message = F
)
opts_chunk$set(dev = "svg")
options(device = function(file, width, height) {
svg(tempfile(), width = width, height = height)
})
options(crayon.enabled = F)
options(knitr.table.format = "html")
# A blank theme for ggplot
theme_empty = theme_bw() + theme(
line = element_blank(),
rect = element_blank(),
strip.text = element_blank(),
axis.text = element_blank(),
plot.title = element_blank(),
axis.title = element_blank(),
plot.margin = structure(c(0, 0, -0.5, -1), unit = "lines", valid.unit = 3L, class = "unit"),
legend.position = "none"
)
theme_simple = theme_bw() + theme(
line = element_blank(),
panel.grid = element_blank(),
rect = element_blank(),
strip.text = element_blank(),
axis.text.x = element_text(size = 18, family = "STIXGeneral"),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
plot.title = element_blank(),
axis.title = element_blank(),
# plot.margin = structure(c(0, 0, -1, -1), unit = "lines", valid.unit = 3L, class = "unit"),
legend.position = "none"
)
theme_axes_math = theme_void() + theme(
text = element_text(family = "MathJax_Math"),
axis.title = element_text(size = 22),
axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")),
axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")),
axis.line = element_line(
color = "grey70",
size = 0.25,
arrow = arrow(angle = 30, length = unit(0.15, "inches")
)),
plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"),
legend.position = "none"
)
theme_axes_serif = theme_void() + theme(
text = element_text(family = "MathJax_Main"),
axis.title = element_text(size = 22),
axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")),
axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")),
axis.line = element_line(
color = "grey70",
size = 0.25,
arrow = arrow(angle = 30, length = unit(0.15, "inches")
)),
plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"),
legend.position = "none"
)
theme_axes = theme_void() + theme(
text = element_text(family = "Fira Sans Book"),
axis.title = element_text(size = 18),
axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")),
axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")),
axis.line = element_line(
color = grey_light,
size = 0.25,
arrow = arrow(angle = 30, length = unit(0.15, "inches")
)),
plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"),
legend.position = "none"
)
theme_set(theme_gray(base_size = 20))
# Column names for regression results
reg_columns = c("Term", "Est.", "S.E.", "t stat.", "p-Value")
# Function for formatting p values
format_pvi = function(pv) {
return(ifelse(
pv < 0.0001,
"<0.0001",
round(pv, 4) %>% format(scientific = F)
))
}
format_pv = function(pvs) lapply(X = pvs, FUN = format_pvi) %>% unlist()
# Tidy regression results table
tidy_table = function(x, terms, highlight_row = 1, highlight_color = "black", highlight_bold = T, digits = c(NA, 3, 3, 2, 5), title = NULL) {
x %>%
tidy() %>%
select(1:5) %>%
mutate(
term = terms,
p.value = p.value %>% format_pv()
) %>%
kable(
col.names = reg_columns,
escape = F,
digits = digits,
caption = title
) %>%
kable_styling(font_size = 20) %>%
row_spec(1:nrow(tidy(x)), background = "white") %>%
row_spec(highlight_row, bold = highlight_bold, color = highlight_color)
}
```
$$
\begin{align}
\def\ci{\perp\mkern-10mu\perp}
\end{align}
$$
# Prologue
---
name: schedule
# Schedule
## Last time
The conditional independence assumption: $\left\{ \text{Y}_{0i},\, \text{Y}_{1i}\right\} \ci \text{D}_{i}\big| \text{X}_{i}$
_I.e._, conditional on some controls $\left( \text{X}_{i} \right)$, treatment is as-good-as random.
## Today
- Omitted variable bias
- Good *vs.* bad controls
## Upcoming
- DAGS
- Matching estimators
---
layout: true
# Omitted-variable bias
---
class: inverse, middle
name: OVB
---
## Revisiting an old friend
Let's start where we left off: Returns to schooling.
We have two linear, population models
$$
\begin{align}
\text{Y}_{i} &= \alpha + \rho \text{s}_i + \eta_i \tag{1}
\\
\text{Y}_{i} &= \alpha + \rho \text{s}_i + \text{X}_{i}'\gamma + \nu_i \tag{2}
\end{align}
$$
--
We should not interpret $\hat{\rho}$ causally in model $\left( 1 \right)$ (for fear of selection bias).
--
For model $\left( 2 \right)$, we can interpret $\hat{\rho}$ causally .b[*if*] $\thinspace\text{Y}_{si}\ci \text{s}_i\big|\text{X}_{i}\thinspace$ (CIA).
--
In other words, the CIA says that our .hi[observable vector] $\color{#e64173}{\text{X}_{i}}$ .hi[must explain all of correlation between] $\color{#e64173}{s_i}$ .hi[and] $\color{#e64173}{\eta_i}$.
---
name: ovb_formula
## The OVB formula
We can use the omitted-variable bias (OVB) formula to compare regression estimates from .hi-slate[models with different sets of control variables].
--
We're concerned about selection and want to use a set of control variables to account for ability $\left( \text{A}_i \right)$—family background, motivation, intelligence.
$$
\begin{align}
\text{Y}_{i} &= \alpha + \beta \text{s}_i + v_i \tag{1}
\\
\text{Y}_{i} &= \pi + \rho \text{s}_i + \text{A}_{i}'\gamma + e_i \tag{2}
\end{align}
$$
--
What happens if we can't get data on $\text{A}_i$ and opt for $\left( 1 \right)$?
--
$$
\begin{align}
\dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \text{s}_i \right)}{\mathop{\text{Var}} \left( \text{s}_i \right)} = \rho + \gamma' \delta_{As}
\end{align}
$$
where $\delta_{As}$ are coefficients from regressing $\text{A}_i$ on $\text{s}_i$.
---
## Interpretation
Our two regressions
$$
\begin{align}
\text{Y}_{i} &= \alpha + \beta \text{s}_i + v_i \tag{1}
\\
\text{Y}_{i} &= \pi + \rho \text{s}_i + \text{A}_{i}'\gamma + e_i \tag{2}
\end{align}
$$
will yield the same estimates for the returns to schooling
$$
\begin{align}
\dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \text{s}_i \right)}{\mathop{\text{Var}} \left( \text{s}_i \right)} = \rho + \gamma' \delta_{As}
\end{align}
$$
if (.hi-slate[a]) schooling is uncorrelated with ability $\left( \delta_{As} = 0 \right)$ *or* (.hi-slate[b]) ability is uncorrelated with earnings, conditional on schooling $\left( \gamma = 0 \right)$.
---
name: ovb_ex
## Example
```{r, table_321, echo = F}
coef_v = c("0.132", "0.131", "0.114", "0.087", "0.066")
se_v = c(rep("0.007", 3), "0.009", "0.010") %>% paste0("(", ., ")")
control_v = c(
"None", "Age Dum.", "2 + Add'l",
"3 + AFQT", "4 + Occupation"
)
names_v = 1:5
tab_mat = matrix(c(coef_v, se_v, control_v), nrow = 3, byrow = T)[,1:4]
row.names(tab_mat) = c("Schooling", "", "Controls")
tab321 = kable(
x = tab_mat,
col.names = names_v[1:4],
caption = "Table 3.2.1, The returns to schooling",
align = "c"
) %>%
column_spec(1, bold = T, italic = F)
# Print the table
tab321
```
Here we have four specifications of controls for a regression of log wages on years of schooling (from the NLSY).
---
## Example
```{r, table_321_1, echo = F}
tab321 %>% column_spec(2, color = red_pink)
```
.hi[Column 1] (no control variables) suggests a 13.2% increase in wages for an additional year of schooling.
---
## Example
```{r, table_321_2, echo = F}
tab321 %>% column_spec(3, color = red_pink)
```
.hi[Column 2] (age dummies) suggests a 13.1% increase in wages for an additional year of schooling.
---
## Example
```{r, table_321_3, echo = F}
tab321 %>% column_spec(4, color = red_pink)
```
.hi[Column 3] (column 2 controls plus parents' ed. and self demographics) suggests a 11.4% increase in wages for an additional year of schooling.
---
## Example
```{r, table_321_4, echo = F}
tab321 %>% column_spec(5, color = red_pink)
```
.hi[Column 4] (column 3 controls plus AFQT.super[.pink[†]] score) suggests a 8.7% increase in wages for an additional year of schooling.
.footnote[.pink[†] *AFQT* is *Armed Forces Qualification Test*.]
---
## Example
```{r, table_321_5, echo = F}
tab321 %>%
column_spec(5, color = red_pink) %>%
column_spec(2, color = purple)
```
As we ratchet up controls, the estimated returns to schooling drop by 4.5 percentage points (34% drop in the coefficient) from .hi-purple[Column 1] to .hi[Column 4].
--
$$
\begin{align}
\color{#6A5ACD}{\dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \text{s}_i \right)}{\mathop{\text{Var}} \left( \text{s}_i \right)}} = \color{#e64173}{\rho} + \color{#20B2AA}{\gamma'} \color{#FFA500}{\delta_{As}}
\end{align}
$$
--
If we think .hi-turquoise[ability positively affects wages], then it looks like we also have .hi-orange[positive selection into schooling].
---
layout: false
class: clear, center, middle
name: ovb_venn
```{r, venn_iv, echo = F, fig.height = 7.5}
# Colors (order: x1, x2, x3, y, z)
venn_colors = c(purple, red, "grey60", orange)
# Line types (order: x1, x2, x3, y, z)
venn_lines = c("solid", "dotted", "dotted", "solid")
# Locations of circles
venn_df = tibble(
x = c( 0.0, -0.5, 1.5, -1.0),
y = c( 0.0, -2.5, -1.8, 2.0),
r = c( 1.9, 1.5, 1.5, 1.3),
l = c( "Y", "X[1]", "X[2]", "X[3]"),
xl = c( 0.0, -0.6, 1.6, -1.0),
yl = c( 0.0, -2.6, -1.9, 2.2)
)
# Venn
ggplot(data = venn_df, aes(x0 = x, y0 = y, r = r, fill = l, color = l)) +
geom_circle(aes(linetype = l), alpha = 0.3, size = 0.75) +
theme_void() +
theme(legend.position = "none") +
scale_fill_manual(values = venn_colors) +
scale_color_manual(values = venn_colors) +
scale_linetype_manual(values = venn_lines) +
geom_text(aes(x = xl, y = yl, label = l), size = 9, family = "Fira Sans Book", parse = T) +
annotate(
x = -6, y = 0,
geom = "text", label = TeX("\\textit{Omitted:} $X_2$ and $X_3$"), size = 9, family = "Fira Sans Book", hjust = 0
) +
xlim(-6, 4.5) +
ylim(-4.2, 3.4) +
coord_equal()
```
---
layout: true
# Omitted-variable bias
---
## Note
This OVB formula .hi-slate[does not] require either of the models to be causal.
The formula compares the regression coefficient in a .hi-purple[short model] to the regression coefficient on the same variable in a .hi-pink[long model]..super[.pink[†]]
.footnote[.pink[†] Here, .hi-pink[*long model*] refers to a model with more controls than the .hi-purple[*short model*].]
---
name: ovb_cia
## The OVB formula and the CIA.super[.pink[†]]
.footnote[.pink[†] The title for my first spy novel.]
In addition to helping us think through and sign OVB, the formula
$$
\begin{align}
\dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \text{s}_i \right)}{\mathop{\text{Var}} \left( \text{s}_i \right)} = \rho + \gamma' \delta_{As}
\end{align}
$$
drives home the point that we're leaning .it[very] hard on the conditional independence assumption to be able to interpret our coefficients as causal.
--
.qa[Q] When is the CIA plausible?
--
.qa[A] Two potential answers
1. Randomized experiments
2. Programs with arbitrary cutoffs/lotteries
---
layout: false
class: clear, middle
Control variables play an enormous role in our quest for causality (the CIA).
.qa[Q] Are "more controls" always better (or at least never worse)?
---
class: clear, middle
.qa[A] No. There are such things as...
---
layout: true
# Bad controls
---
name: bad_controls
class: inverse, middle
---
name: bad_def
## Defined
.qa[Q] What's a *bad* control—when can a control make a bad situation worse?
--
.qa[A] *Bad controls* are variables that are (also) affected by treatment.
--
.note[Note] There are other types of .it[bad controls] too. More soon (DAGs).
--
.qa[Q] Okay, so why is it bad to control using a variable affected by treatment?
--
.note[Hint] It's a flavor of selection bias.
--
Let's consider an example...
---
name: bad_ex
## Example
Suppose we want to know the .hi-slate[effect of college graduation on wages].
1. There are only two types of jobs: blue collar and white collar.
1. White-collar jobs, on average, pay more than blue-collar jobs.
1. Graduating college increases the likelihood of a white-collar job.
--
.qa[Q] Should we control for occupation type when considering the effect of college graduation on wages? (Will occupation be an omitted variable?)
--
.qa[A] No.
--
Imagine college degrees are randomly assigned.
--
When we condition on occupation,
--
we compare degree-earners who chose blue-collar jobs to non-degree-earners who chose blue-collar jobs.
--
Our assumption of random degrees says .b[nothing] about random job selection.
---
layout: false
class: clear, middle
Bad controls can undo valid randomizations.
---
layout: true
# Bad controls
---
name: bad_formal
## Formal-ish derivation
More formally, let
- $\text{W}_i$ be a dummy for whether $i$ has a white-collar job
- $\text{Y}_i$ denote $i$'s earnings
- $\text{C}_i$ refer to $i$'s .hi-slate[randomly assigned] college-graduation status
--
$$
\begin{align}
\text{Y}_{i} &= \text{C}_{i} \color{#e64173}{\text{Y}_{1i}} + \left( 1 - \text{C}_{i} \right) \color{#6A5ACD}{\text{Y}_{0i}}
\\
\text{W}_{i} &= \text{C}_{i} \color{#e64173}{\text{W}_{1i}} + \left( 1 - \text{C}_{i} \right) \color{#6A5ACD}{\text{W}_{0i}}
\end{align}
$$
--
Because we've assumed $\text{C}_i$ is randomly assigned, differences in means yield causal estimates, _i.e._,
$$
\begin{align}
\mathop{E}\left[ \text{Y}_{i}\mid \color{#e64173}{\text{C}_{i} = 1} \right] - \mathop{E}\left[ \text{Y}_{i} \mid \color{#6A5ACD}{\text{C}_{i} = 0} \right] &= \mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} - \color{#6A5ACD}{\text{Y}_{0i}} \right]
\\
\mathop{E}\left[ \text{W}_{i}\mid \color{#e64173}{\text{C}_{i} = 1} \right] - \mathop{E}\left[ \text{W}_{i} \mid \color{#6A5ACD}{\text{C}_{i} = 0} \right] &= \mathop{E}\left[ \color{#e64173}{\text{W}_{1i}} - \color{#6A5ACD}{\text{W}_{0i}} \right]
\end{align}
$$
---
## Formal-ish derivation, continued
Let's see what happens when we throw in some controls—_e.g._, focusing on the wage-effect of college graduation for white-collar jobs.
--
$\mathop{E}\left[ \text{Y}_{i} \mid \text{W}_i = 1,\, \color{#e64173}{\text{C}_i = 1} \right] - \mathop{E}\left[ \text{Y}_{i} \mid \text{W}_i = 1,\, \color{#6A5ACD}{\text{C}_i = 0} \right]$
--
.pad-left[
$= \mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} \mid \color{#e64173}{\text{W}_{1i}} = 1,\, \color{#e64173}{\text{C}_i = 1} \right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#6A5ACD}{\text{W}_{0i}} = 1,\, \color{#6A5ACD}{\text{C}_i = 0} \right]$
]
--
.pad-left[
$= \mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#6A5ACD}{\text{W}_{0i}} = 1\right]$
]
--
.pad-left[
$=\mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right]$
$\color{#ffffff}{=} + \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#6A5ACD}{\text{W}_{0i}} = 1\right]$
]
--
.pad-left[
$= \underbrace{\mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} - \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right]}_{\text{Causal effect on white-collar workers}} + \underbrace{\mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#6A5ACD}{\text{W}_{0i}} = 1\right]}_{\text{Selection bias}}$
]
---
## Formal-ish derivation, continued
By introducing a bad control, we introduced selection bias into a setting that did not have selection bias without controls.
--
Specifically, the selection bias term
$$
\begin{align}
\mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#6A5ACD}{\text{W}_{0i}} = 1\right]
\end{align}
$$
describes how college graduation changes the composition of the pool of white-collar workers.
--
.note[Note] Even if the causal effect is zero, this selection bias need not be zero.
---
name: bad_tricky_ex
## A trickier example
A timely/trickier example: Wage gaps (_e.g._, female-male or black-white).
--
.qa[Q] Should we control for occupation when we consider wage gaps?
--
- What are we trying to capture?
- If we're concerned about discrimination, it seems likely that discrimination also affects occupational choice and hiring outcomes.
- Some motivate occupation controls with groups' differential preferences.
--
What's the answer?
---
name: bad_proxy
## Proxy variables
Angrist and Pischke bring up an interesting scenario that intersects omitted-variable bias and bad controls.
- We want to estimate the returns to education.
- Ability is omitted.
- We have a proxy for ability—a test taken after schooling finishes.
--
We're a bit stuck.
1. If we omit the test altogether, we've got omitted-variable bias.
1. If we include our proxy, we've got a bad control.
--
With some math/luck, we can bound the true effect with these estimates.
---
name: bad_emp
## Example
Returning to our OVB-motivated example, we control for occupation.
```{r, table_bad_control, echo = F}
coef_v = c("0.132", "0.131", "0.114", "0.087", "0.066")
se_v = c(rep("0.007", 3), "0.009", "0.010") %>% paste0("(", ., ")")
control_v = c(
"None", "Age Dum.", "2 + Add'l",
"3 + AFQT", "4 + Occupation"
)
names_v = 1:5
tab_mat = matrix(c(coef_v, se_v, control_v), nrow = 3, byrow = T)
row.names(tab_mat) = c("Schooling", "", "Controls")
kable(
x = tab_mat,
col.names = names_v,
caption = "Table 3.2.1, The returns to schooling",
align = "c"
) %>%
column_spec(1, bold = T, italic = F) %>%
column_spec(6, color = red_pink)
```
Schooling likely affects occupation; how do we interpret the new results?
---
## Conclusion
Timing matters.
The right controls can help tremendously, but bad controls hurt.
---
layout: false
# Table of contents
.pull-left[
### Admin
.smaller[
1. [Schedule](#schedule)
]
]
.pull-right[
### Controls
.smaller[
1. [Omitted-variable bias](#ovb)
- [The formula](#ovb_formula)
- [Example](#ovb_ex)
- [OVB Venn](#ovb_venn)
- [OVB and the CIA](#ovb_cia)
1. [Bad controls](#bad_controls)
- [Defined](#bad_def)
- [Example](#bad_ex)
- [Formalization(ish)](#bad_formal)
- [Trickier example](#bad_tricky_ex)
- [Bad proxy conundrum](#bad_proxy)
- [Empirical example](#bad_emp)
]
]
---
exclude: true
```{r, generate pdfs, include = F, eval = F}
pagedown::chrome_print("06-controls.html", output = "06-controls.pdf")
```