---
title: "Controls"
subtitle: "EC 607, Set 06"
author: "Edward Rubin"
date: "Spring 2020"
output:
  xaringan::moon_reader:
    css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css']
    # self_contained: true
    nature:
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
---
class: inverse, middle

```{r, setup, include = F}
# devtools::install_github("dill/emoGG")
library(pacman)
p_load(
  broom, tidyverse,
  ggplot2, ggthemes, ggforce, ggridges,
  latex2exp, viridis, extrafont, gridExtra,
  kableExtra, snakecase, janitor,
  data.table, dplyr,
  lubridate, knitr,
  estimatr, here, magrittr
)
# Define pink color
red_pink <- "#e64173"
turquoise <- "#20B2AA"
orange <- "#FFA500"
red <- "#fb6107"
blue <- "#3b3b9a"
green <- "#8bb174"
grey_light <- "grey70"
grey_mid <- "grey50"
grey_dark <- "grey20"
purple <- "#6A5ACD"
slate <- "#314f4f"
# Dark slate grey: #314f4f
# Knitr options
opts_chunk$set(
  comment = "#>",
  fig.align = "center",
  fig.height = 7,
  fig.width = 10.5,
  warning = F,
  message = F
)
opts_chunk$set(dev = "svg")
options(device = function(file, width, height) {
  svg(tempfile(), width = width, height = height)
})
options(crayon.enabled = F)
options(knitr.table.format = "html")
# A blank theme for ggplot
theme_empty <- theme_bw() + theme(
  line = element_blank(),
  rect = element_blank(),
  strip.text = element_blank(),
  axis.text = element_blank(),
  plot.title = element_blank(),
  axis.title = element_blank(),
  plot.margin = structure(c(0, 0, -0.5, -1), unit = "lines", valid.unit = 3L, class = "unit"),
  legend.position = "none"
)
theme_simple <- theme_bw() + theme(
  line = element_blank(),
  panel.grid = element_blank(),
  rect = element_blank(),
  strip.text = element_blank(),
  axis.text.x = element_text(size = 18, family = "STIXGeneral"),
  axis.text.y = element_blank(),
  axis.ticks = element_blank(),
  plot.title = element_blank(),
  axis.title = element_blank(),
  # plot.margin = structure(c(0, 0, -1, -1), unit = "lines", valid.unit = 3L, class = "unit"),
  legend.position = "none"
)
theme_axes_math <- theme_void() + theme(
  text = element_text(family = "MathJax_Math"),
  axis.title = element_text(size = 22),
  axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")),
  axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")),
  axis.line = element_line(
    color = "grey70",
    size = 0.25,
    arrow = arrow(angle = 30, length = unit(0.15, "inches")
  )),
  plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"),
  legend.position = "none"
)
theme_axes_serif <- theme_void() + theme(
  text = element_text(family = "MathJax_Main"),
  axis.title = element_text(size = 22),
  axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")),
  axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")),
  axis.line = element_line(
    color = "grey70",
    size = 0.25,
    arrow = arrow(angle = 30, length = unit(0.15, "inches")
  )),
  plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"),
  legend.position = "none"
)
theme_axes <- theme_void() + theme(
  text = element_text(family = "Fira Sans Book"),
  axis.title = element_text(size = 18),
  axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")),
  axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")),
  axis.line = element_line(
    color = grey_light,
    size = 0.25,
    arrow = arrow(angle = 30, length = unit(0.15, "inches")
  )),
  plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"),
  legend.position = "none"
)
theme_set(theme_gray(base_size = 20))
# Column names for regression results
reg_columns <- c("Term", "Est.", "S.E.", "t stat.", "p-Value")
# Function for formatting p values
format_pvi <- function(pv) {
  return(ifelse(
    pv < 0.0001,
    "<0.0001",
    round(pv, 4) %>% format(scientific = F)
  ))
}
format_pv <- function(pvs) lapply(X = pvs, FUN = format_pvi) %>% unlist()
# Tidy regression results table
tidy_table <- function(x, terms, highlight_row = 1, highlight_color = "black", highlight_bold = T, digits = c(NA, 3, 3, 2, 5), title = NULL) {
  x %>%
    tidy() %>%
    select(1:5) %>%
    mutate(
      term = terms,
      p.value = p.value %>% format_pv()
    ) %>%
    kable(
      col.names = reg_columns,
      escape = F,
      digits = digits,
      caption = title
    ) %>%
    kable_styling(font_size = 20) %>%
    row_spec(1:nrow(tidy(x)), background = "white") %>%
    row_spec(highlight_row, bold = highlight_bold, color = highlight_color)
}
```

```{css, echo = F, eval = T}
@media print {
  .has-continuation {
    display: block !important;
  }
}
```

$$
\begin{align}
  \def\ci{\perp\mkern-10mu\perp}
\end{align}
$$

# Prologue

---
name: schedule

# Schedule

## Last time

The conditional independence assumption: $\left\{ \text{Y}_{0i},\, \text{Y}_{1i}\right\} \ci \text{D}_{i}\big| \text{X}_{i}$
<br>_I.e._, conditional on some controls $\left( \text{X}_{i} \right)$, treatment is as-good-as random.

## Today

- Omitted variable bias
- Good *vs.* bad controls

## Upcoming

- Topics: Matching estimators
- Admin: Assignment and midterm

---
layout: true
# Omitted-variable bias

---
class: inverse, middle
name: OVB
---
## Revisiting an old friend

Let's start where we left off: Returns to schooling.

We have two linear, population models
$$
\begin{align}
  \text{Y}_{i} &= \alpha + \rho \text{s}_i + \eta_i \tag{1}
  \\
  \text{Y}_{i} &= \alpha + \rho \text{s}_i + \text{X}_{i}'\gamma + \nu_i \tag{2}
\end{align}
$$

--

We should not interpret $\hat{\rho}$ causally in model $\left( 1 \right)$ (for fear of selection bias).

--

For model $\left( 2 \right)$, we can interpret $\hat{\rho}$ causally .b[*if*] $\thinspace\text{Y}_{si}\ci \text{s}_i\big|\text{X}_{i}\thinspace$ (CIA).

--

In other words, the CIA says that our .hi[observable vector] $\color{#e64173}{\text{X}_{i}}$ .hi[must explain all of correlation between] $\color{#e64173}{s_i}$ .hi[and] $\color{#e64173}{\eta_i}$.

---
name: ovb_formula
## The OVB formula

We can use the omitted-variable bias (OVB) formula to compare regression estimates from .hi-slate[models with different sets of control variables].

--

We're concerned about selection and want to use a set of control variables to account for ability $\left( \text{A}_i \right)$—family background, motivation, intelligence.

$$
\begin{align}
  \text{Y}_{i} &= \alpha + \beta \text{s}_i + v_i \tag{1}
  \\
  \text{Y}_{i} &= \pi + \rho \text{s}_i + \text{A}_{i}'\gamma + e_i \tag{2}
\end{align}
$$

--

What happens if we can't get data on $\text{A}_i$ and opt for $\left( 1 \right)$?

--

$$
\begin{align}
  \dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \text{s}_i \right)}{\mathop{\text{Var}} \left( \text{s}_i \right)} = \rho + \gamma' \delta_{As}
\end{align}
$$

where $\delta_{As}$ are coefficients from regressing $\text{A}_i$ on $\text{s}_i$.
---
## Interpretation

Our two regressions

$$
\begin{align}
  \text{Y}_{i} &= \alpha + \beta \text{s}_i + v_i \tag{1}
  \\
  \text{Y}_{i} &= \pi + \rho \text{s}_i + \text{A}_{i}'\gamma + e_i \tag{2}
\end{align}
$$

will yield the same estimates for the returns to schooling

$$
\begin{align}
  \dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \text{s}_i \right)}{\mathop{\text{Var}} \left( \text{s}_i \right)} = \rho + \gamma' \delta_{As}
\end{align}
$$

if (.hi-slate[a]) schooling is uncorrelated with ability $\left( \delta_{As} = 0 \right)$ *or* (.hi-slate[b]) ability is uncorrelated with earnings, conditional on schooling $\left( \gamma = 0 \right)$.
---
name: ovb_ex
## Example

```{r, table_321, echo = F}
coef_v <- c("0.132", "0.131", "0.114", "0.087", "0.066")
se_v <- c(rep("0.007", 3), "0.009", "0.010") %>% paste0("(", ., ")")
control_v <- c(
  "None", "Age Dum.", "2 + Add'l",
  "3 + AFQT", "4 + Occupation"
)
names_v <- 1:5
tab_mat <- matrix(c(coef_v, se_v, control_v), nrow = 3, byrow = T)[,1:4]
row.names(tab_mat) <- c("Schooling", "", "Controls")
tab321 <- kable(
  x = tab_mat,
  col.names = names_v[1:4],
  caption = "Table 3.2.1, The returns to schooling",
  align = "c"
) %>%
column_spec(1, bold = T, italic = F)
# Print the table
tab321
```

Here we have four specifications of controls for a regression of log wages on years of schooling (from the NLSY).

---
## Example

```{r, table_321_1, echo = F}
tab321 %>% column_spec(2, color = red_pink)
```

.hi[Column 1] (no control variables) suggests a 13.2% increase in wages for an additional year of schooling.
---
## Example

```{r, table_321_2, echo = F}
tab321 %>% column_spec(3, color = red_pink)
```

.hi[Column 2] (age dummies) suggests a 13.1% increase in wages for an additional year of schooling.
---
## Example

```{r, table_321_3, echo = F}
tab321 %>% column_spec(4, color = red_pink)
```

.hi[Column 3] (column 2 controls plus parents' ed. and self demographics) suggests a 11.4% increase in wages for an additional year of schooling.
---
## Example

```{r, table_321_4, echo = F}
tab321 %>% column_spec(5, color = red_pink)
```

.hi[Column 4] (column 3 controls plus AFQT.super[.pink[†]] score) suggests a 8.7% increase in wages for an additional year of schooling.

.footnote[.pink[†] *AFQT* is *Armed Forces Qualification Test*.]
---
## Example

```{r, table_321_5, echo = F}
tab321 %>%
  column_spec(5, color = red_pink) %>%
  column_spec(2, color = purple)
```

As we ratchet up controls, the estimated returns to schooling drop by 4.5 percentage points (34% drop in the coefficient) from .hi-purple[Column 1] to .hi[Column 4].

--

$$
\begin{align}
  \color{#6A5ACD}{\dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \text{s}_i \right)}{\mathop{\text{Var}} \left( \text{s}_i \right)}} = \color{#e64173}{\rho} + \color{#20B2AA}{\gamma'} \color{#FFA500}{\delta_{As}}
\end{align}
$$

--

If we think .hi-turquoise[ability positively affects wages], then it looks like we also have .hi-orange[positive selection into schooling].
---
layout: false
class: clear, center, middle
name: ovb_venn

```{r, venn_iv, echo = F, fig.height = 7.5}
# Colors (order: x1, x2, x3, y, z)
venn_colors <- c(purple, red, "grey60", orange)
# Line types (order: x1, x2, x3, y, z)
venn_lines <- c("solid", "dotted", "dotted", "solid")
# Locations of circles
venn_df <- tibble(
  x  = c( 0.0,   -0.5,    1.5,   -1.0),
  y  = c( 0.0,   -2.5,   -1.8,    2.0),
  r  = c( 1.9,    1.5,    1.5,    1.3),
  l  = c( "Y", "X[1]", "X[2]", "X[3]"),
  xl = c( 0.0,   -0.6,    1.6,   -1.0),
  yl = c( 0.0,   -2.6,   -1.9,    2.2)
)
# Venn
ggplot(data = venn_df, aes(x0 = x, y0 = y, r = r, fill = l, color = l)) +
geom_circle(aes(linetype = l), alpha = 0.3, size = 0.75) +
theme_void() +
theme(legend.position = "none") +
scale_fill_manual(values = venn_colors) +
scale_color_manual(values = venn_colors) +
scale_linetype_manual(values = venn_lines) +
geom_text(aes(x = xl, y = yl, label = l), size = 9, family = "Fira Sans Book", parse = T) +
annotate(
  x = -6, y = 0,
  geom = "text", label = TeX("\\textit{Omitted:} $X_2$ and $X_3$"), size = 9, family = "Fira Sans Book", hjust = 0
) +
xlim(-6, 4.5) +
ylim(-4.2, 3.4) +
coord_equal()
```
---
layout: true
# Omitted-variable bias

---
## Note

This OVB formula .hi-slate[does not] require either of the models to be causal.

The formula compares the regression coefficient in a .hi-purple[short model] to the regression coefficient on the same variable in a .hi-pink[long model]..super[.pink[†]]

.footnote[.pink[†] Here, .hi-pink[*long model*] refers to a model with more controls than the .hi-purple[*short model*].]

---
name: ovb_cia
## The OVB formula and the CIA.super[.pink[†]]

.footnote[.pink[†] The title for my first spy novel.]

In addition to helping us think through and sign OVB, the formula

$$
\begin{align}
  \dfrac{\mathop{\text{Cov}} \left( \text{Y}_{i},\, \text{s}_i \right)}{\mathop{\text{Var}} \left( \text{s}_i \right)} = \rho + \gamma' \delta_{As}
\end{align}
$$

drives home the point that we're leaning .it[very] hard on the conditional independence assumption to be able to interpret our coefficients as causal.

--

.qa[Q] When is the CIA plausible?

--

.qa[A] Two potential answers
1. Randomized experiments
2. Programs with arbitrary cutoffs/lotteries
---
layout: false
class: clear, middle

Control variables play an enormous role in our quest for causality (the CIA).

.qa[Q] Are "more controls" always better (or at least never worse)?
---
class: clear, middle

.qa[A] No. There are such things as...
---
layout: true
# Bad controls

---
name: bad_controls
class: inverse, middle
---
name: bad_def
## Defined

.qa[Q] What's a *bad* control—when can a control make a bad situation worse?

--

.qa[A] *Bad controls* are variables that are (also) affected by treatment.

--

.qa[Q] Okay, so why is it bad to control using a variable affected by treatment?

--

.note[Hint] It's a flavor of selection bias.

--

Let's consider an example...
---
name: bad_ex
## Example

Suppose we want to know the .hi-slate[effect of college graduation on wages].

1. There are only two types of jobs: blue collar and white collar.
1. White-collar jobs, on averge, pay more than blue-collar jobs.
1. Graduating college increases the likelihood of a white-collar job.

--

.qa[Q] Should we control for occupation type when considering the effect of college graduation on wages? (Will occupation be an omitted variable?)

--

.qa[A] No.
--
 Imagine college degrees are randomly assigned.
--
 When we condition on occupation,
--
 we compare degree-earners who chose blue-collar jobs to non-degree-earners who chose blue-collar jobs.
--
 Our assumption of random degrees says .b[nothing] about random job selection.
---
name: bad_formal
## Formal-ish derivation

More formally, let

- $\text{W}_i$ be a dummy for whether $i$ has a white-collar job
- $\text{Y}_i$ denote $i$'s earnings
- $\text{C}_i$ refer to $i$'s .hi-slate[randomly assigned] college-graduation status

--

$$
\begin{align}
  \text{Y}_{i} &= \text{C}_{i} \color{#e64173}{\text{Y}_{1i}} + \left( 1 - \text{C}_{i} \right) \color{#6A5ACD}{\text{Y}_{0i}}
  \\
  \text{W}_{i} &= \text{C}_{i} \color{#e64173}{\text{W}_{1i}} + \left( 1 - \text{C}_{i} \right) \color{#6A5ACD}{\text{W}_{0i}}
\end{align}
$$

--

Becuase we've assumed $\text{C}_i$ is randomly assigned, differences in means yield causal estimates, _i.e._,

$$
\begin{align}
   \mathop{E}\left[ \text{Y}_{i}\mid \color{#e64173}{\text{C}_{i} = 1} \right] - \mathop{E}\left[ \text{Y}_{i} \mid \color{#6A5ACD}{\text{C}_{i} = 0} \right] &= \mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} - \color{#6A5ACD}{\text{Y}_{0i}} \right]
   \\
   \mathop{E}\left[ \text{W}_{i}\mid \color{#e64173}{\text{C}_{i} = 1} \right] - \mathop{E}\left[ \text{W}_{i} \mid \color{#6A5ACD}{\text{C}_{i} = 0} \right] &= \mathop{E}\left[ \color{#e64173}{\text{W}_{1i}} - \color{#6A5ACD}{\text{W}_{0i}} \right]
\end{align}
$$
---
## Formal-ish derivation, continued

Let's see what happens when we throw in some controls—_e.g._, focusing on the the wage-effect of college graduation for white-collar jobs.

--

$\mathop{E}\left[ \text{Y}_{i} \mid \text{W}_i = 1,\, \color{#e64173}{\text{C}_i = 1} \right] - \mathop{E}\left[ \text{Y}_{i} \mid \text{W}_i = 1,\, \color{#6A5ACD}{\text{C}_i = 0} \right]$

--
.pad-left[
$= \mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} \mid \color{#e64173}{\text{W}_{1i}} = 1,\, \color{#e64173}{\text{C}_i = 1} \right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#6A5ACD}{\text{W}_{0i}} = 1,\, \color{#6A5ACD}{\text{C}_i = 0} \right]$
]

--
.pad-left[
$= \mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#6A5ACD}{\text{W}_{0i}} = 1\right]$
]

--
.pad-left[
$=\mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right]$
<br> $\color{#ffffff}{=} + \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#6A5ACD}{\text{W}_{0i}} = 1\right]$
]

--
.pad-left[
$= \underbrace{\mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} - \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right]}_{\text{Causal effect on white-collar workers}} + \underbrace{\mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#6A5ACD}{\text{W}_{0i}} = 1\right]}_{\text{Selection bias}}$
]
---
## Formal-ish derivation, continued

By introducing a bad control, we introduced selection bias into a setting that did not have selection bias without controls.

--

Specifically, the selection bias term
$$
\begin{align}
  \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{W}_{1i}} = 1\right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#6A5ACD}{\text{W}_{0i}} = 1\right]
\end{align}
$$
describes how college graduation changes the composition of the pool of white-class workers.

--

.note[Note] Even if the causal effect is zero, this selection bias need not be zero.
---
name: bad_tricky_ex
## A trickier example

A timely/trickier example: Wage gaps (_e.g._, female-male or black-white).

--

.qa[Q] Should we control for occupation when we consider wage gaps?

--

- What are we trying to capture?

- If we're concerned about discrimination, it seems likely that discrimination also affects occupational choice and hiring outcomes.

- Some motivate occuption controls with groups' differential preferences.

--

What's the answer?
---
name: bad_proxy
## Proxy variables

Angrist and Pischke bring up an interesting scenario that intersects omitted-variable bias and bad controls.

- We want to estimate the returns to education.
- Ability is omitted.
- We have a proxy for ability—a test taken after schooling finishes.

--

We're a bit stuck.

1. If we omit the test altogether, we've got omitted-variable bias.
1. If we include our proxy, we've got a back control.

--

With some math/luck, we can bound the true effect with these estimates.
---
name: bad_emp
## Example

Returning to our OVB-motivated example, we control for occupation.

```{r, table_bad_control, echo = F}
coef_v <- c("0.132", "0.131", "0.114", "0.087", "0.066")
se_v <- c(rep("0.007", 3), "0.009", "0.010") %>% paste0("(", ., ")")
control_v <- c(
  "None", "Age Dum.", "2 + Add'l",
  "3 + AFQT", "4 + Occupation"
)
names_v <- 1:5
tab_mat <- matrix(c(coef_v, se_v, control_v), nrow = 3, byrow = T)
row.names(tab_mat) <- c("Schooling", "", "Controls")
kable(
  x = tab_mat,
  col.names = names_v,
  caption = "Table 3.2.1, The returns to schooling",
  align = "c"
) %>%
column_spec(1, bold = T, italic = F) %>%
column_spec(6, color = red_pink)
```

Schooling likely affects occupation; how do we interpret the new results?
---
## Conclusion

Timing matters.

The right controls can help tremendously, but bad controls hurt.
---
layout: false
# Table of contents

.pull-left[
### Admin
.smaller[

1. [Schedule](#schedule)
]

]

.pull-right[
### Controls
.smaller[

1. [Omitted-variable bias](#ovb)
  - [The formula](#ovb_formula)
  - [Example](#ovb_ex)
  - [OVB Venn](#ovb_venn)
  - [OVB and the CIA](#ovb_cia)
1. [Bad controls](#bad_controls)
  - [Defined](#bad_def)
  - [Example](#bad_ex)
  - [Formalization(ish)](#bad_formal)
  - [Trickier example](#bad_tricky_ex)
  - [Bad proxy conundrum](#bad_proxy)
  - [Empirical example](#bad_emp)
]
]
---
exclude: true

```{r, generate pdfs, include = F, eval = F}
pagedown::chrome_print("06-controls.html", output = "06-controls.pdf")
pagedown::chrome_print("06-controls.html", output = "06-controls-nopause.pdf")
```