---
title: "The Experimental Ideal"
subtitle: "EC 607, Set 02"
author: "Edward Rubin"
date: ""
output:
xaringan::moon_reader:
css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css']
# self_contained: true
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
class: inverse, middle
```{r, setup, include = F}
# devtools::install_github("dill/emoGG")
library(pacman)
p_load(
broom, tidyverse,
latex2exp, ggplot2, ggthemes, ggforce, viridis, extrafont, gridExtra,
kableExtra, snakecase, janitor,
data.table, dplyr, estimatr,
lubridate, knitr, parallel,
lfe,
here, magrittr
)
# Define pink color
red_pink <- "#e64173"
turquoise <- "#20B2AA"
orange <- "#FFA500"
red <- "#fb6107"
blue <- "#3b3b9a"
green <- "#8bb174"
grey_light <- "grey70"
grey_mid <- "grey50"
grey_dark <- "grey20"
purple <- "#6A5ACD"
slate <- "#314f4f"
# Dark slate grey: #314f4f
# Knitr options
opts_chunk$set(
comment = "#>",
fig.align = "center",
fig.height = 7,
fig.width = 10.5,
warning = F,
message = F
)
opts_chunk$set(dev = "svg")
options(device = function(file, width, height) {
svg(tempfile(), width = width, height = height)
})
options(crayon.enabled = F)
options(knitr.table.format = "html")
# A blank theme for ggplot
theme_empty <- theme_bw() + theme(
line = element_blank(),
rect = element_blank(),
strip.text = element_blank(),
axis.text = element_blank(),
plot.title = element_blank(),
axis.title = element_blank(),
plot.margin = structure(c(0, 0, -0.5, -1), unit = "lines", valid.unit = 3L, class = "unit"),
legend.position = "none"
)
theme_simple <- theme_bw() + theme(
line = element_blank(),
panel.grid = element_blank(),
rect = element_blank(),
strip.text = element_blank(),
axis.text.x = element_text(size = 18, family = "STIXGeneral"),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
plot.title = element_blank(),
axis.title = element_blank(),
# plot.margin = structure(c(0, 0, -1, -1), unit = "lines", valid.unit = 3L, class = "unit"),
legend.position = "none"
)
theme_axes_math <- theme_void() + theme(
text = element_text(family = "MathJax_Math"),
axis.title = element_text(size = 22),
axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")),
axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")),
axis.line = element_line(
color = "grey70",
size = 0.25,
arrow = arrow(angle = 30, length = unit(0.15, "inches")
)),
plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"),
legend.position = "none"
)
theme_axes_serif <- theme_void() + theme(
text = element_text(family = "MathJax_Main"),
axis.title = element_text(size = 22),
axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")),
axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")),
axis.line = element_line(
color = "grey70",
size = 0.25,
arrow = arrow(angle = 30, length = unit(0.15, "inches")
)),
plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"),
legend.position = "none"
)
theme_axes <- theme_void() + theme(
text = element_text(family = "Fira Sans Book"),
axis.title = element_text(size = 18),
axis.title.x = element_text(hjust = .95, margin = margin(0.15, 0, 0, 0, unit = "lines")),
axis.title.y = element_text(vjust = .95, margin = margin(0, 0.15, 0, 0, unit = "lines")),
axis.line = element_line(
color = grey_light,
size = 0.25,
arrow = arrow(angle = 30, length = unit(0.15, "inches")
)),
plot.margin = structure(c(1, 0, 1, 0), unit = "lines", valid.unit = 3L, class = "unit"),
legend.position = "none"
)
theme_set(theme_gray(base_size = 20))
# Column names for regression results
reg_columns <- c("Term", "Est.", "S.E.", "t stat.", "p-Value")
# Function for formatting p values
format_pvi <- function(pv) {
return(ifelse(
pv < 0.0001,
"<0.0001",
round(pv, 4) %>% format(scientific = F)
))
}
format_pv <- function(pvs) lapply(X = pvs, FUN = format_pvi) %>% unlist()
# Tidy regression results table
tidy_table <- function(x, terms, highlight_row = 1, highlight_color = "black", highlight_bold = T, digits = c(NA, 3, 3, 2, 5), title = NULL) {
x %>%
tidy() %>%
select(1:5) %>%
mutate(
term = terms,
p.value = p.value %>% format_pv()
) %>%
kable(
col.names = reg_columns,
escape = F,
digits = digits,
caption = title
) %>%
kable_styling(font_size = 20) %>%
row_spec(1:nrow(tidy(x)), background = "white") %>%
row_spec(highlight_row, bold = highlight_bold, color = highlight_color)
}
# A few extras
xaringanExtra::use_xaringan_extra(c("tile_view", "fit_screen"))
```
# Prologue
---
name: schedule
# Schedule
### Last time
Research basics, our class, and .mono[R]
### Today
.hi-slate[Material:] The Rubin causal model (not mine), .orange[Chapter 2 MHE].
--
.hi-purple[Assignment.sub[1]] Make sure [.mono[R]](https://www.r-project.org/) and [.mono[RStudio]](https://www.rstudio.com/products/rstudio/download/#download) are running on your computer.
--
.hi-purple[Assignment.sub[2]] Take 15 minutes to quietly think about your interests.
--
.hi-purple[Assignment.sub[3]] First formal assignment.
--
### Future
.hi-slate[Lab:] Meet Kyu and start deepening R knowledge.
.hi[Long run:] Deepen understandings/intuitions for causality and inference.
---
layout: false
class: inverse, middle
name: review
# Review
## Research fundamentals
---
# Review
## Research fundamentals
Angrist and Pischke provide four .hi-slate[fundamental questions for research:]
1. What is the .hi[causal relationship of interest]?
2. How would an .hi[ideal experiment] capture this causal effect of interest?
3. What is your .hi[identification strategy]?
4. What is your .hi[mode of inference]?
--
Seemingly straightforward questions can be fundamentally unanswerable.
---
# Review
## General research recommendations
More unsolicited advice:
- Be curious.
- Ask questions.
- Attend seminars.
- Meet faculty (UO + visitors).
- Focus on learning—especially intuition..pink[†]
- .hi-pink[Be kind and constructive.]
.footnote[
.pink[†] *Learning* is not always the same as getting good grades.
]
---
layout: true
# The experimental ideal
---
class: inverse, middle
---
## What's so great about experiments?
Science widely regards .hi[experiments as the gold standard] for research.
*But why?* The costs can be substantial.
.hi-slate[Costs]
- slow and expensive
- heavily regulated by (risk-averse?) review boards
- can abstract away from the actual question/setting
.hi-slate[Benefits]
So the benefits need to be pretty large, right?
---
## Example: Hospitals and health
Imagine we want to know the .hi[causal effect of hospitals on health].
--
.hi-slate[Research question]
Within the population of poor, elderly individuals, does visiting the emergency room for primary care improve health?
--
.hi-slate[Empirical exercise]
1. Collect data on *health status* and *hospital visits*.
1. Summarize health status by hospital-visit group.
---
## Example: Hospitals and health
Our empirical exercise from the 2005 National Health Inteview Survey:
```{r, table1, echo = F}
data.frame(
v1 = c("Hospital", "No hospital"),
v2 = c("7,774", "90,049"),
v3 = c(3.21, 3.93),
v4 = c(0.014, 0.003)
) %>% kable(
col.names = c("Group", "Sample Size", "Mean Health Status", "Std. Error"),
align = c("l", "c", "c", "c")
) %>%
kable_styling(font_size = 22) %>%
row_spec(1:2, background = "white", color = slate)
```
--
We get a $t$ statistic of 58.9 when testing a difference in groups' means (0.72).
--
.hi[Conclusion?] Hospitals make folks worse. Hospitals make sick people sicker.
--
.hi[Alternative conclusion:] Perhaps we're making a mistake in our analysis...
--
maybe sick people go to hospitals?
---
name: framework
## Potential outcomes framework
Let's develop a framework to better discuss the problem here.
--
- Binary treatment variable (_e.g._, hospitalized): $\text{D}_i = {0,1}$
- Outcome for individual $i$ (_e.g._, health): $\text{Y}_i$
This framework has a few names...
- Neyman potential outcomes framework
- Rubin causal model
- Neyman-Rubin .pseudocode-small["potential outcome"|"causal" "framework"|"model"]
---
## Potential outcomes framework
.hi-slate[Research question:] Does $\text{D}_i$ affect $\text{Y}_i$?
--
For each individual $i$, there are two .b[potential outcomes] (w/ binary $\text{D}_i$)
--
1. $\color{#e64173}{\text{Y}_{1i}}$ .pink[if] $\color{#e64173}{\text{D}_i = 1}$
$\color{#e64173}{i}$.pink['s health outcome if she went to the hospital]
--
1. $\color{#6A5ACD}{\text{Y}_{0i}}$ .purple[if] $\color{#6A5ACD}{\text{D}_i = 0}$
$\color{#6A5ACD}{i}$.purple['s health outcome if she did not go to the hospital]
--
The difference between these two outcomes gives us the .hi-orange[causal effect of hospital treatment], _i.e._,
$$
\begin{align}
\color{#FFA500}{\tau_i} = \color{#e64173}{\text{Y}_{1i}} - \color{#6A5ACD}{\text{Y}_{0i}}
\end{align}
$$
---
## #problems
This simple equation
$$
\begin{align}
\color{#FFA500}{\tau_i} = \color{#e64173}{\text{Y}_{1i}} - \color{#6A5ACD}{\text{Y}_{0i}}
\end{align}
$$
leads us to .hi-slate[*the fundamental problem of causal inference.*]
--
> We can never simultaneously observe $\color{#e64173}{\text{Y}_{1i}}$ and $\color{#6A5ACD}{\text{Y}_{0i}}$.
--
Most of applied econometrics focuses on addressing this simple problem.
--
Accordingly, our methods try to address the related question
> For each $\color{#e64173}{\text{Y}_{1i}}$, what is a (reasonably) good counterfactual?
---
## Solutions?
.hi-slate[Problem] We cannot directly calculate $\color{#FFA500}{\tau_i} = \color{#e64173}{\text{Y}_{1i}} - \color{#6A5ACD}{\text{Y}_{0i}}$.
--
.hi-slate[Proposed solution]
Compare .pink[outcomes for people who visited the hospital] $\left( \color{#e64173}{\text{Y}_{1i}\mid \color{#e64173}{\text{D}_{i}=1}} \right)$
to .purple[outcomes for people who did not visit the hospital] $\left( \color{#6A5ACD}{\text{Y}_{0j}\mid \color{#6A5ACD}{\text{D}_{j}=0}} \right)$.
--
$$
\begin{align}
\mathop{E}\left[ \text{Y}_{i} \mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \text{Y}_{i}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right]
\end{align}
$$
which gives us the *observed difference in health outcomes*.
--
.hi-slate[Q] This comparison will return *an* answer, but is it *the* answer we want?
---
name: selection
## Selection
.hi-slate[Q] What does $\mathop{E}\left[ \text{Y}_{i} \mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \text{Y}_{i}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right]$ actually tell us?
--
.hi-slate[A] First notice that we can write $i$'s outcome $\text{Y}_{i}$ as
$$
\begin{align}
\text{Y}_{i}
&= \color{#6A5ACD}{\text{Y}_{0i}} + \text{D}_{i} \underbrace{\left( \color{#e64173}{\text{Y}_{1i}} - \color{#6A5ACD}{\text{Y}_{0i}} \right)}_\color{#FFA500}{\tau_i}
\end{align}
$$
--
Now write out our expectation, apply this definition, do creative math.
$\mathop{E}\left[ \text{Y}_{i} \mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \text{Y}_{i}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right]$
--
$= \mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}}\mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right]$
--
$= \mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}}\mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#e64173}{\text{D}_{i}=1} \right] + \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right]$
---
count: false
## Selection
.hi-slate[Q] What does $\mathop{E}\left[ \text{Y}_{i} \mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \text{Y}_{i}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right]$ actually tell us?
.hi-slate[A] First notice that we can write $i$'s outcome $\text{Y}_{i}$ as
$$
\begin{align}
\text{Y}_{i}
&= \color{#6A5ACD}{\text{Y}_{0i}} + \text{D}_{i} \underbrace{\left( \color{#e64173}{\text{Y}_{1i}} - \color{#6A5ACD}{\text{Y}_{0i}} \right)}_\color{#FFA500}{\tau_i}
\end{align}
$$
Now write out our expectation, apply this definition, do creative math.
$\mathop{E}\left[ \text{Y}_{i} \mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \text{Y}_{i}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right]$
$= \mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}}\mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right]$
$= \underbrace{\mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}}\mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#e64173}{\text{D}_{i}=1} \right]}_\text{Average treatment effect on the treated 😀} + \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right]$
---
count: false
## Selection
.hi-slate[Q] What does $\mathop{E}\left[ \text{Y}_{i} \mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \text{Y}_{i}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right]$ actually tell us?
.hi-slate[A] First notice that we can write $i$'s outcome $\text{Y}_{i}$ as
$$
\begin{align}
\text{Y}_{i}
&= \color{#6A5ACD}{\text{Y}_{0i}} + \text{D}_{i} \underbrace{\left( \color{#e64173}{\text{Y}_{1i}} - \color{#6A5ACD}{\text{Y}_{0i}} \right)}_\color{#FFA500}{\tau_i}
\end{align}
$$
Now write out our expectation, apply this definition, do creative math.
$\mathop{E}\left[ \text{Y}_{i} \mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \text{Y}_{i}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right]$
$= \mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}}\mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right]$
$= \underbrace{\mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}}\mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#e64173}{\text{D}_{i}=1} \right]}_\text{Average treatment effect on the treated 😀} + \underbrace{\mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right]}_\text{Selection bias 😞}$
---
## Selection
The .b[first term] is *good variation*—essentially the answer that we want.
$\mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}}\mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#e64173}{\text{D}_{i}=1} \right]$
--
$=\mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} - \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{D}_{i}=1} \right]$
--
$=\mathop{E}\left[ \color{#FFA500}{\tau_i} \mid \color{#e64173}{\text{D}_{i}=1} \right]$
--
The .hi-orange[average causal effect] of hospitalization *for hospitalized individuals*.
--
The .b[second term] is bad variation—preventing us from knowing the answer.
$\mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right]$
--
The difference in the average untreated outcome between the treatment and control groups.
--
.hi-slate[*Selection bias*] The extent to which the "control group" provides a bad counterfactual for the treated individuals.
---
## Selection
Angrist and Pischke (MHE, p. 15),
> The goal of most empirical economic research is to overcome selection bias, and therefore to say something about the causal effect of a variable like $\text{D}_{i}$.
--
.hi-slate[Q] So how do experiments—the gold standard of empirical economic (and scientific) research—accomplish this goal and overcome selection bias?
---
name: experiments
## Back to experiments
.hi-slate[Q] How do experiments overcome selection bias?
--
.hi-slate[A] Experiments break the link between potential outcomes and treatment.
*In other words:* Randomly assigning $\text{D}_{i}$ makes $\text{D}_{i}$ independent of which outcome we observe (meaning $\color{#e64173}{\text{Y}_{1i}}$ or $\color{#6A5ACD}{\text{Y}_{0i}}$).
--
.hi-slate[Difference in means] with random assignment of $\text{D}_{i}$
$\mathop{E}\left[ \text{Y}_{i} \mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \text{Y}_{i}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right]$
--
$= \mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}}\mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right]$
--
$= \mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}}\mid \color{#e64173}{\text{D}_{i}=1} \right] - \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#e64173}{\text{D}_{i}=1} \right]$
--
from random assignment of ${\text{D}_{i}}$
--
$= \mathop{E}\left[ \color{#e64173}{\text{Y}_{1i}} - \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#e64173}{\text{D}_{i}=1} \right]$
--
$= \mathop{E}\left[ \color{#FFA500}{\tau_i}\mid \color{#e64173}{\text{D}_{i}=1} \right]$
--
$= \mathop{E}\left[ \color{#FFA500}{\tau_i} \right]$
--
Random assignment of $\text{D}_{i}$ breaks selection bias.
---
## Randomly assigned treatment
The key to avoiding selection bias: .hi-slate[random assignment of treatment]
--
(or .slate[*as-good-as random assignment*], _e.g._, natural experiments).
--
Random assignment of treatment gives us
$$
\begin{align}
\mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}}\mid \color{#6A5ACD}{\text{D}_{i}=0} \right] = \mathop{E}\left[ \color{#6A5ACD}{\text{Y}_{0i}} \mid \color{#e64173}{\text{D}_{i}=1} \right]
\end{align}
$$
meaning the control group's mean now provides a good counterfactual for the treatment group's mean.
--
In other words, there is no selection bias, _i.e._,