Learning from Observational Data

class: center, middle, inverse, title-slide

# Learning from Observational Data
## EC 350: Labor Economics
### <a href="https://kyleraze.com">Kyle Raze</a>
### Winter 2022

---

# Learning from Observational Data

1. A taxonomy of data
    - Experimental *vs.* observational data
2. Direct acyclic graphs
    - Causal paths
    - Backdoor paths
    - Backdoor criterion
3. Regression discontinuity

---
class: inverse, middle

# A taxonomy of data

---
# A taxonomy of data

## .hi-pink[Experimental]

Data generated from a .hi-pink[randomized] experiment.

- Treatment assigned at .hi-pink[random]
- The **gold standard** of social science research
- Often difficult/impractical/unethical to conduct

## .hi-purple[Observational (non-experimental)]

Data generated from the .hi-purple[decisions] of various individuals in the "real world."

- Sometimes treatment is randomly assigned (*e.g.,* in a lottery), but not usually .hi-purple[(non-random!)]
- Prone to selection bias and omitted-variable bias
- Must rely on natural experiments to identify causal relationships

---
# A taxonomy of data

## **Example: Effect of job training on unemployment status**

.pull-left[
### .hi-pink[Experimental sample]

.remark-slide table{
        width: 100%;
    }

/* Change the background color to white for shaded rows (even rows) */

.remark-slide thead, .remark-slide tr:nth-child(2n) {
        background-color: white;
    }
</style>

<table>
<caption>
<br>**Unemployed?** (.mono[=] 1 if yes, .mono[=] if no)</caption>
 <thead>
  <tr>
   <th style="text-align:left;color: #708090 !important;">  </th>
   <th style="text-align:center;color: #708090 !important;"> 1 </th>
   <th style="text-align:center;color: #708090 !important;"> 2 </th>
   <th style="text-align:center;color: #708090 !important;"> 3 </th>
   <th style="text-align:center;color: #708090 !important;"> 4 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> .hi-pink[Training?] </td>
   <td style="text-align:center;color: #272822 !important;"> -0.111 </td>
   <td style="text-align:center;color: #272822 !important;"> -0.116 </td>
   <td style="text-align:center;color: #272822 !important;"> -0.115 </td>
   <td style="text-align:center;color: #272822 !important;"> -0.113 </td>
  </tr>
  <tr>
   <td style="text-align:left;color: #272822 !important;color: #c2bebe !important;color: #272822 !important;">  </td>
   <td style="text-align:center;color: #272822 !important;color: #c2bebe !important;"> (0.044) </td>
   <td style="text-align:center;color: #272822 !important;color: #c2bebe !important;"> (0.044) </td>
   <td style="text-align:center;color: #272822 !important;color: #c2bebe !important;"> (0.044) </td>
   <td style="text-align:center;color: #272822 !important;color: #c2bebe !important;"> (0.044) </td>
  </tr>
  <tr>
   <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> **Control mean** </td>
   <td style="text-align:center;color: #272822 !important;"> 0.354 </td>
   <td style="text-align:center;color: #272822 !important;"> 0.354 </td>
   <td style="text-align:center;color: #272822 !important;"> 0.354 </td>
   <td style="text-align:center;color: #272822 !important;"> 0.354 </td>
  </tr>
  <tr>
   <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> **Demographics** </td>
   <td style="text-align:center;color: #272822 !important;">  </td>
   <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td>
   <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td>
   <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td>
  </tr>
  <tr>
   <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> **Education** </td>
   <td style="text-align:center;color: #272822 !important;">  </td>
   <td style="text-align:center;color: #272822 !important;">  </td>
   <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td>
   <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td>
  </tr>
  <tr>
   <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> **Unemployed?.sub[t-1]** </td>
   <td style="text-align:center;color: #272822 !important;">  </td>
   <td style="text-align:center;color: #272822 !important;">  </td>
   <td style="text-align:center;color: #272822 !important;">  </td>
   <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td>
  </tr>
</tbody>
</table>
.smallest[*Note:* Standard errors in parentheses.]
]

.pull-right[
### .hi-purple[Non-experimental sample]

.remark-slide table{
        width: 100%;
    }

/* Change the background color to white for shaded rows (even rows) */

.remark-slide thead, .remark-slide tr:nth-child(2n) {
        background-color: white;
    }
</style>

<table>
<caption>
<br>**Unemployed?** (.mono[=] 1 if yes, .mono[=] if no)</caption>
 <thead>
  <tr>
   <th style="text-align:left;color: #708090 !important;">  </th>
   <th style="text-align:center;color: #708090 !important;"> 1 </th>
   <th style="text-align:center;color: #708090 !important;"> 2 </th>
   <th style="text-align:center;color: #708090 !important;"> 3 </th>
   <th style="text-align:center;color: #708090 !important;"> 4 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> .hi-purple[Training?] </td>
   <td style="text-align:center;color: #272822 !important;"> 0.128 </td>
   <td style="text-align:center;color: #272822 !important;"> 0.164 </td>
   <td style="text-align:center;color: #272822 !important;"> 0.160 </td>
   <td style="text-align:center;color: #272822 !important;"> -0.182 </td>
  </tr>
  <tr>
   <td style="text-align:left;color: #272822 !important;color: #c2bebe !important;color: #272822 !important;">  </td>
   <td style="text-align:center;color: #272822 !important;color: #c2bebe !important;"> (0.025) </td>
   <td style="text-align:center;color: #272822 !important;color: #c2bebe !important;"> (0.027) </td>
   <td style="text-align:center;color: #272822 !important;color: #c2bebe !important;"> (0.027) </td>
   <td style="text-align:center;color: #272822 !important;color: #c2bebe !important;"> (0.027) </td>
  </tr>
  <tr>
   <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> **Control mean** </td>
   <td style="text-align:center;color: #272822 !important;"> 0.115 </td>
   <td style="text-align:center;color: #272822 !important;"> 0.115 </td>
   <td style="text-align:center;color: #272822 !important;"> 0.115 </td>
   <td style="text-align:center;color: #272822 !important;"> 0.115 </td>
  </tr>
  <tr>
   <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> **Demographics** </td>
   <td style="text-align:center;color: #272822 !important;">  </td>
   <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td>
   <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td>
   <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td>
  </tr>
  <tr>
   <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> **Education** </td>
   <td style="text-align:center;color: #272822 !important;">  </td>
   <td style="text-align:center;color: #272822 !important;">  </td>
   <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td>
   <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td>
  </tr>
  <tr>
   <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> **Unemployed?.sub[t-1]** </td>
   <td style="text-align:center;color: #272822 !important;">  </td>
   <td style="text-align:center;color: #272822 !important;">  </td>
   <td style="text-align:center;color: #272822 !important;">  </td>
   <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td>
  </tr>
</tbody>
</table>
.smallest[*Note:* Standard errors in parentheses.]

]

---
class: inverse, middle

# Direct acyclic graphs

---
# Direct acyclic graphs

.pull-left[

A direct acyclic graph (DAG) can help us visualize the assumptions necessary to estimate causal relationships using observational data.

.hi-pink[Nodes] represent .hi-pink[variables].

.hi-black[Arrows] represent .hi-black[causal relationships] between variables.

]
.pull-right[
<img src="05-Observational_Data_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" />

]

---
# DAGs follow two rules
.pull-left[
.center[**Rule 1 ("direct"):** No bidirectional arrows!]

<img src="05-Observational_Data_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" />
.center[.hi-red[Illegal!]]
]
--
.pull-right[
.center[**Rule 2 ("acyclic"):** No feedback loops!]

<img src="05-Observational_Data_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" />
.center[.hi-red[Illegal!]]
]

---
# Causal paths

Our objective is to **identify the causal effect** of a treatment variable .hi[D] on an outcome variable .hi[Y].

- The treatment could have a **direct effect** on the outcome: .hi[D] .mono[-->] .hi[Y].
- Alternatively, the treatment could have an **indirect effect** on the outcome through .hi[X], a mediator variable: .hi[D] .mono[-->] .hi[X] .mono[-->] .hi[Y].

---
# Backdoor paths

The presence of a confounder variable .hi[W] opens a **backdoor path** from the treatment to the outcome:<br>.center[.hi[D] .mono[<--] .hi[W] .mono[-->] .hi[Y]]

An open backdoor path creates a **spurious correlation** between the treatment and the outcome!

---
# Backdoor paths

## **Example: Returns to education**

.pull-left[
**Q:** How does education affect earnings?

- .hi[D] .mono[=] Education (*e.g.,* going to college or not)
- .hi[Y] .mono[=] Earnings as an adult
- .hi[PE] .mono[=] Parental education
- .hi[I] .mono[=] Family income
- .hi[U] .mono[=] Unobserved characteristics (*e.g.,* family background)

]
.pull-right[
<img src="05-Observational_Data_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" />
]

---
count: false
# Backdoor paths

## **Example: Returns to education**

.pull-left[
**Q:** How does education affect earnings?

The presence&mdash;*or absence*&mdash;of an arrow illustrates our **causal assumptions** about how education affects earnings!
]
.pull-right[
<img src="05-Observational_Data_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" />
]

---
# Backdoor paths

## **Example: Returns to education**

.pull-left[
**Q:** What are the paths through which education affects earnings?

]
.pull-right[
<img src="05-Observational_Data_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" />
]

---
count: false
# Backdoor paths

## **Example: Returns to education**

.pull-left[
**Q:** What are the paths through which education affects earnings?

- .hi[D] .mono[-->] .hi[Y] (causal effect)
- .hi[D] .mono[<--] .hi[I] .mono[-->] .hi[Y] (backdoor path)
- .hi[D] .mono[<--] .hi[PE] .mono[-->] .hi[I] .mono[-->] .hi[Y] (backdoor path)
- .hi[D] .mono[<--] .hi[U] .mono[-->] .hi[PE] .mono[-->] .hi[I] .mono[-->] .hi[Y] (backdoor path)

]
.pull-right[
<img src="05-Observational_Data_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" />
]

---
# Backdoor paths

## **Backdoor criterion**

> The observed correlation between .hi[Y] and .hi[D] isolates the causal effect of .hi[D] on .hi[Y] if and only if all backdoor paths from .hi[D] to .hi[Y] are closed.

**Q:** What closes a backdoor path?

- **A.sub[1]:** *Conditioning* or *controlling for* the confounder variable on the path.
--

- **A.sub[2]:** The presence of a collider variable on the path.

---
# Backdoor paths

The presence of a collider variable .hi[C] closes a backdoor path from the treatment to the outcome:<br>.center[.hi[D] .mono[-->] .hi[C] .mono[<--] .hi[Y]]

**The implication?** We don't want to control for collider variables!

- Conditioning on a collider can open up new backdoor paths. (More on this later.)

---
# Backdoor paths

## **Example: Returns to education**

.pull-left[
**Q:** How could we satisfy the backdoor criterion given our assumptions about the effect of education on earnings?

]
.pull-right[
<img src="05-Observational_Data_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" />
]

---
count: false
# Backdoor paths

## **Example: Returns to education**

.pull-left[
**Q:** How could we satisfy the backdoor criterion given our assumptions about the effect of education on earnings?

**A:** Control for family income (.hi[I])

- **Why?** Family income appears as a non-collider on each backdoor path:<br>.center[.hi[D] .mono[<--] .hi[I] .mono[-->] .hi[Y]] .center[.hi[D] .mono[<--] .hi[PE] .mono[-->] .hi[I] .mono[-->] .hi[Y]] .center[.hi[D] .mono[<--] .hi[U] .mono[-->] .hi[PE] .mono[-->] .hi[I] .mono[-->] .hi[Y]]

]
.pull-right[
<img src="05-Observational_Data_files/figure-html/unnamed-chunk-16-1.svg" style="display: block; margin: auto;" />
]

---
# Backdoor paths

## **Example: Returns to education**

.pull-left[
**Q:** Would controlling for family income isolate the causal effect of education on earnings if unobserved family background (.hi[U]) has a direct effect on earnings (.hi[Y])?

]
.pull-right[
<img src="05-Observational_Data_files/figure-html/unnamed-chunk-17-1.svg" style="display: block; margin: auto;" />
]

---
count: false
# Backdoor paths

## **Example: Returns to education**

.pull-left[
**Q:** Would controlling for family income isolate the causal effect of education on earnings if unobserved family background (.hi[U]) has a direct effect on earnings (.hi[Y])?

**A:** No!

- .hi[U] is unobserved, so we can't control for it.
- The backdoor path .hi[D] .mono[<--] .hi[U] .mono[-->] .hi[Y] would stay open.
]
.pull-right[
<img src="05-Observational_Data_files/figure-html/unnamed-chunk-18-1.svg" style="display: block; margin: auto;" />
]

---
count: false
# Backdoor paths

## **Example: Returns to education**

.pull-left[
**Q:** Would controlling for family income isolate the causal effect of education on earnings if unobserved family background (.hi[U]) has a direct effect on earnings (.hi[Y])?

**A:** No!

- .hi[U] is unobserved, so we can't control for it.
- The backdoor path .hi[D] .mono[<--] .hi[U] .mono[-->] .hi[Y] would stay open.

**The takeaway?**<br>.hi-pink[ALL causal inference is by assumption!]
]
.pull-right[
<img src="05-Observational_Data_files/figure-html/unnamed-chunk-19-1.svg" style="display: block; margin: auto;" />
]

---
class: inverse, middle

# Regression discontinuity

---
# Regression discontinuity

There are situations in the real world where treatment is assigned in a way that is **as good as random.**

- These situations can provide **valid comparison groups**, just like the ones you'd find in a randomized control trial!

**Examples?** When some arbitrary threshold triggers a change in treatment:

- Anti-discrimination laws only apply to firms with more than 15 employees.
- Prisoners are eligible for early parole if some score exceeds a threshold.
- An individual has legal access to alcohol if they are 21 or older.
- You get a ticket if your speed exceeds the speed limit.
- A candidate for governor wins if her vote share exceeds that of her competitors.

Economists can (and often do) use these situations to estimate causal effects.

---
# Regression discontinuity

**Example:** Effect of merit scholarships on graduation

- Outcome variable .mono[=] probability of graduation
- Treatment .mono[=] scholarship money
- "Assignment variable" .mono[=] admissions test score (*e.g.,* the SAT)
- "Cutoff/threshold"  .mono[=] minimum score for getting a scholarship (*e.g.,* SAT score of 1200 or higher)

**Assumption:** Students *just below* the cutoff are comparable to those *just above* the cutoff.

---
layout: true
class: clear-slide

---
Let's start with potential graduation rates: `\(\color{#9370DB}{\mathop{E}\left[ \text{Y}_{0,i} \mid \text{SAT}_{i} \right]}\)`

<img src="05-Observational_Data_files/figure-html/s1-1.svg" style="display: block; margin: auto;" />
---
count: false
Let's start with potential graduation rates: `\(\color{#9370DB}{\mathop{E}\left[ \text{Y}_{0,i} \mid \text{SAT}_{i} \right]}\)` and `\(\color{#e64173}{\mathop{E}\left[ \text{Y}_{1,i} \mid \text{SAT}_{i} \right]}\)`.

<img src="05-Observational_Data_files/figure-html/s2-1.svg" style="display: block; margin: auto;" />
---
You only get a scholarship if if your .hi-slate[SAT score exceeds the cutoff score].

<img src="05-Observational_Data_files/figure-html/s3-1.svg" style="display: block; margin: auto;" />
---
`\(\color{#e64173}{\mathop{E}\left[ \text{Y}_{1,i} \mid \text{SAT}_{i} = 1200 \right]} - \color{#9370DB}{\mathop{E}\left[ \text{Y}_{0,i} \mid \text{SAT}_{i} = 1200 \right]}\)` gives the .hi-orange[causal effect] .hi-slate[at the cutoff].

<img src="05-Observational_Data_files/figure-html/s4-1.svg" style="display: block; margin: auto;" />
---

Using real data, researchers have to estimate `\(\color{#e64173}{\mathop{E}\left[ \text{Y}_{1,i} \mid \text{SAT}_{i} \right]}\)` and `\(\color{#9370DB}{\mathop{E}\left[ \text{Y}_{0,i} \mid \text{SAT}_{i} \right]}\)`.

---
One way to estimate the .hi-orange[jump] is to estimate a regression on each side of the cutoff.

<img src="05-Observational_Data_files/figure-html/s6-1.svg" style="display: block; margin: auto;" />
---
count: false
One way to estimate the .hi-orange[jump] is to estimate a regression on each side of the cutoff.

<img src="05-Observational_Data_files/figure-html/s7-1.svg" style="display: block; margin: auto;" />
---
Another way is to estimate regressions using only data closer to the cutoff.

<img src="05-Observational_Data_files/figure-html/s8-1.svg" style="display: block; margin: auto;" />
---
Different choices of samples and models can lead to different estimates of the treatment effect!

<img src="05-Observational_Data_files/figure-html/s9-1.svg" style="display: block; margin: auto;" />
---
Different choices of samples and models can lead to different estimates of the treatment effect!

<img src="05-Observational_Data_files/figure-html/s10-1.svg" style="display: block; margin: auto;" />
---
Different choices of samples and models can lead to different estimates of the treatment effect!

<img src="05-Observational_Data_files/figure-html/s11-1.svg" style="display: block; margin: auto;" />
---
Different choices of samples and models can lead to different estimates of the treatment effect!

<img src="05-Observational_Data_files/figure-html/s12-1.svg" style="display: block; margin: auto;" />
---
Different choices of samples and models can lead to different estimates of the treatment effect!

<img src="05-Observational_Data_files/figure-html/s13-1.svg" style="display: block; margin: auto;" />
---
Different choices of samples and models can lead to different estimates of the treatment effect!

<img src="05-Observational_Data_files/figure-html/s14-1.svg" style="display: block; margin: auto;" />
---
Some modeling choices can find an effect even if none exists!

<img src="05-Observational_Data_files/figure-html/s15a-1.svg" style="display: block; margin: auto;" />
---
count: false
Some modeling choices can find an effect even if none exists!

<img src="05-Observational_Data_files/figure-html/s15b-1.svg" style="display: block; margin: auto;" />
---
count: false
Some modeling choices can find an effect even if none exists!

<img src="05-Observational_Data_files/figure-html/s15c-1.svg" style="display: block; margin: auto;" />
---
count: false
Some modeling choices can find an effect even if none exists!

<img src="05-Observational_Data_files/figure-html/s15d-1.svg" style="display: block; margin: auto;" />
---
count: false
Some modeling choices can find an effect even if none exists!

---
layout: false
# Regression discontinuity

**Q:** When should we trust a regression discontinuity comparison?

- When is the comparison *internally valid*?

**A:** When we believe that **treatment is the only thing that changes** (other than observed outcomes) at the cutoff.

1. We don't want to see evidence of people **bunching** on one side of the threshold.
    - This could mean that people are **manipulating the assignment variable** near the cutoff so that they get the treatment.
    - Example: cheating among students who anticipate being close to the cutoff as a way to increase their score just enough to get the scholarship.
2. We don't want to see a **"jump" in other variables** at the cutoff.
    - This would mean that people on one side of the cutoff are **no longer comparable** to people on the other side!

---
# Regression discontinuity

**Q:** How can we tell if the treatment actually has a causal effect on the outcome?

**A:** The treatment has an effect if **all three** of the statements below are true.

1. We believe that the regression discontinuity comparison is **internally valid.**
2. We can see that the **outcome variable "jumps"** at the cutoff ***when we look at the raw data.***
3. The estimate of the "jump" is **precise enough** to conclude that the effect is statistically significant.