The Fundamental Problem of Econometrics

# The Fundamental Problem of Econometrics
## EC 320: Introduction to Econometrics
### Philip Economides
### Winter 2022

---

# Housekeeping

* First Quiz available from midnight, __due 11:59pm January 17th__

* Consists of 8 questions, mix of MCQ and true or falses, no calculations

* Reminder that Problem Set 2 is online, along with the necessary data, in analytical and computational portions, __due 5:00pm January 24th__

* This lecture will be relevant to all the work described above

* Next lecture is _intended_ to be in person

---

# Prologue

---

# Statistics Inform Policy

__Policy:__ In 2017, the University of Oregon started requiring first-year students to live on campus.

__Rationale:__ First-year students who live on campus fare better than those who live off campus.

- _80 percent more likely_ to graduate in four years.
- Second-year retention rate _5 percentage points higher_.
- GPAs _0.13 points higher_, on average.

---
# Other Things Equal

The UO's interpretation of those comparisons warrants skepticism.

- The decision to live on campus is probably related to family wealth and interest in school.

- Family wealth and interest in school are also related to academic achievement.

__Why?__ The difference in outcomes between those on and off campus is not an .hi-purple[_all else equal_]* comparison.

__Upshot:__ We can't attribute the difference in outcomes solely to living on campus.

---
# Other Things Equal

## A high bar

When all other factors are held constant, statistical comparisons detect causal relationships.

(Micro)economics has developed a comparative advantage in understanding where .hi-purple[_all else equal_] comparisons can and cannot be made.

- Anyone can retort "_correlation doesn't necessarily imply causation_."

- Understanding _why_ is difficult, but useful for learning from data.

---
class: inverse, middle

# The Fundamental Problem of Econometrics

---
# Causal Identification

## Goal

Identify the effect of a .hi[treatment], `$D_i$`, on individual `$i$`'s .hi[outcome], `$Y_{D,i}$`.

## Ideal data

Ideally, we could calculate the .hi[treatment effect] *for each individual* `$i$` as

`$$Y_{1,i} - Y_{0,i}$$`

- `$Y_{1,i}$` is the outcome for person `$i$` when she receives the treatment.
- `$Y_{0,i}$` is the outcome for person `$i$` when she does not receive the treatment.
- Known as .pink[potential outcomes].

---
# Causal Identification

## Ideal data

```
#>     i trt  y1i  y0i
#> 1   1   1 5.01 2.56
#> 2   2   1 8.85 2.53
#> 3   3   1 6.31 2.67
#> 4   4   1 5.97 2.79
#> 5   5   1 7.61 4.34
#> 6   6   0 7.63 4.15
#> 7   7   0 4.75 0.56
#> 8   8   0 5.77 3.52
#> 9   9   0 7.47 4.49
#> 10 10   0 7.79 1.40
```
]

.pull-right[
Calculate the causal effect of treatment.
$$
`\begin{align}
  \tau_i = y_{1,i} -  y_{0,i}
\end{align}`
$$
for each individual `$i$`.
]
---
count: false
# Causal Identification

## Ideal data

```
#>     i trt  y1i  y0i effect_i
#> 1   1   1 5.01 2.56     2.45
#> 2   2   1 8.85 2.53     6.32
#> 3   3   1 6.31 2.67     3.64
#> 4   4   1 5.97 2.79     3.18
#> 5   5   1 7.61 4.34     3.27
#> 6   6   0 7.63 4.15     3.48
#> 7   7   0 4.75 0.56     4.19
#> 8   8   0 5.77 3.52     2.25
#> 9   9   0 7.47 4.49     2.98
#> 10 10   0 7.79 1.40     6.39
```
]

.pull-right[
Calculate the causal effect of treatment.
$$
`\begin{align}
  \tau_i = y_{1,i} -  y_{0,i}
\end{align}`
$$
for each individual `$i$`.
]
---
count: false
# Causal Identification

## Ideal data

.pull-right[
Calculate the causal effect of treatment.
$$
`\begin{align}
  \tau_i = y_{1,i} -  y_{0,i}
\end{align}`
$$
for each individual `$i$`.

The mean of `$\tau_i$` is the .hi[average treatment effect] (.pink[ATE]).

Thus, `$\color{#e64173}{\overline{\tau} = 3.82}$`
]

Notice the assignment of treatment does not influence ATE. Information on who was treated is irrelevant in this setting.

---
# Fundamental Problem of Econometrics

## Ideal comparison
$$
`\begin{align}
  \tau_i = \color{#e64173}{y_{1,i}} &- \color{#9370DB}{y_{0,i}}
\end{align}`
$$

Highlights the fundamental problem of econometrics, much like when a traveller assesses options down two separate roads.

## The problem

- If we observe `$\color{#e64173}{y_{1,i}}$`, then we cannot observe `$\color{#9370DB}{y_{0,i}}$`.

- If we observe `$\color{#9370DB}{y_{0,i}}$`, then we cannot observe `$\color{#e64173}{y_{1,i}}$`.

- Can only observe what actually happened; cannot observe the **counterfactual**.

The traveller's alternative outcome is forever unknown to them.

---
# Fundamental Problem of Econometrics

A dataset that we can observe for 10 people looks something like
.pull-left[

```
#>     i trt  y1i  y0i
#> 1   1   1 5.01   NA
#> 2   2   1 8.85   NA
#> 3   3   1 6.31   NA
#> 4   4   1 5.97   NA
#> 5   5   1 7.61   NA
#> 6   6   0   NA 4.15
#> 7   7   0   NA 0.56
#> 8   8   0   NA 3.52
#> 9   9   0   NA 4.49
#> 10 10   0   NA 1.40
```
]

But, we do observe
- `$\color{#e64173}{y_{1,i}}$` for `$i$` in 1, 2, 3, 4, 5
- `$\color{#9370DB}{y_{0,i}}$` for `$i$` in 6, 7, 8, 9, 10

]

**Q:** How do we "fill in" the `NA`s and estimate `$\overline{\tau}$`?

---
# Estimating Causal Effects

**Notation:** `$D_i$` is a binary indicator variable such that

- `$\color{#e64173}{D_i=1}$` .pink[if individual] `$\color{#e64173}{i}$` .pink[is treated].
- `$\color{#9370DB}{D_i=0}$` .purple[if individual] `$\color{#9370DB}{i}$` .purple[is not treated (*control* group).]

Then, rephrasing the previous slide,

- We only observe `$\color{#e64173}{y_{1,i}}$` when `$\color{#e64173}{D_{i}=1}$`.
- We only observe `$\color{#9370DB}{y_{0,i}}$` when `$\color{#9370DB}{D_{i}=0}$`.

**Q:** How can we estimate `$\overline{\tau}$` using only `$\left(\color{#e64173}{y_{1,i}|D_i=1}\right)$` and `$\left(\color{#9370DB}{y_{0,i}|D_i=0}\right)$`?

---
# Estimating Causal Effects

**Q:** How can we estimate `$\overline{\tau}$` using only `$\left(\color{#e64173}{y_{1,i}|D_i=1}\right)$` and `$\left(\color{#9370DB}{y_{0,i}|D_i=0}\right)$`?

**Idea:** What if we compare the group of `$n$` peoples' means? _I.e._,
$$
`\begin{aligned}
  =&\color{#e64173}{\mathop{Avg_n}\left( y_i\mid D_i = 1 \right)} - \color{#9370DB}{\mathop{Avg_n}\left( y_i\mid D_i =0 \right)}\\
  =&\color{#e64173}{\mathop{Avg_n}\left( y_{1i}\mid D_i = 1 \right)} - \color{#9370DB}{\mathop{Avg_n}\left( y_{0i}\mid D_i =0 \right)}
\end{aligned}`
$$

**Q:** When does a simple difference-in-means provide information on the .hi-slate[causal effect] of the treatment?

**Q.sub[2.0]:** Is `$\color{#e64173}{\mathop{Avg}\left( y_i\mid D_i = 1 \right)} - \color{#9370DB}{\mathop{Avg}\left( y_i\mid D_i =0 \right)}$` a *good* estimator for `$\overline{\tau}$`?

---
# Estimating Causal Effects

**Assumption:** Let `$\tau_i = \tau$` for all `$i$`.

- The treatment effect is equal (constant) across all individuals `$i$`.

**Note:** We defined

$$
`\begin{align}
  \tau_i = \tau = \color{#e64173}{y_{1,i}} - \color{#9370DB}{y_{0,i}}
\end{align}`
$$

which implies

$$
`\begin{align}
   \color{#e64173}{y_{1,i}} = \color{#9370DB}{y_{0,i}} + \tau
\end{align}`
$$

---
class: clear-slide

**Q:** Is `$\color{#e64173}{\mathop{Avg}\left( y_i\mid D_i = 1 \right)} - \color{#9370DB}{\mathop{Avg}\left( y_i\mid D_i =0 \right)}$` a *good* estimator for `$\tau$`?

Difference-in-means
--
 `$\quad \color{#ffffff}{\Bigg|}=\color{#e64173}{\mathop{Avg}\left( y_i\mid D_i = 1 \right)} - \color{#9370DB}{\mathop{Avg}\left( y_i\mid D_i =0 \right)}$`
--
 `$\quad \color{#ffffff}{\Bigg|}=\color{#e64173}{\mathop{Avg}\left( y_{1,i}\mid D_i = 1 \right)} - \color{#9370DB}{\mathop{Avg}\left( y_{0,i}\mid D_i =0 \right)}$`
--
 `$\quad \color{#ffffff}{\Bigg|}=\color{#e64173}{\mathop{Avg}\left( \color{#000000}{\tau \: +} \: \color{#9370DB}{y_{0,i}} \mid D_i = 1 \right)} - \color{#9370DB}{\mathop{Avg}\left( y_{0,i}\mid D_i =0 \right)}$`
--
 `$\quad \color{#ffffff}{\Bigg|}=\tau + \color{#e64173}{\mathop{Avg}\left(\color{#9370DB}{y_{0,i}} \mid D_i = 1 \right)} - \color{#9370DB}{\mathop{Avg}\left( y_{0,i}\mid D_i =0 \right)}$`
--
 `$\quad \color{#ffffff}{\Bigg|}= \text{Average causal effect} + \color{#FD5F00}{\text{Selection bias}}$`

Our proposed difference-in-means estimator gives us the sum of

1. `$\tau$`, the .hi-slate[causal, average treatment effect] that we want.
2. .hi-orange[Selection bias:] How much treatment and control groups differ, on average.

---
class: inverse, middle

# Randomized Control Trials

---
# Selection Bias

**Problem:** Existence of selection bias precludes *all else equal* comparisons.

- To make valid comparisons that yield causal effects, we need to shut down the bias term.

**Potential solution:** Conduct an experiment.

- How? .hi[Random assignment of treatment].

- Hence the name, .hi[*randomized* control trial] (RCT).

Groups will need to be large enough to be comparable. As we saw with small sample extraction, the resulting estimated parameters may be far off from their population values. Following the LLN, as we increase `$n$` of both groups, our randomly assigned treatment estimate is .hi-blue[more likely] to be representative of the population mean.

---
# Randomized Control Trials

## Example: Effect of de-worming on attendance

**Motivation:** Intestinal worms are common among children in less-developed countries. The symptoms of these parasites can keep school-aged children at home, disrupting human capital accumulation.

**Policy Question:** Do school-based de-worming interventions provide a cost-effective way to increase school attendance?

---
# Randomized Control Trials

## Example: Effect of de-worming on attendance

**Research Question:** How much do de-worming interventions increase school attendance?

**Q:** Could we simply compare average attendance among children with and without access to de-worming medication?
--
 **A:** If we're after the causal effect, probably not.
--
 **Q:** Why not?
--
 **A:** Selection bias: Families with access to de-worming medication probably have healthier children for other reasons, too (wealth, access to clean drinking water, *etc.*). .pink[Can't make an *all else equal* comparison. Biased and/or spurious results.]

---
# Randomized Control Trials

## Example: Effect of de-worming on attendance

**Solution:** Run an experiment.

Imagine an RCT where we have two groups:

- .hi-slate[Treatment:] Villages that where children get de-worming medication in school.
- .hi-slate[Control:] Villages that where children don't get de-worming medication in school (status quo).

By randomizing villages into .hi-slate[treatment] or .hi-slate[control], we will, on average, include all kinds of villages (poor _vs._ less poor, access to clean water _vs._ contaminated water, hospital _vs._ no hospital, *etc.*) in both groups.

*All else equal*!

---
class: clear-slide

.hi-slate[54 villages]
<img src="04-Fun_Problem_files/figure-html/plot1-1.png" style="display: block; margin: auto;" />
---
class: clear-slide
count: false

.hi-slate[54 villages] .hi[of varying levels of development]
<img src="04-Fun_Problem_files/figure-html/plot2-1.png" style="display: block; margin: auto;" />
---
class: clear-slide
count: false

.hi-slate[54 villages] .hi[of varying levels of development] .hi-orange[plus randomly assigned treatment]
<img src="04-Fun_Problem_files/figure-html/plot3_1-1.png" style="display: block; margin: auto;" />
---
class: clear-slide
count: false

.hi-slate[54 villages] .hi[of varying levels of development] .hi-orange[plus randomly assigned treatment]
<img src="04-Fun_Problem_files/figure-html/plot3_2-1.png" style="display: block; margin: auto;" />
---
class: clear-slide
count: false

.hi-slate[54 villages] .hi[of varying levels of development] .hi-orange[plus randomly assigned treatment]
<img src="04-Fun_Problem_files/figure-html/plot3_3-1.png" style="display: block; margin: auto;" />
---
class: clear-slide
count: false

.hi-slate[54 villages] .hi[of varying levels of development] .hi-orange[plus randomly assigned treatment]
<img src="04-Fun_Problem_files/figure-html/plot3_4-1.png" style="display: block; margin: auto;" />
---
class: clear-slide
count: false

.hi-slate[54 villages] .hi[of varying levels of development] .hi-orange[plus randomly assigned treatment]
<img src="04-Fun_Problem_files/figure-html/plot3_5-1.png" style="display: block; margin: auto;" />
---
class: clear-slide
count: false

.hi-slate[54 villages] .hi[of varying levels of development] .hi-orange[plus randomly assigned treatment]
<img src="04-Fun_Problem_files/figure-html/plot3_6-1.png" style="display: block; margin: auto;" />
---
class: clear-slide
count: false

.hi-slate[54 villages] .hi[of varying levels of development] .hi-orange[plus randomly assigned treatment]
<img src="04-Fun_Problem_files/figure-html/plot3_7-1.png" style="display: block; margin: auto;" />
---
class: clear-slide
count: false

.hi-slate[54 villages] .hi[of varying levels of development] .hi-orange[plus randomly assigned treatment]
<img src="04-Fun_Problem_files/figure-html/plot3_8-1.png" style="display: block; margin: auto;" />
---
class: clear-slide
count: false

.hi-slate[54 villages] .hi[of varying levels of development] .hi-orange[plus randomly assigned treatment]
<img src="04-Fun_Problem_files/figure-html/plot3_9-1.png" style="display: block; margin: auto;" />
---
class: clear-slide
count: false

.hi-slate[54 villages] .hi[of varying levels of development] .hi-orange[plus randomly assigned treatment]
<img src="04-Fun_Problem_files/figure-html/plot3_10-1.png" style="display: block; margin: auto;" />
---
class: clear-slide
count: false

.hi-slate[54 villages] .hi[of varying levels of development] .hi-orange[plus randomly assigned treatment]
<img src="04-Fun_Problem_files/figure-html/plot3_11-1.png" style="display: block; margin: auto;" />
---
class: clear-slide
count: false

.hi-slate[54 villages] .hi[of varying levels of development] .hi-orange[plus randomly assigned treatment]
<img src="04-Fun_Problem_files/figure-html/plot3_12-1.png" style="display: block; margin: auto;" />

---
# Randomized Control Trials

## Example: Effect of de-worming on attendance

We can estimate the .hi[causal effect] of de-worming on school attendance by comparing the average attendance rates in the treatment group (💊) with those in the control group (no 💊).

$$
`\begin{align}
  \overline{\text{Attendance}}_\text{Treatment} - \overline{\text{Attendance}}_\text{Control}
\end{align}`
$$

Alternatively, we can use the regression

$$
`\begin{align}
  \text{Attendance}_i = \beta_0 + \beta_1 \text{Treatment}_i + u_i \tag{1}
\end{align}`
$$

where `$\text{Treatment}_i$` is a binary variable (=1 if village `$i$` received the de-worming treatment).

--
**Q:** Should trust the results of `$(1)$`? Why?
 **A:** On average, .hi[randomly assigning treatment should balance] treatment and control across the other dimensions that affect school attendance.

---
class: clear-slide

.hi[Randomization can go wrong!]
<img src="04-Fun_Problem_files/figure-html/fertilizer_plot3_bad-1.png" style="display: block; margin: auto;" />

---
# Causality

## Example: Returns to education

The optimal investment in education by students, parents, and legislators depends in part on the monetary *return to education*.

.hi-purple[Thought experiment:]
- Randomly select an individual.
- Give her an additional year of education.
- How much do her earnings increase?

The change in her earnings describes the .hi-slate[causal effect] of education on earnings.

---
# Causality

## Example: Returns to education

**Q:** Could we simply compare the earnings those with more education to those with less?
--
 **A:** If we want to measure the causal effect, probably not.

--
1. People *choose* education based on their ability and other factors.
1. High-ability people tend to earn more *and* stay in school longer.
1. Education likely reduces experience (time out of the workforce).

Point (3) also illustrates the difficulty in learning about the effect of education while *holding all else constant*.

Many important variables have the same challenge: gender, race, income.

---
# Causality

## Example: Returns to education

**Q:** How can we estimate the returns to education?

- Randomly .pink[assign education] (might be difficult).
- Randomly .pink[encourage education] (might work).
- Randomly .pink[assign programs] that affect education (*e.g.*, mentoring).

.hi-slate[Option 2:] Look for a .hi-purple[*natural experiment*] (a policy or accident in society that arbitrarily increased education for one subset of people).

- Admissions .purple[cutoffs]
- .purple[Lottery] enrollment and/or capacity .purple[constraints]

---
# Summary

* The fundamental problem of econometrics is our inability to observe the outcome of a particular unit if assigned to treatment and control at once

* This introduces challenges in finding a way to construct a valid counterfactual using group means

* This quasi-experimental approach is valid if there is no selection bias or if we control for selection bias such that it is eliminated

* RCTs and natural experiments are one way to avoid selection bias.

* Since these events are often rare or infeasible, we will explore many other ways to approach this challenge

---