ECON 4050: Introduction to Econometrics

.title[
# ECON 4050: Introduction to Econometrics
]
.subtitle[
## Regression Discontinuity Design
]
.author[
### Adam Soliman, PhD
]
.date[
### Clemson University
]

---

## Course Transition: ***Research Methods***

* Three main causal evaluation methods used in economics:
  
  - ***regression discontinuity designs (RDD)***
  - ***differences-in-differences (DiD)***
  - ***instrumental variables (IV)*** (not covered in this course)

* These methods are used to identify __causal relationships__ between treatments and outcomes.

---

# Today: **Regression Discontinuity Design (RDD)**

* Very common research design in applied research because it provides credible causal estimates

* **Idea**: life is full of random rules which assign some to treatment and others to control, and we exploit knowledge of this assignment rule

* **Empirical application:** effect of alcohol consumption on mortality

* **Starting point**: subjects are ***not*** randomly allocated to treatment ⚠️

* RDD can be applied when we have specific information about the rules determining treatment

* Key assumption: variable which assigns treatment cannot be manipulated by individuals

---

# Discontinuities are Everywhere

There are many arbitrary rules in life that determine assignment to some treatment:
  
--
  
* In North Carolina, you used to have to have reached the age of five by October 16 in the relevant year to be eligible to enter kindergarten [(Cook and Kang, 2016)](https://pubs.aeaweb.org/doi/pdfplus/10.1257/app.20140323);
  
--
  
* In the US, a new born baby weighing less than 1,500 grams is considered to be of "very low birth weight" and receive additional treatment [(Almond et al., 2010)](https://academic.oup.com/qje/article/125/2/591/1882183);
  
--

* Flagship state universities use a certain SAT cutoff level to select their students [(Hoekstra, 2009)](https://cdn.theatlantic.com/static/mt/assets/business/Hoekstra_Flagship.pdf);

* In Italy, there are quotas of residence permits for illegal immigrants that are allocated on a first-come first-served basis until quota is exhausted [(Pinotti, 2017)](https://pubs.aeaweb.org/doi/pdfplus/10.1257/aer.20150355);

We will focus our analysis on the following discontinuity:

* In the US, the legal drinking age is 21 years old [(Carpenter and Dobkin, 2009)](http://masteringmetrics.com/wp-content/uploads/2015/01/Carpenter-and-Dobkin-2009.pdf).

---

# An Example: Alcohol Consumption and Mortality

* Imagine you are interested in assessing the __causal__ impact of alcohol consumption by young adults on mortality.

* Why is this not that straightforward? Why can't you just regress alcohol consumption on dying age and cause of death?

* Because there may be unobserved selection into alcohol consumption that may also be a determinant of mortality.
  
--

* In the US, alcohol consumption is prohibited before the age of 21.

* Debate on whether the minimum legal drinking age (MLDA) should be lowered to 18, as was the case in the Vietnam-era.

---

# Key Terms and Intuition

> ***Running variable:*** variable that determines assignment to treatment.

`$\rightarrow$` `$a$` = age

> ***Cutoff level:*** level of the ***running variable*** above (or below) which individuals are treated (or not).

`$\rightarrow$` `$c = 21$` year old birthday

Causal intuition:

* How different are individuals *just before* and *just after* their 21st birthday, other than legal access to alcohol?

* Around the threshold, allocation to treatment is ***as good as random***.

* 👉 ***Regression discontinuity design*** exploits this allocation to treatment!

---

# Carpenter and Dobkin's data

* Let's take a closer at the data used in the paper

``` r
# load package
library(masteringmetrics)
# load data
data("mlda", package = "masteringmetrics")
```
]

```
## # A tibble: 6 × 7
## agecell all internal external alcohol homicide suicide
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19.1 92.8 16.6 76.2 0.639 16.3 11.2
## 2 19.2 95.1 18.3 76.8 0.677 16.9 12.2
## 3 19.2 92.1 18.9 73.2 0.866 15.2 11.7
## 4 19.3 88.4 16.1 72.3 0.867 16.7 11.3
## 5 19.4 88.7 17.4 71.3 1.02 14.9 11.0
## 6 19.5 90.2 17.9 72.3 1.17 15.6 12.2
```
]

* This dataset contains aggregate death rates (and their causes) for different age groups (`agecell`) between 19 and 23 years old.

---

# Sharp Discontinuity at Cutoff

At the threshold, the probability of being treated jumps from 0 to 1.

---

# Sharp Discontinuity at Cutoff

---

# Sharp Discontinuity at Cutoff

---

# RDD Framework

* ***Treatment variable***: `$D_a$`

- `$D_a$` = 1 if individual is over 21 years old, `$D_a$` = 0 if not.

- `$D_a$` is a function of the individual's age, `$a$`, which is the ***running variable***.
  
--

* The ***cutoff*** age, 21, separates those who can drink legally and those who can't:
 $$
 D_a = \begin{cases}\begin{array}{lcl}
 1 \quad \text{if } a \geq 21 \\\
 0 \quad \text{if } a < 21
 \end{array}\end{cases}
 $$
 
--
 
## Key features of RD designs

1. Treatment status is a __deterministic__ function of `$a$` `$\rightarrow$` we know the assignment rule

1. Treatment status is a __discontinuous__ function of `$a$` `$\rightarrow$` there is some cutoff level

---

# Graphical Results: All Death Rates

---

# Graphical Results: All Death Rates

---

# Graphical Results: All Death Rates

---

# Graphical Results: All Death Rates

---

# RDD as Local Average Treatment Effect (LATE)

* The RD estimator is a __local average treatment effect (LATE)__.

* It only tells you the impact of treatment `$D$` on outcome `$Y$` ***around*** the cutoff value of the running variable.

* Limited ***external validity*** `$\rightarrow$` you cannot extrapolate to the entire population.

* Using the 21 year old alcohol restriction age in the RD context will only tell you the effect of this restriction on death rates but not the general effect of alcohol consumption.

* One may easily argue that all results from quantitative empirical analyses have a local nature.

---

# Estimation

---

# Estimation

* *Objective:* measure ***gap*** between the two lines at the cutoff.

* In its simplest form, we can write the following regression model:
    `$$DEATHRATE_a = \alpha + \delta D_a + \beta a + \varepsilon_i,$$`
  where `$DEATHRATE_a$` is the death rate at age `$a$`, `$D_a$` is the treatment dummy, and `$a$` is age (defined in months relative to 21st birthday).

`$\rightarrow$` `$\delta$` captures the **jump in death rate** between individuals above and below 21 years old.

* The RDD estimator exploits a discontinuity at `$a = 21$` in the conditional expectation function:
`$$\underbrace{\lim_{c \to 21^+} \mathbb{E}[DEATHRATE_a|a = c]}_{\alpha + \delta} - \underbrace{\lim_{c \to 21^-} \mathbb{E}[DEATHRATE_a|a = c]}_{\alpha} = \delta$$`

---

# Estimation #1: Simple Linear Model (all cause mortality)

`$$DEATHRATE_a = \alpha + \delta D_a + \beta a + \varepsilon_a,$$`

``` r
mlda <- mlda %>%
 mutate(over21 = (agecell >= 21),
 agecell_21 = agecell - 21)
rdd <- lm(all ~ agecell_21 + over21, mlda)

tidy(rdd)
```
]

```
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 91.8 0.805 114. 4.59e-57
## 2 agecell_21 -0.975 0.632 -1.54 1.30e- 1
## 3 over21TRUE 7.66 1.44 5.32 3.15e- 6
```
]

***Interpretation:***

On average, the MLDA increases deaths from all causes by 7.66 per 100,000, or put differently, turning 21 increases the mortality rate by approximately 8.35% relative to those just under 21 (7.66/91.8*100).

This is a big effect considering the average death rate for individuals between 19 and 22 is:

``` r
mean(mlda$all, na.rm = TRUE)
```

```
## [1] 95.67272
```

---

# Estimation #1: Simple Linear Model (motor vehicle accidents)

`$$DEATHRATE_a = \alpha + \delta D_a + \beta a + \varepsilon_a,$$`

``` r
mlda_alt <- mlda %>% mutate(over21_bin = ifelse(agecell >= 21, 1, 0),
 agecell_21 = agecell - 21)
tidy(lm(mva ~ agecell_21 + over21_bin, mlda_alt))
```

```
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 29.4 0.429 68.4 4.03e-47
## 2 agecell_21 -3.15 0.337 -9.34 4.26e-12
## 3 over21_bin 4.53 0.768 5.90 4.34e- 7
```

***Interpretation:***

On average, the MLDA increases motor vehicle deaths by 4.53 per 100,000 relative to those just under the age of 21. We can also express this as a share: 4.53/29.4*100=15.4%. It appears, then, that a lot of the increase in all cause mortality is driven by changes in drunk driving.

---

# Estimation Issues

* The ***functional form*** used to approximate the lines really matters!

`$\rightarrow$` an insufficiently flexible specification runs the risk of mistaking nonlinearity for treatment effect;
   
   `$\rightarrow$` an overly flexible specification reduces precision and runs the risk of overfitting.

---

# Simulations - Linear Relationship and Clear Discontinuity

`$$outcome_i = \alpha + \delta treatment_i + \beta running_i + e_i,$$`

---

# Simulations - Linear Relationship and Clear Discontinuity

`$$outcome_i = \alpha + \color{#d90502}\delta treatment_i + \beta running_i + e_i,$$`

---

# Simulations - Quadratic Relationship and Clear Discontinuity

`$$outcome_i = \alpha + \delta treatment_i + \beta_1 running_i + \color{#d90502}{\beta_2 running_i^2} + e_i,$$`

---

# Simulations - Quadratic Relationship and Clear Discontinuity

`$$outcome_i = \alpha + \color{#d90502}\delta treatment_i + \beta_1 running_i + \beta_2 running_i^2 + e_i,$$`

---

# Simulations - Linear Relationship but NO Discontinuity

---

# How to Choose Appropriate Functional Form?

* Essential to __visualise__ the data!

* Coefficients across models shouldn't vary too much.

* Should we expect the relationship between the outcome variable and the running variable to be nonlinear? Should we expect it to differ around the cutoff?

* [Gelman and Imbens (2019)](https://www.tandfonline.com/doi/abs/10.1080/07350015.2017.1366909), "Why High-Order Polynomials Should Not Be Used in Regression Discontinuity Designs":  
  *"We recommend researchers [...] use estimators based on local linear or quadratic polynomials or other smooth functions."*

---

# Going Back to our Example: Nonlinearities / `$\neq$` Slopes?

---

# Going Back to our Example: Nonlinearities / `$\neq$` Slopes?

Gap between the lines is roughly the same for both specifications.

---

# Graphical Representation of the Regression Results

---

# Nonparametric Estimation

* Give more weight to observations close to the cutoff level

2 settings:

* How much more weight?
  
  `$\rightarrow$` depends on the chosen ***kernel***.

* How far away from the cutoff do observations need to be to be discarded?

`$\rightarrow$` depends on the chosen ***bandwidth***.

Luckily there's an `R` package that chooses these settings optimally based on fancy algorythms: `rdrobust`.

---

# Identifying Assumptions

---

# RDD Assumptions

> *Key assumption*: ***Potential outcomes are smooth at the threshold.***

`$\rightarrow$` assignment variable cannot be manipulated!

Formally:

`$$\lim_{r \to c+} E[Y_i^d|r] = \lim_{r \to c-} E[Y_i^d|r], d \in \{0,1\}$$`

* The population just below must not be discretely different from the population just above the cutoff.

* Assumption is violated if people can manipulate the running variable because they know the cutoff value.

* Knowing the cutoff value in itself does not violate the assumption, only ability to manipulate running variable does.

---

# RDD Assumptions

> *Key assumption*: ***Potential outcomes are smooth at the threshold.***

If the assumption holds, we have:

$$
`\begin{align}
&\lim_{r \to c^+} \mathbb{E}[Y_i | R_i = r] - \lim_{r \to c^-} \mathbb{E}[Y_i | R_i = r] \\
= &\lim_{r \to c^+} \mathbb{E}[Y_i^1 | R_i = r] - \lim_{r \to c^-} \mathbb{E}[Y_i^0 | R_i = r] \\
= &\mathbb{E}[Y_i^1 | R_i = c] - \mathbb{E}[Y_i^0 | R_i = c] \\
= &\mathbb{E}[Y_i^1 - Y_i^0 | R_i = c] \\
\end{align}`
$$

---

# RDD Assumptions

> *Key assumption*: ***Potential outcomes are smooth at the threshold.***

If the assumption holds, we have:

$$
`\begin{align}
&\lim_{c \to 21^+} \mathbb{E}[Y_i | a_i = c] - \lim_{a \to 21^-} \mathbb{E}[Y_i | a_i = c] \\
= &\lim_{c \to 21^+} \mathbb{E}[Y_i^1 | a_i = c] - \lim_{c \to 21^-} \mathbb{E}[Y_i^0 | a_i = c] \\
= &\mathbb{E}[Y_i^1 | a_i = 21] - \mathbb{E}[Y_i^0 | a_i = 21] \\
= &\underbrace{\mathbb{E}[Y_i^1 - Y_i^0}_\text{ATE} | a_i = 21] \\
\end{align}`
$$

---

# Example of Manipulation: [Camacho and Conover (2011)](https://pubs.aeaweb.org/doi/pdfplus/10.1257/pol.3.2.41)

What happens when threshold for eligibility to social assistance programs becomes known?

.pull-left[
<img src="../img/photos/manip_1.png" width="700px" style="display: block; margin: auto;" />
]

.pull-right[
<img src="../img/photos/manip_2.png" width="700px" style="display: block; margin: auto;" />
]
---

# Noncompliance

What if the running variable does not *fully* determine assignment to treatment?

`$\rightarrow$` ***Fuzzy RDD***

* Even if all observations that satisfy the treatment condition are not treated, there is still a jump in the probability of being treated.

* For you, just know that problem of imperfect determination of allocation to treatment can still be solved

---

# 5 Steps for Conducting RDD in Practice1

.footnote[
1 Taken from [Andrew Heiss' wonderful course on RDD](https://evalsp20.classes.andrewheiss.com/class/11-class/).
]

### Step #1: ***Is assignment to treatment rule-based?***

### Step #2: ***Is design sharp or fuzzy?***

### Step #3: ***Is there a discontinuity in running variable at cutoff?***

### Step #4: ***Is there a discontinuity in outcome variable at cutoff in running variable?***

### Step #5: ***How big is the gap?***

---

# Task

1. Import the dataset after downloading it from [here](https://github.com/adamsoliman/IntroEconometrics/blob/master/data_for_tasks/mlda.csv). How many age cells are there?

1. Create a dummy variable for individuals over 21 years old.

1. Plot the death rate for all causes (`all`) on the y-axis as a function of age (`agecell`) on the x-axis, coloring observations above and below 21 years old. Does anything seem striking?

1. Add a regression line to the plot with `geom_smooth(method = `lm`)`. What do you observe?

1. Do the same for motor vehicle-related causes (`mva`) and alcohol-related causes (`alcohol`) as a function of age.

1. Estimate the following model on all death causes.
`$$DEATHRATE_a = \alpha + \delta D_a + \beta a + \varepsilon_i,$$`

Does the RDD coefficient correspond to the graphical illustration?
 
1. How do you interpret each coefficient?

1. What is the causal effect of legal access to alcohol on death rates?