Intro to Causality

Task 1: SDO, ATE and Randomization

Let’s compute these various quantities and biases with some toy data (i.e. data we generated ourselves).

1. Load the data here. The group variable corresponds to whether the individual has been treated or not, the Y0 variable corresponds to the potential outcome if the individual does not receive the treatment \((Y_i^0)\) while Y1 corresponds to the potential outcome if the individual receives the treatment \((Y_i^1)\). Create a new variable containing the observed outcome \((Y_i)\) and the individual treatment effect \((\delta_i)\). (Recall that \(Y_i = D_i * Y_i^1 + (1 - D_i) Y_i^0\) and \(\delta_i = Y_i^1 - Y_i^0\).)

link <- "https://www.dropbox.com/s/oqc3vq1m9eyp1d6/toy_data_2.csv?dl=1"
toy_data <- read.csv(link)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

toy_data <- toy_data %>%
    mutate(Y = Y1 * group_dummy + (1 - group_dummy) * Y0,
           delta = Y1 - Y0)

2. Compute the ATE and the SDO. Is there is any bias? Is it large in magnitude?

ATE = mean(toy_data$delta)
ATE

## [1] 2.10418

SDO = mean(toy_data$Y[toy_data$group == "treatment"]) - mean(toy_data$Y[toy_data$group == "control"])
SDO

## [1] 3.208584

The SDO is about 50% larger than the ATE, implying that the bias is quite large.

3. In this new dataset we’ve randomly assigned the same individuals to the treatment and control groups. Compute the SDO under randomization. Remember that you need to recompute \(Y_i\) because the assignment is not the same anymore. If you got it right, the bias should be very close to 0. Why is it not exactly 0?

link_rand <- "https://www.dropbox.com/s/0qfsonzz1t9gkxi/toy_data_random.csv?dl=1"
toy_data_random <- read.csv(link_rand)

toy_data_random <- toy_data_random %>%
    mutate(Y = Y1 * group_random_dummy + (1 - group_random_dummy) * Y0)

SDO_random = mean(toy_data_random$Y[toy_data_random$group_random == "treatment"]) - mean(toy_data_random$Y[toy_data_random$group_random == "control"])
SDO_random

## [1] 2.047158

ATE - SDO_random

## [1] 0.05702187

The bias is equal to 0.057. It is not equal to zero because our sample size is not large enough to completely negate random variations in the two group’s differences.

4. To do at home: Compute the value of the selection bias and the heterogenous treatment effect bias and check that we have \[SDO = ATE + \textrm{selection bias} + \textrm{heterogenous treatment effect bias}\]

selection_bias = mean(toy_data$Y0[toy_data$group == "treatment"]) - mean(toy_data$Y0[toy_data$group == "control"])

het_trt_effect_bias = (1 - sum(toy_data$group == "treatment") / nrow(toy_data)) * (mean(toy_data$delta[toy_data$group == "treatment"]) - mean(toy_data$delta[toy_data$group == "control"]))

SDO

## [1] 3.208584

ATE + selection_bias + het_trt_effect_bias

## [1] 3.208584

Task 2: STAR data

1. Load the STAR data from here and assign it to an object called star_df.

star_df <- read.csv("https://www.dropbox.com/s/bf1fog8yasw3wjj/star_data.csv?dl=1")

2. Read the help for AER::STAR to understand what the variables correspond to. (Note: the data has been reshaped so don’t mind the “k”, “1”, etc. in the variable names in the help.)

library(AER)
?AER::STAR

3. What’s the unit of observation? Which variable contains: (i) the (random) class assignment, (ii) the student’s class grade, (iii) the outcomes of interest?

The unit of observation is student-grade. The variable containing the (random) class assignment is star, that containing the student’s class grade is grade, and that containing the outcomes of interest are read and math.

4. How many observations are there? Why so many?

There are so many observations because the data has been reshaped such that each row corresponds to a student in a grade rather than just to a student.

5. Why are there so many NAs? What do they correspond to?

The NAs correspond to children who left the experiment for various reasons. This is called attrition.

6. Keep only cases with no NAs with the following code:
star_df <- star_df[complete.cases(star_df),]

star_df <- star_df[complete.cases(star_df),]

7. Let’s check how well the randomization was done by doing balancing checks.
Compute the average percentage of girls, african americans, and free lunch qualifiers by grade and treatment class. (Hint: The following computes the percentage of girls (without the relevant dplyr verbs): share_female = mean(gender == "female") * 100.)

star_df %>%
    group_by(grade, star) %>%
    summarise(
        share_female = mean(gender == "female") * 100,
        share_african_american = mean(ethnicity == "afam") * 100,
        share_free_lunch = mean(lunch == "free") * 100)

## `summarise()` has grouped output by 'grade'. You can override using the
## `.groups` argument.

## # A tibble: 12 × 5
## # Groups:   grade [4]
##    grade star         share_female share_african_american share_free_lunch
##    <chr> <chr>               <dbl>                  <dbl>            <dbl>
##  1 1     regular              48.7                   37.3             51.5
##  2 1     regular+aide         47.8                   29.8             50.8
##  3 1     small                48.6                   31.9             47.9
##  4 2     regular              48.3                   36.7             50.6
##  5 2     regular+aide         47.7                   33.8             48.2
##  6 2     small                49.1                   33.3             46.6
##  7 3     regular              48.9                   35.2             49.7
##  8 3     regular+aide         47.2                   34.3             49.1
##  9 3     small                50.1                   31.6             46.8
## 10 k     regular              48.5                   28.5             46.0
## 11 k     regular+aide         48.8                   32.2             49.3
## 12 k     small                48.0                   30.2             46.7

Task 3: Your Turn!

1. Filter the dataset to only keep first graders and the small class and regular class groups.

star_df_1_small <- star_df %>%
    filter(star %in% c("small","regular") & grade == "1")

2. Compute the average math score for both groups, and the difference between the two. (Use base R.)

mean_small = mean(star_df_1_small$math[star_df_1_small$star == "small"])
mean_small

## [1] 539.0885

mean_regular = mean(star_df_1_small$math[star_df_1_small$star == "regular"])
mean_regular

## [1] 526.4434

ATE = mean_small - mean_regular
ATE

## [1] 12.64506

3. Create a dummy variable treatment equal to TRUE if student is in treatment group (i.e. small class size) and FALSE if in control group (i.e. regular class size). See slide 33 for how to create a dummy variable.

star_df_1_small <- star_df_1_small %>%
    mutate(treatment = (star == "small"))
table(star_df_1_small$treatment)

## 
## FALSE  TRUE 
##  2359  1786

4. Regress math score on the treatment dummy variable. Are the results in line with question 2?

lm(math ~ treatment, star_df_1_small)

## 
## Call:
## lm(formula = math ~ treatment, data = star_df_1_small)
## 
## Coefficients:
##   (Intercept)  treatmentTRUE  
##        526.44          12.65

How do you interpret these coefficients?

The intercept corresponds to the expected math score for first graders in the control group, defined as a regular-sized class. In other words the expected math score of first grades in the control group is 526,44. You can see this by yourself by comparing the coefficient to the same average computed in question 2. The slope coefficient corresponds the difference in expected math scores between first graders in the treatment and in the control group. In other words, first graders in small classes are expected to score 12.65 points higher than students in regular-sized classes. Again, you can compare this coefficient with the difference in averages computed in question 2.

Intro to Causality - Tasks

Florian Oswald, Gustave Kenedi and Pierre Villedieu

2025-10-28

Task 1: SDO, ATE and Randomization

Task 2: STAR data

Task 3: Your Turn!