Let’s compute these various quantities and biases with some toy data (i.e. data we generated ourselves).
1. Load the data here.
The group variable corresponds to whether the individual
has been treated or not, the Y0 variable corresponds to the
potential outcome if the individual does not receive the treatment \((Y_i^0)\) while Y1 corresponds
to the potential outcome if the individual receives the treatment \((Y_i^1)\). Create a new variable containing
the observed outcome \((Y_i)\) and the
individual treatment effect \((\delta_i)\). (Recall that \(Y_i = D_i * Y_i^1 + (1 - D_i) Y_i^0\) and
\(\delta_i = Y_i^1 - Y_i^0\).)
link <- "https://www.dropbox.com/s/oqc3vq1m9eyp1d6/toy_data_2.csv?dl=1"
toy_data <- read.csv(link)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
toy_data <- toy_data %>%
mutate(Y = Y1 * group_dummy + (1 - group_dummy) * Y0,
delta = Y1 - Y0)
2. Compute the ATE and the SDO. Is there is any bias? Is it large in magnitude?
ATE = mean(toy_data$delta)
ATE
## [1] 2.10418
SDO = mean(toy_data$Y[toy_data$group == "treatment"]) - mean(toy_data$Y[toy_data$group == "control"])
SDO
## [1] 3.208584
The SDO is about 50% larger than the ATE, implying that the bias is quite large.
3. In this new dataset we’ve randomly assigned the same individuals to the treatment and control groups. Compute the SDO under randomization. Remember that you need to recompute \(Y_i\) because the assignment is not the same anymore. If you got it right, the bias should be very close to 0. Why is it not exactly 0?
link_rand <- "https://www.dropbox.com/s/0qfsonzz1t9gkxi/toy_data_random.csv?dl=1"
toy_data_random <- read.csv(link_rand)
toy_data_random <- toy_data_random %>%
mutate(Y = Y1 * group_random_dummy + (1 - group_random_dummy) * Y0)
SDO_random = mean(toy_data_random$Y[toy_data_random$group_random == "treatment"]) - mean(toy_data_random$Y[toy_data_random$group_random == "control"])
SDO_random
## [1] 2.047158
ATE - SDO_random
## [1] 0.05702187
The bias is equal to 0.057. It is not equal to zero because our sample size is not large enough to completely negate random variations in the two group’s differences.
4. To do at home: Compute the value of the selection bias and the heterogenous treatment effect bias and check that we have \[SDO = ATE + \textrm{selection bias} + \textrm{heterogenous treatment effect bias}\]
selection_bias = mean(toy_data$Y0[toy_data$group == "treatment"]) - mean(toy_data$Y0[toy_data$group == "control"])
het_trt_effect_bias = (1 - sum(toy_data$group == "treatment") / nrow(toy_data)) * (mean(toy_data$delta[toy_data$group == "treatment"]) - mean(toy_data$delta[toy_data$group == "control"]))
SDO
## [1] 3.208584
ATE + selection_bias + het_trt_effect_bias
## [1] 3.208584
1. Load the STAR data from here
and assign it to an object called star_df.
star_df <- read.csv("https://www.dropbox.com/s/bf1fog8yasw3wjj/star_data.csv?dl=1")
2. Read the help for AER::STAR to understand what the
variables correspond to. (Note: the data has been reshaped so
don’t mind the “k”, “1”, etc. in the variable names in the help.)
library(AER)
?AER::STAR
3. What’s the unit of observation? Which variable contains: (i) the (random) class assignment, (ii) the student’s class grade, (iii) the outcomes of interest?
The unit of observation is student-grade. The variable
containing the (random) class assignment is star, that
containing the student’s class grade is grade, and that
containing the outcomes of interest are read and
math.
4. How many observations are there? Why so many?
There are so many observations because the data has been reshaped such that each row corresponds to a student in a grade rather than just to a student.
5. Why are there so many NAs? What do they correspond
to?
The NAs correspond to children who left the
experiment for various reasons. This is called attrition.
6. Keep only cases with no NAs with the following
code:
star_df <- star_df[complete.cases(star_df),]
star_df <- star_df[complete.cases(star_df),]
7. Let’s check how well the randomization was done by doing
balancing checks.
Compute the average percentage of girls, african americans, and free
lunch qualifiers by grade and treatment class. (Hint:
The following computes the percentage of girls (without the relevant
dplyr verbs):
share_female = mean(gender == "female") * 100.)
star_df %>%
group_by(grade, star) %>%
summarise(
share_female = mean(gender == "female") * 100,
share_african_american = mean(ethnicity == "afam") * 100,
share_free_lunch = mean(lunch == "free") * 100)
## `summarise()` has grouped output by 'grade'. You can override using the
## `.groups` argument.
## # A tibble: 12 × 5
## # Groups: grade [4]
## grade star share_female share_african_american share_free_lunch
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 regular 48.7 37.3 51.5
## 2 1 regular+aide 47.8 29.8 50.8
## 3 1 small 48.6 31.9 47.9
## 4 2 regular 48.3 36.7 50.6
## 5 2 regular+aide 47.7 33.8 48.2
## 6 2 small 49.1 33.3 46.6
## 7 3 regular 48.9 35.2 49.7
## 8 3 regular+aide 47.2 34.3 49.1
## 9 3 small 50.1 31.6 46.8
## 10 k regular 48.5 28.5 46.0
## 11 k regular+aide 48.8 32.2 49.3
## 12 k small 48.0 30.2 46.7
1. Filter the dataset to only keep first graders and the small class and regular class groups.
star_df_1_small <- star_df %>%
filter(star %in% c("small","regular") & grade == "1")
2. Compute the average math score for both groups, and the difference
between the two. (Use base R.)
mean_small = mean(star_df_1_small$math[star_df_1_small$star == "small"])
mean_small
## [1] 539.0885
mean_regular = mean(star_df_1_small$math[star_df_1_small$star == "regular"])
mean_regular
## [1] 526.4434
ATE = mean_small - mean_regular
ATE
## [1] 12.64506
3. Create a dummy variable treatment equal to
TRUE if student is in treatment group (i.e. small class
size) and FALSE if in control group (i.e. regular class
size). See slide 33 for how to create a dummy variable.
star_df_1_small <- star_df_1_small %>%
mutate(treatment = (star == "small"))
table(star_df_1_small$treatment)
##
## FALSE TRUE
## 2359 1786
4. Regress math score on the treatment dummy variable. Are the results in line with question 2?
lm(math ~ treatment, star_df_1_small)
##
## Call:
## lm(formula = math ~ treatment, data = star_df_1_small)
##
## Coefficients:
## (Intercept) treatmentTRUE
## 526.44 12.65
The intercept corresponds to the expected math score for first graders in the control group, defined as a regular-sized class. In other words the expected math score of first grades in the control group is 526,44. You can see this by yourself by comparing the coefficient to the same average computed in question 2. The slope coefficient corresponds the difference in expected math scores between first graders in the treatment and in the control group. In other words, first graders in small classes are expected to score 12.65 points higher than students in regular-sized classes. Again, you can compare this coefficient with the difference in averages computed in question 2.