library(data.table)
library(fixest)
library(kableExtra)
library(modelsummary)
library(hdm)
library(ggplot2)
Selection on Observables Framework
Now we will depart from experiments and start to use observational data to infer causality. In that case, we need the treatment assignment mechanism to be regular. There are 3 conditions that an assignment mechanism must satisfy in order to be considered regular:
- Individualistic assignment: This limits the dependence of a particular unit’s assignment probability on the values of covariates and potential outcomes for other units.
An assignment mechanism \(\operatorname{Pr}(\mathbf{W} \mid \mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1))\) is individualistic if, for some function \(q(\cdot) \in[0,1],\)
\[ p_{i}(\mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1))=q\left(X_{i}, Y_{i}(0), Y_{i}(1)\right), \text { for all } i=1, \ldots, N \]
and
\[ \operatorname{Pr}(\mathbf{W} \mid \mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1))=c \cdot \prod_{i=1}^{N} q\left(X_{i}, Y_{i}(0), Y_{i}(1)\right)^{W_{i}}\left(1-q\left(X_{i}, Y_{i}(0), Y_{i}(1)\right)\right)^{1-W_{i}} \]
for \((\mathbf{W}, \mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1)) \in \mathbb{A},\) for some set \(\mathbb{A},\) and zero elsewhere (\(c\) is the constant that ensures that the probabilities sum to unity).
- Probabilistic assignment: This requires the assignment mechanism to imply a nonzero probability for each treatment value, for every unit.
An assignment mechanism \(\operatorname{Pr}(\mathbf{W} \mid \mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1))\) is probabilistic if the probability of assignment to treatment for unit i is strictly between zero and one:
\[ 0<p_{i}(X_i, Y_i(0), Y_i(1))<1, \text { for each possible } \mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1) \] for all \(i=1, \ldots, N.\)
- Unconfounded assignment: This disallows dependence of the assignment mechanism on the potential outcomes.
An assignment mechanism is unconfounded if it does not depend on the potential outcomes: \[\begin{equation*} \operatorname{Pr}(\mathbf{W} \mid \mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1))=\operatorname{Pr}\left(\mathbf{W} \mid \mathbf{X}, \mathbf{Y}^{\prime}(0), \mathbf{Y}^{\prime}(1)\right) \end{equation*}\] for all \(\mathbf{W}, \mathbf{X}, \mathbf{Y}(0), \mathbf{Y}(1), \mathbf{Y}^{\prime}(0),\) and \(\mathbf{Y}^{\prime}(1)\)
The unconfounded assumption is also known as the conditional independence assumption – CIA, (Angrist and Pischke 2008). Thus, if the assignment mechanism is uncounfounded and individualistic the probability of assignment is the individual propensity score. Also, given individualistic assignment, a mechanism that is both probabilistic and uncounfounded is referred as strongly ignorable treatment assignment.
Imbens (2015) lays out an overall strategy for flexibly and robustly estimating the average effect of the treatment. The strategy consists of two, sometimes three, distinct stages to estimate the average treatment effect from observational studies. The first stage is referred as the design stage where the full sample is trimmed to improve overlap of covariates. In the second stage, a supplementary analysis of the unconfoundedness assumption is assessed. Finally, the third stage is the analysis stage, where the estimator of choice is applied to the trimmed data set and the average effect is thus estimated.
Design Stage
In this stage do not use the outcome data and focus solely on the treatment indicators and covariates, \((X,W)\). This stage does not involve the outcome data and as a result, this analysis cannot be “contaminated” by knowledge of estimated outcome distributions, or by preferences, conscious or unconcious, for particular results.
Initial Covariate’s Balance
We will use once again the “schools” dataset.
<- fread("Data/cps_union_data.csv")
dt #' Cleaning. Do not use the outcome!
c("earnings", "V1", "CPSID", "CPSIDP", "public_housing", "employed") := NULL]
dt[, `:=`(
dt[, marital_status = ifelse(marital_status %in% c(1, 2), 1, 0),
race = ifelse(race == 1, 1, 0),
age_2 = age^2
)]<- c("class_of_worker", "class_of_worker_last_year")
cat_cols := lapply(.SD, factor), .SDcols = cat_cols]
dt[, (cat_cols) <- dt[complete.cases(dt)] dt
# From modelsummary package: pre-formatted balance table
datasummary_balance(~union, data = dt,
title = "Descriptive statistics by treatment group for an observational study.")
Mean | Std. Dev. | Mean | Std. Dev. | Diff. in Means | Std. Error | ||
---|---|---|---|---|---|---|---|
age | 42.3 | 14.6 | 45.7 | 13.3 | 3.5 | 0.4 | |
female | 0.5 | 0.5 | 0.5 | 0.5 | 0.0 | 0.0 | |
race | 0.8 | 0.4 | 0.8 | 0.4 | 0.0 | 0.0 | |
marital_status | 0.5 | 0.5 | 0.6 | 0.5 | 0.1 | 0.0 | |
veteran | 0.1 | 0.2 | 0.1 | 0.3 | 0.0 | 0.0 | |
education | 90.9 | 22.9 | 94.8 | 21.7 | 3.9 | 0.6 | |
worked_last_year | 1.0 | 0.2 | 1.0 | 0.1 | 0.0 | 0.0 | |
total_income_last_year | 60986.3 | 78361.2 | 65972.3 | 61781.2 | 4986.0 | 1768.9 | |
wage_income_last_year | 55485.1 | 71788.6 | 60617.7 | 57985.8 | 5132.6 | 1653.4 | |
own_farm_income_last_year | 125.8 | 10392.1 | 29.0 | 818.7 | -96.8 | 100.2 | |
private_health_insurance | 0.8 | 0.4 | 0.9 | 0.3 | 0.1 | 0.0 | |
medicaid | 0.1 | 0.3 | 0.1 | 0.2 | 0.0 | 0.0 | |
age_2 | 1999.2 | 1305.7 | 2269.5 | 1240.6 | 270.4 | 34.5 | |
N | % | N | % | ||||
class_of_worker | 21 | 9916 | 88.0 | 761 | 51.5 | ||
25 | 297 | 2.6 | 116 | 7.9 | |||
27 | 514 | 4.6 | 208 | 14.1 | |||
28 | 547 | 4.9 | 392 | 26.5 | |||
class_of_worker_last_year | 0 | 278 | 2.5 | 16 | 1.1 | ||
13 | 21 | 0.2 | 1 | 0.1 | |||
14 | 99 | 0.9 | 11 | 0.7 | |||
22 | 9496 | 84.2 | 789 | 53.4 | |||
25 | 299 | 2.7 | 109 | 7.4 | |||
27 | 541 | 4.8 | 191 | 12.9 | |||
28 | 539 | 4.8 | 360 | 24.4 | |||
29 | 1 | 0.0 | 0 | 0.0 |
Imbens and Rubin (2015) point that the goal is, at least at this point, to assess whether the differences between the two sample distributions are so large that simple adjustment methods are unlikely to remove biases in estimated treatment/control average differences. So, they suggest reporting the normalized differences:
\[\begin{equation} \hat\Delta_{ct}= \frac{\bar X_t - \bar X_c}{\sqrt{(s^2_t+s^2_c)/2}} \end{equation}\]
where \(\bar X_t\) and \(\bar X_c\) denote the sample means for treatment and control groups, \(s^2_t\) and \(s^2_c\) are the sample variances. They claim that, given a large enough sample size, the t-statistic of the difference in means will always be high enough reject the null hypothesis, whereas the normalized difference, \(\hat\Delta_{ct}\) remains unchanged when the sample size increases. The challenge of adjusting for differences in the covariates should be simpler when one has more observations, but that is not what a t-test is telling us. An extra benefit of the normalized difference is that it is expressed in standard deviation units. Differences of 0.25 or less seem to indicate good balance1.
We may also wish to compare measures of dispersion in the two distributions. This can be done with the logarithm of the ratio of standard deviations, which is invariant to scale:
\[\hat\Gamma_{ct}=\ln(s_t)-\ln(s_c)\]
A second approach to comparing distributions, one can investigate what fraction of the treated (control) units have covariate values that are in the tails of the distribution of the controls (treated). For example, we can compute the probability mass of the covariate distribution for the treated that is outside the \(1−\alpha/2\) and the \(\alpha/2\) quantiles of the covariate distribution for the controls:
\[\hat\pi_t^\alpha=\left(1-\left(\hat F_t(\hat F_c^{-1}(1-\alpha/2))\right)+\hat F_t(\hat F_c^{-1}(\alpha/2))\right)\]
and the analogous quantity for the control distribution:
\[\hat\pi_c^\alpha=\left(1-\left(\hat F_c(\hat F_t^{-1}(1-\alpha/2))\right)+\hat F_c(\hat F_t^{-1}(\alpha/2))\right)\]
where \(\hat F_c(\cdot)\) and \(\hat F_t(\cdot)\)) are the empirical distribution function of a covariate in the control and treated subsamples, respectively:
\[\hat F_j(x)=\frac{1}{N_j}\sum_{W_i\in j}\mathbf{1}_{X_i}\leq x\, , \quad\text{for } j\in\{c, t\}\] and the \(\hat F^{-1}(\cdot)\) denotes the quantile function of the empirical distribution.
# Compute the normalized difference
<- function(xt, xc, sdt, sdc) {
norm_diff - xc)/(sqrt((sdt^2 + sdc^2)/2))
(xt
}
#' Tail overlap for treatment and control groups
#'
#' @param dt data.table containing your dataset
#' @param treat_var character, name of treatment variable
#' @param variables character, column name containing all variables
#' @param values character, column name containing values for variables
#' @param alpha float, significance level
#'
#' @return data.table with three columns, one for variable and two for the coverages of treatment and control
<- function(dt, treat_var, variables, values, alpha = 0.05) {
coverage stopifnot(is.data.table(dt))
<- dt[, by = .(get(treat_var), get(variables)),
quantiles lapply(.SD, function(x){
c(quantile(x, 1 - alpha/2),
quantile(x, alpha/2)
)
}),= values
.SDcols := rep(c("high", "low"), .N/2)]
][, q
<- dcast(quantiles, get+get.1~q, value.var = "value")
quantiles setnames(quantiles, c("get", "get.1"), c(treat_var, variables))
:= 1*(!get(treat_var))]
quantiles[, not_treat_var
<- merge(dt_long, quantiles[, !c(..treat_var)],
pi_dt by.x = c(treat_var, variables),
by.y = c("not_treat_var", variables),
all.x = TRUE)
<- pi_dt[, by = .(get(treat_var), get(variables)),
pi_dt pi = (1 - mean(value <= high)) + mean(value <= low))]
.(<- dcast(pi_dt, get.1~get, value.var = "pi")
pi_dt setnames(pi_dt, c("get.1", "0", "1"), c(variables, "pi_cont", "pi_treat"))
return(pi_dt)
}#' Select the numerical columns and compute the normalized difference
#' and log ratio
<- colnames(dt)[sapply(dt, is.numeric)]
num_cols
<- melt(dt[, ..num_cols], id.vars = "union",
dt_long variable.factor = FALSE)
## Warning in melt.data.table(dt[, ..num_cols], id.vars = "union", variable.factor
## = FALSE): 'measure.vars' [age, female, race, marital_status, ...] are not all
## of the same type. By order of hierarchy, the molten data value column will be of
## type 'double'. All measure variables not of type 'double' will be coerced too.
## Check DETAILS in ?melt.data.table for more on coercion.
head(dt_long)
## union variable value
## 1: 0 age 63
## 2: 0 age 43
## 3: 0 age 32
## 4: 0 age 38
## 5: 1 age 39
## 6: 0 age 29
<- dt_long[, by = .(variable, union),
dt_summa N = .N,
.(mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE)
)]
<- dcast(dt_summa, variable~union,
dt_wide value.var = c("mean", "sd", "N"))
head(dt_wide)
## variable mean_0 mean_1 sd_0 sd_1 N_0
## 1: age 4.226095e+01 45.7400135 14.6015359 13.3227642 11274
## 2: age_2 1.999174e+03 2269.5247123 1305.6744610 1240.6458013 11274
## 3: education 9.091174e+01 94.8043331 22.8525160 21.7442397 11274
## 4: female 4.976938e-01 0.4847664 0.5000169 0.4999372 11274
## 5: marital_status 5.462125e-01 0.6452268 0.4978819 0.4786066 11274
## 6: medicaid 8.781267e-02 0.0534868 0.2830348 0.2250783 11274
## N_1
## 1: 1477
## 2: 1477
## 3: 1477
## 4: 1477
## 5: 1477
## 6: 1477
# Creates new columns
`:=`(
dt_wide[, diff_mean = mean_1 - mean_0,
std_err = sqrt(sd_1^2/N_1 + sd_0^2/N_0),
t_stat = (mean_1 - mean_0)/sqrt(sd_1^2/N_1 + sd_0^2/N_0),
norm_diff = norm_diff(mean_1, mean_0, sd_1, sd_0),
log_ratio = log(sd_1) - log(sd_0)
) ]#' New data.table for coverages
<- coverage(dt_long, treat_var = "union",
pi_dt variables = "variable",
values = "value")
# Join coverages to dt_wide
<- dt_wide[pi_dt, on = "variable"]
bal_dt # Reorder the columns by their index
<- which(colnames(dt_wide) == "variable")
vari <- grep("_1", colnames(dt_wide))
treat <- grep("_0", colnames(dt_wide))
cont #' Rearrange columns
setcolorder(bal_dt, c(vari, treat, cont))
And now we can present the normalized difference alongside to other metrics in the balance table.
#' Show the data.table as a pretty table using kableExtra package
<- bal_dt[1, N_1]
Nt <- bal_dt[1, N_0]
Nc # Insert the number of observations in column name
<- sprintf("Treatment ($N_t = %s$)", Nt)
t_title <- sprintf("Control ($N_c = %s$)", Nc)
c_title <- c(1, 2, 2, 5)
kbl_head names(kbl_head) <- c(" ", t_title, c_title, "Overlap Measures")
#' Generates the table. You can use the format = 'latex' if compiling a
#' PDF. Note: The native pipe operator |> works only for R 4.1 or newer
kbl(bal_dt[, -c("N_1", "N_0", "diff_mean", "std_err")],
digits = 2,
col.names = c("Variable", "Mean", "Std.Dev",
"Mean", "Std.Dev",
"t-stat", "Norm. Diff.", "Log Ratio", "$\\pi_c^{0.05}$", "$\\pi_t^{0.05}$"),
escape = FALSE) |>
kable_classic(full_width = FALSE) |>
add_header_above(header = kbl_head)
Treatment (\(N_t = 1477\))
|
Control (\(N_c = 11274\))
|
Overlap Measures
|
|||||||
---|---|---|---|---|---|---|---|---|---|
Variable | Mean | Std.Dev | Mean | Std.Dev | t-stat | Norm. Diff. | Log Ratio | \(\pi_c^{0.05}\) | \(\pi_t^{0.05}\) |
age | 45.74 | 13.32 | 42.26 | 14.60 | 9.33 | 0.25 | -0.09 | 0.11 | 0.03 |
age_2 | 2269.52 | 1240.65 | 1999.17 | 1305.67 | 7.83 | 0.21 | -0.05 | 0.11 | 0.03 |
education | 94.80 | 21.74 | 90.91 | 22.85 | 6.43 | 0.17 | -0.05 | 0.08 | 0.04 |
female | 0.48 | 0.50 | 0.50 | 0.50 | -0.93 | -0.03 | 0.00 | 0.50 | 0.52 |
marital_status | 0.65 | 0.48 | 0.55 | 0.50 | 7.44 | 0.20 | -0.04 | 0.45 | 0.35 |
medicaid | 0.05 | 0.23 | 0.09 | 0.28 | -5.33 | -0.13 | -0.23 | 0.91 | 0.95 |
own_farm_income_last_year | 28.99 | 818.73 | 125.83 | 10392.05 | -0.97 | -0.01 | -2.54 | 1.00 | 1.00 |
private_health_insurance | 0.91 | 0.29 | 0.80 | 0.40 | 12.54 | 0.30 | -0.32 | 0.20 | 0.09 |
race | 0.79 | 0.40 | 0.81 | 0.39 | -1.89 | -0.05 | 0.04 | 0.19 | 0.21 |
total_income_last_year | 65972.27 | 61781.18 | 60986.32 | 78361.19 | 2.82 | 0.07 | -0.24 | 0.11 | 0.03 |
veteran | 0.07 | 0.26 | 0.06 | 0.23 | 2.67 | 0.08 | 0.14 | 0.94 | 0.93 |
wage_income_last_year | 60617.73 | 57985.77 | 55485.09 | 71788.61 | 3.10 | 0.08 | -0.21 | 0.11 | 0.03 |
worked_last_year | 0.99 | 0.10 | 0.98 | 0.16 | 4.51 | 0.10 | -0.40 | 1.00 | 1.00 |
Since we have multiple covariates, it is also useful to understand their joint distribution. In this case we want to have a single measure of the difference between the treatment arms distributions. Let \(K\) be the number of covariates, the number of components of the vector of pre-treatment variables \(X_i\), then we have \(K\)-vectors for sample means, \(\bar X_t\) and \(\bar X_c\), and \(K\times K\) covariance matrices, \(\hat\Sigma_t\) and \(\hat\Sigma_c\).
\[ \hat\Sigma_j=\frac{1}{N_j-1}\sum_{i\in j}(X_i-\bar X_j)(X_i-\bar X_j)\prime \]
And that single measure we are seeking for is the Mahalanobis distance:
\[ \hat\Delta_{ct}^{mv}=\sqrt{(\bar X_t-\bar X_c)\prime\left(\frac{\hat\Sigma_c+\hat\Sigma_t}{2}\right)^{-1}(\bar X_t-\bar X_c)} \]
<- dt[, ..num_cols
means = union,
][, by lapply(.SD, mean, na.rm = TRUE)]
<- dt[, ..num_cols
covars = union,
][, by sigma = list(var(.SD, na.rm = TRUE)))]
.(# Average of Sigma matrices
<- (covars$sigma[[1]] + covars$sigma[[2]])/2
m_sig # Use unlist to convert data.table rows into vectors
<- mahalanobis(unlist(means[union == 1, -c("union")]),
mahalo unlist(means[union == 0, -c("union")]),
m_sig) mahalo
## [1] 0.2062444
Propensity Score
Here we focus on the statistical problem of estimating the conditional probability of receiving the treatment given the observed covariates. The end goal is to obtain estimates of the propensity score that balance the covariates between treated and control subsamples. The propensity score in the design stage is useful to trim the data set in order to improve covariate’s balance.
We have seen in class that the propensity score is a balancing score and those have an important property: treatment assignment is unconfounded when conditioned on a balancing score:
\[W_i\perp Y_i(0), Y_i(1)|e(X_i)\]
and the propensity score is the probability of being assigned to treatment given the full set of observed covariates:
\[e(X_i)=Pr(W_i=1|X_i)\]
Typical binary outcome models used to estimate a propensity score are, (Cameron and Trivedi 2005):
We will proceed to the following section estimating logit models.
Covariate’s selection for PS estimation
Before estimating the propensity score, the researcher must specify wich covariates and their transformations enter the vector \(X\). It may not be sufficient to include them only linearly, but also, we do not want to be too flexible in the specification and risk to incur in overfitting, which will spoil the overlap assumption.
So we must have a principled method to select variables to be included in the propensity score that is able to balance between the two forces discussed. Imbens and Rubin (2015) propose a stepwise procedure for selecting covariates and high-order terms2, but also there is a growing literature on automatic selection, mainly using the Lasso algorithm, Hastie, Tibshirani, and Wainwright (2019) and Belloni et al. (2017) .
Stepwise Selection
#' Stepwise model selection - Imbens and Rubin -----------------------------
#' Imbens and Rubin's stepwise selection algorithm
#' treatment: character variable for treatment indicator variable
#' Xb: character vector with names of basic covariates: you may pass it as c()
#' if you do not want any basic covariate
#' Xt: character vector with names for covariates to be tested for inclusion
#' data: dataframe with variables
#' Clinear: threshold, in terms of likelihood ratio statistics, for inclusion of
#' linear terms
#' Cquadratic: threshold, in terms of likelihood ratio statistics, for inclusion
#' of quadratic/interaction terms
#' Intercept: does the model include intercept?
#' Author: Luis Alvarez
#' Modifications: Rafael F. Bressan
<- function(treatment, Xb, Xt, data,
ir_stepwise Clinear = 1,
Cquadratic = 2.71,
intercept = TRUE)
{#Add or not intercept
if (intercept)
= "1"
inter.add else inter.add = "-1"
#Formula for model
if (length(Xb) == 0)
<- paste(treatment, inter.add, sep = " ~ ")
formula else formula <- paste(treatment, paste(c(inter.add,Xb), collapse = " + "),
sep = " ~ ")
<- TRUE
continue <- Xt
Xt_left # First order inclusion
while (continue) {
<- fixest::feglm(as.formula(formula), data, family = "binomial")
null.model <- logLik(null.model)
null.lkl <- c()
test.stats for (covariate in Xt_left)
{<- paste(formula, covariate, sep = " + ")
formula.test <- fixest::feglm(as.formula(formula.test), data,
test.model family = "binomial")
<- 2*(logLik(test.model) - null.lkl)
lkl.ratio <- c(test.stats, lkl.ratio)
test.stats
}
if (max(test.stats,na.rm = TRUE) < Clinear)
<- FALSE
continue else {
<- Xt_left[which.max(test.stats)]
add.coef <- paste(formula, add.coef, sep = " + ")
formula <- Xt_left[-which.max(test.stats)]
Xt_left
}
}
#Defining Xstar set. Set of first order included variables
<- c(Xb, Xt[!(Xt %in% Xt_left)])
Xstar
#Creating all combinations of Xstar interactions
<- expand.grid(Xstar, Xstar)
combinations <- paste(combinations[,1],combinations[,2],sep = ":")
Xcomb <- TRUE
continue <- Xcomb
Xcomb_left
while (continue) {
<- fixest::feglm(as.formula(formula), data, family = "binomial")
null.model
<- logLik(null.model)
null.lkl
<- c()
test.stats for (covariate in Xcomb_left)
{<- paste(formula, covariate, sep = " + ")
formula.test <- fixest::feglm(as.formula(formula.test), data,
test.model family = "binomial")
<- 2*(logLik(test.model) - null.lkl)
lkl.ratio <- c(test.stats, lkl.ratio)
test.stats
}
if (max(test.stats,na.rm = TRUE) < Cquadratic)
<- FALSE
continue else {
<- Xcomb_left[which.max(test.stats)]
add.coef <- paste(formula, add.coef, sep = " + ")
formula <- Xcomb_left[-which.max(test.stats)]
Xcomb_left
}
}
return(list(formula = formula, inc_x = Xstar))
}
The stepwise method asks for a basic set of covariates, \(X_b\), which will always be included in the specification and an additional set of covariates, \(X_t\), that should be tested for inclusion. The key characteristic of these covariates is that they are known not to be affected by the treatment. Then, the algorithm tests for the interactions and quadratic terms of variables in \(X^∗\), the set of included variables. All tests are based on likelihood ratio statistics and thresholds in terms of these statistics for inclusion (see Appendix A of Imbens (2015) for details).
#' Basic covariates
<- c("age", "age_2", "female", "race", "marital_status",
xb "veteran", "education", "class_of_worker")
#' Test covariates
<- c("private_health_insurance", "medicaid")
xt #' TEST ONLY: Select a random sample of full data for quick results
# set.seed(1234)
# ds <- dt[sample(.N, 2000)]
<- Sys.time()
ir_tic <- ir_stepwise("union", xb, xt, data = dt) ir_form
## Warning in max(test.stats, na.rm = TRUE): nenhum argumento não faltante para
## max; retornando -Inf
<- feglm(as.formula(ir_form$formula), family = "binomial",
ps_ir data = dt[, -c("earnings")])
## Warning in `[.data.table`(dt, , -c("earnings")): column(s) not removed because
## not found: [earnings]
<- round(Sys.time() - ir_tic, 2)
ir_proc_time # attr(terms(ps_ir), "term.labels")
Lasso Selection
#' Model selection via Lasso
<- Sys.time()
lasso_tic <- rlassologit(union~(.)^2,
ps_lasso data = dt[, c("union", ..xb, ..xt)])
<- round(Sys.time() - lasso_tic, 2)
lasso_proc_time # names(coef(ps_lasso))[ps_lasso$index]
Finally, the third model for estimating the propensity score was a full logit model with all meaningful variables and their interactions up to second order.
#' Model with full set of covariates!
<- paste0("union~(", paste(c(xb, xt), collapse = "+"), ")^2")
fml <- Sys.time()
full_tic <- feglm(as.formula(fml), family = "binomial",
ps_full data = dt[, c("union", ..xb, ..xt)])
<- round(Sys.time() - full_tic, 2)
full_proc_time # attr(terms(ps_full), "term.labels")
Assessing PS Overlap
#' Add columns with PS from each model
`:=`(
dt[, ir_ps = predict(ps_ir),
lasso_ps = predict(ps_lasso),
full_ps = predict(ps_full)
)]#' Make a long format dt
<- melt(dt,
dt_models measure.vars = c("ir_ps", "lasso_ps", "full_ps"),
variable.name = "model",
value.name = "ps")
<- dcast(dt_models[, by = .(union, model),
ol_tbl min = min(ps),
.(max = max(ps),
mean = mean(ps))],
~union, value.var = c("min", "mean", "max")
model
)# Reorder the columns by their index
<- grep("_1", colnames(ol_tbl))
treat <- grep("_0", colnames(ol_tbl))
cont setcolorder(ol_tbl, c(1, treat, cont))
ol_tbl
## model min_1 mean_1 max_1 min_0 mean_0 max_0
## 1: ir_ps 0.009918264 0.2293002 0.7494858 0.003333567 0.1009689 0.7420043
## 2: lasso_ps 0.035331068 0.2195084 0.6355415 0.035331068 0.1022517 0.6269871
## 3: full_ps 0.011251529 0.2324742 0.7958217 0.004475085 0.1005531 0.7620283
#' Label each facet including runtime
<- c(ir_ps = sprintf("Imbens-Rubin %s %s.",
model_labels
ir_proc_time, attr(ir_proc_time, "units")),
lasso_ps = sprintf("Lasso %s %s.",
lasso_proc_time, attr(lasso_proc_time, "units")),
full_ps = sprintf("Full %s %s.",
full_proc_time,attr(full_proc_time, "units")))
#' Plot the PS by model and treatment group
ggplot(dt_models, aes(ps)) +
geom_histogram(aes(y = stat(density)), color = "White", bins = 50) +
facet_grid(union~model,
labeller = labeller(union = c(`0` = "Not Union",
`1` = "Union"),
model = model_labels)) +
labs(x = "Propensity score") +
theme_light()
Trimming to improve PS balance
#' Author: Luis Alvarez
#' Adapted by: Rafael F. Bressan
<- function(ps)
trimming
{= 1/(ps*(1 - ps))
inv.vec
if (max(inv.vec) <= 2*mean(inv.vec))
{print("No trimming")
return(rep(TRUE, length(ps)))
else {
}# value function. eq. 16.8 Imbens and Rubin
<- function(gamma) {
value_fun 2*sum(inv.vec[inv.vec <= gamma]) - gamma*sum(inv.vec <= gamma)
}# root finding. g is a list with root
<- uniroot(value_fun, c(min(inv.vec), max(inv.vec)))
g
<- 1/2 - sqrt(1/4 - 1/g$root)
alpha.trim <- ps <= 1 - alpha.trim & ps >= alpha.trim
keep_vec <- length(ps) - sum(keep_vec)
trim_obs print(sprintf("Trimming threshold alpha is %s. Trimmed %s observations.",
alpha.trim, trim_obs))return(keep_vec)
} }
= model, trim_keep := trimming(ps)] dt_models[, by
## [1] "Trimming threshold alpha is 0.0400294186024297. Trimmed 1926 observations."
## [1] "No trimming"
## [1] "Trimming threshold alpha is 0.0409316530369026. Trimmed 2119 observations."
<- dcast(dt_models[(trim_keep), by = .(union, model),
ol_tbl min = min(ps),
.(max = max(ps),
mean = mean(ps))],
~union, value.var = c("min", "mean", "max")
model
)# Reorder the columns by their index
<- grep("_1", colnames(ol_tbl))
treat <- grep("_0", colnames(ol_tbl))
cont setcolorder(ol_tbl, c(1, treat, cont))
ol_tbl
## model min_1 mean_1 max_1 min_0 mean_0 max_0
## 1: ir_ps 0.04083745 0.2391832 0.7494858 0.04005112 0.1154320 0.7420043
## 2: lasso_ps 0.03533107 0.2195084 0.6355415 0.03533107 0.1022517 0.6269871
## 3: full_ps 0.04149871 0.2419742 0.7958217 0.04096000 0.1171516 0.7620283
ggplot(dt_models[(trim_keep)], aes(ps)) +
geom_histogram(aes(y = stat(density)), color = "White", bins = 50) +
facet_grid(union~model,
labeller = labeller(union = c(`0` = "Not Union",
`1` = "Union"),
model = model_labels)) +
labs(x = "Propensity score",
title = "Propensity score distribution after trimming") +
theme_light()
<- dt_models[(trim_keep) & model == "full_ps"]
dt_trim balance_kbl(balance_tbl(dt_trim, "union"))
## Warning in melt.data.table(dt[, ..num_cols], id.vars = treat, variable.factor =
## FALSE): 'measure.vars' [age, female, race, marital_status, ...] are not all of
## the same type. By order of hierarchy, the molten data value column will be of
## type 'double'. All measure variables not of type 'double' will be coerced too.
## Check DETAILS in ?melt.data.table for more on coercion.
Treatment (\(N_t = 1411\))
|
Control (\(N_c = 9221\))
|
Overlap Measures
|
|||||||
---|---|---|---|---|---|---|---|---|---|
Variable | Mean | Std.Dev | Mean | Std.Dev | t-stat | Norm. Diff. | Log Ratio | \(\pi_c^{0.05}\) | \(\pi_t^{0.05}\) |
age | 46.38 | 12.95 | 44.61 | 13.76 | 4.74 | 0.13 | -0.06 | 0.09 | 0.03 |
age_2 | 2319.00 | 1221.94 | 2179.56 | 1269.62 | 3.97 | 0.11 | -0.04 | 0.09 | 0.03 |
education | 95.43 | 21.59 | 92.96 | 22.49 | 3.97 | 0.11 | -0.04 | 0.07 | 0.02 |
female | 0.47 | 0.50 | 0.44 | 0.50 | 2.42 | 0.07 | 0.01 | 0.56 | 0.53 |
marital_status | 0.67 | 0.47 | 0.61 | 0.49 | 4.17 | 0.12 | -0.03 | 0.39 | 0.33 |
medicaid | 0.05 | 0.22 | 0.08 | 0.27 | -4.32 | -0.11 | -0.21 | 0.92 | 0.95 |
own_farm_income_last_year | 30.34 | 837.65 | 151.02 | 11489.42 | -0.99 | -0.01 | -2.62 | 1.00 | 1.00 |
private_health_insurance | 0.93 | 0.26 | 0.88 | 0.32 | 6.03 | 0.16 | -0.22 | 0.12 | 0.07 |
ps | 0.24 | 0.17 | 0.12 | 0.10 | 27.42 | 0.91 | 0.50 | 0.12 | 0.17 |
race | 0.79 | 0.41 | 0.80 | 0.40 | -0.66 | -0.02 | 0.01 | 0.20 | 0.21 |
total_income_last_year | 67465.77 | 62569.48 | 67665.56 | 83301.62 | -0.11 | 0.00 | -0.29 | 0.11 | 0.03 |
veteran | 0.08 | 0.26 | 0.06 | 0.24 | 1.86 | 0.05 | 0.09 | 0.94 | 0.92 |
wage_income_last_year | 62062.09 | 58727.78 | 61608.21 | 76251.83 | 0.26 | 0.01 | -0.26 | 0.11 | 0.02 |
worked_last_year | 0.99 | 0.09 | 0.98 | 0.13 | 3.05 | 0.08 | -0.34 | 1.00 | 1.00 |
Assessing Unconfoundedness
In the second stage, analyses are carried out to assess the plausibility of unconfoundedness. Again it should be stressed that these analyses can be carried out without access to the outcome data. The strategy relies on splitting the set of covariates into a vector of pseudo-outcome, \(X_p\) and a matrix of remaining covariates, \(\mathbf{X}_r\). We then estimate the effect on the pseudo-sample \((X_p, W, X_r)\). This estimates the “pseudo” causal effect which we know a priori to be zero. If that is indeed the case, we take it as evidence that the unconfoundedness assumption is valid.
We chose the variable wage_income_last_year
to be our pseudo-outcome3. Since this variable was not part of our propensity score model before, we are ready to estimate our pseudo-effect. We will use the Horvitz-Thompson (also known as Inverse Probability Weight - IPW) estimator just for the sake of demonstration. Ideally one should use the same estimator intended for the real outcome, but we will cover other estimators like Matching, Subclassification and Doubly-Robust only in the next session.
Hence, we already have a trimmed dataset containing the pseudo-outcome, treatment status and covariates, \((X_p, W, X_r)\) to run the regression:
\[X_{p,i}=\alpha_i+\beta W_i+\gamma\prime X_{r,i}+\varepsilon_i\]
# Creates the weights. Next TA class we will discuss this in more detail
:= union/ps + (1 - union)/(1 - ps)]
dt_trim[, weight <- feols(wage_income_last_year~union, data = dt_trim)
reg1 <- feols(wage_income_last_year~union, data = dt_trim, weights = ~weight)
reg2 <- feols(wage_income_last_year~union+age+age_2+education+female,
reg3 data = dt_trim, weights = ~weight)
setFixest_dict(c(wage_income_last_year = "Wage_{t-1}"))
etable(list(reg1, reg2, reg3), se = "hetero", sdBelow = TRUE)
## model 1 model 2 model 3
## Dependent Var.: Wage_{t-1} Wage_{t-1} Wage_{t-1}
##
## (Intercept) 61,608.2*** 61,352.7*** -118,233.3***
## (794.1) (771.9) (8,746.6)
## union 453.9 -365.2 -1,210.7
## (1,753.2) (1,829.3) (1,664.3)
## age 4,596.9***
## (435.3)
## age_2 -43.86***
## (5.116)
## education 868.5***
## (43.32)
## female -25,207.1***
## (1,840.1)
## _______________ ___________ ___________ ______________
## S.E. type Heter.-rob. Heter.-rob. Heterosk.-rob.
## Observations 10,632 10,632 10,632
## R2 4.31e-6 7.76e-6 0.12940
## Adj. R2 -8.98e-5 -8.63e-5 0.12899
Ignore for the moment the fact that union
and education
may be endogenous to wages and we have our evidence of unconfoudedness.
As a final note, Imbens and Rubin (2015) let it clear that the covariates to be included in a propensity score model and as controls in any estimation should be pre-treatment variables not affected by treatment designation. If regressors in a causal model are affected by treatment, those are called bad controls and we should avoid including them in a regression.
References
I refer you to the following reading: Should we require balance t-tests of baseline observables in randomized experiments?↩︎
Be warned. In all my tests, even with a moderate sized dataset this procedure is extremely inefficent and slow.↩︎
Keep in mind this is a “toy” regression, an overly simplistic example. It may be argued that past wages are related to current union membership. The idea of a pseudo-outcome is one that is not economically (and casually) related to the treatment.↩︎