library(data.table)
library(fixest)
library(kableExtra)
library(modelsummary)
library(MatchIt)
library(ggplot2)
<- fread("Data/cps_union_data.csv")
dt #' Cleaning.
c("V1", "CPSID", "CPSIDP", "public_housing", "employed") := NULL]
dt[, `:=`(
dt[, marital_status = ifelse(marital_status %in% c(1, 2), 1, 0),
race = ifelse(race == 1, 1, 0),
age_2 = age^2
)]<- c("class_of_worker", "class_of_worker_last_year")
cat_cols := lapply(.SD, factor), .SDcols = cat_cols]
dt[, (cat_cols) <- dt[complete.cases(dt)] dt
Matching
Given that CIA holds and asserting that there is overlap between both groups (i.e. strong ignorability assumption), the Propensity Score Matching - PSM - estimator for ATT can be written as:
\[ \begin{equation} \tau_{A T T}^{P S M}=E_{e(X) \mid D=1}\{E[Y(1) \mid D=1, e(X)]-E[Y(0) \mid D=0, e(X)]\} \end{equation} \]
where we draw the controls from a pool of untreated units and we aim to choose a subset such that its performance as a control group is better than the whole sample.
From Caliendo and Kopeinig (2008) we have the following figure, depicting the steps for implementing a PSM estimation.
We are now at step 3, and need to choose an adequate matching algorithm. There are several ways to do this, we will outline some of the most used methods.
Nearest neighbor matching (NNM) is a widely used method, and is composed of two steps. First, the algorithm sorts treated units in a specified order. Then, for each unit in the treatement group, the program finds one or more units in the control group with the nearest PS (or linearized PS). Matching with replacement means that after a treated unit is matched to another from the control group, the last is reintroduced to the control group for next-round matching.
It is worth noting the matching can be done based on the Mahalanobis metric distance, which is not a method based on the PS. It plays a similar role as PS in assessing the similarities of different units in terms of the joint distribution of their covariates.
We will use the package MatchIt
to perform PSM estimation1. I suggest you thoroughly read the package’s website, especially the articles. Let’s estimate the ATT matching on the propensity score at the 1:1 ratio.
#' Basic covariates
<- c("age", "age_2", "female", "race", "marital_status",
xb "veteran", "education", "class_of_worker")
#' Test covariates
<- c("private_health_insurance", "medicaid")
xt #' Model with full set of covariates!
<- paste0("union~(", paste(c(xb, xt), collapse = "+"), ")^2")
fml <- feglm(as.formula(fml), family = "binomial",
ps_full data = dt[, c("union", ..xb, ..xt)])
# Include the PS into original data
`:=`(
dt[, ps = predict(ps_full)
)]
# Matching units
<- matchit(as.formula(fml), method = "nearest",
psm_model data = dt[, c("earnings", "union", ..xb, ..xt)],
distance = dt$ps,
replace = FALSE,
m.order = "smallest",
ratio = 1)
psm_model
## A matchit object
## - method: 1:1 nearest neighbor matching without replacement
## - distance: User-defined
## - number of obs.: 12732 (original), 2954 (matched)
## - target estimand: ATT
## - covariates: age, age_2, female, race, marital_status, veteran, education, class_of_worker, private_health_insurance, medicaid
And we can access the matched data with the following command:
<- match.data(psm_model)
dta head(dta[, c(1:2, 13:15)])
## earnings union distance weights subclass
## 1: 865 0 0.46576368 1 501
## 2: 2884 0 0.07473265 1 1410
## 3: 965 1 0.47847617 1 1020
## 4: 800 1 0.47936622 1 1
## 5: 756 1 0.37228106 1 136
## 6: 473 1 0.40904697 1 347
The match.data
function returns the matched units, with all their original covariates, plus the columns distance, weights and subclass. Distance refers to the propensity score (or Mahalanobis if asked), weights and subclass refer to sample weights to be used in effect estimation and the “class” (i.e. block) that unit has been assigned. With this matched data in hand it is straightforward to estimate the ATT with a linear regression.
<- feols(earnings~union, data = dta, weights = ~weights)
psm_reg etable(psm_reg, se = "hetero", keep = "union")
## psm_reg
## Dependent Var.: earnings
##
## union 102.5*** (24.22)
## _______________ ________________
## S.E. type Heterosked.-rob.
## Observations 2,954
## R2 0.00604
## Adj. R2 0.00570
Adjustment for covariates inclusion is easily handled in this approach, you just include them on the linear regression. Also, CRVE may be used when necessary.
If replacement is used during the matching, you must use the get_matches
function to get the matched data and use this in the linear regression. The standard errors should be computed using as clusters the subclass and unity id.
# Avoiding get_matches bug
<- dt[, c("earnings", "union", ..xb, ..xt)]
dt_gm # Matching units
<- matchit(as.formula(fml), method = "nearest",
psm_rep data = dt_gm,
distance = dt$ps,
replace = TRUE,
m.order = "smallest",
ratio = 3)
# Get matched data
<- get_matches(psm_rep)
dtb head(dtb[, 1:5])
## id subclass weights earnings union
## 1 5 1 1.0000000 965 1
## 2 6540 1 0.3333333 1153 0
## 3 12413 1 0.3333333 2885 0
## 4 7258 1 0.3333333 800 0
## 5 10 2 1.0000000 800 1
## 6 4909 2 0.3333333 490 0
<- feols(earnings~union, data = dtb, weights = ~weights)
psm_rep_reg etable(psm_rep_reg, cluster = ~ subclass + id, keep = "union")
## psm_rep_reg
## Dependent Var.: earnings
##
## union 94.48*** (22.36)
## _______________ ________________
## S.E.: Clustered by: subclass & id
## Observations 5,908
## R2 0.00507
## Adj. R2 0.00490
Subclassification
Subclassification on the estimated propensity score is also refered as blocking or stratification. The sample is partitioned into subclasses (also referred to as strata or blocks), based on the values of the estimated propensity scores, so that within the subclasses, the estimated propensity scores are approximately constant.
The causal effects are estimated within each subclass as if assignment was completely at random within each subclass. If our sample is divided in \(B\) blocks, then Imbens and Rubin suggest estimating treatment effects within each block, \(\tau_{b}\), by fitting a linear model for each block \(b=1,2 \ldots B\) :
\[ y_{i b}=\alpha_{b}+\tau_{b} W_{i b}+\gamma_{b}^{\prime} X_{i b}+\varepsilon_{i b} \]
where \(y_{i b}\) is the outcome of observation \(i\) in block \(b ;\) and \(X_{i b}\) is an optional subset of covariates included. Blocking typically does not eliminate all biases associated with differences in the covariates, thus, remaining imbalances within each block can be accounted for if covariates are included. Second, these adjustments can improve the precision of estimators for causal effects.
After estimating the treatment effect within each block, we aggreate these as:
\[ \begin{equation} \hat{\tau}=\sum_{b=1}^{B} \omega_{b} \hat{\tau}_{b} \end{equation} \]
where the weights depend on the estimand of interest. For the ATE, then \(\omega_{b}=n_{b} / n\), where \(n_{b}\) is the number of observations in block \(b\). If we are interested in the average treatment effect on the treated (ATT), then \(\omega_{b}=n_{b}^{t} / n^{t}\), where \(n_{b}^{t}\) is the number of treated units in block \(b\), and \(n^{t}\) is the overall number of treated units.
Creating the blocks
Imbens and Rubin (2015) propose an algorithm to split the range of PS into \(J\) blocks in order to estimate the causal effect through the subclassification method. We create intervals of the type \([b_{j−1} ,b_j)\) such that we analyze units with propensity scores within an interval as if they have identical propensity scores.
We start with a single block: \(J = 1\), with boundaries equal to \(b_0 = 0\) and \(b_j = b_1 = 1\) (if no trimming is selected). Then we have two steps. First we assess the adequacy of the current number of blocks by computing a t-statistic for the null hypothesis that the average value of the estimated linearized propensity score is the same for treated and control units.
\[ t_{\ell}(j)=\frac{\bar{\ell}_{t}(j)-\bar{\ell}_{c}(j)}{\sqrt{s_{\ell}^{2}(j) \cdot\left(1 / N_{\mathrm{c}}(j)+1 / N_{\mathrm{t}}(j)\right)}} \]
Second, we check whether the number of controls and treated, \(N_c(j)\) and \(N_t(j)\), and the total number of units, \(N(j)\), in each new block, would be greater than some minimum. If a block is not adequately balanced, and if splitting that block would lead to two new ones each with a sufficient number of units, then, we split at the median value and the new blocks are assessed for adequacy.
#' Function that subdivides a given propensity score vector in sub-blocks
#' treat = vector with treatment assignments
#' ps = vector with propensity scores
#' K = how many covariates will we want to test/use in bias correction of
#' estimates later on?
#' t.max = threshold for tstat in making a further subdivide
#' trim = should we discard extreme observations so there is overlap?
#' Author: Luis Alvarez
#' Modifications: Rafael F. Bressan
<- function(treat, ps, K, t.max = 1.96, trim = FALSE)
ps_blocks
{# Linearized propensity score function
<- function(ps) {
lps log(ps/(1 - ps))
}
if (trim) {
= min(ps[treat == 1])
b0 = max(ps[treat == 0])
b1 else
}
{= 0
b0 = 1
b1
}= c(b0, b1)
b_vec while (TRUE)
{<- length(b_vec) - 1
J <- do.call(c, lapply(1:J, function(j){
b_vec_new <- (b_vec[j] <= ps) & (ps < b_vec[j + 1])
sample
<- lps(ps[sample & treat == 1])
ps.treat <- lps(ps[sample & treat == 0])
ps.control
#print(length(ps.control))
#print(length(ps.treat))
<- tryCatch(
t.test.pass abs(t.test(ps.control, ps.treat)$statistic) > t.max},
{error = function(e){return(FALSE)}
)
= median(c(ps.treat, ps.control))
med.val
= sum(ps.treat < med.val)
Nt.below = sum(ps.treat >= med.val)
Nt.above = sum(ps.control < med.val)
Nc.below = sum(ps.control >= med.val)
Nc.above
<- min(Nt.below, Nt.above, Nc.below, Nc.above) >= max(3, K + 2)
s.crit
if (t.test.pass & s.crit)
return(c(b_vec[j], plogis(med.val), b_vec[j + 1]))
else return(c(b_vec[j], b_vec[j + 1]))
# end of do.call
}))
= unique(b_vec_new)
b_vec_new
#print(length(b_vec_new))
if (length(b_vec_new) == length(b_vec))
break
else b_vec = b_vec_new
# end of while loop
}
#Constructing blocking variable now
= rep(NA, length(treat))
block_var
for (j in 1:(length(b_vec) - 1))
<= ps) & (ps < b_vec[j + 1])] = j
block_var[(b_vec[j]
return(block_var)
}
:= ps_blocks(union, ps, length(c(xb, xt)),
dt[, block t.max = 2.66,
trim = TRUE)]
#' Packing data into blocks
<- dt[!is.na(block), keyby = block,
dt_block data = list(.SD))]
.(#' Balance tables by blocks
:= lapply(data, balance_tbl, treat = "union")]
dt_block[, bal_tbl #' Check tables
lapply(dt_block$bal_tbl, balance_kbl)
Treatment (\(N_t = 103\))
|
Control (\(N_c = 3051\))
|
Overlap Measures
|
|||||||
---|---|---|---|---|---|---|---|---|---|
Variable | Mean | Std.Dev | Mean | Std.Dev | t-stat | Norm. Diff. | Log Ratio | \(\pi_c^{0.05}\) | \(\pi_t^{0.05}\) |
age | 33.54 | 14.03 | 33.11 | 13.84 | 0.31 | 0.03 | 0.01 | 0.11 | 0.02 |
age_2 | 1320.17 | 1203.16 | 1288.01 | 1210.38 | 0.27 | 0.03 | -0.01 | 0.11 | 0.02 |
earnings | 832.57 | 578.21 | 674.55 | 514.70 | 2.74 | 0.29 | 0.12 | 0.05 | 0.07 |
education | 87.39 | 21.82 | 86.70 | 22.01 | 0.32 | 0.03 | -0.01 | 0.06 | 0.06 |
female | 0.71 | 0.46 | 0.68 | 0.47 | 0.73 | 0.07 | -0.03 | 0.32 | 0.29 |
marital_status | 0.33 | 0.47 | 0.33 | 0.47 | -0.05 | 0.00 | 0.00 | 0.67 | 0.67 |
medicaid | 0.16 | 0.36 | 0.16 | 0.36 | -0.05 | 0.00 | 0.00 | 0.84 | 0.84 |
own_farm_income_last_year | 0.10 | 0.99 | 21.35 | 645.19 | -1.82 | -0.05 | -6.48 | 1.00 | 1.00 |
private_health_insurance | 0.49 | 0.50 | 0.52 | 0.50 | -0.63 | -0.06 | 0.00 | 0.48 | 0.51 |
ps | 0.04 | 0.01 | 0.03 | 0.01 | 1.52 | 0.15 | -0.06 | 0.07 | 0.09 |
race | 0.83 | 0.37 | 0.87 | 0.34 | -0.97 | -0.10 | 0.11 | 0.13 | 0.17 |
total_income_last_year | 38599.91 | 31115.69 | 36857.61 | 46876.84 | 0.55 | 0.04 | -0.41 | 0.06 | 0.08 |
veteran | 0.05 | 0.22 | 0.04 | 0.19 | 0.50 | 0.05 | 0.13 | 0.96 | 0.95 |
wage_income_last_year | 34536.77 | 29439.08 | 33395.99 | 44332.65 | 0.38 | 0.03 | -0.41 | 0.07 | 0.09 |
worked_last_year | 0.94 | 0.24 | 0.95 | 0.22 | -0.39 | -0.04 | 0.08 | 0.05 | 0.06 |
Treatment (\(N_t = 207\))
|
Control (\(N_c = 2949\))
|
Overlap Measures
|
|||||||
---|---|---|---|---|---|---|---|---|---|
Variable | Mean | Std.Dev | Mean | Std.Dev | t-stat | Norm. Diff. | Log Ratio | \(\pi_c^{0.05}\) | \(\pi_t^{0.05}\) |
age | 43.48 | 14.47 | 42.89 | 14.29 | 0.57 | 0.04 | 0.01 | 0.08 | 0.04 |
age_2 | 2098.75 | 1335.86 | 2043.48 | 1324.24 | 0.58 | 0.04 | 0.01 | 0.08 | 0.04 |
earnings | 1038.98 | 638.25 | 1036.82 | 761.29 | 0.05 | 0.00 | -0.18 | 0.13 | 0.01 |
education | 92.05 | 22.62 | 93.23 | 23.76 | -0.72 | -0.05 | -0.05 | 0.06 | 0.03 |
female | 0.52 | 0.50 | 0.49 | 0.50 | 0.76 | 0.05 | 0.00 | 0.51 | 0.48 |
marital_status | 0.49 | 0.50 | 0.49 | 0.50 | -0.04 | 0.00 | 0.00 | 0.51 | 0.51 |
medicaid | 0.14 | 0.34 | 0.11 | 0.31 | 1.19 | 0.09 | 0.11 | 0.89 | 0.86 |
own_farm_income_last_year | 0.00 | 0.00 | 13.37 | 492.22 | -1.48 | -0.04 | -Inf | 1.00 | 1.00 |
private_health_insurance | 0.86 | 0.35 | 0.85 | 0.36 | 0.36 | 0.03 | -0.02 | 0.15 | 0.14 |
ps | 0.07 | 0.01 | 0.07 | 0.01 | 1.73 | 0.12 | 0.00 | 0.07 | 0.04 |
race | 0.89 | 0.31 | 0.87 | 0.34 | 1.01 | 0.07 | -0.08 | 0.13 | 0.11 |
total_income_last_year | 57886.11 | 50401.69 | 67975.00 | 92121.69 | -2.59 | -0.14 | -0.60 | 0.07 | 0.05 |
veteran | 0.04 | 0.19 | 0.04 | 0.20 | -0.07 | -0.01 | -0.01 | 0.96 | 0.96 |
wage_income_last_year | 52616.00 | 45717.86 | 61483.78 | 86758.87 | -2.49 | -0.13 | -0.64 | 0.08 | 0.04 |
worked_last_year | 0.98 | 0.15 | 0.98 | 0.15 | -0.22 | -0.02 | 0.05 | 1.00 | 1.00 |
Treatment (\(N_t = 44\))
|
Control (\(N_c = 748\))
|
Overlap Measures
|
|||||||
---|---|---|---|---|---|---|---|---|---|
Variable | Mean | Std.Dev | Mean | Std.Dev | t-stat | Norm. Diff. | Log Ratio | \(\pi_c^{0.05}\) | \(\pi_t^{0.05}\) |
age | 51.61 | 11.75 | 49.67 | 11.10 | 1.07 | 0.17 | 0.06 | 0.04 | 0.05 |
age_2 | 2798.93 | 1248.11 | 2590.01 | 1022.98 | 1.09 | 0.18 | 0.20 | 0.04 | 0.05 |
earnings | 1187.80 | 688.00 | 1110.69 | 762.68 | 0.72 | 0.11 | -0.10 | 0.24 | 0.00 |
education | 104.59 | 19.79 | 95.77 | 21.04 | 2.86 | 0.43 | -0.06 | 0.23 | 0.23 |
female | 0.64 | 0.49 | 0.66 | 0.47 | -0.37 | -0.06 | 0.03 | 0.34 | 0.36 |
marital_status | 0.77 | 0.42 | 0.77 | 0.42 | 0.00 | 0.00 | 0.01 | 0.23 | 0.23 |
medicaid | 0.00 | 0.00 | 0.03 | 0.18 | -5.08 | -0.26 | -Inf | 1.00 | 1.00 |
own_farm_income_last_year | 0.00 | 0.00 | 24.06 | 476.44 | -1.38 | -0.07 | -Inf | 1.00 | 1.00 |
private_health_insurance | 0.95 | 0.21 | 0.95 | 0.23 | 0.28 | 0.04 | -0.08 | 0.05 | 0.05 |
ps | 0.08 | 0.00 | 0.08 | 0.00 | 1.84 | 0.30 | 0.08 | 0.09 | 0.09 |
race | 0.89 | 0.32 | 0.90 | 0.30 | -0.24 | -0.04 | 0.06 | 0.10 | 0.11 |
total_income_last_year | 83222.20 | 76412.63 | 71255.32 | 79359.09 | 1.01 | 0.15 | -0.04 | 0.14 | 0.05 |
veteran | 0.02 | 0.15 | 0.02 | 0.14 | 0.06 | 0.01 | 0.04 | 1.00 | 1.00 |
wage_income_last_year | 77834.27 | 73042.72 | 65106.75 | 74359.95 | 1.12 | 0.17 | -0.02 | 0.16 | 0.05 |
worked_last_year | 1.00 | 0.00 | 0.99 | 0.10 | 2.66 | 0.14 | -Inf | 1.00 | 1.00 |
Treatment (\(N_t = 76\))
|
Control (\(N_c = 712\))
|
Overlap Measures
|
|||||||
---|---|---|---|---|---|---|---|---|---|
Variable | Mean | Std.Dev | Mean | Std.Dev | t-stat | Norm. Diff. | Log Ratio | \(\pi_c^{0.05}\) | \(\pi_t^{0.05}\) |
age | 47.21 | 10.74 | 46.83 | 11.27 | 0.29 | 0.03 | -0.05 | 0.09 | 0.03 |
age_2 | 2342.74 | 957.67 | 2320.22 | 1004.92 | 0.19 | 0.02 | -0.05 | 0.09 | 0.03 |
earnings | 1393.11 | 765.81 | 1362.36 | 835.06 | 0.33 | 0.04 | -0.09 | 0.09 | 0.00 |
education | 105.78 | 19.71 | 105.72 | 16.10 | 0.03 | 0.00 | 0.20 | 0.09 | 0.09 |
female | 0.51 | 0.50 | 0.46 | 0.50 | 0.93 | 0.11 | 0.01 | 0.54 | 0.49 |
marital_status | 0.75 | 0.44 | 0.74 | 0.44 | 0.11 | 0.01 | 0.00 | 0.26 | 0.25 |
medicaid | 0.07 | 0.25 | 0.04 | 0.19 | 0.94 | 0.13 | 0.27 | 0.96 | 0.93 |
own_farm_income_last_year | 0.00 | 0.00 | 191.01 | 2373.68 | -2.15 | -0.11 | -Inf | 1.00 | 1.00 |
private_health_insurance | 0.96 | 0.20 | 0.96 | 0.20 | 0.05 | 0.01 | -0.01 | 0.04 | 0.04 |
ps | 0.08 | 0.00 | 0.08 | 0.00 | 0.00 | 0.00 | 0.01 | 0.04 | 0.07 |
race | 0.84 | 0.37 | 0.83 | 0.37 | 0.24 | 0.03 | -0.02 | 0.17 | 0.16 |
total_income_last_year | 80370.89 | 75102.68 | 86945.45 | 75254.55 | -0.73 | -0.09 | 0.00 | 0.05 | 0.05 |
veteran | 0.04 | 0.20 | 0.03 | 0.18 | 0.25 | 0.03 | 0.08 | 0.97 | 0.96 |
wage_income_last_year | 73969.71 | 62181.04 | 78832.80 | 65729.39 | -0.64 | -0.08 | -0.06 | 0.05 | 0.07 |
worked_last_year | 0.99 | 0.11 | 0.99 | 0.11 | -0.14 | -0.02 | 0.08 | 1.00 | 1.00 |
Treatment (\(N_t = 169\))
|
Control (\(N_c = 1412\))
|
Overlap Measures
|
|||||||
---|---|---|---|---|---|---|---|---|---|
Variable | Mean | Std.Dev | Mean | Std.Dev | t-stat | Norm. Diff. | Log Ratio | \(\pi_c^{0.05}\) | \(\pi_t^{0.05}\) |
age | 45.38 | 12.26 | 46.59 | 11.78 | -1.21 | -0.10 | 0.04 | 0.07 | 0.05 |
age_2 | 2209.23 | 1140.80 | 2308.91 | 1134.99 | -1.07 | -0.09 | 0.01 | 0.07 | 0.05 |
earnings | 1258.40 | 669.70 | 1220.67 | 785.32 | 0.68 | 0.05 | -0.16 | 0.08 | 0.01 |
education | 87.91 | 17.42 | 90.34 | 18.73 | -1.70 | -0.13 | -0.07 | 0.06 | 0.05 |
female | 0.18 | 0.38 | 0.21 | 0.40 | -0.89 | -0.07 | -0.05 | 0.79 | 0.82 |
marital_status | 0.64 | 0.48 | 0.65 | 0.48 | -0.15 | -0.01 | 0.01 | 0.35 | 0.36 |
medicaid | 0.03 | 0.17 | 0.04 | 0.20 | -0.77 | -0.06 | -0.15 | 1.00 | 0.97 |
own_farm_income_last_year | 230.78 | 2405.35 | 811.63 | 29288.49 | -0.73 | -0.03 | -2.50 | 1.00 | 1.00 |
private_health_insurance | 0.97 | 0.17 | 0.96 | 0.21 | 1.06 | 0.08 | -0.19 | 0.04 | 0.03 |
ps | 0.10 | 0.01 | 0.10 | 0.01 | 1.44 | 0.12 | -0.01 | 0.09 | 0.04 |
race | 0.62 | 0.49 | 0.61 | 0.49 | 0.05 | 0.00 | 0.00 | 0.39 | 0.38 |
total_income_last_year | 72285.69 | 43214.51 | 80041.91 | 105324.69 | -1.78 | -0.10 | -0.89 | 0.13 | 0.01 |
veteran | 0.05 | 0.21 | 0.08 | 0.26 | -1.56 | -0.12 | -0.21 | 0.92 | 0.95 |
wage_income_last_year | 68841.37 | 41655.16 | 72868.47 | 87632.62 | -1.02 | -0.06 | -0.74 | 0.13 | 0.02 |
worked_last_year | 0.99 | 0.08 | 0.99 | 0.09 | 0.40 | 0.03 | -0.18 | 1.00 | 1.00 |
Treatment (\(N_t = 244\))
|
Control (\(N_c = 1337\))
|
Overlap Measures
|
|||||||
---|---|---|---|---|---|---|---|---|---|
Variable | Mean | Std.Dev | Mean | Std.Dev | t-stat | Norm. Diff. | Log Ratio | \(\pi_c^{0.05}\) | \(\pi_t^{0.05}\) |
age | 48.00 | 12.53 | 48.22 | 12.78 | -0.25 | -0.02 | -0.02 | 0.08 | 0.04 |
age_2 | 2460.73 | 1194.62 | 2488.61 | 1234.29 | -0.33 | -0.02 | -0.03 | 0.08 | 0.04 |
earnings | 1221.63 | 636.17 | 1067.21 | 702.28 | 3.43 | 0.23 | -0.10 | 0.14 | 0.00 |
education | 84.04 | 20.57 | 81.66 | 22.99 | 1.63 | 0.11 | -0.11 | 0.10 | 0.03 |
female | 0.28 | 0.45 | 0.25 | 0.43 | 0.90 | 0.06 | 0.04 | 0.75 | 0.72 |
marital_status | 0.75 | 0.43 | 0.75 | 0.43 | 0.03 | 0.00 | 0.00 | 0.25 | 0.25 |
medicaid | 0.03 | 0.18 | 0.04 | 0.19 | -0.49 | -0.03 | -0.08 | 0.96 | 0.97 |
own_farm_income_last_year | 0.00 | 0.00 | 9.73 | 282.80 | -1.26 | -0.05 | -Inf | 1.00 | 1.00 |
private_health_insurance | 0.93 | 0.26 | 0.92 | 0.27 | 0.18 | 0.01 | -0.02 | 0.08 | 0.07 |
ps | 0.16 | 0.04 | 0.15 | 0.04 | 1.58 | 0.11 | 0.06 | 0.07 | 0.04 |
race | 0.70 | 0.46 | 0.72 | 0.45 | -0.60 | -0.04 | 0.02 | 0.28 | 0.30 |
total_income_last_year | 71728.98 | 45025.76 | 64531.37 | 71549.65 | 2.07 | 0.12 | -0.46 | 0.06 | 0.04 |
veteran | 0.14 | 0.34 | 0.13 | 0.34 | 0.15 | 0.01 | 0.01 | 0.87 | 0.86 |
wage_income_last_year | 64494.05 | 38379.75 | 60168.60 | 69077.85 | 1.40 | 0.08 | -0.59 | 0.10 | 0.03 |
worked_last_year | 1.00 | 0.06 | 0.99 | 0.11 | 1.42 | 0.08 | -0.50 | 1.00 | 1.00 |
Treatment (\(N_t = 265\))
|
Control (\(N_c = 526\))
|
Overlap Measures
|
|||||||
---|---|---|---|---|---|---|---|---|---|
Variable | Mean | Std.Dev | Mean | Std.Dev | t-stat | Norm. Diff. | Log Ratio | \(\pi_c^{0.05}\) | \(\pi_t^{0.05}\) |
age | 48.83 | 13.52 | 48.21 | 13.34 | 0.61 | 0.05 | 0.01 | 0.07 | 0.06 |
age_2 | 2566.89 | 1332.10 | 2501.79 | 1290.21 | 0.66 | 0.05 | 0.03 | 0.07 | 0.06 |
earnings | 1095.82 | 586.35 | 996.68 | 632.79 | 2.19 | 0.16 | -0.08 | 0.03 | 0.05 |
education | 97.75 | 20.43 | 99.48 | 21.68 | -1.10 | -0.08 | -0.06 | 0.21 | 0.22 |
female | 0.56 | 0.50 | 0.58 | 0.49 | -0.42 | -0.03 | 0.01 | 0.42 | 0.44 |
marital_status | 0.67 | 0.47 | 0.63 | 0.48 | 0.97 | 0.07 | -0.02 | 0.37 | 0.33 |
medicaid | 0.01 | 0.11 | 0.02 | 0.13 | -0.67 | -0.05 | -0.20 | 1.00 | 1.00 |
own_farm_income_last_year | 0.00 | 0.09 | 0.00 | 0.11 | -0.54 | -0.04 | -0.20 | 1.00 | 1.00 |
private_health_insurance | 0.95 | 0.22 | 0.95 | 0.22 | 0.02 | 0.00 | 0.00 | 0.05 | 0.05 |
ps | 0.33 | 0.03 | 0.33 | 0.03 | 0.74 | 0.06 | -0.03 | 0.07 | 0.05 |
race | 0.75 | 0.43 | 0.72 | 0.45 | 0.86 | 0.06 | -0.03 | 0.28 | 0.25 |
total_income_last_year | 69449.44 | 102854.22 | 62829.92 | 66162.49 | 0.95 | 0.08 | 0.44 | 0.08 | 0.03 |
veteran | 0.08 | 0.26 | 0.07 | 0.25 | 0.46 | 0.03 | 0.06 | 0.93 | 0.92 |
wage_income_last_year | 64438.74 | 101556.84 | 56627.92 | 64313.39 | 1.14 | 0.09 | 0.46 | 0.09 | 0.04 |
worked_last_year | 1.00 | 0.06 | 0.99 | 0.11 | 1.28 | 0.09 | -0.55 | 1.00 | 1.00 |
Treatment (\(N_t = 368\))
|
Control (\(N_c = 423\))
|
Overlap Measures
|
|||||||
---|---|---|---|---|---|---|---|---|---|
Variable | Mean | Std.Dev | Mean | Std.Dev | t-stat | Norm. Diff. | Log Ratio | \(\pi_c^{0.05}\) | \(\pi_t^{0.05}\) |
age | 45.88 | 11.73 | 46.61 | 11.94 | -0.87 | -0.06 | -0.02 | 0.06 | 0.03 |
age_2 | 2242.10 | 1096.53 | 2314.99 | 1119.03 | -0.92 | -0.07 | -0.02 | 0.06 | 0.03 |
earnings | 1161.78 | 578.59 | 1026.36 | 638.95 | 3.13 | 0.22 | -0.10 | 0.11 | 0.03 |
education | 103.14 | 20.14 | 100.32 | 20.85 | 1.93 | 0.14 | -0.03 | 0.22 | 0.17 |
female | 0.60 | 0.49 | 0.56 | 0.50 | 1.20 | 0.09 | -0.01 | 0.44 | 0.40 |
marital_status | 0.70 | 0.46 | 0.72 | 0.45 | -0.64 | -0.05 | 0.02 | 0.28 | 0.30 |
medicaid | 0.04 | 0.18 | 0.03 | 0.17 | 0.55 | 0.04 | 0.11 | 0.97 | 0.96 |
own_farm_income_last_year | 10.33 | 161.74 | 2.36 | 48.62 | 0.91 | 0.07 | 1.20 | 1.00 | 1.00 |
private_health_insurance | 0.96 | 0.19 | 0.98 | 0.14 | -1.37 | -0.10 | 0.28 | 0.02 | 1.00 |
ps | 0.47 | 0.06 | 0.46 | 0.05 | 2.54 | 0.18 | 0.09 | 0.05 | 0.07 |
race | 0.88 | 0.33 | 0.86 | 0.34 | 0.50 | 0.04 | -0.04 | 0.14 | 0.12 |
total_income_last_year | 63900.30 | 40403.92 | 59276.68 | 40104.78 | 1.61 | 0.11 | 0.01 | 0.11 | 0.02 |
veteran | 0.09 | 0.28 | 0.08 | 0.28 | 0.21 | 0.02 | 0.02 | 0.92 | 0.91 |
wage_income_last_year | 58489.62 | 33893.87 | 53629.72 | 36825.84 | 1.93 | 0.14 | -0.08 | 0.14 | 0.02 |
worked_last_year | 1.00 | 0.05 | 1.00 | 0.05 | -0.10 | -0.01 | 0.07 | 1.00 | 1.00 |
`:=`(N = sapply(data, nrow),
dt_block[, Nt = sapply(data, function(x) sum(x$union)))]
`:=`(w = N/sum(N),
dt_block[, wt = Nt/sum(Nt))]
:= lapply(data, function(x){
dt_block[, estimates <- feols(earnings~union, data = x)
reg <- coeftable(reg, se = "hetero")["union", c("Estimate", "Std. Error")]
est return(list(est[1], est[2]))
})]# Unpack estimates
= block, c("tau", "sigma") := estimates[[1]]]
dt_block[, by <- dt_block[, .(
bl_res estimand = c("ATE", "ATT"),
Estimate = c(sum(tau*w), sum(tau*wt)),
`Std. Error` = c(sqrt(sum(w^2 * sigma^2)), sqrt(sum(wt^2 * sigma^2)))
)] bl_res
## estimand Estimate Std. Error
## 1: ATE 85.47282 22.66528
## 2: ATT 96.62407 19.23292
Or, we can use the MatchIt
package.
<- matchit(as.formula(fml), data = dt,
mS method = "subclass",
distance = dt$ps,
estimand = "ATT",
subclass = 8,
discard = "both")
<- match.data(mS)
dtc <- feols(earnings~union, data = dtc, weights = ~weights)
sub_reg etable(sub_reg, se = "hetero", keep = "union")
## sub_reg
## Dependent Var.: earnings
##
## union 97.99*** (19.63)
## _______________ ________________
## S.E. type Heterosked.-rob.
## Observations 12,635
## R2 0.00201
## Adj. R2 0.00193
Doubly-Robust
The basic Horvitz-Thompson (IPW) estimator exploits the following two equalities:
\[\begin{equation} \mathbb{E}\left[\frac{Y_{i}\cdot W_{i} }{e\left(X_{i}\right)}\right]=\mathbb{E}\left[Y_{i}(1)\right] \quad \text { and } \quad \mathbb{E}\left[\frac{Y_{i} \cdot\left(1-W_{i}\right) }{1-e\left(X_{i}\right)}\right]=\mathbb{E}\left[Y_{i}(0)\right] \end{equation}\]
For ATE we have the following weighting scheme:
\[\omega_{i}^{ate}=\frac{1}{e\left(X_{i}\right)^{W_{i}} \cdot\left(1-e\left(X_{i}\right)\right)^{1-W_{i}}}=\left\{\begin{array}{ll} 1 /\left(1-e\left(X_{i}\right)\right) & \text { if } W_{i}=0 \\ 1 / e\left(X_{i}\right) & \text { if } W_{i}=1 \end{array}\right.\]
and for ATT the weights are slightly changed,
\[\omega_{i}^{att}=\frac{1}{P[W_i=1]}\frac{e(X_i)^{1-W_i}}{\left(1-e\left(X_{i}\right)\right)^{1-W_{i}}}=\left\{\begin{array}{ll} P[W_i=1]^{-1}\cdot e(X_i) /\left(1-e\left(X_{i}\right)\right) & \text { if } W_{i}=0 \\ P[W_i=1]^{-1} & \text { if } W_{i}=1 \end{array}\right.\]
Hence, our estimands are given by:
\[\begin{align} \tau^{ATE}=\mathbb{E}\left[Y_{i}(1)-Y_{i}(0)\right]&=\mathbb{E}\left[\frac{Y_{i} \cdot W_{i}}{e(X_i)}-\frac{Y_{i} \cdot\left(1-W_{i}\right)}{1-e(X_i)}\right] \\ \tau^{ATT}=\mathbb{E}\left[Y_{i}(1)-Y_{i}(0) \mid W_{i}=1\right]&=\mathbb{E}\left[\frac{Y_{i} \cdot W_{i}}{\mathbb{P}\left[W_{i}=1\right]}-\frac{Y_{i} \cdot\left(1-W_{i}\right) \cdot e(X_i)}{\mathbb{P}\left[W_{i}=1\right]\left(1-e(X_i)\right)}\right] \end{align}\]
and the estimator will be the sample counterpart, considering the estimated propensity score, \(\hat e(X_i)\). It is useful to write this estimator as a weighted regression estimator. Take the model for \(\tau^{ATE}\), for example. The regression specification and the weights are, respectively:
\[ \begin{align} Y_i&=\alpha+\tau\cdot W_i+\varepsilon_i, \\ \omega_i &= \frac{W_i}{\hat e(X_i)}+\frac{1-W_i}{1-\hat e(X_i)} \end{align} \] and it is easy to see that \(\mathbb{E}[\hat\tau] = \tau^{ATE}\). This estimator can be modified easily to incorporate covariates. One can estimate a regression function that includes additional covariates the researcher believe help explain the outcome and use the same weights as before.
\[ \begin{align} Y_i&=\alpha+\tau\cdot W_i+X_i\beta+\varepsilon_i \\ \end{align} \]
This weighted regression estimator will be consistent for the true ATE as long as either the specification of the propensity score or the regression function is correct, yielding the property known as double-robustness.
# Overall probability of treatment
<- mean(dt$union)
p_treat # Creating the weights
`:=`(w = union/ps + (1 - union)/(1 - ps),
dt[, wt = (1/p_treat)*(ps/(1 - ps))^(1 - union))]
# ATE regression
<- feols(earnings~union+age+age_2+education+female,
reg data = dt, weights = ~w)
# ATT regression
<- feols(earnings~union+age+age_2+education+female,
reg2 data = dt, weights = ~wt)
etable(list(reg, reg2), se = "hetero", keep = "union")
## model 1 model 2
## Dependent Var.: earnings earnings
##
## union 85.74*** (19.69) 93.31*** (17.92)
## _______________ ________________ ________________
## S.E. type Heterosked.-rob. Heterosked.-rob.
## Observations 12,732 12,732
## R2 0.29100 0.22192
## Adj. R2 0.29072 0.22161
Useful links
References
A very nice article about the usage of
MatchIt
package and others inR
is Zhao et al. (2021).↩︎