library(data.table)
library(fixest)
library(kableExtra)
library(modelsummary)
library(MatchIt)
library(ggplot2)
dt <- fread("Data/cps_union_data.csv")
#' Cleaning.
dt[, c("V1", "CPSID", "CPSIDP", "public_housing", "employed") := NULL]
dt[, `:=`(
  marital_status = ifelse(marital_status %in% c(1, 2), 1, 0),
  race = ifelse(race == 1, 1, 0),
  age_2 = age^2
)]
cat_cols <- c("class_of_worker", "class_of_worker_last_year")
dt[, (cat_cols) := lapply(.SD, factor), .SDcols = cat_cols]
dt <- dt[complete.cases(dt)]

Matching

Given that CIA holds and asserting that there is overlap between both groups (i.e. strong ignorability assumption), the Propensity Score Matching - PSM - estimator for ATT can be written as:

\[ \begin{equation} \tau_{A T T}^{P S M}=E_{e(X) \mid D=1}\{E[Y(1) \mid D=1, e(X)]-E[Y(0) \mid D=0, e(X)]\} \end{equation} \]

where we draw the controls from a pool of untreated units and we aim to choose a subset such that its performance as a control group is better than the whole sample.

From Caliendo and Kopeinig (2008) we have the following figure, depicting the steps for implementing a PSM estimation.

We are now at step 3, and need to choose an adequate matching algorithm. There are several ways to do this, we will outline some of the most used methods.

Nearest neighbor matching (NNM) is a widely used method, and is composed of two steps. First, the algorithm sorts treated units in a specified order. Then, for each unit in the treatement group, the program finds one or more units in the control group with the nearest PS (or linearized PS). Matching with replacement means that after a treated unit is matched to another from the control group, the last is reintroduced to the control group for next-round matching.

It is worth noting the matching can be done based on the Mahalanobis metric distance, which is not a method based on the PS. It plays a similar role as PS in assessing the similarities of different units in terms of the joint distribution of their covariates.

We will use the package MatchIt to perform PSM estimation1. I suggest you thoroughly read the package’s website, especially the articles. Let’s estimate the ATT matching on the propensity score at the 1:1 ratio.

#' Basic covariates
xb <- c("age", "age_2", "female", "race", "marital_status", 
        "veteran", "education", "class_of_worker")
#' Test covariates
xt <- c("private_health_insurance", "medicaid")
#' Model with full set of covariates!
fml <- paste0("union~(", paste(c(xb, xt), collapse = "+"), ")^2")
ps_full <- feglm(as.formula(fml), family = "binomial",
                data = dt[, c("union", ..xb, ..xt)])
# Include the PS into original data
dt[, `:=`(
  ps = predict(ps_full)
)]

# Matching units
psm_model <- matchit(as.formula(fml), method = "nearest",
                     data = dt[, c("earnings", "union", ..xb, ..xt)],
                     distance = dt$ps,
                     replace = FALSE,
                     m.order = "smallest",
                     ratio = 1)
psm_model
## A matchit object
##  - method: 1:1 nearest neighbor matching without replacement
##  - distance: User-defined
##  - number of obs.: 12732 (original), 2954 (matched)
##  - target estimand: ATT
##  - covariates: age, age_2, female, race, marital_status, veteran, education, class_of_worker, private_health_insurance, medicaid

And we can access the matched data with the following command:

dta <- match.data(psm_model)
head(dta[, c(1:2, 13:15)])
##    earnings union   distance weights subclass
## 1:      865     0 0.46576368       1      501
## 2:     2884     0 0.07473265       1     1410
## 3:      965     1 0.47847617       1     1020
## 4:      800     1 0.47936622       1        1
## 5:      756     1 0.37228106       1      136
## 6:      473     1 0.40904697       1      347

The match.data function returns the matched units, with all their original covariates, plus the columns distance, weights and subclass. Distance refers to the propensity score (or Mahalanobis if asked), weights and subclass refer to sample weights to be used in effect estimation and the “class” (i.e. block) that unit has been assigned. With this matched data in hand it is straightforward to estimate the ATT with a linear regression.

psm_reg <- feols(earnings~union, data = dta, weights = ~weights)
etable(psm_reg, se = "hetero", keep = "union")
##                          psm_reg
## Dependent Var.:         earnings
##                                 
## union           102.5*** (24.22)
## _______________ ________________
## S.E. type       Heterosked.-rob.
## Observations               2,954
## R2                       0.00604
## Adj. R2                  0.00570

Adjustment for covariates inclusion is easily handled in this approach, you just include them on the linear regression. Also, CRVE may be used when necessary.

If replacement is used during the matching, you must use the get_matches function to get the matched data and use this in the linear regression. The standard errors should be computed using as clusters the subclass and unity id.

# Avoiding get_matches bug
dt_gm <- dt[, c("earnings", "union", ..xb, ..xt)]
# Matching units
psm_rep <- matchit(as.formula(fml), method = "nearest",
                     data = dt_gm,
                     distance = dt$ps,
                     replace = TRUE,
                     m.order = "smallest",
                     ratio = 3)
# Get matched data
dtb <- get_matches(psm_rep)
head(dtb[, 1:5])
##      id subclass   weights earnings union
## 1     5        1 1.0000000      965     1
## 2  6540        1 0.3333333     1153     0
## 3 12413        1 0.3333333     2885     0
## 4  7258        1 0.3333333      800     0
## 5    10        2 1.0000000      800     1
## 6  4909        2 0.3333333      490     0
psm_rep_reg <- feols(earnings~union, data = dtb, weights = ~weights)
etable(psm_rep_reg, cluster = ~ subclass + id, keep = "union")
##                       psm_rep_reg
## Dependent Var.:          earnings
##                                  
## union            94.48*** (22.36)
## _______________  ________________
## S.E.: Clustered by: subclass & id
## Observations                5,908
## R2                        0.00507
## Adj. R2                   0.00490

Subclassification

Subclassification on the estimated propensity score is also refered as blocking or stratification. The sample is partitioned into subclasses (also referred to as strata or blocks), based on the values of the estimated propensity scores, so that within the subclasses, the estimated propensity scores are approximately constant.

The causal effects are estimated within each subclass as if assignment was completely at random within each subclass. If our sample is divided in \(B\) blocks, then Imbens and Rubin suggest estimating treatment effects within each block, \(\tau_{b}\), by fitting a linear model for each block \(b=1,2 \ldots B\) :

\[ y_{i b}=\alpha_{b}+\tau_{b} W_{i b}+\gamma_{b}^{\prime} X_{i b}+\varepsilon_{i b} \]

where \(y_{i b}\) is the outcome of observation \(i\) in block \(b ;\) and \(X_{i b}\) is an optional subset of covariates included. Blocking typically does not eliminate all biases associated with differences in the covariates, thus, remaining imbalances within each block can be accounted for if covariates are included. Second, these adjustments can improve the precision of estimators for causal effects.

After estimating the treatment effect within each block, we aggreate these as:

\[ \begin{equation} \hat{\tau}=\sum_{b=1}^{B} \omega_{b} \hat{\tau}_{b} \end{equation} \]

where the weights depend on the estimand of interest. For the ATE, then \(\omega_{b}=n_{b} / n\), where \(n_{b}\) is the number of observations in block \(b\). If we are interested in the average treatment effect on the treated (ATT), then \(\omega_{b}=n_{b}^{t} / n^{t}\), where \(n_{b}^{t}\) is the number of treated units in block \(b\), and \(n^{t}\) is the overall number of treated units.

Creating the blocks

Imbens and Rubin (2015) propose an algorithm to split the range of PS into \(J\) blocks in order to estimate the causal effect through the subclassification method. We create intervals of the type \([b_{j−1} ,b_j)\) such that we analyze units with propensity scores within an interval as if they have identical propensity scores.

We start with a single block: \(J = 1\), with boundaries equal to \(b_0 = 0\) and \(b_j = b_1 = 1\) (if no trimming is selected). Then we have two steps. First we assess the adequacy of the current number of blocks by computing a t-statistic for the null hypothesis that the average value of the estimated linearized propensity score is the same for treated and control units.

\[ t_{\ell}(j)=\frac{\bar{\ell}_{t}(j)-\bar{\ell}_{c}(j)}{\sqrt{s_{\ell}^{2}(j) \cdot\left(1 / N_{\mathrm{c}}(j)+1 / N_{\mathrm{t}}(j)\right)}} \]

Second, we check whether the number of controls and treated, \(N_c(j)\) and \(N_t(j)\), and the total number of units, \(N(j)\), in each new block, would be greater than some minimum. If a block is not adequately balanced, and if splitting that block would lead to two new ones each with a sufficient number of units, then, we split at the median value and the new blocks are assessed for adequacy.

#' Function that subdivides a given propensity score vector in sub-blocks
#' treat = vector with treatment assignments
#' ps = vector with propensity scores
#' K = how many covariates will we want to test/use in bias correction of 
#' estimates later on? 
#' t.max = threshold for tstat in making a further subdivide 
#' trim = should we discard extreme observations so there is overlap?
#' Author: Luis Alvarez
#' Modifications: Rafael F. Bressan
ps_blocks <- function(treat, ps, K, t.max = 1.96,  trim = FALSE)
{
  # Linearized propensity score function
  lps <- function(ps) {
    log(ps/(1 - ps))
  }
  
  if (trim) {
    b0 = min(ps[treat == 1])
    b1 = max(ps[treat == 0])
  } else
  {
    b0 = 0
    b1 = 1
  }
  b_vec = c(b0, b1)
  while (TRUE)
  {
    J <- length(b_vec) - 1
    b_vec_new <- do.call(c, lapply(1:J, function(j){
      sample <- (b_vec[j] <= ps) & (ps < b_vec[j + 1])
      
      ps.treat <- lps(ps[sample & treat == 1])
      ps.control <- lps(ps[sample & treat == 0])
      
      #print(length(ps.control))
      #print(length(ps.treat))
      
      t.test.pass <- tryCatch(
        {abs(t.test(ps.control, ps.treat)$statistic) > t.max}, 
        error = function(e){return(FALSE)}
        )
      
      med.val = median(c(ps.treat, ps.control))
      
      Nt.below = sum(ps.treat < med.val)
      Nt.above = sum(ps.treat >= med.val)
      Nc.below = sum(ps.control < med.val)
      Nc.above = sum(ps.control >= med.val)
      
      s.crit <- min(Nt.below, Nt.above, Nc.below, Nc.above) >= max(3, K + 2)
      
      if (t.test.pass & s.crit)
        return(c(b_vec[j], plogis(med.val), b_vec[j + 1])) 
      else return(c(b_vec[j], b_vec[j + 1]))
    })) # end of do.call
    
    b_vec_new = unique(b_vec_new)
    
    #print(length(b_vec_new))
    if (length(b_vec_new) == length(b_vec))
      break 
    else b_vec = b_vec_new
  } # end of while loop
  
  #Constructing blocking variable now
  block_var = rep(NA, length(treat))
  
  for (j in 1:(length(b_vec) - 1))
    block_var[(b_vec[j] <= ps) & (ps < b_vec[j + 1])] = j
  
  return(block_var)
}
dt[, block := ps_blocks(union, ps, length(c(xb, xt)), 
                        t.max = 2.66,
                        trim = TRUE)]
#' Packing data into blocks
dt_block <- dt[!is.na(block), keyby = block,
               .(data = list(.SD))]
#' Balance tables by blocks
dt_block[, bal_tbl := lapply(data, balance_tbl, treat = "union")]
#' Check tables
lapply(dt_block$bal_tbl, balance_kbl)
[[1]]
Treatment (\(N_t = 103\))
Control (\(N_c = 3051\))
Overlap Measures
Variable Mean Std.Dev Mean Std.Dev t-stat Norm. Diff. Log Ratio \(\pi_c^{0.05}\) \(\pi_t^{0.05}\)
age 33.54 14.03 33.11 13.84 0.31 0.03 0.01 0.11 0.02
age_2 1320.17 1203.16 1288.01 1210.38 0.27 0.03 -0.01 0.11 0.02
earnings 832.57 578.21 674.55 514.70 2.74 0.29 0.12 0.05 0.07
education 87.39 21.82 86.70 22.01 0.32 0.03 -0.01 0.06 0.06
female 0.71 0.46 0.68 0.47 0.73 0.07 -0.03 0.32 0.29
marital_status 0.33 0.47 0.33 0.47 -0.05 0.00 0.00 0.67 0.67
medicaid 0.16 0.36 0.16 0.36 -0.05 0.00 0.00 0.84 0.84
own_farm_income_last_year 0.10 0.99 21.35 645.19 -1.82 -0.05 -6.48 1.00 1.00
private_health_insurance 0.49 0.50 0.52 0.50 -0.63 -0.06 0.00 0.48 0.51
ps 0.04 0.01 0.03 0.01 1.52 0.15 -0.06 0.07 0.09
race 0.83 0.37 0.87 0.34 -0.97 -0.10 0.11 0.13 0.17
total_income_last_year 38599.91 31115.69 36857.61 46876.84 0.55 0.04 -0.41 0.06 0.08
veteran 0.05 0.22 0.04 0.19 0.50 0.05 0.13 0.96 0.95
wage_income_last_year 34536.77 29439.08 33395.99 44332.65 0.38 0.03 -0.41 0.07 0.09
worked_last_year 0.94 0.24 0.95 0.22 -0.39 -0.04 0.08 0.05 0.06
[[2]]
Treatment (\(N_t = 207\))
Control (\(N_c = 2949\))
Overlap Measures
Variable Mean Std.Dev Mean Std.Dev t-stat Norm. Diff. Log Ratio \(\pi_c^{0.05}\) \(\pi_t^{0.05}\)
age 43.48 14.47 42.89 14.29 0.57 0.04 0.01 0.08 0.04
age_2 2098.75 1335.86 2043.48 1324.24 0.58 0.04 0.01 0.08 0.04
earnings 1038.98 638.25 1036.82 761.29 0.05 0.00 -0.18 0.13 0.01
education 92.05 22.62 93.23 23.76 -0.72 -0.05 -0.05 0.06 0.03
female 0.52 0.50 0.49 0.50 0.76 0.05 0.00 0.51 0.48
marital_status 0.49 0.50 0.49 0.50 -0.04 0.00 0.00 0.51 0.51
medicaid 0.14 0.34 0.11 0.31 1.19 0.09 0.11 0.89 0.86
own_farm_income_last_year 0.00 0.00 13.37 492.22 -1.48 -0.04 -Inf 1.00 1.00
private_health_insurance 0.86 0.35 0.85 0.36 0.36 0.03 -0.02 0.15 0.14
ps 0.07 0.01 0.07 0.01 1.73 0.12 0.00 0.07 0.04
race 0.89 0.31 0.87 0.34 1.01 0.07 -0.08 0.13 0.11
total_income_last_year 57886.11 50401.69 67975.00 92121.69 -2.59 -0.14 -0.60 0.07 0.05
veteran 0.04 0.19 0.04 0.20 -0.07 -0.01 -0.01 0.96 0.96
wage_income_last_year 52616.00 45717.86 61483.78 86758.87 -2.49 -0.13 -0.64 0.08 0.04
worked_last_year 0.98 0.15 0.98 0.15 -0.22 -0.02 0.05 1.00 1.00
[[3]]
Treatment (\(N_t = 44\))
Control (\(N_c = 748\))
Overlap Measures
Variable Mean Std.Dev Mean Std.Dev t-stat Norm. Diff. Log Ratio \(\pi_c^{0.05}\) \(\pi_t^{0.05}\)
age 51.61 11.75 49.67 11.10 1.07 0.17 0.06 0.04 0.05
age_2 2798.93 1248.11 2590.01 1022.98 1.09 0.18 0.20 0.04 0.05
earnings 1187.80 688.00 1110.69 762.68 0.72 0.11 -0.10 0.24 0.00
education 104.59 19.79 95.77 21.04 2.86 0.43 -0.06 0.23 0.23
female 0.64 0.49 0.66 0.47 -0.37 -0.06 0.03 0.34 0.36
marital_status 0.77 0.42 0.77 0.42 0.00 0.00 0.01 0.23 0.23
medicaid 0.00 0.00 0.03 0.18 -5.08 -0.26 -Inf 1.00 1.00
own_farm_income_last_year 0.00 0.00 24.06 476.44 -1.38 -0.07 -Inf 1.00 1.00
private_health_insurance 0.95 0.21 0.95 0.23 0.28 0.04 -0.08 0.05 0.05
ps 0.08 0.00 0.08 0.00 1.84 0.30 0.08 0.09 0.09
race 0.89 0.32 0.90 0.30 -0.24 -0.04 0.06 0.10 0.11
total_income_last_year 83222.20 76412.63 71255.32 79359.09 1.01 0.15 -0.04 0.14 0.05
veteran 0.02 0.15 0.02 0.14 0.06 0.01 0.04 1.00 1.00
wage_income_last_year 77834.27 73042.72 65106.75 74359.95 1.12 0.17 -0.02 0.16 0.05
worked_last_year 1.00 0.00 0.99 0.10 2.66 0.14 -Inf 1.00 1.00
[[4]]
Treatment (\(N_t = 76\))
Control (\(N_c = 712\))
Overlap Measures
Variable Mean Std.Dev Mean Std.Dev t-stat Norm. Diff. Log Ratio \(\pi_c^{0.05}\) \(\pi_t^{0.05}\)
age 47.21 10.74 46.83 11.27 0.29 0.03 -0.05 0.09 0.03
age_2 2342.74 957.67 2320.22 1004.92 0.19 0.02 -0.05 0.09 0.03
earnings 1393.11 765.81 1362.36 835.06 0.33 0.04 -0.09 0.09 0.00
education 105.78 19.71 105.72 16.10 0.03 0.00 0.20 0.09 0.09
female 0.51 0.50 0.46 0.50 0.93 0.11 0.01 0.54 0.49
marital_status 0.75 0.44 0.74 0.44 0.11 0.01 0.00 0.26 0.25
medicaid 0.07 0.25 0.04 0.19 0.94 0.13 0.27 0.96 0.93
own_farm_income_last_year 0.00 0.00 191.01 2373.68 -2.15 -0.11 -Inf 1.00 1.00
private_health_insurance 0.96 0.20 0.96 0.20 0.05 0.01 -0.01 0.04 0.04
ps 0.08 0.00 0.08 0.00 0.00 0.00 0.01 0.04 0.07
race 0.84 0.37 0.83 0.37 0.24 0.03 -0.02 0.17 0.16
total_income_last_year 80370.89 75102.68 86945.45 75254.55 -0.73 -0.09 0.00 0.05 0.05
veteran 0.04 0.20 0.03 0.18 0.25 0.03 0.08 0.97 0.96
wage_income_last_year 73969.71 62181.04 78832.80 65729.39 -0.64 -0.08 -0.06 0.05 0.07
worked_last_year 0.99 0.11 0.99 0.11 -0.14 -0.02 0.08 1.00 1.00
[[5]]
Treatment (\(N_t = 169\))
Control (\(N_c = 1412\))
Overlap Measures
Variable Mean Std.Dev Mean Std.Dev t-stat Norm. Diff. Log Ratio \(\pi_c^{0.05}\) \(\pi_t^{0.05}\)
age 45.38 12.26 46.59 11.78 -1.21 -0.10 0.04 0.07 0.05
age_2 2209.23 1140.80 2308.91 1134.99 -1.07 -0.09 0.01 0.07 0.05
earnings 1258.40 669.70 1220.67 785.32 0.68 0.05 -0.16 0.08 0.01
education 87.91 17.42 90.34 18.73 -1.70 -0.13 -0.07 0.06 0.05
female 0.18 0.38 0.21 0.40 -0.89 -0.07 -0.05 0.79 0.82
marital_status 0.64 0.48 0.65 0.48 -0.15 -0.01 0.01 0.35 0.36
medicaid 0.03 0.17 0.04 0.20 -0.77 -0.06 -0.15 1.00 0.97
own_farm_income_last_year 230.78 2405.35 811.63 29288.49 -0.73 -0.03 -2.50 1.00 1.00
private_health_insurance 0.97 0.17 0.96 0.21 1.06 0.08 -0.19 0.04 0.03
ps 0.10 0.01 0.10 0.01 1.44 0.12 -0.01 0.09 0.04
race 0.62 0.49 0.61 0.49 0.05 0.00 0.00 0.39 0.38
total_income_last_year 72285.69 43214.51 80041.91 105324.69 -1.78 -0.10 -0.89 0.13 0.01
veteran 0.05 0.21 0.08 0.26 -1.56 -0.12 -0.21 0.92 0.95
wage_income_last_year 68841.37 41655.16 72868.47 87632.62 -1.02 -0.06 -0.74 0.13 0.02
worked_last_year 0.99 0.08 0.99 0.09 0.40 0.03 -0.18 1.00 1.00
[[6]]
Treatment (\(N_t = 244\))
Control (\(N_c = 1337\))
Overlap Measures
Variable Mean Std.Dev Mean Std.Dev t-stat Norm. Diff. Log Ratio \(\pi_c^{0.05}\) \(\pi_t^{0.05}\)
age 48.00 12.53 48.22 12.78 -0.25 -0.02 -0.02 0.08 0.04
age_2 2460.73 1194.62 2488.61 1234.29 -0.33 -0.02 -0.03 0.08 0.04
earnings 1221.63 636.17 1067.21 702.28 3.43 0.23 -0.10 0.14 0.00
education 84.04 20.57 81.66 22.99 1.63 0.11 -0.11 0.10 0.03
female 0.28 0.45 0.25 0.43 0.90 0.06 0.04 0.75 0.72
marital_status 0.75 0.43 0.75 0.43 0.03 0.00 0.00 0.25 0.25
medicaid 0.03 0.18 0.04 0.19 -0.49 -0.03 -0.08 0.96 0.97
own_farm_income_last_year 0.00 0.00 9.73 282.80 -1.26 -0.05 -Inf 1.00 1.00
private_health_insurance 0.93 0.26 0.92 0.27 0.18 0.01 -0.02 0.08 0.07
ps 0.16 0.04 0.15 0.04 1.58 0.11 0.06 0.07 0.04
race 0.70 0.46 0.72 0.45 -0.60 -0.04 0.02 0.28 0.30
total_income_last_year 71728.98 45025.76 64531.37 71549.65 2.07 0.12 -0.46 0.06 0.04
veteran 0.14 0.34 0.13 0.34 0.15 0.01 0.01 0.87 0.86
wage_income_last_year 64494.05 38379.75 60168.60 69077.85 1.40 0.08 -0.59 0.10 0.03
worked_last_year 1.00 0.06 0.99 0.11 1.42 0.08 -0.50 1.00 1.00
[[7]]
Treatment (\(N_t = 265\))
Control (\(N_c = 526\))
Overlap Measures
Variable Mean Std.Dev Mean Std.Dev t-stat Norm. Diff. Log Ratio \(\pi_c^{0.05}\) \(\pi_t^{0.05}\)
age 48.83 13.52 48.21 13.34 0.61 0.05 0.01 0.07 0.06
age_2 2566.89 1332.10 2501.79 1290.21 0.66 0.05 0.03 0.07 0.06
earnings 1095.82 586.35 996.68 632.79 2.19 0.16 -0.08 0.03 0.05
education 97.75 20.43 99.48 21.68 -1.10 -0.08 -0.06 0.21 0.22
female 0.56 0.50 0.58 0.49 -0.42 -0.03 0.01 0.42 0.44
marital_status 0.67 0.47 0.63 0.48 0.97 0.07 -0.02 0.37 0.33
medicaid 0.01 0.11 0.02 0.13 -0.67 -0.05 -0.20 1.00 1.00
own_farm_income_last_year 0.00 0.09 0.00 0.11 -0.54 -0.04 -0.20 1.00 1.00
private_health_insurance 0.95 0.22 0.95 0.22 0.02 0.00 0.00 0.05 0.05
ps 0.33 0.03 0.33 0.03 0.74 0.06 -0.03 0.07 0.05
race 0.75 0.43 0.72 0.45 0.86 0.06 -0.03 0.28 0.25
total_income_last_year 69449.44 102854.22 62829.92 66162.49 0.95 0.08 0.44 0.08 0.03
veteran 0.08 0.26 0.07 0.25 0.46 0.03 0.06 0.93 0.92
wage_income_last_year 64438.74 101556.84 56627.92 64313.39 1.14 0.09 0.46 0.09 0.04
worked_last_year 1.00 0.06 0.99 0.11 1.28 0.09 -0.55 1.00 1.00
[[8]]
Treatment (\(N_t = 368\))
Control (\(N_c = 423\))
Overlap Measures
Variable Mean Std.Dev Mean Std.Dev t-stat Norm. Diff. Log Ratio \(\pi_c^{0.05}\) \(\pi_t^{0.05}\)
age 45.88 11.73 46.61 11.94 -0.87 -0.06 -0.02 0.06 0.03
age_2 2242.10 1096.53 2314.99 1119.03 -0.92 -0.07 -0.02 0.06 0.03
earnings 1161.78 578.59 1026.36 638.95 3.13 0.22 -0.10 0.11 0.03
education 103.14 20.14 100.32 20.85 1.93 0.14 -0.03 0.22 0.17
female 0.60 0.49 0.56 0.50 1.20 0.09 -0.01 0.44 0.40
marital_status 0.70 0.46 0.72 0.45 -0.64 -0.05 0.02 0.28 0.30
medicaid 0.04 0.18 0.03 0.17 0.55 0.04 0.11 0.97 0.96
own_farm_income_last_year 10.33 161.74 2.36 48.62 0.91 0.07 1.20 1.00 1.00
private_health_insurance 0.96 0.19 0.98 0.14 -1.37 -0.10 0.28 0.02 1.00
ps 0.47 0.06 0.46 0.05 2.54 0.18 0.09 0.05 0.07
race 0.88 0.33 0.86 0.34 0.50 0.04 -0.04 0.14 0.12
total_income_last_year 63900.30 40403.92 59276.68 40104.78 1.61 0.11 0.01 0.11 0.02
veteran 0.09 0.28 0.08 0.28 0.21 0.02 0.02 0.92 0.91
wage_income_last_year 58489.62 33893.87 53629.72 36825.84 1.93 0.14 -0.08 0.14 0.02
worked_last_year 1.00 0.05 1.00 0.05 -0.10 -0.01 0.07 1.00 1.00
dt_block[, `:=`(N = sapply(data, nrow),
                Nt = sapply(data, function(x) sum(x$union)))]
dt_block[, `:=`(w = N/sum(N),
                wt = Nt/sum(Nt))]
dt_block[, estimates := lapply(data, function(x){
  reg <- feols(earnings~union, data = x)
  est <- coeftable(reg, se = "hetero")["union", c("Estimate", "Std. Error")]
  return(list(est[1], est[2]))
})]
# Unpack estimates
dt_block[, by = block, c("tau", "sigma") := estimates[[1]]]
bl_res <- dt_block[, .(
  estimand = c("ATE", "ATT"),
  Estimate = c(sum(tau*w), sum(tau*wt)),
  `Std. Error` = c(sqrt(sum(w^2 * sigma^2)), sqrt(sum(wt^2 * sigma^2)))
  )]
bl_res
##    estimand Estimate Std. Error
## 1:      ATE 85.47282   22.66528
## 2:      ATT 96.62407   19.23292

Or, we can use the MatchIt package.

mS <- matchit(as.formula(fml), data = dt,
              method = "subclass", 
              distance = dt$ps,
              estimand = "ATT",
              subclass = 8,
              discard = "both")

dtc <- match.data(mS)
sub_reg <- feols(earnings~union, data = dtc, weights = ~weights)
etable(sub_reg, se = "hetero", keep = "union")
##                          sub_reg
## Dependent Var.:         earnings
##                                 
## union           97.99*** (19.63)
## _______________ ________________
## S.E. type       Heterosked.-rob.
## Observations              12,635
## R2                       0.00201
## Adj. R2                  0.00193

Doubly-Robust

The basic Horvitz-Thompson (IPW) estimator exploits the following two equalities:

\[\begin{equation} \mathbb{E}\left[\frac{Y_{i}\cdot W_{i} }{e\left(X_{i}\right)}\right]=\mathbb{E}\left[Y_{i}(1)\right] \quad \text { and } \quad \mathbb{E}\left[\frac{Y_{i} \cdot\left(1-W_{i}\right) }{1-e\left(X_{i}\right)}\right]=\mathbb{E}\left[Y_{i}(0)\right] \end{equation}\]

For ATE we have the following weighting scheme:

\[\omega_{i}^{ate}=\frac{1}{e\left(X_{i}\right)^{W_{i}} \cdot\left(1-e\left(X_{i}\right)\right)^{1-W_{i}}}=\left\{\begin{array}{ll} 1 /\left(1-e\left(X_{i}\right)\right) & \text { if } W_{i}=0 \\ 1 / e\left(X_{i}\right) & \text { if } W_{i}=1 \end{array}\right.\]

and for ATT the weights are slightly changed,

\[\omega_{i}^{att}=\frac{1}{P[W_i=1]}\frac{e(X_i)^{1-W_i}}{\left(1-e\left(X_{i}\right)\right)^{1-W_{i}}}=\left\{\begin{array}{ll} P[W_i=1]^{-1}\cdot e(X_i) /\left(1-e\left(X_{i}\right)\right) & \text { if } W_{i}=0 \\ P[W_i=1]^{-1} & \text { if } W_{i}=1 \end{array}\right.\]

Hence, our estimands are given by:

\[\begin{align} \tau^{ATE}=\mathbb{E}\left[Y_{i}(1)-Y_{i}(0)\right]&=\mathbb{E}\left[\frac{Y_{i} \cdot W_{i}}{e(X_i)}-\frac{Y_{i} \cdot\left(1-W_{i}\right)}{1-e(X_i)}\right] \\ \tau^{ATT}=\mathbb{E}\left[Y_{i}(1)-Y_{i}(0) \mid W_{i}=1\right]&=\mathbb{E}\left[\frac{Y_{i} \cdot W_{i}}{\mathbb{P}\left[W_{i}=1\right]}-\frac{Y_{i} \cdot\left(1-W_{i}\right) \cdot e(X_i)}{\mathbb{P}\left[W_{i}=1\right]\left(1-e(X_i)\right)}\right] \end{align}\]

and the estimator will be the sample counterpart, considering the estimated propensity score, \(\hat e(X_i)\). It is useful to write this estimator as a weighted regression estimator. Take the model for \(\tau^{ATE}\), for example. The regression specification and the weights are, respectively:

\[ \begin{align} Y_i&=\alpha+\tau\cdot W_i+\varepsilon_i, \\ \omega_i &= \frac{W_i}{\hat e(X_i)}+\frac{1-W_i}{1-\hat e(X_i)} \end{align} \] and it is easy to see that \(\mathbb{E}[\hat\tau] = \tau^{ATE}\). This estimator can be modified easily to incorporate covariates. One can estimate a regression function that includes additional covariates the researcher believe help explain the outcome and use the same weights as before.

\[ \begin{align} Y_i&=\alpha+\tau\cdot W_i+X_i\beta+\varepsilon_i \\ \end{align} \]

This weighted regression estimator will be consistent for the true ATE as long as either the specification of the propensity score or the regression function is correct, yielding the property known as double-robustness.

# Overall probability of treatment
p_treat <- mean(dt$union)
# Creating the weights
dt[, `:=`(w = union/ps + (1 - union)/(1 - ps),
          wt = (1/p_treat)*(ps/(1 - ps))^(1 - union))]
# ATE regression
reg <- feols(earnings~union+age+age_2+education+female, 
              data = dt, weights = ~w)
# ATT regression
reg2 <- feols(earnings~union+age+age_2+education+female, 
              data = dt, weights = ~wt)
etable(list(reg, reg2), se = "hetero", keep = "union")
##                          model 1          model 2
## Dependent Var.:         earnings         earnings
##                                                  
## union           85.74*** (19.69) 93.31*** (17.92)
## _______________ ________________ ________________
## S.E. type       Heterosked.-rob. Heterosked.-rob.
## Observations              12,732           12,732
## R2                       0.29100          0.22192
## Adj. R2                  0.29072          0.22161

References

Caliendo, Marco, and Sabine Kopeinig. 2008. “SOME PRACTICAL GUIDANCE FOR THE IMPLEMENTATION OF PROPENSITY SCORE MATCHING.” Journal of Economic Surveys 22 (1): 31–72. https://doi.org/https://doi.org/10.1111/j.1467-6419.2007.00527.x.
Imbens, Guido W, and Donald B Rubin. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press.
Zhao, Qin-Yu, Jing-Chao Luo, Ying Su, Yi-Jie Zhang, Guo-Wei Tu, and Zhe Luo. 2021. “Propensity Score Matching with r: Conventional Methods and New Features.” Annals of Translational Medicine 9 (9). https://atm.amegroups.com/article/view/61857.

  1. A very nice article about the usage of MatchIt package and others in R is Zhao et al. (2021).↩︎