1 Introduction to estimating treatment effects from observational studies
A common aim in social science is to estimate the causal effect of a treatment or exposure or condition on an outcome of interest.
Effect of a medical treatment on a disease outcome (e.g. survival, time to recovery)
Effect of race on a criminal justice outcome (e.g. arrest, sentence)
Effect of an educational program on reading or math scores
Ideally, we would conduct a randomized controlled trial (RCT) in which we randomize a sample of individuals to either receive the treatment condition or to a control condition, which could be nothing, business as usual, or some placebo treatment. This is the gold standard approach in science. However, we have numerous questions that are difficult to assess with an RCT for reasons such as ethical considerations, cost, and rarity of outcomes. Particularly in the last several decades, we have access to large administrative datasets with numerous features on individuals and numerous outcomes. Even if we could technically conduct an RCT, it is worth exploring these datasets to assess which questions might have interesting findings.
“Observational studies” aim to draw causal conclusions from data in which the treatment or exposure was not under the control of the researcher. The most common form of “observational data” are data that were not collected for the purposes of analysis, but for some basic record keeping purpose, such as health insurance claims, arrest records, mortgage applications, or school attendance and grades.
The section will focus on one particular method of causal estimation in which machine learning plays a starring role, propensity score estimation.
1.1 Example: Florida murderers
Radelet (1981) collected data on convicted in murderers in Florida between 1976-1977, including their race, their victim’s race, and whether or not they received the death penalty. Our primary interest here is to assess whether there is a racial disparity in the imposition of the death penalty.
# A tibble: 7 × 4
# Groups: murderer, victim, death [7]
murderer victim death n
<chr> <chr> <dbl> <int>
1 B B 0 97
2 B B 1 6
3 B W 0 52
4 B W 1 11
5 W B 0 9
6 W W 0 132
7 W W 1 19
The simplest comparison would be to contrast the fraction of black murderers receiving the death penalty with the fraction of white murderers receiving the death penalty.
# A tibble: 2 × 2
murderer deathPenalty
<chr> <dbl>
1 B 0.102
2 W 0.119
So, 10.2% of black murderers received the death compared with 11.9% of white murderers. It appears that white murderers are actually 1.6 percentage points more likely to receive the death penalty than black murderers.
By far the most common way social scientists have traditionally explored such questions is to use a regression model. With regression we can formally test whether the outcome differs by “treatment,” which in this case is the race of the murderer.
# compare death sentence for B vs. W murdererslm1 <-lm(death ~ murderer,data=dFLmurder)summary(lm1)
Call:
lm(formula = death ~ murderer, data = dFLmurder)
Residuals:
Min 1Q Median 3Q Max
-0.1187 -0.1187 -0.1024 -0.1024 0.8976
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.10241 0.02439 4.198 3.48e-05 ***
murdererW 0.01634 0.03482 0.469 0.639
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3143 on 324 degrees of freedom
Multiple R-squared: 0.0006793, Adjusted R-squared: -0.002405
F-statistic: 0.2202 on 1 and 324 DF, p-value: 0.6392
At first glance, the race of the murderer does not seem to matter. However, there is another variable in the dataset, the race of the victim. Let’s explore
# do B & W murderers differ by the victim's race?dFLmurder |>group_by(murderer) |>summarize(raceVictim =mean(victim=="B"))
# A tibble: 2 × 2
murderer raceVictim
<chr> <dbl>
1 B 0.620
2 W 0.0562
It does seem that there are large differences in the race of the victims. The majority (62%) of black murderers had black victims. White murderers rarely had black victims (5.6%). Maybe this should not matter since we are really only interested in the race of the murderer, but let’s try “controlling” for the race of the victim. I put “controlling” in quotes because social scientists often say “I controlled for the race of the victim,” but what they really mean is that they put +victim race in a regression model, which is not the same and, as we will see, does not offer complete adjustment for the race of the victim.
# include race of victim in regressionlm3 <-lm(death ~ murderer+victim,data=dFLmurder)summary(lm3)
Call:
lm(formula = death ~ murderer + victim, data = dFLmurder)
Residuals:
Min 1Q Median 3Q Max
-0.17565 -0.12539 -0.12539 -0.05761 0.94239
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.05761 0.02963 1.944 0.05276 .
murdererW -0.05026 0.04290 -1.172 0.24217
victimW 0.11804 0.04516 2.614 0.00937 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3115 on 323 degrees of freedom
Multiple R-squared: 0.02138, Adjusted R-squared: 0.01532
F-statistic: 3.529 on 2 and 323 DF, p-value: 0.03047
Now that we have included race of the victim, the sign on the race of the murderer has changed from positive to negative. Suddenly it seems that black murderers are more likely to get the death penalty by 5 percentage points. Why would include the race of the victim suddenly change our understanding of the relationship between the race of the murderer and receiving the death penalty?
# A tibble: 2 × 2
murderer deathPenalty
<chr> <dbl>
1 B 0.102
2 W 0.119
Here we observe the 1.6 percentage point difference in the rate of death penalty imposition with white murderers being more likely to receive the death penalty.
`summarise()` has grouped output by 'murderer'. You can override using the
`.groups` argument.
# A tibble: 4 × 3
# Groups: murderer [2]
murderer victim deathPenalty
<chr> <chr> <dbl>
1 B B 0.0583
2 B W 0.175
3 W B 0
4 W W 0.126
Once we break it down by race of the victim, we see that black murderers are more likely to receive the death penalty when their victim is black (by 5.8 percentage points) and when their victim is white (by 4.9 percentage points). Our conclusions have completely changed after we introduced the race of the victim.
Simpson’s Paradox
In some datasets, after incorporating a third variable in the analysis, the conclusions can completely reverse. This is sometimes called Yule’s Paradox.
Why does this happen? In this example, the real driver of the death penalty seems to be the death of a white victim. When the victim is white, the murderer is much more likely to receive the death penalty. And who is most likely to have white victims?… white murderers. As a result, when we just compare black and white murderers it seems that white murderers are more likely to get the death penalty (which they are), but the reason we observe this is because they are the most likely to have a white victim.
Confounder
A confounder is any variable that is associated with both the outcome of interest and the treatment. We are typically uninterested in the confounder’s association with the outcome, but instead just want to account for it in some way.
In the Florida murderers data, the race of the victim is associated with both the race of the murderer and the likelihood of receiving the death penalty. Our research question here focuses on the race of the murderer. We need to account for the race of the victim because we would like to think of two murderers of different races but with similar circumstances and determine who is more likely to receive the death penalty.
Race of the victim is only one confounder. Other mitigating and aggravating factors not recorded in these data could also explain the imposition of the death penalty, such as age of the victim, relationship to the victim, and intentionality of the homicide. Our thoughts about the racial differences in the death penalty could be reversed yet again with another confounder.
1.2 Example: Simulated data (\(\mathbf{x}\) correlated with \(t\))
Now that we know confounders are important, let’s see whether a tool like regression is effective at accounting for potential confounders. Here I simulate some data where \[
y_i = 0 + 0\times t_i + x_{1i} +x_{2i}+4x_{1i}x_{2i} + N(0,1)
\] where \(t_i\) is a 0/1 treatment assignment indicator and \(x_1\) and \(x_2\) are 0/1 confounders. I have simulated \(x_1\) and \(x_2\) so that they are more likely to be 1 when \(t=1\). Importantly, the outcome does not depend on \(t\) since the coefficient in front of \(t\) is 0, meaning that in this simulated example there is no treatment effect. Lastly, note that there is a rather large interaction effect. When \(x_1\) and \(x_2\) are both equal to 1, then the expected value of \(y\) increases by 4.
The traditional social science approach is the regression model with additive terms for the treatment assignment, \(x_1\), and \(x_2\), expecting the \(+x_1+x_2\) part to “control for” the confounders.
summary(lm(y~treat+x1+x2, data=dataCor))
Call:
lm(formula = y ~ treat + x1 + x2, data = dataCor)
Residuals:
Min 1Q Median 3Q Max
-3.5435 -0.8583 0.1989 0.9039 2.9764
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.08564 0.34940 0.245 0.80663
treat -0.79105 0.29322 -2.698 0.00759 **
x1 2.65377 0.25975 10.217 < 2e-16 ***
x2 2.95462 0.24213 12.202 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.39 on 196 degrees of freedom
Multiple R-squared: 0.7763, Adjusted R-squared: 0.7729
F-statistic: 226.8 on 3 and 196 DF, p-value: < 2.2e-16
Yikes! This regression model found a significant treatment effect! But we simulated these data precisely so that there would be no treatment effect. Revisit how we simulated these data. \(treat\) is not involved in determining the value of \(y\). \(y\) only depends on \(x_1\) and \(x_2\), yet the regression model is telling us that $treat
In this case, the failure to include an important \(x_1x_2\) interaction term caused us to make an incorrect inference about the effect of the treatment on the outcome. If we were clever enough to realize that there was an important interaction term, then we could simply add it in and correctly conclude that there was no treatment effect.
summary(lm(y~treat+x1+x2+x1:x2, data=dataCor))
Call:
lm(formula = y ~ treat + x1 + x2 + x1:x2, data = dataCor)
Residuals:
Min 1Q Median 3Q Max
-2.88473 -0.60375 -0.01706 0.63581 2.33238
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03804 0.24511 0.155 0.87681
treat 0.16258 0.21628 0.752 0.45313
x1 0.75209 0.22579 3.331 0.00104 **
x2 0.90262 0.22261 4.055 7.25e-05 ***
x1:x2 4.46055 0.31281 14.260 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9751 on 195 degrees of freedom
Multiple R-squared: 0.8905, Adjusted R-squared: 0.8883
F-statistic: 396.5 on 4 and 195 DF, p-value: < 2.2e-16
This is the problem. In order to be sure to fully account for confounders, linear regression requires that we get the model structure right. This is very hard. There are a lot of methods for regression diagnostics, but they are complicated and become more complicated as the number of features increases. In datasets with dozens or hundreds of features, it becomes challenging to consider fully all of the potential interaction effects and non-linear effects. Neglecting an important one can drastically change your conclusions.
Before we move on to a different simulation, note that this dataset had a strong correlation between \(\mathbf{x}\) and \(t\) and, at the same time, \(t\) was strongly associated with \(y\).
cor(dataCor$x1, dataCor$treat)
[1] -0.6393057
cor(dataCor$x2, dataCor$treat)
[1] -0.5673086
aggregate(y~treat, data=dataCor, mean)
treat y
1 0 5.015015
2 1 0.897501
1.3 Example: Simulated data (\(\mathbf{x}\) uncorrelated with \(t\))
Here we will explore a similar simulation. However, this time \(\mathbf{x}\) will be uncorrelated with \(t\). I added some correlation between \(x_1\) and \(x_2\), but both are completely unassociated with \(t\). As before, \(y\) has no association with \(t\), but a particularly strong association with the \(x_1x_2\) interaction.
This simulation is like what happens in a randomized trial. In randomized trials, the treatment assignment is unrelated to any of the \(x_j\) by the researcher’s design.
If we use the simplest regression approach to check whether there is a treatment effect, we correctly conclude that there is no treatment effect. When \(t\) was correlated with \(\mathbf{x}\) in our previous simulation we found the opposite.
summary(lm(y~treat, data=dataInd))
Call:
lm(formula = y ~ treat, data = dataInd)
Residuals:
Min 1Q Median 3Q Max
-3.6364 -1.5653 -0.7089 0.5261 6.1881
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.9050 0.2295 8.301 1.59e-14 ***
treat 0.3425 0.3245 1.055 0.293
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.295 on 198 degrees of freedom
Multiple R-squared: 0.005593, Adjusted R-squared: 0.0005707
F-statistic: 1.114 on 1 and 198 DF, p-value: 0.2926
Let’s go ahead and adjust for \(\mathbf{x}\) and see if it matters.
summary(lm(y~treat+x1+x2, data=dataInd))
Call:
lm(formula = y ~ treat + x1 + x2, data = dataInd)
Residuals:
Min 1Q Median 3Q Max
-3.3634 -0.8332 -0.0822 0.6639 4.3394
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.35221 0.23566 -9.981 <2e-16 ***
treat -0.07211 0.18023 -0.400 0.69
x1 3.98421 0.21135 18.851 <2e-16 ***
x2 3.80216 0.20841 18.243 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.262 on 196 degrees of freedom
Multiple R-squared: 0.7024, Adjusted R-squared: 0.6978
F-statistic: 154.2 on 3 and 196 DF, p-value: < 2.2e-16
Still a null treatment effect, one that is even closer to 0 and with a smaller standard error.
Let’s add in the important missing interaction term.
summary(lm(y~treat+x1*x2, data=dataInd))
Call:
lm(formula = y ~ treat + x1 * x2, data = dataInd)
Residuals:
Min 1Q Median 3Q Max
-2.92707 -0.62959 -0.04675 0.69941 2.27152
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.16154 0.28336 0.570 0.569275
treat 0.05738 0.13958 0.411 0.681445
x1 1.03411 0.30269 3.416 0.000772 ***
x2 0.75288 0.30876 2.438 0.015648 *
x1:x2 4.15820 0.35936 11.571 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9741 on 195 degrees of freedom
Multiple R-squared: 0.8235, Adjusted R-squared: 0.8199
F-statistic: 227.5 on 4 and 195 DF, p-value: < 2.2e-16
We get a treatment effect estimate that is even smaller and a standard error that shows even more precision.
Note
When the treatment assignment (\(t\)) is uncorrelated with the other features in the dataset (\(\mathbf{x}\)), then the inclusion or exclusion of \(\mathbf{x}\) has little effect on the overall conclusion. Including \(+x_1+x_2+\ldots\) and correctly guessing its functional form can increase precision.
What if we could first eliminate the correlation between \(t\) and \(\mathbf{x}\) before fitting the regression model? In that case we would not have to worry so much if we include or fail to include certain features or transform them in a particular way.
2 Weights to uncorrelate \(t\) and \(\mathbf{x}\)
The distribution of \(x_1\) is completely different among the treated cases compared to the control cases. Among those where \(t=0\) we need a lot more \(x_1=0\) and a lot fewer \(x_1=1\).
a <- dataCor |>count(x1, treat)a
x1 treat n
1 0 0 10
2 0 1 73
3 1 0 90
4 1 1 27
If each of the 10 control cases with \(x_1=0\) were weighted by 73/10 and each of the 90 control cases with \(x_1=1\) were weighted by 27/90, then the treatment and control groups would resemble each other. The 10 control cases with weights 73/10 would have total weight 73. The 90 control cases with weights 27/90 would have total weight 27.
Let’s add a new weight column to our data to reflect this idea and see what happens to the table and to the correlation between \(x_1\) and \(t\).
Now \(t\) is completely uncorrelated with \(x_1\). The correlation matrix still shows very large correlation between \(t\) and \(x_2\). Unfortunately, we are only part of the way there since we need to uncorrelate \(x_1\) and \(x_2\) at the same time.
Now it seems that we need 4 different weights for control cases to deal with all the \((x_1,x_2)\) combinations. You can probably see that this is going to get exponentially more complex as we add more features, possibly discrete features with many categorical levels (like state or country) or continuous features (like age, SES, test scores).
Let’s rework the weight for control cases where \(\mathbf{x}'=[0\; 0]\). From the table above we can see that this weight should be 48/3.
An equivalent way of thinking about the weight is the odds that an observation with features \(\mathbf{x}\) is a member of the treatment group. Instead of building a table of all the combinations of \(\mathbf{x}\) and \(t\), we can compute the odds of treatment assignment. Logistic regression is a perfect tool for computing the odds.
# fit a fully interacted logistic regression modellogit1 <-glm(treat~x1*x2,data=dataCor,family=binomial)# generate a table of all combinations of x1 and x2wCheck <-expand.grid(x1=0:1, x2=0:1) |>arrange(x1,x2)# predict by default is on log odds scale, exp() to get odds scalewCheck$w <-predict(logit1, newdata=wCheck) |>exp() |>zapsmall()wCheck
With very little work we have produced all the weights we need. Check that the weights are the same as the weights you would have computed directly from the earlier table a. Let’s insert the appropriate weights into dataCor and check the correlations now.
# treatment cases all get weight 1dataCor$w <-1# control cases get weights p/(1-p)dataCor$w[dataCor$treat==0] <-predict(logit1, newdata=subset(dataCor, treat==0)) |>exp()# check that the total weights are equal nowdataCor |>group_by(x1, x2, treat) |>summarize(sum(w))
Now we have a dataset where the weighted dataset in which the weighted comparison cases collectively resemble the treatment cases. With this weighted dataset, we do not need to be concerned about \(x_1\) or \(x_2\) being confounders. They cannot be confounders because they are not associated with the treatment assignment.
What happens if we neglect to include \(x_1\) and \(x_2\) in our analysis?
summary(svyglm(y~treat, design=sdesign))
Call:
svyglm(formula = y ~ treat, design = sdesign)
Survey design:
svydesign(ids = ~1, weights = ~w, data = dataCor)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.82561 0.22964 3.595 0.000409 ***
treat 0.07189 0.28742 0.250 0.802741
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 2.576261)
Number of Fisher Scoring iterations: 2
We correctly conclude that there is no treatment effect. Even though we simulated data in which \(x_1\) and \(x_2\) had strong effects on \(y\), neglecting their inclusion in our analysis did not really affect our findings. Including \(x_1\) and \(x_2\) may still be worthwhile to gain better precision.
summary(svyglm(y~treat+x1+x2, design=sdesign))
Call:
svyglm(formula = y ~ treat + x1 + x2, design = sdesign)
Survey design:
svydesign(ids = ~1, weights = ~w, data = dataCor)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.14895 0.25873 -0.576 0.565
treat 0.07189 0.24829 0.290 0.772
x1 1.76389 0.28375 6.216 3.00e-09 ***
x2 1.66102 0.31198 5.324 2.76e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 1.560209)
Number of Fisher Scoring iterations: 2
Note that the treatment effect estimate is identical. Including \(x_1\) and \(x_2\) did nothing to change our treatment effect estimate. However, we did decrease the standard error of the treatment effect estimate. And if we consider all main effects and the interaction effect?
summary(svyglm(y~treat+x1*x2, design=sdesign))
Call:
svyglm(formula = y ~ treat + x1 * x2, design = sdesign)
Survey design:
svydesign(ids = ~1, weights = ~w, data = dataCor)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.17465 0.18067 0.967 0.33489
treat 0.07189 0.18848 0.381 0.70329
x1 0.73425 0.22057 3.329 0.00104 **
x2 0.71610 0.26768 2.675 0.00810 **
x1:x2 4.75751 0.40928 11.624 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.8175212)
Number of Fisher Scoring iterations: 2
Still, the treatment effect estimate is unchanged, but the standard error is even smaller.
Note
Analytical strategies that uncorrelate the treatment assignment (\(t\)) from potential confounders (\(\mathbf{x}\)) are robust to the inclusion or exclusion of \(\mathbf{x}\) or any of their transformations.
A powerful method for uncorrelating the treatment assignment is to build observation weights based on \(P(t=1|\mathbf{x})\), which is known as the propensity score.
Propensity score
The propensity score is the probability of assignment to treatment conditional on the observation features, \(P(t=1|\mathbf{x})\)
3 Average Treatment Effect (ATE) and the Average Treatment Effect on the Treated (ATT)
In the previous simulated example, we reweighted the control cases so that the distribution of their features aligned with the feature distribution of the treated cases. We accomplished by weighting cases as
This assigns weight 1 to treatment cases and weights equal to the odds of treatment assignment to the comparison cases.
Because these weights make the control cases “look like” the treatment cases, the overall distribution of \(x_1\) and \(x_2\) match the distribution among the treatment cases. As a result, our treatment effect estimate describes what the treatment effect would be for the kinds of individuals that receive the treatment. If there is a particular value of \(\mathbf{x}\) that made it impossible for someone to be in the treatment group, then \(P(t=1|\mathbf{x})=0\) and any control case with that feature would receive weight 0 and be eliminated from the analysis.
Since these weights make the control cases resemble the treatment cases, our treatment effect estimate is called the average treatment effect on the treated (ATT). The ATE compares what actually happened to the treatment group to what would have happened had they not been treated. In the majority of social science inquiries, this is the primary quantity of interest.
When examining racial disparities in the justice system, we ask what would happen to members of a racial minority (“treated”) if their cases were handled the same way as the cases of white defendants. We reweight cases involving white defendants to collectively resemble cases involving minority defendants. That’s an ATT question.
When evaluating a reading program designed for schools in low income neighborhoods, we reweight students without access to the reading program to resemble the students in the schools in the low income neighborhoods with the reading program. …ATT.
When testing an urban health initiative, we would reweight residents in neighborhoods without the initiative to collective look like the residents in the neighborhood with the urban health initiative. Again… ATT.
In each of these questions we are not interested in evaluating the treatment on the kinds of individuals or the kinds of places for which the treatment is not relevant. We focus our assessment on the kinds of cases, students, and places that concern us most for our investigation.
In some cases, we may be interested in the average treatment effect on the population (ATE). The ATE estimates the difference between what would happen if the entire population of interest was treated versus the entire population was left untreated. Several public health questions fall into this category, like vaccinations. The weights for estimating the ATE are \[
w_i = t_i\frac{1}{P(t=1|\mathbf{x}_i)} + (1-t_i)\frac{1}{1-P(t=1|\mathbf{x}_i)}
\] Treated cases receive a weight of \(\frac{1}{p}\) while control cases receive a weight of \(\frac{1}{1-p}\). These weights reweight observations so that the distribution of \(\mathbf{x}\) is the same for the treatment and control groups (just like with ATE) but that both of them also match the distribution of \(\mathbf{x}\) for the population (rather than just the treatment group).
4 Neyman-Rubin-Holland causal model
Often simply called the “Rubin causal model,” the foundational idea in its simplest form is that every unit of analysis has two values
\(Y_i(1)\), outcome of a case if assigned to treatment
\(Y_i(0)\), outcome of a case if assigned to control
We envision that every observation in our dataset has two potential outcomes, one that we would observe if they were assigned to the treatment and one that we would observe if they were assigned to the control. Then, ideally, we could compute the difference for each of them and average. This is the perfect way to estimate a causal effect.
Structure of the Rubin causal model
Observation
Covariates
Treatment
Control
Unit causal effect
1
\(\mathbf{x}_1\)
\(Y_1(1)\)
\(Y_1(0)\)
\(Y_1(1)-Y_1(0)\)
2
\(\mathbf{x}_2\)
\(Y_2(1)\)
\(Y_2(0)\)
\(Y_2(1)-Y_2(0)\)
3
\(\mathbf{x}_3\)
\(Y_3(1)\)
\(Y_3(0)\)
\(Y_3(1)-Y_3(0)\)
\(\vdots\)
\(\vdots\)
\(\vdots\)
\(\vdots\)
\(\vdots\)
n
\(\mathbf{x}_n\)
\(Y_n(1)\)
\(Y_n(0)\)
\(Y_n(1)-Y_n(0)\)
\(ATE=\)
\(\frac{1}{n} \sum Y_i(1)-Y_i(0)\)
This idealized model, however, is faced with the fundamental problem of causal inference (Rubin 1974).
Fundamental problem of causal inference
For any observation we only get to observe \(Y_i(1)\)or\(Y_i(0)\), but never both.
In spite of this challenge, under certain assumptions we can estimate some key quantities of interest.
\[
\begin{split}
ATT &= \mathbb{E}(Y(1)-Y(0)|t=1) \\
&= \mathbb{E}(Y(1)|t=1) - \mathbb{E}(Y(0)|t=1) \\
ATE &= \mathbb{E}(Y(1) - Y(0)) \\
\end{split}
\]
Remember that for each observation we only observe one of the potential outcomes. So, we can easily estimate two quantities.
In general \[
\mathbb{E}(Y(0)|t=0) \neq \mathbb{E}(Y(0)|t=1)
\]
But if cases are randomized to treatment then \[
\mathbb{E}(Y(0)|t=0) = \mathbb{E}(Y(0)|t=1)
\]
The assumption of strong ignorability is essential for causal inference from observational data.
Strong ignorability
Treatment assignment is strongly ignorable if the potential outcomes are independent of treatment assignment, conditional on \(\mathbf{x}\). \[
(Y(1),Y(0)) \perp\!\!\!\!\perp t\, |\, \mathbf{x}
\] or \[
P(t=1|\mathbf{x},Y(1),Y(0)) = P(t=1|\mathbf{x})
\]
Strong ignorability holds if
\(\mathbf{x}\) contains all of the confounders, or
treatment assignment depends on \(\mathbf{x}\) alone, or
treatment is randomly assigned (special case of the previous)
In addition to strong ignorability, we have to assume that this is the correct relationship. As we saw previously, regression analysis is not kind to researchers when we neglect to include important terms.
5 Estimation of ATT with propensity scores
With a data set with \((y_i, t_i, \mathbf{x}_i)\) for \(i=1,\ldots,n\), we want to estimate \[
ATT = \mathbb{E}(Y(1)|t=1) - \mathbb{E}(Y(0)|t=1)
\] The first term is actually easy to estimate. It asks us “among all the individuals who received the treatment, what would their average outcome be if they receive the treatment?” That we actually observed. \[
\mathbb{\hat E}(Y(1)|t=1) = \frac{\sum t_iy_i}{\sum t_i}
\]
The harder one is the second term. It asks “for all the individuals who received treatment, what would their average outcome have been if they had not received the treatment?” This is a counterfactual expected value. We have no direct observations of this. But let’s work through some mathematics to see how we might estimate this.
Three items to recall from our probability discussion at the beginning of the course.
Expected value of a random variable
If \(Y\) is a random variable with probability distribution \(p(y)\), then expected value of \(Y\) is \[
\mathbb{E}(Y) = \int_{-\infty}^\infty y\; p(y)\, dy
\]
Law of large numbers
If \(y_1,\ldots,y_n\) is a random sample from \(p(y)\), then as \(n\rightarrow\infty\)\[
\frac{1}{n} \sum_{i=1}^n g(y_i) \rightarrow \mathbb{E}(g(Y))
\]
Bayes Theorem
\[
p(y|t) = \frac{p(t|y)p(y)}{p(t)}
\]
We will use all of these properties to work through how to estimate \(\mathbb{E}(Y(0)|t=1)\). As you go through this line-by-line, ask yourself why we are able to move from one line to the next. Sometimes, it is simply algebra. Other steps involve the definition of expected value, Bayes Theorem, the assumption of strong ignorability, and the law of large numbers.
The final step replacing \(\frac{p(t=1)}{1-p(t=1)}\sum (1-t_i)\) with \(\sum (1-t_i)\frac{p(t=1|\mathbf{x}_i)}{1-p(t=1|\mathbf{x}_i)}\) is not necessarily obvious, but here are the details. \[
\begin{split}
1 &= \iint p(y_0,\mathbf{x}|t=1)\; d\mathbf{x}\,dy_0 \\
&= \iint p(y_0,\mathbf{x}|t=1)
\frac{p(y_0,\mathbf{x}|t=0)}{p(y_0,\mathbf{x}|t=0)}\; d\mathbf{x}\,dy_0 \\
&= \iint y_0
\frac{p(t=1|y_0,\mathbf{x})p(y_0,\mathbf{x})}
{p(t=0|y_0,\mathbf{x})p(y_0,\mathbf{x})}
\frac{p(t=0)}{p(t=1)}
p(y_0,\mathbf{x}|t=0) \; d\mathbf{x}\,dy_0 \\
&= \frac{p(t=0)}{p(t=1)}\iint y_0
\frac{p(t=1|\mathbf{x})}
{p(t=0|\mathbf{x})}
p(y_0,\mathbf{x}|t=0) \; d\mathbf{x}\,dy_0 \\
&\approx \frac{p(t=0)}{p(t=1)}
\frac{\sum_{i=1}^n (1-t_i)
\frac{p(t=1|\mathbf{x}_i)}{1-p(t=1|\mathbf{x}_i)}}
{\sum_{i=1}^n (1-t_i)} \\
\frac{p(t=1)}{p(t=0)}\sum_{i=1}^n (1-t_i)&\approx \sum_{i=1}^n (1-t_i)
\frac{p(t=1|\mathbf{x}_i)}{1-p(t=1|\mathbf{x}_i)}
\end{split}
\]
This formally shows that the approach we took earlier of weighting comparison cases with \(\frac{p}{1-p}\) is a general approach for estimating ATT. The propensity score estimator for ATT is \[
\begin{split}
\widehat{ATT} &= \mathbb{\hat E}(Y(1)|t=1) - \mathbb{\hat E}(Y(0)|t=1) \\
&= \frac{\sum t_iy_i}{\sum t_i} - \frac{\sum (1-t_i)w_iy_i}
{\sum (1-t_i)w_i}
\end{split}
\] where \(w_i=\frac{p(t=1|\mathbf{x}_i)}{1-p(t=1|\mathbf{x}_i)}\).
The remaining problem is that we need to estimate the propensity score, \(p(t=1|\mathbf{x}_i)\). We will first work through an example with the Florida murders dataset and then we will work on a larger dataset.
6 Example: Florida murderers revisited
Let’s start with the easy part. \[
\mathbb{\hat E}(Y(1)|t=1) = \frac{\sum t_iy_i}{\sum t_i}
\]
We have all of our “treated” cases and we get to observe the outcome for them when they are in the treatment.
# A tibble: 4 × 4
# Groups: murderer [2]
murderer victim totW pct
<chr> <chr> <dbl> <dbl>
1 B B 103 0.620
2 B W 63 0.380
3 W B 103. 0.620
4 W W 63.0 0.380
The distribution of victim race is now identical for black and white murderers. We are now set to compute \[
\mathbb{\hat E}(Y(0)|t=1) = \frac{\sum (1-t_i)w_iy_i}{\sum (1-t_i)w_i}
\]
Call:
svyglm(formula = death ~ murderer, design = sdesign)
Survey design:
svydesign(ids = ~1, weights = ~w, data = dFLmurder)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.10241 0.02357 4.345 1.86e-05 ***
murdererW -0.05466 0.02765 -1.977 0.0489 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.06890908)
Number of Fisher Scoring iterations: 2
Our estimate of the ATT is -0.055, suggesting that the black murderer death penalty rate would be -5.5 percentage points smaller had they been treated as white murderers.
We can extract the ATT and a 95% confidence interval from the regression model as
a <-coef(summary(glm1))(a["murdererW","Estimate"] +c(0,-1,1)*1.96*a["murdererW","Std. Error"]) |>round(3)
[1] -0.055 -0.109 0.000
Could we get a more precise estimate for the ATT? Recall that in our simulation we tried adding some features to our model even when they were uncorrelated with the treatment assignment. When features are predictive of the outcome, then we can get more precision. Let’s give it a try.
Call:
svyglm(formula = death ~ murderer + victim, design = sdesign)
Survey design:
svydesign(ids = ~1, weights = ~w, data = dFLmurder)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.05645 0.02144 2.633 0.00887 **
murdererW -0.05466 0.02533 -2.158 0.03168 *
victimW 0.12109 0.02983 4.059 6.19e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.06544564)
Number of Fisher Scoring iterations: 2
a <-coef(summary(glm1))(a["murdererW","Estimate"] +c(0,-1,1)*1.96*a["murdererW","Std. Error"]) |>round(3)
[1] -0.055 -0.104 -0.005
Including the victim’s race as a covariate has no effect on the ATT, but it does increase the precision of the ATT estimate.
It turns out that for least squares and Poisson regression, you can simply add covariates of interest to the regression model and the treatment effect estimate will simply be the coefficient in front of the treatment indicator. For other regression models, and specifically for logistic regression, extracting the ATT takes a little more work.
7 Boosted propensity score estimation with NSDUH
In this section we are going to use propensity score estimation with the National Survey on Drug Use and Health (NSDUH) to estimate the effect of past heroin use on current year arrest.
NSDUH provides national data on tobacco, alcohol, drug use, mental health, and related health topics. The data collection has been ongoing since 1971 and has been conducted every year since. The most recent data collection had 70,000 people interviewed. The Substance Abuse and Mental Health Services Administration (SAMHSA) with the Department of Health and Human Services manages the data collection. RTI International has been collecting the data since 1988.
Previously we used logistic regression to estimate propensity scores. All the complaints I had about regression being insufficient for estimating causal effects, map over to using logistic regression for propensity scores. That is, failure to include important non-linear and interaction effects in the propensity score model will result in bad propensity score estimates and incorrect treatment effect estimates. For this reason, we will move beyond using logistic regression to estimate the propensity score and use boosting instead. The use of boosting for propensity score estimation was originally developed in McCaffrey, Ridgeway, and Morral (2004).
Let’s start by loading up the NSDUH 2010 dataset.
load("data/nsduh.Rdata")nsduh <- da32722.0001
We’re going to study whether past heroin use is associated with arrest in the last year. We will set heroin to 1 if the respondent indicated that they had last used heroin more than 12 months ago (and none in the last year). I exclude respondents reporting heroin use in the current year. We will set arrest to 1 if the respondent reports being arrested in the prior 12 months. Responses that are NA indicate no heroin use or no arrests.
# NOBOOKY measures number of times arrested and booked in previous 12 months# NA = logical skip, blank, don't know... here assume no arrest# 1 = 1 arrest# 2 = 2 arrests# 3 = 3 or more arrestsnsduh <- nsduh |>mutate(arrest =as.numeric(!is.na(NOBOOKY2) & NOBOOKY2>0)) |># HERREC measures last time used# 1 = Within the past 30 days# 2 = More than 30 days ago but within the past 12 mos# 3 = More than 12 months ago# 8 = Used at some point in the past 12 mos# 9 = Used sometime in lifetime# NA = never used, refused, blankfilter(!(HERREC %in%c(1,2,8))) |>mutate(heroin =as.numeric(!is.na(HERREC) & HERREC %in%c(3,9)))
We will consider a handful of potential confounders, race, sex, marital status, and age. Note that a marital status of NA is considered a legitimate value and those respondents are not dropped from the analysis.
NSDUH has sampling weights ANALWT_C. These reweight the observations to collectively resemble the United States population. NSDUH oversamples people and oversamples in places where they believe they are more likely to get respondents answering yes to drug use. The weights are essential for adjusting the sample back to reflecting the population. To properly account for sampling weights in propensity score analysis, you should include them as weights in the propensity score model and multiply the resulting propensity score weights by the sampling weight in the outcome model (Ridgeway et al. 2015). We will do this in the example below.
Let’s see how the treatment (prior heroin use) and control (no heroin use) groups differ on these features.
# A tibble: 7 × 3
NEWRACE2 `0` `1`
<fct> <dbl> <dbl>
1 W 0.670 0.710
2 B 0.120 0.0957
3 NA.AN 0.00473 0.0160
4 PI 0.00298 0
5 A 0.0458 0.00435
6 Multi 0.0123 0.0151
7 H 0.145 0.159
Note that 71% of those reporting heroin use are white, while 67% of those reporting no heroin use are white. There are smaller differences for black and Hispanic respondents, but clear evidence that race is associated with heroin use.
# A tibble: 2 × 3
IRSEX `0` `1`
<fct> <dbl> <dbl>
1 M 0.483 0.710
2 F 0.517 0.290
Also, sex of the respondent appears to be associated with heroin use with men making up 71% of the heroin users compared with 48% of the non-heroin users.
# A tibble: 5 × 3
IRMARIT `0` `1`
<fct> <dbl> <dbl>
1 M 0.479 0.302
2 W 0.0585 0.0220
3 D 0.126 0.309
4 N 0.289 0.367
5 NA 0.0472 0.000816
Marital status seems strongly associated with heroin use, with 30% of heroin users reported being married and 48% of non-heroin users reported being married. 68% of heroin users reported being divorced or never married compared to 42% for non-heroin users.
Heroin use particularly afflicts 26-34 year olds disproportionately. All of these respondent features, race, sex, marital status, and age all are associated with prior heroin use. And all of them could plausibly be related to arrest as well. Therefore, it is essential that we properly account for their potential confounding effect.
7.1 Propensity score estimation with logistic regression
We begin by estimating the propensity score. Since NSDUH has sampling weights, we will first create a survey design object so that R knows that ANALWT_C is a special column in nsduh that needs to be incorporated into all analyses (weighted means, weighted variances, weighted regressions, etc.).
What would we conclude without any adjustment for demographic characteristics? Respondents with prior heroin use are 24 percentage points more likely to report being arrested in the last 12 months compared with non-heroin respondents.
svyglm(arrest~heroin, design=sdesign) |>summary()
Call:
svyglm(formula = arrest ~ heroin, design = sdesign)
Survey design:
svydesign(ids = ~1, weights = ~ANALWT_C, data = nsduh)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.027004 0.001084 24.906 < 2e-16 ***
heroin 0.089607 0.018818 4.762 1.92e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.02731343)
Number of Fisher Scoring iterations: 2
Let’s pursue our propensity score strategy by fitting a standard logistic regression model predicting prior heroin use from the four demographic variables of interest. We use quasibinomial here just so that R does not complain about non-integer weights.
Even though we already used ANALWT_C in constructing the propensity scores, we need to multiply the propensity score weights by ANALWT_C to get the right answers. Gory details are in Ridgeway et al. (2015).
nsduh <- nsduh |>mutate(w = w*ANALWT_C)
If all of our prior reasoning is correct, then the propensity score weights should uncorrelate heroin and each of the demographic variables. Let’s check their weighted distributions.
Everything appears to have worked out! The heroin and non-heroin groups now have the same race distribution, sex distribution, distribution of marital status, and age distribution. This is what we should see if heroin usage were randomly assigned to respondents. Now that heroin is uncorrelated with the demographic features, any regression analysis we conduct will be more robust to their inclusion or exclusion or use of interactions or lack of interaction effects.
7.2 Outcome analysis
The column \(w\) contains our propensity score weights multiplied by the NSDUH sampling weights. We first set up a new survey design object with these new weights and proceed with a weighted regression model to compute the ATT.
Call:
svyglm(formula = arrest ~ heroin, design = sdesign)
Survey design:
svydesign(ids = ~1, weights = ~w, data = nsduh)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.044266 0.002941 15.052 < 2e-16 ***
heroin 0.072345 0.019016 3.804 0.000142 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.0726703)
Number of Fisher Scoring iterations: 2
# report a 95% confidence intervala <-coef(summary(lm1))round(a["heroin","Estimate"] +c(0,-1,1)*1.96*a["heroin","Std. Error"], 3)
[1] 0.072 0.035 0.110
This is only a modest change from our original estimate that involved no adjustment for potential confounders. As before we can also include the demographic features in an effort to increase precision of our ATT.
We get about a minimal reduction on the width of the confidence interval (a modest reduction) and no change in the actual estimate of ATT as expected.
All of this analysis is based on the assumption that we really uncorrelated those demographic features from prior heroin use. But did we? The balance tables shown previously just looked at the marginal distribution of the demographic features and not their interactions. Perhaps we are still doing the analysis wrong and neglecting to account for important interactions.
If we compare the treatment and comparison groups in terms of a race/sex interaction, we see some differences.
# A tibble: 8 × 4
NEWRACE2 IRSEX `0` `1`
<fct> <fct> <dbl> <dbl>
1 W M 0.510 0.462
2 W F 0.202 0.248
3 B M 0.0635 0.0764
4 B F 0.0325 0.0193
5 NA.AN M 0.00940 0.0136
6 Multi M 0.0102 0.0127
7 H M 0.114 0.142
8 H F 0.0440 0.0173
It is clear that we have not really matched well on this interaction. Whether the differences observed here matter all depends on how much the interaction affects the arrest outcome. Even if it does not matter in this example, we need to be concerned about those research questions with a large number of potential confounders. Also, all of the demographic features included here are categorical features. Non-linear effects may arise in the propensity score stage and failure to include them can result in incorrect estimates. We need a more robust method for estimating propensity scores.
7.3 Propensity score estimation with boosting
We will first walk through an example of obtaining boosted propensity scores directly using gbmt(). Then we will use the fastDR package that integrates a lot of these ideas.
library(gbm3)
We start by estimating a propensity score model, predicting heroin from the demographic features. Some aspects of fitting the generalized boosted model change when estimating propensity scores. The primary change is our motivation. We are no longer interested in out-of-sample predictive performance. There is no future dataset that we are trying to predict. We are trying to understand for this specific dataset, how to explain which observations are in the treatment group and which ones are in the comparison group. The implication is that we do not do any cross-validation and we set bag.fraction=1.0. Both of these changes result in better in-sample predictive performance.
set.seed(20240316)gbm1 <-gbmt(heroin~NEWRACE2+IRSEX+IRMARIT+CATAG7,data=nsduh,weights=nsduh$ANALWT_C,distribution=gbm_dist("Bernoulli"),train_params =training_params(num_trees =3000, # number of treesshrinkage =0.003, # lambdabag_fraction =1.0,num_train =nrow(nsduh),min_num_obs_in_node =10,interaction_depth =3, # number of splitsnum_features =4), # number of featurespar_details=gbmParallel(num_threads=8),is_verbose =FALSE)
Now that we have a model fit, we need to decide the optimal number of iterations. It is somewhat of an open question about what it means to optimize the propensity score model. There is general consensus that the optimal propensity score model should produce propensity scores that makes the balance tables “look good.” There is also no consensus on what “looks good” means. I characterize the balance table based on the single largest difference in the table. The propensity score model is only as good as how well it balances the worst balanced feature. If there is one feature that does not align across the treatment and control groups, then there is a risk that we have not adequately accounted for its potential confounding effect.
For now we will use all 3,000 trees of our GBM model. We will consider optimizing the number of trees when we use fastDR().
Let’s check the balance that we get from gbmt(). All of the marginal distributions look good. The largest difference in the treatment and control columns is about 0.006, less than a percentage point.
And let’s check that race/sex interaction. There are some modest differences with the largest about 0.025, much smaller than the 0.048 difference we got when we used logistic regression to estimate the propensity scores.
# A tibble: 8 × 4
NEWRACE2 IRSEX `0` `1`
<fct> <fct> <dbl> <dbl>
1 W M 0.490 0.462
2 W F 0.227 0.248
3 B M 0.0656 0.0764
4 B F 0.0280 0.0193
5 NA.AN M 0.0122 0.0136
6 Multi M 0.00995 0.0127
7 H M 0.121 0.142
8 H F 0.0328 0.0173
7.4fastDR to estimate propensity score estimates (and doubly robust estimates)
fastDR was originally developed as the Toolkit for Weighting and Analysis of Non-equivalent Groups (twang), a package I wrote some time ago. RAND continues to develop and maintain the twang package (https://www.rand.org/statistics/twang.html). twang has many options and those options come with some computational burdens. I stripped out the features that I use all the time, upgraded the boosting algorithm to gbm3, and added the ability to compute doubly robust estimates.
To estimate the propensity score estimates and the ATT, run the fastDR() function. You just need to separate out the outcome, treatment indicator, confounders, sampling weights, and an observation identifier.
The balance.tab.un displays the unadjusted balance table. Here we can examine how the treatment and control groups differ before propensity score weighting. The column on the far right shows the Kolmogorov-Smirnov statistic, which for these variables is just the difference in their rates. It is a little more complicated for continuous variables.
Table 2: Balance table for propensity score weighted observations
control
treatment
KS
NEWRACE2:W
0.711
0.710
-0.002
NEWRACE2:B
0.094
0.096
0.002
NEWRACE2:NA.AN
0.016
0.016
0.000
NEWRACE2:A
0.006
0.004
-0.001
NEWRACE2:Multi
0.015
0.015
0.000
NEWRACE2:H
0.158
0.159
0.001
IRSEX:M
0.709
0.710
0.001
IRSEX:F
0.291
0.290
-0.001
IRMARIT:M
0.303
0.302
-0.001
IRMARIT:W
0.022
0.022
0.000
IRMARIT:D
0.308
0.309
0.000
IRMARIT:N
0.364
0.367
0.002
IRMARIT:NA
0.002
0.001
-0.001
CATAG7:12-13
0.001
0.000
-0.001
CATAG7:14-15
0.003
0.003
0.000
CATAG7:16-17
0.005
0.004
0.000
CATAG7:18-20
0.031
0.031
0.000
CATAG7:21-25
0.090
0.090
0.000
CATAG7:26-34
0.208
0.210
0.002
CATAG7:35+
0.662
0.662
0.000
Note that after weighting, the treatment and control groups look very similar. The largest difference in the table above is 0.002. You can access this largest KS statistic from the ps2 object.
ps2$ks
[1] 0.002178828
We can also extract the number of treatment cases and the effective sample size of the weighted control cases.
ps2$n1
[1] 599
ps2$ESS
[1] 4223.858
The effective sample size (ESS) is the equivalent number of independent, unweighted observations that would give you the same precision as the observed, weighted data. If \(\mathrm{Var}(y)=\sigma^2\), then we are looking for \(ESS\) that solves \[
\begin{split}
\frac{\sigma^2}{ESS} &= \mathrm{Var}\left(\frac{\sum w_iy_i}{\sum w_i}\right) \\
&= \frac{\mathrm{Var}(\sum w_iy_i)}{(\sum w_i)^2} \\
&= \frac{\sum w_i^2\mathrm{Var}(y_i)}{(\sum w_i)^2} \\
&= \sigma^2\frac{\sum w_i^2}{(\sum w_i)^2} \\
&= \frac{\sigma^2}{\frac{(\sum w_i)^2}{\sum w_i^2}} \\
ESS &= \frac{(\sum w_i)^2}{\sum w_i^2}
\end{split}
\]
Since the cases all have weights, many of which could be near 0, the ESS gives a useful way of gauging the amount of information in the data for the estimate of \(\mathbb{\hat E}(Y(0)|t=1)\).
Table 3: Results table showing unweighted, propensity score weighted, and DR estimates
E.y1
E.y0
se.y1
se.y0
TE
se.TE
p
un
0.117
0.027
0.019
0.001
0.090
0.019
0
ps
0.117
0.045
0.019
0.006
0.072
0.020
0
dr
0.117
0.056
0.019
0.006
0.061
0.023
0
The first row of Table 3 shows the unadjusted analysis. The second row shows a propensity score analysis. The third row shows a doubly robust analysis, one that includes the features in the outcome regression model. The first column is the arrest rate for the heroin group. This remains unchanged regardless of the estimator. The second column is the arrest rate for the comparison group that is either unadjusted, propensity score weighted, or a doubly robust estimate. The next two columns show their standard errors. TE and se.TE show the treatment effect (ATT) and the standard error of the treatment effect. The final column shows the p-value.
Heroin users had an arrest rate of 11.7%. Based on the doubly robust analysis, respondents who were similar to the heroin users (based on age, race, sex, and marital status) had an arrest rate of 5.6%. The ATT is the difference in these two percentages, 6.1 percentage points. We can get a 95% confidence interval by taking the TE and adding/subtracting 1.96 times the se.TE, which gives us
att <- ps2$effects$arrest["dr","TE"]se <- ps2$effects$arrest["dr","se.TE"]# ATT estimate and 95% confidence interval100*round(att +1.96*c(0,-1,1)*se, 3)
[1] 6.1 1.6 10.6
These are powerful tools, but can only give results that are as good as the data. If the data have limited features, then you will be limited in how much confounding you can address. The most common failure of propensity score methods is the inability to measure important confounders in the first place. Propensity score methods only produce causal effects if strong ignorability holds. To get there often means very comprehensive data collection on the units of analysis and accounting for all of them in the propensity score model.
8 Summary
We covered a few ways of estimating treatment effects in observational studies, noting the fundamental problem of causal inference, inferring what would have happened in the absence of treatment. We covered propensity score estimation, weighting techniques to adjust for confounders, and the use of machine learning for propensity score estimation.
Causal Estimation:
Simpson’s Paradox: after incorporating a third variable in an analysis, the conclusions can reverse
Observational studies aim to infer causal relationships where randomization is impractical
Key challenges include addressing confounding variables
Neyman-Rubin Causal Model: Framework underpinning causal inference methods, formalizing the challenge of unobserved counterfactuals
Propensity Score Methods:
Probability of treatment assignment given observed covariates
Balances covariate distributions between treatment and control groups to estimate causal effects
Use boosting to improve estimates compared to logistic regression
Practical Applications:
Florida murderers analysis showed the importance of addressing race of victims, an important confounding variable
NSDUH to study the effect of prior heroin use on future arrest
demonstrating propensity score with sampling weights
References
McCaffrey, D. M., G. Ridgeway, and A. R. Morral. 2004. “Propensity Score Estimation with Boosted Regression for Evaluating Causal Effects in Observational Studies.”Psychological Methods 9 (4): 403–25.
Radelet, Michael L. 1981. “Racial Characteristics and the Imposition of the Death Penalty.”American Sociological Review 46 (6): 918–27. http://www.jstor.org/stable/2095088.
Ridgeway, Greg, Stephanie A Kovalchik, Beth Ann Griffin, and Mohammed U. Kabeto. 2015. “Propensity Score Analysis with Survey Weighted Data.”Journal of Causal Inference 3: 237–49. https://api.semanticscholar.org/CorpusID:3490612.
Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.”Journal of Educational Psychology 66 (5): 688.