QTM 385 - Experimental Methods

Lecture 18 - Heterogeneous Effects

Danilo Freire

Emory University

Hello, everyone! 👋
How are you today?

Brief recap 📚

Brief recap 📚

Centola (2010): Online Behaviour Spread

  • Experimentally varied network structure (clustered vs. random) in an online health community
  • Measured spread of adopting a health forum registration
  • Finding: Behaviour spread significantly farther and faster in clustered (high reinforcement) networks compared to random networks
  • Implication: Network structure critically affects diffusion, especially for behaviours requiring social proof. Weak ties might be less effective for complex contagions

Paluk et al. (2021): Conflict Climate Change

  • Field experiment in 56 US middle schools using social networks to reduce conflict
  • Randomly assigned schools to intervention or control
  • Intervention: Student-nominated “social referents” to model anti-conflict norms.
  • Finding: Targeting influential social referents was effective in reducing overall student conflict (by ~30%) than targeting random students
  • Mechanism: Changed perceived social norms around conflict within the school network (spillover effect)

Today’s plan 📅

Heterogeneous Treatment Effects (HTE)

Moving Beyond the Average

  • Why average treatment effects (ATEs) aren’t the whole story
  • Defining and understanding treatment effect variability
  • Fundamental challenge: We cannot directly observe individual treatment effects or their variance
  • Methods for exploring HTE:
    • Bounding the variance of treatment effects
    • Testing for the presence of heterogeneity
  • Structured approaches:
    • Treatment-by-Covariate Interactions (Subgroup Analysis / CATEs)
    • Treatment-by-Treatment Interactions (Factorial Designs)
  • Modelling HTE using regression
  • Pitfalls and best practices:
    • Multiple comparisons
    • Causal interpretation challenges

Why Look Beyond the ATE?

The Limits of Averages

  • To this point, we’ve focused mainly on estimating the Average Treatment Effect (ATE)
  • Assumption: The treatment effect might be similar across individuals
  • Reality: Interventions rarely affect everyone identically. Effects often vary across individuals, groups, or contexts
  • Practical Value: Policymakers want to know who benefits most (or least) from an intervention and under what conditions
    • Targeting resources effectively
    • Tailoring programs
  • Scientific Value: Understanding why effects vary helps uncover causal mechanisms
    • What makes an intervention work (or not)?

Examples of Heterogeneity

  • School Choice: Policies allowing school choice might benefit children whose parents prioritise academic quality more than those whose parents prioritise proximity
  • Job Training: Programs might increase earnings for voluntary participants but have no effect on those required to attend for public assistance eligibility
  • Tax Compliance: Efforts to increase compliance might work differently depending on whether taxpayers can easily conceal income
  • Advertising: A campaign targeting teenagers might alienate older consumers

The Fundamental Problem of HTE Inference

Defining Treatment Effect Heterogeneity

  • Individual treatment effect: \(\tau_i = Y_i(1) - Y_i(0)\)
  • Treatment effect heterogeneity refers to the variance of individual treatment effects across subjects: \(Var(\tau)\)
  • We want to estimate \(Var(\tau)\) and test if \(Var(\tau) > 0\)
  • Recall the relationship:

\[Var(\tau) = Var(Y_i(1) - Y_i(0))\] \[Var(\tau) = Var(Y_i(1)) + Var(Y_i(0)) - 2Cov(Y_i(1), Y_i(0))\]

  • This equation highlights the challenge…

What Experimental Data Tell Us (and What They Don’t)

  • Experiments give us samples of \(Y_i(1)\) (from the treatment group) and \(Y_i(0)\) (from the control group)
  • From these, we can estimate the marginal distributions of \(Y(1)\) and \(Y(0)\)
  • We can estimate \(E[Y(1)]\), \(E[Y(0)]\), \(Var(Y(1))\), and \(Var(Y(0))\)
  • BUT: We never observe both \(Y_i(1)\) and \(Y_i(0)\) for the same individual \(i\)
  • Therefore, we cannot directly estimate the joint distribution of potential outcomes
  • So we cannot estimate \(Cov(Y_i(1), Y_i(0))\)
  • Without the covariance, we cannot directly calculate \(Var(\tau)\)!

Simple Example Illustrating the Problem

  • Suppose \(N=6\) subjects. We observe outcomes:
    • Control Group (\(Y(0)\)): {1, 2, 3} -> \(\hat{Var}(Y(0))=1\)
    • Treatment Group (\(Y(1)\)): {4, 5, 6} -> \(\hat{Var}(Y(1))=1\)
  • Estimated ATE = \(E[Y(1)] - E[Y(0)] = 5 - 2 = 3\)
  • What is \(Var(\tau)\)? It depends on how potential outcomes are paired (which we don’t know):
    • Scenario 1 (Perfect Positive Correlation): Pairs are {(1,4), (2,5), (3,6)}
      • Individual effects \(\tau_i\) are {3, 3, 3}
      • \(Var(\tau) = 0\). Homogeneous effect
    • Scenario 2 (Perfect Negative Correlation): Pairs are {(1,6), (2,5), (3,4)}
      • Individual effects \(\tau_i\) are {5, 3, 1}
      • \(Var(\tau) = Var(\{5,3,1\}) = 4\). Heterogeneous effect
    • Scenario 3 (Mixed): Pairs are {(1,6), (2,4), (3,5)}
      • Individual effects \(\tau_i\) are {5, 2, 2}
      • \(Var(\tau) = Var(\{5,2,2\}) = 3\). Heterogeneous effect
  • Experimental data alone do not distinguish these scenarios!

Detecting Heterogeneity: Bounds & Tests

Bounding the Variance of Treatment Effects

  • Since we can’t estimate \(Cov(Y(1), Y(0))\), we can’t pinpoint \(Var(\tau)\).
  • However, we can estimate bounds on \(Var(\tau)\) by considering the most extreme possible correlations (Heckman, Smith, Clements 1997)
  • Procedure (for equal group sizes):
    1. Sort observed \(Y(0)\) values ascendingly
    2. Sort observed \(Y(1)\) values ascendingly
    3. Lower Bound for \(Var(\tau)\): Pair the sorted values rank-by-rank (1st with 1st, 2nd with 2nd, …). Calculate the variance of the resulting \(\tau_i\) estimates. This assumes maximum possible positive covariance
    4. Upper Bound for \(Var(\tau)\): Pair the sorted \(Y(0)\) values (ascending) with sorted \(Y(1)\) values in descending order (1st \(Y(0)\) with last \(Y(1)\), …). Calculate the variance of the resulting \(\tau_i\). This assumes maximum possible negative covariance
  • If the number of subjects differs, pair percentiles instead of ranks
  • A lower bound substantially > 0 suggests heterogeneity

Testing for Heterogeneity: Comparing Variances

  • A simpler approach: Test the null hypothesis \(H_0: Var(\tau) = 0\) (homogeneous effects)
  • If \(\tau_i = \tau\) (constant) for all \(i\), then \(Var(\tau)=0\)
  • Also, if \(\tau\) is constant, \(Cov(Y(0), \tau) = Cov(Y(0), \text{constant}) = 0\)
  • Recall: \(Var(Y(1)) = Var(Y(0)) + Var(\tau) + 2Cov(Y(0), \tau)\)
  • Under \(H_0: Var(\tau)=0\), this simplifies to \(Var(Y(1)) = Var(Y(0))\)
  • Test Idea: If we observe significantly different variances in the treatment and control groups, we can reject the null hypothesis of homogeneous effects
  • Caution: Equality of variances does not strictly prove homogeneity (could have \(Var(\tau) = -2Cov(Y(0), \tau)\)), but large differences suggest heterogeneity

Example: Teacher Incentives Variance Test

  • Revisit the teacher incentives experiment (Muralidharan & Sundararaman 2011)
  • Outcome: Change in school test scores (post-pre)
  • Observed variances:
    • Control group: \(\hat{Var}(Y(0)) = 59.29\)
    • Treatment group: \(\hat{Var}(Y(1)) = 91.20\)
  • Observed difference: \(|91.20 - 59.29| = 31.91\). Is this difference large enough to reject \(H_0\)?
  • Randomisation Inference:
    1. Assume constant treatment effect (ATE = 3.50) for all schools (generate full potential outcomes schedule)
    2. Repeat random assignment to treatment/control 100,000 times
    3. For each simulation, calculate the absolute difference in variances between the simulated T/C groups
    4. Find the proportion of simulated differences >= 31.91
  • Result: p-value = 0.088
  • Conclusion: Cannot reject \(H_0\) at \(\alpha=0.05\), but the borderline p-value hints at possible heterogeneity. Incentives might affect schools differently

Limitations of Basic Methods

  • Bounds on \(Var(\tau)\) are often very wide, making them uninformative
  • Tests comparing \(Var(Y(1))\) and \(Var(Y(0))\) often lack statistical power, especially with smaller samples
  • These methods don’t tell us why effects might vary or who is affected differently
  • They serve as a preliminary step. Significant results encourage more structured investigation

➡️ Need approaches that examine variation across specific subgroups or conditions

Treatment-by-Covariate Interactions (CATEs) 📊

Conditional Average Treatment Effects (CATEs)

  • Idea: Partition subjects into subgroups based on pre-treatment covariates (X) and estimate the ATE within each subgroup
  • CATE: Conditional Average Treatment Effect
    • \(CATE(X=x) = E[Y_i(1) - Y_i(0) | X_i = x]\)
  • Example: Estimate the ATE for men vs. women, old vs. young, high income vs. low income
  • Interaction Effect: The difference between CATEs across subgroups
    • \(Interaction = CATE(X=x_1) - CATE(X=x_2)\)
  • This approach uses observable characteristics to explore potential sources of heterogeneity
  • Often guided by theory about why effects might differ

Example: Teacher Incentives & Parent Literacy

  • Researchers explored if teacher incentive effects varied by school characteristics (Muralidharan & Sundararaman 2011)
  • Covariate: Average parent literacy level (pre-treatment). Partitioned schools into below-median and above-median literacy
  • Estimated CATEs:
    • Low Literacy Schools: CATE = 11.14 - 7.83 = 3.31
    • High Literacy Schools: CATE = 12.26 - 8.57 = 3.69
  • Interaction Effect: 3.69 - 3.31 = 0.38.
  • Hypothesis Test: Is this difference statistically significant?
    • Using randomisation inference (or regression F-test), p = 0.88
    • Conclusion: No significant evidence that the treatment effect differs based on parental literacy levels in the school

Regression Framework for CATEs

  • Regression provides a flexible way to estimate and test CATEs and interactions
  • Let \(I_i\) be the treatment indicator (1=Treat, 0=Control)
  • Let \(P_i\) be the covariate indicator (e.g., 1=High Literacy, 0=Low Literacy)
  • Null Model (Homogeneous Effect): \(Y_i = \alpha + \beta I_i + \gamma P_i + u_i\)
    • Assumes a single ATE (\(\beta\))
  • Alternative Model (Interaction): \(Y_i = \alpha + \beta I_i + \gamma P_i + \delta (I_i \times P_i) + u_i\)
    • CATE for Low Literacy (\(P_i=0\)): \(\beta\)
    • CATE for High Literacy (\(P_i=1\)): \(\beta + \delta\)
    • Interaction Effect: \(\delta\)
  • Testing: Use an F-test to see if the model with the interaction term (\(\delta\)) fits significantly better than the null model (i.e., test \(H_0: \delta = 0\))
  • In the teacher incentive example, the F-test yields p = 0.88, matching the randomisation inference

Cautions with CATEs ⚠️

The Multiple Comparisons Problem

  • Datasets often contain many potential covariates (age, gender, income, education, location, etc.)
  • If we test for interactions with many covariates, the chance of finding a “significant” interaction purely by chance increases
  • Example: Test 20 uncorrelated covariates for interaction at \(\alpha=0.05\). Assume the true effect is homogeneous
    • Probability of finding at least one significant interaction: \(1 - (1 - 0.05)^{20} \approx 0.64\). (High risk of false discovery!)
  • Solution 1: Bonferroni Correction:
    • If conducting \(q\) tests, adjust significance level to \(\alpha / q\)
    • E.g., for 20 tests at \(\alpha=0.05\), require \(p < 0.05 / 20 = 0.0025\)
    • Simple, but can be overly conservative (reduces power)
  • Solution 2: Pre-specification:
    • Specify the few theoretically motivated interactions you will test before looking at the data (e.g., in a Pre-Analysis Plan)

Correlation vs. Causation in Subgroups

  • A fundamental limitation: Covariates used for subgroup analysis are observed characteristics, not randomly assigned treatments
  • Finding that CATEs differ between, say, high-education and low-education groups means the treatment effect correlates with education
  • It does not necessarily mean that changing someone’s education level would change how they respond to the treatment
  • Education might be a marker for other underlying factors (e.g., income, access to information, underlying health) that are the “true” drivers of the heterogeneity
  • Subgroup analysis is essentially observational/descriptive regarding the source of heterogeneity
  • It’s useful for prediction (who is likely to respond more?) and generating hypotheses, but not for establishing the causal impact of the covariate itself on the treatment effect

Example: Interpreting Voter Turnout Interaction

  • Gerber, Green, Larimer (2008) sent mailers showing recipients’ own (and neighbours’) past voting records to encourage turnout
  • Finding: Mailer effect (ATE) was larger among people who had voted in a previous election (2004) compared to those who hadn’t
  • Superficial interpretation: Seeing your past voting record is more motivating than seeing your past non-voting record
  • Problem: Past voting behaviour (the covariate) is not random. People who voted in 2004 are different from those who didn’t in many ways (more politically engaged, different demographics, etc.). The mailer content also differs (shows voting vs. non-voting)
  • Follow-up Experiment (Gerber, Green, Larimer 2010): Randomly varied whether the mailer showed a record of voting or a record of abstention
  • Result: Showing a record of abstention was significantly more motivating
  • Lesson: Subgroup analysis conflated the type of person with the content of the message. Direct experimental manipulation was needed to isolate the causal effect of the message content

Treatment-by-Treatment Interactions (Factorial Designs) 🧪

Beyond Covariates: Manipulating Multiple Factors

  • To make causal claims about why treatment effects vary, we need to experimentally manipulate the hypothesised moderating factors
  • Factorial Design: An experimental design that includes two or more “factors” (independent variables), where subjects are randomly assigned to a combination of the levels of these factors
  • Example: Factor A (Treatment vs. Control), Factor B (Context 1 vs. Context 2).
  • Subjects randomly assigned to one of four groups: (Treat, Cxt 1), (Treat, Cxt 2), (Control, Cxt 1), (Control, Cxt 2).
  • Allows us to estimate:
    • Main effect of Factor A
    • Main effect of Factor B
    • Interaction effect: Does the effect of Factor A depend on the level of Factor B?
  • This provides stronger causal evidence about moderators than treatment-by-covariate analysis

Examples of Factorial Designs

  • Gerber & Green (2000) Voter Mobilisation: Crossed multiple communication methods (Factor 1: Canvassing Y/N, Factor 2: Phone Call Y/N, Factor 3: Direct Mail Y/N) to see if effects were additive or interactive
  • Olken (2007) Corruption Monitoring: Crossed top-down audits (Factor 1: Audit Y/N) with bottom-up community monitoring (Factor 2: Invitations to meetings Y/N) in Indonesian road projects. Tested if audits were more effective when community was involved
  • Rosen (2010) Discrimination: Crossed putative email sender ethnicity (Factor 1: Hispanic/Non-Hispanic name) with email grammar quality (Factor 2: Good/Bad grammar) sent to state legislators. Tested if ethnic discrimination depended on perceived social class (signalled by grammar)

Rosen (2010) Example: Ethnicity and Grammar

Research Question: Do US state legislators respond differently to constituent emails based on perceived ethnicity and writing quality?

2x2 Factorial Design: - Factor 1: Sender Name (Colin Smith vs. José Ramirez) - Factor 2: Email Grammar (Good vs. Bad)

Outcome: Percentage of emails receiving a reply

Results (Subset, N=100 per cell):

Colin Smith (Non-Hispanic) José Ramirez (Hispanic) Difference (José - Colin)
Good Grammar 52% 37% -15%
Bad Grammar 29% 34% +5%
Difference (Bad-Good) -23% -3%
  • Interpretation:
    • With good grammar, Colin gets more replies (-15% effect for José)
    • With bad grammar, José gets slightly more replies (+5% effect)
    • The effect of ethnicity depends on grammar quality (Interaction)
    • Poor grammar hurts Colin much more (-23%) than José (-3%)

Advantages and Caveats of Factorial Designs

  • Advantages:
    • Allows causal estimation of interaction effects
    • Efficient way to study multiple factors simultaneously
    • Can reveal complex relationships (e.g., conditions where a treatment is effective/ineffective)
  • Caveats:
    • Require larger sample sizes to detect interactions (interactions are often smaller than main effects)
    • Can become complex logistically with many factors/levels
    • Non-compliance can be problematic: estimating effects for those receiving multiple treatments requires sufficient compliers in those specific cells, which can be rare

Modelling HTE with Regression 📈

Extending the Regression Framework

  • Regression is a powerful tool for modelling both treatment-by-covariate and treatment-by-treatment interactions.
  • Systematic Heterogeneity: Variation in \(\tau\) that can be predicted by observed covariates or experimental factors (modelled by interaction terms).
  • Idiosyncratic Heterogeneity: Residual variation in \(\tau\) not explained by the model (part of the error term \(u_i\)).
  • We use interaction terms to capture systematic heterogeneity.

Modelling Treatment-by-Treatment Interaction

  • Let \(J_i = 1\) if sender is José, 0 if Colin
  • Let \(G_i = 1\) if grammar is bad, 0 if good
  • Model: \(Y_i = \alpha + \beta J_i + \gamma G_i + \delta (J_i \times G_i) + u_i\)
  • Interpreting Coefficients:
    • \(\alpha\): Mean outcome for baseline (Colin, Good Grammar) = 52%
    • \(\beta\): Effect of José vs. Colin, when grammar is good = 37% - 52% = -15%
    • \(\gamma\): Effect of Bad vs. Good grammar, when sender is Colin = 29% - 52% = -23%
    • \(\delta\): Interaction effect. How the effect of José changes when grammar goes from good to bad.
      • Effect of José (Bad Grammar) = (\(\alpha + \gamma\)) + (\(\beta + \delta\))
      • Effect of José (Good Grammar) = \(\alpha + \beta\)
      • Difference = \(\gamma + \delta\)
      • Alternatively: \(\delta\) = (Effect of Bad Grammar for José) - (Effect of Bad Grammar for Colin) = (-3%) - (-23%) = +20%
  • Estimates for Rosen data: \(\hat{Y}_i = 0.52 - 0.15 J_i - 0.23 G_i + 0.20 (J_i \times G_i)\)
  • Testing the Interaction: F-test for \(H_0: \delta = 0\). In Rosen’s data, p = 0.037. Significant interaction

Automating the Search for Interactions

  • Manually specifying and testing interactions becomes infeasible with many covariates/factors.
  • Machine Learning Approaches: Algorithms designed to automatically search for interactions.
    • E.g., Generalised Random Forests (Athey & Imbens)
    • Methodically partition the data based on covariates to find subgroups with different treatment effects
    • Often use techniques like cross-validation to avoid overfitting and false discoveries
  • Can be a powerful exploratory tool, especially in high-dimensional settings
  • Still requires careful interpretation and follow-up validation
  • Beyond the scope of this lecture, but good to be aware of

Summary

Investigating Heterogeneous Treatment Effects

  • ATEs provide an incomplete picture; treatment effects often vary systematically
  • Fundamental challenge: \(Cov(Y(1), Y(0))\) is unidentified, so \(Var(\tau)\) cannot be directly estimated from experimental data alone
  • Preliminary checks:
    • Bounding \(Var(\tau)\): Often yields wide, uninformative bounds
    • Comparing \(Var(Y(1))\) and \(Var(Y(0))\): Low power test for \(H_0: Var(\tau)=0\).
  • Structured approaches are needed for deeper insights
  • Treatment-by-Covariate (Subgroup/CATEs):
    • Estimates effects within subgroups defined by pre-treatment X
    • Useful for description and prediction
    • Interpretation caution: Correlational regarding source of HTE; risk of multiple comparisons
  • Treatment-by-Treatment (Factorial Designs):
    • Experimentally manipulate multiple factors
    • Allows causal inference about interactions
    • More complex design, requires larger N
  • Regression Modelling: Flexible tool for estimating and testing interactions
  • Best Practice: Pre-specify theoretically motivated interactions to test; treat exploratory findings cautiously pending replication

And that’s all for today! 🎉

See you next time! 😉