DATASCI 385 - Experimental Methods

Lecture 18 - Heterogeneous Effects

Danilo Freire

Department of Data and Decision Sciences
Emory University

Hello, everyone! 👋
How are you today?

Brief recap 📚

Brief recap 📚

Centola (2010): Online Behaviour Spread

  • Experimentally varied network structure (clustered vs. random) in an online health community
  • Measured spread of adopting a health forum registration
  • Finding: Behaviour spread significantly farther and faster in clustered (high reinforcement) networks compared to random networks
  • Implication: Network structure critically affects diffusion, especially for behaviours requiring social proof. Weak ties might be less effective for complex contagions

Paluck et al. (2021): Conflict Climate Change

  • Field experiment in 56 U.S. middle schools using social networks to reduce conflict
  • Randomly assigned schools to intervention or control
  • Intervention: Student-nominated “social referents” to model anti-conflict norms
  • Finding: Targeting influential social referents was effective in reducing overall student conflict (by ~30%) than targeting random students
  • Mechanism: Changed perceived social norms around conflict within the school network (spillover effect)

Today’s plan 📅

Heterogeneous treatment effects (HTE)

Moving beyond the average

  • Why average treatment effects (ATEs) aren’t the whole story
  • Defining and understanding treatment effect variability
  • Fundamental challenge: We cannot directly observe individual treatment effects or their variance
  • Methods for exploring HTE:
    • Bounding the variance of treatment effects
    • Testing for the presence of heterogeneity
  • Structured approaches:
    • Treatment-by-Covariate Interactions (Subgroup Analysis / CATEs)
    • Treatment-by-Treatment Interactions (Factorial Designs)
  • Modelling HTE using regression
  • Pitfalls and best practices:
    • Multiple comparisons
    • Causal interpretation challenges

Why look beyond the ATE?

The limits of averages

  • To this point, we’ve focused mainly on estimating the Average Treatment Effect (ATE)
  • Assumption: The treatment effect might be similar across individuals
  • Reality: Interventions rarely affect everyone identically. Effects often vary across individuals, groups, or contexts
  • Practical Value: Policymakers want to know who benefits most (or least) from an intervention and under what conditions
    • Targeting resources effectively
    • Tailoring programmes
  • Scientific Value: Understanding why effects vary helps uncover causal mechanisms
    • What makes an intervention work (or not)?

Examples of heterogeneity

  • School Choice: Policies allowing school choice might benefit children whose parents prioritise academic quality more than those whose parents prioritise proximity
  • Job Training: Programmes might increase earnings for voluntary participants but have no effect on those required to attend for public assistance eligibility
  • Tax Compliance: Efforts to increase compliance might work differently depending on whether taxpayers can easily conceal income
  • Advertising: A campaign targeting teenagers might alienate older consumers

The fundamental problem of HTE inference

Defining treatment effect heterogeneity

  • Individual treatment effect: \(\tau_i = Y_i(1) - Y_i(0)\)
  • Treatment effect heterogeneity refers to the variance of individual treatment effects across subjects: \(Var(\tau)\)
  • We want to estimate \(Var(\tau)\) and test if \(Var(\tau) > 0\) (indicating heterogeneity)
  • Recall the relationship:

\[Var(\tau) = Var(Y_i(1) - Y_i(0))\] \[Var(\tau) = Var(Y_i(1)) + Var(Y_i(0)) - 2Cov(Y_i(1), Y_i(0))\]

  • This equation highlights the challenge! Let’s see why… 🤓

Decomposing the variance

  • \(Var(Y(1))\) + \(Var(Y(0))\): Total variance in outcomes if we could see both worlds
  • \(Cov(Y_i(1), Y_i(0))\): The covariance term. This is the key!
    • It asks: Do people who would have high outcomes without the treatment also have high outcomes with it?
  • Why subtract the covariance? It accounts for stable differences between subjects
  • Scenario: high positive covariance (common case)
    • People who are “high-achievers” (\(Y_i(0)\) is high) may also be “high-achievers” after treatment (\(Y_i(1)\) is high)
    • This stable difference makes \(Var(Y(1))\) and \(Var(Y(0))\) large, but there’s no heterogeneity in the effect
    • Subtracting \(2Cov(\cdot)\) corrects for this stable variance, isolating the true variance of the effect, \(Var(\tau)\)
  • Example: If the effect is constant (\(\tau_i = \tau\) for all \(i\)), then \(Var(\tau) = 0\). The high covariance between \(Y_i(1)\) and \(Y_i(0)\) will cancel out the other two terms

What experimental data tell us (and what they don’t)

  • Experiments give us samples of \(Y_i(1)\) (from the treatment group) and \(Y_i(0)\) (from the control group)
  • From these, we can estimate the marginal distributions of \(Y(1)\) and \(Y(0)\)
  • We can estimate, for instance, \(E[Y(1)]\), \(E[Y(0)]\), \(Var(Y(1))\), and \(Var(Y(0))\)
  • BUT: We never observe both \(Y_i(1)\) and \(Y_i(0)\) for the same individual \(i\)
  • Therefore, we cannot directly estimate the joint distribution of potential outcomes
  • So we cannot estimate \(Cov(Y_i(1), Y_i(0))\)
  • Without the covariance, we cannot directly calculate \(Var(\tau)\)!

Illustrating the problem

  • Suppose \(N=6\) subjects. We observe outcomes:
    • Control Group (\(Y(0)\)): \({1, 2, 3}\) -> \(\hat{Var}(Y(0))=1\)
    • Treatment Group (\(Y(1)\)): \({4, 5, 6}\) -> \(\hat{Var}(Y(1))=1\)
  • Estimated ATE = \(E[Y(1)] - E[Y(0)] = 5 - 2 = 3\)
  • What is \(Var(\tau)\)? It depends on how potential outcomes are paired (which we don’t know):
  • Scenario 1 (perfect positive correlation): Pairs are \({(1,4), (2,5), (3,6)}\)
    • Individual effects \(\tau_i\) are \({3, 3, 3}\)
    • \(Var(\tau) = 0\). Homogeneous effect
  • Scenario 2 (perfect negative correlation): Pairs are \({(1,6), (2,5), (3,4)}\)
    • Individual effects \(\tau_i\) are \({5, 3, 1}\)
    • \(Var(\tau) = Var(\{5,3,1\}) = 4\). Heterogeneous effect
  • Scenario 3 (Mixed): Pairs are \({(1,6), (2,4), (3,5)}\)
    • Individual effects \(\tau_i\) are \({5, 2, 2}\)
    • \(Var(\tau) = Var(\{5,2,2\}) = 3\). Heterogeneous effect
  • Experimental data alone do not distinguish these scenarios!

Detecting heterogeneity: bounds & tests

Bounding the variance of treatment effects

  • Since we can’t estimate \(Cov(Y(1), Y(0))\), we can’t pinpoint \(Var(\tau)\)
  • However, we can estimate bounds on \(Var(\tau)\) by considering the most extreme possible correlations (Heckman, Smith, Clements 1997)
  • We can use the same pairing logic from the previous example
  • Procedure (for equal group sizes):
    1. Sort observed \(Y(0)\) values ascendingly
    2. Sort observed \(Y(1)\) values ascendingly
    3. Lower Bound for \(Var(\tau)\): Pair the sorted values rank-by-rank (1st with 1st, 2nd with 2nd, …). Calculate the variance of the resulting \(\tau_i\) estimates. This assumes maximum possible positive covariance
    4. Upper Bound for \(Var(\tau)\): Pair the sorted \(Y(0)\) values (ascending) with sorted \(Y(1)\) values in descending order (1st \(Y(0)\) with last \(Y(1)\), …). Calculate the variance of the resulting \(\tau_i\). This assumes maximum possible negative covariance
  • If the number of subjects differs, pair percentiles instead of ranks
  • A lower bound substantially greater than 0 suggests heterogeneity

Testing for heterogeneity: comparing variances

  • A simpler approach: Test the null hypothesis \(H_0: Var(\tau) = 0\) (homogeneous effects)
  • If \(\tau_i = \tau\) (constant) for all \(i\), then \(Var(\tau)=0\)
  • Also, if \(\tau\) is constant, \(Cov(Y(0), \tau) = Cov(Y(0), \text{constant}) = 0\)
  • Recall: \(Var(Y(1)) = Var(Y(0)) + Var(\tau) + 2Cov(Y(0), \tau)\)
  • Under \(H_0: Var(\tau)=0\), this simplifies to \(Var(Y(1)) = Var(Y(0))\)
  • Test Idea: If we observe significantly different variances in the treatment and control groups, we can reject the null hypothesis of homogeneous effects
  • Caution: Equality of variances does not strictly prove homogeneity, but large differences suggest heterogeneity

Example: teacher incentives variance test

  • Revisit the teacher incentives experiment (Muralidharan & Sundararaman 2011)
  • Outcome: Change in school test scores (post-pre)
  • Observed variances:
    • Control group: \(\hat{Var}(Y(0)) = 59.29\)
    • Treatment group: \(\hat{Var}(Y(1)) = 91.20\)
  • Observed difference: \(|91.20 - 59.29| = 31.91\); is this difference large enough to reject \(H_0\)?
  • Randomisation Inference:
    1. Assume constant treatment effect (ATE = 3.50) for all schools (generate full potential outcomes schedule)
    2. Repeat random assignment to treatment/control 100,000 times
    3. For each simulation, calculate the absolute difference in variances between the simulated T/C groups
    4. Find the proportion of simulated differences >= 31.91
  • Result: p-value = 0.088
  • Conclusion: Cannot reject \(H_0\) at \(\alpha=0.05\), but the borderline p-value hints at possible heterogeneity. Incentives might affect schools differently

Limitations of basic methods

  • Bounds on \(Var(\tau)\) are often very wide, making them uninformative
  • Tests comparing \(Var(Y(1))\) and \(Var(Y(0))\) often lack statistical power, especially with smaller samples
  • These methods don’t tell us why effects might vary or who is affected differently
  • They serve as a preliminary step. Significant results encourage more structured investigation

➡️ Need approaches that examine variation across specific subgroups or conditions

Treatment-by-Covariate Interactions (CATEs) 📊

Conditional Average Treatment Effects (CATEs)

  • We can partition subjects into subgroups based on pre-treatment covariates (\(X\)) and estimate the ATE within each subgroup
  • CATE: Conditional Average Treatment Effect
    • \(CATE(X=x) = E[Y_i(1) - Y_i(0) | X_i = x]\)
  • Example: Estimate the ATE for men vs. women, old vs. young, high income vs. low income
  • Interaction Effect: The difference between CATEs across subgroups
    • \(Interaction = CATE(X=x_1) - CATE(X=x_2)\)
  • This approach uses observable characteristics to explore potential sources of heterogeneity
  • Often guided by theory about why effects might differ

Example: teacher incentives & parent literacy

  • Researchers explored if teacher incentive effects varied by school characteristics (Muralidharan & Sundararaman 2011)
  • Covariate: Average parent literacy level (pre-treatment). Partitioned schools into below-median and above-median literacy
  • Estimated CATEs:
    • Low Literacy Schools: CATE = 11.14 - 7.83 = 3.31
    • High Literacy Schools: CATE = 12.26 - 8.57 = 3.69
  • Interaction Effect: 3.69 - 3.31 = 0.38
  • Hypothesis Test: Is this difference statistically significant?
    • Using randomisation inference (or regression F-test), p = 0.88
    • No significant evidence that the treatment effect differs based on parental literacy levels in the school

Regression framework for CATEs

  • Regression provides a flexible way to estimate and test CATEs and interactions
  • Let \(I_i\) be the treatment indicator (1=Treat, 0=Control)
  • Let \(P_i\) be the covariate indicator (e.g., 1=High Literacy, 0=Low Literacy)
  • Null Model (Homogeneous Effect): \(Y_i = \alpha + \beta I_i + \gamma P_i + u_i\)
    • Assumes a single ATE (\(\beta\))
  • Alternative Model (Interaction): \(Y_i = \alpha + \beta I_i + \gamma P_i + \delta (I_i \times P_i) + u_i\)
    • CATE for Low Literacy (\(P_i=0\)): \(\beta\)
    • CATE for High Literacy (\(P_i=1\)): \(\beta + \delta\)
    • Interaction Effect: \(\delta\)
  • Use an F-test to see if the model with the interaction term (\(\delta\)) fits significantly better than the null model (i.e., test \(H_0: \delta = 0\))
  • In the teacher incentive example, the F-test yields \(p = 0.88\), matching the randomisation inference

Cautions with CATEs ⚠️

The multiple comparisons problem

  • Datasets often contain many potential covariates (age, gender, income, education, location, etc.)
  • If we test for interactions with many covariates, the chance of finding a significant interaction purely by chance increases
  • Example: Test 20 uncorrelated covariates for interaction at \(\alpha=0.05\). Assume the true effect is homogeneous
    • \((1 - \alpha)^q\) gives the probability of no false positives across \(q\) tests, so \(1 - (1 - \alpha)^q\) is the probability of at least one false positive
    • Probability of finding at least one significant interaction: \(1 - (1 - 0.05)^{20} \approx 0.64\). (High risk of Type I error!)
  • Solution 1: Bonferroni Correction:
    • If conducting \(q\) tests, adjust significance level to \(\alpha / q\)
    • E.g., for 20 tests at \(\alpha=0.05\), require \(p < 0.05 / 20 = 0.0025\)
    • Simple, but can be overly conservative (reduces power)
    • Other similar solutions exist (e.g., False Discovery Rate control, Holm-Bonferroni, etc.)
  • Solution 2: Pre-specification:
    • Specify the few theoretically motivated interactions you will test before looking at the data (e.g., in a Pre-Analysis Plan)

Correlation vs. causation in subgroups

  • Covariates used for subgroup analysis are observed characteristics, not randomly assigned treatments
  • Finding that CATEs differ between, say, high-education and low-education groups means the treatment effect correlates with education
  • It does not necessarily mean that changing someone’s education level would change how they respond to the treatment
  • Education might be a marker for other underlying factors (e.g., income, access to information, underlying health) that are the “true” drivers of the heterogeneity
  • Subgroup analysis is essentially observational/descriptive regarding the source of heterogeneity
  • It’s useful for prediction (who is likely to respond more?) and generating hypotheses, but not for establishing the causal impact of the covariate itself on the treatment effect

Example: interpreting voter turnout interaction

  • Gerber, Green, Larimer (2008) sent mailers showing recipients’ own (and neighbours’) past voting records to encourage turnout
  • Finding: Mailer effect (ATE) was larger among people who had voted in a previous election (2004) compared to those who hadn’t
  • Superficial interpretation: Seeing your past voting record is more motivating than seeing your past non-voting record
  • However, past voting behaviour (the covariate) is not random. People who voted in 2004 are different from those who didn’t in many ways (more politically engaged, different demographics, etc.)
  • Follow-up Experiment (Gerber, Green, Larimer 2010): Randomly varied whether the mailer showed a record of voting or a record of abstention
  • Result: Showing a record of abstention was significantly more motivating!
  • The subgroup analysis conflated the type of person with the content of the message. Direct experimental manipulation was needed to isolate the causal effect of the message content

Treatment-by-Treatment Interactions (Factorial Designs) 🧪

Beyond covariates: manipulating multiple factors

  • To make causal claims about why treatment effects vary, we need to experimentally manipulate the hypothesised moderating factors
  • This is done using factorial experimental designs
  • An experimental design that includes two or more “factors” (independent variables), where subjects are randomly assigned to a combination of the levels of these factors
  • Example: Factor A (Treatment vs. Control), Factor B (Context 1 vs. Context 2)
  • Subjects randomly assigned to one of four groups: (Treat, Cxt 1), (Treat, Cxt 2), (Control, Cxt 1), (Control, Cxt 2)
  • Allows us to estimate:
    • Main effect of Factor A
    • Main effect of Factor B
    • Interaction effect: Does the effect of Factor A depend on the level of Factor B?
  • This provides stronger causal evidence about moderators than treatment-by-covariate analysis

Examples of factorial designs

  • Gerber & Green (2000) Voter Mobilisation: Crossed multiple communication methods (Factor 1: Canvassing Y/N, Factor 2: Phone Call Y/N, Factor 3: Direct Mail Y/N) to see if effects were additive or interactive
  • Olken (2007) Corruption Monitoring: Crossed top-down audits (Factor 1: Audit Y/N) with bottom-up community monitoring (Factor 2: Invitations to meetings Y/N) in Indonesian road projects. Tested if audits were more effective when community was involved
  • Rosen (2010) Discrimination: Crossed putative email sender ethnicity (Factor 1: Hispanic/Non-Hispanic name) with email grammar quality (Factor 2: Good/Bad grammar) sent to state legislators. Tested if ethnic discrimination depended on perceived social class (signalled by grammar)

Rosen (2010) example: ethnicity and grammar

  • Do U.S. state legislators respond differently to constituent emails based on perceived ethnicity and writing quality?

  • 2x2 Factorial Design:

    • Factor 1: Sender Name (Colin Smith vs. José Ramirez)
    • Factor 2: Email Grammar (Good vs. Bad)
  • Outcome: Percentage of emails receiving a reply

  • Results (Subset, N=100 per cell):

Colin Smith (Non-Hispanic) José Ramirez (Hispanic) Difference (José - Colin)
Good Grammar 52% 37% -15%
Bad Grammar 29% 34% +5%
Difference (Bad-Good) -23% -3%
  • Interpretation:
    • With good grammar, Colin gets more replies (-15% effect for José)
    • With bad grammar, José gets slightly more replies (+5% effect)
    • The effect of ethnicity depends on grammar quality (Interaction)
    • Poor grammar hurts Colin much more (-23%) than José (-3%)

Advantages and caveats of factorial designs

  • Advantages:
    • Allows causal estimation of interaction effects
    • Efficient way to study multiple factors simultaneously
    • Can reveal complex relationships (e.g., conditions where a treatment is effective/ineffective)
  • Caveats:
    • Require larger sample sizes to detect interactions (interactions are often smaller than main effects)
    • Can become complex logistically with many factors/levels
    • Non-compliance can be problematic: estimating effects for those receiving multiple treatments requires sufficient compliers in those specific cells, which can be rare

Modelling HTE with regression 📈

Extending the regression framework

  • Regression is a powerful tool for modelling both treatment-by-covariate and treatment-by-treatment interactions. And you already know how to use it!
  • Systematic Heterogeneity: Variation in \(\tau\) that can be predicted by observed covariates or experimental factors (modelled by interaction terms)
  • Idiosyncratic Heterogeneity: Residual variation in \(\tau\) not explained by the model (part of the error term \(u_i\))
  • We use interaction terms to capture systematic heterogeneity

Modelling treatment-by-treatment interaction

  • Let \(J_i = 1\) if sender is José, 0 if Colin
  • Let \(G_i = 1\) if grammar is bad, 0 if good
  • Model: \(Y_i = \alpha + \beta J_i + \gamma G_i + \delta (J_i \times G_i) + u_i\)
  • Interpreting Coefficients:
    • \(\alpha\): Mean outcome for baseline (Colin, Good Grammar) = 52%
    • \(\beta\): Effect of José vs. Colin, when grammar is good = 37% - 52% = -15%
    • \(\gamma\): Effect of Bad vs. Good grammar, when sender is Colin = 29% - 52% = -23%
    • \(\delta\): Interaction effect. How the effect of José changes when grammar goes from good to bad.
      • Effect of José (Bad Grammar) = (\(\alpha + \gamma\)) + (\(\beta + \delta\))
      • Effect of José (Good Grammar) = \(\alpha + \beta\)
      • Difference = \(\gamma + \delta\)
      • Alternatively: \(\delta\) = (Effect of Bad Grammar for José) - (Effect of Bad Grammar for Colin) = (-3%) - (-23%) = +20%
  • Estimates for Rosen data: \(\hat{Y}_i = 0.52 - 0.15 J_i - 0.23 G_i + 0.20 (J_i \times G_i)\)
  • Testing the Interaction: F-test for \(H_0: \delta = 0\). In Rosen’s data, p = 0.037. Significant interaction

Automating the search for interactions

  • Manually specifying and testing interactions becomes infeasible with many covariates/factors
  • Machine Learning Approaches: Algorithms designed to automatically search for interactions
    • E.g., Generalised Random Forests (Athey & Imbens)
    • Methodically partition the data based on covariates to find subgroups with different treatment effects
    • Often use techniques like cross-validation to avoid overfitting and false discoveries
  • Can be a great exploratory tool, especially in high-dimensional settings
  • Still requires careful interpretation and follow-up validation
  • Beyond the scope of this lecture, but good to be aware of! 🤓

Summary

Investigating heterogeneous treatment effects

  • ATEs provide an incomplete picture; treatment effects often vary systematically
  • Fundamental challenge: \(Cov(Y(1), Y(0))\) is unidentified, so \(Var(\tau)\) cannot be directly estimated from experimental data alone
  • Preliminary checks:
    • Bounding \(Var(\tau)\): Often yields wide, uninformative bounds
    • Comparing \(Var(Y(1))\) and \(Var(Y(0))\): Low power test for \(H_0: Var(\tau)=0\).
  • Structured approaches are needed for deeper insights
  • Treatment-by-Covariate (Subgroup/CATEs):
    • Estimates effects within subgroups defined by pre-treatment X
    • Useful for description and prediction
    • Interpretation caution: Correlational regarding source of HTE; risk of multiple comparisons
  • Treatment-by-Treatment (Factorial Designs):
    • Experimentally manipulate multiple factors
    • Allows causal inference about interactions
    • More complex design, requires larger N
  • Regression Modelling: Flexible tool for estimating and testing interactions
  • Best Practice: Pre-specify theoretically motivated interactions to test; treat exploratory findings cautiously pending replication

And that’s all for today! 🎉

See you next time! 😉

Appendix 01: why \(- 2Cov\)?

The 2 comes from the algebraic expansion of a squared difference. More formally:

Let \(X = Y_i(1)\) and \(Y = Y_i(0)\). We want \(Var(X - Y)\).

  1. Start with the definition of Variance:

\(Var(Z) = E[ (Z - E[Z])^2 ]\)

So, \(Var(X - Y) = E[ ( (X - Y) - E[X - Y] )^2 ]\)

  1. Rearrange the terms inside:

\(Var(X - Y) = E[ ( (X - E[X]) - (Y - E[Y]) )^2 ]\)

  1. Expand the square (like \((a - b)^2 = a^2 - 2ab + b^2\)):

Let \(a = (X - E[X])\) and \(b = (Y - E[Y])\).

\(Var(X - Y) = E[ (X-E[X])^2 \mathbf{- 2}(X-E[X])(Y-E[Y]) + (Y-E[Y])^2 ]\)

  1. Distribute the Expectation ( \(E[A - B + C] = E[A] - E[B] + E[C]\) ):

\(Var(X-Y) = E[(X-E[X])^2] \mathbf{- 2}E[(X-E[X])(Y-E[Y])] + E[(Y-E[Y])^2]\)

  1. Recognise the definitions of Variance and Covariance:

\(E[(X-E[X])^2] = Var(X)\)

\(E[(Y-E[Y])^2] = Var(Y)\)

\(E[(X-E[X])(Y-E[Y])] = Cov(X, Y)\)

  1. Substitute back:

\(Var(X - Y) = Var(X) - 2Cov(X, Y) + Var(Y)\)

\(Var(\tau) = Var(Y_i(1)) + Var(Y_i(0)) - 2Cov(Y_i(1), Y_i(0))\)