QTM 385 - Experimental Methods

Lecture 25 - Course Revision

Danilo Freire

danilo.freire@emory.edu

Department of Data and Decision Sciences
Emory University

Welcome back! 🤓
Course revision session 📚

Today’s goal: connecting the dots

We have covered a lot of ground in experimental methods this semester! 🥳
Today, we will review some key concepts and methods from the course
The aim is to see how different topics link together
And feel free to ask me questions about the group project, or anything else! 😉

Foundations 🏗️

The research design process (Lec 02)

Good research questions produce knowledge people care about, solving problems or helping policy
Research questions should be clear, specific, and answerable
No experiment is theory-free, even if not explicitly stated
Operationalisation involves translating abstract concepts (e.g, social isolation) into measurable variables (e.g, frequency of social interactions)
- Construct validity ensures the measure accurately reflects the concept
Credible designs yield practical answers, are transparent via pre-registration (PAPs), and are replicable

Recommended reading: Esther Duflo - The Economist as Plumber (2017)

The MIDA framework (Lec 02 cont)

The MIDA framework provides a structure for declaring and diagnosing any research design:
- Model: Assumptions about how the world works (potential outcomes, relationships)
- Inquiry: The specific question (estimand) we want to answer (e.g, ATE)
- Data Strategy: How data are generated (sampling, treatment assignment)
- Answer Strategy: The estimator used to answer the inquiry from the data (e.g, difference-in-means, regression)
Using MIDA in code (with DeclareDesign) allows simulating the design to understand its properties (bias, power, etc) before implementation

Source: DeclareDesign

Potential outcomes & causality (Lec 03)

The Potential Outcomes (PO) framework is what we use for defining causal effects
- For each unit \(i\), there’s an outcome if treated (\(Y_i(1)\)) and an outcome if untreated (\(Y_i(0)\))
- The individual treatment effect is \(\tau_i = Y_i(1) - Y_i(0)\)
The Fundamental Problem of Causal Inference states we only observe one potential outcome per unit (\(Y_i = Y_i(1)Z_i + Y_i(0)(1-Z_i)\))
Causality is inherently a missing data problem
Our goal is often to estimate population averages, like the Average Treatment Effect (ATE): \(ATE = E[Y_i(1) - Y_i(0)]\)

Wikipedia: Jorge Luis Borges - The garden of forking paths

Selection bias & SUTVA (Lec 03 cont)

Selection bias arises when comparing groups that differ systematically before treatment (e.g, sicker people choosing hospitals), biasing simple comparisons
Randomisation breaks the link between potential outcomes and treatment receipt, making groups comparable on average (in expectation) on all pre-treatment characteristics
This allows unbiased ATE estimation via difference-in-means
SUTVA (Stable Unit Treatment Value Assumption):
- No interference between units (one unit’s treatment doesn’t affect another’s outcome)
- Consistent treatment value (the treatment is the same for all units receiving it)

Confounding: Underlying illness (\(Z\)) causes both taking a specific drug (\(X\)) and experiencing a bad outcome (\(Y\))
Mediation: Smoking (\(X\)) causes lung damage (\(Z\)), which then leads to breathing difficulties (\(Y\))
Collision: A student’s talent (\(X\)) influences their grades (\(Y\)); both talent (\(X\)) and grades (\(Y\)) influence getting a scholarship (\(Z\)) – selecting only scholarship winners (\(Z\)) distorts the observed talent-grade relationship

Selection bias (Lec 04)

Start with the observed difference in means: \(E[Y_i|Z_i=1] - E[Y_i|Z_i=0]\)
Substitute observed outcomes with potential outcomes based on treatment status: \(= E[Y_i(1)|Z_i=1] - E[Y_i(0)|Z_i=0]\)
Add and subtract the counterfactual for the treated group (\(E[Y_i(0)|Z_i=1]\)): \(= E[Y_i(1)|Z_i=1] - E[Y_i(0)|Z_i=1] + E[Y_i(0)|Z_i=1] - E[Y_i(0)|Z_i=0]\)
Group the terms: \(= \{ E[Y_i(1)|Z_i=1] - E[Y_i(0)|Z_i=1] \} + \{ E[Y_i(0)|Z_i=1] - E[Y_i(0)|Z_i=0] \}\)
Identify the components: \(= ATT + \text{Selection Bias}\)
- Where ATT is the Average Treatment effect on the Treated
- And Selection Bias is \(\{ E[Y_i(0)|Z_i=1] - E[Y_i(0)|Z_i=0] \}\), which is the difference in the potential untreated outcomes between those who selected treatment and those who did not
Under randomisation, \(ATT = ATE\), and the selection bias term averages to zero

Balance tests check pre-treatment covariate balance; useful diagnostic, but doesn’t guarantee balance on unobservables
Beware common biases like attrition bias, survivorship bias, and post-treatment bias (controlling for mediators)

Hypothesis testing: Neyman vs Fisher (Lec 05)

Neyman Approach (ATE)

Focuses on estimating the average effect in the population
Tests hypotheses like \(H_0: ATE = 0\) vs \(H_a: ATE \neq 0\)
Uses test statistics (e.g, \(t\)-stat = estimate / SE) and \(p\)-values; rejects \(H_0\) if \(p\)-value < \(\alpha\)
Confidence Intervals provide a range of plausible values for the ATE
Relies on large sample approximations (Central Limit Theorem)
Considers Type I (\(\alpha\)) and Type II (\(\beta\)) errors; Power = 1 - \(\beta\)

Fisher Approach (Randomisation Inference)

Uses the random assignment process itself as the basis for inference
Tests the sharp null hypothesis (\(H_0: Y_i(1) = Y_i(0)\) for all \(i\))
Simulates all possible random assignments under \(H_0\) to build a reference distribution
The \(p\)-value is the proportion of simulated statistics as extreme as the observed one
Requires fewer assumptions (no normality) and yields exact \(p\)-values; good for small samples (uses ri2 package)

Randomisation inference details (Lec 05 cont)

The core idea of RI is to ask: “Assuming the treatment had absolutely no effect on anyone (the sharp null), how likely were we to get a difference-in-means as large as the one we actually observed, just by the random chance of assignment?”
We generate the randomisation distribution by:
- Assuming \(H_0\) is true, so \(Y_i(1)=Y_i(0)=Y_i^{obs}\) for all \(i\)
- Recalculating the difference-in-means (or other test statistic) for many (or all) possible ways the units could have been randomly assigned to \(Z=1\) and \(Z=0\)
- Plotting these simulated differences
The \(p\)-value is the fraction of simulated differences that are as large or larger in magnitude than our actual observed difference
This avoids assumptions about the distribution of outcomes needed for t-tests

Source: Myself (2020) 😂

Key experimental findings (Lec 06)

Discussed influential studies applying experimental methods:
- Kalla & Broockman (2015): Used a field experiment (blocked randomisation) to show revealing donor status significantly increased political access to US congressional officials
- Bertrand & Mullainathan (2004): Employed a correspondence study (field experiment) randomising names on CVs, finding significant callback gaps favouring White-sounding names in the US labour market
- Chattopadhyay & Duflo (2004): Leveraged a natural experiment (randomised council seat reservations in India) showing female leaders prioritised different public goods (water vs roads) compared to male leaders

Source: Data Collada (2016)

Design Challenges & Solutions 🛠️

Blocking and clustering (Lec 07 & 08)

Blocking

Group units by pre-treatment covariates (\(X\)) related to outcome (\(Y\)); randomise within blocks
Increases precision (removes between-block variance), ensures balance on \(X\)
Include block fixed effects or use interaction estimators (lm_lin)

Clustering

Treatment assigned at group level (village, school); outcomes measured at individual level
Often necessary due to practical constraints or spillovers
Challenge: Intra-Cluster Correlation (ICC) violates independence (\(\rho = \frac{\sigma^2_{between}}{\sigma^2_{between} + \sigma^2_{within}}\))

Clustering Consequences & Power

Use Cluster-Robust Standard Errors (CRSE) (estimatr::lm_robust(..., clusters = ...)), requires sufficient clusters
Power analysis must account for ICC; driven more by number of clusters
Improve designs via pair-matching or blocking at the cluster level

Source: NIH Collaboratory (2020)

Power analysis principles (Lec 08 cont)

Statistical power is the probability of correctly rejecting a false null hypothesis (detecting a true effect)
Conventionally aim for power \(\ge 0.80\)
Power depends on:
- Effect size: Larger effects are easier to detect
- Sample size (N): Larger N increases power
- Significance level (\(\alpha\)): Lower \(\alpha\) reduces power
- Outcome variance (\(\sigma^2\)): Lower variance increases power
- Proportion treated: Power is maximised with equal group sizes (50/50 split)
- Design features: Blocking increases power; clustering decreases power (must use \(n_{ESS}\))
Conduct power analysis before the experiment using tools like DeclareDesign or power calculators, making reasonable assumptions about effect size and variance

One-sided non-compliance (Lec 09)

Some assigned to treatment (\(Z=1\)) don’t receive it (\(D=0\)), but control compliance (\(Z=0 \implies D=0\)) is perfect
Compliers (\(D_i(1)=1, D_i(0)=0\)) and Never-takers (\(D_i(1)=0, D_i(0)=0\))
Intent-to-Treat (ITT) effect (\(E[Y|Z=1] - E[Y|Z=0]\)) estimates the effect of assignment; it’s unbiased but diluted
Complier Average Causal Effect (CACE/LATE) (\(E[Y_i(1) - Y_i(0) | D_i(1)>D_i(0)]\)) estimates the effect of treatment on compliers
Estimation uses Instrumental Variables (IV) / 2SLS, with assignment (\(Z\)) as instrument for treatment receipt (\(D\))
\(CACE = ITT_Y / ITT_D\)
Requires relevance, exclusion, independence assumptions

Compliance types
	Wᵢ(0) = 0	Wᵢ(0) = 1
Wᵢ(1)=0	never-taker	defier
Wᵢ(1)=1	complier	always-taker

Two-sided non-compliance (Lec 10)

Non-compliance occurs in both arms: some \(Z=1\) don’t get \(D=1\); some \(Z=0\) do get \(D=1\) (e.g, control group finds alternative access)
Adds potential for Always-takers (\(D_i(1)=1, D_i(0)=1\)) and Defiers (\(D_i(1)=0, D_i(0)=1\))
Observed groups become mixtures of compliance types
Requires the Monotonicity Assumption (assume no Defiers) to identify CACE; this implies \(D_i(1) \ge D_i(0)\) for all \(i\)
Estimation still uses IV/2SLS (\(CACE = ITT_Y / ITT_D\))
Always-takers don’t bias the IV estimate under monotonicity

Source: Facure (2022)

Attrition: missing outcome data (Lec 11)

Attrition involves missing outcome data post-randomisation (e.g, participants drop out)
Bias occurs if attrition is non-random (differential attrition related to treatment or potential outcomes)
Handling Options:
- Assume MCAR (unlikely); analyse complete cases (reduces power)
- Assume MAR / Conditional Ignorability (\(MIPO|X\)): Missingness depends only on observed pre-treatment \(X\); use Inverse Probability Weighting (IPW) to upweight observed units similar to missing ones
- Assume MNAR: Missingness depends on unobservables; use Bounds Analysis to estimate range of possible ATEs under worst-case (Manski bounds) or monotonicity assumptions (Lee bounds)

Source: McElreath on X (2019)

Ethics in research design (Lec 12)

Ethical conduct is integral to good science
Core Principles from the Belmont Report:
- Respect for Persons: Requires informed consent, autonomy, and protection for vulnerable groups
- Beneficence: Involves minimising harm and maximising potential benefits through careful risk-benefit assessment; Equipoise (genuine uncertainty) is key
- Justice: Demands fair participant selection and equitable distribution of research burdens/benefits
Practical implementation involves Institutional Review Boards (IRBs), clear consent processes, data protection, and considering staff/community well-being
Adaptive designs can enhance ethics by allocating more participants to effective treatments sooner

Source: Belmont Report (1979)

Six components of a `DeclareDesign` study (Lec 13)

DeclareDesign formalises research plans using six key components, specified using declare_* functions:
- Population: Defines units and their characteristics (declare_model)
- Potential outcomes: Specifies how outcomes depend on treatments (declare_model)
- Sampling strategy: How units are selected (declare_sampling)
- Assignment: How units are assigned to treatment (declare_assignment)
- Estimand: The target quantity of interest (declare_inquiry)
- Estimator: The procedure/model used for estimation (declare_estimator)
The DesignLibrary package provides pre-built templates for common designs

Source: DeclareDesign

Pre-analysis plans (PAPs) in practice (Lec 14)

PAPs detail the research plan (hypotheses, design, analysis) before data analysis
Aim to increase transparency, reduce bias (p-hacking, HARKing), enhance credibility
Stemmed from reproducibility crisis
Key components: Motivation, Hypotheses, Population/Sampling, Intervention, Outcomes/Covariates, Randomisation, Analysis Plan (estimators, SEs, power, missing data, subgroups), Implementation details
Should distinguish confirmatory (pre-specified) from exploratory analyses

Pros: Credibility, transparency, limits researcher degrees of freedom
Cons: Time-consuming, potentially inflexible (mitigated by allowing pre-specified exploratory analysis or clear justifications for deviations)
Registries like OSF, AEA, EGAP host PAPs
SOPs (Standard Operating Procedures) offer a potentially more flexible alternative but are less common

Advanced Methods & Applications 🔬

Natural & quasi-experiments (Lec 15)

Used when RCTs aren’t feasible/ethical, leveraging “as-if” random assignment
Natural Experiments rely on assignment outside researcher control (e.g, lotteries); require strong exogeneity arguments
Quasi-Experiments are a broader category; common designs include:
- Regression Discontinuity (RDD): Exploits sharp cutoff rules (e.g, Mignozzetti et al); assumes continuity of potential outcomes at cutoff
- Difference-in-Differences (DID): Compares changes over time for treated vs control (e.g, Card & Krueger); assumes parallel trends in absence of treatment
Validity depends heavily on the plausibility of underlying assumptions

Interference & spillovers (Lec 16)

Interference occurs when one unit’s treatment affects another’s outcome (a SUTVA violation); common in social/network settings
Standard ATE estimates become biased
Requires expanding potential outcomes notation (e.g, \(Y_{i}(Z_i, Z_{-i})\))
Designs to address/estimate interference:
- Clustered randomisation: Randomise at a level high enough to contain spillovers
- Multi-level designs: Randomise at multiple levels (e.g, household & individual) to separate direct/indirect effects

YouTube video on spillovers

Heterogeneous treatment effects (HTE) (Lec 18)

Effects often vary; ATE is just the average; understanding variability is key
Challenge: \(Var(\tau)\) depends on unidentifiable \(Cov(Y(1), Y(0))\)
Exploring HTE:
- Treatment-by-Covariate Interactions (CATEs): Estimate ATE within subgroups based on pre-treatment \(X\); use regression interactions (\(Y \sim Z * X\)); Caution: Correlational re: HTE source; multiple comparisons risk
- Treatment-by-Treatment Interactions (Factorial Designs): Experimentally manipulate multiple factors (\(Z_1, Z_2\)); allows causal inference about interactions; requires larger N
Beware the multiple comparisons problem when testing many subgroups; use corrections (Bonferroni) or pre-specification

Source: University of Minnesota (2020)

Mediation analysis (Lec 20)

Seeks to understand how \(Z\) affects \(Y\) via mediator \(M\) (\(Z \to M \to Y\))
Traditional regression methods are often biased due to \(M \leftrightarrow Y\) confounding (omitted variables affecting both \(M\) and \(Y\))
Potential outcomes approach reveals fundamental identification problem requiring strong ‘sequential ignorability’ assumptions (\(M\) is ‘as-if’ random conditional on \(Z\) and \(X\))
Randomising \(Z\) alone is insufficient
Using \(Z\) as an IV for \(M\) requires the strong exclusion restriction (no direct \(Z \leftrightarrow Y\) path), often violated

Implicit Mediation is a robust design-based alternative:
- Experimentally vary treatment components (\(Z\) vs \(Z'\))
- Compare effects of different bundles to infer mechanism importance without measuring \(M\)

Source: Rijnhart et al (2021)

Survey experiments (Lec 21 & 22)

Survey Experiments (Lec 21)

Random assignment embedded within survey instruments
Ideal for studying attitudes, preferences, information effects
Common designs: Question wording/framing, order effects, vignettes
Trade-offs: High internal validity vs potential external validity/demand effect concerns

Sensitive Topics (Lec 22)

Challenge: Social desirability bias
Goal: Elicit truthful responses while protecting privacy
Techniques: List Experiment, Randomised Response (RRT), Endorsement Experiment, Conjoint Analysis

Source: Freire & Skarbek (2022)

Survey experiments: validation & design (Lec 21 cont)

Ensuring construct validity (manipulating what you intend):
- Pilot testing treatments before main study
- Manipulation checks (post-treatment questions assessing if manipulation worked)
- Placebo conditions (similar task/info but without key manipulation)
- Non-equivalent outcomes (outcomes that shouldn’t be affected)
Design considerations:
- Comparability across conditions (length, complexity)
- Realism of vignettes/stimuli
- Respondent burden and attention (timers, forced exposure)
- Device compatibility (mobile vs desktop)

Sensitive survey techniques details (Lec 22 cont)

List Experiment: Compare mean count between T (list + sensitive item) and C (list only); difference estimates prevalence; assumes no design effects/no liars; watch for floor/ceiling effects
RRT: Respondent uses random device (coin flip) to determine whether to answer truthfully or give fixed response; known probabilities allow estimation; can be confusing for respondents but often performs well in validation

Endorsement Experiment: Randomly associate policy/statement with endorsing group; difference in support reveals implicit attitude towards endorser; analysis complex with multiple endorsers
Conjoint Analysis: Respondents choose between profiles with multiple randomised attributes; estimates importance of each attribute (including sensitive ones) via trade-offs; powerful but complex design/analysis

Discussions & Integration 🌍

Key paper discussions (Lec 06, 17, 19, 23)

Foundational Ideas & Design

Lec 06: Kalla & Broockman (access), Bertrand & Mullainathan (discrimination), Chattopadhyay & Duflo (representation)
- Classic examples of field & natural experiments demonstrating core concepts
Lec 17: Centola (networks/contagion), Paluck (network intervention/climate), Gerber & Green (GOTV/interference/IV)
- Focused on exploring interference, network structure, and spillover effects experimentally and analytically

Identification & Complex Settings

Lec 19: Munshi (networks/IV/FE), Miguel & Kremer (externalities/cluster RCT/spillovers)
- Showcased clever identification strategies using IVs and cluster-randomisation for observational data and spillovers
Lec 23: Druckman (list/sensitive), Blair (list+endorsement/sensitive), Rosenfeld (validation/sensitive), Freire & Skarbek (conjoint/sensitive)
- Illustrated application and validation of methods for sensitive topics

Integration of research findings (Lec 24)

Generalising results (extrapolation) is challenging; distinguish Sample ATE (SATE) from Population ATE (PATE); PATE estimation adds sampling uncertainty
The Bayesian framework formally updates prior beliefs with new evidence, can incorporate beliefs about potential bias (e.g, sampling bias); posterior is precision-weighted average
Meta-Analysis pools results from multiple studies
- Fixed Effects assumes one true effect; weights by precision (\(1/SE^2\))
- Random Effects allows true effect to vary across studies (more realistic); accounts for between-study heterogeneity
- Beware publication bias and study heterogeneity

Source: Freire et al (2022)

Generalisation & meta-analysis details (Lec 24 cont)

PATE standard error includes sampling variance: \(SE(\widehat{PATE}) = \sqrt{ \frac{Var(Y_i(1))}{m} + \frac{Var(Y_i(0))}{N-m} }\)
Bayesian updating combines prior precision (\(\rho_{prior}\)) and data precision (\(\rho_{data}\)) for posterior precision (\(\rho_{posterior} = \rho_{prior} + \rho_{data}\))
Posterior mean is precision-weighted average: \(\mu_{posterior} = \frac{\rho_{prior} \mu_{prior} + \rho_{data} x_e}{\rho_{posterior}}\)
Incorporating bias (\(B \sim N(\beta, \sigma^2_B)\)) reduces effective data precision: \(\rho_{effective\_data} = \frac{1}{\sigma^2_B + \sigma^2_{xe}}\)
Meta-analysis requires careful study selection (PRISMA), data extraction, and choice between fixed/random effects models; meta-regression explores heterogeneity sources

Wrapping up

Phew, that was a lot of content! 😅
We’ve discussed from basic experimental design (MIDA, PO) through implementation challenges (compliance, attrition, ethics, interference) to advanced analysis (HTE, mediation, quasi-methods) and synthesis (meta-analysis)
Key theme: Understanding assumptions, potential biases, and choosing appropriate experimental designs and methods
Experiments are amazing 🤩! But they require careful thought and execution
Thank you for your engagement throughout the course 🙌
Any questions?

Thank you for your attention! 🙏

QTM 385 - Experimental Methods

Welcome back! 🤓 Course revision session 📚

Today’s goal: connecting the dots

Foundations 🏗️

The research design process (Lec 02)

The MIDA framework (Lec 02 cont)

Potential outcomes & causality (Lec 03)

Selection bias & SUTVA (Lec 03 cont)

Selection bias (Lec 04)

Hypothesis testing: Neyman vs Fisher (Lec 05)

Neyman Approach (ATE)

Fisher Approach (Randomisation Inference)

Randomisation inference details (Lec 05 cont)

Key experimental findings (Lec 06)

Design Challenges & Solutions 🛠️

Blocking and clustering (Lec 07 & 08)

Blocking

Clustering

Clustering Consequences & Power

Power analysis principles (Lec 08 cont)

One-sided non-compliance (Lec 09)

Two-sided non-compliance (Lec 10)

Attrition: missing outcome data (Lec 11)

Ethics in research design (Lec 12)

Six components of a DeclareDesign study (Lec 13)

Pre-analysis plans (PAPs) in practice (Lec 14)

Advanced Methods & Applications 🔬

Natural & quasi-experiments (Lec 15)

Interference & spillovers (Lec 16)

Heterogeneous treatment effects (HTE) (Lec 18)

Mediation analysis (Lec 20)

Survey experiments (Lec 21 & 22)

Survey Experiments (Lec 21)

Sensitive Topics (Lec 22)

Survey experiments: validation & design (Lec 21 cont)

Sensitive survey techniques details (Lec 22 cont)

Discussions & Integration 🌍

Key paper discussions (Lec 06, 17, 19, 23)

Foundational Ideas & Design

Identification & Complex Settings

Integration of research findings (Lec 24)

Generalisation & meta-analysis details (Lec 24 cont)

Wrapping up

Thank you for your attention! 🙏

Welcome back! 🤓
Course revision session 📚

Six components of a `DeclareDesign` study (Lec 13)