QTM 385 - Experimental Methods

Lecture 03 - Potential Outcomes Framework

Danilo Freire

Emory University

Hello, everyone! 👋
Hope you’re all doing well! 😉

Brief recap 📚

Last time, we saw that…

  • A good research question should produce knowledge that solves real-world problems and guides policy decisions, with a practical and credible research design

  • Theory is essential in research design, whether implicit or explicit, as it helps generate hypotheses, informs design choices, and guides inference strategies

  • Operationalisation is the process of translating theoretical concepts into measurable variables, such as turning “social isolation” into the frequency of social interactions

  • Pre-registration involves filing research designs and hypotheses publicly to reduce bias, improve credibility, and distinguish between pre-planned and exploratory analyses

  • The reproducibility crisis in science highlights the need for transparent and replicable research, with pre-registration and pre-analysis plans helping address this issue

  • The EGAP research design form provides a blueprint for creating robust research designs, covering key components like hypotheses, population, intervention, outcomes, and analysis

  • The MIDA framework (Model, Inquiry, Data, Answer) helps researchers simulate and diagnose their designs before implementation, ensuring they can answer their research questions effectively

  • DeclareDesign is a tool that allows researchers to declare, diagnose, and design their studies using the MIDA framework, improving the quality and credibility of research

  • A two-arm trial is a common experimental design where units are randomly assigned to treatment or control groups, and the average treatment effect (ATE) is estimated

  • Alignment between research design and theoretical frameworks is necessary for generating credible and actionable results, even when experiments are not feasible

Today, we will discuss…

  • The concept of potential outcomes and how it helps us understand causal inference
  • The fundamental problem of causal inference
  • The Rubin Causal Model
  • The Stable Unit Treatment Value Assumption (SUTVA) and its importance in causal inference
  • Why random assignment removes selection bias and allows us to estimate causal effects
  • But first… an example

Do hospitals make people sicker? 🏥

Imagine you are a health economist…

  • … and you want to know whether the government should invest in building more hospitals
  • You measure the sample’s health status: poor, fair, good, very good, or excellent? (the higher, the better)
  • And you find this:
Hospital No Hospital Difference
Health status 3.21 3.93 −0.72∗∗∗
(0.014) (0.003)
Observations 7,774 90,049


  • A simple comparison of means suggests that going to the hospital makes people worse off: those who had a hospital stay in the last 6 months are, on average, less healthy than those that were not admitted to the hospital
  • The difference is statistically significant at the 1% level
  • But don’t dismiss the idea just yet!
  • What could be going on here?

Potential outcomes

  • We are interested in the relationship between treatment \(T\) and some outcome that may be impacted by the treatment (eg. health)
  • Outcome of interest:
    • \(Y\) = outcome we are interested in studying (e.g. health)
    • \(Y_i\) = value of outcome of interest for individual \(i\)
  • For each individual, there are two potential outcomes:
    • \(Y_{0,i}\) = outcome if individual \(i\) does not receive treatment
    • \(Y_{1,i}\) = outcome if individual \(i\) does receive treatment
  • So the treatment effect for individual \(i\) is:
    • \(\tau_{i} = Y_{1,i} - Y_{0,i}\)

Potential outcomes

  • Alejandro has a broken leg
  • He has two potential outcomes:
    • \(Y_{0,a}\) = If he doesn’t go to the hospital, his leg doesn’t heal properly
    • \(Y_{1,a}\) = If he goes to the hospital, his leg heals completely
  • Benicio doesn’t have any broken bones. His health is fine
  • He also has two potential outcomes:
    • \(Y_{0,b}\) = If he doesn’t go to the hospital, his health is still fine
    • \(Y_{1,b}\) = If he goes to the hospital, his health is still fine
  • The fundamental problem of causal inference:
    • We never observe both potential outcomes for the same individual
    • Creates a missing data problem if we compare treated to untreated

The fundamental problem of causal inference

  • For any individual, we can only observe one potential outcome:

\[ Y_i = \begin{cases} Y_{0i} & \text{if } T_i = 0 \\ Y_{1i} & \text{if } T_i = 1 \end{cases} \]

where \(T_i\) is a treatment indicator equal to 1 if \(i\) was treated and 0 otherwise

  • Each individual either participates in the programme or not

  • The causal impact of programme \(T\) on \(i\) is:

\[ Y_{1i} - Y_{0i} \]

  • We only observe \(i\)’s actual outcome:

\[ Y_i = Y_{0i} + (Y_{1i} - Y_{0i}) T_i \]

Example: Alejandro goes to the hospital, Benicio does not.

Establishing causality

  • In an ideal world (research-wise), we could clone each treated individual and observe the impacts of treatment on the outcomes of interest

  • But we can’t clone people (yet 😂), so we need to find a way to estimate the treatment effect using the data we have
  • What is the impact of giving Lisa a textbook on her test scores? 📚
    • Impact = Lisa’s score with a book - Lisa’s score without a book
  • In the real world, we either observe Lisa with a textbook or without a textbook, but not both
  • We never observe the counterfactual

Establishing causality

  • To measure the causal impact of giving Lisa a book on her test score, we need to find a similar child that did not receive a book

  • Our estimate of the impact of the book is then the difference in test scores between the treatment group and the comparison group
    • Impact = Lisa’s score with a book - Bart’s score without a book
  • As this example illustrates, finding a good comparison group is hard
    • Your research design is your counterfactual

Average causal effects

  • What we actually want to know is the average causal effect, but that is not what we get from a difference in means comparison.

Difference in group means

  • Average causal effect of program on participants + selection bias

Even in a large sample:

  • People will choose to participate in a program when they expect the program to make them better off (i.e., when \(Y_{1,i} - Y_{0,i} > 0\)).

  • The people who choose to participate are likely to be different than those who choose not to… even in the absence of the program

Selection bias: example

Selection bias: example

Selection bias: example

Mathematically speaking…

  • The mathematical expectation of \(Y_i\) is \(E[Y_i]\)
  • Equivalent to sample average in an infinite population
    • Example: probability coin flipped lands heads = 0.5
    • A six-sided die rolled = 3.5 (although no side has 3.5 dots)
    • Equivalent to fraction heads after a (very) large number of flips (“long-run average”)
    • \(E[Y_i] = \sum_{i=1}^{N} Y_i/N\)
  • Law of large numbers: sample average converges to population average as sample size increases
    • \(E[Y_i] = \lim_{N \to \infty} \sum_{i=1}^{N} Y_i/N\)
    • \(E[Y_i] = \lim_{N \to \infty} \bar{Y}\)
  • In small samples, average of \(Y_i\) might be anything
  • Average of \(Y_i\) gets very close to \(E[Y_i]\) as number of observations (that we are averaging over) gets large
  • Bias = difference between expected value of an estimator and the true value of the parameter being estimated
    • \(Bias = E[\hat{\theta}] - \theta\)

The Central Limit Theorem visualised

  • Imagine we simulate a uniform distribution of numbers from 0 to 9
  • The expected value of this distribution is 4.5
  • As we increase the number of simulations, you’ll see that the average of the numbers converges to 4.5

Conditional expectations

  • Conditional expectation of \(Y_i\) given \(X_i\) is \(E[Y_i|X_i]\)
  • The conditional expectation of \(Y_i\) given a variable \(X_i = x\), is the average of \(Y_i\) in an infinite population that has \(X_i = x\)
  • \(E[Y_i|X_i = x] = \sum_{i=1}^{N} Y_i/N\) for all \(i\) such that \(X_i = x\)


  • Example:
    • \(X_i =1\) if individual \(i\) is treated, 0 otherwise
    • \(E[Y_i|X_i = 1]\) = average outcome for treated individuals
    • \(E[Y_i|X_i = 0]\) = average outcome for untreated individuals
    • \(E[Y_i|X_i = 1] - E[Y_i|X_i = 0]\) = average treatment effect

Selection bias (this time with maths)

  • When we compare (many) participants to (many) non-participants:

The experimental ideal 👩🏻‍🔬

Two types of counterfactuals

  • Participant vs. Non-Participant comparisons (we just saw why this is problematic)


  • Pre-treatment vs. Post-treatment comparisons (why?)


  • Extremely strong (read: often completely unreasonable) assumptions are required to make either of these impact evaluation approaches credible

Pre- vs. post-treatment comparisons

Pre- vs. post-treatment comparisons

Pre- vs. post-treatment comparisons

Pre- vs. post-treatment comparisons

Millennium development villages

  • Millennium Villages Project (MVP) (2004-2015) was a UN project to demonstrate that integrated, community-led development could improve health, education, and agriculture in rural Africa
  • First evaluation relied on data on pre-treatment and post-treatment outcomes in Bar Sauri, Kenya
  • On most outcomes people living in the MDV looked better after a 3-5 years
  • But have a look at this 🤨

How to estimate causal effects

Quasi-experimental approaches

  • Difference-in-difference estimation
    • Idea: Combine pre/post + treated/untreated designs
    • Requirement: Common trends in treatment, comparison groups
  • Instrumental variables
    • Idea: Find a source of exogenous variation in treatment
    • Requirement: A valid instrument (satisfying the exclusion restriction)
  • Regression discontinuity
    • Idea: Exploit explicit rules (cutoffs) for assigning treatment
    • Requirement: The existence of discontinuity

Alternative approaches

  • Conditional independence assumption (CIA) approaches

    • Machine estimators, i.e., propensity score matching
    • Coefficient stability robustness to controls
  • Explicit models (structural or not) of selection into treatment

  • Natural experiments when treatment is as-good-as-random

    • Example: Rainfall shocks in childhood (Maccini and Yang 2009)
    • Closely related to instrumental variables approach

How to estimate causal effects

Experimental approach

  • Random assignment to treatment
    • Eligibility for program is determined at random, e.g., via pulling names out of a hat
  • The Law of Large Numbers
    • A sample average can be brought as close as we like to the population mean just by enlarging the sample
    • Mathematically: \(\bar{Y} \to E[Y_i] \quad \text{as} \quad N \to \infty\)
  • Treatment and control groups
  • When treatment status is randomly assigned:
    • Treatment and control groups are random samples of a single population
    • Example: The population of all eligible applicants for the program.
  • Expected outcomes
  • In the absence of the program, expected outcomes are equal for both groups
    • Formally: \(E[Y_i | \text{Treatment}] = E[Y_i | \text{Control}]\)

Random assignment

  • Randomisation does not eliminating individual difference (we still can’t identify individual treatment effects
  • On average, individuals in treatment/control are similar (in large samples)ˆ
  • Need Stable Unit Treatment Value Assumption (SUTVA): Potential outcomes for any unit do not vary with the treatments assigned to other units (more later)
  • See Imbens and Rubin (2015) for more details
  • When is SUTVA violated?
    • Spillovers
    • Non-compliance
    • Interference
    • Contamination
    • Heterogeneous treatment effects
  • We will see all that in the next lectures! 🤓

Violations of SUTVA

SUTVA: “stable unit” part

  • Individuals are receiving the same treatment – i.e., the “dose” of the treatment to each member of the treatment group is the same
  • If we are estimating the effect of hospitalisation on health status, we assume everyone is getting the same dose of the hospitalisation treatmentˆ
  • Easy to imagine violations if hospital quality varies across individuals
  • Have to be careful what we are and are not defining as the treatment

SUTVA: “treatment value” part

  • The potential outcomes for any unit do not vary with the treatments assigned to other units
  • No spillovers: The treatment of one individual does not affect the outcomes of others (e.g., a vaccinated person cannot indirectly protect unvaccinated neighbours)
  • No interference: Units’ treatments do not influence each other’s outcomes through direct interaction or competition
  • No contamination: Control group members do not receive any form of the treatment, either intentionally or accidentally

Partial equilibrium

  • Partial equilibrium is a concept from economics
  • It refers to the idea that the effects of a policy or intervention are limited to the individuals directly affected by it
  • In the context of SUTVA, partial equilibrium means that the treatment effect for one group may not generalise to other groups
  • Issue of external validity
  • Let’s say we estimate a causal effect of early childhood intervention in some area
  • Now adopt it for the whole country – will it have the same effect as we found?
    • Expansion may create general equilibrium effects
    • Have different effects due to economies of scale
    • The effect might be different if the population is different

Recap: back to our hospital example 🏥

Potential outcomes revisited

  • Main question: do hospitals make people sicker?

  • Heterogeneous treatment effects: treatments work differently for different people

    • Sick people benefit more from hospitals (e.g., broken bones, infections)
    • Healthy people might even be harmed by hospitals (e.g., stress, exposure to germs)
  • Real-World Example:

    • Imagine two people:
      • Alejandro (sick): Hospital fixes his broken leg → improves his health
      • Benicio (healthy): Hospital visit causes anxiety → slightly worsens his health
  • Takeaway:

    • Treatments don’t work the same way for everyone
    • If we only study people who choose treatment (like sick people), results are misleading.

Potential outcomes for hospital visits

  • What would happen if everyone went to the hospital vs. no one?

  • Potential outcomes table:


Person Outcome if Hospitalised Outcome if Not Hospitalised
Sick Health improves slightly Health stays poor
Healthy Health slightly worsens Health stays fine


  • The Problem:
    • We only observe one outcome per person (e.g., we see Alejandro’s health after he goes to the hospital, but not what would’ve happened if he didn’t)
  • Selection Bias: Sick people choose hospitals, making comparisons unfair

How random assignment fixes this

  • Randomisation breaks the link between treatment choice and outcomes

  • (An absurd) experiment:

    • What if we randomly assign everyone (sick and healthy) to two groups?
      • Treatment group: Must go to the hospital
      • Control group: Cannot go to the hospital
    • Afterward, compare average health of both groups
  • Randomisation ensures groups are similar on average (similar mix of sick/healthy people)

  • Any difference in outcomes is caused by the hospital, not pre-existing health

  • But what are we measuring here?

  • The average effect of hospitalisation on the population

Randomising amongst the sick

  • What if we only randomise treatment for people who need it?

  • Better Experiment:

    • Only include sick people in the study
    • Randomly assign half to hospitals, half to stay home
  • Result:

    • Compare health of hospitalised sick people vs. non-hospitalised sick people
  • What are we measuring now? Is this the ideal experiment?

  • The average effect of hospitalisation on sick people
  • Limitation:
    • Doesn’t tell us how hospitals affect healthy people (but they don’t need hospitals anyway!)

People do not always follow the rules

  • We might consider randomising access to hospitals

  • But what if people ignore their random assignment?

  • Example:

    • Assign 100 sick people to hospitals, but 40 refuse to go
    • Assign 100 sick people to stay home, but 20 go to hospitals anyway
  • Problem:

    • The “treated” group now includes only those who complied (e.g., the most severe cases)
    • The “control” group includes some people who sought treatment
  • Result:

    • The measured effect applies only to compliers (e.g., very sick people)
    • The effect might not apply to everyone
  • This is the compliance problem in experiments

Why compliance matters for real-world policy

  • Even with random assignment, human behaviour (often) complicates things

  • Example: A government offers free training to help people find jobs

    • Random assignment: Some people get training, others do not
    • Reality: Only 30% of the treatment group actually attend the training and get jobs
  • Should we continue the programme?

  • Two ways to analyse:

    • Intent-to-Treat (ITT): Compare all people assigned to treatment vs. control (even if they didn’t go).
      • Shows the effect of offering the program.
    • Treatment-on-the-Treated (TOT): Compare only those who actually took the treatment vs. control
      • Shows the effect for compliers
  • Takeaway: Compliance affects how we interpret results and design policies.

To sum up…

  • Potential outcomes help us understand the causal effects of treatments
  • The fundamental problem of causal inference is that we never observe both potential outcomes for the same individual
  • Random assignment helps us estimate causal effects by breaking the link between treatment choice and outcomes
  • Selection bias occurs when treated and untreated groups are different in the absence of treatment
  • SUTVA is a key assumption in causal inference that ensures potential outcomes do not vary with the treatments assigned to other units
  • Compliance affects how we interpret results and design policies
  • Partial equilibrium is the idea that the effects of a policy or intervention are limited to the individuals directly affected by it
  • External validity is the extent to which the results of a study can be generalised to other populations, settings, and times

That’s all for today! 🎉

Thank you for your attention! 🙏
See you all soon! 😊