QTM 385 - Experimental Methods

Lecture 05 - Sampling Distributions, Statistical Inference, and Hypothesis Testing

Danilo Freire

Emory University

Hello, everyone! 👋
How are you doing today? 😁

Brief recap 📚

Last time, we saw that…

  • Selection bias arises when treatment groups differ initially
  • Balance tests assess pre-treatment group comparability
    • Are they useful? Sometimes. Do they solve the problem? No. 😂
  • Causal diagrams (DAGs) illustrate variable relationships and biases
    • Useful for identifying confounders and mediators
  • Loss to follow-up can introduce bias in treatment effects
  • Survivorship bias focuses on successful outcomes, ignoring failures
  • Post-treatment bias occurs when controlling for variables affected by treatment
  • Standard deviation quantifies the spread of data
  • Standard error measures the precision of sample means (or other test statistics)
  • Regression analysis are the most convenient way to estimate treatment effects
    • They can handle multiple covariates and interactions
  • DeclareDesign is a package for designing and analysing experiments
    • It helps you through the entire process! 😎

Today, we will discuss…

  • Regression analysis in R (from last week)
  • Sampling distributions of experimental estimates
  • The importance of randomisation inference
  • Hypothesis testing
    • Sharp null hypothesis
    • Null hypothesis of no average treatment effect
    • Confidence intervals
  • But first… interesting experiment of the day! 🎉
  • And a few ideas about your pre-analysis plans and final project 😊

Source: Alex Coppock

Interesting experiment of the day! 🎉

Source: Knittel et al (2024)

Pre-analysis plan and final project 😊

  • Now we have the final list of students enrolled in the course!
  • So we can start thinking about your project 🤓
  • It will be a group project, with 3-4 students per group
  • You will have to submit a pre-analysis plan, I will create a dataset for you to work on, and you will have to write a report
  • and I have a few questions for you (don’t worry, they’re easy! 😂)

Regression analysis in R (recap) 📊

Loading the packages and simulating the data

# Load the required packages
library(fabricatr)
library(estimatr)
library(randomizr)

# Set the seed for reproducibility
set.seed(385)

data2 <- fabricate(
  N = 1000,
  treat = complete_ra(N, m = 500),
  age = round(rnorm(N, mean = 30, sd = 5)),
  education = round(rnorm(N, mean = 12, sd = 2)),
  interviews = round(rnorm(N, mean = 10, sd = 2) + 5 * treat)
)

head(data2)
    ID treat age education interviews
1 0001     0  32        11          9
2 0002     1  30        13         16
3 0003     1  31        12         16
4 0004     1  33        13         15
5 0005     0  27        15          7
6 0006     0  33        12          9

Adding covariates

  • We can add covariates to experimental models to increase precision
  • The same way we do in any regression:
    • \(Y_i = \alpha + \beta \text{T}_i + \gamma \text{X}_i + \epsilon_i\)
  • Freedman (2008) demonstrated that pre-treatment covariate adjustment may bias estimates of average treatment effects, mainly in small samples
  • Lin (2013) proposed that if covariate correlations differ between treatment and control groups, centering and interacting covariates with the treatment variable will balance the groups
  • This adjusts for covariates separately in each group, which is equivalent to including treatment-by-covariate interactions
    • \(Y_i = \alpha + \beta \text{T}_i + \gamma \text{W}_i + \delta \text{T}_i \times \text{W}_i + \epsilon_i\)
    • Where \(W_i = \text{X}_i - \bar{\text{X}}\) and \(\bar{\text{X}}\) is the mean of \(\text{X}_i\)
    • For ease of interpretation, we can centre the covariates to have a mean of zero
  • Anyway, worry not! 😅 Our friends at DeclareDesign have done all the work for us and created a function called lm_lin() that does everything automatically!

Estimating the second model

model2 <- lm_lin(interviews ~ treat,
                covariates = ~ age + education,
                data = data2)
summary(model2)

Call:
lm_lin(formula = interviews ~ treat, covariates = ~age + education, 
    data = data2)

Standard error type:  HC2 

Coefficients:
                  Estimate Std. Error  t value   Pr(>|t|)  CI Lower CI Upper
(Intercept)        9.88659    0.09213 107.3127  0.000e+00  9.705804 10.06738
treat              5.13084    0.12546  40.8956 3.281e-215  4.884642  5.37704
age_c              0.02845    0.01771   1.6062  1.085e-01 -0.006307  0.06320
education_c       -0.04550    0.04721  -0.9636  3.355e-01 -0.138150  0.04716
treat:age_c       -0.03324    0.02546  -1.3054  1.921e-01 -0.083206  0.01673
treat:education_c  0.04324    0.06140   0.7043  4.814e-01 -0.077248  0.16373
                   DF
(Intercept)       994
treat             994
age_c             994
education_c       994
treat:age_c       994
treat:education_c 994

Multiple R-squared:  0.6272 ,   Adjusted R-squared:  0.6254 
F-statistic: 344.4 on 5 and 994 DF,  p-value: < 2.2e-16
  • The lm_lin() function is similar to the lm_robust() function, but it includes the covariates argument
  • As you can see, the results are pretty similar to the previous model
  • The treatment effect is slightly higher, and the standard error is slightly smaller
  • The _c part of the variable names indicates that the variables are centred

Sub-group analysis

  • We can also estimate the treatment effect for sub-groups of the population
  • This is useful when we suspect that the treatment effect may vary across known dimensions
  • For instance, we can estimate the treatment effect for people with high and low levels of education
  • We can do this by including an interaction term between the treatment variable and the covariate of interest
  • Interpretation: The treatment effect for the high-education subgroup (high_edu = 1) is 0.097 larger than for the low-education subgroup
  • Total treatment effect for high_edu = 1: 5.101 + 0.097 = 5.198
# Create a binary subgroup (e.g., "high" vs. "low" education)
data2$high_edu <- ifelse(data2$education > median(data2$education), 1, 0)

# Fit an interaction model with covariates
model_interaction <- lm_robust(
  interviews ~ treat * high_edu + age + education,
  data = data2)

# Summarize results
summary(model_interaction)

Call:
lm_robust(formula = interviews ~ treat * high_edu + age + education, 
    data = data2)

Standard error type:  HC2 

Coefficients:
               Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper  DF
(Intercept)     9.39778    0.69048 13.6104  8.255e-39  8.04281 10.75275 994
treat           5.10068    0.16905 30.1726 1.682e-142  4.76894  5.43241 994
high_edu       -0.26179    0.24257 -1.0792  2.808e-01 -0.73780  0.21422 994
age             0.01141    0.01276  0.8938  3.716e-01 -0.01363  0.03644 994
education       0.02033    0.05026  0.4046  6.859e-01 -0.07829  0.11895 994
treat:high_edu  0.09721    0.25093  0.3874  6.985e-01 -0.39521  0.58962 994

Multiple R-squared:  0.6269 ,   Adjusted R-squared:  0.625 
F-statistic: 341.1 on 5 and 994 DF,  p-value: < 2.2e-16

Hypothesis Testing 👩🏻‍🔬

What is a hypothesis test? 🤔

  • A hypothesis test is an evaluation of a particular hypothesis about the population distribution
  • Statistical thought experiments:
    • Assume we know (part of) the true DGP
    • Use tools of probability to see what types of data we should see under this assumption
    • Compare our observed data to this thought experiment.
  • Statistical proof by contradiction:
    • We assume the null hypothesis is true
    • We calculate the probability of observing our data under this assumption
    • If this probability is very low, we reject the null hypothesis
  • Null hypothesis (\(H_0\)): A proposed value for a population parameter
    • This is usually “no effect/difference/relationship”
    • We denote this as \(H_0: \theta = \theta_0\)
    • For instance, information about user ratings does not affect cancellation rates: \(H_0: \mu_y - \mu_n = 0\)
  • Alternative hypothesis (\(H_1\)): A different value for the population parameter, which we are interested in
    • This is usually “there is an effect”
    • \(H_1: \theta \neq \theta_0\)
    • For instance, information about user ratings does affect cancellation rates: \(H_1: \mu_y - \mu_n \neq 0\)
  • Always mutually exclusive

General framework for hypothesis testing

  • A hypothesis test chooses whether or not to reject the null hypothesis based on the data we observe.
  • Rejection based on a test statistic, \(T_n = T(Y_1, ..., Y_n)\)
    • Will help us adjudicate between the null and the alternative
    • Typically: larger values of \(T_n → null\) less plausible
  • The rejection region, \(R\), contains the values of \(T_n\) for which we reject the null.
    • These are the areas that indicate that there is evidence against the null.
  • Two-sided alternative (our focus):
    • \(H_0 : \mu_y - \mu_x = 0\) and \(H_a : \mu_y - \mu_x \neq 0\)
    • Implies that \(T_n >> 0\) or \(T_n << 0\) will be evidence against the null
    • Rejection regions: \(|T_n| > c\) for some value \(c\)
  • How to determine these regions?

Type I and Type II errors

  • A Type I error is when we reject the null hypothesis when it is in fact true
    • We say that information has no effect when it does
    • A false discovery (very bad, thus type I)
  • A Type II error is when we fail to reject the null hypothesis when it is false
    • We say that information has an effect when it does not
    • A missed opportunity (not as bad, thus type II)
  • The probability of a Type I error is denoted by \(\alpha\)
  • Choose a level of significance \(\alpha\)
    • Convention in social sciences is \(\alpha = 0.05\), but nothing magical there
    • Particle physicists at CERN use \(\alpha \approx \frac{1}{1,750,000}\)
    • Lower values of \(\alpha\) guard against “flukes” but increase barriers to discovery

Source: Scribbr

Type I and Type II errors 😂

Hypothesis testing steps

  1. Choose null and alternative hypotheses (e.g., \(H_0 : \mu_y - \mu_x = 0\) vs. \(H_a : \mu_y - \mu_x \neq 0\)).
  2. Choose a test statistic (e.g., \(T_n = \frac{\hat{D}_n}{se[\hat{D}_n]}\)).
  3. Choose a significance level, \(\alpha\) (e.g., \(\alpha = 0.05\)).
  4. Determine the rejection region (e.g., \(|T_n| > 1.96\)).
  5. Reject the null hypothesis if the test statistic falls within the rejection region; otherwise, fail to reject.

Rejection regions

  • The rejection region is determined by the critical values, \(c\), which are defined by the significance level, \(\alpha\)

Average treatment effect (ATE) hypothesis testing

  • We have already seen one way to test hypotheses using the potential outcomes framework
    • We estimate the average treatment effect (ATE) and see if the confidence interval includes zero
  • This is known as the Jerzy Neyman approach to hypothesis testing
  • As potential outcomes are unobservable, we use the observed data to estimate the ATE after randomisation
  • Neyman had such a huge impact of experimental methods that many scholars call the potential outcomes framework the Neyman-Rubin model.
    • See here for more details, and here for a paper by Rubin on the importance of Neyman’s work

Sharp null hypothesis 🧐

  • But there is another interesting way to test hypotheses!
  • The sharp null hypothesis is a hypothesis that is true for all units in the population
    • For instance, “everyone in the population has the same potential outcome under treatment and control”
    • This is a very strong assumption, but it is useful for testing the average treatment effect
    • It is also known as the Ronald Fisher approach to hypothesis testing
  • It is based on \(p\)-values and sampling distributions
  • For a comprehensive discussion of the sharp null hypothesis, see Imbens and Rubin (2015)
  • Let’s see how it works! 😊

Hypothesis testing with the sharp null hypothesis

Randomisation inference

  • Sampling variability is a crucial topic in experimental design and analysis
  • A single experiment provides just one of many possible datasets generated by random assignment
  • The estimate of the average treatment effect can vary substantially depending on the random assignment
  • The sampling distribution (or randomisation distribution) refers to the collection of estimates from every possible random assignment
  • This hypothesis testing approach is broadly applicable, not limited by sample size or outcome distribution, and can be used with counts, durations, or ranks
  • In theory, this calculation can be done for any experiment size, but the number of possible random assignments becomes very large as \(N\) increases
  • For example, an experiment with \(N = 50\) and half assigned to treatment has more than 126 trillion randomisations 😮
    • \(\frac{50!}{25! \times 25!} = 126,410,606,437,752\)
  • But we can approximate the sampling distribution by sampling from all possible random assignments
  • Calculating p-values based on an inventory of possible randomisations is called randomisation inference

Randomisation inference in R

  • There is an R package for that! 😊
  • The ri2 package provides tools for randomisation inference, and it was made by the same people who created the estimatr package
  • You can find more information about the package here
  • Just install it with install.packages("ri2") and load it with library(ri2)
  • The package supports all the randomisation procedures that can be described by the randomizr package
  • Let’s see how it works!
  • Another package that I found useful (and broadly similar to ri2) is the ritest package, available on GitHub: https://github.com/grantmcdermott/ritest

Example

Do female council heads allocate more resources to water sanitation?

  • Gerber and Green (2012) describe a hypothetical experiment in which 2 of 7 villages are assigned a female council head and the outcome is the share of the local budget allocated to water sanitation
  • Their table 2.2 describes one way the experiment could have come out
library(ri2)
library(estimatr)

# Create the data
table_2_2 <- data.frame(T = c(1, 0, 0, 0, 0, 0, 1),
                        Y = c(15, 15, 20, 20, 10, 15, 30))

summary(lm_robust(Y ~ T, data = table_2_2))

Call:
lm_robust(formula = Y ~ T, data = table_2_2)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept)     16.0      1.871  8.5524  0.00036    11.19    20.81  5
T                6.5      7.730  0.8409  0.43876   -13.37    26.37  5

Multiple R-squared:  0.2485 ,   Adjusted R-squared:  0.09824 
F-statistic: 0.7071 on 1 and 5 DF,  p-value: 0.4388

Randomisation inference

# Declare randomisation procedure
declaration <- declare_ra(N = 7, m = 2)

# Conduct Randomisation Inference
ri2_out <- conduct_ri(
  formula = Y ~ T,
  assignment = "T",
  declaration = declaration,
  sharp_hypothesis = 0,
  data = table_2_2,
  sims = 1000
)

summary(ri2_out)
  term estimate two_tailed_p_value
1    T      6.5          0.3809524
plot(ri2_out)

Another example!

Effect of not having a runoff in sub-Sarahan Africa

  • The data to the right, from Glynn and Ichino (2015), examines the relationship between the absence of a runoff election (\(A_i = 1\)) and harassment of opposition parties (\(Y_i\)).
  • Data was collected from 10 sub-Saharan African countries
  • The study suggests that without runoff elections, incumbents can win with a simple plurality, reducing their need to court smaller parties
  • This creates incentives to suppress turnout through intimidation
  • Conversely, with a runoff (\(A_i = 0\)), incumbents need broader support and are more likely to engage smaller parties rather than intimidate them
Unit \(Y_i(0)\) \(Y_i(1)\) \(A_i\) \(Y_i\)
Cameroon ? 1 1 1
Kenya ? 1 1 1
Malawi ? 1 1 1
Nigeria ? 1 1 1
Tanzania ? 0 1 0
Congo 0 ? 0 0
Madagascar 0 ? 0 0
Central African Republic 0 ? 0 0
Ghana 0 ? 0 0
Guinea-Bissau 0 ? 0 0

Source: Blackwell (2013)

Sharp null hypothesis

Unit \(Y_i(0)\) \(Y_i(1)\) \(A_i\) \(Y_i\)
Cameroon 1 1 1 1
Kenya 1 1 1 1
Malawi 1 1 1 1
Nigeria 1 1 1 1
Tanzania 0 0 1 0
Congo 0 0 0 0
Madagascar 0 0 0 0
Central African Republic 0 0 0 0
Ghana 0 0 0 0
Guinea-Bissau 0 0 0 0
gi_data <- data.frame(Y = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0),
                      A = c(1, 1, 1, 1, 1, 0, 0, 0, 0, 0))

declaration <- declare_ra(N = 10, m = 5) 

ri2_out <- conduct_ri(
  formula = Y ~ A,
  assignment = "A",
  declaration = declaration,
  sharp_hypothesis = 0,
  data = gi_data,
  sims = 1000
)

summary(ri2_out)
  term estimate two_tailed_p_value
1    A      0.8         0.04761905

So far, so good? 😃

Let’s discuss a real experiment! 🤓

Bottom-up accountability and public service provision: Evidence from a field experiment in Brazil

By yours truly! 😊 https://doi.org/10.1177/2053168020914444

Outline of the paper: Introduction

Based on the EGAP Research Design Form

  • Researcher name: Danilo Freire, Manoel Galdino, and Umberto Mignozzetti
  • Research project title: Bottom-up accountability and public service provision: Evidence from a field experiment in Brazil
  • One sentence summary of your specific research question: Does a mobile phone application that enables citizen monitoring of school construction projects improve public service delivery?
  • General motivation: Accountability systems are crucial for efficient public service provision, and there is a belief that bottom-up monitoring can help with the principal-agent dilemma
  • Why should someone who is not an academic care about the results of this research?: How technology can help citizens ensure government accountability in young democracies
  • Theoretical motivation: Debate between studies showing the positive impacts of information-based interventions and those that find little evidence
  • Primary hypotheses:
    • Key parameter/estimands: The key parameters/estimands are the impacts of the TDP app on: (1) percentage of project completed; (2) difference in completion before and after intervention; (3) if the construction is finished; (4) if the construction is canceled; and (5) number of updated dates.
    • Predicted sign/magnitude: The primary hypothesis is that the TDP app will have a positive effect on (1), (2), (3), and (5) leading to higher completion rates. Conversely, the app will have a negative effect on (4) leading to less cancelations.
    • Logic or theory of change: Providing citizens with information about school construction projects empowers them to pressure officials, leading to faster and better completion.

Outline of the paper: Introduction

Based on the EGAP Research Design Form

  • Alternative explanations if results are consistent with hypotheses:

    • Alternative theories: Increased community engagement and public awareness, that generates other forms of engagement and pressuring. Increased media coverage of school construction could coincide with the programme
  • Hypothesis for alternative outcome: With an increase in media attention regarding public works, we would expect an overall improvement in public services, not a change that is specific to the treatment group.

  • Alternative explanations if results are inconsistent with hypotheses:

    • Alternative theories: Difficulties in differentiating political corruption from budget issues; the lack of political pressure; the lack of trust in representatives
    • Hypothesis for alternative outcome: Citizens may not trust their representatives and think the problems are related to austerity instead of corruption

Section 2: Population and Sample

  • Population of interest: Brazilian municipalities & citizens; school construction projects.
  • Where and when: Brazil; Intervention 1: Aug 2017-July 2018; Intervention 2: Aug 2018-July 2019.
  • Context match?: Matches, but economic crisis & electoral cycle are specific conditions.
  • Sample size:
    • Intervention 1: 344 control, 2642 treatment municipalities.
    • Intervention 2: 659 control, 3717 treatment schools.
  • Sample selection:
    • Intervention 1: Random assignment of municipalities.
    • Intervention 2: Random assignment of schools, stratified by state, construction status, spending.
  • Consent
    • Obtained?: No. App use was voluntary, data focused on public works.
    • Vulnerable population?: No coercion risk; participation was voluntary.
  • Ethics:
    • Power sufficient?: Yes. Sample size provides power to detect plausible effects.
    • Size necessary?: Yes. Sample was determined by power analysis, not unnecessarily large.
    • Risk of targeting?: No. Results inform general policy, not targeted at individuals.

Section 3: Intervention

  • Status Quo:
    • School construction projects in Brazil often face delays.
    • No prior system for citizens to directly report problems and pressure authorities in this way.
  • Intervention:
    • Mobile app (Tá de Pé - TDP) enabling citizens to monitor school construction sites.
    • Features: submit photos, check status, send requests to mayor’s offices.
  • Control:
    • Intervention 1: No access to TDP app in control municipalities.
    • Intervention 2: No access to TDP app at control schools.
    • Pure control (no intervention); controls for external factors and placebo effects associated to knowing about the existance of the app.
  • Units:
    • Intervention 1: Municipality level.
    • Intervention 2: School level.
    • Outcomes measured at the same level (municipality and school, respectively).
  • Compliance:
    • “Taking” the intervention means the municipality/school had access to the app.
    • Compliance is whether the app was accessible, not level of engagement.
  • Non-Compliance:
    • Potential for users in control group to download the app, but data analysis accounts for this.
    • No concern about those in treatment group not using the app as participation is voluntary.
  • Ethics:
    • Control is status quo, so no worse than current conditions.
    • No coercion as app is voluntary.

Section 3: Intervention

Section 3: Intervention

Section 4: Outcomes and Covariates

  • Primary Outcome: Completion status of school construction projects.
  • Measurement:
    • Percentage of project completed (continuous).
    • Difference in completion before/after intervention (continuous).
    • Finished construction (binary).
    • Cancelled construction (binary).
    • Number of updated completion dates by firms (count).
    • Data comes from the Ministry of Education’s SIMEC platform, and was accessed via their API.
  • Priors: Prior studies showed variation in project completion rates, with delays common.
  • Validity and measurement error:
    • Data is based on official government records, reducing untruthful reporting concerns.
    • Potential for time lags in updating information, but not a major concern for the study design.
  • Covariates:
    • Municipal population, poverty, federal transfers, primary and secondary school quality.
    • Data is from the Brazilian Ministry of Education and the Brazilian Census.
    • No additional outcomes or covariates collected to distinguish between explanations and alternatives; however, the randomization at the municipality/school level is designed to account for these biases.
  • Ethics:
    • Data collection from SIMEC data is minimal in burden and has clear benefits for public policy and accountability.

Section 5: Randomisation

  • Randomisation strategy:
    • Intervention 1: Cluster randomisation at the municipal level.
    • Intervention 2: Blocked randomisation at the school level.
  • Blocks:
    • Intervention 1: No blocks.
    • Intervention 2: Blocks created based on the strata (Brazilian states, construction status, and municipal spending median).
  • Clusters:
    • Intervention 1: Clusters are municipalities. 344 control clusters and 2642 treatment clusters.
    • Intervention 2: Not clustered.
    • Randomising at the individual level was not feasible.

Section 6: Analysis

  • Estimator:
    • Linear Regression Model with a treatment indicator and control variables.
  • Standard Errors:
    • Cluster-robust standard errors at the municipality level for intervention 1.
    • Cluster-robust standard errors at the school level for intervention 2.
  • Test:
    • Randomisation Inference tests are employed in addition to standard t-tests for p-values
  • Missing Data:
    • Missing data is not a significant concern.
  • Effect size:
    • Expected effect size is based on prior studies and theoretical expectations.
    • Minimum effect sizes are not determined a priori.
    • Similar studies had mixed effect sizes.
  • Power:
    • Sample sizes were chosen to detect effect sizes that are likely to have practical implications on policy.

Section 6: Analysis

Section 7: Implementation

  • Randomisation:
    • Randomisation was conducted on a computer.
  • Implementation:
    • Transparência Brasil implemented the intervention by creating the app and by making it available in a subset of municipalities/schools.
    • No direct dangers to the research team or enumerators, as all data was collected through APIs and public databases.
    • Implementation was tracked via app downloads and user sessions.
  • Compliance:
    • Compliance measured by analysing whether the app was available for use in a given location.
    • All compliance data came from the app store and from usage logs.
  • Data management:
    • Data was stored securely on cloud servers.
    • Data was anonymised by using city/school identification and not user’s information.
    • Publicly available data was collected directly via APIs or downloaded directly from the ministry’s site.

Results

Results

Randomisation inference! 🤓

…and that’s it! 😊

Thank you, and see you soon! 🙏🏼