QTM 385 - Experimental Methods

Lecture 05 - Sampling Distributions, Statistical Inference, and Hypothesis Testing

Danilo Freire

danilo.freire@emory.edu

Emory University

Hello, everyone! 👋
How are you doing today? 😁

Brief recap 📚

Last time, we saw that…

Selection bias arises when treatment groups differ initially
Balance tests assess pre-treatment group comparability
- Are they useful? Sometimes. Do they solve the problem? No. 😂
Causal diagrams (DAGs) illustrate variable relationships and biases
- Useful for identifying confounders and mediators
Loss to follow-up can introduce bias in treatment effects
Survivorship bias focuses on successful outcomes, ignoring failures
Post-treatment bias occurs when controlling for variables affected by treatment

Standard deviation quantifies the spread of data
Standard error measures the precision of sample means (or other test statistics)
Regression analysis are the most convenient way to estimate treatment effects
- They can handle multiple covariates and interactions
DeclareDesign is a package for designing and analysing experiments
- It helps you through the entire process! 😎

Today, we will discuss…

Regression analysis in R (from last week)
Sampling distributions of experimental estimates
The importance of randomisation inference
- ri2 package
Hypothesis testing
- Sharp null hypothesis
- Null hypothesis of no average treatment effect
- Confidence intervals
But first… interesting experiment of the day! 🎉
And a few ideas about your pre-analysis plans and final project 😊

Source: Alex Coppock

Interesting experiment of the day! 🎉

Source: Knittel et al (2024)

Pre-analysis plan and final project 😊

Now we have the final list of students enrolled in the course!
So we can start thinking about your project 🤓
It will be a group project, with 3-4 students per group
You will have to submit a pre-analysis plan, I will create a dataset for you to work on, and you will have to write a report
and I have a few questions for you (don’t worry, they’re easy! 😂)

Regression analysis in R (recap) 📊

Loading the packages and simulating the data

# Load the required packages
library(fabricatr)
library(estimatr)
library(randomizr)

# Set the seed for reproducibility
set.seed(385)

data2 <- fabricate(
  N = 1000,
  treat = complete_ra(N, m = 500),
  age = round(rnorm(N, mean = 30, sd = 5)),
  education = round(rnorm(N, mean = 12, sd = 2)),
  interviews = round(rnorm(N, mean = 10, sd = 2) + 5 * treat)
)

head(data2)

    ID treat age education interviews
1 0001     0  32        11          9
2 0002     1  30        13         16
3 0003     1  31        12         16
4 0004     1  33        13         15
5 0005     0  27        15          7
6 0006     0  33        12          9

Adding covariates

We can add covariates to experimental models to increase precision
The same way we do in any regression:
- \(Y_i = \alpha + \beta \text{T}_i + \gamma \text{X}_i + \epsilon_i\)
Freedman (2008) demonstrated that pre-treatment covariate adjustment may bias estimates of average treatment effects, mainly in small samples
Lin (2013) proposed that if covariate correlations differ between treatment and control groups, centering and interacting covariates with the treatment variable will balance the groups
This adjusts for covariates separately in each group, which is equivalent to including treatment-by-covariate interactions
- \(Y_i = \alpha + \beta \text{T}_i + \gamma \text{W}_i + \delta \text{T}_i \times \text{W}_i + \epsilon_i\)
- Where \(W_i = \text{X}_i - \bar{\text{X}}\) and \(\bar{\text{X}}\) is the mean of \(\text{X}_i\)
- For ease of interpretation, we can centre the covariates to have a mean of zero
Anyway, worry not! 😅 Our friends at DeclareDesign have done all the work for us and created a function called lm_lin() that does everything automatically!

Estimating the second model

model2 <- lm_lin(interviews ~ treat,
                covariates = ~ age + education,
                data = data2)
summary(model2)


Call:
lm_lin(formula = interviews ~ treat, covariates = ~age + education, 
    data = data2)

Standard error type:  HC2 

Coefficients:
                  Estimate Std. Error  t value   Pr(>|t|)  CI Lower CI Upper
(Intercept)        9.88659    0.09213 107.3127  0.000e+00  9.705804 10.06738
treat              5.13084    0.12546  40.8956 3.281e-215  4.884642  5.37704
age_c              0.02845    0.01771   1.6062  1.085e-01 -0.006307  0.06320
education_c       -0.04550    0.04721  -0.9636  3.355e-01 -0.138150  0.04716
treat:age_c       -0.03324    0.02546  -1.3054  1.921e-01 -0.083206  0.01673
treat:education_c  0.04324    0.06140   0.7043  4.814e-01 -0.077248  0.16373
                   DF
(Intercept)       994
treat             994
age_c             994
education_c       994
treat:age_c       994
treat:education_c 994

Multiple R-squared:  0.6272 ,   Adjusted R-squared:  0.6254 
F-statistic: 344.4 on 5 and 994 DF,  p-value: < 2.2e-16

The lm_lin() function is similar to the lm_robust() function, but it includes the covariates argument
As you can see, the results are pretty similar to the previous model
The treatment effect is slightly higher, and the standard error is slightly smaller
The _c part of the variable names indicates that the variables are centred

Sub-group analysis

We can also estimate the treatment effect for sub-groups of the population
This is useful when we suspect that the treatment effect may vary across known dimensions
For instance, we can estimate the treatment effect for people with high and low levels of education
We can do this by including an interaction term between the treatment variable and the covariate of interest
Interpretation: The treatment effect for the high-education subgroup (high_edu = 1) is 0.097 larger than for the low-education subgroup
Total treatment effect for high_edu = 1: 5.101 + 0.097 = 5.198

# Create a binary subgroup (e.g., "high" vs. "low" education)
data2$high_edu <- ifelse(data2$education > median(data2$education), 1, 0)

# Fit an interaction model with covariates
model_interaction <- lm_robust(
  interviews ~ treat * high_edu + age + education,
  data = data2)

# Summarize results
summary(model_interaction)


Call:
lm_robust(formula = interviews ~ treat * high_edu + age + education, 
    data = data2)

Standard error type:  HC2 

Coefficients:
               Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper  DF
(Intercept)     9.39778    0.69048 13.6104  8.255e-39  8.04281 10.75275 994
treat           5.10068    0.16905 30.1726 1.682e-142  4.76894  5.43241 994
high_edu       -0.26179    0.24257 -1.0792  2.808e-01 -0.73780  0.21422 994
age             0.01141    0.01276  0.8938  3.716e-01 -0.01363  0.03644 994
education       0.02033    0.05026  0.4046  6.859e-01 -0.07829  0.11895 994
treat:high_edu  0.09721    0.25093  0.3874  6.985e-01 -0.39521  0.58962 994

Multiple R-squared:  0.6269 ,   Adjusted R-squared:  0.625 
F-statistic: 341.1 on 5 and 994 DF,  p-value: < 2.2e-16

Hypothesis Testing 👩🏻‍🔬

What is a hypothesis test? 🤔

A hypothesis test is an evaluation of a particular hypothesis about the population distribution
Statistical thought experiments:
- Assume we know (part of) the true DGP
- Use tools of probability to see what types of data we should see under this assumption
- Compare our observed data to this thought experiment.
Statistical proof by contradiction:
- We assume the null hypothesis is true
- We calculate the probability of observing our data under this assumption
- If this probability is very low, we reject the null hypothesis

Null hypothesis (\(H_0\)): A proposed value for a population parameter
- This is usually “no effect/difference/relationship”
- We denote this as \(H_0: \theta = \theta_0\)
- For instance, information about user ratings does not affect cancellation rates: \(H_0: \mu_y - \mu_n = 0\)
Alternative hypothesis (\(H_1\)): A different value for the population parameter, which we are interested in
- This is usually “there is an effect”
- \(H_1: \theta \neq \theta_0\)
- For instance, information about user ratings does affect cancellation rates: \(H_1: \mu_y - \mu_n \neq 0\)
Always mutually exclusive

General framework for hypothesis testing

A hypothesis test chooses whether or not to reject the null hypothesis based on the data we observe.
Rejection based on a test statistic, \(T_n = T(Y_1, ..., Y_n)\)
- Will help us adjudicate between the null and the alternative
- Typically: larger values of \(T_n → null\) less plausible
The rejection region, \(R\), contains the values of \(T_n\) for which we reject the null.
- These are the areas that indicate that there is evidence against the null.
Two-sided alternative (our focus):
- \(H_0 : \mu_y - \mu_x = 0\) and \(H_a : \mu_y - \mu_x \neq 0\)
- Implies that \(T_n >> 0\) or \(T_n << 0\) will be evidence against the null
- Rejection regions: \(|T_n| > c\) for some value \(c\)
How to determine these regions?

Type I and Type II errors

A Type I error is when we reject the null hypothesis when it is in fact true
- We say that information has no effect when it does
- A false discovery (very bad, thus type I)
A Type II error is when we fail to reject the null hypothesis when it is false
- We say that information has an effect when it does not
- A missed opportunity (not as bad, thus type II)
The probability of a Type I error is denoted by \(\alpha\)
Choose a level of significance \(\alpha\)
- Convention in social sciences is \(\alpha = 0.05\), but nothing magical there
- Particle physicists at CERN use \(\alpha \approx \frac{1}{1,750,000}\)
- Lower values of \(\alpha\) guard against “flukes” but increase barriers to discovery

Source: Scribbr

Type I and Type II errors 😂

Hypothesis testing steps

Choose null and alternative hypotheses (e.g., \(H_0 : \mu_y - \mu_x = 0\) vs. \(H_a : \mu_y - \mu_x \neq 0\)).
Choose a test statistic (e.g., \(T_n = \frac{\hat{D}_n}{se[\hat{D}_n]}\)).
Choose a significance level, \(\alpha\) (e.g., \(\alpha = 0.05\)).
Determine the rejection region (e.g., \(|T_n| > 1.96\)).
Reject the null hypothesis if the test statistic falls within the rejection region; otherwise, fail to reject.

Rejection regions

The rejection region is determined by the critical values, \(c\), which are defined by the significance level, \(\alpha\)

Source: Blackwell (2016)

Average treatment effect (ATE) hypothesis testing

We have already seen one way to test hypotheses using the potential outcomes framework
- We estimate the average treatment effect (ATE) and see if the confidence interval includes zero
This is known as the Jerzy Neyman approach to hypothesis testing
As potential outcomes are unobservable, we use the observed data to estimate the ATE after randomisation
Neyman had such a huge impact of experimental methods that many scholars call the potential outcomes framework the Neyman-Rubin model.
- See here for more details, and here for a paper by Rubin on the importance of Neyman’s work

Sharp null hypothesis 🧐

But there is another interesting way to test hypotheses!
The sharp null hypothesis is a hypothesis that is true for all units in the population
- For instance, “everyone in the population has the same potential outcome under treatment and control”
- This is a very strong assumption, but it is useful for testing the average treatment effect
- It is also known as the Ronald Fisher approach to hypothesis testing
It is based on \(p\)-values and sampling distributions
For a comprehensive discussion of the sharp null hypothesis, see Imbens and Rubin (2015)
Let’s see how it works! 😊

Hypothesis testing with the sharp null hypothesis

Randomisation inference

Sampling variability is a crucial topic in experimental design and analysis
A single experiment provides just one of many possible datasets generated by random assignment
The estimate of the average treatment effect can vary substantially depending on the random assignment
The sampling distribution (or randomisation distribution) refers to the collection of estimates from every possible random assignment
This hypothesis testing approach is broadly applicable, not limited by sample size or outcome distribution, and can be used with counts, durations, or ranks

In theory, this calculation can be done for any experiment size, but the number of possible random assignments becomes very large as \(N\) increases
For example, an experiment with \(N = 50\) and half assigned to treatment has more than 126 trillion randomisations 😮
- \(\frac{50!}{25! \times 25!} = 126,410,606,437,752\)
But we can approximate the sampling distribution by sampling from all possible random assignments
Calculating p-values based on an inventory of possible randomisations is called randomisation inference

Randomisation inference in R

There is an R package for that! 😊
The ri2 package provides tools for randomisation inference, and it was made by the same people who created the estimatr package
You can find more information about the package here
Just install it with install.packages("ri2") and load it with library(ri2)
The package supports all the randomisation procedures that can be described by the randomizr package
Let’s see how it works!
Another package that I found useful (and broadly similar to ri2) is the ritest package, available on GitHub: https://github.com/grantmcdermott/ritest

Example

Do female council heads allocate more resources to water sanitation?

Gerber and Green (2012) describe a hypothetical experiment in which 2 of 7 villages are assigned a female council head and the outcome is the share of the local budget allocated to water sanitation
Their table 2.2 describes one way the experiment could have come out

library(ri2)
library(estimatr)

# Create the data
table_2_2 <- data.frame(T = c(1, 0, 0, 0, 0, 0, 1),
                        Y = c(15, 15, 20, 20, 10, 15, 30))

summary(lm_robust(Y ~ T, data = table_2_2))


Call:
lm_robust(formula = Y ~ T, data = table_2_2)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept)     16.0      1.871  8.5524  0.00036    11.19    20.81  5
T                6.5      7.730  0.8409  0.43876   -13.37    26.37  5

Multiple R-squared:  0.2485 ,   Adjusted R-squared:  0.09824 
F-statistic: 0.7071 on 1 and 5 DF,  p-value: 0.4388

Randomisation inference

# Declare randomisation procedure
declaration <- declare_ra(N = 7, m = 2)

# Conduct Randomisation Inference
ri2_out <- conduct_ri(
  formula = Y ~ T,
  assignment = "T",
  declaration = declaration,
  sharp_hypothesis = 0,
  data = table_2_2,
  sims = 1000
)

summary(ri2_out)

  term estimate two_tailed_p_value
1    T      6.5          0.3809524

plot(ri2_out)

Another example!

Effect of not having a runoff in sub-Sarahan Africa

The data to the right, from Glynn and Ichino (2015), examines the relationship between the absence of a runoff election (\(A_i = 1\)) and harassment of opposition parties (\(Y_i\)).
Data was collected from 10 sub-Saharan African countries
The study suggests that without runoff elections, incumbents can win with a simple plurality, reducing their need to court smaller parties
This creates incentives to suppress turnout through intimidation
Conversely, with a runoff (\(A_i = 0\)), incumbents need broader support and are more likely to engage smaller parties rather than intimidate them

Unit	\(Y_i(0)\)	\(Y_i(1)\)	\(A_i\)	\(Y_i\)
Cameroon	?	1	1	1
Kenya	?	1	1	1
Malawi	?	1	1	1
Nigeria	?	1	1	1
Tanzania	?	0	1	0
Congo	0	?	0	0
Madagascar	0	?	0	0
Central African Republic	0	?	0	0
Ghana	0	?	0	0
Guinea-Bissau	0	?	0	0

Source: Blackwell (2013)

Sharp null hypothesis

Unit	\(Y_i(0)\)	\(Y_i(1)\)	\(A_i\)	\(Y_i\)
Cameroon	1	1	1	1
Kenya	1	1	1	1
Malawi	1	1	1	1
Nigeria	1	1	1	1
Tanzania	0	0	1	0
Congo	0	0	0	0
Madagascar	0	0	0	0
Central African Republic	0	0	0	0
Ghana	0	0	0	0
Guinea-Bissau	0	0	0	0

gi_data <- data.frame(Y = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0),
                      A = c(1, 1, 1, 1, 1, 0, 0, 0, 0, 0))

declaration <- declare_ra(N = 10, m = 5) 

ri2_out <- conduct_ri(
  formula = Y ~ A,
  assignment = "A",
  declaration = declaration,
  sharp_hypothesis = 0,
  data = gi_data,
  sims = 1000
)

summary(ri2_out)

  term estimate two_tailed_p_value
1    A      0.8         0.04761905

So far, so good? 😃

Let’s discuss a real experiment! 🤓

Bottom-up accountability and public service provision: Evidence from a field experiment in Brazil

By yours truly! 😊 https://doi.org/10.1177/2053168020914444

Outline of the paper: Introduction

Based on the EGAP Research Design Form

Researcher name: Danilo Freire, Manoel Galdino, and Umberto Mignozzetti
Research project title: Bottom-up accountability and public service provision: Evidence from a field experiment in Brazil
One sentence summary of your specific research question: Does a mobile phone application that enables citizen monitoring of school construction projects improve public service delivery?
General motivation: Accountability systems are crucial for efficient public service provision, and there is a belief that bottom-up monitoring can help with the principal-agent dilemma
Why should someone who is not an academic care about the results of this research?: How technology can help citizens ensure government accountability in young democracies

Theoretical motivation: Debate between studies showing the positive impacts of information-based interventions and those that find little evidence
Primary hypotheses:
- Key parameter/estimands: The key parameters/estimands are the impacts of the TDP app on: (1) percentage of project completed; (2) difference in completion before and after intervention; (3) if the construction is finished; (4) if the construction is canceled; and (5) number of updated dates.
- Predicted sign/magnitude: The primary hypothesis is that the TDP app will have a positive effect on (1), (2), (3), and (5) leading to higher completion rates. Conversely, the app will have a negative effect on (4) leading to less cancelations.
- Logic or theory of change: Providing citizens with information about school construction projects empowers them to pressure officials, leading to faster and better completion.

Outline of the paper: Introduction

Based on the EGAP Research Design Form

Alternative explanations if results are consistent with hypotheses:
- Alternative theories: Increased community engagement and public awareness, that generates other forms of engagement and pressuring. Increased media coverage of school construction could coincide with the programme
Hypothesis for alternative outcome: With an increase in media attention regarding public works, we would expect an overall improvement in public services, not a change that is specific to the treatment group.
Alternative explanations if results are inconsistent with hypotheses:
- Alternative theories: Difficulties in differentiating political corruption from budget issues; the lack of political pressure; the lack of trust in representatives
- Hypothesis for alternative outcome: Citizens may not trust their representatives and think the problems are related to austerity instead of corruption

Section 2: Population and Sample

Population of interest: Brazilian municipalities & citizens; school construction projects.
Where and when: Brazil; Intervention 1: Aug 2017-July 2018; Intervention 2: Aug 2018-July 2019.
Context match?: Matches, but economic crisis & electoral cycle are specific conditions.
Sample size:
- Intervention 1: 344 control, 2642 treatment municipalities.
- Intervention 2: 659 control, 3717 treatment schools.
Sample selection:
- Intervention 1: Random assignment of municipalities.
- Intervention 2: Random assignment of schools, stratified by state, construction status, spending.
Consent
- Obtained?: No. App use was voluntary, data focused on public works.
- Vulnerable population?: No coercion risk; participation was voluntary.
Ethics:
- Power sufficient?: Yes. Sample size provides power to detect plausible effects.
- Size necessary?: Yes. Sample was determined by power analysis, not unnecessarily large.
- Risk of targeting?: No. Results inform general policy, not targeted at individuals.

Section 3: Intervention

Status Quo:
- School construction projects in Brazil often face delays.
- No prior system for citizens to directly report problems and pressure authorities in this way.
Intervention:
- Mobile app (Tá de Pé - TDP) enabling citizens to monitor school construction sites.
- Features: submit photos, check status, send requests to mayor’s offices.
Control:
- Intervention 1: No access to TDP app in control municipalities.
- Intervention 2: No access to TDP app at control schools.
- Pure control (no intervention); controls for external factors and placebo effects associated to knowing about the existance of the app.

Units:
- Intervention 1: Municipality level.
- Intervention 2: School level.
- Outcomes measured at the same level (municipality and school, respectively).
Compliance:
- “Taking” the intervention means the municipality/school had access to the app.
- Compliance is whether the app was accessible, not level of engagement.
Non-Compliance:
- Potential for users in control group to download the app, but data analysis accounts for this.
- No concern about those in treatment group not using the app as participation is voluntary.
Ethics:
- Control is status quo, so no worse than current conditions.
- No coercion as app is voluntary.

Section 3: Intervention

Section 4: Outcomes and Covariates

Primary Outcome: Completion status of school construction projects.
Measurement:
- Percentage of project completed (continuous).
- Difference in completion before/after intervention (continuous).
- Finished construction (binary).
- Cancelled construction (binary).
- Number of updated completion dates by firms (count).
- Data comes from the Ministry of Education’s SIMEC platform, and was accessed via their API.
Priors: Prior studies showed variation in project completion rates, with delays common.

Validity and measurement error:
- Data is based on official government records, reducing untruthful reporting concerns.
- Potential for time lags in updating information, but not a major concern for the study design.
Covariates:
- Municipal population, poverty, federal transfers, primary and secondary school quality.
- Data is from the Brazilian Ministry of Education and the Brazilian Census.
- No additional outcomes or covariates collected to distinguish between explanations and alternatives; however, the randomization at the municipality/school level is designed to account for these biases.
Ethics:
- Data collection from SIMEC data is minimal in burden and has clear benefits for public policy and accountability.

Section 5: Randomisation

Randomisation strategy:
- Intervention 1: Cluster randomisation at the municipal level.
- Intervention 2: Blocked randomisation at the school level.
Blocks:
- Intervention 1: No blocks.
- Intervention 2: Blocks created based on the strata (Brazilian states, construction status, and municipal spending median).
Clusters:
- Intervention 1: Clusters are municipalities. 344 control clusters and 2642 treatment clusters.
- Intervention 2: Not clustered.
- Randomising at the individual level was not feasible.

Section 6: Analysis

Estimator:
- Linear Regression Model with a treatment indicator and control variables.
Standard Errors:
- Cluster-robust standard errors at the municipality level for intervention 1.
- Cluster-robust standard errors at the school level for intervention 2.
Test:
- Randomisation Inference tests are employed in addition to standard t-tests for p-values
Missing Data:
- Missing data is not a significant concern.
Effect size:
- Expected effect size is based on prior studies and theoretical expectations.
- Minimum effect sizes are not determined a priori.
- Similar studies had mixed effect sizes.
Power:
- Sample sizes were chosen to detect effect sizes that are likely to have practical implications on policy.

Section 6: Analysis

Section 7: Implementation

Randomisation:
- Randomisation was conducted on a computer.
Implementation:
- Transparência Brasil implemented the intervention by creating the app and by making it available in a subset of municipalities/schools.
- No direct dangers to the research team or enumerators, as all data was collected through APIs and public databases.
- Implementation was tracked via app downloads and user sessions.
Compliance:
- Compliance measured by analysing whether the app was available for use in a given location.
- All compliance data came from the app store and from usage logs.
Data management:
- Data was stored securely on cloud servers.
- Data was anonymised by using city/school identification and not user’s information.
- Publicly available data was collected directly via APIs or downloaded directly from the ministry’s site.

Results

Randomisation inference! 🤓

…and that’s it! 😊

Thank you, and see you soon! 🙏🏼

QTM 385 - Experimental Methods

Hello, everyone! 👋 How are you doing today? 😁

Brief recap 📚

Last time, we saw that…

Today, we will discuss…

Interesting experiment of the day! 🎉

Pre-analysis plan and final project 😊

Regression analysis in R (recap) 📊

Loading the packages and simulating the data

Adding covariates

Estimating the second model

Sub-group analysis

Hypothesis Testing 👩🏻‍🔬

What is a hypothesis test? 🤔

General framework for hypothesis testing

Type I and Type II errors

Type I and Type II errors 😂

Hypothesis testing steps

Rejection regions

Average treatment effect (ATE) hypothesis testing

Sharp null hypothesis 🧐

Hypothesis testing with the sharp null hypothesis

Randomisation inference

Randomisation inference in R

Example

Do female council heads allocate more resources to water sanitation?

Randomisation inference

Another example!

Effect of not having a runoff in sub-Sarahan Africa

Sharp null hypothesis

So far, so good? 😃

Let’s discuss a real experiment! 🤓

Bottom-up accountability and public service provision: Evidence from a field experiment in Brazil

By yours truly! 😊 https://doi.org/10.1177/2053168020914444

Outline of the paper: Introduction

Based on the EGAP Research Design Form

Outline of the paper: Introduction

Based on the EGAP Research Design Form

Section 2: Population and Sample

Section 3: Intervention

Section 3: Intervention

Section 3: Intervention

Section 4: Outcomes and Covariates

Section 5: Randomisation

Section 6: Analysis

Section 6: Analysis

Section 7: Implementation

Results

Results

Randomisation inference! 🤓

…and that’s it! 😊

Thank you, and see you soon! 🙏🏼

Hello, everyone! 👋
How are you doing today? 😁