QTM 385 - Experimental Methods

Lecture 07 - Blocking and Clustering

Danilo Freire

Department of Data and Decision Sciences
Emory University

Hi, there!
Hope all is well 😉

Brief recap 📚

Last time, we…

  • Discussed Kalla and Broockman (2015): Campaign Contributions Facilitate Access to Congressional Officials: A Randomized Field Experiment
  • They conducted a field experiment demonstrating that informing congressional offices about meeting attendees being donors significantly increased (3-4 times) the likelihood of securing meetings with policymakers
  • Their treatment was a simple email sent to congressional offices with information about the meeting attendees (donors vs. local constituents)
  • The experiment also had a good theoretical grounding, as it was based on the idea that money distorts political representation
  • The results indicate this is true, at least in the American context (at the time of the experiment)

Last time, we…

  • The findings replicate elsewhere, too: For instance, 14% more callbacks for German- vs Turkish-sounding names (Kaas and Munger (2019))
  • Yemane and Reino (2019) find that Latinos are also discriminated against in the US labor market, but not in Spain. Why is that?

Today, we will…

  • Understand how to improve our experiments by using blocking
  • Learn why blocking reduces variance and increases precision
  • See how blocking solves some practical issues in field experiments
  • Learn about, and how to deal with, clustering and intra-cluster correlation
  • Understand why clustering increases variance and reduces statistical power
  • Differences between blocking and clustering
  • But first…, let’s talk about your group work again!

Group project 👥

Group project

  • Thanks to everyone who has emailed me with their group preferences 😉
  • If you haven’t emailed me yet, I will soon assign you to a group (randomly)
  • Group numbers have also been randomised to avoid any selection bias! 😂
  • I will also create the groups on Canvas in case you need to refer to it later
  • Please let me know if you have already formed a group and want to keep it!

Questions?

Group project 🤝

Next steps (from our previous lecture)

  • I’ll give you some time to discuss your ideas and start writing your plan
  • What do you think about sending this to me next week/in two weeks?
    • Submit at most 2 paragraphs summarising an experiment that you want to develop in this course. At minimum, your summary should include a research question, why the question is important, and a rough sketch of how you plan to answer the question
  • In two weeks:
    • Write a title and abstract for a paper you imagine writing based on your proposed experiment. Assume that your findings align with your theoretical predictions. Remember to establish why the findings matter for your intended audience
  • In three weeks:
    • Outline your pre-analysis plan. Your outline should include sections on the research question, the experimental design, the data you will collect, and the analysis you will conduct

Blocking and Clustering

What is blocking?

  • Blocking is a procedure that involves grouping experimental units based on certain characteristics
  • These groups are called blocks or strata and are formed based on variables that are expected to affect the outcome of the experiment (heterogeneous treatment effects)
  • Within each block, units are randomly assigned to treatment or control groups
  • So we have “experiments within experiments”!
  • This approach helps to ensure that the treatment and control groups are comparable within each block, reducing the potential for confounding variables to affect the results

Why is blocking important?

  • Blocking can also help us ensure that an equal number of people from each group are assigned to the treatment and control groups
  • For instance, imagine an experiment that includes 20 people, 10 men and 10 women
  • If we randomly assign people to the treatment and control groups, we might end up with 15 women in the experiment, or even only men in our sample (although the chance of this happening is quite low)
  • Blocking removes this risk entirely
  • And as groups will be more homogeneous, blocking also reduces variance and increases precision
  • “Block what you can, and randomise what you cannot” (Gerber and Green 2012, p. 110)

One or many blocks?

  • With enough time and resources, we could create a block for every possible characteristic that might affect the outcome of the experiment
  • But this would be impractical and unnecessary
  • Instead, we should focus on the most important characteristics that are likely to have the greatest impact on the outcome
  • But how to we know which characteristics are important if we haven’t run the experiment yet?
  • Two strategies:
    • Use previous research (quantitative or qualitative) to identify important characteristics
    • Pilot the experiment with a small sample to test for important characteristics

How does blocking help?

  • Let’s say we are interested in testing the effect of a new drug on blood pressure
  • We know that age is an important factor that affects blood pressure, so we decide to block our sample by age
  • To simplify, we create two blocks: one for people under 50 and another for people over 50
  • Imagine that we have 12 people in our sample (\(N\)) and want to assign 6 of them (\(m=6\)) to the treatment group and 6 to the control group
  • How many possible ways are there to assign people to the treatment and control groups?
# Random assignment
choose(12, 6)
[1] 924
# Two blocks with 6 people each
# 3 in the treatment group
choose(6, 3) * choose(6, 3)
[1] 400

How does blocking help?

  • The assignments that are ruled out are those in which too many or too few units in a block are assigned to treatment
  • Those “extreme” assignments produce estimates that are in the tails of the sampling distribution
  • The figure shows the sampling distribution of the difference-in-means estimator under complete random assignment
  • The histogram is shaded according to whether the particular random assignment is permissible under a procedure that blocks on the binary covariate \(X\) (age, in our case)
  • After many simulations, we see that blocking rules out by design those assignments that are not well balanced

Is there any disadvantage to blocking?

  • In general, no!
  • Gerber and Green argue that, even if you create blocks at random, you will do no worse than if you had not blocked at all (simple randomisation)
  • So there is no real disadvantage to blocking according to them
  • Others, such as Pashley and Miratrix (2021), argue that in some specific conditions (such as unthoughtful blocking), blocking will not be too beneficial, but it will not be so harmful either
  • In the vast majority of cases, you have a lot of gain from blocking and very little to lose
  • The only risk, in fact, is analysing the data incorrectly
  • And that can happen! We will see how soon 😉

When is blocking useful?

  • Blocking can be particularly useful when:
    • The sample size is small
    • There are important characteristics that are likely to affect the outcome of the experiment
    • The cost of blocking is low compared to the potential benefits
    • When it is important to affirm that heterogeneity is expected and should be explored (defense against p-hacking)
  • We should not overstate its benefits, as much of it can also be obtained with covariate adjustment. But the decrease in variance is real!

Kalla and Broockman (2015)

How to define ATE in blocked experiments?

  • If we consider the ATE at the unit level:

\[ATE = \frac{1}{N}\sum_{i=1}^N y_{i,1} - y_{i,0}\]

  • We could re-express this quantity equivalently using the ATE of block \(j\), \(ATE_j\), as follows:

\[ATE = \frac{1}{J}\sum_{j=1}^J\sum_{i=1}^{N_j} \frac{y_{i,1} - y_{i,0}}{N_j} = \sum_{j=1}^J \frac{N_j}{N}ATE_j\]

  • And it would be logical to estimate this quantity by replacing what we can indeed calculate:

\[\widehat{ATE} = \sum_{j=1}^J \frac{N_j}{N}\widehat{ATE_j}\]

How to define ATE in blocked experiments?

  • We can define the standard error of the estimator by averaging the standard errors within each block (if our blocks are sufficiently large)

  • Take each block’s standard error
  • Weight it by the squared proportion of that block’s size
  • Add up all these weighted squared standard errors
  • Take the square root of the sum

How to define ATE in blocked experiments?

Let’s simulate some data

set.seed(12345)
# We have 10 units
N <- 10
# y0 is the potential outcome under control
y0 <- c(0, 0, 0, 1, 1, 3, 4, 5, 190, 200)
# For each unit, the treatment effect is intrinsic
tau <- c(10, 30, 200, 90, 10, 20, 30, 40, 90, 20)
# y1 is the potential outcome under treatment
y1 <- y0 + tau
# Two blocks: a and b
block <- c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b")
# Z is the treatment assignment
# (in the code we use Z instead of T)
Z <- c(0, 0, 0, 0, 1, 1, 0, 0, 1, 1)
# Y is the observed outcome
Y <- Z * y1 + (1 - Z) * y0
# The data
dat <- data.frame(Z = Z, y0 = y0, y1 = y1, tau = tau, b = block, Y = Y)
head(dat)
  Z y0  y1 tau b  Y
1 0  0  10  10 a  0
2 0  0  30  30 a  0
3 0  0 200 200 a  0
4 0  1  91  90 a  1
5 1  1  11  10 a 11
6 1  3  23  20 a 23

How to define ATE in blocked experiments?

  • One option to estimate \(ATE_j\) is just to replace it with \(\widehat{ATE}\)
with(dat, table(b, Z))
   Z
b   0 1
  a 4 2
  b 2 2

As we can see, we have 6 units in block \(a\), 2 of which are assigned to the treatment, and 4 units in block \(b\), 2 of which are assigned to the treatment.

Estimating ATE in blocked experiments

  • First, let’s see some possible estimations
# The ATE
library(estimatr)
lm_robust(Y ~ Z, data = dat)
              Estimate Std. Error  t value   Pr(>|t|)    CI Lower   CI Upper DF
(Intercept)   1.666667  0.9189366 1.813691 0.10728314  -0.4524049   3.785738  8
Z           131.833333 68.4173061 1.926900 0.09015082 -25.9372575 289.603924  8
lm_robust(Y ~ Z + block, data = dat)
             Estimate Std. Error   t value   Pr(>|t|)  CI Lower  CI Upper DF
(Intercept) -32.42857   21.63212 -1.499093 0.17752759 -83.58042  18.72327  7
Z           114.78571   50.53715  2.271314 0.05736626  -4.71565 234.28708  7
blockb      102.28571   50.49783  2.025547 0.08245280 -17.12268 221.69411  7
  • How are they different?

Why are they different?

  • How are they different? (The first one ignores the blocks. The second one uses a different set of weights, created using fixed effects variables or indicator/dummy variables)

  • And we can estimate the total ATE by adjusting the weights according to the size of the blocks:

lm_lin(Y ~ Z, covariates = ~ block, data = dat)
            Estimate Std. Error  t value     Pr(>|t|)   CI Lower   CI Upper DF
(Intercept)     1.95   0.250000 7.800000 0.0002340912   1.338272   2.561728  6
Z             108.25  12.530862 8.638672 0.0001325490  77.588086 138.911914  6
blockb_c        4.25   0.559017 7.602631 0.0002696413   2.882135   5.617865  6
Z:blockb_c    228.75  30.599224 7.475680 0.0002957945 153.876397 303.623603  6
  • Which one should we use? Any thoughts?

Weighted average of block ATEs

  • The weighted average of the block ATEs is the best estimator
  • The weights are the proportion of units in each block, \(N_j/N\)
  • If the likelihood of being assigned to the treatment group differs by block, comparing the means across all subjects will lead to a biased estimate of the ATE
    • Unless the probability of assignment to the treatment group is identical for every block
  • In a nutshell: when estimating the ATE in a blocked experiment, just do it the old-fashioned way! 😅

Clustering

What is clustering?

  • So far, we have only allocated individual units to treatment and control groups
  • But in some cases, we might want to allocate whole groups of units to treatment and control conditions
  • Usually, the main reason why we do this is because of practical issues
    • It is impossible to randomly assign individuals to treatment and control groups (TV markets, for instance)
    • We cannot isolate individuals from each other (same household, same school, etc.)
    • We have strong priors about possible spillover effects
  • However, we still measure effects at the individual level (or any level smaller than the randomisation unit)
  • Let’s see how that impacts our analyses 😉

Source: TMG Research

:::

Clustering

  • A common example is an education experiment in which the treatment is randomised at the classroom level
  • All students in a classroom are assigned to either treatment or control together, as it is impossible for students in the same classroom to be assigned to different conditions (different teachers, materials, etc)
  • Assignments do not vary within the classroom
  • Clusters can be localities, like villages, precincts, or neighbourhoods
  • When clusters are all the same size, the standard difference-in-means estimator we usually employ is unbiased
  • However, caution is needed when clusters have different numbers of units or when there are very few clusters, as the treatment effects could be correlated with cluster size
  • When cluster size is related to potential outcomes, the usual difference-in-means estimator is often biased

:::

Differences between blocking and clustering

Aspect Blocking Clustering
Purpose To reduce variance and increase precision Practical necessity, not choice
Unit of randomisation Individual units Groups of units
Grouping Based on pre-treatment characteristics Based on natural or administrative groups
Analysis Compare within blocks, then weight Must account for within-cluster similarity
Example Block by age, then randomise individuals within age groups Randomise schools, but measure student outcomes

Intra-cluster correlation

  • Typically, cluster randomised trials have higher variance than individually randomised trials
  • Why? Because individuals within the same cluster tend to be more similar to each other than to individuals in different clusters
  • How much higher variance depends on a statistic that can be hard to think about, the intra-cluster correlation (ICC) of the outcome (\(\rho\))
  • Intra-cluster correlation (ICC) measures how similar individuals are within the same cluster compared to individuals in different clusters

Key Points:

  • Ranges from 0 to 1
  • Higher values indicate greater similarity within clusters
  • Formula: \(\rho = \frac{\sigma^2_{between}}{\sigma^2_{between} + \sigma^2_{within}}\)

Intuitive Explanation:

  • ICC = 0: Individuals within clusters are no more similar than individuals from different clusters (\(\sigma^2_{between} = 0\))
  • ICC = 1: All individuals within a cluster are identical to each other (\(\sigma^2_{within} = 0\))
  • Typical values: Usually between 0.01 and 0.2 in social science research

Example:

Imagine we’re studying student test scores in 10 schools with 30 students each:

  • Between-school variance (\(\sigma^2_{between}\)) = 25
  • Within-school variance (\(\sigma^2_{within}\)) = 75
  • ICC = \(\frac{25}{25 + 75} = 0.25\)

This means 25% of the total variance in test scores is due to differences between schools, and 75% is due to differences between students within the same school

Information reduction

Clustering reduces the number of possible treatment assignments, which reduces statistical information

Example:

With 10 individuals (5 with black hair, 5 with other colors):

Individual Randomisation:

  • Each person can be independently assigned to treatment or control
  • Total possible assignments: \(\binom{10}{5} = 252\) different combinations

Cluster Randomisation:

  • All individuals in a cluster receive the same treatment

  • Only 2 possible assignments:

    1. Black hair cluster = treatment, Other colour cluster = control
    2. Black hair cluster = control, Other colour cluster = treatment
  • The more clustered our design, the fewer possible randomisation combinations we have, leading to less statistical information for estimating treatment effects

  • This is why clustered designs typically require larger sample sizes to achieve the same statistical power as individually randomised experiments

Source: Stata Guide

Information reduction

Design Effects in Clustered Designs

  • The design effect (DEFF) quantifies how much clustering increases the variance of our estimates compared to simple random sampling

\[DEFF = 1 + (m - 1) \times \rho\]

  • Where \(m\) = average cluster size and \(\rho\) = intra-cluster correlation (ICC)

  • DEFF = 1: Clustering has no effect on variance (\(\rho = 0\))

  • DEFF > 1: Clustering increases variance; need larger sample size

  • DEFF < 1: Clustering decreases variance (rare)

Example:

  • Average cluster size: 30 students per classroom
  • ICC: 0.1 (10% of total variance is between classrooms)
  • DEFF = 1 + (30 - 1) × 0.1 = 3.9

This means we need nearly 4 times as many observations as in a simple random sample to achieve the same statistical power!

Practical Recommendations for Clustered Designs

Before Conducting the Experiment:

  1. Estimate the ICC from previous studies or pilot data
  2. Calculate required sample size accounting for design effects
  3. Plan for adequate cluster numbers (typically need at least 15-20 clusters per arm)
  4. Consider blocking at higher levels to improve precision

During Analysis:

  1. Always use clustered standard errors at the level of randomisation
  2. Report both clustered and unclustered standard errors for comparison
  3. Check balance across clusters to ensure randomisation worked
  4. Consider heterogeneous treatment effects across clusters

Reporting Results:

  1. Clearly state the unit of randomisation
  2. Report the ICC and design effect
  3. Describe the clustering structure in your methodology
  4. Explain how standard errors were calculated

What to do about information reduction

Robust clustered standard errors

When dealing with clustered data, we need to adjust our standard errors to account for the within-cluster correlation

  • Classical standard errors assume independent observations

  • In clustered data, observations within clusters are correlated

  • This leads to underestimated standard errors and inflated Type I error rates

  • Solution: Robust Clustered Standard Errors

    • Adjust standard errors to account for clustering
    • Use the “sandwich” estimator to correct for within-cluster correlation
    • More conservative and accurate inference
    • Implemented in the estimatr package in R

R example

library(estimatr)

# Simulate clustered data
set.seed(123)
n_clusters <- 20
n_per_cluster <- 30
total_n <- n_clusters * n_per_cluster

# Create cluster IDs
cluster_id <- rep(1:n_clusters, each = n_per_cluster)

# Treatment assigned at cluster level
treatment <- rep(rbinom(n_clusters, 1, 0.5), each = n_per_cluster)

# Create outcome with cluster effects
cluster_effect <- rnorm(n_clusters, 0, 2)
individual_effect <- rnorm(total_n, 0, 1)
outcome <- 2 * treatment + rep(cluster_effect, each = n_per_cluster) + individual_effect

# Create data frame
data <- data.frame(
  cluster_id = factor(cluster_id),
  treatment = treatment,
  outcome = outcome
)

# Compare standard errors
model_naive <- lm_robust(outcome ~ treatment, data = data)
model_clustered <- lm_robust(outcome ~ treatment, clusters = cluster_id, data = data)

# Display results
summary(model_naive)

Call:
lm_robust(formula = outcome ~ treatment, data = data)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
(Intercept)   0.4872     0.1239   3.932 9.397e-05   0.2439   0.7305 598
treatment     0.7484     0.1737   4.308 1.927e-05   0.4072   1.0896 598

Multiple R-squared:  0.02962 ,  Adjusted R-squared:  0.02799 
F-statistic: 18.56 on 1 and 598 DF,  p-value: 1.927e-05
summary(model_clustered)

Call:
lm_robust(formula = outcome ~ treatment, data = data, clusters = cluster_id)

Standard error type:  CR2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper    DF
(Intercept)   0.4872     0.6400  0.7613   0.4683  -0.9886    1.963  8.00
treatment     0.7484     0.8941  0.8370   0.4140  -1.1361    2.633 17.22

Multiple R-squared:  0.02962 ,  Adjusted R-squared:  0.02799 
F-statistic: 0.7006 on 1 and 19 DF,  p-value: 0.413

Combining blocking and clustering

We can improve the efficiency of clustered designs by incorporating blocking at a higher level

Example:

  • Clusters: Classrooms (level where treatment is assigned)
  • Blocks: Schools (higher-level grouping)
  • Blocking strategy: Stratify randomisation by school characteristics

Benefits:

  • Reduces variance by ensuring similar schools are distributed across treatment arms
  • Improves precision of treatment effect estimates
  • Accounts for both clustering and heterogeneity between blocks

Implementation:

  1. Group clusters into blocks based on relevant characteristics
  2. Randomise treatment within each block
  3. Analyse using appropriate clustered standard errors

Summary

  • Blocking is a technique to reduce variance and increase precision by grouping similar units before randomisation
  • It is especially useful in small samples or when important covariates are known
  • We run a series of mini-experiments within blocks, then combine results
  • Results can be estimated using weighted averages of block-specific ATEs
  • Clustering is a practical necessity when randomising groups of units together
  • It requires adjusting standard errors to account for within-cluster correlation
  • The intra-cluster correlation (ICC) quantifies similarity within clusters and impacts variance
  • Design effects from clustering often require larger sample sizes
  • Always use robust clustered standard errors when analysing clustered data
  • Combining blocking and clustering can further improve efficiency in experimental designs

Next lecture

  • We will see more on blocking and clustering in R with the estimatr package
  • Estimate models with robust clustered standard errors and see how they differ from regular standard errors
  • More on covariate adjustment
  • Learn about required sample sizes and power analysis
  • And more! 😂

Thanks very much! 😊

See you next time! 👋