QTM 385 - Experimental Methods

Lecture 08 - Blocking and Clustering (cont.), and Statistical Power

Danilo Freire

danilo.freire@emory.edu

Emory University

Hi, there!
Nice to see you again! 😉

Brief recap 📚

Blocking recap

Blocking involves grouping experimental units based on certain characteristics to ensure comparability between treatment and control groups
Blocks are formed based on variables expected to affect the outcome, and within each block, units are randomly assigned to treatment or control
Blocking reduces variance and increases precision by ensuring balanced groups within each block
Key benefits:
- Ensures equal representation of important subgroups in treatment and control
- Reduces the risk of confounding variables affecting results
- Particularly useful for small sample sizes or when heterogeneity is expected

Source: Bobbit (2020)

Clustering recap

Clustering involves assigning whole groups of units to treatment and control, often due to practical constraints
Common in experiments where individual randomisation is impossible (e.g., classrooms, villages)
Clustering introduces intra-cluster correlation (ICC), which measures how similar individuals within the same cluster are
Challenges:
- Higher variance compared to individual randomisation
- Requires robust clustered standard errors to avoid underestimating uncertainty
Blocking can be used within clusters to further reduce variance and improve precision

Source: Zhai and Gutman (2018)

Blocking (cont.)

Let’s declare the design of a hypothetical experiment!

Community gardening programme

Imagine that we are evaluating a “Community Gardening Programme” (treatment) that is designed to increase social cohesion within a city
We believe that different neighbourhoods start with varying levels of social cohesion
So we will block by “Neighbourhood Type” (i.e. established versus new development) to improve the precision of our estimated programme effect

Community gardening programme

Declare Model

Simulating our experiment using DeclareDesign:

# Load packages
library(DeclareDesign)
library(tidyverse) 

# Set seed for reproducibility
set.seed(385)

# Simulate our data
N = 100 # Total individuals

# Declare Model: what we think about the world
model_blocks <- declare_model(
  N = N,
  # Two neighbourhood types
  neighbourhood_type = sample(rep(c("Established", "New Development"), each = N/2)), 
  # Use neighbourhood type as blocks
  blocks = neighbourhood_type, 

  # Established neighbourhoods start with slightly higher cohesion
  baseline_cohesion_effect = ifelse(neighbourhood_type == "Established", 0.5, 0), 
  u = rnorm(N), # Random noise

  potential_outcomes(
    social_cohesion ~ Z * 1.5 + # Treatment effect = 1.5 units
      baseline_cohesion_effect + # Baseline cohesion varies by neighbourhood type
      u/10 # Random noise
  )
)

Community gardening programme

Declare Inquiry

# Declare Inquiry: what we want to learn

# Average Treatment Effect (ATE)
inquiry_blocks <- declare_inquiry(ATE = mean(social_cohesion_Z_1 - social_cohesion_Z_0))

Community gardening programme

Declare Assignment

# Declare Assignment: how we assign treatment

# Blocked RA by neighbourhood type
assignment_blocks  <- declare_assignment(Z = block_ra(blocks = blocks))

Community gardening programme

Declare Measurement

# Declare Measurement: how we measure outcomes
measurement_blocks <- declare_measurement(social_cohesion = reveal_outcomes(social_cohesion ~ Z))

Community gardening programme

Declare Estimator

# Declare Estimator: how we estimate treatment effects
estimator_blocks <- declare_estimator(social_cohesion ~ Z)

Now we have all the components of our design
Let’s put them together!

design_blocks <- model_blocks + inquiry_blocks +
                 assignment_blocks + measurement_blocks + estimator_blocks

Community gardening programme

Model without blocking

# Re-use the same model and inquiry for a fair comparison
model_no_blocks <- model_blocks
inquiry_no_blocks <- inquiry_blocks
assignment_no_blocks <- declare_assignment(Z = complete_ra(N = N, m = 50)) # No blocks
measurement_no_blocks <- measurement_blocks # Re-use measurement
estimator_no_blocks <- estimator_blocks     # Re-use estimator

design_no_blocks <- model_no_blocks + inquiry_no_blocks +
                    assignment_no_blocks + measurement_no_blocks + estimator_no_blocks

Community gardening programme

Simulate and compare

# Simulate the designs
simulations <- simulate_design(design_blocks, design_no_blocks, sims = 1000)

# Plot the results
ggplot(simulations, aes(estimate)) +
  geom_histogram(bins = 30) +
  facet_wrap( ~ design) +
  theme_bw() +
  theme(strip.background = element_blank()) +
  labs(
    title = "Distribution of Estimated Community Gardening Programme Effect",
    x = "Estimated Treatment Effect (Social Cohesion)",
    y = "Frequency"
  )

Community gardening programme

Simulate and compare

Cool, isn’t it? 🤓

Clustering (cont.)

Last class, we discussed the concept of clustering in experiments
Clustering is often used when random assignment is not feasible, such as in:
- Education (e.g., classrooms)
- Healthcare (e.g., hospitals)
- Community interventions (e.g., neighbourhoods)
Clustering can be beneficial, but it also comes with challenges, such as the need for robust statistical methods to account for intra-cluster correlation
We also discussed the importance of considering the design effect when planning a clustered experiment
- We unfortunately lose statistical power when we cluster
Clustering is more of a necessity than a choice

Intra-cluster correlation (ICC)

The intra-cluster correlation (ICC) is a measure of the similarity of individuals within the same cluster, compared to individuals in different clusters
The ICC is defined as:
- \(ICC = \frac{\sigma^2_{between}}{\sigma^2_{between} + \sigma^2_{within}}\)
The measure goes from 0 to 1. When it is closer to 0, it means that clusters have no influence on the outcome, so we can treat individuals as independent
- This is the ideal situation!
When it is closer to 1, it means that all units within the same cluster have the same outcome
- This is not good because it implies that units are so similar that the effective sample size is equal to the number of clusters
In social sciences, education, or public health, ICC values often range from 0.05 to 0.20
Even seemingly “small” ICC values can have a noticeable impact, especially with large clusters!

Effective sample size

Loss of power due to clustering

The effective sample size is a concept that helps us understand the impact of clustering on our statistical power (Kish (1965))
It is defined as:

\[n_{ESS} = \frac{n}{1 + (k - 1) \cdot ICC}\]

Where:
- \(n\) = total sample size
- \(k\) = average cluster size
- \(ICC\) = intra-cluster correlation
Let’s see what each step means

Effective sample size (cont.)

Explanation

\[n_{ESS} = \frac{n}{1 + (k - 1) \cdot ICC}\]

Numerator (\(n\)): Sample size. If there were no clustering effect (\(ICC = 0\)), the denominator would be 1, and ESS would equal \(n\)
Denominator: [\(1 + (k - 1) * ICC\)] - The “variance inflation factor”: This is the key part that adjusts for clustering
Let’s look at it piece by piece:
- \((k - 1)\): This reflects the fact that the “loss” of information increases with cluster size (\(k\)). If \(k=1\) (clusters of size 1, i.e., no clustering), then \((k-1) = 0\), and the denominator becomes 1. As \(k\) increases, more redundancy is introduced, and the denominator increases
- \((k - 1) \cdot ICC\): Total amount of correlation effect within a cluster. ICC multiplied by approximately \(k-1\) because each individual is correlated with \(k-1\) other individuals in their cluster
- Dividing by the denominator: By dividing the nominal sample size \(n\) by this variance inflation factor, we indicate that our clustered sample provides less independent information

Effective sample size (cont.)

Example

Let’s say we have a sample of 1000 individuals, with an average cluster size of 20 (\(k=20\)) and an ICC of 0.2:

\[n_{ESS} = \frac{1000}{1 + (20 - 1) \cdot 0.2} = \frac{1000}{1 + 3.8} = \frac{1000}{4.8} \approx 208.33\]

This means that, despite having 1000 individuals, we lose about 80% of our sample size due to clustering!
And since the effective sample size is what we use to calculate power, we need to adjust our calculations accordingly
This is why we need to be careful when designing clustered experiments! 😅

What to do in such situations?

As we’ve seen, cluster randomised trials entail a series of specific challenges for standard estimation and testing methods
If randomisation is conducted at the cluster level, the uncertainty arising from this process is also at the cluster level
When we have a sufficient number of clusters, cluster robust standard errors can help us produce confidence intervals with the correct coverage. However, these require a large number of clusters
If the cluster size (or any related characteristic) is linked to the effect magnitude, then the estimation may be biased (and adjustments are required)
So, what can we do? 🤷🏻‍♂️

What to do in such situations?

Increase sample size

One option is to increase the sample size to account for the loss of power due to clustering
This can be done by:
- Adding more clusters
- Increasing the number of units within each cluster
However, this can be challenging in practice, as it may not always be feasible to add more clusters or units

What to do in such situations?

Pair matching

Imai et al (2009) proposed a design suggestion to improve the efficiency of cluster randomised trials
The strategy has three steps:
- First, choose the causal quantity of interest (usually, individual difference in means)
- Then, identify available pre-treatment covariates likely to affect the outcome variable (blocks), and, if possible, pair clusters based on the similarity of these covariates and cluster sizes
- They show that this step is usually overlooked and can yield many additional observations
- Finally, researchers should randomly choose one treated and one control cluster within each pair
This design is particularly useful when the number of clusters is small, and is an extreme example of blocking. Use blockTools to implement it!
Again, block what you can!

Example

Clustered randomised trial with and without blocking

Let’s declare a design with heterogeneous cluster sizes
There are 300 units in 12 clusters
Two bigger clusters are of size 100 and 10 smaller clusters are of size 10
The 200 units in clusters of size 100 have a 0 treatment effect, the other 100 in clusters of size 10 have an effect of 3. This means that the average treatment effect is 1
Example adapted from DeclareDesign

Clustered randomised trial with and without blocking

# Load packages
library(DeclareDesign)
library(tidyverse)

# Set seed for reproducibility
set.seed(385)

# Design
N_clusters <- 12

cluster_design <- 
  # M: Model
  declare_model(clusters = add_level(N = N_clusters, 
                                          cl_size = rep(c(100, 10), c(N/6, N - N/6)),
                                          effect = ifelse(cl_size == 100, 0, 3)),
                     units    = add_level(N = cl_size, u = rnorm(N, sd = .2),
                                          Y_Z_0 = u, Y_Z_1 = u + effect)) +
  
  # I: Inquiry
  declare_inquiry(ATE_i = mean(Y_Z_1 - Y_Z_0)) +

  # D: Data Strategy
  declare_assignment(Z = cluster_ra(clusters = clusters)) +
  
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +

  # A: Answer Strategy
  declare_estimator(Y ~ Z, inquiry = "ATE_i", clusters = clusters,
                    model = lm_robust, label = "Clustered")

Clustered randomised trial with and without blocking

# Block on cluster size and simulate
# Re-use the same model and inquiry for a fair comparison
# Modified cluster design with blocking
matched_cluster_design <- 
  # M: Model
  declare_model(clusters = add_level(N = N_clusters, 
                                     cl_size = rep(c(100, 10), c(N/6, N - N/6)),
                                     # ATE = 1
                                     effect = ifelse(cl_size == 100, 0, 3)),
                units    = add_level(N = cl_size, 
                                     u = rnorm(N, sd = .2),
                                     Y_Z_0 = u, 
                                     Y_Z_1 = u + effect)) +
  
  # I: Inquiry
  declare_inquiry(ATE_i = mean(Y_Z_1 - Y_Z_0)) +
  
  # D: Data Strategy (modified assignment)
  declare_assignment(Z = block_and_cluster_ra(clusters = clusters, blocks = cl_size)) +
  
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  
  # A: Answer Strategy (modified estimator with blocks)
  declare_estimator(Y ~ Z, inquiry = "ATE_i", clusters = clusters, label = "Blocks")

simulations <- simulate_design(cluster_design, matched_cluster_design, sims = 1000)

Plot the results

Statistical power 💪

Statistical power

What is it?

Statistical power is the probability of correctly rejecting the null hypothesis when it is false
In other words, it is the probability of detecting an effect when there is one
Statistical power is influenced by several factors, including:
- Sample size
- Effect size
- Significance level (alpha)
- Variability in the data
Power calculations used to be overlooked in the past, but they are now considered an essential part of the research design process
They must be included in any pre-analysis plan (PAP), and they are often required by journals and funding agencies
So let’s see what they are all about! 😅
(Based on Coppock (2019))

Statistical power

Why you should care

Experimenters often guard against false positives with statistical significance tests
Power analysis is the opposite: they are designed to guard against false negatives
Answering this question always requires some guesswork
You’ll have to supply guesses as to how big your treatment effect can be, how many subjects will answer your survey, and how many subjects you can afford to treat
Where do these guesses come from? As we saw before:
- Previous studies
- Pilot studies
- Subject-matter expertise
A rule of thumb is that a power of 0.80 (or 80%) is generally considered acceptable
There is no one-size-fits-all answer, and the acceptable level of power may vary depending on the context and consequences of Type I and Type II errors
So if you’re running a study on a very sensitive topic, you might want to aim for a higher power level (e.g., 90% or 95%)

How to calculate power

The three steps

There are three things that determine how highly powered your experiment will be
The first two you can’t really control – these are the realities of your experimental environment
The last, the experimental design, is the only thing that you have power over – use it!
- Strength of the treatment effect: This is the size of the effect you expect to observe in your experiment. The larger the effect, the easier it is to detect, and the higher the power
- Variability in the data: The more variability there is in your data, the harder it is to detect an effect, and the lower the power. To the extent that it is possible, try to select outcome variables that have low variability
- Sample size: The larger the sample size, the more likely you are to detect an effect if it exists. Increasing the sample size generally increases power

How to calculate power

Formula

Here is a common formula used to calculate power:

\[ \beta = \Phi \left( \frac{|\mu_t - \mu_c|\sqrt{N}}{2\sigma} - \Phi^{-1} \left( 1 - \frac{\alpha}{2} \right) \right) \]

\(\beta\) (Beta): Power. A value between 0 and 1, usually set to 0.80
\(\Phi\) (Phi): Cumulative distribution function of a standard normal distribution. \(\Phi^{-1}\) is its inverse function
\(\mu_t - \mu_c\): Assumed ATE
\(\sigma\): Standard Deviation of Outcomes. This represents the assumed variability or “noisiness” of the outcome measurements. Assumed to be the same for both treatment and control
\(\alpha\): Significance level. Commonly set to 0.05 in many fields
\(N\): Sample size, the only factor directly controlled by the researcher in this formula. The formula assumes a 50/50 chance of being in control

Calculate power using `DeclareDesign`

# N
sample_size <- 100

# Declare a two-arm trial in code
two_arm_trial <-
  declare_model(N = sample_size,
                U = rnorm(N),
                potential_outcomes(Y ~ 0.2 * Z + U)) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_assignment(Z = complete_ra(N, prob = 0.5)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z, inquiry = "ATE")

diagnose_design(two_arm_trial)


Research design diagnosis based on 500 simulations. Diagnosis completed in 1 secs. Diagnosand estimates with bootstrapped standard errors in parentheses (100 replicates).

        Design Inquiry Estimator Outcome Term N Sims Mean Estimand
 two_arm_trial     ATE estimator       Y    Z    500          0.20
                                                            (0.00)
 Mean Estimate   Bias SD Estimate   RMSE  Power Coverage
          0.20  -0.00        0.20   0.20   0.17     0.94
        (0.01) (0.01)      (0.01) (0.01) (0.02)   (0.01)

Calculate power using `DeclareDesign`

# Redesign over sample size and calculate power
diagnosis <-
  two_arm_trial |>
  redesign(sample_size = c(250, 500, 750, 1000, 1250)) |>
  diagnose_designs() |>
  tidy() |>
  filter(diagnosand == "power")

# Visualize power curve over sample sizes
ggplot(diagnosis, aes(sample_size, estimate)) + 
  geom_point() +
  geom_line()

Calculate power using `DeclareDesign`

Using different models

Now we will test many estimations at the same time!
We start with two-arm trial with 40 units in which 20 units are assigned to treatment, blocking on a binary pre-treatment covariate \(W\)
We’ll let the treatment effects vary according to \(W\), but the true average treatment effect (our estimand in this case) is equal to 1.

N = 40

# Model
model <- declare_model(N, W = rep(0:1, N / 2), u = rnorm(N), potential_outcomes(Y ~ 2 * Z * W + u))  

# Inquiry
inquiry     <- declare_inquiry(ate = 1)  

# Data strategy
assignment   <- declare_assignment(Z = block_ra(blocks = W))
measurement  <- declare_measurement(Y = reveal_outcomes(Y ~ Z))

# Answer strategy
estimator  <- declare_estimator(Y ~ Z, estimand = "ate", label = "Simple D-I-M")

# Declare the design
design <- model + inquiry + assignment + measurement + estimator

diagnose_design(design)


Research design diagnosis based on 500 simulations. Diagnosis completed in 2 secs. Diagnosand estimates with bootstrapped standard errors in parentheses (100 replicates).

 Design Inquiry    Estimator Outcome Term N Sims Mean Estimand Mean Estimate
 design     ate Simple D-I-M       Y    Z    500          1.00          1.01
                                                        (0.00)        (0.01)
   Bias SD Estimate   RMSE  Power Coverage
   0.01        0.31   0.31   0.76     0.98
 (0.01)      (0.01) (0.01) (0.02)   (0.01)

Using different models

Let’s consider two additional estimation strategies
The first controls for the pre-treatment covariate \(W\) in an OLS
An alternative is the “Lin estimator”, which centres and interacts the variables

new_design <- design +
              declare_estimator(Y ~ Z + W,  model = lm_robust,
                                inquiry = "ate", label = "OLS: Control for W") +
              declare_estimator(Y ~ Z, covariates = ~ W, model = lm_lin,
                                inquiry = "ate", label = "Lin: Control + Interaction")

diagnose_design(new_design)


Research design diagnosis based on 500 simulations. Diagnosis completed in 3 secs. Diagnosand estimates with bootstrapped standard errors in parentheses (100 replicates).

     Design Inquiry                  Estimator Outcome Term N Sims
 new_design     ate Lin: Control + Interaction       Y    Z    500
                                                                  
 new_design     ate         OLS: Control for W       Y    Z    500
                                                                  
 new_design     ate               Simple D-I-M       Y    Z    500
                                                                  
 Mean Estimand Mean Estimate   Bias SD Estimate   RMSE  Power Coverage
          1.00          1.00  -0.00        0.33   0.33   0.87     0.94
        (0.00)        (0.01) (0.01)      (0.01) (0.01) (0.01)   (0.01)
          1.00          1.00  -0.00        0.33   0.33   0.81     0.95
        (0.00)        (0.01) (0.01)      (0.01) (0.01) (0.02)   (0.01)
          1.00          1.00  -0.00        0.33   0.33   0.75     0.98
        (0.00)        (0.01) (0.01)      (0.01) (0.01) (0.02)   (0.01)

Using different models

To figure out how these gains in power from switching up estimation strategies compare with gains from increasing \(N\), we declare a sequence of designs, differing only in values for \(N\)

designs   <- redesign(new_design, N = seq(30, 80, 10))
diagnoses <- diagnose_design(designs)

diagnoses$diagnosands_df %>%
  ggplot(aes(N, power)) +
  geom_line(aes(color = estimator))

Using different models

Final thoughts 💭

Final thoughts

We saw how to use blocking to improve the precision of our estimates, including those of clustered experiments
- Pair matching is a great way to do this!
We also (briefly) introduced DeclareDesign and how to use it to calculate power
- We will cover this in more detail in a dedicated class!
We saw how to calculate power using DeclareDesign and how to use it to compare different designs
And that’s all for today! 🎉

QTM 385 - Experimental Methods

Hi, there! Nice to see you again! 😉

Brief recap 📚

Blocking recap

Clustering recap

Blocking (cont.)

Let’s declare the design of a hypothetical experiment!

Community gardening programme

Community gardening programme

Declare Model

Community gardening programme

Declare Inquiry

Community gardening programme

Declare Assignment

Community gardening programme

Declare Measurement

Community gardening programme

Declare Estimator

Community gardening programme

Model without blocking

Community gardening programme

Simulate and compare

Community gardening programme

Simulate and compare

Cool, isn’t it? 🤓

Clustering (cont.)

Clustering (cont.)

Intra-cluster correlation (ICC)

Effective sample size

Loss of power due to clustering

Effective sample size (cont.)

Explanation

Effective sample size (cont.)

Example

What to do in such situations?

What to do in such situations?

Increase sample size

What to do in such situations?

Pair matching

Example

Clustered randomised trial with and without blocking

Clustered randomised trial with and without blocking

Clustered randomised trial with and without blocking

Plot the results

Statistical power 💪

Statistical power

What is it?

Statistical power

Why you should care

How to calculate power

The three steps

How to calculate power

Formula

Calculate power using DeclareDesign

Calculate power using DeclareDesign

Calculate power using DeclareDesign

Using different models

Using different models

Using different models

Using different models

Final thoughts 💭

Final thoughts

Have a great day! 👋

Hi, there!
Nice to see you again! 😉

Calculate power using `DeclareDesign`

Calculate power using `DeclareDesign`

Calculate power using `DeclareDesign`