Let’s declare the design of a hypothetical experiment!
Community gardening programme
Imagine that we are evaluating a “Community Gardening Programme” (treatment) that is designed to increase social cohesion within a city
We believe that different neighbourhoods start with varying levels of social cohesion
So we will block by “Neighbourhood Type” (i.e. established versus new development) to improve the precision of our estimated programme effect
Community gardening programme
Declare Model
Simulating our experiment using DeclareDesign:
# Load packageslibrary(DeclareDesign)library(tidyverse) # Set seed for reproducibilityset.seed(385)# Simulate our dataN =100# Total individuals# Declare Model: what we think about the worldmodel_blocks <-declare_model(N = N,# Two neighbourhood typesneighbourhood_type =sample(rep(c("Established", "New Development"), each = N/2)), # Use neighbourhood type as blocksblocks = neighbourhood_type, # Established neighbourhoods start with slightly higher cohesionbaseline_cohesion_effect =ifelse(neighbourhood_type =="Established", 0.5, 0), u =rnorm(N), # Random noisepotential_outcomes( social_cohesion ~ Z *1.5+# Treatment effect = 1.5 units baseline_cohesion_effect +# Baseline cohesion varies by neighbourhood type u/10# Random noise ))
Community gardening programme
Declare Inquiry
# Declare Inquiry: what we want to learn# Average Treatment Effect (ATE)inquiry_blocks <-declare_inquiry(ATE =mean(social_cohesion_Z_1 - social_cohesion_Z_0))
Community gardening programme
Declare Assignment
# Declare Assignment: how we assign treatment# Blocked RA by neighbourhood typeassignment_blocks <-declare_assignment(Z =block_ra(blocks = blocks))
Community gardening programme
Declare Measurement
# Declare Measurement: how we measure outcomesmeasurement_blocks <-declare_measurement(social_cohesion =reveal_outcomes(social_cohesion ~ Z))
Community gardening programme
Declare Estimator
# Declare Estimator: how we estimate treatment effectsestimator_blocks <-declare_estimator(social_cohesion ~ Z)
# Re-use the same model and inquiry for a fair comparisonmodel_no_blocks <- model_blocksinquiry_no_blocks <- inquiry_blocksassignment_no_blocks <-declare_assignment(Z =complete_ra(N = N, m =50)) # No blocksmeasurement_no_blocks <- measurement_blocks # Re-use measurementestimator_no_blocks <- estimator_blocks # Re-use estimatordesign_no_blocks <- model_no_blocks + inquiry_no_blocks + assignment_no_blocks + measurement_no_blocks + estimator_no_blocks
Community gardening programme
Simulate and compare
# Simulate the designssimulations <-simulate_design(design_blocks, design_no_blocks, sims =1000)
# Plot the resultsggplot(simulations, aes(estimate)) +geom_histogram(bins =30) +facet_wrap( ~ design) +theme_bw() +theme(strip.background =element_blank()) +labs(title ="Distribution of Estimated Community Gardening Programme Effect",x ="Estimated Treatment Effect (Social Cohesion)",y ="Frequency" )
Community gardening programme
Simulate and compare
Cool, isn’t it? 🤓
Clustering (cont.)
Clustering (cont.)
Last class, we discussed the concept of clustering in experiments
Clustering is often used when random assignment is not feasible, such as in:
Education (e.g., classrooms)
Healthcare (e.g., hospitals)
Community interventions (e.g., neighbourhoods)
Clustering can be beneficial, but it also comes with challenges, such as the need for robust statistical methods to account for intra-cluster correlation
We also discussed the importance of considering the design effect when planning a clustered experiment
We unfortunately lose statistical power when we cluster
Clustering is more of a necessity than a choice
Intra-cluster correlation (ICC)
The intra-cluster correlation (ICC) is a measure of the similarity of individuals within the same cluster, compared to individuals in different clusters
The measure goes from 0 to 1. When it is closer to 0, it means that clusters have no influence on the outcome, so we can treat individuals as independent
This is the ideal situation!
When it is closer to 1, it means that all units within the same cluster have the same outcome
This is not good because it implies that units are so similar that the effective sample size is equal to the number of clusters
In social sciences, education, or public health, ICC values often range from 0.05 to 0.20
Even seemingly “small” ICC values can have a noticeable impact, especially with large clusters!
Effective sample size
Loss of power due to clustering
The effective sample size is a concept that helps us understand the impact of clustering on our statistical power (Kish (1965))
It is defined as:
\[n_{ESS} = \frac{n}{1 + (k - 1) \cdot ICC}\]
Where:
\(n\) = total sample size
\(k\) = average cluster size
\(ICC\) = intra-cluster correlation
Let’s see what each step means
Effective sample size (cont.)
Explanation
\[n_{ESS} = \frac{n}{1 + (k - 1) \cdot ICC}\]
Numerator (\(n\)): Sample size. If there were no clustering effect (\(ICC = 0\)), the denominator would be 1, and ESS would equal \(n\)
Denominator: [\(1 + (k - 1) * ICC\)] - The “variance inflation factor”: This is the key part that adjusts for clustering
Let’s look at it piece by piece:
\((k - 1)\): This reflects the fact that the “loss” of information increases with cluster size (\(k\)). If \(k=1\) (clusters of size 1, i.e., no clustering), then \((k-1) = 0\), and the denominator becomes 1. As \(k\) increases, more redundancy is introduced, and the denominator increases
\((k - 1) \cdot ICC\): Total amount of correlation effect within a cluster. ICC multiplied by approximately \(k-1\) because each individual is correlated with \(k-1\) other individuals in their cluster
Dividing by the denominator: By dividing the nominal sample size \(n\) by this variance inflation factor, we indicate that our clustered sample provides less independent information
Effective sample size (cont.)
Example
Let’s say we have a sample of 1000 individuals, with an average cluster size of 20 (\(k=20\)) and an ICC of 0.2:
This means that, despite having 1000 individuals, we lose about 80% of our sample size due to clustering!
And since the effective sample size is what we use to calculate power, we need to adjust our calculations accordingly
This is why we need to be careful when designing clustered experiments! 😅
What to do in such situations?
As we’ve seen, cluster randomised trials entail a series of specific challenges for standard estimation and testing methods
If randomisation is conducted at the cluster level, the uncertainty arising from this process is also at the cluster level
When we have a sufficient number of clusters, cluster robust standard errors can help us produce confidence intervals with the correct coverage. However, these require a large number of clusters
If the cluster size (or any related characteristic) is linked to the effect magnitude, then the estimation may be biased (and adjustments are required)
So, what can we do? 🤷🏻♂️
What to do in such situations?
Increase sample size
One option is to increase the sample size to account for the loss of power due to clustering
This can be done by:
Adding more clusters
Increasing the number of units within each cluster
However, this can be challenging in practice, as it may not always be feasible to add more clusters or units
What to do in such situations?
Pair matching
Imai et al (2009) proposed a design suggestion to improve the efficiency of cluster randomised trials
The strategy has three steps:
First, choose the causal quantity of interest (usually, individual difference in means)
Then, identify available pre-treatment covariates likely to affect the outcome variable (blocks), and, if possible, pair clusters based on the similarity of these covariates and cluster sizes
They show that this step is usually overlooked and can yield many additional observations
Finally, researchers should randomly choose one treated and one control cluster within each pair
This design is particularly useful when the number of clusters is small, and is an extreme example of blocking. Use blockTools to implement it!
Again, block what you can!
Example
Clustered randomised trial with and without blocking
Let’s declare a design with heterogeneous cluster sizes
There are 300 units in 12 clusters
Two bigger clusters are of size 100 and 10 smaller clusters are of size 10
The 200 units in clusters of size 100 have a 0 treatment effect, the other 100 in clusters of size 10 have an effect of 3. This means that the average treatment effect is 1
Experimenters often guard against false positives with statistical significance tests
Power analysis is the opposite: they are designed to guard against false negatives
Answering this question always requires some guesswork
You’ll have to supply guesses as to how big your treatment effect can be, how many subjects will answer your survey, and how many subjects you can afford to treat
Where do these guesses come from? As we saw before:
Previous studies
Pilot studies
Subject-matter expertise
A rule of thumb is that a power of 0.80 (or 80%) is generally considered acceptable
There is no one-size-fits-all answer, and the acceptable level of power may vary depending on the context and consequences of Type I and Type II errors
So if you’re running a study on a very sensitive topic, you might want to aim for a higher power level (e.g., 90% or 95%)
How to calculate power
The three steps
There are three things that determine how highly powered your experiment will be
The first two you can’t really control – these are the realities of your experimental environment
The last, the experimental design, is the only thing that you have power over – use it!
Strength of the treatment effect: This is the size of the effect you expect to observe in your experiment. The larger the effect, the easier it is to detect, and the higher the power
Variability in the data: The more variability there is in your data, the harder it is to detect an effect, and the lower the power. To the extent that it is possible, try to select outcome variables that have low variability
Sample size: The larger the sample size, the more likely you are to detect an effect if it exists. Increasing the sample size generally increases power
\(\beta\) (Beta): Power. A value between 0 and 1, usually set to 0.80
\(\Phi\) (Phi): Cumulative distribution function of a standard normal distribution. \(\Phi^{-1}\) is its inverse function
\(\mu_t - \mu_c\): Assumed ATE
\(\sigma\): Standard Deviation of Outcomes. This represents the assumed variability or “noisiness” of the outcome measurements. Assumed to be the same for both treatment and control
\(\alpha\): Significance level. Commonly set to 0.05 in many fields
\(N\): Sample size, the only factor directly controlled by the researcher in this formula. The formula assumes a 50/50 chance of being in control
Research design diagnosis based on 500 simulations. Diagnosis completed in 1 secs. Diagnosand estimates with bootstrapped standard errors in parentheses (100 replicates).
Design Inquiry Estimator Outcome Term N Sims Mean Estimand
two_arm_trial ATE estimator Y Z 500 0.20
(0.00)
Mean Estimate Bias SD Estimate RMSE Power Coverage
0.20 -0.00 0.20 0.20 0.17 0.94
(0.01) (0.01) (0.01) (0.01) (0.02) (0.01)
Calculate power using DeclareDesign
# Redesign over sample size and calculate powerdiagnosis <- two_arm_trial |>redesign(sample_size =c(250, 500, 750, 1000, 1250)) |>diagnose_designs() |>tidy() |>filter(diagnosand =="power")# Visualize power curve over sample sizesggplot(diagnosis, aes(sample_size, estimate)) +geom_point() +geom_line()
Calculate power using DeclareDesign
Using different models
Now we will test many estimations at the same time!
We start with two-arm trial with 40 units in which 20 units are assigned to treatment, blocking on a binary pre-treatment covariate \(W\)
We’ll let the treatment effects vary according to \(W\), but the true average treatment effect (our estimand in this case) is equal to 1.
N =40# Modelmodel <-declare_model(N, W =rep(0:1, N /2), u =rnorm(N), potential_outcomes(Y ~2* Z * W + u)) # Inquiryinquiry <-declare_inquiry(ate =1) # Data strategyassignment <-declare_assignment(Z =block_ra(blocks = W))measurement <-declare_measurement(Y =reveal_outcomes(Y ~ Z))# Answer strategyestimator <-declare_estimator(Y ~ Z, estimand ="ate", label ="Simple D-I-M")# Declare the designdesign <- model + inquiry + assignment + measurement + estimatordiagnose_design(design)
Research design diagnosis based on 500 simulations. Diagnosis completed in 2 secs. Diagnosand estimates with bootstrapped standard errors in parentheses (100 replicates).
Design Inquiry Estimator Outcome Term N Sims Mean Estimand Mean Estimate
design ate Simple D-I-M Y Z 500 1.00 1.01
(0.00) (0.01)
Bias SD Estimate RMSE Power Coverage
0.01 0.31 0.31 0.76 0.98
(0.01) (0.01) (0.01) (0.02) (0.01)
Using different models
Let’s consider two additional estimation strategies
The first controls for the pre-treatment covariate \(W\) in an OLS
An alternative is the “Lin estimator”, which centres and interacts the variables
new_design <- design +declare_estimator(Y ~ Z + W, model = lm_robust,inquiry ="ate", label ="OLS: Control for W") +declare_estimator(Y ~ Z, covariates =~ W, model = lm_lin,inquiry ="ate", label ="Lin: Control + Interaction")diagnose_design(new_design)
Research design diagnosis based on 500 simulations. Diagnosis completed in 3 secs. Diagnosand estimates with bootstrapped standard errors in parentheses (100 replicates).
Design Inquiry Estimator Outcome Term N Sims
new_design ate Lin: Control + Interaction Y Z 500
new_design ate OLS: Control for W Y Z 500
new_design ate Simple D-I-M Y Z 500
Mean Estimand Mean Estimate Bias SD Estimate RMSE Power Coverage
1.00 1.00 -0.00 0.33 0.33 0.87 0.94
(0.00) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01)
1.00 1.00 -0.00 0.33 0.33 0.81 0.95
(0.00) (0.01) (0.01) (0.01) (0.01) (0.02) (0.01)
1.00 1.00 -0.00 0.33 0.33 0.75 0.98
(0.00) (0.01) (0.01) (0.01) (0.01) (0.02) (0.01)
Using different models
To figure out how these gains in power from switching up estimation strategies compare with gains from increasing \(N\), we declare a sequence of designs, differing only in values for \(N\)
designs <-redesign(new_design, N =seq(30, 80, 10))diagnoses <-diagnose_design(designs)