Druckman et al. (2015) used a list experiment on athlete drug/alcohol use, showing significant underreporting compared to direct questions
Blair et al. (2014) compared list and endorsement experiments to measure support for NATO forces in Afghanistan
Rosenfeld et al. (2016) validated list experiments, endorsement experiments, and the randomised response technique (RRT) against Mississippi referendum results, finding RRT performed pretty well
Freire & Skarbek (2023) employed a conjoint experiment to understand attitudes towards lynching in Brazil
Today’s plan 📅
Integration of research findings
Discuss the challenge of generalising experimental results
Distinguish between Sample ATE (SATE) and Population ATE (PATE)
Understand the calculation and interpretation of standard errors for the PATE
Introduce a Bayesian framework for formally updating beliefs with new experimental evidence
Consider how to incorporate potential bias (like sampling bias) into our analysis and uncertainty estimates
Introduce meta-analysis as a technique for pooling results from multiple studies
Examine the assumptions and limitations of common meta-analytic approaches
Replication: Conducting a new experiment that uses the same design (treatments, outcome measures, general procedures) as an earlier study, often in a similar setting with similar subjects
The goal is to assess the robustness and generalisability of the original finding
Verification: Taking the original data from a completed study and attempting to reproduce the reported statistical results.
Here, the goal is to check for clerical errors, understand the impact of specific analytical choices, and assess the sensitivity of results to different modelling decisions
Estimating population average treatment effects (PATE) 🌎
From sample to population
Sample Average Treatment Effect (SATE): Average causal effect specifically within the group of individuals included in our experiment
Often, our real interest lies in the Population Average Treatment Effect (PATE): This represents the average effect we would expect if the treatment were applied to the entire population from which our sample was drawn
Thinking about replication studies (which involve drawing new samples) highlights the shift from considering only within-sample randomisation variability (for SATE) to considering sample-to-sample variability (for PATE).
If experimental subjects are selected via random sampling from a well-defined, large population, we can statistically estimate the PATE and quantify the uncertainty around that estimate
This involves an additional layer of uncertainty compared to just estimating the SATE, because our specific sample might not perfectly mirror the population due to random chance
The standard error of the PATE is usually larger than that of the SATE, reflecting this added uncertainty
Quantifying uncertainty
Under ideal conditions (independent random sampling from a fixed, large population \(N*\)), the standard error of the estimated PATE (\(SE(\widehat{PATE})\)) is given by:
\(Var(Y_i(1))\) and \(Var(Y_i(0))\) are the variances of potential outcomes in the population under treatment and control
These are typically estimated using the variances observed in the treatment and control groups of our sample
\(m\) is the number of subjects in the treatment group, and \(N-m\) is the number in the control group
This formula differs from the SATE standard error because it relies on population variances and, due to the assumption of a large population \(N*\), the covariance between potential outcomes is assumed to be zero
A Bayesian framework for interpretation 📈
Updating beliefs with evidence
Because generalisation involves inference beyond the data at hand, a Bayesian approach provides a useful conceptual and analytical framework
Bayesian inference views probability as representing a subjective degree of belief about a hypothesis or parameter value
The process involves the following steps:
Starting with prior beliefs about the parameter (e.g., the PATE) before seeing the current study’s results
Observing the evidence from the new experiment
Using Bayes’ Rule to combine the prior beliefs and the evidence to form updated posterior beliefs
This framework explicitly models how our understanding should rationally change as new information becomes available
Priors and posteriors
Characterising beliefs
Prior Beliefs: Our knowledge or belief about a parameter (like the PATE) before considering the evidence from the current experiment
Often expressed as a probability distribution (e.g., assuming the PATE follows a normal distribution with a certain mean and variance)
The mean reflects the best guess, and the variance reflects the degree of prior uncertainty
Can be based on previous studies, theory, or expert judgement
Posterior Beliefs: Our updated state of knowledge or belief after incorporating the evidence from the experiment
Also expressed as a probability distribution
Mathematically derived by combining the prior distribution and the likelihood of the observed data (which reflects the experimental result and its precision)
Represents a compromise (weighed average) between prior beliefs and the new evidence. Stronger priors or less precise evidence lead to smaller updates
Bayes’ rule for discrete outcomes
Updating beliefs about a hypothesis
Consider a simple hypothesis \(H\) (e.g., the treatment has a positive effect) and evidence \(E\) from an experiment (e.g., the result is statistically significant at the \(0.05\) level)
We need:
Prior P(H): Our belief in \(H\) before the experiment
P(E|H): The probability of observing evidence \(E\) if \(H\) is true (this is related to the statistical power of the test)
P(E|~H): The probability of observing \(E\) if \(H\) is false (this is the Type I error rate, e.g., \(0.05\))
Bayes’ Rule gives the Posterior P(H|E), our belief in \(H\) after observing \(E\):
Example: If \(P(H)=0.5\) (50/50 prior), \(P(E|H)=0.45\) (power), and \(P(E|~H)=0.05\) (sig. level), then \(P(H|E) = 0.90\)
Bayesian updating with normal distributions
Estimating a continuous parameter (PATE)
Let’s apply this to estimate a continuous parameter, the PATE, denoted \(\tau\)
Prior beliefs: Assume our prior beliefs about \(\tau\) can be represented by a normal distribution with mean \(g\) (our best guess) and variance \(\sigma^2_g\)
Experimental result: The experiment yields an estimate \(x_e\) with a standard error \(\sigma_{xe}\), so its variance is \(\sigma^2_{xe}\)
Likelihood: Assuming the experimental estimate \(X_e\) follows a normal sampling distribution centred on the true PATE (\(\tau\)), the likelihood is also normal (conjugate priors)
Posterior beliefs: When both the prior and the likelihood are normal, the resulting posterior distribution for \(\tau\) is also normal
The mean and variance of this posterior distribution are calculated as a weighted average of the prior information and the experimental evidence, reflecting the relative precision of each
We will now incorporate the possibility of bias in the experimental estimate of the PATE
Visualising the Bayesian update
Consider a simple case (initially ignoring bias):
Prior belief about PATE: Centred at \(g=0\) with standard deviation \(\sigma_g=2\) (variance \(\sigma^2_g=4\))
Experimental result: Estimate \(x_e=10\) with standard error \(SE_{xe}=1\) (variance \(\sigma^2_{xe}=1\))
The Posterior belief about PATE will be:
A normal distribution
Centred between the prior mean (\(0\)) and the data (\(10\))
More precise (smaller variance) than the prior in our example
In this example, the posterior mean is \(8\), and the posterior standard deviation is approximately \(0.89\). The data, being more precise than the prior, pulls the posterior belief strongly towards it
Source: FEDAI Figure 11.1. Prior (dotted), likelihood based on experimental result (dashed), and posterior (solid) distributions.
Calculating the posterior (normal-normal case)
It is often easier to work with precision, which is defined as the inverse of the variance: \(\text{precision} = 1 / \sigma^2\)
Small variance means low uncertainty and high precision
Large variance means the opposite: high uncertainty and low precision
Data precision: \(\rho_{data} = 1 / \sigma^2_{data}\)
The posterior precision is simply the sum of the prior and data precisions: \[\rho_{posterior} = \rho_{prior} + \rho_{data}\]
This addition reflects how combining information increases our overall certainty
The posterior mean is a precision-weighted average of the prior mean and the data: \[\mu_{posterior} = \frac{\rho_{prior} \mu_{prior} + \rho_{data} x_e}{\rho_{posterior}} = \frac{(1/\sigma^2_{prior}) \mu_{prior} + (1/\sigma^2_{data}) x_e}{(1/\sigma^2_{prior}) + (1/\sigma^2_{data})}\]
Information sources with higher precision (lower variance) receive more weight in determining the final posterior belief
The posterior variance is the inverse of the posterior precision: \(\sigma^2_{posterior} = 1 / \rho_{posterior}\)
R code for Bayesian update (normal-normal)
Let’s apply this to the example from Figure 11.1, calculating precisions first
Now, let’s consider the more realistic scenario where our experiment might yield an unbiased estimate of the Sample ATE, but due to non-random sampling or other factors, it might be a biased estimate of the Population ATE (\(\tau\))
Let \(B\) represent this potential sampling bias, such that the expected experimental result, given the true PATE \(\tau\), is \(\tau + B\)
We need to expand our Bayesian model to include prior beliefs about this bias term \(B\)
Assume \(B \sim N(\beta, \sigma^2_B)\), where \(\beta\) is our prior belief about the bias and \(\sigma^2_B\) is our uncertainty about it
We typically assume that our prior beliefs about the true effect \(\tau\) and the bias \(B\) are independent
Posterior mean and variance with bias
Updating beliefs about PATE (\(\tau\))
We define the data’s effective precision for estimating \(\tau\) as:
Higher uncertainty about the bias (\(\sigma^2_B\)) directly reduces the effective precision of the data
Think of bias uncertainty as adding noise to \(\tau\)
Researchers implicitly setting \(\beta = 0\) and \(\sigma^2_B = 0\) when extrapolating from convenience samples risk significant overconfidence
The posterior mean becomes a weighted average using the prior precision (\(\rho_{prior}\)) and the data’s effective precision (\(\rho_{effective\_data}\)) as weights:
Now we add prior beliefs about the bias (mean bias_mean = \(\beta\), se bias_se = \(\sigma_B\)) and test the impact of different levels of uncertainty about the bias on the posterior mean and variance
R Code
calculate_posterior <-function(prior_mean, prior_se, data_mean, data_se, bias_mean, bias_se) {# Precisions for prior and data measurement prior_var <- prior_se^2 prior_precision <-1/ prior_var data_var <- data_se^2 data_precision <-1/ data_var # Precision of the measurement itself# Precision related to bias belief bias_var <- bias_se^2# Handle case where bias_se is 0 (certainty about bias) to avoid division by zeroif (bias_var ==0) { bias_precision <-Inf } else { bias_precision <-1/ bias_var # How certain are we about the bias value }# Calculate data's effective precision for estimating tau# Uncertainty about bias adds to the data's measurement variance total_data_uncertainty_variance <- bias_var + data_varif (total_data_uncertainty_variance ==0) { # Should only happen if bias_se=0 and data_se=0 effective_data_precision <-Inf } else { effective_data_precision <-1/ total_data_uncertainty_variance }# Calculate posterior precision (certainty about tau) posterior_precision <- prior_precision + effective_data_precision posterior_precision <- prior_precision + effective_data_precision# Calculate posterior variance and SE for tauif (posterior_precision ==Inf) { # Handle certainty case posterior_var <-0 } else { posterior_var <-1/ posterior_precision } posterior_se <-sqrt(posterior_var)# Calculate posterior mean for tau# Weighted average using prior precision and data's *effective* precision# Note: Data point is corrected by expected bias (data_mean - bias_mean)# Handle edge case of infinite precisionsif (is.infinite(prior_precision) &&is.infinite(effective_data_precision)) {# This case is ill-defined unless prior_mean equals (data_mean - bias_mean)# For simplicity, let's prioritize data if both are infinitely precise. posterior_mean <- data_mean - bias_mean } elseif (is.infinite(prior_precision)) { posterior_mean <- prior_mean } elseif (is.infinite(effective_data_precision)) { posterior_mean <- data_mean - bias_mean } elseif (posterior_precision ==0) { # Both prior and effective data precision are 0 (infinite variance) posterior_mean <-NA# Undefined / Non-informative } else { posterior_mean <- (prior_precision * prior_mean + effective_data_precision * (data_mean - bias_mean)) / posterior_precision }# Return resultslist(prior_mean = prior_mean, prior_se = prior_se, prior_precision = prior_precision,data_mean = data_mean, data_se = data_se, data_precision = data_precision,bias_mean = bias_mean, bias_se = bias_se, bias_precision = bias_precision,effective_data_precision = effective_data_precision,posterior_mean = posterior_mean, posterior_se = posterior_se, posterior_precision = posterior_precision)}# --- Example cases ---# Base info from Fig 11.1prior_mean_base <-0prior_se_base <-2data_mean_base <-10data_se_base <-1# Case A: Certainty of no bias (Bias SE = 0)results_A <-calculate_posterior(prior_mean_base, prior_se_base, data_mean_base, data_se_base, bias_mean =0, bias_se =0)# Case B: Moderate uncertainty about bias (Bias SE = 2)results_B <-calculate_posterior(prior_mean_base, prior_se_base, data_mean_base, data_se_base, bias_mean =0, bias_se =2)# Case C: High uncertainty about bias (Bias SE = 10)results_C <-calculate_posterior(prior_mean_base, prior_se_base, data_mean_base, data_se_base, bias_mean =0, bias_se =10)# --- Print results ---print_results <-function(results, case_label) {cat("--- Case:", case_label, "---\n")cat(sprintf("Bias Beliefs: Mean=%.1f, SE=%.1f (Bias Var=%.1f, Bias Precision=%.4f)\n", results$bias_mean, results$bias_se, results$bias_se^2, results$bias_precision))cat(sprintf("Data Effective Precision: %.4f\n", results$effective_data_precision))cat(sprintf("Posterior: Mean=%.2f, SE=%.2f (Posterior Precision=%.4f)\n\n", results$posterior_mean, results$posterior_se, results$posterior_precision))}print_results(results_A, "A: Certain No Bias")
--- Case: A: Certain No Bias ---
Bias Beliefs: Mean=0.0, SE=0.0 (Bias Var=0.0, Bias Precision=Inf)
Data Effective Precision: 1.0000
Posterior: Mean=8.00, SE=0.89 (Posterior Precision=1.2500)
This Bayesian framework is flexible and can integrate different kinds of evidence.
Imagine using results from a high-quality, unbiased randomised experiment to establish your prior beliefs about the PATE (\(g, \sigma^2_g\))
You could then update these priors using findings from a large, but potentially biased observational study (\(x_e, \sigma^2_{xe}\))
To do this rigorously, you must explicitly state your prior beliefs about the bias of the observational study (\(\beta, \sigma^2_B\))
Meta-Analysis 📊
Meta-Analysis
Systematically combining results
Meta-analysis is a set of statistical techniques designed to synthesise or pool results from multiple related studies
Often, individual studies might be small or lack statistical power, but combining their results can lead to a more precise and clearer conclusion
It is a widely used method for summarising research literatures across many disciplines, mainly in the health sciences
The most basic form, known as fixed effects meta-analysis, has strong parallels with the Bayesian updating process described earlier, particularly under the assumption of no bias
Assumption: All studies (\(k=1, ..., K\)) included in the analysis are estimating the exact same underlying true population treatment effect, \(\tau\). Any variation in results is due purely to random sampling error.
Method: Calculate a pooled estimate ($ ATE_{pooled} \() as a weighted average of the individual study estimates (\) ATE_k $).
Weighting: Each study’s contribution is weighted by its precision. Studies with smaller standard errors (higher precision) receive more weight.
The resulting pooled standard error will be smaller than the standard error of any individual study included in the meta-analysis, reflecting the gain in precision from combining evidence
When is fixed effects meta-analysis appropriate?
This simple pooling method rests on several strong and often unrealistic assumptions:
Homogeneity of Effect: All studies are estimating precisely the same true effect size \(\tau\). There is no real variation in the effect across different populations, settings, or minor treatment variations
Independence of Studies: The results of the studies are statistically independent (often violated if studies share researchers, settings, or are aware of each other)
No Bias: Each individual study estimate (\(ATE_k\)) is assumed to be an unbiased estimate of the common true effect \(\tau\). Potential biases (sampling, publication, etc.) are ignored
These assumptions are particularly questionable when pooling studies across diverse contexts or designs
These assumptions can be relaxed with more complex models:
Random effects meta-analysis: Assumes that the true effect size varies across studies, allowing for a distribution of effect sizes rather than a single common effect
Hierarchical models: Can model both within-study and between-study variability
However, results can be heavily skewed by publication bias (the “file drawer problem”)
Techniques exist to detect this (e.g., funnel plots) but cannot fully correct for it
Let’s see how to conduct one in practice (all of it!)
Case study: Legislature size and government spending 💰
Step 01: Scrap data
My colleagues and I conducted a meta-analysis of the effects of legislature size on government spending
To find the articles, we used Google Scholar (n = 1001); Microsoft Academic (n = 927); and Scopus (n = 736) and searched for the following keywords:
("upper chamber size" OR "lower chamber size" OR "council size" OR "parliament size" OR "legislature size" OR "number of legislators" OR "legislative size") AND ("spending" OR "expenditure" OR "government size")
We also searched all articles that cited the original work on the effects of legislature size on government spending, and ended up with 5,705 records
These variables are useful to estimate meta-regression models, which allow us to estimate the effect of moderators on the ATE
Step 04: Conduct meta-analysis
We used the metaR package to conduct the meta-analysis
Although we did estimate fixed effect models, the main models had [random effects], which are more appropriate for our data
Instead of having \(\tau = \theta + \varepsilon\) (true effect + within study error), we assumed that \(\theta\) varies across studies, having a true parameter \(\mu\) and a between-study error \(\xi_i\):
\[\tau = \mu + \xi_i + \varepsilon_i\]
We added two levels to the models: publication ID for each paper, and a a common index for papers that share the same data
We found a lot of heterogeneity in the data!
But the meta-regressions also helped us understand the main drivers of the differences in the ATE 🤓
Here is one of the many figures we produced. As you can see, the main effect is close to zero and there is a lot of variability in the estimates
Meta-regression
Testing for method heterogeneity
Conclusion 📝
Key takeaways
Generalising experimental findings is surely challenging, as it introduces uncertainty and the potential for bias
It’s important to distinguish between the SATE (effect in the sample) and the PATE (effect in the target population). Estimating the PATE requires accounting for sampling variability.
Convenience sampling complicates PATE estimation and makes standard error calculation based on random sampling assumptions inappropriate.
The Bayesian framework offers a coherent way to integrate prior knowledge with new evidence and explicitly model uncertainty about potential biases
Meta-analysis provides tools for pooling study results, but standard methods (like fixed effects) rely on strong, often unmet, assumptions (e.g., homogeneity of effects, no publication bias, no sampling bias).
Statistical models facilitate interpolation and extrapolation but depend on correct specification Prediction uncertainty escalates rapidly as one moves further from the range of observed data
Transparency about assumptions is required when integrating findings or extrapolating results