QTM 385 - Experimental Methods

Lecture 14 - Writing Pre-Analysis Plans

Danilo Freire

Emory University

Hello, everyone! 😊

Brief recap 📚

Last class, we discussed…

  • Quarto for reproducible research and document authoring
  • Literate programming principles combining code and documentation
  • Version control integration with Git
  • Multi-format publishing (HTML, PDF, slides)
  • DeclareDesign simulation workflow components
  • Research design fundamentals and simulation workflows
  • Six key components: Population, Outcomes, Sampling, Assignment, Estimand, Estimator
  • Diagnostic analysis with power calculations

Today’s plan 📅

A closer look at pre-analysis plans

Writing and executing PAPs

  • We have briefly discussed the importance of PAPs before
  • Today, we will dive a little deeper into the topic
  • Discuss their pros and cons
  • We will talk about its components
    • Research questions, hypotheses, variables, estimations, threats to validity
  • In your group, you will work on a PAP template and discuss how its components apply to your research project
  • Finally, we will see some examples of PAPs you can use as a reference

The brief history of PAPs 📜

Why do we bother with PAPs?

  • The idea of PAPs is actually more recent than you might think
  • While RCTs have been around for decades, the first PAPs were written in the early 2000s in response to a growing concern about false results in medicine
  • In 1997, the US. Food and Drug Administration Modernization Act (FDAMA) mandated the public registration of clinical trials, including protocols for data collection and analysis
  • This led to ClinicalTrials.gov in 2000, a registry requiring researchers to outline primary outcomes, sample sizes, and statistical methods before patient enrollment
  • As you expect, these statistical analysis plans (SAPs) were designed to prevent data mining and selective reporting
  • The International Council for Harmonisation (ICH) in Madrid further expanded these requirements to include handling of missing data and statistical models to reduce Type I errors

The rise of PAPs: Ulysses pacts for researchers?

  • The 2010s were a decade of reproducibility crises
  • Several studies tried to replicate famous experiments and failed
  • The Open Science Framework (OSF) was created in 2012 to promote open science practices
  • The American Economic Association (AEA) launched the RCT Registry in 2013
  • Casey et al. (2012) demonstrated how PAPs could “bind researchers’ hands” against data mining by pre-specifying outcomes, covariates, and subgroup analyses
  • Olken (2015) argued that comprehensive PAPs, like a “Ulysses pact”, were necessary to limit researcher discretion
  • Simmons et al (2011) mention the idea of “researchers degrees of freedom”, since “it is unacceptably easy to publish “statistically significant” evidence consistent with any hypothesis”

Ulysses/Odysseus and the Sirens

Pros and cons of PAPs

  • The two main advantages of PAPs are transparency and credibility
    • They prevent p-hacking and HARKing (hypothesising after results are known)
  • But what about the cons?
  • Ofosu and Posner (2021) mention some of the main criticisms:
    • PAPs are time-consuming: 50% of researchers spend more than 2 weeks writing them, 1/4 spend more than a month
    • They are inflexible and limit the scope for breakthroughs
    • PAPs force researchers to run sub-optimal analyses to avoid deviations, thus creating boring and uninformative research
    • People can steal your work and publish it before you do
    • Finally, some say that PAPs don’t even work, as they require constant policing and this is something academia does not reward

More than 50 hypothesis specified? Really? How can this prevent data mining?

Source: Ofosu and Posner (2021)

Responding to criticisms

  • PAPs are time-consuming: True, but the authors mention that 64% of respondents said that writing a PAP was useful
  • They are inflexible: Not necessarily
    • A good approach can be “to freely undertake exploratory investigations that go beyond the PAP, clearly labeling the results of such investigations in the paper as coming from analyses that were not pre-specified, with an explanation provided for why they were added”
  • PAPs are worthless without policing: About 40% of respondents said the reviewers mentioned the PAP in their reviews
  • People can steal your work: This seems to be a little concern, as only 15% of respondents said they cared about this

SOPs as a flexible alternative to PAPs?

  • Lin and Green (2016) argue that we should adopt a SOP as much as possible to avoid the pitfalls of PAPs
  • SOPs are more flexible in the way that they only specify what you plan to do if something happens, that is, you don’t need to specify all the details of your analysis
  • They are more like a “safety net” for your research
  • It can save time if you work in a research group, as you can share the same SOP with your colleagues and avoid writing multiple PAPs
  • You can see an example here: https://github.com/acoppock/Green-Lab-SOP
  • While interesting in practice, apparently SOPs never really took off
  • Ofosu and Posner (2021) mention that only 3% of respondents said they used SOPs
  • Maybe it’s time to revisit this idea?

SOPs as a flexible alternative to PAPs?

Source: Lin et al (2016)

Components of a PAP 📝

What should a PAP contain?

  • Scholars do not fully agree about how long or detailed a PAP should be
  • Uri (2017) argues that PAPs should not contain anything that is not essential to the analysis
  • McKenzie (2012) has a helpful pre-analysis plan checklist that includes only a few points
  • EGAP (2017) proposes a more comprehensive list of components, with 7 sections
    • This is the one we will use in this class!
  • From experience, PAPs can be as short as 2 pages or as long as 50 pages
    • The average is probably between 10-20 pages
  • At the mininum, PAPs should include 4 sections:
    • Unit of analysis, population, and inclusion/exclusion criteria
    • Method (observational, experimental, quasi-experimental)
    • Experimental intervention or explanatory variable
    • Outcomes of interest
  • Let’s see the EGAP template in more detail
  • Available at https://danilofreire.github.io/qtm385/design-form.html

Group activity

How would you organise your PAP?

  • Together with your group members, you will work on the EGAP template
  • The idea is to discuss how you would organise your PAP and fill out the template
  • You don’t need to complete all of it, just a brief summary of each section
  • You have a few minutes to discuss each section with your group, then we will share our thoughts with the class
  • Let’s start! 🚀

Section 01 - Introduction

Section 1: Introduction
  1. Researcher name
  1. Research project title
  1. One sentence summary of your specific research question
  1. General motivation
  1. Why should someone who is not an academic care care about the results of this research? [1 paragraph]
  2. What policy decision(s) will your research help inform? [1 paragraph]
  1. Theoretical motivation
  1. What theoretical questions can this research shed light on? [1 paragraph]
  2. Key debate(s)/literature(s) that will be informed by the answer to your research question [1 paragraph]

Hypotheses

Section 1: Introduction
  1. Primary hypotheses
  1. What are the key parameter/estimands the research design seeks to estimate? What sign and/or magnitude is predicted by primary hypotheses for each parameter/estimand? [1-2 paragraphs]
  2. What is the logic or theory of change behind the primary hypotheses [1-2 paragraphs]
  3. What are the key pieces in the relevant academic literature that inform your hypotheses? [2-3 pieces]
  1. Secondary hypotheses
  1. What are the secondary paramater/estimands the research design seeks to estimate? What sign and/or magnitude is predicted by the secondary hypotheses for each parameter/estimand [These may be conditional effects for subgroups or hypotheses about additional outcomes or cross- randomized treatments.]
  2. What is the logic or theory of change behind each secondary hypothesis? [Explain what effects we should expect if the theory behind your primary hypothesis is correct.]
  1. Alternative explanations if results are consistent with hypotheses
  1. What alternative theories could explain the results?
  2. Hypothesis for an alternative outcome (or other subgroups) that would be consistent only with the alternative explanation and not the logic behind your primary hypothesis.
  1. Alternative explanations if results are inconsistent with hypotheses
  1. What alternative theories could explain the results?

Population and Sample

Section 2: Population and Sample
  1. Population of interest
  1. Where and when will your study take place?
  1. Does this match up to your population of interest, or are there conditions that make this study context different?
  1. Sample size
  1. How is this sample selected? Be specific about the procedure.
  1. Consent
  1. How will you obtain informed consent? If you will not, what is the justification?
  2. Is this population vulnerable to being coerced into participating in the study?
  1. Ethics
  1. Is the sample size large enough that you have sufficient power for your research conclusions to be credible and useful?
  2. Is the sample size no larger than necessary for the research?
  3. Can the research (results) be used to target people or make people more vulnerable?

Section 03 - Treatment and Randomisation

Section 3: Intervention
  1. Status Quo
  1. Describe the status quo–what are the current conditions in terms of the outcomes you hope to change? What aspects of the intervention already exist, if any?
  1. Intervention
  1. Describe your intervention(s)
  2. What is already known about the effect of the proposed intervention relative to the status quo? Is there credible evidence on the question?
  1. Control
  1. Describe the control condition
  2. Is the control condition a pure control (no intervention at all) or a placebo? What is the placebo contition designed to control for?
  1. Units
  1. To what units (level) will the intervention be applied? Individual, classroom, school, village, municipality, etc.
  2. Is this the same level at which outcomes will be measured? If not, how will you address the different levels if they do not perfectly overlap?

Threats to Validity

Section 3: Intervention
  1. Compliance
  1. What does it mean to “take” (comply with) the the intervention?
  2. If the intervention is a program, how much someone need to attend (showing up once? finishing the program?) in order to count as having attended?
  1. Non- Compliance
  1. Is there any concern with non-compliance (either taking the intervention if assigned to control/placebo or failing to take the intervention if assigned to treatment)?
  1. Ethics
  1. Is the control condition no worse than the status quo, according to the best evidence available?
  2. Are there concerns that participants may be forced to comply wiht the intervention?
  3. What are the risks and magnitude of potentially negative effects of the treatment? Are such risks concentrated on a particular subset of your population?

Outcomes and Covariates

Section 4: Outcome and Covariates
  1. Primary Outcome
  1. What is your primary outcome?
  1. Measurement
  1. How will it be measured? (Give the actual text of the survey question and response options, if using a survey measure. Is the outcome continuous, binary, etc.?)
  1. Priors
  1. What is the expected distribution of the primary outcome? (This may come from a prior study on a similar population or you may have to make an educated guess).
  1. Validity and measurement error
  1. Is there any concern with untruthful reporting? If so, how will you address it?
  1. Stages
  1. Will you collect a baseline?
  2. Will you collect a midline?
  3. Will you collect multiple waves of endline measurement?
  4. If you will collect a baseline or midline, how will you find the same respondents (minimize attrition?)
  1. Covariates
  1. What covariate data do you need, including for subgroup analysis? How will covariates be measured?
  2. What addtional covariates (if any) will you measure?
  3. What additional outcomes or covariates will you collect to distinguish between your explanation and alternatives if your findings are consistent with your hypothesis?
  1. Ethics
  1. Will data collection be onerous (time, effort) or painful (physically, emotionally) for any respondents?
  2. Are these costs necessary? Have they been minimized?
  3. Are they outweighed by the potential benefits of the research to society?

Randomisation

Section 5: Randomisation
  1. Randomisation strategy
  1. Complete/simple, block, cluster, factorial etc.
  1. Blocks
  1. What are they, how many blocks, how many units per block?
  1. Clusters
  1. What are they, how many clusters, how many units per cluster?
  2. If you have clusters, what is the intra-class correlation (ICC)?
  3. Is clustering strictly necessary, or could you randomize at the individual level?

Analysis

Section 6: Analysis
  1. Estimator
  1. What is your estimator?
  1. Standard Errors
  1. What kind of standard errors will you use?
  1. Test
  1. If you plan to report a p-value, what kind of test will you use?
  1. Missing Data
  1. How will you handle missing data?
  1. Effect size
  1. What is the expected effect size? What is the minimum effect size that would make the study worth running? what effect sizes have similar studies found?
  1. What is your power?

Implementation

Section 7: Implementation
  1. Randomisation
  1. How will you conduct the randomisation? (on a computer in advance, drawing from an urn in public, etc.)
  1. Implementation
  1. Who will implement the intervention?
  2. Are there any dangers to your research team, including enumerators? How will you minimize them?
  3. How will you track the quality of the implementation of the intervention?
  1. Compliance
  1. Who will measure compliance?
  1. Data management
  1. How will you manage the data? (security, anonymity, etc.)

PAP examples you can use 📝

Some examples

Dates for your PAPs 📅

Important dates

  • Wednesday, March 26: PAP draft due (10-15 pages)
  • Monday, March 31: You will receive feedback on your PAP
  • Monday, April 7: Final version of your PAP due
  • Monday, April 14: You will receive your dataset
  • Wednesday, April 24 and Monday, April 28: You will present your results in class (about 15 minutes) and submit your final report (about 15 pages)

And that’s all for today! 🎉

Have a great week! 🎉