QTM 385 - Experimental Methods

Lecture 04 - Selection Bias, Variability, and Regression Analysis

Danilo Freire

Emory University

Hello, everyone! 👋
Nice to see you all again! 😉

Brief recap 📚

Last time, we saw that…

  • Potential outcomes framework help understand treatment effects
  • Fundamental problem of causal inference: we only see one potential outcome
  • Treatment effect is the difference between potential outcomes (\(\tau_{i} = Y_{1,i} - Y_{0,i}\))
  • Average causal effect is the goal, but simple means comparison is usually biased
  • Selection bias: treated/untreated groups may differ even before treatment
  • Bias does not vanish in large samples
  • Mathematical expectation \(E[Y_i]\) is the population average
  • Law of large numbers: sample average converges to population average
  • Conditional expectation \(E[Y_i|X_i]\) is average outcome given \(X_i\)
  • SUTVA (Stable Unit Treatment Value Assumption)
  • Violations of SUTVA: spillovers, contamination
  • External validity: generalising results to other settings
  • Heterogeneous treatment effects: effects vary across people
  • Compliance problem: people may not follow random assignment
  • Intent-to-Treat (ITT) and Treatment-on-the-Treated (TOT) analyses address compliance

Today, we will discuss…

  • A little more about potential outcomes
  • Measuring variability
  • Different types of selection bias, e.g.:
    • Inappropriate controls
    • Loss to follow-up
    • Attrition bias
    • Survivorship bias
    • Post-treatment bias
  • How to use R and DeclareDesign to estimate regression models
  • But first…

Source: xkcd (where else? 😄)

Funny correlation of the day!

Let’s get started! 🚀

Potential outcomes revisited

  • Remember from last time: we have two potential outcomes for each unit
  • We only observe one of them: \(Y_i = Y_{1,i} \cdot D_i + Y_{0,i} \cdot (1 - D_i)\)
    • \(D_i\) is the treatment indicator = 1 if treated, 0 if not
  • So we need to estimate the average treatment effect (ATE):
    • \(E[Y_{1,i} - Y_{0,i}] = E[Y_{1,i}] - E[Y_{0,i}]\)
  • The problem is, not all comparisons are valid
  • I mean, they can be, but only under the heroic assumption that the groups are similar on average before any adjustment
  • Otherwise, we will have selection bias
  • Let’s see how this works in practice! 🤓

Khuzdar and Maria

  • Selection bias occurs when the groups are systematically difference, often before the treatment (but not exclusively!)
  • Using an example from Angrist and Pischke (2021), let’s say we have a student called Khuzdar from Kazakhstan, who is considering studying in the US and is worried about the cold weather
  • Should he get health insurance? 🤔
  • Let’s imagine that, without insurance, Khuzdar has a potential outcome of \(Y_{0,i} = 3\) and, with insurance, \(Y_{1,i} = 4\). So the treatment effect is \(\tau_i = 1\), that is, he gains 1 “health point” by getting insurance
  • Now, let’s imagine that Khuzdar has a Chilean colleague called Maria Moreno, who is also considering studying in the US
  • But since she comes from chilly Santiago, she is not worried about the cold weather
  • So, without insurance, Maria has a potential outcome of \(Y_{0,i} = 5\) and, with insurance, \(Y_{1,i} = 5\). So the treatment effect is \(\tau_i = 0\), that is, she gains no “health points” by getting insurance

Khuzdar and Maria

  • In fact, the comparison between frail Khuzdar and hearty Maria tells us little about the causal effects of their choices!
  • Why is that? Because they were different to begin with
  • Imagine that Khuzdar got insurance and Maria did not
  • Let’s do a little mathematical trick here: we will add and subtract \(Y_{0, Khuzdar}\) from the treatment effect (they cancel each other out, right?)
  • So we have the following:

Difference in means = average treatment effect + selection bias

  • So far, so good? 🤓
  • If we assume that the treatment has a constant effect (i.e. the treatment effect is the same for everyone), we can rewrite the equation as:

  • Where \(k\) is both the individual and average causal effect of insurance on health
  • Using the constant-effects model to substitute for \(Avg_n[Y_{1i}|D_i = 1]\), we have

How to check for selection bias?

Balance tests

  • We use balance tests or (randomisation checks) to check if the groups are similar before treatment
  • The idea is to compare the means of the covariates for the treated and untreated groups
  • If the means are similar, it provides evidence that nothing systematic is driving the treatment effect
  • We are never 100% sure, but we trust the random assignment
  • Small differences in means are acceptable, as long as they are not systematic
  • Why? Because some variation can happen only by chance
  • 100% personal opinion: I think they aren’t very useful 😂
  • Mutz et al (2018) make a good case for that

Angrist and Pischke (2021), pp. 20 (selected parts)

Questions? 🤔

Different types of selection bias 🎯

Causal diagrams and associations

  • Causal diagrams are a great way to visualise the relationships between variables
    • They are also called directed acyclic graphs (DAGs)
    • They are “directed” because they show the direction of the relationships, and “acyclic” because they do not have loops (one cause cannot cause itself)
  • They have been introduced by Judea Pearl and are widely used in causal inference (quick guide here)
  • They help us understand the associations between variables and identify potential sources of bias
  • And they are super easy to draw! 🤓
  • For instance, let’s illustrate a classic example of selection bias with a causal diagram (borrowed from Hernán et al., 2019)

  • Figure 1 shows the common cause \(L\) (smoking) results in \(E\) (carrying matches) and \(D\) (lung cancer) being associated
  • Figure 2 shows how randomisation breaks the association between \(L\) and \(D\), so the treatment effect is unbiased
  • In this case, if you know what \(L\) is, you can also control for it in your analysis

Some sources of selection bias

Losses to follow-up

  • One problem that we sometimes have with experiments is individuals dropping out
  • That is, when individuals leave the study before the end of follow-up
  • Why is that an issue?
  • Because the treatment effect may be different for those who leave
  • Imagine a follow-up study of the effect of antiretroviral therapy (\(E\)) on AIDS (\(D\)) risk among HIV-infected patients
  • \(L\) is the presence of other symptoms of HIV infection
  • The greater the true level of immunosuppression (\(U\)), the greater the risk of AIDS. \(U\) is unmeasured
  • If a patient drops out from the study, his AIDS status cannot be assessed and we say that he is censored (\(C=1\))
  • Patients with greater values of \(U\) are more likely to be lost to follow up
  • The causal diagram shows the relationship between the variables

  • How can we correct for this bias?
  • One way is to use stratified analysis or inverse probability weighting
  • We will discuss these methods in further detail in the next lectures
  • But stratification is quite simple:
    • Assuming that \(L\) is measured, we can estimate the treatment effect within each stratum of \(L\)
    • Then we can average the stratum-specific estimates to get the overall treatment effect

Attrition bias

  • Attrition bias is a type of selection bias that occurs when individuals drop out of a study in a way that is related to the treatment or outcome
  • It is more systematic than losses to follow-up, which can happen by chance (e.g., disengagement, relocation, or other factors, leading to incomplete data)
  • For example, in a clinical trial, if participants experiencing side effects disproportionately drop out of the treatment group compared to the placebo group, this difference introduces attrition bias
  • In short, attrition bias specifically addresses the bias that arises when these departures are non-random
  • As with losses to follow-up, this is a type of selection bias than can be caused by the experiment itself, not only by pre-existing differences between groups

Survivorship bias

  • In my view, one of the coolest examples of selection bias
  • Survivorship bias is the logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not
  • It is also closely related to attrition bias and follow up bias (more mentioned in observational data)
    • Attrition bias emphasises the process of losing data, while survivorship bias highlights the exclusion of failures in analysis
  • For instance, imagine that you are studying the effect of education on income
  • If you only look at people who have completed college, you are ignoring those who dropped out
  • This can lead to an overestimation of the effect of education on income
  • Why? Because those who dropped out may have done so because they were not good students, which can affect their income
  • Have a look at this example (you may have seen it before!)

Survivorship bias in World War II aircraft

  • During World War II, the US military wanted to improve the armour on its aircraft
  • They analysed the bullet holes on returning aircraft and decided to add more armour to the areas with the most bullet holes
  • Considering the picture you see here, where would you add the armour?
  • More here
  • Another example: Imagine you are studying mutual fund performance, and you only look at funds that are still in operation. You see that they beat the S&P 500 by 2% per year
  • What’s the problem here?

Post-treatment bias

  • Post-treatment bias (or overcontrol) occurs when we control for variables that are affected by the treatment
  • Imagine a research study evaluating a job training programme for unemployed individuals
  • The treatment is the job training programme and the outcome is finding employment after program completion
  • Imagine also that people who take the programme submit more job applications after training
  • This is a post-treatment variable, as it is affected by the treatment
  • If we control for the number of job applications in our analysis, we (usually) underestimate the programme’s effectiveness, because we are removing one of the mechanisms through which the treatment works
  • This bias, however, can go in either direction, it depends on the context
  • Is there a statistical fix for it? Not really!
  • The best way to avoid post-treatment bias is to think carefully about the variables you include in your analysis

Examples of post-treatment bias

Avoidable post-treatment bias

  • Causal effect of party ID on the vote
    • Do control for race
    • Do not control for voting intentions five minutes before voting
  • Causal effect of race on salary in a firm
    • Do control for qualifications
    • Don’t control for position in the firm
  • Causal effect of medicine on health
    • Do control for health prior to the treatment decision
    • Do not control for side effects or other medicines taken later

Unavoidable post-treatment bias

  • Causal effect of democratisation on civil war; do we control for GDP?
    • Yes, since GDP → democratisation we must control to avoid omitted variable bias
    • No, since democratisation → GDP, we would have post-treatment bias
  • Causal effect of legislative effectiveness on state failure; do we control for trade openness?
    • Yes, since trade openness → legislative effectiveness, we must control to avoid omitted variable bias
    • No, since legislative effectiveness → trade openness, we would have post-treatment bias

Now it’s your turn! 🤓

  • I will list some factors and you tell me if they are likely to be sources of selection bias (and why)
    • Telephone random sampling
    • Ascertainment: when the investigator responsible for assessing the outcome of interest is also aware of the group assignment
    • WEIRD samples: Western, Educated, Industrialised, Rich, and Democratic
    • Referral filter: when the study population is recruited through referrals, often from health care providers or NGOs
    • When you ask socially sensitive questions in a survey

Measuring variability 📏

Measuring variability

Variance and standard deviation

  • Apart from averages, we are also interested in variability
  • We usually look at average squared deviations from the mean (variance)
  • The sample variance of \(Y_i\) in a sample of size \(n\) is defined as:
    • \(S(Y_i)^2 = \frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \bar{Y})^2\)
  • The population variance, which is a fixed parameter, is defined as:
    • \(Var(Y_i) = E[(Y_i - E[Y_i])^2]\)
    • Here we just replaced averages with expectations
  • Since the variance can be quite large, we often look at the standard deviation (square root of the variance), written as \(\sigma(Y_i)\)
  • These measures are descriptive, as they summarise the variability in your dataset

Standard error

  • The standard deviation of an inferential statistic is called its standard error
  • Standard error measures how precise your sample mean is as an estimate of the true population mean (which we usually don’t know)
  • The standard error of the sample mean can be written as:
    • \(SE(\bar{Y}) = \frac{\sigma(Y_i)}{\sqrt{n}}\)
  • But since we usually don’t have the population standard deviation, we use the sample standard deviation instead
    • \(SE(\bar{Y}) = \frac{S(Y_i)}{\sqrt{n}}\)
  • The standard error is not the same as the standard deviation
  • The standard error is a measure of the precision of an estimate, while the standard deviation is a measure of the variability of the data
  • Confused? 🤔 Don’t worry, let’s see their differences in further detail 😉

Comparison between standard deviation and standard error

  • A high SD indicates more spread in the data points. It does not inherently change with sample size
  • The standard error decreases as the sample size increases, indicating that larger samples yield more reliable estimates of the mean
    • That’s the central limit theorem at work! 🤓
  • Confidence intervals in statistial models are often constructed using the SE, while descriptive statistics often report the SD
  • Does it sound clearer now? 🙂
  • Suppose you measure the IQ of 50 students:
    • \(\bar{IQ} = 100\)
    • \(SD = 15\)
    • \(SE = \frac{15}{\sqrt{50}} \approx 2.12\)
  • \(SD = 15\): Individual IQs vary by 15 points from the mean (e.g., most are between 85 and 115).
  • \(SE = 2.12\): If you repeated the study, the average IQ of your samples would typically vary by 2 points from 100

Standard error of a difference in means

  • We are usually interested not in one mean, but in the difference between two means
  • Let’s assume we have a treatment group mean of \(\bar{Y}_{1}\) and a control group mean of \(\bar{Y}_{0}\), with sample sizes \(n_1\) and \(n_0\), respectively
  • The total sample size is \(n = n_1 + n_0\)
  • Using the fact that the variance of a difference between two statistically independent variables is the sum of their variances, we have:
    • \(Var(\bar{Y}_{1} - \bar{Y}_{0}) = Var(\bar{Y}_{1}) + Var(\bar{Y}_{0})\)
    • \(Var(\bar{Y}_{1} - \bar{Y}_{0}) = \frac{\sigma(Y_{1})^2}{n_1} + \frac{\sigma(Y_{0})^2}{n_0}\)
  • The standard error of the difference in means is then:
    • \(SE(\bar{Y}_{1} - \bar{Y}_{0}) = \sqrt{\frac{\sigma(Y_{1})^2}{n_1} + \frac{\sigma(Y_{0})^2}{n_0}}\)
  • Again, since we don’t usually have the population standard deviation, we use the sample standard deviation instead
    • \(SE(\bar{Y}_{1} - \bar{Y}_{0}) = \sqrt{\frac{S(Y_{1})^2}{n_1} + \frac{S(Y_{0})^2}{n_0}}\)
  • This is the standard error of the difference in means, which is used to calculate confidence intervals and hypothesis tests: \(t = \frac{\text{Difference}}{\text{SE}_{\text{diff}}}\) and \(CI = \text{Difference} \pm t \times \text{SE}_{\text{diff}}\)
  • Do you have to worry about this? Not really, but it’s good to know what’s going on under the hood 😉
  • Your software will do the heavy lifting for you 🤖

Regression analysis of experiments 📊

Regression analysis of experiments

  • It is super easy to estimate the average treatment effect using regression analysis
  • Although some areas (e.g. medicine, psychology) often use \(t\)-tests for difference-in-means, regression is more flexible and gives you exactly the same results
  • The idea is to regress the outcome variable on the treatment variable and some covariates if necessary
  • But always report the ATE without covariates first, as it is the most interpretable
  • Let’s see an example using R! 🤓
  • We will the fabricatr, estimatr, and randomizr packages, which are part of the DeclareDesign suite
  • More information about the packages are available here, here, and here
  • We will start slowly, don’t worry, and build up from there in the next lectures 😉

Simulating data in R

  • Imagine that we want to estimate the effect of a job training programme on the number of interviews that unemployed individuals get, similar to the example we discussed earlier
  • The treatment variable is treat, which takes the value 1 if the individual received the training and 0 otherwise
  • We will assign 1000 individuals to the treatment and control groups, with an equal probability of being assigned to each group (simple random assignment)
  • The outcome variable is interviews, which is the number of interviews the individual got after the training
  • The treatment effect is 5, that is, individuals who received the training got 5 more interviews than those who did not
# Load the required packages
library(fabricatr)
library(estimatr)
library(randomizr)

# Set the seed for reproducibility
set.seed(385)

# Simulate the data
data <- fabricate( # fabricatr
  N = 1000, # base
  treat = complete_ra(N, m = 500), # randomizr
  interviews = round(rnorm(N, mean = 10, sd = 2) + 5 * treat, digits = 0) # base
)

# Check the first few rows of the data
head(data)
    ID treat interviews
1 0001     0         11
2 0002     1         15
3 0003     1         15
4 0004     1         16
5 0005     0          9
6 0006     0         11

Step-by-step: using fabricatr to simulate data

  • We used the fabricate() function from the fabricatr package to simulate the data
  • The package is very flexible and was built for all types of data simulations, mainly those used by social scientists (e.g. experiments, surveys)
    • We can draw from Likert scales, create binary variables, ordinal variables, categorical variables, etc
    • More about how to create different types of variables here
  • The fabricate() function takes a series of arguments, which are the variables we want to simulate
  • It uses base R functions too, which makes it easy to use
  • For instance, N = 1000 is the number of individuals in the sample and we created it using just base R
  • To simulate the interiews variable, we used the rnorm() function to draw from a normal distribution with mean 10 and standard deviation 2
  • We sum 5 because it is the treatment effect
  • As we cannot have a fraction of an interview, we used the round() function to round the number of interviews to the nearest integer (digits = 0)
  • So far so good? 🤓

Randomising treatment assignment

  • We used the complete_ra() function from the randomizr package to randomise the treatment assignment
  • The function has a few arguments, such as
    • N: the number of individuals in the sample (required)
    • m: the number of individuals assigned to the treatment group (optional)
    • m_each: the number of individuals assigned to each treatment group (optional)
  • We could have used simple_ra() for simple random assignment, which is a simple coin flip. The arguments are:
    • N: the number of individuals in the sample (required)
    • prob: the probability of being assigned to the treatment group (optional)
    • prob_each: the probability of being assigned to each treatment group (optional)
  • We could also use block_ra() for block random assignment, which is when we divide the sample into blocks and then randomise within each block
  • cluster_ra() is used for cluster random assignment, which is when we randomise at the cluster level (e.g. schools, hospitals)
  • Finally, block_and_cluster_ra() is used for block and cluster random assignment, which is a combination of the two
  • We will discuss these functions in further detail in the next lectures 😉

Estimating the treatment effect

  • Now that we have the data, we can estimate the treatment effect using the lm_robust() function from the estimatr package
  • The function is similar to the lm() function, but it uses robust standard errors, which are more appropriate for experiments
  • Why? Because robust standard errors correct for heteroskedasticity and clustering
  • This is a free lunch: if your data is not heteroskedastic or clustered, the robust standard errors will be the same as the regular standard errors
  • The function takes a formula as an argument, which is the outcome variable regressed on the treatment variable
model <- lm_robust(interviews ~ treat, data = data)
summary(model)

Call:
lm_robust(formula = interviews ~ treat, data = data)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper  DF
(Intercept)    10.18     0.0922  110.39  0.000e+00    9.997   10.359 998
treat           4.91     0.1305   37.63 1.205e-193    4.654    5.166 998

Multiple R-squared:  0.5866 ,   Adjusted R-squared:  0.5862 
F-statistic:  1416 on 1 and 998 DF,  p-value: < 2.2e-16

Difference in means

  • For those who would like to estimate the difference in means using the t.test() function, we’ve got you covered! 😉
  • estimatr also has a function that does that, called difference_in_means() (what a surprise! 😂)
  • It is actually more flexible than the t.test() function, as it can handle unit-randomised, cluster-randomised, block-randomised, matched-pair randomised, and matched-pair clustered designs. More here
  • Here is how to use it:
difference_in_means(interviews ~ treat, data = data)
Design:  Standard 
      Estimate Std. Error  t value      Pr(>|t|) CI Lower CI Upper       DF
treat     4.91  0.1304729 37.63232 1.204798e-193 4.653967 5.166033 997.9986
  • As you can see, the results are exactly the same as the regression model
  • Just the output looks a little different, but the interpretation is the same
  • Use whatever you feel more comfortable with! 😊

Interpreting the results

  • The output of the summary() function gives us the coefficient estimate of the treatment variable
  • The intercept is 10.18, which is the average number of interviews for the control group
  • The coefficient of the treatment variable is 4.91, which is the average treatment effect
  • The standard error of the treatment effect is 0.13, which is the standard error of the difference in means
  • Our results are statistically significant at the 5% level, as the \(p\)-value is less than 0.05
model <- lm_robust(interviews ~ treat, data = data)
summary(model)

Call:
lm_robust(formula = interviews ~ treat, data = data)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper  DF
(Intercept)    10.18     0.0922  110.39  0.000e+00    9.997   10.359 998
treat           4.91     0.1305   37.63 1.205e-193    4.654    5.166 998

Multiple R-squared:  0.5866 ,   Adjusted R-squared:  0.5862 
F-statistic:  1416 on 1 and 998 DF,  p-value: < 2.2e-16
  • The R^2 and its adjusted version are measures of the goodness of fit of the model
    • They are the proportion of the variance in the outcome variable explained by the model
  • The F statistic is a test of the overall significance of the model
    • The null hypothesis is that all coefficients are zero
    • But as we see, the treatment variable is significant

Including pre-treatment covariates in the model

  • We can also include covariates in the model to increase precision
  • In contrast with the treatment, the covariates have no causal interpretation, that is, they are not affected by the treatment
  • Even with randomisation, there can be small imbalances in covariates between treatment and control groups, especially in small samples
  • Including these covariates in the regression helps to adjust for any residual imbalances that may exist due to random chance, making the analysis more accurate
  • Including covariates can also help bolster the credibility of the results by demonstrating that the treatment effect persists even after controlling for relevant variables
  • But remember to include only pre-treatment covariates in the model, as post-treatment covariates can introduce bias
  • Let’s simulate a new dataset with two pre-treatment covariates: age and education
set.seed(385)

data2 <- fabricate(
  N = 1000,
  treat = complete_ra(N, m = 500),
  age = round(rnorm(N, mean = 30, sd = 5)),
  education = round(rnorm(N, mean = 12, sd = 2)),
  interviews = round(rnorm(N, mean = 10, sd = 2) + 5 * treat)
)

head(data2)
    ID treat age education interviews
1 0001     0  32        11          9
2 0002     1  30        13         16
3 0003     1  31        12         16
4 0004     1  33        13         15
5 0005     0  27        15          7
6 0006     0  33        12          9

Estimating the second model

  • We can add covariates to experimental models to increase precision
  • The same way we do in any regression:
    • \(Y_i = \alpha + \beta \text{T}_i + \gamma \text{X}_i + \epsilon_i\)
  • Freedman (2008) demonstrated that pre-treatment covariate adjustment may bias estimates of average treatment effects, mainly in small samples
  • Lin (2013) proposed that if covariate correlations differ between treatment and control groups, centering and interacting covariates with the treatment variable will balance the groups
  • This adjusts for covariates separately in each group, which is equivalent to including treatment-by-covariate interactions
    • \(Y_i = \alpha + \beta \text{T}_i + \gamma \text{W}_i + \delta \text{T}_i \times \text{W}_i + \epsilon_i\)
    • Where \(W_i = \text{X}_i - \bar{\text{X}}\) and \(\bar{\text{X}}\) is the mean of \(\text{X}_i\)
    • For ease of interpretation, we can centre the covariates to have a mean of zero
  • Anyway, worry not! 😅 Our friends at DeclareDesign have done all the work for us and created a function called lm_lin() that does everything automatically!

Estimating the second model

model2 <- lm_lin(interviews ~ treat,
                covariates = ~ age + education,
                data = data2)
summary(model2)

Call:
lm_lin(formula = interviews ~ treat, covariates = ~age + education, 
    data = data2)

Standard error type:  HC2 

Coefficients:
                  Estimate Std. Error  t value   Pr(>|t|)  CI Lower CI Upper
(Intercept)        9.88659    0.09213 107.3127  0.000e+00  9.705804 10.06738
treat              5.13084    0.12546  40.8956 3.281e-215  4.884642  5.37704
age_c              0.02845    0.01771   1.6062  1.085e-01 -0.006307  0.06320
education_c       -0.04550    0.04721  -0.9636  3.355e-01 -0.138150  0.04716
treat:age_c       -0.03324    0.02546  -1.3054  1.921e-01 -0.083206  0.01673
treat:education_c  0.04324    0.06140   0.7043  4.814e-01 -0.077248  0.16373
                   DF
(Intercept)       994
treat             994
age_c             994
education_c       994
treat:age_c       994
treat:education_c 994

Multiple R-squared:  0.6272 ,   Adjusted R-squared:  0.6254 
F-statistic: 344.4 on 5 and 994 DF,  p-value: < 2.2e-16
  • The lm_lin() function is similar to the lm_robust() function, but it includes the covariates argument
  • We pass a simple R formula to it
  • As you can see, the results are pretty similar to the previous model
  • The treatment effect is slightly higher, and the standard error is slightly lower
  • The _c part of the variable names indicates that the variables are centred, as recommended by Lin (2013).
  • Lin’s method addresses covariate imbalance by adjusting for covariates separately within each treatment group, which is equivalent to including all treatment-by-covariate interactions.

Sub-group analysis

  • We can also estimate the treatment effect for sub-groups of the population
  • This is useful when we suspect that the treatment effect may vary across known dimensions
  • For instance, we can estimate the treatment effect for people with high and low levels of education
  • We can do this by including an interaction term between the treatment variable and the covariate of interest
  • We will discuss this in further detail in the next lectures, but here is how to do it
# Create a binary subgroup (e.g., "high" vs. "low" education)
data2$high_edu <- ifelse(data2$education > median(data2$education), 1, 0)

# Fit an interaction model with covariates
model_interaction <- lm_robust(
  interviews ~ treat * high_edu + age + education,
  data = data2)

# Summarize results
summary(model_interaction)

Call:
lm_robust(formula = interviews ~ treat * high_edu + age + education, 
    data = data2)

Standard error type:  HC2 

Coefficients:
               Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper  DF
(Intercept)     9.39778    0.69048 13.6104  8.255e-39  8.04281 10.75275 994
treat           5.10068    0.16905 30.1726 1.682e-142  4.76894  5.43241 994
high_edu       -0.26179    0.24257 -1.0792  2.808e-01 -0.73780  0.21422 994
age             0.01141    0.01276  0.8938  3.716e-01 -0.01363  0.03644 994
education       0.02033    0.05026  0.4046  6.859e-01 -0.07829  0.11895 994
treat:high_edu  0.09721    0.25093  0.3874  6.985e-01 -0.39521  0.58962 994

Multiple R-squared:  0.6269 ,   Adjusted R-squared:  0.625 
F-statistic: 341.1 on 5 and 994 DF,  p-value: < 2.2e-16

Sub-group analysis: interpreting the results

  • The output of the summary() function gives us the coefficient estimate of the treatment variable for the high education group
    • It is in the last line, treat:high_edu
  • Interpretation: The treatment effect for the high-education subgroup (high_edu = 1) is 0.097 larger than for the low-education subgroup
  • Total treatment effect for high_edu = 1: 5.101 + 0.097 = 5.198
  • Significance: The interaction is not significant (CI: -0.395 to 0.590), meaning there’s no evidence the treatment effect differs between education subgroups
  • This is expected, as we simulated the data to have the same treatment effect for all individuals
  • But it is good to know how to do it, right? 😉
# Create a binary subgroup (e.g., "high" vs. "low" education)
data2$high_edu <- ifelse(data2$education > median(data2$education), 1, 0)

# Fit an interaction model with covariates
model_interaction <- lm_robust(
  interviews ~ treat * high_edu + age + education,
  data = data2)

# Summarize results
summary(model_interaction)

Call:
lm_robust(formula = interviews ~ treat * high_edu + age + education, 
    data = data2)

Standard error type:  HC2 

Coefficients:
               Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper  DF
(Intercept)     9.39778    0.69048 13.6104  8.255e-39  8.04281 10.75275 994
treat           5.10068    0.16905 30.1726 1.682e-142  4.76894  5.43241 994
high_edu       -0.26179    0.24257 -1.0792  2.808e-01 -0.73780  0.21422 994
age             0.01141    0.01276  0.8938  3.716e-01 -0.01363  0.03644 994
education       0.02033    0.05026  0.4046  6.859e-01 -0.07829  0.11895 994
treat:high_edu  0.09721    0.25093  0.3874  6.985e-01 -0.39521  0.58962 994

Multiple R-squared:  0.6269 ,   Adjusted R-squared:  0.625 
F-statistic: 341.1 on 5 and 994 DF,  p-value: < 2.2e-16

Cool, isn’t it? 😎

Questions?

Summary

  • We discussed the potential outcomes framework in further detail, focusing on issues of selection bias
  • We talked about different types of selection bias, such as inappropriate controls, loss to follow-up, attrition bias, survivorship bias, and counfounders
  • We also discussed how to measure variability using the standard deviation and standard error
  • Then, we talked about regression analysis of experiments, including how to simulate data, randomise treatment assignment, estimate the treatment effect, and include pre-treatment covariates in the model
  • We also introduced the fabricatr, estimatr, and randomizr packages, which are part of the DeclareDesign suite
  • Finally, we discussed sub-group analysis and how to estimate the treatment effect for different sub-groups of the population
  • I hope you enjoyed the lecture! 🤓
  • Next time, we will discuss more about hypothesis testing and confidence intervals, so stay tuned! 😉

Thank you and see you soon! 😃