library(data.table)
library(fixest)
library(ggplot2)
library(broom)
setDTthreads(100) # data.table uses all cpu cores
setFixest_nthreads(4) # fixest uses all cpu cores
Bootstrap
Exact finite-sample results are unavailable for most microeconometrics estimators and related test statistics, usually the researcher relies on asymptotic theory to draw inference. An alternative is provided by the bootstrap, Efron and Tibshirani (1994). This approximates the distribution of a statistic by a Monte Carlo simulation, with sampling done from the empirical distribution of the observed data.
Consider we have an iid sample \(\{\mathbf{w_1}, \ldots, \mathbf{w_N}\}\), where \(\mathbf{w_i} = (y_i , \mathbf{x_i})\) and \(\hat\theta\) is an estimator that is \(\sqrt{N}\) consistent and asymptotically normally distributed.
A general bootstrap algorithm is:
Given data \(\mathbf{w_1},\ldots, \mathbf{w_N}\), draw a bootstrap sample of size \(N\) and denote this as \(\mathbf{w^∗_1},\ldots, \mathbf{w^∗_N}\)
Calculate an appropriate statistic using the bootstrap sample. For example the studentized statistic: \(t^*=(\hat\theta^*-\hat\theta)/s_{\hat\theta^*}\)
Repeat steps 1 and 2 \(B\) independent times, obtaining \(B\) bootstrap replications of the statistic.
Use these \(B\) bootstrap replications to obtain a bootstrapped version of the statistic
Consider the usual test statistic \(t_N=(\hat\theta-\theta_0)/s_{\hat\theta}\) for the hipothesis testing, \(H_0: \theta = \theta_0\) against \(H_a:\theta\neq\theta_0\). We perform \(B\) bootstrap replications gathering studentized test statistics \(t_1^*, \ldots, t_B^*\), centered around \(\hat\theta\) since resampling is from a distribution centered around \(\hat\theta\). We prefer the studentized statistic since it provides inference with asymptotic refinement, Cameron and Trivedi (2005).
\[ \begin{equation} t_b^*=(\hat\theta_b^*-\hat\theta)/s_{\hat\theta_b^*} \end{equation} \]
The distribution of \(t_1^*, \ldots, t_B^*\) is used to approximate the distribution of \(t_N\). For an upper one-sided alternative test the bootstrap critical value is the upper \(\alpha\) quantile of the \(B\) ordered test statistics, the \((1-\alpha)(B+1)^{th}\) bootstrap statistic. For a two-sided nonsymmetrical test the bootstrap critical values are the lower \(\alpha/2\) and upper \(\alpha/2\) quantiles of the ordered test statistics, and the null hypothesis is rejected if the original \(t_N\) statistic lies outside this range. Symmetrical test we instead order \(|t^∗|\) and the critical value is the \(1-\alpha\) quantile. The null hypothesis is rejected if \(|t_N|\) exceeds this critical value.
Let’s make an example with our schools dataset. For now, we will ignore the possible clusterization of errors and compare the bootstraped standard error and studentized t-statistic to the default ones. We start by loading and cleaning the data, then we create a function1 to compute \(R\) bootstrap replications plus the estimations based on the sample.
<- fread("Data/escolas.csv")[, -c("V1")]
dt # Clean up the data. Never forget it!!
== "S/escola", escola := "99"][, escola := as.integer(escola)]
dt[escola <- dt[!is.na(logico)]
dt head(dt)
## escola aluno logico mulher idade
## 1: 1 89 -0.6110555 1 10.30411
## 2: 1 151 -1.0899623 0 10.64384
## 3: 1 152 -0.1321487 1 11.06027
## 4: 1 235 0.5063938 0 11.26301
## 5: 1 281 1.6238429 1 11.04658
## 6: 1 290 0.1871226 1 10.62466
<- function(dt, formula, coef, b = 0, R = 399){
boot_func stopifnot(is.data.table(dt))
# Model estimated from data
<- feols(as.formula(formula), data = dt)
m0 <- coefficients(m0)[coef]
beta0 <- se(m0)[coef]
se0 <- (beta0 - b)/se0
t0 names(beta0) <- NULL
names(se0) <- NULL
names(t0) <- NULL
# Bootstrap
<- data.table(replication = seq_len(R))
reps_dt := lapply(replication, function(x){
reps_dt[, bootstrap # Resampling
<- sample(dt[, .N], dt[, .N], replace = TRUE)
idx <- feols(as.formula(formula), data = dt[idx])
mb <- coefficients(mb)[coef]
beta <- se(mb)[coef]
se <- (beta - beta0)/se # studentized t-stat
t names(beta) <- NULL
names(se) <- NULL
names(t) <- NULL
return(list(beta = beta, se = se, t = t))
})]# # Insert m0 at first row
<- rbind(
reps_dt list(replication = 0,
bootstrap = list(list(beta = beta0, se = se0, t = t0))),
reps_dt
)# Unpack the list-column
<- reps_dt[, by = replication, bootstrap[[1]]]
reps_dt return(reps_dt)
}
<- boot_func(dt, "logico~mulher+idade", "mulher", R = 199)
boot_dt head(boot_dt)
## replication beta se t
## 1: 0 0.2448560 0.06361732 3.84888958
## 2: 1 0.3169498 0.06266485 1.15046547
## 3: 2 0.1906231 0.06535310 -0.82984413
## 4: 3 0.2749047 0.05999099 0.50088699
## 5: 4 0.2963712 0.06528726 0.78905453
## 6: 5 0.2406282 0.06269209 -0.06743728
The results of our bootstrap are contained in this data.table, including replication zero, which is the data model. We can visualize the coefficients and studentized statistics distributions.
<- melt(boot_dt, id.vars = "replication", variable.name = "stat")
boot_long
ggplot(boot_long[replication != 0 & stat != "se"],
aes(value, fill = stat)) +
geom_histogram(bins = 20) +
facet_wrap(~stat, scales = "free") +
geom_vline(data = boot_long[replication == 0 & stat != "se"],
aes(xintercept = value, color = "darkred")) +
labs(subtitle = sprintf("Bootstrap estimations, R = %s", nrow(boot_dt)-1)) +
guides(colour = "none") +
theme_classic()
Next we create two more functions to compute critical values for the statistic and an asymptotically refined confidence interval.
<- function(t_vec, alpha = 0.05, type = c("two-sided", "one-sided")) {
t_boot <- type[1]
type if (type == "two-sided") {
<- sort(abs(t_vec))
ord <- ord[ceiling((1 - alpha)*length(ord))]
uc <- -uc
lc
}else {
<- sort(t_vec)
ord <- ord[floor(alpha*length(ord))]
lc <- ord[ceiling((1 - alpha)*length(ord))]
uc
}return(c(lower = lc, upper = uc))
}
<- function(beta0, se0, t_vec, ci_level = 0.95) {
ci_boot <- t_boot(t_vec, alpha = 1 - ci_level)
t_critical <- beta0 + t_critical["lower"]*se0
lower <- beta0 + t_critical["upper"]*se0
upper return(c(lower, upper))
}
t_boot(boot_dt$t)
## lower upper
## -2.053182 2.053182
ci_boot(boot_dt[replication == 0, beta], boot_dt[replication == 0, se],
$t) boot_dt
## lower upper
## 0.1142381 0.3754739
Compare the results above to the default values we get from regression output.
tidy(feols(logico~mulher+idade, data = dt),
conf.int = TRUE)[c("term", "estimate", "statistic",
"conf.low", "conf.high")]
## # A tibble: 3 x 5
## term estimate statistic conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.24 8.23 2.47 4.01
## 2 mulher 0.245 3.85 0.120 0.370
## 3 idade -0.300 -8.64 -0.367 -0.232
Wild Bootstrap2
Wild Bootstrap is a variant that has been shown to be a good alternative to Cluster-Robust standard errors when there are not many clusters in the sample. Canay, Santos, and Shaikh (2021) provide conditions where this method works well when there are few clusters, but each cluster has many observations. These conditions include homogeneity-like restrictions on the distribution of covariates. Recall that, for a model:
\[Y_{ic} = \beta X_{ic} + \gamma' Z_{ic} + \epsilon_{ic} \] where \(c \in \{1,2\ldots C\}\) indexes clusters; a Wild BS level \(\alpha\) test for \(H_0: \beta = b\) against \(H_1: \beta \neq b\) consists of:
Estimate the model imposing the null \(\beta = b\). Let \(\tilde{\gamma}\) and \(\tilde{\epsilon_{ic}}\) denote the estimated coefficients and residuals from this step.
For replications \(s \in \{1,2 \ldots S\}\):
- For each \(c \in \{1,2\ldots C\}\), draw \(e_c^s = \pm 1\) with probability \(1/2\).
- Generate an artificial outcome using the formula \(\tilde{y}^s_{ic} = bX_{ic} + \tilde{\gamma}'Z_{ic} + e_c^s \tilde{\epsilon}_{is}\)
- Run the unrestricted regression of \(\tilde{y}^s_{ic}\) on \(X_{ic}\) and \(Z_{ic}\). Let \(\hat{\beta}_s\) denote the OLS estimator obtained in this step.
- Store the \(s\)-th coefficient \(\hat{\beta}_s\) or its studentized version, \(\hat{t}_s = (\hat{\beta}_s-b)/\sqrt{\hat{V}(\hat{\beta}_s)}\), where \(\hat{V}(\hat{\beta}_s)\) is an estimate of the variance of the OLS coefficient (say, CR standard error).
Compute the \(1-\alpha\) quantile of \(\{|\hat{\beta}_s-b|:s \in 1,2\ldots S\}\) (\(\{|\hat{t}_s|:s \in 1,2\ldots S\}\)).
Reject the null if the absolute value of the unrestricted regression coefficient minus the value under the null \(\hat{\beta}-b\) (unrestricted \(t\)-stat) in the data strictly exceeds the \((1-\alpha)\) quantile.
If we want to compute a p-value, we just need to find the smallest \(\alpha\) for which the null is rejected. Canay, Santos, and Shaikh (2021) recommends studentization using cluster-robust standard errors, as it will work better when there are many clusters.
Let’s create a function that implements Wild BS:
<- function(dt, formula, coef, cluster, b = 0, R = 399){
wild_bs_func stopifnot(is.data.table(dt))
# No spaces in formula
<- gsub("\\s*", "", formula)
formula # Impose the null
<- sprintf("%s\\+|\\+%s$", coef, coef)
pattern <- sub(pattern, "", formula)
null_form <- sprintf("~%d*%s", b, coef)
off <- sub("^(.*)~(.*)", "\\1", formula)
dep_var <- feols(as.formula(null_form), data = dt,
m0 cluster = cluster,
offset = as.formula(off))
<- fitted(m0)
fitted0 <- resid(m0)
resid0 # Copy dt to create artificial outcomes
<- copy(dt)
art_dt # Bootstrap
<- data.table(replication = seq_len(R))
reps_dt := lapply(replication, function(x){
reps_dt[, bootstrap # Resampling
= get(cluster),
art_dt[, by := 1 - 2*rbinom(1, 1, 0.5)]
e c(dep_var) := fitted0 + e*resid0]
art_dt[, <- feols(as.formula(formula), cluster = cluster, data = art_dt)
mb <- coefficients(mb)[coef]
betam <- se(mb)[coef]
sem <- (betam - b)/sem # studentized t-stat
tstat names(betam) <- NULL
names(sem) <- NULL
names(tstat) <- NULL
return(list(beta = betam, se = sem, t = tstat))
})]# Insert data estimation at first row
<- feols(as.formula(formula), cluster = cluster, data = dt)
model <- coefficients(model)[coef]
betam <- se(model)[coef]
sem <- (betam - b)/sem
tstat names(betam) <- NULL
names(sem) <- NULL
names(tstat) <- NULL
<- rbind(
reps_dt list(replication = 0,
bootstrap = list(list(beta = betam, se = sem, t = tstat))),
reps_dt
)# Unpack the list-column
<- reps_dt[, by = replication, bootstrap[[1]]]
reps_dt return(reps_dt)
}
<- wild_bs_func(dt, "logico~mulher+idade", "mulher", "escola")
wild_dt head(wild_dt)
## replication beta se t
## 1: 0 0.244856027 0.06322075 3.87303233
## 2: 1 0.089661779 0.08283529 1.08241034
## 3: 2 0.001722727 0.08604382 0.02002151
## 4: 3 0.029281719 0.08242660 0.35524599
## 5: 4 -0.133281627 0.08081290 -1.64926185
## 6: 5 0.122836329 0.07974120 1.54043741
<- melt(wild_dt, id.vars = "replication", variable.name = "stat")
wild_long
ggplot(wild_long[replication != 0 & stat != "se"],
aes(value, fill = stat)) +
geom_histogram(bins = 20) +
facet_wrap(~stat, scales = "free") +
geom_vline(data = boot_long[replication == 0 & stat != "se"],
aes(xintercept = value, color = "darkred")) +
labs(subtitle = sprintf("Wild Bootstrap estimations, R = %s", nrow(wild_dt)-1)) +
guides(colour = "none") +
theme_classic()
t_boot(wild_dt$t)
## lower upper
## -2.046417 2.046417
ci_boot(wild_dt[replication == 0, beta], wild_dt[replication == 0, se],
$t) wild_dt
## lower upper
## 0.1154800 0.3742321
tidy(feols(logico~mulher+idade, cluster = ~escola, data = dt),
conf.int = TRUE)[c("term", "estimate", "statistic",
"conf.low", "conf.high")]
## # A tibble: 3 x 5
## term estimate statistic conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.24 7.12 2.35 4.13
## 2 mulher 0.245 3.87 0.121 0.369
## 3 idade -0.300 -8.02 -0.373 -0.226
Randomization Inference
This method derives from Fisher’s exact p-values (FEP). Given data from a completely randomized experiment our goal is to assess the sharp null hypothesis of no effect of the active versus control treatment, that is, the null hypothesis under which, for each unit in the experiment, both values of the potential outcomes are identical. The most general case that fits into the FEP framework is the null hypothesis:
\[ \begin{equation} H_0: Y_i(1) = Y_i(0) + \delta_i \quad i = 1,2\ldots N \end{equation} \]
for some set of pre-specified treatment effects \(\delta_i\). Let \(\mathbf{W}\in\{0,1\}^N\) be the stochastic assignment vector then, as usual, \(\mathbf{Y^{obs}=Y(1)*W+Y(0)*(1-W)}\) and \(\mathbf{T(Y^{obs}, W)}\) be a test statistic. Under the sharp null, for any statistic \(\mathbf{T(Y^{obs}, W)}\), its distribution is completely known as all potential outcomes are revealed under the sharp null hypothesis. The test statistic is stochastic solely through the stochastic nature of the assignment vector, thus we can infer exact p-values through (randomized) treatment assignment simulation:
- Compute and store \(T^{obs} = T(\mathbf{Y}^{obs}, \mathbf{W}^{obs})\)
- For \(p \in \{1,2\ldots P\}\):
- Draw \(\mathbf{W}^p\) from the random, but possibly clustered, treatment assignment mechanism
- Compute and store \(T^p = T(\mathbf{Y}^{obs}, \mathbf{W}^p)\)
- The p-value is the proportion of simulated statistics that exceeds the observed one, \(\hat p=1/(P+1)\left(1+\sum_{k=1}^P\mathbb{1}\{T^k\geq T^{obs}\}\right)\)
If we were to draw all possible combinations of assignment vectors this p-value would be exact. An \(\alpha\)-level test can be conducted by computing the test statistic in the observed data, and rejecting the null if this value exceeds the \(1-\alpha\) quantile of the simulated distribution.
Which statistic should we use? We also care about power, so we should use statistics for which high values indicate evidence against the null. We want the test statistic to have the ability to distinguish between the null hypothesis and an interesting alternative hypothesis. For example, the absolute value of the difference in means between treated and control units; the studentized version of this statistic; coefficients of a treatment effect regression, etc.
Let’s show an example using as statistic the absolute value of difference in means. First we will fake a school-clustered treatment effect to be added to schools dataset.
<- copy(dt)
dt2 = escola,
dt2[, by := rbinom(1, 1, 0.5)]
treat := 0.5*treat + logico] dt2[, logico
Now dt2
holds the treatment assignment under column treat
and the logical exam’s scores has been increased by 0.5 points for each treated unit. We will now create a function to perform the randomization inference3, returning a vector consisting of significance level, statistic’s critical value, observed statistic and the p-value associated to it.
<- function(dt, dep_var, treatment, cluster, nper = 999, alpha = 0.05){
permutation stopifnot(is.data.table(dt))
<- dt[, mean(get(dep_var)), by = get(treatment)]
tobs <- abs(tobs[get == 1, V1] - tobs[get == 0, V1])
tobs # Copy dt to create artificial outcomes
<- copy(dt)
art_dt # Permutations
<- data.table(permutation = seq_len(nper))
perm_dt := sapply(permutation, function(x){
perm_dt[, statistic # Random treatment assignment
= get(cluster), c(treatment) := rbinom(1, 1, 0.5)]
art_dt[, by <- art_dt[, mean(get(dep_var)), by = get(treatment)]
tp # Return the statistic
abs(tp[get == 1, V1] - tp[get == 0, V1])
})]# Include tobs as the first statistic
<- rbind(
perm_dt list(
permutation = 0,
statistic = tobs),
perm_dt
)<- perm_dt[, mean(statistic >= tobs)]
pval <- perm_dt[order(statistic)][(1 - alpha)*(nper + 1), statistic]
tc return(c(alpha = alpha, t_critical = tc, t_obs = tobs, p_value = pval))
}<- permutation(dt2, "logico", "treat", "escola")
rand_inf rand_inf
## alpha t_critical t_obs p_value
## 0.0500000 0.4597388 0.7140106 0.0010000
A challenge for you!
The permutation
function above is not taking into account treatment assignments repetition. Your challenge is to create a function that makes randomization inference where all permutations are unique. As a hint, you may want to take a look at the function permutations
from arrangements
package.
<- function(dt, dep_var, treatment, cluster, nper = 999, alpha = 0.05){
permutation2 stopifnot(is.data.table(dt))
<- dt[, mean(get(dep_var)), by = get(treatment)]
tobs <- abs(tobs[get == 1, V1] - tobs[get == 0, V1])
tobs # Permutation by cluster, without repetition
<- paste0("treat", seq_len(nper))
treat_names <- dt[, by = get(cluster), .(treat = first(get(treatment)))]
cl := arrangements::permutations(
cl[, (treat_names)
treat, nsample = nper,
layout = "list")
]# dt_join holds the dependent var and ALL pemuted treatment assignments
<- dt[cl[, -c("treat")], on = "escola==get"]
dt_join
<- data.table(permutation = seq_len(nper))
perm_dt := sapply(treat_names, function(x){
perm_dt[, statistic <- dt_join[, mean(get(dep_var)), by = get(x)]
tp abs(tp[get == 1, V1] - tp[get == 0, V1])
})]# Include tobs as the first statistic
<- rbind(
perm_dt list(
permutation = 0,
statistic = tobs),
perm_dt
)<- perm_dt[, mean(statistic >= tobs)]
pval <- perm_dt[order(statistic)][(1 - alpha)*(nper + 1), statistic]
tc return(c(alpha = alpha, t_critical = tc, t_obs = tobs, p_value = pval))
}<- permutation2(dt2, "logico", "treat", "escola")
rand_inf2 rand_inf2
## alpha t_critical t_obs p_value
## 0.0500000 0.4611940 0.7140106 0.0010000
References
I am creating a function for demonstration purposes, you could use the package
boot
to create the bootstrap statistics and confidence intervals.↩︎The algorithm and the implementation of the wild bootstrap here presented were largely based on the work of previous teaching assistant Luis Alvarez. His code in base R can be accessed at GitHub↩︎
Note we are eventually repeating some treatment assignments in our “permutations.” According to Imbens and Rubin (2015), for an approximation of the p-value, it does not matter whether we sample with or without replacement, being that the latter will provide more accurate p-values when \(P\) is low.↩︎