---
title: "Inference: Clustering"
subtitle: "EC 607, Set 11"
author: "Edward Rubin"
date: ""
output:
xaringan::moon_reader:
css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css']
# self_contained: true
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
class: inverse, middle
```{r, setup, include = F}
# devtools::install_github("dill/emoGG")
library(pacman)
p_load(
broom, tidyverse,
latex2exp, ggplot2, ggthemes, ggforce, viridis, extrafont, gridExtra,
kableExtra, snakecase, janitor,
data.table, dplyr, fixest, estimatr,
lubridate, knitr, parallel, furrr,
lfe,
huxtable,
here, magrittr
)
# Define pink color
red_pink = "#e64173"
turquoise = "#20B2AA"
orange = "#FFA500"
red = "#fb6107"
blue = "#3b3b9a"
green = "#8bb174"
grey_light = "grey70"
grey_mid = "grey50"
grey_dark = "grey20"
purple = "#6A5ACD"
slate = "#314f4f"
# Dark slate grey: #314f4f
# Knitr options
opts_chunk$set(
comment = "#>",
fig.align = "center",
fig.height = 7,
fig.width = 10.5,
warning = F,
message = F
)
opts_chunk$set(dev = "svg")
options(device = function(file, width, height) {
svg(tempfile(), width = width, height = height)
})
options(knitr.table.format = "html")
# A few extras
xaringanExtra::use_xaringan_extra(c('tile_view', 'fit_screen'))
```
# Prologue
---
name: schedule
# Schedule
## Last time
Regression discontinuities
## Today
Inference and clustering
## Upcoming
- Problem set #3 due Sunday?
- Project step 2 coming this week too (with feedback from step 1).
---
layout: true
# Inference
---
class: inverse, middle
---
name: motive
## Motivation
So far, we've focused on carefully .hi[obtaining causal estimates] of the effect of some treatment $\text{D}_{i}$ on our outcome $\text{Y}_{i}$.
--
Our discussion of research designs and their requirements/assumptions has centered on .hi[avoiding selection and securing unbiased and/or consistent estimates] for $\tau$.
--
In other words, we've concentrated on .hi[point estimates].
What about .hi-purple[inference]?
---
## Shminference .super[.pink[†]]
.footnote[
.pink[†] What is [*shminference*](https://en.wikipedia.org/wiki/Shm-reduplication)?
]
.qa[Q] Why care about inference?
--
.qa[A] I'll give you two reasons.
--
1. We often want to .hi[test theories/hypotheses]. Point estimates (*i.e.*, $\hat{\beta}$) can't do this alone. Inference finishes the job.
--
2. Other times, we want to .hi[*measure* the effect] of a treatment. Inference helps us think about the .hi[precision] of our estimates.
--
.note[Note:] Similar reasoning can apply to bounding forecasting/predictions.
--
If you want answers, then you need to do inference
--
correctly.
---
## What's so complicated?
Angrist and Pischke told us that "correcting" our standard errors for heteroskedasticity may increase the standard errors up to 25%.
What else are we worried about?
---
## What we're worried about
- .hi-slate[Transformations of estimators], *i.e.*, $\mathop{\text{Var}} \left[ f\left(\hat\beta\right) \right] \neq f\!\left(\mathop{\text{Var}} \left[ \hat\beta \right]\right)$
--
- .hi-slate[Dependence/correlation in our disturbance], *i.e.*, $\mathop{\text{Cov}} \left(\varepsilon_i,\,\varepsilon_j \right)\neq 0$
- .slate[Autocorrelation] $\varepsilon_t = \rho \varepsilon_{t-1} + \varepsilon_t$
- .slate[Correlated shocks within groups] $\varepsilon_i = \varepsilon_{g(i)} + \varepsilon_i$
--
- .hi-slate[Finite-sample properties] *vs.* asymptotic properties
--
- .hi-slate[Power] and .hi-slate[minimal detectable effects]
--
- .hi-slate[Multiple-hypothesis testing] and .hi-slate[*p-hacking*]
--
.it[In other words:] We've got a lot to worry/think about.
---
layout: true
# Clustering
---
name: clustering
class: inverse, middle
---
## Setup
Many studies—observational and experimental—have a treatment that is assigned to all/most individuals within a group.
- Classrooms/schools
- Households
- Villages/counties/states
--
Furthermore, we might imagine individuals within the same group may have correlated disturbances. For $i$ and $j$ in group $g$
$$
\begin{align}
\mathop{\text{Cov}} \left( \varepsilon_{i},\, \varepsilon_{j} \right) = \mathop{E}\left[ \varepsilon_{i} \varepsilon_{j} \right] = \rho_\varepsilon \sigma^2_\varepsilon
\end{align}
$$
where $\rho_\varepsilon$ gives the within-group correlation of disturbances—what *MHE* calls the .attn[intraclass correlation coefficient].
---
## Setup
In other words, we have a regression
$$
\begin{align}
y_{i} = \beta_0 + \beta_1 x_{g(i)} + \varepsilon_i
\end{align}
$$
where individual $i$ is in group $g$, and $\text{X}_{g(i)}$ only varies across groups.
--
For within-group correlation, we can use an additive random-effects model
$$
\begin{align}
\varepsilon_i = \nu_{g(i)} + \eta_i
\end{align}
$$
meaning group members all receive a common shock $\nu_{g(i)}$, and individuals receive independent shocks $\eta_i$.
--
.note[Note] We assume $\eta_i$ is independent of $\eta_j$ $\left( i\neq j \right)$ and $\nu_g$ $\left( \forall g \right)$.
---
## Additive random effects
Based upon this model we've set up
$$
\begin{align}
\varepsilon_i = \nu_{g(i)} + \eta_i
\end{align}
$$
the covariance between individuals $i$ and $j$ in group $g$ is
$$
\begin{align}
\mathop{\text{Cov}} \left( \varepsilon_i,\, \varepsilon_j \right) &= \mathop{E}\left[ \varepsilon_i \varepsilon_j \right] = \mathop{E}\left[ \left( \nu_g + \eta_i \right) \left( \nu_g + \eta_j \right) \right] = \mathop{E}\left[ \nu_g^2 \right] = \sigma^2_\nu
\\ &= \rho_\varepsilon\sigma^2_\varepsilon
\\ &= \rho_\varepsilon \left( \sigma^2_\nu + \sigma^2_\eta \right)
\end{align}
$$
Thus, we can write the intraclass correlation coefficient as
$$
\begin{align}
\rho_\varepsilon = \dfrac{\sigma^2_\nu}{\sigma^2_\varepsilon} = \dfrac{\sigma^2_\nu}{\sigma^2_\nu + \sigma^2_\eta}
\end{align}
$$
---
## What is $\rho_\varepsilon?$
Let's review what we know.
$$
\begin{align}
\varepsilon_i = \nu_{g(i)} + \eta_i \quad\quad \color{#6A5ACD}{\text{and}} \quad\quad \rho_\varepsilon = \dfrac{\sigma^2_\nu}{\sigma^2_\varepsilon} = \dfrac{\sigma^2_\nu}{\sigma^2_\nu + \sigma^2_\eta}
\end{align}
$$
--
One way to think about $\rho_\varepsilon$ is as the .hi[share of the variance of the disturbance] $\varepsilon_i$ .hi[accounted for by the shared disurbance] $\nu_{g(i)}$.
--
How much of the disturbance's variance only varies at the .hi[group] level?
--
As $\nu_{g(i)}$ accounts for more and more of the variation in $\varepsilon_i$, $\rho_\varepsilon\rightarrow 1$.
---
## So...
.qa[Q] Why do we care about $\rho_\varepsilon$?
--
.qa[A] It tells us by how wrong our standard errors can be if we treat all observations as independent.
--
Let $\mathop{\text{Var}_o} \left( \hat\beta_1 \right)$ denote the conventional variance formula for OLS estimator..super[.pink[†]]
.footnote[
.pink[†] which treats all disturbances as independent (and identically distributed).
]
--
Let $\mathop{\text{Var}} \left( \hat\beta_1 \right)$ denote the actual variance of $\hat\beta_1$.
---
name: moulton
## So....
With (.hi-slate[1]) nonstochastic regressors fixed by group *and* (.hi-slate[2]) groups of size $n$
$\quad\dfrac{\mathop{\text{Var}} \left( \hat\beta_1 \right)}{\mathop{\text{Var}_o} \left( \hat\beta_1 \right)} = 1 + (n-1) \rho_\varepsilon\quad$
--
$\implies \quad \dfrac{\mathop{\text{S.E.}} \left( \hat\beta_1 \right)}{\mathop{\text{S.E.}_o} \left( \hat\beta_1 \right)} = \sqrt{1 + (n-1) \rho_\varepsilon}$
--
The term $\sqrt{1 + (n-1) \rho_\varepsilon}$ is called the .attn[Moulton factor].super[.pink[†]].
The .attn[Moulton factor] tells us by what factor standard errors will be wrong if we ignore within-group correlation (conditional on assumptions .hi-slate[1] and .hi-slate[2]).
.footnote[
.pink[†] After [Moulton (1986)](https://www.sciencedirect.com/science/article/pii/0304407686900217). Derivation: *MHE* 323–325.
]
--
.qa[Q] What happens if $\rho =$ 1?
--
What if you duplicated your dataset?
--
.qa[Q] What happens as $n$ increases?
---
exclude: true
```{r, dup-data, echo = F, cache = TRUE}
# Load data
test_df = Ecdat::Caschool %>% select(
test_score = testscr, ratio = str, income = avginc, enrollment = enrltot
) %>% as_tibble()
# And an id
test_df %<>% mutate(id = 1:nrow(test_df))
# Stack the dataset a bunch of times
dup_df = lapply(X = 1:20, FUN = function(x) mutate(test_df, dup = x)) %>% bind_rows()
# Now run a regression
dup_est = lapply(
X = 1:20,
FUN = function(x) feols(test_score ~ ratio + income, data = dup_df %>% filter(dup <= x))
)
# Grab standard errors
dup_se = lapply(
X = seq_along(dup_est),
FUN = function(x) {
data.frame(
dups = x - 1,
se_iid = dup_est[[x]] %>% summary(vcov = 'iid') %>% coeftable() %>% magrittr::extract(2,2),
se_cl = dup_est[[x]] %>% summary(cluster = 'id') %>% coeftable() %>% magrittr::extract(2,2)
)
}
) %>% bind_rows()
```
---
layout: false
class: clear, middle
The effect of duplicating a dataset on the .pink[OLS standard error]
```{r, dup-plot1, echo = FALSE}
ggplot(
data = dup_se,
aes(x = dups, y = se_iid)
) +
geom_hline(yintercept = 0) +
geom_line(color = red_pink, size = 0.5) +
geom_point(color = red_pink, size = 3.5) +
labs(x = "Number of times dataset is duplicated", y = "Standard error") +
theme_pander(base_size = 22, base_family = "Fira Sans Condensed")
```
---
layout: false
class: clear, middle
Correcting our .orange[standard errors for clustering] (observations' correlation).
```{r, dup-plot2, echo = FALSE}
ggplot(
data = dup_se,
aes(x = dups)
) +
geom_hline(yintercept = 0) +
geom_line(aes(y = se_iid), color = red_pink, size = 0.5, alpha = 0.25) +
geom_point(aes(y = se_iid), color = red_pink, size = 3.5, alpha = 0.25) +
geom_line(aes(y = se_cl), color = orange, size = 0.5) +
geom_point(aes(y = se_cl), color = orange, size = 3.5) +
labs(x = "Number of times dataset is duplicated", y = "Standard error") +
theme_pander(base_size = 22, base_family = "Fira Sans Condensed")
```
---
layout: true
# Clustering
---
name: moulton-example
## The Moulton factor
The Moulton factor
$$
\begin{align}
\dfrac{\mathop{\text{S.E.}} \left( \hat\beta_1 \right)}{\mathop{\text{S.E.}_o} \left( \hat\beta_1 \right)} = \sqrt{1 + (n-1) \rho_\varepsilon}
\end{align}
$$
shows even when $\rho_\varepsilon$ is small, we can have vary large standard error issues.
--
.ex[Ex] An experiment on 400 schools, each with 1,000 students.
--
If $\rho_\varepsilon = 0.01$, the Moulton factor is $\sqrt{1 + (1,000-1)\times0.01}\approx 3.32$.
---
name: moulton-tstat
## Test statistics
.note[Recall] $t_\text{stat} = \dfrac{\hat{\beta}_1}{\mathop{\text{S.E.}} \left( \hat\beta_1 \right)}$.
--
$\therefore \dfrac{t_o}{t}$
--
$= \dfrac{\hat{\beta}_1/\mathop{\text{S.E.}_o} \left( \hat\beta_1 \right)}{\hat{\beta}_1/\mathop{\text{S.E.}} \left( \hat\beta_1 \right)}$
--
$=\dfrac{\mathop{\text{S.E.}} \left( \hat\beta_1 \right)}{\mathop{\text{S.E.}_o} \left( \hat\beta_1 \right)}$
--
$=$ the Moulton factor.
--
.ex[Ex] Thus, in our example of 400 schools with 1,000 students, ignoring within-school correlation of $\rho_\varepsilon=$ 0.01 would lead us test statistics that are more than 3 times as large as they should be.
--
This is why economics seminars have standard-error police.
---
## Relaxing assumptions
If we allow regressors to vary by individual and groups to differ in size $\left( n_g \right)$,
$$
\begin{align}
\dfrac{\mathop{\text{Var}} \left( \hat\beta_1 \right)}{\mathop{\text{Var}_o} \left( \hat\beta_1 \right)} =
1 + \left[ \dfrac{\mathop{\text{Var}} \left( n_g \right)}{\overline{n}} + \overline{n} - 1 \right]\rho_x \rho_\varepsilon
\end{align}
$$
where $\rho_x$ denotes the intraclass (within-group) correlation of $x_i$..super[.pink[†]]
.footnote[
.pink[†] See *MHE* for mathematical definitions and the derivation.
]
--
.attn[Important] The Moulton factor for this general model depends upon the amount of within-group correlation in $x_i$ *and* $\varepsilon_i$.
--
The special case is also important, as treatment is often fixed at some level.
---
name: cluster_answers
## The answer
.qa[Q] So what do we do now?
--
.qa[A] We've got options (as usual)
1. Parametrically model the random effects
2. Cluster-robust standard error (estimator)
3. Aggregate up to the group (or a similar method)
4. Block (group-based) bootstrap
5. GLS/MLE modeling $y_i$ and $\varepsilon_i$
--
.hi-orange[Most common:] Cluster-robust standard errors
.hi-red[Runner up:] Block bootstrap
.hi-pink[Second runner up:] Group-level analysis
---
name: cluster-robust
## Cluster-robust standard errors
[Liang and Zeger (1986)](https://doi.org/10.1093/biomet/73.1.13) extend White's heteroskedasticity-robust covariance matrix to allow for both clustering and heteroskedasticity..super[.pink[†]]
.footnote[
.pink[†] When people say *clustering*, they typically mean *correlated disturbances within a group.*
]
--
.smaller[
$$
\begin{align}
\hat{\Omega}_\text{cl} = \left( \text{X}^\prime \text{X} \right)^{-1} \left( \sum_g \text{X}_g' \hat{\Psi}_g \text{X}_g \right) \left( \text{X}^\prime \text{X} \right)^{-1}
\end{align}
$$
]
--
.smaller[
$$
\begin{align}
\hat{\Psi}_g = a e_g e_g' = a
\begin{bmatrix}
e_{1g}^2 & e_{1g} e_{2g} & \cdots & e_{1g} e_{n_g g} \\
e_{1g} e_{2g} & e_{2g}^2 & e_{2g} \cdots & e_{2g} e_{n_g g} \\
\vdots & \vdots & \ddots & \vdots \\
e_{1g} e_{n_g g} & e_{2g} e_{n_g g} & \cdots & e_{n_g g}^2
\end{bmatrix}
\end{align}
$$
]
where $e_{g}$ are the OLS residuals for group $g$, $e_{ig}$ is the residual for individual $i$ in group $g$, and $a$ is a degrees-of-freedom adjustment.
---
## Cluster-robust standard errors
.note[Derivation] Let $\text{x}_i$ denote observation $i$ (row) from $\text{X}$.
--