Heteroskedasticity, Part II

---

# Prologue

---

# Schedule

## Last Time

Heteroskedasticity: Issues and tests

## Today

- .hi[First assignment] due today
- Living with heteroskedasticity

## Upcoming

- .hi[Second assignment] released soon (next week?)

---

# EC 421

## Goals

- Develop .hi[intuition] for econometrics.
- Learn how to .hi[*apply*] econometrics—strengths, weaknessed, *etc.*
- Learn .hi[.mono[R]].

I want you to know what the formulas mean, when/why we use them, and when they fail/work.

This course has the potential to be one of the most useful/valuable/applicable/marketable classes that you take at UO.

---
layout: true
# Heteroskedasticity
## Review

---
class: inverse, middle

---

Three review questions

**Question 1:** What is the difference between `$u_i$` and `$e_i$`?

**Question 2:** We spend *a lot* of time discussing `$u_i^2$`. Why?

**Question 3:** We also spend *a lot* of time discussing `$e_i^2$`. Why?

---

**Question 1:** What is the difference between `$u_i$` and `$e_i$`?

**Answer 1:**
--
<br> `$\color{#e64173}{u_i}$` gives the .hi[population disturbance] for the `$i$`.super[th] observation.
--
 `$u_i$` measures how far the `$i$`.super[th] observation is from the .hi[population] line, _i.e._,

$$ u\_i = y\_i - \color{#e64173}{\underbrace{\left(\beta\_0 + \beta\_1 x\_i\right)}\_{\text{Population line}}} $$

`$\color{#6A5ACD}{e_i}$` gives the .hi-purple[regression residual (error)] for the `$i$`.super[th] observation.
--
 `$e_i$` measures how far the `$i$`.super[th] observation is from the .hi-purple[sample regression] line, _i.e._,

$$ e\_i = y\_i - \color{#6A5ACD}{\underbrace{\left(\hat{\beta}\_0 + \hat{\beta}\_1 x\_i\right)}\_{\text{Sample reg. line} = \hat{y}}} = y\_i - \color{#6A5ACD}{\hat{y}_i} $$

---

**Question 2:** We spend *a lot* of time discussing `$u_i^2$`. Why?

**Answer 2:**

One of major assumptions is that our disturbances (the `$u_i$`'s) are homoskedastic (they have constant variance), _i.e._, `$\mathop{\text{Var}} \left( u_i \middle| x_i \right) = \sigma^2$`.

We also assume that the mean of these disturbances is zero, `$\color{#e64173}{\mathop{\boldsymbol{E}}\left[ u_i \middle| x_i \right] = 0}$`.

By definition, `$\mathop{\text{Var}} \left( u_i \middle| x_i \right) = \mathop{\boldsymbol{E}}\Big[ u_i^2 - \underbrace{\color{#e64173}{\mathop{\boldsymbol{E}}\left[ u_i \middle| x_i \right]}^2}_{\color{#e64173}{=0}} \Big| x_i \Big] = \mathop{\boldsymbol{E}}\left[ u_i^2 \middle| x_i \right]$`

Thus, if we want to learn about the variance of `$u_i$`, we can focus on `$u_i^2$`.

---

**Question 3:** We also spend *a lot* of time discussing `$e_i^2$`. Why?

**Answer 3:**

We cannot observe `$u_i$` (or `$u_i^2$`).

But `$u_i^2$` tells us about the variance of `$u_i$`.

We use `$e_i^2$` to learn about `$u_i^2$` and, consequently, `$\sigma_i^2$`.

---
layout: false
# Heteroskedasticity
## Review: Current assumptions

1. Our sample (the `$x_k$`'s and `$y_i$`) was .hi[randomly drawn] from the population.

2. `$y$` is a .hi[linear function] of the `$\beta_k$`'s and `$u_i$`.

3. There is no perfect .hi[multicollinearity] in our sample.

4. The explanatory variables are .hi[exogenous]: `$\mathop{\boldsymbol{E}}\left[ u \middle| X \right] = 0 \left(\implies \mathop{\boldsymbol{E}}\left[ u \right] = 0\right)$`.

5. The disurbances have .hi[constant variance] `$\sigma^2$` and .hi[zero covariance], _i.e._,
  - `$\mathop{\boldsymbol{E}}\left[ u_i^2 \middle| X_i \right] = \mathop{\text{Var}} \left( u_i \middle| X_i \right) = \sigma^2 \implies \mathop{\text{Var}} \left( u_i \right) = \sigma^2$`
  - `$\mathop{\text{Cov}} \left( u_i, \, u_j \middle| X_i,\,X_j \right) = \mathop{\boldsymbol{E}}\left[ u_i u_j \middle| X_i,\,X_j \right] = 0$` for `$i\neq j$`

6. The disturbances come from a .hi[Normal] distribution, _i.e._, `$u_i \overset{\text{iid}}{\sim} \mathop{\text{N}}\left( 0, \sigma^2 \right)$`.

---
layout: true
# Heteroskedasticity
## Review

---

Today we're focusing on assumption \#5:

> 5\. The disurbances have .hi[constant variance] `$\sigma^2$` and .hi[zero covariance], _i.e._,
>  - `$\mathop{\boldsymbol{E}}\left[ u_i^2 \middle| X_i \right] = \mathop{\text{Var}} \left( u_i \middle| X \right) = \sigma^2 \implies \mathop{\text{Var}} \left( u_i \right) = \sigma^2$`
>  - `$\mathop{\text{Cov}} \left( u_i, \, u_j \middle| X_i,\,X_j \right) = \mathop{\boldsymbol{E}}\left[ u_i u_j \middle| X_i,\,X_j \right] = 0$` for `$i\neq j$`

Specifically, we will focus on the assumption of .hi[constant variance] (also known as *homoskedasticity*).

**Violation of this assumption:** Our disturbances have different variances.

.hi[Heteroskedasticity:] `$\mathop{\text{Var}} \left( u_i \right) = \sigma^2_i$` and `$\sigma^2_i \neq \sigma^2_j$` for some `$i\neq j$`.

---

Classic example of heteroskedasticity: The funnel

Variance of `$u$` increases with `$x$`

---

Another example of heteroskedasticity: (double funnel?)

Variance of `$u$` increasing at the extremes of `$x$`

---

Another example of heteroskedasticity:

Differing variances of `$u$` by group

---

.hi[Heteroskedasticity] is present when the variance of `$u$` changes with any combination of our explanatory variables `$x_1$` through `$x_k$`.

---

We have some tests that may help us detect heteroskedasticity.

- Goldfeld-Quandt
- White
- (There are others, *e.g.*, Breusch-Pagan)

What do we do if we detect it?

---
layout: true
# Living with heteroskedasticity

---
class: inverse, middle, true

---

In the presence of heteroskedasticity, OLS is

- still .hi[unbiased]
- .hi[no longer the most efficient] unbiased linear estimator

On average, we get the right answer but with more noise (less precision).
<br> *Also:* Our standard errors are biased.

**Options:**

1. Check regression .hi[specification].
2. Find a new, more efficient .hi[unbiased estimator] for `$\beta_j$`'s.
3. Live with OLS's inefficiency; find a .hi[new variance estimator].
  - Standard errors
  - Confidence intervals
  - Hypothesis tests

---
layout: true
# Living with heteroskedasticity
## Misspecification

---

As we've discussed, the specification.pink[<sup>†</sup>] of your regression model matters a lot for the unbiasedness and efficiency of your estimator.

**Response \#1:** Ensure your specification doesn't cause heteroskedasticity.

---

*Example:* Let the population relationship be $$ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + u_i $$

with `$\mathop{\boldsymbol{E}}\left[ u_i \middle| x_i \right] = 0$` and `$\mathop{\text{Var}} \left( u_i \middle| x_i \right) = \sigma^2$`.

However, we omit `$x^2$` and estimate $$ y_i = \gamma_0 + \gamma_1 x_i + w_i $$

Then $$ w_i = u_i + \beta_2 x_i^2 \implies \mathop{\text{Var}} \left( w_i \right) = f(x_i) $$

_I.e._, the variance of `$w_i$` changes systematically with `$x_i$` (heteroskedasticity).

---

.pink[Truth:] `$\color{#e64173}{\log\left(y_i\right) = \beta_0 + \beta_1 x_i + u_i}$`  .slate[**Misspecification:**] `$\color{#314f4f}{y_i = \beta_0 + \beta_1 x_i + v_i}$`

---

.pink[**Truth:**] `$\color{#e64173}{\log\left(y_i\right) = \beta_0 + \beta_1 x_i + u_i}$`  .slate[Misspecification:] `$\color{#314f4f}{y_i = \beta_0 + \beta_1 x_i + v_i}$`

---

More generally:

**Misspecification problem:** Incorrect specification of the regression model can cause heteroskedasticity (among other problems).

**Solution:** 💡 Get it right (_e.g._, don't omit `$x^2$`).

**New problems:**

- We often don't know the *right* specification.
- We'd like a more formal process for addressing heteroskedasticity.

**Conclusion:** Specification often will not "solve" heteroskedasticity.
<br> However, correctly specifying your model is still really important.

---
layout: true
# Living with heteroskedasticity
## Weighted least squares

---

Weighted least squares (WLS) presents another approach.

**Response \#2:** Increase efficiency by weighting our observations.

Let the true population relationship be

$$
`\begin{align}
  y_i = \beta_0 + \beta_1 x_{i} + u_i \tag{1}
\end{align}`
$$

with `$u_i \sim \mathop{N} \left( 0,\, \sigma_i^2 \right)$`.

Now transform `$(1)$` by dividing each observation's data by `$\sigma_i$`, _i.e._,

$$
`\begin{align}
  \dfrac{y_i}{\sigma_i} &= \beta_0 \dfrac{1}{\sigma_i} + \beta_1 \dfrac{x_{i}}{\sigma_i} + \dfrac{u_i}{\sigma_i} \tag{2}
\end{align}`
$$

---

$$
`\begin{align}
  y_i &= \beta_0 + \beta_1 x_{i} + u_i \tag{1} \\[1em]
  \dfrac{y_i}{\sigma_i} &= \beta_0 \dfrac{1}{\sigma_i} + \beta_1 \dfrac{x_{i}}{\sigma_i} + \dfrac{u_i}{\sigma_i} \tag{2}
\end{align}`
$$

Whereas `$(1)$` is heteroskedastic,
--
 `$\color{#e64173}{(2)}$` .hi[is homoskedastic].

∴ OLS is efficient and unbiased for estimating the `$\beta_k$` in `$(2)$`!

Why is `$(2)$` homoskedastic?

`$\mathop{\text{Var}} \left( \dfrac{u_i}{\sigma_i} \middle| x_i \right) =$`
--
 `$\dfrac{1}{\sigma_i^2} \mathop{\text{Var}} \left( u_i \middle| x_i \right) =$`
--
 `$\dfrac{1}{\sigma_i^2} \sigma_i^2 =$`
--
 `$1$`

---

WLS is great, but we need to know `$\sigma_i^2$`, which is generally unlikely.

We can *slightly* relax this requirement—instead requiring

1. `$\mathop{\text{Var}} \left( u_i | x_i \right) = \sigma_i^2 = \sigma^2 h(x_i)$`

2. We know `$h(x)$`.

As before, we transform our heteroskedastic model into a homoskedastic model. This time we divide each observation's data<sup>.pink[†]</sup> by `$\sqrt{h(x_i)}$`.

---

$$
`\begin{align}
  y_i &= \beta_0 + \beta_1 x_{i} + u_i \tag{1} \\[1em]
  \dfrac{y_i}{\sqrt{h(x_i)}} &= \beta_0 \dfrac{1}{\sqrt{h(x_i)}} + \beta_1 \dfrac{x_{i}}{\sqrt{h(x_i)}} + \dfrac{u_i}{\sqrt{h(x_i)}} \tag{2}
\end{align}`
$$
with `$\mathop{\text{Var}} \left( u_i | x_i \right) = \sigma^2 h(x_i)$`.

Now let's check that `$(2)$` is indeed homoskedastic.

`$\mathop{\text{Var}} \left( \dfrac{u_i}{\sqrt{h(x_i)}} \middle| x_i \right) =$`
--
 `$\dfrac{1}{h(x_i)} \mathop{\text{Var}} \left( u_i \middle| x_i \right) =$`
--
 `$\dfrac{1}{h(x_i)} \sigma^2 h(x_i) =$`
--
 `$\color{#e64173}{\sigma^2}$`

---

.hi[Weighted least squares] (WLS) estimators are a special class of .hi[generalized least squares] (GLS) estimators focused on heteroskedasticity.

$$
  y\_i = \beta\_0 + \beta\_1 x\_{1i} + u\_i \quad \color{#e64173}{\text{ vs. }} \quad
  \dfrac{y\_i}{\sigma\_i} = \beta\_0 \dfrac{1}{\sigma\_i} + \beta\_1 \dfrac{x\_{1i}}{\sigma\_i} + \dfrac{u\_i}{\sigma\_i}
$$

*Notes:*

1. WLS **transforms** a heteroskedastic model into a homoskedastic model.
2. **Weighting:** WLS downweights observations with higher variance `$u_i$`'s.
3. **Big requirement:** WLS requires that we *know* `$\sigma_i^2$` for each observation.
4. WLS is generally **infeasible**. *Feasible* GLS (FGLS) offers a solution.
5. Under its assumptions: WLS is the **best linear unbiased estimator**.

---
layout: true
# Living with heteroskedasticity
## Heteroskedasticity-robust standard errors

---

**Response \#3:**

- Ignore OLS's inefficiency (in the presence of heteroskedasticity).
- Focus on .hi[unbiased estimates for our standard errors].
- In the process: Correct inference.

**Q:** What is a standard error?
--
<br>**A:** The .hi[standard deviation of an estimator's distribution].

Estimators (like `$\hat{\beta}_1$`) are random variables, so they have distributions.

Standard errors give us a sense of how much variability is in our estimator.

---

*Recall*: We can write the OLS estimator for `$\beta_1$` as

$$ \hat{\beta}\_1 = \beta\_1 + \dfrac{\sum_i \left( x\_i - \overline{x} \right) u\_i}{\sum\_i \left( x\_i - \overline{x} \right)^2} = \beta\_1 + \dfrac{\sum_i \left( x\_i - \overline{x} \right) u\_i}{\text{SST}\_x} \tag{3} $$

Let `$\mathop{\text{Var}} \left( u_i \middle| x_i \right) = \sigma_i^2$`.

We can use `$(3)$` to write the variance of `$\hat{\beta}_1$`, _i.e._,

$$ \mathop{\text{Var}} \left( \hat{\beta}_1 \middle| x_i \right) = \dfrac{\sum_i \left( x_i - \overline{x} \right)^2 \sigma_i^2}{\text{SST}_x^2} \tag{4} $$

---

If we want unbiased estimates for our standard errors, we need an unbiased estimate for

$$ \dfrac{\sum_i \left( x_i - \overline{x} \right)^2 \sigma_i^2}{\text{SST}_x^2} $$

Our old friend Hal White provided such an estimator:.pink[<sup>†</sup>]

$$ \widehat{\mathop{\text{Var}}} \left( \hat{\beta}_1 \right) = \dfrac{\sum_i \left( x_i - \overline{x} \right)^2 e_i^2}{\text{SST}_x^2} $$

where the `$e_i$` comes from the OLS regression of interest.

---

Our heteroskedasticity-robust estimators for the standard error of `$\beta_j$`.

$$ \widehat{\mathop{\text{Var}}} \left( \hat{\beta}_1 \right) = \dfrac{\sum_i \left( x_i - \overline{x} \right)^2 e_i^2}{\text{SST}_x^2} $$

.hi[Case 2] Multiple (linear) regression, `$y_i = \beta_0 + \beta_1 x_{1i} + \cdots + \beta_k x_{ki} + u_i$`

$$ \widehat{\mathop{\text{Var}}} \left( \hat{\beta}\_j \right) = \dfrac{\sum\_i \hat{r}\_{ij}^2 e\_i^2}{\text{SST}\_{x\_j^2}} $$

where `$\hat{r}_{ij}$` denotes the i.super[th] residual from regressing `$x_j$` on all other explanatory variables.

---

With these standard errors, we can return to correct statistical inferencel

_E.g._, we can update our previous `$t$` statistic formula with our new heteroskedasticity-robust standard erros.

$$ t = \dfrac{\text{Estimate} - \text{Hypothesized value}}{\text{Standard error}} $$

---

**Notes**

- We are still using .hi[OLS estimates for] `$\color{#e64173}{\beta_j}$`
- Our het.-robust standard errors use a .hi[different estimator].
- Homoskedasticity
  - Plain OLS variance estimator is more efficient.
  - Het.-robust is still unbiased.
- Heteroskedasticity
  - Plain OLS variance estimator is biased.
  - Het.-robust variance estimator is unbiased.

---

These standard errors go by many names

- Heteroskedasticity-robust standard errors
- Het.-robust standard errors
- White standard errors
- Eicker-White standard errors
- Huber standard errors
- Eicker-Huber-White standards errors
- (some other combination of Eicker, Huber, and White)

**Do not say:** "Robust standard errors". The problem: "robust" to what?

---
layout: false
class: inverse, middle

# Living with heteroskedasticity
## Examples

---
layout: true
# Living with heteroskedasticity

---

## Examples

Back to our test-scores dataset…

``` r
# Load packages
library(pacman)
p_load(tidyverse, Ecdat)
# Select and rename desired variables; assign to new dataset; format as tibble
test_df <- Caschool %>% select(
  test_score = testscr, ratio = str, income = avginc, enrollment = enrltot
) %>% as_tibble()
# View first 2 rows of the dataset
head(test_df, 2)
```

```
#> # A tibble: 2 × 4
#>   test_score ratio income enrollment
#>        <dbl> <dbl>  <dbl>      <int>
#> 1       691.  17.9  22.7         195
#> 2       661.  21.5   9.82        240
```

---
layout: true
# Living with heteroskedasticity
## Example: Model specification

---

We found significant evidence of heteroskedasticity.

Let's check if it was due to misspecifying our model.

---

Model.sub[1]: `$\text{Score}_i = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \text{Income}_i + u_i$`
<br>`lm(test_score ~ ratio + income, data = test_df)`

---

Model.sub[2]: `$\log\left(\text{Score}_i\right) = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \text{Income}_i + u_i$`
<br>`lm(log(test_score) ~ ratio + income, data = test_df)`

---

Model.sub[3]: `$\log\left(\text{Score}_i\right) = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \log\left(\text{Income}_i\right) + u_i$`
<br>`lm(log(test_score) ~ ratio + log(income), data = test_df)`

---

Let's test this new specification with the White test for heteroskedasticity.

.center[
Model.sub[3]: `$\log\left(\text{Score}_i\right) = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \log\left(\text{Income}_i\right) + u_i$`
]

--
The regression for the White test
--
$$
`\begin{align}
  e_i^2 = &\alpha_0 + \alpha_1 \text{Ratio}_i + \alpha_2 \log\left(\text{Income}_i\right) + \alpha_3 \text{Ratio}_i^2 + \alpha_4 \left(\log\left(\text{Income}_i\right)\right)^2 \\
  &+ \alpha_5 \left(\text{Ratio}_i\times\log\left(\text{Income}_i\right)\right) + v_i
\end{align}`
$$
--
yields `$R_e^2\approx0.029$`
--
 and test statistic of
--
 `$\widehat{\text{LM}} = n\times R_e^2 \approx 12.2$`.

Under H.sub[0], `$\text{LM}$` is distributed as
--
 `$\chi_5^2$`
--
 `$\implies$` *p*-value `$\approx$` 0.033.

∴
--
 .hi[Reject H.sub[0].]
--
 .hi[Conclusion:]
--
 There is statistically significant evidence of heteroskedasticity at the five-percent level.

---

Okay, we tried adjusting our specification, but there is still evidence of heteroskedasticity.

**Next:** In general, you will turn to heteroskedasticity-robust standard errors.

- OLS is still unbiased for the .hi[coefficients] (the `$\beta_j$`'s)

- Heteroskedasticity-robust standard errors are unbiased for the .hi[standard errors] of the `$\hat{\beta}_j$`'s, _i.e._, `$\sqrt{\mathop{\text{Var}} \left( \hat{\beta}_j \right)}$`.

---
layout: true
# Living with heteroskedasticity
## Example: Het.-robust standard errors

---

Let's return to our model

$$ \text{Score}_i = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \text{Income}_i + u_i $$

We can use the `fixest` package in .mono[R] to calculate standard errors.

---

$$ \text{Score}_i = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \text{Income}_i + u_i $$

1\. Run the regression with `feols()` (instead of `lm()`)

``` r
# Load 'fixest' package
p_load(fixest)
# Regress log score on ratio and log income
test_reg <- feols(test_score ~ ratio + income, data = test_df)
```

*Notice* that `feols()` uses the same syntax as `lm()` for this regression.

---

$$ \text{Score}_i = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \text{Income}_i + u_i $$

2\. Estimate het.-robust standard errors with `vcov = 'hetero'` option in `summary()`

``` r
# Het-robust standard errors with 'vcov = 'hetero''
summary(test_reg, vcov = 'hetero')
```

```
#>               Estimate Std. Error  t value  Pr(>|t|)    
#>  (Intercept) 638.729155   7.301234 87.48235 < 2.2e-16 ***
#>  ratio        -0.648740   0.353340 -1.83602  0.067066 .  
#>  income        1.839112   0.114733 16.02949 < 2.2e-16 ***
```

---

Ceofficients and **heteroskedasticity-robust standard errors**:

``` r
summary(test_reg, vcov = 'hetero')
```

Ceofficients and **plain OLS standard errors** (assumes homoskedasticity):

``` r
summary(test_reg, vcov = 'iid')
```

```
#>               Estimate Std. Error  t value  Pr(>|t|)    
#>  (Intercept) 638.729155   7.449077 85.74608 < 2.2e-16 ***
#>  ratio        -0.648740   0.354405 -1.83051  0.067888 .  
#>  income        1.839112   0.092787 19.82083 < 2.2e-16 ***
```

---
layout: true
# Living with heteroskedasticity
## Example: WLS

---

We mentioned that WLS is often not possible—we need to know the functional form of the heteroskedasticity—either

**A**\. `$\sigma_i^2$`

**B**\. `$h(x_i)$`, where `$\sigma_i^2 = \sigma^2 h(x_i)$`

There *are* occasions in which we can know `$h(x_i)$`.

---

Imagine individuals in a population have homoskedastic disturbances.

However, instead of observing individuals' data, we observe (in data) groups' averages (_e.g._, cities, counties, school districts).

If these groups have different sizes, then our dataset will be heteroskedastic—in a predictable fashion.

**Recall:** The variance of the sample mean depends upon the sample size,
$$ \mathop{\text{Var}} \left( \overline{x} \right) = \dfrac{\sigma_x^2}{n} $$

**Example:**  Our school testing data is averaged at the school level.

---

*Example:*  Our school testing data is averaged at the school level.

Even if individual students have homoskedastic disturbances, the schools would have heteroskedastic disturbances, _i.e._,

**Individual-level model:** `$\quad \text{Score}_i = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \text{Income}_i + u_i$`

**School-level model:** `$\quad \overline{\text{Score}}_s = \beta_0 + \beta_1 \overline{\text{Ratio}}_s + \beta_2 \overline{\text{Income}}_s + \overline{u}_s$`

where the `$s$` subscript denotes an individual school (just as `$i$` indexes an individual person).

$$ \mathop{\text{Var}} \left( \overline{u}_s \right) = \dfrac{\sigma^2}{n_s} $$

---

For WLS, we're looking for a function `$h(x_s)$` such that `$\mathop{\text{Var}} \left( \overline{u}_s | x_s \right) = \sigma^2 h(x_s)$`.

We just showed<sup>.pink[†]</sup> that `$\mathop{\text{Var}} \left( \overline{u}_s |x_s \right) = \dfrac{\sigma^2}{n_s}$`.

Thus, `$h(x_s) = 1/n_s$`, where `$n_s$` is the number of students in school `$s$`.

To implement WLS, we divide each observation's data by `$1/\sqrt{h(x_s)}$`, meaning we need to multiply each school's data by `$\sqrt{n_s}$`.

The variable .mono[enrollment] in the .mono[test_df] dataset is our `$n_s$`.

---

Using WLS to estimate `$\text{Score}_i = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \text{Income}_i + u_i$`

**Step 1:** Multiply each variable by `$1/\sqrt{h(x_i)} = \sqrt{\text{Enrollment}_i}$`

``` r
# Create WLS transformed variables, multiplying by sqrt of 'pop'
test_df <- mutate(test_df,
  test_score_wls = test_score * sqrt(enrollment),
  ratio_wls      = ratio * sqrt(enrollment),
  income_wls     = income * sqrt(enrollment),
  intercept_wls  = 1 * sqrt(enrollment)
)
```

Notice that we are creating a transformed intercept.

---

Using WLS to estimate `$\text{Score}_i = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \text{Income}_i + u_i$`

**Step 2:** Run our WLS (transformed) regression

``` r
# WLS regression
wls_reg <- lm(
  test_score_wls ~ -1 + intercept_wls + ratio_wls + income_wls,
  data = test_df
)
```
--
*Note:* The `-1` in our regression tells .mono[R] not to add an intercept, since we are adding a transformed intercept (`intercept_wls`).

---

The .hi[WLS estimates and standard errors:]

```
#>                Estimate Std. Error t value Pr(>|t|)    
#>  intercept_wls 618.78331    8.26929  74.829   <2e-16 ***
#>  ratio_wls      -0.21314    0.37676  -0.566    0.572    
#>  income_wls      2.26493    0.09065  24.985   <2e-16 ***
```
--
<br>
The .hi[OLS estimates] and .hi[het.-robust standard errors:]

---

Alternative to doing your own weighting: feed `lm()` some `weights`.

``` r
lm(test_score ~ ratio + income, data = test_df, weights = enrollment)
```

---
layout: false

# Living with heteroskedasticity

In this example

- .hi[Heteroskedasticity-robust standard errors] did not change our standard errors very much (relative to plain OLS standard errors).

- .hi[WLS] changed our answers a bit—coefficients and standard errors.

These examples highlighted a few things:

1. Using the correct estimator for your standard errors really matters.<sup>.pink[†]</sup>

2. Econometrics doesn't always offer an obviously *correct* route.

To see \#1, let's run a simulation.

---
layout: true
# Living with heteroskedasticity
## Simulation

---

Let's examine a simple linear regression model with heteroskedasticity.

$$ y\_i = \underbrace{\beta\_0}\_{=1} + \underbrace{\beta\_1}\_{=10} x\_i + u\_i $$

where `$\mathop{\text{Var}} \left( u_i | x_i \right) = \sigma_i^2 = \sigma^2 x_i^2$`.
--

---

Let's examine a simple linear regression model with heteroskedasticity.

$$ y\_i = \underbrace{\beta\_0}\_{=1} + \underbrace{\beta\_1}\_{=10} x\_i + u\_i $$

where `$\mathop{\text{Var}} \left( u_i | x_i \right) = \sigma_i^2 = \sigma^2 x_i^2$`.

---

*Note regarding WLS:*

Since `$\mathop{\text{Var}} \left( u_i | x_i \right) = \sigma^2 x_i^2$`,

$$ \mathop{\text{Var}} \left( u_i | x_i \right) = \sigma^2 h(x_i) \implies h(x_i) = x_i^2 $$

WLS multiplies each variable by `$1/\sqrt{h(x_i)} = 1/x_i$`.

---

In this simulation, we want to compare

1. The **efficiency** of
  - OLS
  - WLS with correct weights: `$h(x_i) = x_i$`
  - WLS with incorrect weights: `$h(x_i) = \sqrt{x_i}$`

2. How well our **standard errors** perform (via confidence intervals) with
  - Plain OLS standard errors
  - Heteroskedasticity-robust standard errors
  - WLS standard errors

---

The simulation plan:

Do 10,000 times:

1. Generate a sample of size 30 from the population

2. Calculate/save OLS and WLS (×2) estimates for β.sub[1]

3. Calculate/save standard errors for β.sub[1] using
  - Plain OLS standard errors
  - Heteroskedasticity-robust standard errors
  - WLS (correct)
  - WLS (incorrect)

]

---

**For one iteration of the simulation:**

Code to generate the data...

``` r
# Parameters
b0 <- 1
b1 <- 10
s2 <- 1
# Sample size
n <- 30
# Generate data
sample_df <- tibble(
  x = runif(n, 0.5, 1.5),
  y = b0 + b1 * x + rnorm(n, 0, sd = s2 * x^2)
)
```

---

**For one iteration of the simulation:**

Code to estimate our coefficients and standard errors...

``` r
# OLS
ols <- feols(y ~ x, data = sample_df)
# WLS: Correct weights
wls_t <- lm(y ~ x, data = sample_df, weights = 1/x^2)
# WLS: Correct weights
wls_f <- lm(y ~ x, data = sample_df, weights = 1/x)
# Coefficients and standard errors
summary(ols, vcov = 'iid')
summary(ols, vcov = 'hetero')
summary(wls_t)
summary(wls_f)
```

Then save the results.

---
layout: false
# Living with heteroskedasticity
## Simulation: Coefficients

---
layout: false
# Living with heteroskedasticity
## Simulation: Inference

---
layout: false
# Living with heteroskedasticity
## Simulation: Results

Summarizing our simulation results (10,000 iterations)

.pull-left.center[
**Estimation**: Summary of `$\hat{\beta}_1$`'s

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Estimator </th>
   <th style="text-align:right;"> Mean </th>
   <th style="text-align:right;"> S.D. </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> OLS </td>
   <td style="text-align:right;"> 10.028 </td>
   <td style="text-align:right;"> 0.897 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> WLS Correct </td>
   <td style="text-align:right;"> 10.021 </td>
   <td style="text-align:right;"> 0.675 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> WLS Incorrect </td>
   <td style="text-align:right;"> 10.024 </td>
   <td style="text-align:right;"> 0.766 </td>
  </tr>
</tbody>
</table>
]

.pull-right.center[

**Inference:** % of times we reject `$\beta_1$`

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Estimators </th>
   <th style="text-align:right;"> % Reject </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> OLS + Het.-robust </td>
   <td style="text-align:right;"> 7.5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> OLS + Homosk. </td>
   <td style="text-align:right;"> 8.5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> WLS Correct </td>
   <td style="text-align:right;"> 7.4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> WLS Incorrect </td>
   <td style="text-align:right;"> 8.1 </td>
  </tr>
</tbody>
</table>
]

---
layout: false
class: clear, middle

Going further...

---
# Similar violations
## Assumptions

Recall our old assumption that led to this heteroskedasticity discussion:

> 5\. The disurbances have .b.slate[constant variance] `$\sigma^2$` <br>and .hi[zero covariance], _i.e._,
  - `$\color{#314f4f}{\mathop{\boldsymbol{E}}\left[ u_i^2 \middle| X_i \right] = \mathop{\text{Var}} \left( u_i \middle| X_i \right) = \sigma^2 \implies \mathop{\text{Var}} \left( u_i \right) = \sigma^2}$`
  - `$\color{#e64173}{\mathop{\text{Cov}} \left( u_i, \, u_j \middle| X_i,\,X_j \right) = \mathop{\boldsymbol{E}}\left[ u_i u_j \middle| X_i,\,X_j \right] = 0}$` for `$i\neq j$`

.b.slate[Violation of constant variance] `$=$` .b.slate[heteroskedasticity]

It's also possible (likely) that the .b.pink[disturbances are correlated].

Ignoring this correlation is even more problematic for inference.

---
# Correlated disturbances
## The problem

In many cases, .b.pink[observations' disturbances] `$\left(u_i,\,u_j\right)$` .b.pink[can be correlated].

*Remember*
- The .b.pink[disturbance] represents the un-included variables that affect `$y$`. 
- Some .b.pink[observations] in the sample may *relate* to other observations.

If these observation-level relationships extend to the variables in the disturbance, then disturbances can correlate.

`$\implies \mathop{\text{Cov}} \left( u_i, \, u_j \middle| X_i,\,X_j \right) \color{#e64173}{\neq} 0$`.

Ignoring this correlation can cause **big problems** in your inference.

---
# Correlated disturbances
## The intution

Why is ignoring this correlation problematic?

**False precision:** We can get "overconfident" in our knowledge.

When we teating correlated observations as independent, we OLS thinks we're learning more than we actually are.

**Extreme example:** If duplicate your dataset (stack it on top of itself), plain OLS standard errors would decrease every time you duplicated the dataset.

---
exclude: true

---
layout: false
class: clear, middle

The effect of duplicating our data on the .pink[OLS standard error] of `ratio`.

---
layout: false
class: clear, middle

Correcting our .orange[standard errors for clustering] (observations' correlation).

---
# Correlated disturbances
## Examples

"Real" examples where disturbances might correlate:

- Students in a classroom (share teacher, curriculum, etc.)
- Counties in a state (share state-level policies/laws)
- Businesses in a city (share local economic shocks)
- Consecutive days in a sample (share events, weather, etc.)

---
# Correlated disturbances
## The solution

Just like we calculate *heteroskedasticity*-robust standard errors,
<br>we can also calculate standard errors robust to correlated disturbances.

People call these .pink[*cluster-robust standard errors*] (or just *clustered*).

From `fixest` package:

`feols(y ~ x, data = fake_data, cluster = 'cluster_var')`

or even

`feols(y ~ x, data = fake_data, cluster = c('cluster1', 'cluster2'))`

---
# Final word
## Better inference

1. You should default to assuming your data are heteroskedastic

2. Think about how your explanatory variables and/or disturbances correlate across observations.