Heteroskedasticity

class: center, middle, inverse, title-slide

.title[
# Heteroskedasticity
]
.subtitle[
## EC 421, Set 04
]
.author[
### Edward Rubin
]

---

class: inverse, middle

# Prologue

---
# .mono[R] showcase

**[.mono[R Markdown]](https://rmarkdown.rstudio.com)**
- Simple mark-up language for for combining/creating documents, equations, figures, .mono[R], and more
- [Basics of .mono[Markdown]](https://rmarkdown.rstudio.com/authoring_basics.html)
- _E.g._, `**I'm bold**`, `*I'm italic*`, `I = "code"`

**[Econometrics with .mono[R]](https://www.econometrics-with-r.org)**
- (Currently) free, online textbook
- Written and published using .mono[R] (and probably .mono[R Markdown])
- *Warning:* I haven't read the full book.

Related: Tyler Ransom has a [great cheatsheet for econometrics](https://github.com/tyleransom/EconometricsLabs/blob/master/econometricsCheatSheet.pdf).

---
# Schedule

## Last Time

We wrapped up our review.

## Today

Heteroskedasticity

---
# Schedule

## This week

First assignment!

Submit **one file** (HTML, PDF, or Word) that includes

1. Your written answers
1. Figures and regression output
1. The .mono[R] code that generated your answers.

This file should be a rendered RMarkdown or Quarto file (`.rmd` or `.qmd`).

.hi[Important]

- We should be able to easily find your answers for each question.
- **Do not copy.** (You .attn[will] receive a zero.)

---
class: inverse, middle

# Prerequisite material

---
layout: true
# Prerequisite material
## The `$\chi^2$` distribution

---

Some test statistics are distributed as `$\chi^2$` random variables.

The `$\chi^2$` distribution is just another example of a common (named) distribution (like the Normal distribution, the `$t$` distribution, and the `$F$`).

The shape of the `$\chi^2$` distribution depends on a single parameter:
- We will call this parameter `$k$`
- Our test statistics will refer to `$k$` as *degrees of freedom*.

---

Three examples of `$\chi_k^2$`: `$\color{#314f4f}{k = 1}$`, `$\color{#e64173}{k = 2}$`, and `$\color{orange}{k = 9}$`

---

Probability of observing a more extreme test statistic `$\widehat{\text{LM}}$` under H.sub[0]

---
layout: true
# Heteroskedasticity

---
class: inverse, middle

---

Let's write down our **current assumptions**

1. Our sample (the `$x_k$`'s and `$y_i$`) was .hi[randomly drawn] from the population.

2. `$y$` is a .hi[linear function] of the `$\beta_k$`'s and `$u_i$`.

3. There is no perfect .hi[multicollinearity] in our sample.

4. The explanatory variables are .hi[exogenous]: `$\mathop{\boldsymbol{E}}\left[ u \middle| X \right] = 0 \left(\implies \mathop{\boldsymbol{E}}\left[ u \right] = 0\right)$`.

5. The disurbances have .hi[constant variance] `$\sigma^2$` and .hi[zero covariance], _i.e._,
  - `$\mathop{\boldsymbol{E}}\left[ u_i^2 \middle| X \right] = \mathop{\text{Var}} \left( u_i \middle| X \right) = \sigma^2 \implies \mathop{\text{Var}} \left( u_i \right) = \sigma^2$`
  - `$\mathop{\text{Cov}} \left( u_i, \, u_j \middle| X \right) = \mathop{\boldsymbol{E}}\left[ u_i u_j \middle| X \right] = 0$` for `$i\neq j$`

6. The disturbances come from a .hi[Normal] distribution, _i.e._, `$u_i \overset{\text{iid}}{\sim} \mathop{\text{N}}\left( 0, \sigma^2 \right)$`.

---

Today we're focusing on assumption \#5:

> 5\. The disurbances have .hi[constant variance] `$\sigma^2$` and .hi[zero covariance], _i.e._,
>  - `$\mathop{\boldsymbol{E}}\left[ u_i^2 \middle| X \right] = \mathop{\text{Var}} \left( u_i \middle| X \right) = \sigma^2 \implies \mathop{\text{Var}} \left( u_i \right) = \sigma^2$`
>  - `$\mathop{\text{Cov}} \left( u_i, \, u_j \middle| X \right) = \mathop{\boldsymbol{E}}\left[ u_i u_j \middle| X \right] = 0$` for `$i\neq j$`

Specifically, we will focus on the assumption of .hi[constant variance] (also known as *homoskedasticity*).

**Violation of this assumption:**

.hi[Heteroskedasticity:] `$\mathop{\text{Var}} \left( u_i \right) = \sigma^2_i$` and `$\sigma^2_i \neq \sigma^2_j$` for some `$i\neq j$`.

In other words: Our disturbances have different variances.

---

Classic example of heteroskedasticity: The funnel

Variance of `$u$` increases with `$x$`

---

Another example of heteroskedasticity: (double funnel?)

Variance of `$u$` increasing at the extremes of `$x$`

---

Another example of heteroskedasticity:

Differing variances of `$u$` by group

---

## What and why?

.hi[Heteroskedasticity] is present when the variance of `$u$` changes with any combination of our explanatory variables `$x_1$`, through `$x_k$` (henceforth: `$X$`).

It's very common in practice—and should probably be our default.

.hi[Why we care:] Heteroskedasticity shows us how small violations of our assumptions can affect OLS's performance.

---

## Consequences

So what are the consquences of heteroskedasticity? Bias? Inefficiency?

First, let's check if it has consquences for the the unbiasedness of OLS.

**Recall1:** OLS being unbiased means `$\mathop{\boldsymbol{E}}\left[ \hat{\beta}_k \middle| X \right] = \beta_k$` for all `$k$`.

**Recall2:** We previously showed `$\hat{\beta}_1 = \dfrac{\sum_i\left(y_i-\overline{y}\right)\left(x_i-\overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2}$`

It will actually help us to rewrite this estimator as

$$ \hat{\beta}_1 = \beta_1 + \dfrac{\sum_i \left( x_i - \overline{x} \right) u_i}{\sum_i \left( x_i - \overline{x} \right)^2} $$

---

**Proof:** Assuming `$y_i = \beta_0 + \beta_1 x_i + u_i$`

$$
`\begin{aligned}
  \hat{\beta}_1
  &= \dfrac{\sum_i\left(y_i-\overline{y}\right)\left(x_i-\overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \\
  &= \dfrac{\sum_i\left(\left[ \beta_0 + \beta_1 x_i + u_i \right]- \left[ \beta_0 + \beta_1 \overline{x} + \overline{u} \right] \right)\left(x_i-\overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \\
  &= \dfrac{\sum_i\left(\beta_1 \left[ x_i - \overline{x} \right] + \left[u_i - \overline{u}\right]  \right)\left(x_i-\overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \\
  &= \dfrac{\sum_i\left(\beta_1 \left[ x_i - \overline{x} \right]^2 + \left[ x_i - \overline{x} \right] \left[u_i - \overline{u}\right]\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \\
  &= \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) \left(u_i - \overline{u}\right)}{\sum_i\left(x_i -\overline{x}\right)^2}
\end{aligned}`
$$

---

$$
`\begin{aligned}
  \hat{\beta}_1
  &= \cdots = \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) \left(u_i - \overline{u}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \\
  &= \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) u_i - \overline{u} \sum_i\left(x_i - \overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \\
  &= \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) u_i - \overline{u} \left(\sum_i x_i - \sum_i \overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \\
  &= \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) u_i - \overline{u} \left(\sum_i x_i - n \overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \\
  &= \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) u_i - \overline{u} \color{#e64173}{\left(\sum_i x_i - \sum_i x_i\right)}}{\sum_i\left(x_i -\overline{x}\right)^2} \\
  &= \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) u_i}{\sum_i\left(x_i -\overline{x}\right)^2} \quad \text{😅}
\end{aligned}`
$$

---

## Consequences: Bias

We now want to see if heteroskedasticity biases the OLS estimator for `$\beta_1$`.

$$
`\begin{aligned}
  \mathop{\boldsymbol{E}}\left[ \hat{\beta}_1 \middle| X \right]
  &= \mathop{\boldsymbol{E}}\left[ \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) u_i}{\sum_i\left(x_i -\overline{x}\right)^2} \middle| X \right] \\[0.5em]
  &= \beta_1 + \mathop{\boldsymbol{E}}\left[ \dfrac{\sum_i\left(x_i - \overline{x}\right) u_i}{\sum_i\left(x_i -\overline{x}\right)^2} \middle| X \right] \\[0.5em]
  &= \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \color{#e64173}{\underbrace{\mathop{\boldsymbol{E}}\left[ u_i \middle| X \right]}_{=0}} \\[0.5em]
  &= \beta_1
\end{aligned}`
$$

--
Phew. .hi[OLS is still unbiased] for the `$\beta_k$`.

---

## Consequences: Efficiency

OLS's **efficiency** and **inference** do not survive heteroskedasticity.

- In the presence of heteroskedasticity, OLS is .hi[no longer the most efficient] (best) linear unbiased estimator.

- It would be more informative (efficient) to .hi[weight observations] inversely to their `$u_i$`'s variance.

- Downweight high-variance `$u_i$`'s (too noisy to learn much).

- Upweight observations with low-variance `$u_i$`'s (more 'trustworthy').

- Now you have the idea of weighted least squares (WLS)

---

## Consequences: Inference

OLS .hi[standard errors are biased] in the presence of heteroskedasticity.

- Wrong confidence intervals

- Problems for hypothesis testing (both `$t$` and `$F$` tests)

- It's hard to learn much without sound inference.

---

## Solutions

1. **Tests** to determine whether heteroskedasticity is present.

2. **Remedies** for (1) efficiency and (2) inference

---
layout: true
# Testing for heteroskedasticity

---
class: inverse, middle

---

While we *might* have solutions for heteroskedasticity, the efficiency of our estimators depends upon whether or not heteroskedasticity is present.

1. The **Goldfeld-Quandt test**

1. The **White test**

Each of these tests.super[.pink[†]] centers on the fact that we can .hi[use the OLS residual] `$\color{#e64173}{e_i}$` .hi[to estimate the population disturbance] `$\color{#e64173}{u_i}$`.

.footnote[.pink[†] There are many other options for testing, *e.g.*, the **Breusch-Pagan test**.]

---
layout: true
# Testing for heteroskedasticity
## The Goldfeld-Quandt test

---

Focuses on a specific type of heteroskedasticity: whether the variance of `$u_i$` differs .hi[between two groups].†

Remember how we used our residuals to estimate the `$\sigma^2$`?

$$ s^2 = \dfrac{\text{SSE}}{n-1} = \dfrac{\sum_i e_i^2}{n-1} $$

We will use this same idea to determine whether there is evidence that our two groups differ in the variances of their disturbances, effectively comparing `$s^2_1$` and `$s^2_2$` from our two groups.

.footnote[
[†]: The G-Q test was one of the early tests of heteroskedasticity (1965).
]

---

Operationally,

.pseudocode-small[

1. Order your the observations by `$x$`

2. Split the data into two groups of size n.super[⭑]
 - G1: The first third
 - G2: The last third

3. Run separate regressions of `$y$` on `$x$` for G.sub[1] and G.sub[2]

4. Record SSE.sub[1] and SSE.sub[2]

5. Calculate the G-Q test statistic

]

---

The G-Q test statistic

$$ F_{\left(n^{\star}-k,\, n^{\star}-k\right)} = \dfrac{\text{SSE}_2/(n^\star-k)}{\text{SSE}_1/(n^\star-k)} = \dfrac{\text{SSE}_2}{\text{SSE}_1} $$

follows an `$F$` distribution (under the null hypothesis) with `$n^{\star}-k$` and `$n^{\star}-k$` degrees of freedom.†

**Notes**

- The G-Q test requires the disturbances follow normal distributions.
- The G-Q assumes a very specific type/form of heteroskedasticity.
- Performs very well if we know the form of potentially heteroskedasticity.

.footnote[
[†]: Goldfeld and Quandt suggested `$n^{\star}$` of `$(3/8)n$`. `$k$` gives number of estimated parameters (_i.e._, `$\hat{\beta}_j$`'s).
]

---

---

`$F_{375,\,375} = \dfrac{\color{#e64173}{\text{SSE}_2 = 18,203.4}}{\color{#314f4f}{\text{SSE}_1 = 1,039.5}} \approx 17.5 \implies$` *p*-value `$< 0.001$`

`$\therefore$` We reject H.sub[0]: `$\sigma^2_1 = \sigma^2_2$` and conclude there is statistically significant evidence of heteroskedasticity.

---

The problem...

---

---

`$F_{375,\,375} = \dfrac{\color{#e64173}{\text{SSE}_2 = 14,516.8}}{\color{#314f4f}{\text{SSE}_1 = 14,937.1}} \approx 1 \implies$` *p*-value `$\approx 0.609$`

`$\therefore$` We fail to reject H.sub[0]: `$\sigma^2_1 = \sigma^2_2$` while heteroskedasticity is present.

---
layout: true
# Testing for heteroskedasticity
## The White test

---

**Breusch and Pagan (1981)** attempted to solve this issue of being too specific with the functional form of the heteroskedasticity.

- Regress `$e_i^2$` on `$X = \left[ 1,\, x_1,\, x_2,\, \ldots,\, x_k \right]$` and test for joint significance.
- Allows the data to show if/how the variance of `$u_i$` correlates with `$X$`.
- If `$\sigma_i^2$` correlates with `$X$`, then we have heteroskedasticity.

However, we actually want to know if

$$ \sigma_1^2 = \sigma_2^2 = \cdots = \sigma_n^2 $$

**Q:** Can't we just test this hypothesis?
--
 **A:** Sort of.

---

Toward this goal, Hal White took advantage of the fact that we can .hi[replace the homoskedasticity requirement with a weaker assumption]:

- **Old:** `$\mathop{\text{Var}} \left( u_i \middle| X \right) = \sigma^2$`

- **New:** `$u^2$` is *uncorrelated* with the explanatory variables (_i.e._,  `$x_j$` for all `$j$`), their squares (_i.e._, `$x_j^2$`), and the first-degree interactions (_i.e._, `$x_j x_h$`).

This new assumption is easier to explicitly test (*hint:* regression).

---

An outline of White's test for heteroskedasticity:

.pseudocode-small[

1\. Regress y on x.sub[1], x.sub[2], …, x.sub[k]. Save residuals e.

2\. Regress squared residuals on all explanatory variables, their squares, and interactions.

$$ e^2 = \alpha\_0 + \sum\_{h=1}^k \alpha\_h x\_h + \sum\_{j=1}^k \alpha\_{k+j} x\_j^2 + \sum\_{\ell = 1}^{k-1} \sum\_{m = \ell + 1}^k \alpha\_{\ell,m} x\_\ell x\_m + v\_i $$

3\. Record R.sub[e].super[2].

4\. Calculate test statistic to test H.sub[0]: `$\alpha_p = 0$` for all `$p\neq0$`.

]

---

White's test statistic is

$$ \text{LM} = n \times R_e^2 \qquad \text{Under H}_0,\, \text{LM} \overset{\text{d}}{\sim} \chi_k^2 $$

where `$R^2_e$` comes from the regression of `$e^2$` on the explanatory variables, their squares, and their interactions.

$$ e^2 = \alpha\_0 + \underbrace{\sum\_{h=1}^k \alpha\_h x\_h}\_{\text{Expl. variables}} + \underbrace{\sum\_{j=1}^k \alpha\_{k+j} x\_j^2}\_{\text{Squared terms}} + \underbrace{\sum\_{\ell = 1}^{k-1} \sum\_{m = \ell + 1}^k \alpha\_{\ell,m} x\_\ell x\_m}\_{\text{Interactions}} + v\_i $$

**Note:** The `$k$` (for our `$\chi_k^2$`) equals the number of estimated parameters in the regression above (the `$\alpha_j$`), excluding the intercept `$\left( \alpha_0 \right)$`.

---

**Practical note:** If a variable is equal to its square (_e.g._, binary variables), then you don't (can't) include it. The same rule applies for interactions.

---

*Example:* Consider the model.super[†] `$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + u$`

**Step 1:** Estimate the model; obtain residuals `$(e)$`.

**Step 2:** Regress `$e^2$` on explanatory variables, squares, and interactions.
$$
`\begin{aligned}
  e^2 =
  &\alpha_0 + \alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 x_3 + \alpha_4 x_1^2 + \alpha_5 x_2^2 + \alpha_6 x_3^2 \\
  &+ \alpha_7 x_1 x_2 + \alpha_8 x_1 x_3 + \alpha_9 x_2 x_3 + v
\end{aligned}`
$$

Record the R.super[2] from this equation (call it `$R_e^2$`).

**Step 3:** Test H.sub[0]: `$\alpha_1 = \alpha_2 = \cdots = \alpha_9 = 0$` using `$\text{LM} = n R^2_e \overset{\text{d}}{\sim} \chi_9^2$`.

.footnote[
[†]: To simplify notation here, I'm dropping the `$i$` subscripts.
]

---

The White test for this simple linear regression.

$$
`\begin{aligned}
 e_i^2 &= \hat{\alpha}_0 + \hat{\alpha}_1 x_{1i} \color{#e64173}{+ \hat{\alpha}_2 x^2_{1i}} & \widehat{\text{LM}} &= 185.8 &\mathit{p}\text{-value} < 0.001
\end{aligned}`
$$

---
layout: false
class: inverse, middle

# Testing for Heteroskedasticity
## Examples

---
layout: true
# Testing for heteroskedasticity
## Examples

---

**Goal:** Estimate the relationship between standardized test scores (outcome variable) and (1) student-teacher ratio and (2) income, _i.e._,

$$ \left(\text{Test score}\right)\_i = \beta\_0 + \beta\_1 \text{Ratio}\_{i} + \beta\_2 \text{Income}\_{i} + u\_i \tag{1} $$

**Potential issue:** Heteroskedasticity... and we do not observe `$u_i$`.

**Solution:**

1. Estimate the relationship in `$(1)$` using OLS.
2. Test for heteroskedasticity.
  - Goldfeld-Quandt
  - White

---

We will use testing data from the dataset `Caschool` in the `Ecdat` .mono[R] package.

``` r
# Load packages
library(pacman)
p_load(tidyverse, Ecdat)
# Select and rename desired variables; assign to new dataset
test_df = select(Caschool, test_score = testscr, ratio = str, income = avginc)
# Format as tibble
test_df = as_tibble(test_df)
# View first 2 rows of the dataset
head(test_df, 2)
```

```
#> # A tibble: 2 × 3
#> test_score ratio income
#> <dbl> <dbl> <dbl>
#> 1 691. 17.9 22.7 
#> 2 661. 21.5 9.82
```

---

Let's begin by estimating our model

$$ \left(\text{Test score}\right)\_i = \beta\_0 + \beta\_1 \text{Ratio}\_{i} + \beta\_2 \text{Income}\_{i} + u\_i $$

``` r
# Estimate the model
est_model = lm(test_score ~ ratio + income, data = test_df)
# Summary of the estimate
tidy(est_model)
```

```
#> # A tibble: 3 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 639. 7.45 85.7 5.70e-267
#> 2 ratio -0.649 0.354 -1.83 6.79e- 2
#> 3 income 1.84 0.0928 19.8 4.38e- 62
```

---

Now, let's see what the residuals suggest about heteroskedasticity

``` r
# Add the residuals to our dataset
test_df$e = residuals(est_model)
```

---
layout: true
# Testing for heteroskedasticity
## Example: Goldfeld-Quandt

---

Income looks potentially heteroskedastic; let's test via Goldfeld-Quandt.

``` r
# Arrange the data by income
test_df = arrange(test_df, income)
```

---
count: false

Income looks potentially heteroskedastic; let's test via Goldfeld-Quandt.

---
count: false

Income looks potentially heteroskedastic; let's test via Goldfeld-Quandt.

``` r
# Arrange the data by income
test_df = arrange(test_df, income)
# Re-estimate the model for the last and first 158 observations
est_model1 = lm(test_score ~ ratio + income, data = tail(test_df, 158))
est_model2 = lm(test_score ~ ratio + income, data = head(test_df, 158))
# Grab the residuals from each regression
e_model1 = residuals(est_model1)
e_model2 = residuals(est_model2)
```

---
count: false

Income looks potentially heteroskedastic; let's test via Goldfeld-Quandt.

```
#> [1] 19305.01
```

``` r
(sse_model2 = sum(e_model2^2))
```

```
#> [1] 29537.83
```

---

Remember the Goldfeld-Quandt test statistic?

`$F_{n^\star-k,\,n^\star-k} = \dfrac{\text{SSE}_2}{\text{SSE}_1}$`
--
`$\approx\dfrac{29,537.83}{19,305.01}$`
--
`$\approx1.53$`
--
 Test via `$F_{158-3,\,158-3}$`
--

``` r
# G-Q test statistic
(f_gq = sse_model2/sse_model1)
```

```
#> [1] 1.530061
```
--

``` r
# p-value
pf(q = f_gq, df1 = 158-3, df2 = 158-3, lower.tail = F)
```

```
#> [1] 0.004226666
```

---

The Goldfeld-Quandt test statistic and its null distribution

---

Putting it all together:

H.sub[0]: `$\sigma_1^2 = \sigma_2^2$` *vs.* H.sub[A]: `$\sigma_1^2 \neq \sigma_2^2$`

Goldfeld-Quandt test statistic: `$F \approx 1.53$`

*p*-value `$\approx 0.00423$`

∴ Reject H.sub[0] (*p*-value is less than 0.05).

**Conclusion:** There is statistically significant evidence that `$\sigma_1^2 \neq \sigma_2^2$`. Therefore, we find statistically significant evidence of heteroskedasticity (at the 5-percent level).

---

What if we had chosen to focus on student-to-teacher ratio?
--

``` r
# Arrange the data by ratio
test_df = arrange(test_df, ratio)
# Re-estimate the model for the last and first 158 observations
est_model3 = lm(test_score ~ ratio + income, data = tail(test_df, 158))
est_model4 = lm(test_score ~ ratio + income, data = head(test_df, 158))
# Grab the residuals from each regression
e_model3 = residuals(est_model3)
e_model4 = residuals(est_model4)
# Calculate SSE for each regression
(sse_model3 = sum(e_model3^2))
```

```
#> [1] 26243.52
```

``` r
(sse_model4 = sum(e_model4^2))
```

```
#> [1] 29101.52
```

---

`$F_{n^\star-k,\,n^\star-k} = \dfrac{\text{SSE}_4}{\text{SSE}_3} \approx\dfrac{29,101.52}{26,243.52} \approx1.11$`

which has a *p*-value of approximately 0.2603.

∴ We would have failed to reject H.sub[0], concluding that we failed to find statistically significant evidence of heteroskedasticity.

**Lesson:** Understand the limitations of estimators, tests, *etc.*

---
layout: true
# Heteroskedasticity
## Example: White

---

Let's test the same model and data with the White test.

*Recall:* We saved our residuals as `e` in our dataset, _i.e._,

``` r
# Estimate the model
est_model = lm(test_score ~ ratio + income, data = test_df)
# Add the residuals to our dataset
test_df$e = residuals(est_model)
```

---

The .pink[**White test** adds squared terms and interactions] to initial regression specification (the right-hand side)
$$
`\begin{align}
 \color{#314f4f}{u_{i}^2} =& \color{#314f4f}{\alpha_{0} + \alpha_{1} \text{Ratio}_{i} + \alpha_{2} \text{Income}_{i} } \\
 &+ \color{#e64173}{\alpha_{3} \text{Ratio}_{i}^2 + \alpha_{4} \text{Income}_{i}^2 + \alpha_{5} \text{Ratio}_{i}\times\text{Income}_{i}} \\
 &+ \color{#314f4f}{w_{i}}
\end{align}`
$$

The .pink[**White test**] tests the null hypothesis
 .pink[H.sub[0]:] `$\color{#e64173}{\alpha_1 = \alpha_2 = \alpha_3 = \alpha_4 = \alpha_5 = 0}$`

We just need to write some .mono[R] code to test .pink[H.sub[0]].

---

*Aside:* .mono[R] has funky notation for squared terms and interactions in `lm()`:

- **Squared terms** use `I()`, _e.g._, `lm(y ~ I(x^2))`

- **Interactions** use `:` between the variables, _e.g._, `lm(y ~ x1:x2)`

*Example:* Regress `y` on quadratic of `x1` and `x2`:

``` r
# Pretend quadratic regression w/ interactions
lm(y ~ x1 + x2 + I(x1^2) + I(x2^2) + x1:x2, data = pretend_df)
```

---

**Step 1:** Regress `$e_i^2$` on 1.super[st] degree, 2.super[nd] degree, and interactions

---
count: false

**Step 2:** Collect `$R_e^2$` from the regression.

```
#> [1] 0.07332222
```

---
count: false

**Step 3:** Calculate White test statistic `$\text{LM} = n \times R_e^2 \approx 420 \times 0.073$`

``` r
# Regress squared residuals on quadratic of explanatory variables
white_model = lm(
  I(e^2) ~ ratio + income + I(ratio^2) + I(income^2) + ratio:income,
  data = test_df
)
# Grab the R-squared
white_r2 = summary(white_model)$r.squared
# Calculate the White test statistic
(white_stat = 420 * white_r2)
```

```
#> [1] 30.79533
```

---
count: false

**Step 4:** Calculate the associated *p*-value (where `$\text{LM} \overset{d}{\sim} \chi_k^2$`); here, `$k=5$`

```
#> [1] 1.028039e-05
```

---

Putting everything together...

H.sub[0]: `$\alpha_1 = \alpha_2 = \alpha_3 = \alpha_4 = \alpha_5 = 0$`
--
 *vs.* H.sub[A]: `$\alpha_i \neq 0$` for some `$i \in \{1,\, 2,\,\ldots,\, 5\}$`
--
$$
`\begin{align}
u_i^2 =& \alpha_0 + \alpha_1 \text{Ratio}_i + \alpha_2 \text{Income}_i \\
&+ \alpha_3 \text{Ratio}_i^2 + \alpha_4 \text{Income}_i^2 \\
&+ \alpha_5 \text{Ratio}_i \times \text{Income}_i + w_i
\end{align}`
$$

Our White test statistic: `$\text{LM} = n \times R_e^2 \approx 420 \times 0.073 \approx 30.8$`

Under the `$\chi_5^2$` distribution, this `$\widehat{\text{LM}}$` has a *p*-value less than 0.001.

∴ We .hi[reject H.sub[0]]
--

and conclude there is .hi[statistically significant evidence of heteroskedasticity] (at the 5-percent level).

---

The White test statistic and its null distribution

---
layout: true
# Heteroskedasticity
## Review questions

---

- **Q:** What is the definition of heteroskedasticity?

- **Q:** Why are we concerned about heteroskedasticity?

- **Q:** Does plotting `$y$` against `$x$`, tell us anything about heteroskedasticity?

- **Q:** Does plotting `$e$` against `$x$`, tell us anything about heteroskedasticity?

- **Q:** Since we cannot observe the `$u_i$`'s, what do we use to *learn about* heteroskedasticity?

- **Q:** Which test do you recommend to test for heteroskedasticity? Why?

---
count: false

- **Q:** What is the definition of heteroskedasticity?
--

- **A:**
 .hi[Math:] `$\mathop{\text{Var}} \left( u_i | X \right) \neq \mathop{\text{Var}} \left( u_j | X \right)$` for some `$i\neq j$`.
 .hi[Words:] There is a systematic relationship between the variance of `$u_i$` and our explanatory variables.

---
count: false

.grey-vlight[

- **Q:** What is the definition of heteroskedasticity?

]

- **Q:** Why are we concerned about heteroskedasticity?
--

- **A:** It biases our standard errors—wrecking our statistical tests and confidence intervals. Also: OLS is no longer the most efficient (best) linear unbiased estimator.

---
count: false

.grey-vlight[

- **Q:** What is the definition of heteroskedasticity?

- **Q:** Why are we concerned about heteroskedasticity?

]

- **Q:** Does plotting `$y$` against `$x$`, tell us anything about heteroskedasticity?
--

- **A:** It's not exactly what we want, but since `$y$` is a function of `$x$` and `$u$`, it can still be informative. If `$y$` becomes more/less disperse as `$x$` changes, we likely have heteroskedasticity.

---
count: false

.grey-vlight[

- **Q:** What is the definition of heteroskedasticity?

- **Q:** Why are we concerned about heteroskedasticity?

- **Q:** Does plotting `$y$` against `$x$`, tell us anything about heteroskedasticity?

]

- **Q:** Does plotting `$e$` against `$x$`, tell us anything about heteroskedasticity?
--

- **A:** Yes. The spread of `$e$` depicts its variance—and tells us something about the variance of `$u$`. Trends in this variance, along `$x$`, suggest heteroskedasticity.

---
count: false

.grey-vlight[

- **Q:** What is the definition of heteroskedasticity?

- **Q:** Why are we concerned about heteroskedasticity?

- **Q:** Does plotting `$y$` against `$x$`, tell us anything about heteroskedasticity?

- **Q:** Does plotting `$e$` against `$x$`, tell us anything about heteroskedasticity?

]

- **Q:** Since we cannot observe the `$u_i$`'s, what do we use to *learn about* heteroskedasticity?
--

- **A:** We use the `$e_i$`'s to predict/learn about the `$u_i$`'s. This trick is key for almost everything we do with heteroskedasticity testing/correction.

---
count: false

.grey-vlight[

- **Q:** What is the definition of heteroskedasticity?

- **Q:** Why are we concerned about heteroskedasticity?

- **Q:** Does plotting `$y$` against `$x$`, tell us anything about heteroskedasticity?

- **Q:** Does plotting `$e$` against `$x$`, tell us anything about heteroskedasticity?

- **Q:** Since we cannot observe the `$u_i$`'s, what do we use to *learn about* heteroskedasticity?

]

- **Q:** Which test do you recommend to test for heteroskedasticity? Why?
--

- **A:** I like White. Fewer assumptions. Fewer issues.

---
layout: false
class: clear, middle

*Next time:* Living/working with heteroskedasticity.

---
layout: false
class: inverse, middle

# Appendix

---
class: clear, middle

One more test...

---
layout: true
# Testing for heteroskedasticity
## The Breusch-Pagan test

---

Breusch and Pagan (1981) attempted to solve this issue of being too specific with the functional form of the heteroskedasticity.

- Allows the data to show if/how the variance of `$u_i$` correlates with `$X$`.

- If `$\sigma_i^2$` correlates with `$X$`, then we have heteroskedasticity.

- Regresses `$e_i^2$` on `$X = \left[ 1,\, x_1,\, x_2,\, \ldots,\, x_k \right]$` and tests for joint significance.

---

How to implement:

.pseudocode-small[

1\. Regress y on an intercept, x.sub[1], x.sub[2], …, x.sub[k].

2\. Record residuals e.

3\. Regress e.super[2] on an intercept, x.sub[1], x.sub[2], …, x.sub[k].

$$ e\_i^2 = \alpha\_0 + \alpha\_1 x\_{1i} + \alpha\_2 x\_{2i} + \cdots + \alpha\_k x\_{ki} + v\_i $$

4\. Record R.super[2].

5\. Test hypothesis H.sub[0]: `$\alpha_1 = \alpha_2 = \cdots = \alpha_k = 0$`

]

---

The B-P test statistic† is

$$ \text{LM} = n \times R^2_{e} $$

where `$R^2_e$` is the `$R^2$` from the regression

$$ e\_i^2 = \alpha\_0 + \alpha\_1 x\_{1i} + \alpha\_2 x\_{2i} + \cdots + \alpha\_k x\_{ki} + v\_i $$

Under the null, `$\text{LM}$` is asymptotically distributed as `$\chi^2_k$`.

This test statistic tests H.sub[0]: `$\alpha_1 = \alpha_2 = \cdots = \alpha_k = 0$`.

Rejecting the null hypothesis implies evidence of heteroskedasticity.

.footnote[
[†]: This specific form of the test statistic actually comes form Koenker (1981).
]

---
layout: true
# Testing for heteroskedasticity
## The Breusch-Pagan test

---

**Problem:** We're still assuming a fairly restrictive .hi[functional form] between our explanatory variables `$X$` and the variances of our disturbances `$\sigma^2_i$`.

**Result:** B-P *may* still miss fairly simple forms of heteroskedasticity.

---

Breusch-Pagan tests are still .hi[sensitive to functional form].

$$
`\begin{aligned}
 e_i^2 &= \hat{\alpha}_0 + \hat{\alpha}_1 x_{1i} & \widehat{\text{LM}} &= 1.26 &\mathit{p}\text{-value} \approx 0.261 \\
 e_i^2 &= \hat{\alpha}_0 + \hat{\alpha}_1 x_{1i} \color{#e64173}{+ \hat{\alpha}_2 x^2_{1i}} & \widehat{\text{LM}} &= 185.8 &\mathit{p}\text{-value} < 0.001
\end{aligned}`
$$

---
layout: true
# Testing for heteroskedasticity
## Example: Breusch-Pagan

---

Let's test the same model with the Breusch Pagan.

*Recall:* We saved our residuals as `e` in our dataset, _i.e._,

``` r
test_df$e = residuals(est_model)
```

---

In B-P, we first regress `$e_i^2$` on the explanatory variables,

``` r
# Regress squared residuals on explanatory variables
bp_model = lm(I(e^2) ~ ratio + income, data = test_df)
```

---
count: false

and use the resulting `$R^2$` to calculate a test statistic.

``` r
# Regress squared residuals on explanatory variables
bp_model = lm(I(e^2) ~ ratio + income, data = test_df)
# Grab the R-squared
(bp_r2 = summary(bp_model)$r.squared)
```

```
#> [1] 3.23205e-05
```

---

The Breusch-Pagan test statistic is

`$\text{LM} = n \times R^2_e$`
--
 `$\approx 420 \times 0.0000323$`
--
 `$\approx 0.0136$`

which we test against a `$\chi_k^2$` distribution (here: `$k=2$`).†

``` r
# B-P test statistic
bp_stat = 420 * bp_r2
# Calculate the p-value
pchisq(q = bp_stat, df = 2, lower.tail = F)
```

```
#> [1] 0.9932357
```

.footnote[
[†]: `$k$` is the number of explanatory variables (excluding the intercept).
]

---

H.sub[0]: `$\alpha_1 = \alpha_2 = 0$` *vs.* H.sub[A]: `$\alpha_1 \neq 0$` and/or `$\alpha_2 \neq 0$`

for the model `$u_i^2 = \alpha_0 + \alpha_1 \text{Ratio}_i + \alpha_2 \text{Income}_i + w_i$`

Breusch-Pagan test statistic: `$\widehat{\text{LM}} \approx 0.014$`

*p*-value `$\approx 0.993$`

∴ Fail to reject H.sub[0] (the *p*-value is greater than 0.05)

**Conclusion:** We do not find statistically significant evidence of heteroskedasticity at the 5-percent level.
--
 (We find no evidence of a *linear* relationship between `$u_i^2$` and the explanatory variables.)

---

The Breusch-Pagan test statistic and its null distribution