class: center, middle, inverse, title-slide # Heteroskedasticity ## EC 320: Introduction to Econometrics ### Philip Economides ### Winter 2022 --- class: inverse, middle # Prologue --- # Housekeeping <br> .hi-pink[Final Exam] Review lecture this Wednesday. - Come prepared with questions. **Exam:** Wednesday, March 16 at 14:45pm (here). Office hours Tuesday, March 14th at 12:00pm. .hi-pink[Problem Set 5] Due today by 11:59pm. --- layout: true # Heteroskedasticity --- class: inverse, middle --- Let's revisit assumption \#4: > 4\. __Homoskedasticity:__ The disurbances have .hi[constant variance] `\(\sigma^2\)`, > - `\(\mathop{\boldsymbol{E}}\left[ u_i^2 \middle| X \right] = \mathop{\text{Var}} \left( u_i \middle| X \right) = \sigma^2 \implies \mathop{\text{Var}} \left( u_i \right) = \sigma^2\)` -- Specifically, we will focus on the assumption of .hi[constant variance] (also known as *homoskedasticity*). -- **Violation of this assumption:** .hi[Heteroskedasticity:] `\(\mathop{\text{Var}} \left( u_i \right) = \sigma^2_i\)` and `\(\sigma^2_i \neq \sigma^2_j\)` for some `\(i\neq j\)`. -- In other words: Our disturbances have different variances. --- An easy way to spot this: plot your residuals against your covariates! For example, variance of `\(u_i\)` decreases with `\(X_i\)`, non-constant. <img src="16-Heteroskedasticity_files/figure-html/het ex1-1.svg" style="display: block; margin: auto;" /> --- Another example of heteroskedasticity: residuals increase in variation as X deviates further from the mean. Variance of `\(u_i\)` increasing at the extremes of `\(X_i\)`. <img src="16-Heteroskedasticity_files/figure-html/het ex2 -1.svg" style="display: block; margin: auto;" /> --- Another example of heteroskedasticity: Differing variances of `\(u_i\)` by group <img src="16-Heteroskedasticity_files/figure-html/het ex3 -1.svg" style="display: block; margin: auto;" /> --- .hi-pink[Heteroskedasticity] is present when the variance of `\(u_i\)` changes with any combination of our explanatory variables `\(X_1\)`, through `\(X_j\)`. -- (Very common in practice) -- .hi-blue[Why we care:] Heteroskedasticity shows us how small violations of our assumptions can affect OLS's performance. -- * If we do not account for heteroskedasticity, our standard error are mispecified --- ## Consequences So what are the consquences of heteroskedasticity? Bias? Inefficiency? First, let's check if it has consquences for the the unbiasedness of OLS. -- **Recall<sub>1</sub>:** OLS being unbiased means `\(\mathop{\boldsymbol{E}}\left[ \hat{\beta}_j \middle| X \right] = \beta_j\)` for all `\(j\)` covariates. -- **Recall<sub>2</sub>:** We previously showed `\(\hat{\beta}_1 = \dfrac{\sum_i\left(y_i-\overline{y}\right)\left(x_{1i}-\overline{x}_1\right)}{\sum_i\left(x_{1i} -\overline{x}_1\right)^2}\)` -- It will actually help us to rewrite this estimator as `$$\hat{\beta}_1 = \beta_1 + \dfrac{\sum_{i} \left( x_{1i} - \overline{x}_1 \right) u_i}{\sum_i \left( x_{1i} - \overline{x}_1 \right)^2}$$` --- **Proof:** Assuming `\(y_i = \beta_0 + \beta_1 x_i + u_i\)` $$ `\begin{aligned} \hat{\beta}_1 &= \dfrac{\sum_i\left(y_i-\overline{y}\right)\left(x_{1i}-\overline{x}_1\right)}{\sum_i\left(x_{1i} -\overline{x}_1\right)^2} \\ &= \dfrac{\sum_i\left(\left[ \beta_0 + \beta_1 x_{1i} + u_i \right]- \left[ \beta_0 + \beta_1 \overline{x}_1 + \overline{u} \right] \right)\left(x_{1i}-\overline{x}_1\right)}{\sum_i\left(x_{1i} -\overline{x}_1\right)^2} \\ &= \dfrac{\sum_i\left(\beta_1 \left[ x_{1i} - \overline{x}_1 \right] + \left[u_i - \overline{u}\right] \right)\left(x_{1i}-\overline{x}_1\right)}{\sum_i\left(x_{1i} -\overline{x}_1\right)^2} \\ &= \dfrac{\sum_i\left(\beta_1 \left[ x_{1i} - \overline{x}_1 \right]^2 + \left[ x_{1i} - \overline{x}_1 \right] \left[u_i - \overline{u}\right]\right)}{\sum_i\left(x_{1i} -\overline{x}_1\right)^2} \\ &= \beta_1 + \dfrac{\sum_i\left(x_{1i} - \overline{x}_1\right) \left(u_i - \overline{u}\right)}{\sum_i\left(x_{1i} -\overline{x}_1\right)^2} \end{aligned}` $$ --- $$ `\begin{aligned} \hat{\beta}_1 &= \cdots = \beta_1 + \dfrac{\sum_i\left(x_{1i} - \overline{x}_1\right) \left(u_i - \overline{u}\right)}{\sum_i\left(x_{1i} -\overline{x}_1\right)^2} \\ &= \beta_1 + \dfrac{\sum_i\left(x_{1i} - \overline{x}_1\right) u_i - \overline{u} \sum_i\left(x_{1i} - \overline{x}_1\right)}{\sum_i\left(x_{1i} -\overline{x}_1\right)^2} \\ &= \beta_1 + \dfrac{\sum_i\left(x_{1i} - \overline{x}_1\right) u_i - \overline{u} \left(\sum_i x_{1i} - \sum_i \overline{x}_1\right)}{\sum_i\left(x_{1i} -\overline{x}_1\right)^2} \\ &= \beta_1 + \dfrac{\sum_i\left(x_{1i} - \overline{x}_1\right) u_i - \overline{u} \left(\sum_i x_{1i} - n \overline{x}_1\right)}{\sum_i\left(x_{1i} -\overline{x}_1\right)^2} \\ &= \beta_1 + \dfrac{\sum_i\left(x_{1i} - \overline{x}_1\right) u_i - \overline{u} \color{#e64173}{\left(\sum_i x_{1i} - \sum_i x_{1i}\right)}}{\sum_i\left(x_{1i} -\overline{x}_1\right)^2} \\ &= \beta_1 + \dfrac{\sum_i\left(x_{1i} - \overline{x}_1\right) u_i}{\sum_i\left(x_{1i} -\overline{x}_1\right)^2} \quad \text{😅} \end{aligned}` $$ --- ## Consequences: Bias We now want to see if heteroskedasticity biases the OLS estimator for `\(\beta_1\)`. -- $$ `\begin{aligned} \mathop{\boldsymbol{E}}\left[ \hat{\beta}_1 \middle| X \right] &= \mathop{\boldsymbol{E}}\left[ \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) u_i}{\sum_i\left(x_i -\overline{x}\right)^2} \middle| X \right] \\[0.5em] &= \beta_1 + \mathop{\boldsymbol{E}}\left[ \dfrac{\sum_i\left(x_i - \overline{x}\right) u_i}{\sum_i\left(x_i -\overline{x}\right)^2} \middle| X \right] \\[0.5em] &= \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \color{#e64173}{\underbrace{\mathop{\boldsymbol{E}}\left[ u_i \middle| X \right]}_{=0}} \\[0.5em] &= \beta_1 \end{aligned}` $$ -- .hi[OLS is still unbiased] for the `\(\beta_k\)`. --- ## Consequences: Efficiency OLS's efficiency and inference do not survive heteroskedasticity. - In the presence of heteroskedasticity, OLS is .hi[no longer the most efficient] (best) linear unbiased estimator. -- - It would be more informative (efficient) to .hi[weight observations] inversely to their `\(u_i\)`'s variance. - Downweight high-variance `\(u_i\)`'s (too noisy to learn much). - Upweight observations with low-variance `\(u_i\)`'s (more 'trustworthy'). - Now you have the idea of weighted least squares (WLS) --- ## Consequences: Inference OLS .hi[standard errors are biased] in the presence of heteroskedasticity. - Wrong confidence intervals - Problems for hypothesis testing (both `\(t\)` and `\(F\)` tests) -- - It's hard to learn much without sound inference. --- ## Solutions 1. **Tests** to determine whether heteroskedasticity is present. 2. **Remedies** for (1) efficiency and (2) inference --- layout: true # Testing for heteroskedasticity --- class: inverse, middle --- While we *might* have solutions for heteroskedasticity, the efficiency of our estimators depends upon whether or not heteroskedasticity is present. 1. The **Goldfeld-Quandt test** 1. The **Breusch-Pagan test** 1. The **White test** -- Each of these tests centers on the fact that we can .hi[use the OLS residual] `\(\color{#e64173}{\hat u_i}\)` .hi[to estimate the population disturbance] `\(\color{#e64173}{u_i}\)`. --- layout: true # Testing for heteroskedasticity ## The Goldfeld-Quandt test --- Focuses on a specific type of heteroskedasticity: whether the variance of `\(u_i\)` differs .hi[between two groups].<sup>†</sup> Remember how we used our residuals to estimate the `\(\sigma^2\)`? $$ s^2 = \dfrac{\text{RSS}}{n-1} = \dfrac{\sum_i \hat u_i^2}{n-1} $$ We will use this same idea to determine whether there is evidence that our two groups differ in the variances of their disturbances, effectively comparing `\(s^2_1\)` and `\(s^2_2\)` from our two groups. .footnote[ [†]: The G-Q test was one of the early tests of heteroskedasticity (1965). ] --- Operationally, .pseudocode-small[ 1. Order your the observations by `\(x\)` 2. Split the data into two groups of size n.super[â‘] - G<sub>1</sub>: The first third - G<sub>2</sub>: The last third 3. Run separate regressions of `\(y\)` on `\(x\)` for G.sub[1] and G.sub[2] 4. Record RSS.sub[1] and RSS.sub[2] 5. Calculate the G-Q test statistic ] --- The G-Q test statistic $$ F_{\left(n^{\star}-k,\, n^{\star}-k\right)} = \dfrac{\text{RSS}_2/(n^\star-k)}{\text{RSS}_1/(n^\star-k)} = \dfrac{\text{RSS}_2}{\text{RSS}_1} $$ follows an `\(F\)` distribution (under the null hypothesis) with `\(n^{\star}-k\)` and `\(n^{\star}-k\)` degrees of freedom.<sup>†</sup> -- **Notes** - The G-Q test requires the disturbances follow normal distributions. - The G-Q assumes a very specific type/form of heteroskedasticity. - Performs very well if we know the form of potentially heteroskedasticity. .footnote[ [†]: Goldfeld and Quandt suggested `\(n^{\star}\)` of `\((3/8)n\)`. `\(k\)` gives number of estimated parameters (_i.e._, `\(\hat{\beta}_j\)`'s). ] --- <img src="16-Heteroskedasticity_files/figure-html/gq1a-1.svg" style="display: block; margin: auto;" /> --- <img src="16-Heteroskedasticity_files/figure-html/gq1b-1.svg" style="display: block; margin: auto;" /> `\(F_{375,\,375} = \dfrac{\color{#e64173}{\text{RSS}_2 = 18,203.4}}{\color{#314f4f}{\text{RSS}_1 = 1,039.5}} \approx 17.5 \implies\)` *p*-value `\(< 0.001\)` `\(\therefore\)` We reject H.sub[0]: `\(\sigma^2_1 = \sigma^2_2\)` and conclude there is statistically significant evidence of heteroskedasticity. --- <img src="16-Heteroskedasticity_files/figure-html/gq2-1.svg" style="display: block; margin: auto;" /> `\(F_{375,\,375} = \dfrac{\color{#e64173}{\text{RSS}_2 = 14,516.8}}{\color{#314f4f}{\text{RSS}_1 = 14,937.1}} \approx 1 \implies\)` *p*-value `\(\approx 0.609\)` `\(\therefore\)` We fail to reject H.sub[0]: `\(\sigma^2_1 = \sigma^2_2\)` while heteroskedasticity is present. --- layout: true # Testing for heteroskedasticity ## The Breusch-Pagan test --- Breusch and Pagan (1981) attempted to solve this issue of being too specific with the functional form of the heteroskedasticity. - Allows the data to show if/how the variance of `\(u_i\)` correlates with `\(X\)`. - If `\(\sigma_i^2\)` correlates with `\(X\)`, then we have heteroskedasticity. - Regresses `\(\hat u_i^2\)` on `\(X = \left[ 1,\, x_1,\, x_2,\, \ldots,\, x_k \right]\)` and tests for joint significance. --- How to implement: .pseudocode-small[ 1\. Regress y on an intercept, x.sub[1], x.sub[2], …, x.sub[k]. 2\. Record residuals `\(\hat u_i\)`. 3\. Regress `\(\hat u_i\)`.super[2] on an intercept, x.sub[1], x.sub[2], …, x.sub[k]. `$$\hat u_i^2 = \alpha_0 + \alpha_1 x_{1i} + \alpha_2 x_{2i} + \cdots + \alpha_k x_{ki} + v_i$$` 4\. Record R.super[2]. 5\. Test hypothesis H.sub[0]: `\(\alpha_1 = \alpha_2 = \cdots = \alpha_k = 0\)` ] --- The B-P test statistic<sup>†</sup> is $$ \text{LM} = n \times R^2_{u} $$ where `\(R^2_u\)` is the `\(R^2\)` from the regression $$ \hat u\_i^2 = \alpha\_0 + \alpha\_1 x\_{1i} + \alpha\_2 x\_{2i} + \cdots + \alpha\_k x\_{ki} + v\_i $$ Under the null, `\(\text{LM}\)` is asymptotically distributed as `\(\chi^2_k\)`. -- This test statistic tests H.sub[0]: `\(\alpha_1 = \alpha_2 = \cdots = \alpha_k = 0\)`. Rejecting the null hypothesis implies evidence of heteroskedasticity. .footnote[ [†]: This specific form of the test statistic actually comes form Koenker (1981). ] --- layout: true # Testing for heteroskedasticity ## The `\(\chi^2\)` distribution --- We just mentioned that under the null, the B-P test statistic is distributed as a `\(\chi^2\)` random variable with `\(k\)` degrees of freedom. The `\(\chi^2\)` distribution is just another example of a common (named) distribution (like the Normal distribution, the `\(t\)` distribution, and the `\(F\)`). --- Three examples of `\(\chi_k^2\)`: `\(\color{#314f4f}{k = 1}\)`, `\(\color{#e64173}{k = 2}\)`, and `\(\color{orange}{k = 9}\)` <img src="16-Heteroskedasticity_files/figure-html/chisq1-1.svg" style="display: block; margin: auto;" /> --- Probability of observing a more extreme test statistic `\(\widehat{\text{LM}}\)` under H.sub[0] <img src="16-Heteroskedasticity_files/figure-html/chisq2-1.svg" style="display: block; margin: auto;" /> --- layout: true # Testing for heteroskedasticity ## The Breusch-Pagan test --- **Problem:** We're still assuming a fairly restrictive .hi[functional form] between our explanatory variables `\(X\)` and the variances of our disturbances `\(\sigma^2_i\)`. -- **Result:** B-P *may* still miss fairly simple forms of heteroskedasticity. --- Breusch-Pagan tests are still .hi[sensitive to functional form]. <img src="16-Heteroskedasticity_files/figure-html/bp1-1.svg" style="display: block; margin: auto;" /> $$ `\begin{aligned} \hat u_i^2 &= \hat{\alpha}_0 + \hat{\alpha}_1 x_{1i} & \widehat{\text{LM}} &= 1.26 &\mathit{p}\text{-value} \approx 0.261 \\ \hat u_i^2 &= \hat{\alpha}_0 + \hat{\alpha}_1 x_{1i} \color{#e64173}{+ \hat{\alpha}_2 x^2_{1i}} & \widehat{\text{LM}} &= 185.8 &\mathit{p}\text{-value} < 0.001 \end{aligned}` $$ --- layout: true # Testing for heteroskedasticity ## The White test --- So far we've been testing for specific relationships between our explanatory variables and the variances of the disturbances, _e.g._, - H.sub[0]: `\(\sigma_1^2 = \sigma_2^2\)` for two groups based upon `\(x_j\)` (**G-Q**) - H.sub[0]: `\(\alpha_1 = \cdots = \alpha_k = 0\)` from `\(\hat u_i^2 = \alpha_0 + \alpha_1 x_{1i} + \cdots + \alpha_k x_{ki} + v_i\)` (**B-P**) -- However, we actually want to know if $$ \sigma_1^2 = \sigma_2^2 = \cdots = \sigma_n^2 $$ **Q:** Can't we just test this hypothesis? -- **A:** Sort of. --- Toward this goal, Hal White took advantage of the fact that we can .hi[replace the homoskedasticity requirement with a weaker assumption]: - **Old:** `\(\mathop{\text{Var}} \left( u_i \middle| X \right) = \sigma^2\)` - **New:** `\(u^2\)` is *uncorrelated* with the explanatory variables (_i.e._, `\(x_j\)` for all `\(j\)`), their squares (_i.e._, `\(x_j^2\)`), and the first-degree interactions (_i.e._, `\(x_j x_h\)`). -- This new assumption is easier to explicitly test (*hint:* regression). --- An outline of White's test for heteroskedasticity: .pseudocode-small[ 1\. Regress y on x.sub[1], x.sub[2], …, x.sub[k]. Save residuals e. 2\. Regress squared residuals on all explanatory variables, their squares, and interactions. $$ \hat u^2 = \alpha\_0 + \sum\_{h=1}^k \alpha\_h x\_h + \sum\_{j=1}^k \alpha\_{k+j} x\_j^2 + \sum\_{\ell = 1}^{k-1} \sum\_{m = \ell + 1}^k \alpha\_{\ell,m} x\_\ell x\_m + v\_i $$ 3\. Record R.sub[u].super[2]. 4\. Calculate test statistic to test H.sub[0]: `\(\alpha_p = 0\)` for all `\(p\neq0\)`. ] --- Just as with the Breusch-Pagan test, White's test statistic is $$ \text{LM} = n \times R_u^2 \qquad \text{Under H}_0,\, \text{LM} \overset{\text{d}}{\sim} \chi_k^2 $$ but now the `\(R^2_u\)` comes from the regression of `\(\hat u_i^2\)` on the explanatory variables, their squares, and their interactions. $$ \hat u^2 = \alpha\_0 + \underbrace{\sum\_{h=1}^k \alpha\_h x\_h}\_{\text{Expl. variables}} + \underbrace{\sum\_{j=1}^k \alpha\_{k+j} x\_j^2}\_{\text{Squared terms}} + \underbrace{\sum\_{\ell = 1}^{k-1} \sum\_{m = \ell + 1}^k \alpha\_{\ell,m} x\_\ell x\_m}\_{\text{Interactions}} + v\_i $$ **Note:** The `\(k\)` (for our `\(\chi_k^2\)`) equals the number of estimated parameters in the regression above (the `\(\alpha_j\)`), excluding the intercept `\(\left( \alpha_0 \right)\)`. --- **Practical note:** If a variable is equal to its square (_e.g._, binary variables), then you don't (can't) include it. The same rule applies for interactions. --- *Example:* Consider the model.super[†] `\(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + u\)` **Step 1:** Estimate the model; obtain residuals `\((\hat u_i)\)`. **Step 2:** Regress `\(\hat u^2\)` on explanatory variables, squares, and interactions. $$ `\begin{aligned} \hat u^2 = &\alpha_0 + \alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 x_3 + \alpha_4 x_1^2 + \alpha_5 x_2^2 + \alpha_6 x_3^2 \\ &+ \alpha_7 x_1 x_2 + \alpha_8 x_1 x_3 + \alpha_9 x_2 x_3 + v \end{aligned}` $$ Record the R.super[2] from this equation (call it `\(R_u^2\)`). **Step 3:** Test H.sub[0]: `\(\alpha_1 = \alpha_2 = \cdots = \alpha_9 = 0\)` using `\(\text{LM} = n R^2_u \overset{\text{d}}{\sim} \chi_9^2\)`. .footnote[ [†]: To simplify notation here, I'm dropping the `\(i\)` subscripts. ] --- <img src="16-Heteroskedasticity_files/figure-html/white1-1.svg" style="display: block; margin: auto;" /> We've already done the White test for this simple linear regression. $$ `\begin{aligned} \hat u_i^2 &= \hat{\alpha}_0 + \hat{\alpha}_1 x_{1i} \color{#e64173}{+ \hat{\alpha}_2 x^2_{1i}} & \widehat{\text{LM}} &= 185.8 &\mathit{p}\text{-value} < 0.001 \end{aligned}` $$ --- layout: false class: inverse, middle # Testing for Heteroskedasticity ## Examples --- layout: true # Testing for heteroskedasticity ## Examples --- **Goal:** Estimate the relationship between standardized test scores (outcome variable) and (1) student-teacher ratio and (2) income, _i.e._, $$ \left(\text{Test score}\right)\_i = \beta\_0 + \beta\_1 \text{Ratio}\_{i} + \beta\_2 \text{Income}\_{i} + u\_i \tag{1} $$ **Potential issue:** Heteroskedasticity... and we do not observe `\(u_i\)`. **Solution:** 1. Estimate the relationship in `\((1)\)` using OLS 2. Use the residuals `\(\left( \hat u_i \right)\)` to test for heteroskedasticity - Goldfeld-Quandt - Breusch-Pagan - White --- We will use testing data from the dataset `Caschool` in the `Ecdat` .mono[R] package. ```r # Load packages library(pacman) p_load(tidyverse, Ecdat) # Select and rename desired variables; assign to new dataset test_df <- select(Caschool, test_score = testscr, ratio = str, income = avginc) # Format as tibble test_df <- as_tibble(test_df) # View first 2 rows of the dataset head(test_df, 2) ``` ``` #> # A tibble: 2 x 3 #> test_score ratio income #> <dbl> <dbl> <dbl> #> 1 691. 17.9 22.7 #> 2 661. 21.5 9.82 ``` --- Let's begin by estimating our model $$ \left(\text{Test score}\right)\_i = \beta\_0 + \beta\_1 \text{Ratio}\_{i} + \beta\_2 \text{Income}\_{i} + u\_i $$ ```r # Estimate the model est_model <- lm(test_score ~ ratio + income, data = test_df) # Summary of the estimate tidy(est_model) ``` ``` #> # A tibble: 3 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 639. 7.45 85.7 5.70e-267 #> 2 ratio -0.649 0.354 -1.83 6.79e- 2 #> 3 income 1.84 0.0928 19.8 4.38e- 62 ``` --- Now, let's see what the residuals suggest about heteroskedasticity ```r # Add the residuals to our dataset test_df$e <- residuals(est_model) ``` <img src="16-Heteroskedasticity_files/figure-html/gq3-1.svg" style="display: block; margin: auto;" /> --- layout: true # Testing for heteroskedasticity ## Example: Goldfeld-Quandt --- Income looks potentially heteroskedastic; let's test via Goldfeld-Quandt. ```r # Arrange the data by income test_df <- arrange(test_df, income) ``` --- count: false Income looks potentially heteroskedastic; let's test via Goldfeld-Quandt. ```r # Arrange the data by income test_df <- arrange(test_df, income) # Re-estimate the model for the last and first 158 observations est_model1 <- lm(test_score ~ ratio + income, data = tail(test_df, 158)) est_model2 <- lm(test_score ~ ratio + income, data = head(test_df, 158)) ``` --- count: false Income looks potentially heteroskedastic; let's test via Goldfeld-Quandt. ```r # Arrange the data by income test_df <- arrange(test_df, income) # Re-estimate the model for the last and first 158 observations est_model1 <- lm(test_score ~ ratio + income, data = tail(test_df, 158)) est_model2 <- lm(test_score ~ ratio + income, data = head(test_df, 158)) # Grab the residuals from each regression e_model1 <- residuals(est_model1) e_model2 <- residuals(est_model2) ``` --- count: false Income looks potentially heteroskedastic; let's test via Goldfeld-Quandt. ```r # Arrange the data by income test_df <- arrange(test_df, income) # Re-estimate the model for the last and first 158 observations est_model1 <- lm(test_score ~ ratio + income, data = tail(test_df, 158)) est_model2 <- lm(test_score ~ ratio + income, data = head(test_df, 158)) # Grab the residuals from each regression e_model1 <- residuals(est_model1) e_model2 <- residuals(est_model2) # Calculate SSE for each regression (sse_model1 <- sum(e_model1^2)) ``` ``` #> [1] 19305.01 ``` ```r (sse_model2 <- sum(e_model2^2)) ``` ``` #> [1] 29537.83 ``` --- Remember the Goldfeld-Quandt test statistic? `\(F_{n^\star-k,\,n^\star-k} = \dfrac{\text{RSS}_2}{\text{RSS}_1}\)` -- `\(\approx\dfrac{29,537.83}{19,305.01}\)` -- `\(\approx1.53\)` --  Test via `\(F_{158-3,\,158-3}\)` -- ```r # G-Q test statistic (f_gq <- sse_model2/sse_model1) ``` ``` #> [1] 1.530061 ``` -- ```r # p-value pf(q = f_gq, df1 = 158-3, df2 = 158-3, lower.tail = F) ``` ``` #> [1] 0.004226666 ``` --- The Goldfeld-Quandt test statistic and its null distribution <img src="16-Heteroskedasticity_files/figure-html/ex gq10-1.svg" style="display: block; margin: auto;" /> --- Putting it all together: H.sub[0]: `\(\sigma_1^2 = \sigma_2^2\)` *vs.* H.sub[A]: `\(\sigma_1^2 \neq \sigma_2^2\)` Goldfeld-Quandt test statistic: `\(F \approx 1.53\)` *p*-value `\(\approx 0.00423\)` ∴ Reject H.sub[0] (*p*-value is less than 0.05). **Conclusion:** There is statistically significant evidence that `\(\sigma_1^2 \neq \sigma_2^2\)`. Therefore, we find statistically significant evidence of heteroskedasticity (at the 5-percent level). --- What if we had chosen to focus on student-to-teacher ratio? -- ```r # Arrange the data by ratio test_df <- arrange(test_df, ratio) # Re-estimate the model for the last and first 158 observations est_model3 <- lm(test_score ~ ratio + income, data = tail(test_df, 158)) est_model4 <- lm(test_score ~ ratio + income, data = head(test_df, 158)) # Grab the residuals from each regression e_model3 <- residuals(est_model3) e_model4 <- residuals(est_model4) # Calculate SSE for each regression (sse_model3 <- sum(e_model3^2)) ``` ``` #> [1] 26243.52 ``` ```r (sse_model4 <- sum(e_model4^2)) ``` ``` #> [1] 29101.52 ``` --- `\(F_{n^\star-k,\,n^\star-k} = \dfrac{\text{RSS}_4}{\text{RSS}_3} \approx\dfrac{29,101.52}{26,243.52} \approx1.11\)` which has a *p*-value of approximately 0.2603. -- ∴ We would have failed to reject H.sub[0], concluding that we failed to find statistically significant evidence of heteroskedasticity. -- **Lesson:** Understand the limitations of estimators, tests, *etc.* --- layout: true # Testing for heteroskedasticity ## Example: Breusch-Pagan --- Let's test the same model with the Breusch Pagan. *Recall:* We saved our residuals as `e` in our dataset, _i.e._, ```r test_df$e <- residuals(est_model) ``` --- In B-P, we first regress `\(e_i^2\)` on the explanatory variables, ```r # Regress squared residuals on explanatory variables bp_model <- lm(I(e^2) ~ ratio + income, data = test_df) ``` --- count: false and use the resulting `\(R^2\)` to calculate a test statistic. ```r # Regress squared residuals on explanatory variables bp_model <- lm(I(e^2) ~ ratio + income, data = test_df) # Grab the R-squared (bp_r2 <- summary(bp_model)$r.squared) ``` ``` #> [1] 3.23205e-05 ``` --- The Breusch-Pagan test statistic is `\(\text{LM} = n \times R^2_u\)` -- `\(\approx 420 \times 0.0000323\)` -- `\(\approx 0.0136\)` which we test against a `\(\chi_k^2\)` distribution (here: `\(k=2\)`).<sup>†</sup> -- ```r # B-P test statistic bp_stat <- 420 * bp_r2 # Calculate the p-value pchisq(q = bp_stat, df = 2, lower.tail = F) ``` ``` #> [1] 0.9932357 ``` .footnote[ [†]: `\(k\)` is the number of explanatory variables (excluding the intercept). ] --- H.sub[0]: `\(\alpha_1 = \alpha_2 = 0\)` *vs.* H.sub[A]: `\(\alpha_1 \neq 0\)` and/or `\(\alpha_2 \neq 0\)` for the model `\(u_i^2 = \alpha_0 + \alpha_1 \text{Ratio}_i + \alpha_2 \text{Income}_i + w_i\)` -- Breusch-Pagan test statistic: `\(\widehat{\text{LM}} \approx 0.014\)` -- *p*-value `\(\approx 0.993\)` -- ∴ Fail to reject H.sub[0] (the *p*-value is greater than 0.05) -- **Conclusion:** We do not find statistically significant evidence of heteroskedasticity at the 5-percent level. -- (We find no evidence of a *linear* relationship between `\(u_i^2\)` and the explanatory variables.) --- The Breusch-Pagan test statistic and its null distribution <img src="16-Heteroskedasticity_files/figure-html/ex bp plot-1.svg" style="display: block; margin: auto;" /> --- layout: true # Heteroskedasticity ## Example: White --- The .pink[**White test** adds squared terms and interactions] to the .slate[**B-P test**]. $$ `\begin{align} \color{#314f4f}{u_{i}^2} =& \color{#314f4f}{\alpha_{0} + \alpha_{1} \text{Ratio}_{i} + \alpha_{2} \text{Income}_{i} } \\ &+ \color{#e64173}{\alpha_{3} \text{Ratio}_{i}^2 + \alpha_{4} \text{Income}_{i}^2 + \alpha_{5} \text{Ratio}_{i}\times\text{Income}_{i}} \\ &+ \color{#314f4f}{w_{i}} \end{align}` $$ which moves the null hypothesis from <br>.slate[H.sub[0]:] `\(\color{#314f4f}{\alpha_1 = \alpha_2 = 0}\)` to <br>.pink[H.sub[0]:] `\(\color{#e64173}{\alpha_1 = \alpha_2 = \alpha_3 = \alpha_4 = \alpha_5 = 0}\)` -- So we just need to update our .mono[R] code, and we're set. --- *Aside:* .mono[R] has funky notation for squared terms and interactions in `lm()`: - **Squared terms** use `I()`, _e.g._, `lm(y ~ I(x^2))` - **Interactions** use `:` between the variables, _e.g._, `lm(y ~ x1:x2)` *Example:* Regress `y` on quadratic of `x1` and `x2`: ```r # Pretend quadratic regression w/ interactions lm(y ~ x1 + x2 + I(x1^2) + I(x2^2) + x1:x2, data = pretend_df) ``` --- **Step 1:** Regress `\(\hat u_i^2\)` on 1.super[st] degree, 2.super[nd] degree, and interactions ```r # Regress squared residuals on quadratic of explanatory variables white_model <- lm( I(e^2) ~ ratio + income + I(ratio^2) + I(income^2) + ratio:income, data = test_df ) # Grab the R-squared (white_r2 <- summary(white_model)$r.squared) ``` --- count: false **Step 2:** Collect `\(R_u^2\)` from the regression. ```r # Regress squared residuals on quadratic of explanatory variables white_model <- lm( I(e^2) ~ ratio + income + I(ratio^2) + I(income^2) + ratio:income, data = test_df ) # Grab the R-squared (white_r2 <- summary(white_model)$r.squared) ``` ``` #> [1] 0.07332222 ``` --- count: false **Step 3:** Calculate White test statistic `\(\text{LM} = n \times R_u^2 \approx 420 \times 0.073\)` ```r # Regress squared residuals on quadratic of explanatory variables white_model <- lm( I(e^2) ~ ratio + income + I(ratio^2) + I(income^2) + ratio:income, data = test_df ) # Grab the R-squared white_r2 <- summary(white_model)$r.squared # Calculate the White test statistic (white_stat <- 420 * white_r2) ``` ``` #> [1] 30.79533 ``` --- count: false **Step 4:** Calculate the associated *p*-value (where `\(\text{LM} \overset{d}{\sim} \chi_k^2\)`); here, `\(k=5\)` ```r # Regress squared residuals on quadratic of explanatory variables white_model <- lm( I(e^2) ~ ratio + income + I(ratio^2) + I(income^2) + ratio:income, data = test_df ) # Grab the R-squared white_r2 <- summary(white_model)$r.squared # Calculate the White test statistic white_stat <- 420 * white_r2 # Calculate the p-value pchisq(q = white_stat, df = 5, lower.tail = F) ``` ``` #> [1] 1.028039e-05 ``` --- Putting everything together... -- H.sub[0]: `\(\alpha_1 = \alpha_2 = \alpha_3 = \alpha_4 = \alpha_5 = 0\)` -- *vs.* H.sub[A]: `\(\alpha_i \neq 0\)` for some `\(i \in \{1,\, 2,\,\ldots,\, 5\}\)` -- $$ `\begin{align} u_i^2 =& \alpha_0 + \alpha_1 \text{Ratio}_i + \alpha_2 \text{Income}_i \\ &+ \alpha_3 \text{Ratio}_i^2 + \alpha_4 \text{Income}_i^2 \\ &+ \alpha_5 \text{Ratio}_i \times \text{Income}_i + w_i \end{align}` $$ -- Our White test statistic: `\(\text{LM} = n \times R_e^2 \approx 420 \times 0.073 \approx 30.8\)` -- Under the `\(\chi_5^2\)` distribution, this `\(\widehat{\text{LM}}\)` has a *p*-value less than 0.001. -- ∴ We .hi[reject H.sub[0]] -- and conclude there is .hi[statistically significant evidence of heteroskedasticity] (at the 5-percent level). --- The White test statistic and its null distribution <img src="16-Heteroskedasticity_files/figure-html/ex white plot-1.svg" style="display: block; margin: auto;" /> --- layout: true # Heteroskedasticity ## Review questions --- -- - **Q:** What is the definition of heteroskedasticity? - **Q:** Why are we concerned about heteroskedasticity? - **Q:** Does plotting `\(y\)` against `\(x\)`, tell us anything about heteroskedasticity? - **Q:** Does plotting `\(\hat u_i\)` against `\(x\)`, tell us anything about heteroskedasticity? - **Q:** Since we cannot observe the `\(u_i\)`'s, what do we use to *learn about* heteroskedasticity? - **Q:** Which test do you recommend to test for heteroskedasticity? Why? --- layout: true # Heteroskedasticity ## Remedies --- 1. Ensure your specification doesn't cause heteroskedasticity 1. Increase efficiency by weighting our observations 1. Ignore OLS's inefficiency, focus on unbiased estimates for our standard errors which leads to correct inference .hi-pink[For details], see Chapter 7 of Dougherty. Future metrics classes such as EC421 will go into the weeds on this matter. --- exclude: true