class: center, middle, inverse, title-slide .title[ # Heteroskedasticity, Part II ] .subtitle[ ## EC 421, Set 05 ] .author[ ### Edward Rubin ] --- class: inverse, middle # Prologue --- # Schedule ## Last Time Heteroskedasticity: Issues and tests ## Today - .hi[First assignment] due today - Living with heteroskedasticity ## Upcoming - .hi[Second assignment] released soon (next week?) --- # EC 421 ## Goals - Develop .hi[intuition] for econometrics. - Learn how to .hi[*apply*] econometrics—strengths, weaknessed, *etc.* - Learn .hi[.mono[R]]. -- .mono[R] does the calculations and has already memorized the formulas. I want you to know what the formulas mean, when/why we use them, and when they fail/work. -- This course has the potential to be one of the most useful/valuable/applicable/marketable classes that you take at UO. --- layout: true # Heteroskedasticity ## Review --- class: inverse, middle --- Three review questions **Question 1:** What is the difference between `\(u_i\)` and `\(e_i\)`? **Question 2:** We spend *a lot* of time discussing `\(u_i^2\)`. Why? **Question 3:** We also spend *a lot* of time discussing `\(e_i^2\)`. Why? --- **Question 1:** What is the difference between `\(u_i\)` and `\(e_i\)`? **Answer 1:** -- <br> `\(\color{#e64173}{u_i}\)` gives the .hi[population disturbance] for the `\(i\)`.super[th] observation. -- `\(u_i\)` measures how far the `\(i\)`.super[th] observation is from the .hi[population] line, _i.e._, $$ u\_i = y\_i - \color{#e64173}{\underbrace{\left(\beta\_0 + \beta\_1 x\_i\right)}\_{\text{Population line}}} $$ -- `\(\color{#6A5ACD}{e_i}\)` gives the .hi-purple[regression residual (error)] for the `\(i\)`.super[th] observation. -- `\(e_i\)` measures how far the `\(i\)`.super[th] observation is from the .hi-purple[sample regression] line, _i.e._, $$ e\_i = y\_i - \color{#6A5ACD}{\underbrace{\left(\hat{\beta}\_0 + \hat{\beta}\_1 x\_i\right)}\_{\text{Sample reg. line} = \hat{y}}} = y\_i - \color{#6A5ACD}{\hat{y}_i} $$ --- **Question 2:** We spend *a lot* of time discussing `\(u_i^2\)`. Why? **Answer 2:** One of major assumptions is that our disturbances (the `\(u_i\)`'s) are homoskedastic (they have constant variance), _i.e._, `\(\mathop{\text{Var}} \left( u_i \middle| x_i \right) = \sigma^2\)`. We also assume that the mean of these disturbances is zero, `\(\color{#e64173}{\mathop{\boldsymbol{E}}\left[ u_i \middle| x_i \right] = 0}\)`. By definition, `\(\mathop{\text{Var}} \left( u_i \middle| x_i \right) = \mathop{\boldsymbol{E}}\Big[ u_i^2 - \underbrace{\color{#e64173}{\mathop{\boldsymbol{E}}\left[ u_i \middle| x_i \right]}^2}_{\color{#e64173}{=0}} \Big| x_i \Big] = \mathop{\boldsymbol{E}}\left[ u_i^2 \middle| x_i \right]\)` Thus, if we want to learn about the variance of `\(u_i\)`, we can focus on `\(u_i^2\)`. --- **Question 3:** We also spend *a lot* of time discussing `\(e_i^2\)`. Why? **Answer 3:** We cannot observe `\(u_i\)` (or `\(u_i^2\)`). But `\(u_i^2\)` tells us about the variance of `\(u_i\)`. We use `\(e_i^2\)` to learn about `\(u_i^2\)` and, consequently, `\(\sigma_i^2\)`. --- layout: false # Heteroskedasticity ## Review: Current assumptions -- 1. Our sample (the `\(x_k\)`'s and `\(y_i\)`) was .hi[randomly drawn] from the population. -- 2. `\(y\)` is a .hi[linear function] of the `\(\beta_k\)`'s and `\(u_i\)`. -- 3. There is no perfect .hi[multicollinearity] in our sample. -- 4. The explanatory variables are .hi[exogenous]: `\(\mathop{\boldsymbol{E}}\left[ u \middle| X \right] = 0 \left(\implies \mathop{\boldsymbol{E}}\left[ u \right] = 0\right)\)`. -- 5. The disurbances have .hi[constant variance] `\(\sigma^2\)` and .hi[zero covariance], _i.e._, - `\(\mathop{\boldsymbol{E}}\left[ u_i^2 \middle| X_i \right] = \mathop{\text{Var}} \left( u_i \middle| X_i \right) = \sigma^2 \implies \mathop{\text{Var}} \left( u_i \right) = \sigma^2\)` - `\(\mathop{\text{Cov}} \left( u_i, \, u_j \middle| X_i,\,X_j \right) = \mathop{\boldsymbol{E}}\left[ u_i u_j \middle| X_i,\,X_j \right] = 0\)` for `\(i\neq j\)` -- 6. The disturbances come from a .hi[Normal] distribution, _i.e._, `\(u_i \overset{\text{iid}}{\sim} \mathop{\text{N}}\left( 0, \sigma^2 \right)\)`. --- layout: true # Heteroskedasticity ## Review --- Today we're focusing on assumption \#5: > 5\. The disurbances have .hi[constant variance] `\(\sigma^2\)` and .hi[zero covariance], _i.e._, > - `\(\mathop{\boldsymbol{E}}\left[ u_i^2 \middle| X_i \right] = \mathop{\text{Var}} \left( u_i \middle| X \right) = \sigma^2 \implies \mathop{\text{Var}} \left( u_i \right) = \sigma^2\)` > - `\(\mathop{\text{Cov}} \left( u_i, \, u_j \middle| X_i,\,X_j \right) = \mathop{\boldsymbol{E}}\left[ u_i u_j \middle| X_i,\,X_j \right] = 0\)` for `\(i\neq j\)` -- Specifically, we will focus on the assumption of .hi[constant variance] (also known as *homoskedasticity*). -- **Violation of this assumption:** Our disturbances have different variances. -- .hi[Heteroskedasticity:] `\(\mathop{\text{Var}} \left( u_i \right) = \sigma^2_i\)` and `\(\sigma^2_i \neq \sigma^2_j\)` for some `\(i\neq j\)`. --- Classic example of heteroskedasticity: The funnel Variance of `\(u\)` increases with `\(x\)` <img src="slides_files/figure-html/het ex1-1.svg" style="display: block; margin: auto;" /> --- Another example of heteroskedasticity: (double funnel?) Variance of `\(u\)` increasing at the extremes of `\(x\)` <img src="slides_files/figure-html/het ex2 -1.svg" style="display: block; margin: auto;" /> --- Another example of heteroskedasticity: Differing variances of `\(u\)` by group <img src="slides_files/figure-html/het ex3 -1.svg" style="display: block; margin: auto;" /> --- .hi[Heteroskedasticity] is present when the variance of `\(u\)` changes with any combination of our explanatory variables `\(x_1\)` through `\(x_k\)`. --- layout: false # Testing for heteroskedasticity We have some tests that may help us detect heteroskedasticity. - Goldfeld-Quandt - White - (There are others, *e.g.*, Breusch-Pagan) -- What do we do if we detect it? --- layout: true # Living with heteroskedasticity --- class: inverse, middle, true --- In the presence of heteroskedasticity, OLS is - still .hi[unbiased] - .hi[no longer the most efficient] unbiased linear estimator On average, we get the right answer but with more noise (less precision). <br> *Also:* Our standard errors are biased. -- **Options:** 1. Check regression .hi[specification]. 2. Find a new, more efficient .hi[unbiased estimator] for `\(\beta_j\)`'s. 3. Live with OLS's inefficiency; find a .hi[new variance estimator]. - Standard errors - Confidence intervals - Hypothesis tests --- layout: true # Living with heteroskedasticity ## Misspecification --- As we've discussed, the specification.pink[<sup>†</sup>] of your regression model matters a lot for the unbiasedness and efficiency of your estimator. **Response \#1:** Ensure your specification doesn't cause heteroskedasticity. .footnote[.pink[†] *Specification:* Functional form and included variables.] --- *Example:* Let the population relationship be $$ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + u_i $$ with `\(\mathop{\boldsymbol{E}}\left[ u_i \middle| x_i \right] = 0\)` and `\(\mathop{\text{Var}} \left( u_i \middle| x_i \right) = \sigma^2\)`. However, we omit `\(x^2\)` and estimate $$ y_i = \gamma_0 + \gamma_1 x_i + w_i $$ Then $$ w_i = u_i + \beta_2 x_i^2 \implies \mathop{\text{Var}} \left( w_i \right) = f(x_i) $$ _I.e._, the variance of `\(w_i\)` changes systematically with `\(x_i\)` (heteroskedasticity). --- .pink[Truth:] `\(\color{#e64173}{\log\left(y_i\right) = \beta_0 + \beta_1 x_i + u_i}\)` .slate[**Misspecification:**] `\(\color{#314f4f}{y_i = \beta_0 + \beta_1 x_i + v_i}\)` <img src="slides_files/figure-html/spec plot1-1.svg" style="display: block; margin: auto;" /> --- .pink[**Truth:**] `\(\color{#e64173}{\log\left(y_i\right) = \beta_0 + \beta_1 x_i + u_i}\)` .slate[Misspecification:] `\(\color{#314f4f}{y_i = \beta_0 + \beta_1 x_i + v_i}\)` <img src="slides_files/figure-html/spec plot2-1.svg" style="display: block; margin: auto;" /> --- More generally: **Misspecification problem:** Incorrect specification of the regression model can cause heteroskedasticity (among other problems). -- **Solution:** 💡 Get it right (_e.g._, don't omit `\(x^2\)`). -- **New problems:** - We often don't know the *right* specification. - We'd like a more formal process for addressing heteroskedasticity. -- **Conclusion:** Specification often will not "solve" heteroskedasticity. <br> However, correctly specifying your model is still really important. --- layout: true # Living with heteroskedasticity ## Weighted least squares --- Weighted least squares (WLS) presents another approach. **Response \#2:** Increase efficiency by weighting our observations. -- Let the true population relationship be $$ `\begin{align} y_i = \beta_0 + \beta_1 x_{i} + u_i \tag{1} \end{align}` $$ with `\(u_i \sim \mathop{N} \left( 0,\, \sigma_i^2 \right)\)`. -- Now transform `\((1)\)` by dividing each observation's data by `\(\sigma_i\)`, _i.e._, $$ `\begin{align} \dfrac{y_i}{\sigma_i} &= \beta_0 \dfrac{1}{\sigma_i} + \beta_1 \dfrac{x_{i}}{\sigma_i} + \dfrac{u_i}{\sigma_i} \tag{2} \end{align}` $$ --- $$ `\begin{align} y_i &= \beta_0 + \beta_1 x_{i} + u_i \tag{1} \\[1em] \dfrac{y_i}{\sigma_i} &= \beta_0 \dfrac{1}{\sigma_i} + \beta_1 \dfrac{x_{i}}{\sigma_i} + \dfrac{u_i}{\sigma_i} \tag{2} \end{align}` $$ Whereas `\((1)\)` is heteroskedastic, -- `\(\color{#e64173}{(2)}\)` .hi[is homoskedastic]. ∴ OLS is efficient and unbiased for estimating the `\(\beta_k\)` in `\((2)\)`! -- Why is `\((2)\)` homoskedastic? -- `\(\mathop{\text{Var}} \left( \dfrac{u_i}{\sigma_i} \middle| x_i \right) =\)` -- `\(\dfrac{1}{\sigma_i^2} \mathop{\text{Var}} \left( u_i \middle| x_i \right) =\)` -- `\(\dfrac{1}{\sigma_i^2} \sigma_i^2 =\)` -- `\(1\)` --- WLS is great, but we need to know `\(\sigma_i^2\)`, which is generally unlikely. We can *slightly* relax this requirement—instead requiring 1. `\(\mathop{\text{Var}} \left( u_i | x_i \right) = \sigma_i^2 = \sigma^2 h(x_i)\)` 2. We know `\(h(x)\)`. -- As before, we transform our heteroskedastic model into a homoskedastic model. This time we divide each observation's data<sup>.pink[†]</sup> by `\(\sqrt{h(x_i)}\)`. .footnote[ .pink[†] Divide *all* of the data by `\(\sqrt{h(x_i)}\)`, including the intercept. ] --- $$ `\begin{align} y_i &= \beta_0 + \beta_1 x_{i} + u_i \tag{1} \\[1em] \dfrac{y_i}{\sqrt{h(x_i)}} &= \beta_0 \dfrac{1}{\sqrt{h(x_i)}} + \beta_1 \dfrac{x_{i}}{\sqrt{h(x_i)}} + \dfrac{u_i}{\sqrt{h(x_i)}} \tag{2} \end{align}` $$ with `\(\mathop{\text{Var}} \left( u_i | x_i \right) = \sigma^2 h(x_i)\)`. -- Now let's check that `\((2)\)` is indeed homoskedastic. `\(\mathop{\text{Var}} \left( \dfrac{u_i}{\sqrt{h(x_i)}} \middle| x_i \right) =\)` -- `\(\dfrac{1}{h(x_i)} \mathop{\text{Var}} \left( u_i \middle| x_i \right) =\)` -- `\(\dfrac{1}{h(x_i)} \sigma^2 h(x_i) =\)` -- `\(\color{#e64173}{\sigma^2}\)` .hi[Homoskedasticity!] --- .hi[Weighted least squares] (WLS) estimators are a special class of .hi[generalized least squares] (GLS) estimators focused on heteroskedasticity. -- $$ y\_i = \beta\_0 + \beta\_1 x\_{1i} + u\_i \quad \color{#e64173}{\text{ vs. }} \quad \dfrac{y\_i}{\sigma\_i} = \beta\_0 \dfrac{1}{\sigma\_i} + \beta\_1 \dfrac{x\_{1i}}{\sigma\_i} + \dfrac{u\_i}{\sigma\_i} $$ *Notes:* 1. WLS **transforms** a heteroskedastic model into a homoskedastic model. 2. **Weighting:** WLS downweights observations with higher variance `\(u_i\)`'s. 3. **Big requirement:** WLS requires that we *know* `\(\sigma_i^2\)` for each observation. 4. WLS is generally **infeasible**. *Feasible* GLS (FGLS) offers a solution. 5. Under its assumptions: WLS is the **best linear unbiased estimator**. --- layout: true # Living with heteroskedasticity ## Heteroskedasticity-robust standard errors --- **Response \#3:** - Ignore OLS's inefficiency (in the presence of heteroskedasticity). - Focus on .hi[unbiased estimates for our standard errors]. - In the process: Correct inference. -- **Q:** What is a standard error? -- <br>**A:** The .hi[standard deviation of an estimator's distribution]. -- Estimators (like `\(\hat{\beta}_1\)`) are random variables, so they have distributions. Standard errors give us a sense of how much variability is in our estimator. --- *Recall*: We can write the OLS estimator for `\(\beta_1\)` as $$ \hat{\beta}\_1 = \beta\_1 + \dfrac{\sum_i \left( x\_i - \overline{x} \right) u\_i}{\sum\_i \left( x\_i - \overline{x} \right)^2} = \beta\_1 + \dfrac{\sum_i \left( x\_i - \overline{x} \right) u\_i}{\text{SST}\_x} \tag{3} $$ -- Let `\(\mathop{\text{Var}} \left( u_i \middle| x_i \right) = \sigma_i^2\)`. -- We can use `\((3)\)` to write the variance of `\(\hat{\beta}_1\)`, _i.e._, $$ \mathop{\text{Var}} \left( \hat{\beta}_1 \middle| x_i \right) = \dfrac{\sum_i \left( x_i - \overline{x} \right)^2 \sigma_i^2}{\text{SST}_x^2} \tag{4} $$ --- If we want unbiased estimates for our standard errors, we need an unbiased estimate for $$ \dfrac{\sum_i \left( x_i - \overline{x} \right)^2 \sigma_i^2}{\text{SST}_x^2} $$ Our old friend Hal White provided such an estimator:.pink[<sup>†</sup>] $$ \widehat{\mathop{\text{Var}}} \left( \hat{\beta}_1 \right) = \dfrac{\sum_i \left( x_i - \overline{x} \right)^2 e_i^2}{\text{SST}_x^2} $$ where the `\(e_i\)` comes from the OLS regression of interest. .footnote[ .pink[†] This specific equation is for simple linear regression. ] --- Our heteroskedasticity-robust estimators for the standard error of `\(\beta_j\)`. .hi[Case 1] Simple linear regression, `\(y_i = \beta_0 + \beta_1 x_i + u_i\)` $$ \widehat{\mathop{\text{Var}}} \left( \hat{\beta}_1 \right) = \dfrac{\sum_i \left( x_i - \overline{x} \right)^2 e_i^2}{\text{SST}_x^2} $$ .hi[Case 2] Multiple (linear) regression, `\(y_i = \beta_0 + \beta_1 x_{1i} + \cdots + \beta_k x_{ki} + u_i\)` $$ \widehat{\mathop{\text{Var}}} \left( \hat{\beta}\_j \right) = \dfrac{\sum\_i \hat{r}\_{ij}^2 e\_i^2}{\text{SST}\_{x\_j^2}} $$ where `\(\hat{r}_{ij}\)` denotes the i.super[th] residual from regressing `\(x_j\)` on all other explanatory variables. --- With these standard errors, we can return to correct statistical inferencel _E.g._, we can update our previous `\(t\)` statistic formula with our new heteroskedasticity-robust standard erros. $$ t = \dfrac{\text{Estimate} - \text{Hypothesized value}}{\text{Standard error}} $$ --- **Notes** - We are still using .hi[OLS estimates for] `\(\color{#e64173}{\beta_j}\)` - Our het.-robust standard errors use a .hi[different estimator]. - Homoskedasticity - Plain OLS variance estimator is more efficient. - Het.-robust is still unbiased. - Heteroskedasticity - Plain OLS variance estimator is biased. - Het.-robust variance estimator is unbiased. --- These standard errors go by many names - Heteroskedasticity-robust standard errors - Het.-robust standard errors - White standard errors - Eicker-White standard errors - Huber standard errors - Eicker-Huber-White standards errors - (some other combination of Eicker, Huber, and White) **Do not say:** "Robust standard errors". The problem: "robust" to what? --- layout: false class: inverse, middle # Living with heteroskedasticity ## Examples --- layout: true # Living with heteroskedasticity --- ## Examples Back to our test-scores dataset… ``` r # Load packages library(pacman) p_load(tidyverse, Ecdat) # Select and rename desired variables; assign to new dataset; format as tibble test_df <- Caschool %>% select( test_score = testscr, ratio = str, income = avginc, enrollment = enrltot ) %>% as_tibble() # View first 2 rows of the dataset head(test_df, 2) ``` ``` #> # A tibble: 2 × 4 #> test_score ratio income enrollment #> <dbl> <dbl> <dbl> <int> #> 1 691. 17.9 22.7 195 #> 2 661. 21.5 9.82 240 ``` --- layout: true # Living with heteroskedasticity ## Example: Model specification --- We found significant evidence of heteroskedasticity. Let's check if it was due to misspecifying our model. --- Model.sub[1]: `\(\text{Score}_i = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \text{Income}_i + u_i\)` <br>`lm(test_score ~ ratio + income, data = test_df)` <img src="slides_files/figure-html/ex spec1-1.svg" style="display: block; margin: auto;" /> --- Model.sub[2]: `\(\log\left(\text{Score}_i\right) = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \text{Income}_i + u_i\)` <br>`lm(log(test_score) ~ ratio + income, data = test_df)` <img src="slides_files/figure-html/ex spec2-1.svg" style="display: block; margin: auto;" /> --- Model.sub[3]: `\(\log\left(\text{Score}_i\right) = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \log\left(\text{Income}_i\right) + u_i\)` <br>`lm(log(test_score) ~ ratio + log(income), data = test_df)` <img src="slides_files/figure-html/ex spec3-1.svg" style="display: block; margin: auto;" /> --- Let's test this new specification with the White test for heteroskedasticity. .center[ Model.sub[3]: `\(\log\left(\text{Score}_i\right) = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \log\left(\text{Income}_i\right) + u_i\)` ] -- The regression for the White test -- $$ `\begin{align} e_i^2 = &\alpha_0 + \alpha_1 \text{Ratio}_i + \alpha_2 \log\left(\text{Income}_i\right) + \alpha_3 \text{Ratio}_i^2 + \alpha_4 \left(\log\left(\text{Income}_i\right)\right)^2 \\ &+ \alpha_5 \left(\text{Ratio}_i\times\log\left(\text{Income}_i\right)\right) + v_i \end{align}` $$ -- yields `\(R_e^2\approx0.029\)` -- and test statistic of -- `\(\widehat{\text{LM}} = n\times R_e^2 \approx 12.2\)`. -- Under H.sub[0], `\(\text{LM}\)` is distributed as -- `\(\chi_5^2\)` -- `\(\implies\)` *p*-value `\(\approx\)` 0.033. -- ∴ -- .hi[Reject H.sub[0].] -- .hi[Conclusion:] -- There is statistically significant evidence of heteroskedasticity at the five-percent level. --- Okay, we tried adjusting our specification, but there is still evidence of heteroskedasticity. **Next:** In general, you will turn to heteroskedasticity-robust standard errors. - OLS is still unbiased for the .hi[coefficients] (the `\(\beta_j\)`'s) - Heteroskedasticity-robust standard errors are unbiased for the .hi[standard errors] of the `\(\hat{\beta}_j\)`'s, _i.e._, `\(\sqrt{\mathop{\text{Var}} \left( \hat{\beta}_j \right)}\)`. --- layout: true # Living with heteroskedasticity ## Example: Het.-robust standard errors --- Let's return to our model $$ \text{Score}_i = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \text{Income}_i + u_i $$ We can use the `fixest` package in .mono[R] to calculate standard errors. --- $$ \text{Score}_i = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \text{Income}_i + u_i $$ 1\. Run the regression with `feols()` (instead of `lm()`) ``` r # Load 'fixest' package p_load(fixest) # Regress log score on ratio and log income test_reg <- feols(test_score ~ ratio + income, data = test_df) ``` -- *Notice* that `feols()` uses the same syntax as `lm()` for this regression. --- $$ \text{Score}_i = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \text{Income}_i + u_i $$ 2\. Estimate het.-robust standard errors with `vcov = 'hetero'` option in `summary()` ``` r # Het-robust standard errors with 'vcov = 'hetero'' summary(test_reg, vcov = 'hetero') ``` ``` #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 638.729155 7.301234 87.48235 < 2.2e-16 *** #> ratio -0.648740 0.353340 -1.83602 0.067066 . #> income 1.839112 0.114733 16.02949 < 2.2e-16 *** ``` --- Ceofficients and **heteroskedasticity-robust standard errors**: ``` r summary(test_reg, vcov = 'hetero') ``` ``` #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 638.729155 7.301234 87.48235 < 2.2e-16 *** #> ratio -0.648740 0.353340 -1.83602 0.067066 . #> income 1.839112 0.114733 16.02949 < 2.2e-16 *** ``` Ceofficients and **plain OLS standard errors** (assumes homoskedasticity): ``` r summary(test_reg, vcov = 'iid') ``` ``` #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 638.729155 7.449077 85.74608 < 2.2e-16 *** #> ratio -0.648740 0.354405 -1.83051 0.067888 . #> income 1.839112 0.092787 19.82083 < 2.2e-16 *** ``` --- layout: true # Living with heteroskedasticity ## Example: WLS --- We mentioned that WLS is often not possible—we need to know the functional form of the heteroskedasticity—either **A**\. `\(\sigma_i^2\)` or **B**\. `\(h(x_i)\)`, where `\(\sigma_i^2 = \sigma^2 h(x_i)\)` -- There *are* occasions in which we can know `\(h(x_i)\)`. --- Imagine individuals in a population have homoskedastic disturbances. However, instead of observing individuals' data, we observe (in data) groups' averages (_e.g._, cities, counties, school districts). If these groups have different sizes, then our dataset will be heteroskedastic—in a predictable fashion. **Recall:** The variance of the sample mean depends upon the sample size, $$ \mathop{\text{Var}} \left( \overline{x} \right) = \dfrac{\sigma_x^2}{n} $$ -- **Example:** Our school testing data is averaged at the school level. --- *Example:* Our school testing data is averaged at the school level. Even if individual students have homoskedastic disturbances, the schools would have heteroskedastic disturbances, _i.e._, **Individual-level model:** `\(\quad \text{Score}_i = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \text{Income}_i + u_i\)` **School-level model:** `\(\quad \overline{\text{Score}}_s = \beta_0 + \beta_1 \overline{\text{Ratio}}_s + \beta_2 \overline{\text{Income}}_s + \overline{u}_s\)` where the `\(s\)` subscript denotes an individual school (just as `\(i\)` indexes an individual person). $$ \mathop{\text{Var}} \left( \overline{u}_s \right) = \dfrac{\sigma^2}{n_s} $$ --- For WLS, we're looking for a function `\(h(x_s)\)` such that `\(\mathop{\text{Var}} \left( \overline{u}_s | x_s \right) = \sigma^2 h(x_s)\)`. -- We just showed<sup>.pink[†]</sup> that `\(\mathop{\text{Var}} \left( \overline{u}_s |x_s \right) = \dfrac{\sigma^2}{n_s}\)`. .footnote[ .pink[†] Assuming the individuals' disturbances are homoskedastic. ] -- Thus, `\(h(x_s) = 1/n_s\)`, where `\(n_s\)` is the number of students in school `\(s\)`. -- To implement WLS, we divide each observation's data by `\(1/\sqrt{h(x_s)}\)`, meaning we need to multiply each school's data by `\(\sqrt{n_s}\)`. -- The variable .mono[enrollment] in the .mono[test_df] dataset is our `\(n_s\)`. --- Using WLS to estimate `\(\text{Score}_i = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \text{Income}_i + u_i\)` **Step 1:** Multiply each variable by `\(1/\sqrt{h(x_i)} = \sqrt{\text{Enrollment}_i}\)` ``` r # Create WLS transformed variables, multiplying by sqrt of 'pop' test_df <- mutate(test_df, test_score_wls = test_score * sqrt(enrollment), ratio_wls = ratio * sqrt(enrollment), income_wls = income * sqrt(enrollment), intercept_wls = 1 * sqrt(enrollment) ) ``` Notice that we are creating a transformed intercept. --- Using WLS to estimate `\(\text{Score}_i = \beta_0 + \beta_1 \text{Ratio}_i + \beta_2 \text{Income}_i + u_i\)` **Step 2:** Run our WLS (transformed) regression ``` r # WLS regression wls_reg <- lm( test_score_wls ~ -1 + intercept_wls + ratio_wls + income_wls, data = test_df ) ``` -- *Note:* The `-1` in our regression tells .mono[R] not to add an intercept, since we are adding a transformed intercept (`intercept_wls`). --- The .hi[WLS estimates and standard errors:] ``` #> Estimate Std. Error t value Pr(>|t|) #> intercept_wls 618.78331 8.26929 74.829 <2e-16 *** #> ratio_wls -0.21314 0.37676 -0.566 0.572 #> income_wls 2.26493 0.09065 24.985 <2e-16 *** ``` -- <br> The .hi[OLS estimates] and .hi[het.-robust standard errors:] ``` #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 638.729155 7.301234 87.48235 < 2.2e-16 *** #> ratio -0.648740 0.353340 -1.83602 0.067066 . #> income 1.839112 0.114733 16.02949 < 2.2e-16 *** ``` --- Alternative to doing your own weighting: feed `lm()` some `weights`. ``` r lm(test_score ~ ratio + income, data = test_df, weights = enrollment) ``` --- layout: false # Living with heteroskedasticity In this example - .hi[Heteroskedasticity-robust standard errors] did not change our standard errors very much (relative to plain OLS standard errors). - .hi[WLS] changed our answers a bit—coefficients and standard errors. -- These examples highlighted a few things: 1. Using the correct estimator for your standard errors really matters.<sup>.pink[†]</sup> 2. Econometrics doesn't always offer an obviously *correct* route. .footnote[ .pink[†] Sit in on an economics seminar, and you will see what I mean. ] -- To see \#1, let's run a simulation. --- layout: true # Living with heteroskedasticity ## Simulation --- Let's examine a simple linear regression model with heteroskedasticity. $$ y\_i = \underbrace{\beta\_0}\_{=1} + \underbrace{\beta\_1}\_{=10} x\_i + u\_i $$ where `\(\mathop{\text{Var}} \left( u_i | x_i \right) = \sigma_i^2 = \sigma^2 x_i^2\)`. -- <img src="slides_files/figure-html/sim plot y-1.svg" style="display: block; margin: auto;" /> --- Let's examine a simple linear regression model with heteroskedasticity. $$ y\_i = \underbrace{\beta\_0}\_{=1} + \underbrace{\beta\_1}\_{=10} x\_i + u\_i $$ where `\(\mathop{\text{Var}} \left( u_i | x_i \right) = \sigma_i^2 = \sigma^2 x_i^2\)`. <img src="slides_files/figure-html/sim plot u-1.svg" style="display: block; margin: auto;" /> --- *Note regarding WLS:* Since `\(\mathop{\text{Var}} \left( u_i | x_i \right) = \sigma^2 x_i^2\)`, $$ \mathop{\text{Var}} \left( u_i | x_i \right) = \sigma^2 h(x_i) \implies h(x_i) = x_i^2 $$ WLS multiplies each variable by `\(1/\sqrt{h(x_i)} = 1/x_i\)`. --- In this simulation, we want to compare 1. The **efficiency** of - OLS - WLS with correct weights: `\(h(x_i) = x_i\)` - WLS with incorrect weights: `\(h(x_i) = \sqrt{x_i}\)` 2. How well our **standard errors** perform (via confidence intervals) with - Plain OLS standard errors - Heteroskedasticity-robust standard errors - WLS standard errors --- The simulation plan: .pseudocode-small[ Do 10,000 times: 1. Generate a sample of size 30 from the population 2. Calculate/save OLS and WLS (×2) estimates for β.sub[1] 3. Calculate/save standard errors for β.sub[1] using - Plain OLS standard errors - Heteroskedasticity-robust standard errors - WLS (correct) - WLS (incorrect) ] --- **For one iteration of the simulation:** Code to generate the data... ``` r # Parameters b0 <- 1 b1 <- 10 s2 <- 1 # Sample size n <- 30 # Generate data sample_df <- tibble( x = runif(n, 0.5, 1.5), y = b0 + b1 * x + rnorm(n, 0, sd = s2 * x^2) ) ``` --- **For one iteration of the simulation:** Code to estimate our coefficients and standard errors... ``` r # OLS ols <- feols(y ~ x, data = sample_df) # WLS: Correct weights wls_t <- lm(y ~ x, data = sample_df, weights = 1/x^2) # WLS: Correct weights wls_f <- lm(y ~ x, data = sample_df, weights = 1/x) # Coefficients and standard errors summary(ols, vcov = 'iid') summary(ols, vcov = 'hetero') summary(wls_t) summary(wls_f) ``` Then save the results. --- layout: false # Living with heteroskedasticity ## Simulation: Coefficients <img src="slides_files/figure-html/sim plot efficiency-1.svg" style="display: block; margin: auto;" /> --- layout: false # Living with heteroskedasticity ## Simulation: Inference <img src="slides_files/figure-html/sim plot t stat-1.svg" style="display: block; margin: auto;" /> --- layout: false # Living with heteroskedasticity ## Simulation: Results Summarizing our simulation results (10,000 iterations) .pull-left.center[ **Estimation**: Summary of `\(\hat{\beta}_1\)`'s <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Estimator </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> S.D. </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> OLS </td> <td style="text-align:right;"> 10.028 </td> <td style="text-align:right;"> 0.897 </td> </tr> <tr> <td style="text-align:left;"> WLS Correct </td> <td style="text-align:right;"> 10.021 </td> <td style="text-align:right;"> 0.675 </td> </tr> <tr> <td style="text-align:left;"> WLS Incorrect </td> <td style="text-align:right;"> 10.024 </td> <td style="text-align:right;"> 0.766 </td> </tr> </tbody> </table> ] -- .pull-right.center[ **Inference:** % of times we reject `\(\beta_1\)` <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Estimators </th> <th style="text-align:right;"> % Reject </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> OLS + Het.-robust </td> <td style="text-align:right;"> 7.5 </td> </tr> <tr> <td style="text-align:left;"> OLS + Homosk. </td> <td style="text-align:right;"> 8.5 </td> </tr> <tr> <td style="text-align:left;"> WLS Correct </td> <td style="text-align:right;"> 7.4 </td> </tr> <tr> <td style="text-align:left;"> WLS Incorrect </td> <td style="text-align:right;"> 8.1 </td> </tr> </tbody> </table> ] --- layout: false class: clear, middle Going further... --- # Similar violations ## Assumptions Recall our old assumption that led to this heteroskedasticity discussion: > 5\. The disurbances have .b.slate[constant variance] `\(\sigma^2\)` <br>and .hi[zero covariance], _i.e._, - `\(\color{#314f4f}{\mathop{\boldsymbol{E}}\left[ u_i^2 \middle| X_i \right] = \mathop{\text{Var}} \left( u_i \middle| X_i \right) = \sigma^2 \implies \mathop{\text{Var}} \left( u_i \right) = \sigma^2}\)` - `\(\color{#e64173}{\mathop{\text{Cov}} \left( u_i, \, u_j \middle| X_i,\,X_j \right) = \mathop{\boldsymbol{E}}\left[ u_i u_j \middle| X_i,\,X_j \right] = 0}\)` for `\(i\neq j\)` -- .b.slate[Violation of constant variance] `\(=\)` .b.slate[heteroskedasticity] It's also possible (likely) that the .b.pink[disturbances are correlated]. Ignoring this correlation is even more problematic for inference. --- # Correlated disturbances ## The problem In many cases, .b.pink[observations' disturbances] `\(\left(u_i,\,u_j\right)\)` .b.pink[can be correlated]. *Remember* - The .b.pink[disturbance] represents the un-included variables that affect `\(y\)`. - Some .b.pink[observations] in the sample may *relate* to other observations. If these observation-level relationships extend to the variables in the disturbance, then disturbances can correlate. `\(\implies \mathop{\text{Cov}} \left( u_i, \, u_j \middle| X_i,\,X_j \right) \color{#e64173}{\neq} 0\)`. -- Ignoring this correlation can cause **big problems** in your inference. --- # Correlated disturbances ## The intution Why is ignoring this correlation problematic? **False precision:** We can get "overconfident" in our knowledge. When we teating correlated observations as independent, we OLS thinks we're learning more than we actually are. **Extreme example:** If duplicate your dataset (stack it on top of itself), plain OLS standard errors would decrease every time you duplicated the dataset. --- exclude: true --- layout: false class: clear, middle The effect of duplicating our data on the .pink[OLS standard error] of `ratio`. <img src="slides_files/figure-html/dup-plot1-1.svg" style="display: block; margin: auto;" /> --- layout: false class: clear, middle Correcting our .orange[standard errors for clustering] (observations' correlation). <img src="slides_files/figure-html/dup-plot2-1.svg" style="display: block; margin: auto;" /> --- # Correlated disturbances ## Examples "Real" examples where disturbances might correlate: - Students in a classroom (share teacher, curriculum, etc.) - Counties in a state (share state-level policies/laws) - Businesses in a city (share local economic shocks) - Consecutive days in a sample (share events, weather, etc.) --- # Correlated disturbances ## The solution Just like we calculate *heteroskedasticity*-robust standard errors, <br>we can also calculate standard errors robust to correlated disturbances. People call these .pink[*cluster-robust standard errors*] (or just *clustered*). From `fixest` package: `feols(y ~ x, data = fake_data, cluster = 'cluster_var')` or even `feols(y ~ x, data = fake_data, cluster = c('cluster1', 'cluster2'))` --- # Final word ## Better inference 1. You should default to assuming your data are heteroskedastic 2. Think about how your explanatory variables and/or disturbances correlate across observations.