EC421, Set 04
Prologue
Quarto - Simple mark-up language for for combining/creating documents, equations, figures, R, and more - Basics of Markdown - E.g., **I'm bold**, *I'm italic*, I = "code"
Interactive data visualization beyond shiny: plotly, crosstalk, echarts4r, highcharter.
Econometrics with R - (Currently) free, online textbook - Written and published using R (and probably R Markdown) - Warning: I haven’t read the full book.
Related: Tyler Ransom has a great cheatsheet for econometrics.
We wrapped up our review.
Heteroskedasticity
First assignment!
Submit one file (HTML, PDF, or Word) that includes
This file should be a rendered RMarkdown or Quarto file (.rmd or .qmd).
Important
Prerequisite material
Some test statistics are distributed as \(\chi^2\) random variables.
The \(\chi^2\) distribution is just another example of a common (named) distribution (like the Normal distribution, the \(t\) distribution, and the \(F\)).
The shape of the \(\chi^2\) distribution depends on a single parameter: - We will call this parameter \(k\) - Our test statistics will refer to \(k\) as degrees of freedom.
Three examples of \(\chi_k^2\): \(\textcolor{#314f4f}{k = 1}\), \(\textcolor{#e64173}{k = 2}\), and \(\textcolor{orange}{k = 9}\)
Probability of observing a more extreme test statistic \(\widehat{\text{LM}}\) under H0
Heteroskedasticity
Let’s write down our current assumptions
Our sample (the \(x_k\)’s and \(y_i\)) was randomly drawn from the population.
\(y\) is a linear function of the \(\beta_k\)’s and \(u_i\).
There is no perfect multicollinearity in our sample.
The explanatory variables are exogenous: \(\mathop{\boldsymbol{E}}\left[ u \middle| X \right] = 0 \left(\implies \mathop{\boldsymbol{E}}\left[ u \right] = 0\right)\).
The disurbances have constant variance \(\sigma^2\) and zero covariance, i.e.,
The disturbances come from a Normal distribution, i.e., \(u_i \overset{\text{iid}}{\sim} \mathop{\text{N}}\left( 0, \sigma^2 \right)\).
Today we’re focusing on assumption #5:
5. The disurbances have constant variance \(\sigma^2\) and zero covariance, i.e.,
We will focus on the assumption of constant variance
(also known as homoskedasticity).
Violation of this assumption:
Heteroskedasticity: \(\mathop{\text{Var}} \left( u_i \right) = \sigma^2_i\) and \(\sigma^2_i \neq \sigma^2_j\) for some \(i\neq j\).
In other words: Our disturbances have different variances.
Classic example of heteroskedasticity: The funnel
Variance of \(u\) increases with \(x\)
Another example of heteroskedasticity: (double funnel?)
Variance of \(u\) increasing at the extremes of \(x\)
Another example of heteroskedasticity:
Differing variances of \(u\) by group
Heteroskedasticity is present when the variance of \(u\) changes with any combination of our explanatory variables \(x_1\), through \(x_k\) (henceforth: \(X\)).
It’s very common in practice—and should probably be our default.
Why we care: Heteroskedasticity shows us how small violations of our assumptions can affect OLS’s performance.
So what are the consquences of heteroskedasticity? Bias? Inefficiency?
First, let’s check if it has consquences for the the unbiasedness of OLS.
Recall1: OLS being unbiased means \(\mathop{\boldsymbol{E}}\left[ \hat{\beta}_k \middle| X \right] = \beta_k\) for all \(k\).
Recall2: We previously showed \(\hat{\beta}_1 = \dfrac{\sum_i\left(y_i-\overline{y}\right)\left(x_i-\overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2}\)
It will actually help us to rewrite this estimator as
\[ \hat{\beta}_1 = \beta_1 + \dfrac{\sum_i \left( x_i - \overline{x} \right) u_i}{\sum_i \left( x_i - \overline{x} \right)^2} \]
Proof: Assuming \(y_i = \beta_0 + \beta_1 x_i + u_i\)
\[ \begin{aligned} \hat{\beta}_1 &= \dfrac{\sum_i\left(y_i-\overline{y}\right)\left(x_i-\overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \\ &= \dfrac{\sum_i\left(\left[ \beta_0 + \beta_1 x_i + u_i \right]- \left[ \beta_0 + \beta_1 \overline{x} + \overline{u} \right] \right)\left(x_i-\overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \\ &= \dfrac{\sum_i\left(\beta_1 \left[ x_i - \overline{x} \right] + \left[u_i - \overline{u}\right] \right)\left(x_i-\overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \\ &= \dfrac{\sum_i\left(\beta_1 \left[ x_i - \overline{x} \right]^2 + \left[ x_i - \overline{x} \right] \left[u_i - \overline{u}\right]\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \\ &= \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) \left(u_i - \overline{u}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \end{aligned} \]
\[ \begin{aligned} \hat{\beta}_1 &= \cdots = \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) \left(u_i - \overline{u}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \\ &= \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) u_i - \overline{u} \sum_i\left(x_i - \overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \\ &= \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) u_i - \overline{u} \left(\sum_i x_i - \sum_i \overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \\ &= \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) u_i - \overline{u} \left(\sum_i x_i - n \overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \\ &= \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) u_i - \overline{u} \textcolor{#e64173}{\left(\sum_i x_i - \sum_i x_i\right)}}{\sum_i\left(x_i -\overline{x}\right)^2} \\ &= \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) u_i}{\sum_i\left(x_i -\overline{x}\right)^2} \quad \text{😅} \end{aligned} \]
We now want to see if heteroskedasticity biases the OLS estimator for \(\beta_1\).
\[ \begin{aligned} \mathop{\boldsymbol{E}}\left[ \hat{\beta}_1 \middle| X \right] &= \mathop{\boldsymbol{E}}\left[ \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right) u_i}{\sum_i\left(x_i -\overline{x}\right)^2} \middle| X \right] \\[0.5em] &= \beta_1 + \mathop{\boldsymbol{E}}\left[ \dfrac{\sum_i\left(x_i - \overline{x}\right) u_i}{\sum_i\left(x_i -\overline{x}\right)^2} \middle| X \right] \\[0.5em] &= \beta_1 + \dfrac{\sum_i\left(x_i - \overline{x}\right)}{\sum_i\left(x_i -\overline{x}\right)^2} \textcolor{#e64173}{\underbrace{\mathop{\boldsymbol{E}}\left[ u_i \middle| X \right]}_{=0}} \\[0.5em] &= \beta_1 \end{aligned} \]
Phew. OLS is still unbiased for the \(\beta_k\).
OLS’s efficiency and inference do not survive heteroskedasticity.
In the presence of heteroskedasticity, OLS is no longer the most efficient (best) linear unbiased estimator.
It would be more informative (efficient) to weight observations inversely to their \(u_i\)’s variance.
OLS standard errors are biased in the presence of heteroskedasticity.
Wrong confidence intervals
Problems for hypothesis testing (both \(t\) and \(F\) tests)
It’s hard to learn much without sound inference.
Tests to determine whether heteroskedasticity is present.
Remedies for (1) efficiency and (2) inference
Testing for heteroskedasticity
While we might have solutions for heteroskedasticity, the efficiency of our estimators depends upon whether or not heteroskedasticity is present.
The Goldfeld-Quandt test
The White test
Each of these tests1 centers on the fact that we can use the OLS residual \(\textcolor{#e64173}{e_i}\) to estimate the population disturbance \(\textcolor{#e64173}{u_i}\).
Focuses on a specific type of heteroskedasticity: whether the variance of \(u_i\) differs between two groups.1
Remember how we used our residuals to estimate the \(\sigma^2\)?
\[ s^2 = \dfrac{\text{SSE}}{n-1} = \dfrac{\sum_i e_i^2}{n-1} \]
We will use this same idea to determine whether there is evidence that our two groups differ in the variances of their disturbances, effectively comparing \(s^2_1\) and \(s^2_2\) from our two groups.
Operationally,
Order your the observations by \(x\)
Split the data into two groups of size n⭑
Run separate regressions of \(y\) on \(x\) for G1 and G2
Record SSE1 and SSE2
Calculate the G-Q test statistic
The G-Q test statistic
\[ F_{\left(n^{\star}-k,\, n^{\star}-k\right)} = \dfrac{\text{SSE}_2/(n^\star-k)}{\text{SSE}_1/(n^\star-k)} = \dfrac{\text{SSE}_2}{\text{SSE}_1} \]
follows an \(F\) distribution (under the null hypothesis) with \(n^{\star}-k\) and \(n^{\star}-k\) degrees of freedom.1
Notes
\(F_{375,\,375} = \dfrac{\textcolor{#e64173}{\text{SSE}_2 = 18,203.4}}{\textcolor{#314f4f}{\text{SSE}_1 = 1,039.5}} \approx 17.5 \implies\) p-value \(< 0.001\)
\(\therefore\) We reject H0: \(\sigma^2_1 = \sigma^2_2\) and conclude there is statistically significant evidence of heteroskedasticity.
The problem…
\(F_{375,\,375} = \dfrac{\textcolor{#e64173}{\text{SSE}_2 = 14,516.8}}{\textcolor{#314f4f}{\text{SSE}_1 = 14,937.1}} \approx 1 \implies\) p-value \(\approx 0.609\)
\(\therefore\) We fail to reject H0: \(\sigma^2_1 = \sigma^2_2\) while heteroskedasticity is present.
Breusch and Pagan (1981) attempted to solve this issue of being too specific with the functional form of the heteroskedasticity.
However, we actually want to know if
\[ \sigma_1^2 = \sigma_2^2 = \cdots = \sigma_n^2 \]
Q: Can’t we just test this hypothesis?
A: Sort of.
Toward this goal, Hal White took advantage of the fact that we can replace the homoskedasticity requirement with a weaker assumption:
Old: \(\mathop{\text{Var}} \left( u_i \middle| X \right) = \sigma^2\)
New: \(u^2\) is uncorrelated with the explanatory variables (i.e., \(x_j\) for all \(j\)), their squares (i.e., \(x_j^2\)), and the first-degree interactions (i.e., \(x_j x_h\)).
This new assumption is easier to explicitly test (hint: regression).
An outline of White’s test for heteroskedasticity:
1. Regress y on x1, x2, …, xk. Save residuals e.
2. Regress squared residuals on all explanatory variables, their squares, and interactions.
\[ e^2 = \alpha_0 + \sum_{h=1}^k \alpha_h x_h + \sum_{j=1}^k \alpha_{k+j} x_j^2 + \sum_{\ell = 1}^{k-1} \sum_{m = \ell + 1}^k \alpha_{\ell,m} x_\ell x_m + v_i \]
3. Record Re2.
4. Calculate test statistic to test H0: \(\alpha_p = 0\) for all \(p\neq0\).
White’s test statistic is
\[ \text{LM} = n \times R_e^2 \qquad \text{Under H}_0,\, \text{LM} \overset{\text{d}}{\sim} \chi_k^2 \]
where \(R^2_e\) comes from the regression of \(e^2\) on the explanatory variables, their squares, and their interactions.
\[ e^2 = \alpha_0 + \underbrace{\sum_{h=1}^k \alpha_h x_h}_{\text{Expl. variables}} + \underbrace{\sum_{j=1}^k \alpha_{k+j} x_j^2}_{\text{Squared terms}} + \underbrace{\sum_{\ell = 1}^{k-1} \sum_{m = \ell + 1}^k \alpha_{\ell,m} x_\ell x_m}_{\text{Interactions}} + v_i \]
Note: The \(k\) (for our \(\chi_k^2\)) equals the number of estimated parameters in the regression above (the \(\alpha_j\)), excluding the intercept \(\left( \alpha_0 \right)\).
Practical note: If a variable is equal to its square (e.g., binary variables), then you don’t (can’t) include it. The same rule applies for interactions.
Example: Consider the model1 \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + u\)
Step 1: Estimate the model; obtain residuals \((e)\).
Step 2: Regress \(e^2\) on explanatory variables, squares, and interactions. \[ \begin{aligned} e^2 = &\alpha_0 + \alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 x_3 + \alpha_4 x_1^2 + \alpha_5 x_2^2 + \alpha_6 x_3^2 \\ &+ \alpha_7 x_1 x_2 + \alpha_8 x_1 x_3 + \alpha_9 x_2 x_3 + v \end{aligned} \]
Record the R2 from this equation (call it \(R_e^2\)).
Step 3: Test H0: \(\alpha_1 = \alpha_2 = \cdots = \alpha_9 = 0\) using \(\text{LM} = n R^2_e \overset{\text{d}}{\sim} \chi_9^2\).
The White test for this simple linear regression.
\[ \begin{aligned} e_i^2 &= \hat{\alpha}_0 + \hat{\alpha}_1 x_{1i} \textcolor{#e64173}{+ \hat{\alpha}_2 x^2_{1i}} & \widehat{\text{LM}} &= 185.8 &\mathit{p}\text{-value} < 0.001 \end{aligned} \]
Testing for Heteroskedasticity
Examples
Goal: Estimate the relationship between standardized test scores (outcome variable) and (1) student-teacher ratio and (2) income, i.e.,
\[ \left(\text{Test score}\right)_i = \beta_0 + \beta_1 \text{Ratio}_{i} + \beta_2 \text{Income}_{i} + u_i \tag{1} \]
Potential issue: Heteroskedasticity… and we do not observe \(u_i\).
Solution:
We will use testing data from the dataset Caschool in the Ecdat R package.
#> # A tibble: 2 × 3
#> test_score ratio income
#> <dbl> <dbl> <dbl>
#> 1 691. 17.9 22.7
#> 2 661. 21.5 9.82
Let’s begin by estimating our model
\[ \left(\text{Test score}\right)_i = \beta_0 + \beta_1 \text{Ratio}_{i} + \beta_2 \text{Income}_{i} + u_i \]
#> # A tibble: 3 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 639. 7.45 85.7 5.70e-267
#> 2 ratio -0.649 0.354 -1.83 6.79e- 2
#> 3 income 1.84 0.0928 19.8 4.38e- 62
Now, let’s see what the residuals suggest about heteroskedasticity
Income looks potentially heteroskedastic; let’s test via Goldfeld-Quandt.
Income looks potentially heteroskedastic; let’s test via Goldfeld-Quandt.
Income looks potentially heteroskedastic; let’s test via Goldfeld-Quandt.
# Arrange the data by income
test_df = arrange(test_df, income)
# Re-estimate the model for the last and first 158 observations
est_model1 = lm(test_score ~ ratio + income, data = tail(test_df, 158))
est_model2 = lm(test_score ~ ratio + income, data = head(test_df, 158))
# Grab the residuals from each regression
e_model1 = residuals(est_model1)
e_model2 = residuals(est_model2)Income looks potentially heteroskedastic; let’s test via Goldfeld-Quandt.
# Arrange the data by income
test_df = arrange(test_df, income)
# Re-estimate the model for the last and first 158 observations
est_model1 = lm(test_score ~ ratio + income, data = tail(test_df, 158))
est_model2 = lm(test_score ~ ratio + income, data = head(test_df, 158))
# Grab the residuals from each regression
e_model1 = residuals(est_model1)
e_model2 = residuals(est_model2)
# Calculate SSE for each regression
(sse_model1 = sum(e_model1^2))#> [1] 19305.01
#> [1] 29537.83
Remember the Goldfeld-Quandt test statistic?
\(F_{n^\star-k,\,n^\star-k} = \dfrac{\text{SSE}_2}{\text{SSE}_1}\) \(\approx\dfrac{29,537.83}{19,305.01}\) \(\approx1.53\)
Test via \(F_{158-3,\,158-3}\)
The Goldfeld-Quandt test statistic and its null distribution
Putting it all together:
H0: \(\sigma_1^2 = \sigma_2^2\) vs. HA: \(\sigma_1^2 \neq \sigma_2^2\)
Goldfeld-Quandt test statistic: \(F \approx 1.53\)
p-value \(\approx 0.00423\)
∴ Reject H0 (p-value is less than 0.05).
Conclusion: We find statistically significant evidence that \(\sigma_1^2 \neq \sigma_2^2\).
There is statistically significant evidence of heteroskedasticity (at the 5% level).
What if we had chosen to focus on student-to-teacher ratio?
# Arrange the data by ratio
test_df = arrange(test_df, ratio)
# Re-estimate the model for the last and first 158 observations
est_model3 = lm(test_score ~ ratio + income, data = tail(test_df, 158))
est_model4 = lm(test_score ~ ratio + income, data = head(test_df, 158))
# Grab the residuals from each regression
e_model3 = residuals(est_model3)
e_model4 = residuals(est_model4)
# Calculate SSE for each regression
(sse_model3 = sum(e_model3^2))#> [1] 26243.52
#> [1] 29101.52
\(F_{n^\star-k,\,n^\star-k} = \dfrac{\text{SSE}_4}{\text{SSE}_3} \approx\dfrac{29,101.52}{26,243.52} \approx1.11\)
which has a p-value of approximately 0.2603.
∴ We would have failed to reject H0, concluding that we failed to find statistically significant evidence of heteroskedasticity.
Lesson: Understand the limitations of estimators, tests, etc.
Let’s test the same model and data with the White test.
Recall: We saved our residuals as e in our dataset, i.e.,
The White test adds squared terms and interactions to initial regression specification (the right-hand side) \[ \begin{align} \textcolor{#314f4f}{u_{i}^2} =& \textcolor{#314f4f}{\alpha_{0} + \alpha_{1} \text{Ratio}_{i} + \alpha_{2} \text{Income}_{i} } \\ &+ \textcolor{#e64173}{\alpha_{3} \text{Ratio}_{i}^2 + \alpha_{4} \text{Income}_{i}^2 + \alpha_{5} \text{Ratio}_{i}\times\text{Income}_{i}} \\ &+ \textcolor{#314f4f}{w_{i}} \end{align} \]
The White test tests the null hypothesis
H0: \(\textcolor{#e64173}{\alpha_1 = \alpha_2 = \alpha_3 = \alpha_4 = \alpha_5 = 0}\)
We just need to write some R code to test H0.
Aside: R has funky notation for squared terms and interactions in lm():
Squared terms use I(), e.g., lm(y ~ I(x^2))
Interactions use : between the variables, e.g., lm(y ~ x1:x2)
Example: Regress y on quadratic of x1 and x2:
Step 1: Regress \(e_i^2\) on 1st degree, 2nd degree, and interactions
Step 2: Collect \(R_e^2\) from the regression.
Step 3: Calculate White test statistic \(\text{LM} = n \times R_e^2 \approx 420 \times 0.073\)
#> [1] 30.79533
Step 4: Calculate the associated p-value (where \(\text{LM} \overset{d}{\sim} \chi_k^2\)); here, \(k=5\)
# Regress squared residuals on quadratic of explanatory variables
white_model = lm(
I(e^2) ~ ratio + income + I(ratio^2) + I(income^2) + ratio:income,
data = test_df
)
# Grab the R-squared
white_r2 = summary(white_model)$r.squared
# Calculate the White test statistic
white_stat = 420 * white_r2
# Calculate the p-value
pchisq(q = white_stat, df = 5, lower.tail = F)#> [1] 1.028039e-05
Putting everything together…
H0: \(\alpha_1 = \alpha_2 = \alpha_3 = \alpha_4 = \alpha_5 = 0\) vs. HA: \(\alpha_i \neq 0\) for some \(i \in \{1,\, 2,\,\ldots,\, 5\}\)
from the model \[ \begin{align} u_i^2 =& \alpha_0 + \alpha_1 \text{Ratio}_i + \alpha_2 \text{Income}_i \\ &+ \alpha_3 \text{Ratio}_i^2 + \alpha_4 \text{Income}_i^2 \\ &+ \alpha_5 \text{Ratio}_i \times \text{Income}_i + w_i \end{align} \]
Our White test statistic: \(\text{LM} = n \times R_e^2 \approx 420 \times 0.073 \approx 30.8\)
Under the \(\chi_5^2\) distribution, this \(\widehat{\text{LM}}\) has a p-value less than 0.001.
∴ We reject H0 and conclude there is statistically significant evidence of heteroskedasticity (at the 5-percent level).
The White test statistic and its null distribution
Q: What is the definition of heteroskedasticity?
Q: Why are we concerned about heteroskedasticity?
Q: Does plotting \(y\) against \(x\), tell us anything about heteroskedasticity?
Q: Does plotting \(e\) against \(x\), tell us anything about heteroskedasticity?
Q: Since we cannot observe the \(u_i\)’s, what do we use to learn about heteroskedasticity?
Q: Which test do you recommend to test for heteroskedasticity? Why?
Q: What is the definition of heteroskedasticity?
Q: Why are we concerned about heteroskedasticity?
Q: What is the definition of heteroskedasticity?
Q: Why are we concerned about heteroskedasticity?
Q: Does plotting \(y\) against \(x\), tell us anything about heteroskedasticity?
Q: What is the definition of heteroskedasticity?
Q: Why are we concerned about heteroskedasticity?
Q: Does plotting \(y\) against \(x\), tell us anything about heteroskedasticity?
Q: Does plotting \(e\) against \(x\), tell us anything about heteroskedasticity?
Q: What is the definition of heteroskedasticity?
Q: Why are we concerned about heteroskedasticity?
Q: Does plotting \(y\) against \(x\), tell us anything about heteroskedasticity?
Q: Does plotting \(e\) against \(x\), tell us anything about heteroskedasticity?
Q: Since we cannot observe the \(u_i\)’s, what do we use to learn about heteroskedasticity?
Next time: Living/working with heteroskedasticity.
Appendix
One more test…
Breusch and Pagan (1981) attempted to solve this issue of being too specific with the functional form of the heteroskedasticity.
Allows the data to show if/how the variance of \(u_i\) correlates with \(X\).
If \(\sigma_i^2\) correlates with \(X\), then we have heteroskedasticity.
Regresses \(e_i^2\) on \(X = \left[ 1,\, x_1,\, x_2,\, \ldots,\, x_k \right]\) and tests for joint significance.
How to implement:
1. Regress y on an intercept, x1, x2, …, xk.
2. Record residuals e.
3. Regress e2 on an intercept, x1, x2, …, xk.
\[ e_i^2 = \alpha_0 + \alpha_1 x_{1i} + \alpha_2 x_{2i} + \cdots + \alpha_k x_{ki} + v_i \]
4. Record R2.
5. Test hypothesis H0: \(\alpha_1 = \alpha_2 = \cdots = \alpha_k = 0\)
The B-P test statistic1 is
\[ \text{LM} = n \times R^2_{e} \]
where \(R^2_e\) is the \(R^2\) from the regression
\[ e_i^2 = \alpha_0 + \alpha_1 x_{1i} + \alpha_2 x_{2i} + \cdots + \alpha_k x_{ki} + v_i \]
Under the null, \(\text{LM}\) is asymptotically distributed as \(\chi^2_k\).
This test statistic tests H0: \(\alpha_1 = \alpha_2 = \cdots = \alpha_k = 0\).
Rejecting the null hypothesis implies evidence of heteroskedasticity.
Problem: We’re still assuming a fairly restrictive functional form between our explanatory variables \(X\) and the variances of our disturbances \(\sigma^2_i\).
Result: B-P may still miss fairly simple forms of heteroskedasticity.
Breusch-Pagan tests are still sensitive to functional form.
\[ \begin{aligned} e_i^2 &= \hat{\alpha}_0 + \hat{\alpha}_1 x_{1i} & \widehat{\text{LM}} &= 1.26 &\mathit{p}\text{-value} \approx 0.261 \\ e_i^2 &= \hat{\alpha}_0 + \hat{\alpha}_1 x_{1i} \textcolor{#e64173}{+ \hat{\alpha}_2 x^2_{1i}} & \widehat{\text{LM}} &= 185.8 &\mathit{p}\text{-value} < 0.001 \end{aligned} \]
Let’s test the same model with the Breusch Pagan.
Recall: We saved our residuals as e in our dataset, i.e.,
In B-P, we first regress \(e_i^2\) on the explanatory variables,
and use the resulting \(R^2\) to calculate a test statistic.
The Breusch-Pagan test statistic is
\(\text{LM} = n \times R^2_e\)
\(\approx 420 \times 0.0000323\)
\(\approx 0.0136\)
which we test against a \(\chi_k^2\) distribution (here: \(k=2\)).1
H0: \(\alpha_1 = \alpha_2 = 0\) vs. HA: \(\alpha_1 \neq 0\) and/or \(\alpha_2 \neq 0\) for the model \(u_i^2 = \alpha_0 + \alpha_1 \text{Ratio}_i + \alpha_2 \text{Income}_i + w_i\)
Breusch-Pagan test statistic: \(\widehat{\text{LM}} \approx 0.014\)
p-value \(\approx 0.993\)
∴ Fail to reject H0 (the p-value is greater than 0.05)
Conclusion: We do not find statistically significant evidence of heteroskedasticity at the 5-percent level.
(We find no evidence of a linear relationship between \(u_i^2\) and the explanatory variables.)
The Breusch-Pagan test statistic and its null distribution