Linear Regression - OLS - Parameterization DGP

Linear Regression Model and the OLS Estimator

Effect of the Parametrization of the DGP

Topic of the module

Understand the effect of ...

(1) increasing the sample size $n$,...

(2) changing the variance of $u_{i}$, i.e., $\sigma_{u}^{2}$,...

(3) changing the variance of $X_{i}$, i.e., $\sigma_{X}^{2}$,...

... on the sampling distribution of the OLS estimator for the slope coefficient of a simple linear regression model.

To conduct hypothesis tests for the slope coefficient we have to standardize the OLS estimate by accounting for (1) the sample size $n$, (2) the variance of $u_{i}$ and (2) the variance of $X_{i}$.

Data generating process (DGP)

Consider $n$ observations are generated from the simple regression model,

$$ \begin{align} Y_i = \beta_{0} + \beta_{1} X_{i} + u_{i}, \end{align} $$

where $\beta_{0}=0$ is the intercept and $\beta_{1}$ is the slope parameter.

Furthermore, assume $X_{i}$ and $u_{i}$ are i.i.d. normally distributed, i.e.,

$$ \begin{align} X_{i} \sim N\left(0, \sigma_{X}^{2}\right), \;\;\;\;\; u_{i} \sim N\left(0, \sigma_{u}^{2}\right), \end{align} $$

where $\sigma_{X}^{2}$ and $\sigma_{u}^{2}$ is the variance of $X_{i}$ and $u_{i}$, respectively.

Estimator and parameter of interest

We are interested in the sampling properties of the OLS estimator $\widehat{\beta}_{1}$ given by,

$$ \begin{align} \widehat{\beta}_{1} = \frac{\sum_{i=1}^{n}\left(X_{i} - \overline{X}\right)\left(Y_{i} - \overline{Y}\right)}{\sum_{i=1}^{n}\left(X_{i} - \overline{X}\right)^{2}}, \end{align} $$

as estimator for the slope parameter $\beta_{1}$ of the regression model above.

Illustration

Change the parameters and see the effect on the properties of the OLS estimator $\widehat{\beta}_{1}$ as estimator for $\beta_{1}$.

Parameters

Sample Size $n$

Slope Coef. $\beta_{1}$

Variance of $u_{i}$

Variance of $X_{i}$

Scatter plot (realizations)

The red fitted regression line is based on the regression of,

$$ \begin{align} Y_{i} \;\;\;\;\; \text{on} \;\;\;\;\; X_{i}. \end{align} $$

The scatter plots and the fitted regression lines represent the result for one realization of the DGP. The shaded area illustrate the range of all fitted regression lines across all $10,\!000$ realizations of the DGP.

Scatter plot (fitted residuals)

The fitted unobserved residuals are constructed as,

$$ \begin{align} \widehat{u}_{i} = Y_{i} - \widehat{\beta}_{1} X_{i}, \end{align} $$

and illustrated for one realization of the DGP where $\widehat{\beta}_{1}$ is the respective OLS estimate.

Histogram of the OLS estimates $\widehat{\beta}_{1}$

Consistency:

As the sample size $n$ grows the OLS estimator $\widehat{\beta}_{1}$ gets closer to $\beta_1$,

$$ \begin{align} \widehat{\beta}_{1} \overset{p}{\to} \beta. \end{align} $$

Histogram of the standardized OLS estimates $z_{\widehat{\beta}_{1}}$

Asymptotic Normality:

As the sample size $n$ grows the distribution of the standardized OLS estimator,

$$ \begin{align} z_{\widehat{\beta}_1} &= \frac{\widehat{\beta}_1 - \beta_1}{\sigma_{\widehat{\beta}_{1}}}, \end{align} $$

gets closer to the standard normal distribution $N\left(0, 1\right)$.

More Details

A realization of the DGP specificied above is simulated:
1. A i.i.d. sequence of realizations $X_{1}, X_{2}, ..., X_{n}$ and $u_{1}, u_{2}, ..., u_{n}$ are drawn from the distribution of $X_{i}$ and $u_{i}$ above.
2. Based on the sequence of observations $X_{1}, X_{2}, ..., X_{n}$ and $u_{1}, u_{2}, ..., u_{n}$ a sequence of observations $Y_{1}, Y_{2}, ..., Y_{n}$ is constructed based on the regression model above.
Based on the sequence of observations $Y_{1}, Y_{2}, ... Y_{n}$ and $X_{1}, X_{2}, ..., X_{n}$, the OLS estimates and standardized OLS estimates for the intercept $\beta_{0}$ and slope parameter $\beta_{1}$ are calcluated.
The values of the OLS estimate $\widehat{\beta}_{1}$ and the standardized OLS estimate $z_{\widehat{\beta}_{1}}$ are stored.
Step 1 to 3 is repeated $10,000$ times resulting in $10,\!000$ OLS and standardized OLS estimates.
The distribution of the OLS and standardized estimates are illustrated using histograms.

For the construction of the standardized OLS estimator $z_{\widehat{\beta}_{1}}$, the variance of $\widehat{\beta}_{1}$, i.e., $\sigma_{\widehat{\beta}_{1}}^{2}$, has to be estimated.

The variance of $\widehat{\beta}_1$, i.e., $\sigma_{\widehat{\beta}_{1}}^{2}$, can be robustly estimated by,

$$ \begin{align} \widehat{\sigma}_{\widehat{\beta}_{1}}^{2} = \frac{1}{n} \times \frac{\frac{1}{n-2}\sum_{i=1}^{n}\left(X_{i} - \overline{X}\right)^{2}\widehat{u}_{i}^{2}}{\left[\frac{1}{n}\sum_{i=1}^{n}\left(X_{i} - \overline{X}\right)^{2}\right]^{2}}, \end{align} $$

where $\widehat{u}_{i}$ are the residuals of the estimate regression line.

Note, the estimator for $\sigma_{\widehat{\beta}_{1}}^{2}$ above is robust w.r.t. to heteroskedasticity, i.e., it does not rely on the assumption of homoskedasticity.

Instead, some statistic software report estimates $\sigma_{\widehat{\beta}_{1}}^{2}$, based on the assumption of homoskedasticity.

The so called homoskedasticity-only estimator of $\sigma_{\widehat{\beta}_{1}}^{2}$, is given by,

$$ \begin{align} \widetilde{\sigma}_{\widehat{\beta}_{1}}^{2} = \frac{\frac{1}{n-2}\sum_{i=1}^{n}\widehat{u}_{i}^{2}}{\sum_{i=1}^{n}\left(X_{i} - \overline{X}\right)^{2}}. \end{align} $$

There is no explanation yet.

The figure shows:
The scatter plot of the values of x and y and the corresponding fitted regression line estimated by OLS for one particular realization of the underlying DGP.
The red shaded area illustrates the range of all fitted regression lines estimated by OLS across all realizations for the underlying DGP.

Increasing the sample size decreases the range of the different fitted regression lines estimated by OLS across different realizations of the underlying DGP illustrated by the red shaded area.

Changing the slope coefficient of the underlying DGP rotates the fitted regression lines estimated by OLS.

Increasing the variance of u of the underlying DGP increases the range of the different fitted regression lines estimated by OLS across different realization of the underlying DGP.
Thus, a high variance of the unobserved component u in the regression model of the underlying DGP increases the sampling uncertainty of the estimated effect of the variable x.

Increasing the variance of x of the underlying DGP decreases the range of the different fitted regression lines estimated by OLS across different realization of the underlying DGP.
Thus, a high variance of the variable of interest x of the underlying DGP decreases the sampling uncertainty of the estimated effect of the variable x.

The figure shows:
The scatter plot of the value of x and the fitted residuals of a simple linear regression model estimated by OLS.
Note, since the estimated parameters includes an intercept and a slope coefficient, the fitted residuals have a mean equal to zero and are uncorrelated with x by construction of the OLS estimator.

Increasing the sample size increases the number of fitted residuals estimated by OLS.

Changing the slope coefficient does not change the change the relationship between the fitted residuals and x.

Increasing the variance of u of the underlying DGP increases the spread of the fitted residuals along the y-axis, i.e., the residuals get larger for all values of x.

Increasing the variance of x of the underlying DGP increases the spread of the fitted residuals along the y-axis.

The figure shows:
The histogram of the estimated slope coefficient across all realizations of the underlying DGP.
The red vertical dashed line represents the estimated slope coefficient for one particular realization of the underlying DGP. The green vertical dashed line represents the slope coefficient of the underlying DGP.

By increasing the sample size the estimated slope coefficients concentrate more around the slope coefficient of the underlying DGP.
This is the result of law of large numbers.
Thus, to conduct hypothesis tests we need a sampling distribution which is stable across sample sizes. For this we have to standardize the estimate using the sample size.

Changing the slope coefficient of the underlying DGP shifts the histogram of the estimated slope coefficient.
Thus, to conduct hypothesis tests that the slope coefficient is equal a particular hypothetical value we have to standardize the estimate using the particular hypothetical value.

Increasing the variance of u of the underlying DGP increases the spread of the OLS estimates slope coefficient of the underlying DGP.
Thus, to conduct hypothesis tests we need a sampling distribution which is stable for different values of the variance of u of the underlying DGP. For this we have to standardize the estimate using the spread of u.

Increasing the variance of x of the underlying DGP decreases the spread of the OLS estimates slope coefficient of the underlying DGP.
Thus, to conduct hypothesis tests we need a sampling distribution which is stable for different values of the variance of x of the underlying DGP. For this we have to standardize the estimate using the spread of x.

The figure shows:
The histogram of the standardized estimated slope coefficient across all realizations of the DGP.
For the standardization with subtract the slope coefficient and divide by an estimate of the variance of the estimated slope coefficient.
The red vertical dashed line represents the standardized estimated slope coefficient for one particular realization of the DGP.
The green vertical dashed curve represents the pdf of the standard normal distribution.
Note, the estimate for the variance of the estimated slope coefficient is explained below and is a function of the sample size, the variance of u, the variance of x, and the covariance of u and x.

By increasing the sample size the sampling distribution of the standardized estimated slope coefficient gets closer to the standard normal distribution which pdf is illustrated by the green dashed curve.
This is the result of central limit theorem.
Note, the sampling distribution of the standardized estimate is stable across the sample size.
Thus, for large n the standardized estimate can be used to conduct hypothesis tests.

Changing the slope coefficient of the underlying DGP does not affect the limiting distribution of the standardized estimated slope coefficient, i.e., for large n.
This is true since we subtract the slope coefficient of the underlying DGP for the standardization.
Thus, for large n the standardized estimate can be used to conduct hypothesis tests.

Increasing the variance of u of the underlying DGP does not affect the limiting distribution of the standardized estimated slope coefficient for large n.
This is true since we account for the variance of u for the standardization.
Thus, for large n the standardized estimate can be used to conduct hypothesis tests.

Increasing the variance of x of the underlying DGP does not affect the limiting distribution of the standardized estimated slope coefficient for large n.
This is true since we account for the variance of x for the standardization.
Thus, for large n the standardized estimate can be used to conduct hypothesis tests.

This module is part of the DeLLFi project of the University of Hohenheim and funded by the
Foundation for Innovation in University Teaching