Linear Regression - OLS - Sampling Distribution

Linear Regression Model and the OLS Estimator

Effect of the Sample Size

Topic of the module

Understand the procedure to simulate the sampling distribution of the OLS estimator for the slope parameter of a linear regression model.

A realization of the DGP (see below) is simulated and the value for the estimator (see below), i.e. an estimate, is calculated.

This exercise is repeated $10,\!000$ times resulting in $10,\!000$ values for the estimators, i.e., $10,\!000$ estimates. The distribution of these $10,\!000$ estimates is the sampling distribution of the estimator.

Based on the simulation understand the effect of increasing the sample size $n$ on the sampling distribution of the OLS estimator.

Data generating process (DGP)

Consider $n$ observations are generated from the simple regression model,

$$ \begin{align} Y_i = \beta_{0} + \beta_{1} X_{i} + u_{i}, \end{align} $$

where $\beta_{0}=0$ is the intercept and $\beta_{1}=1$ is the slope parameter.

Furthermore, $X_{i}$ and $u_{i}$ are i.i.d. normally distributed, i.e.,

$$ \begin{align} X_{i} \sim N\left(0, \sigma_{X}^{2}\right), \;\;\;\;\; u_{i} \sim N\left(0, \sigma_{u}^{2}\right), \end{align} $$

where $\sigma_{X}=5$ and $\sigma_{u}=5$ is the standard deviation of $X_{i}$ and $u_{i}$, respectively.

Estimator and parameter of interest

We are interested in the sampling properties of the OLS estimator $\widehat{\beta}_{1}$ given by,

$$ \begin{align} \widehat{\beta}_{1} = \frac{\sum_{i=1}^{n}\left(X_{i} - \overline{X}\right)\left(Y_{i} - \overline{Y}\right)}{\sum_{i=1}^{n}\left(X_{i} - \overline{X}\right)^{2}}, \end{align} $$

as estimator for the slope parameter $\beta_{1}$ of the regression model above.

Illustration

Change the parameters and see the effect on the properties of the OLS estimator $\widehat{\beta}_{1}$ as estimator for $\beta_{1}$.

Parameters

Realization DGP $r$

Sample size $n$

Scatter plot (realizations)

The red fitted regression line is based on the regression of,

$$ \begin{align} Y_{i} \;\;\;\;\; \text{on} \;\;\;\;\; X_{i}. \end{align} $$

The scatter plots and the fitted regression lines represent the result for one realization of the DGP. The shaded area illustrate the range of all fitted regression lines across all $10,\!000$ realizations of the DGP.

Scatter plot (fitted unobserved residuals)

The fitted OLS residuals are

$$ \begin{align} \widehat{u}_{i} = Y_{i} - \widehat{\beta}_{1} X_{i}, \end{align} $$

and illustrated for one realization of the DGP where $\widehat{\beta}_{1}$ is the respective OLS estimate.

Histogram of the OLS estimates $\widehat{\beta}_{1}$

Consistency:

As the sample size $n$ grows the OLS estimator $\widehat{\beta}_{1}$ gets closer to $\beta_1$,

$$ \begin{align} \widehat{\beta}_{1} \overset{p}{\to} \beta. \end{align} $$

Note, to conduct hypothesis tests we need a sampling distribution which is stable across $n$. For this we have to standardize or scale the estimate.

Histogram of the standardized OLS estimates $z_{\widehat{\beta}_{1}}$

Asymptotic Normality:

As the sample size $n$ grows the distribution of the standardized OLS estimator,

$$ \begin{align} z_{\widehat{\beta}_1} &= \frac{\widehat{\beta}_1 - \beta_1}{\sigma_{\widehat{\beta}_{1}}}, \end{align} $$

gets closer to the standard normal distribution $N\left(0, 1\right)$.

Note, the the sampling distribution of the standardized estimates are stable across $n$. Thus, it can be used to conduct hypothesis tests.

More Details

A realization of the DGP specificied above is simulated:
1. A i.i.d. sequence of realizations $X_{1}, X_{2}, ..., X_{n}$ and $u_{1}, u_{2}, ..., u_{n}$ are drawn from the distribution of $X_{i}$ and $u_{i}$ above.
2. Based on the sequence of observations $X_{1}, X_{2}, ..., X_{n}$ and $u_{1}, u_{2}, ..., u_{n}$ a sequence of observations $Y_{1}, Y_{2}, ..., Y_{n}$ is constructed based on the regression model above.
Based on the sequence of observations $Y_{1}, Y_{2}, ... Y_{n}$ and $X_{1}, X_{2}, ..., X_{n}$, the OLS estimates and standardized OLS estimates for the intercept $\beta_{0}$ and slope parameter $\beta_{1}$ are calcluated.
The values of the OLS estimate $\widehat{\beta}_{1}$ and the standardized OLS estimate $z_{\widehat{\beta}_{1}}$ are stored.
Step 1 to 3 is repeated $10,000$ times resulting in $10,\!000$ OLS and standardized OLS estimates.
The distribution of the OLS and standardized estimates are illustrated using histograms.

For the construction of the standardized OLS estimator $z_{\widehat{\beta}_{1}}$, the variance of $\widehat{\beta}_{1}$, i.e., $\sigma_{\widehat{\beta}_{1}}^{2}$, has to be estimated.

The variance of $\widehat{\beta}_1$, i.e., $\sigma_{\widehat{\beta}_{1}}^{2}$, can be robustly estimated by,

$$ \begin{align} \widehat{\sigma}_{\widehat{\beta}_{1}}^{2} = \frac{1}{n} \times \frac{\frac{1}{n-2}\sum_{i=1}^{n}\left(X_{i} - \overline{X}\right)^{2}\widehat{u}_{i}^{2}}{\left[\frac{1}{n}\sum_{i=1}^{n}\left(X_{i} - \overline{X}\right)^{2}\right]^{2}}, \end{align} $$

where $\widehat{u}_{i}$ are the residuals of the estimate regression line.

Note, the estimator for $\sigma_{\widehat{\beta}_{1}}^{2}$ above is robust w.r.t. to heteroskedasticity, i.e., it does not rely on the assumption of homoskedasticity.

Instead, some statistic software report estimates $\sigma_{\widehat{\beta}_{1}}^{2}$, based on the assumption of homoskedasticity.

The so called homoskedasticity-only estimator of $\sigma_{\widehat{\beta}_{1}}^{2}$, is given by,

$$ \begin{align} \widetilde{\sigma}_{\widehat{\beta}_{1}}^{2} = \frac{\frac{1}{n-2}\sum_{i=1}^{n}\widehat{u}_{i}^{2}}{\sum_{i=1}^{n}\left(X_{i} - \overline{X}\right)^{2}}. \end{align} $$

There is no explanation yet.

The figure shows:
The scatter plot of the values of x and y and the corresponding fitted regression line estimated by OLS for one particular realization of the underlying DGP.
The red shaded area illustrates the range of all fitted regression lines estimated by OLS across all realizations for the underlying DGP.

A different realization of the DGP results in a different estimate and thus a different fitted regression line estimated by OLS illustrated by the red dashed line.

Increasing the sample size decreases the range of the different fitted regression lines across all realizations of the underlying DGP illustrated by the red shaded area.

The figure shows the scatter plot of the values of x and the fitted residuals of a simple linear regression model estimated by OLS.
Note, since the estimated parameters includes an intercept and a slope coefficient, the fitted residuals have a mean equal to zero and are uncorrelated with x by construction of the OLS estimator.

A different realization of the DGP results in a different estimate and thus a different fitted residual estimated by OLS illustrated by the scatterplot.

Increasing the sample size increases the number of fitted residuals estimated by OLS.

The figure shows:.
The histogram of the estimated slope coefficient across all realizations of the underlying DGP.
The red vertical dashed line represents the estimated slope coefficient for one particular realization of the underlying DGP. The green vertical dashed line represents the slope coefficient of the underlying DGP.

A different realization of the underlying DGP results in a different estimate for the slope coefficients represented by the red vertical dashed line.

By increasing the sample size the estimated slope coefficients concentrate more around the slope coefficient of the underlying DGP.
This is the result of law of large numbers.
Thus, to conduct hypothesis tests we need a sampling distribution which is stable across sample sizes. For this we have to standardize the estimate usign the sample size.

The figure shows:
The histogram of the standardized estimated slope coefficient across all realizations of the DGP.
For the standardization with subtract the slope coefficient and divide by an estimate of the variance of the estimated slope coefficient.
The red vertical dashed line represents the standardized estimated slope coefficient for one particular realization of the DGP.
The green vertical dashed curve represents the pdf of the standard normal distribution.
Note, the estimate for the variance of the estimated slope coefficient is explained below and is a function of the sample size, the variance of u, the variance of x, and the covariance of u and x.

A different realization of the underlying DGP results in a different standardized estimate for the slope coefficient represented by the red vertical dashed line.

By increasing the sample size the sampling distribution of the standardized estimated slope coefficient gets closer to the standard normal distribution which pdf is illustrated by the green dashed curve.
This is the result of central limit theorem. Note, the sampling distribution of the standardized estimate is stable across the sample size. Thus, for large n the standardized estimate can be used to conduct hypothesis tests.

This module is part of the DeLLFi project of the University of Hohenheim and funded by the
Foundation for Innovation in University Teaching