Linear Regression - OLS - Omitted Variable Bias

Linear Regression Model and the OLS Estimator

Effect of Omitted Variable Bias (OVB)

Topic of the module

Understand the effect of omitted variable bias (OVB) on the sampling distribution of the OLS estimator.

Data generating process (DGP)

Observable and Unobservable Statistical Model

Suppose we are interested in the relationship between the variable $Y$ and $X_{1}$.

However, suppose the variable $Y$ is determined by an additional unobservable variable $X_{2}$.

To estimate the relationship between $Y$ and $X_{1}$ we estimate the following observable linear regression model,

$$ \begin{align} Y_i = \widetilde{\beta}_{0} + \widetilde{\beta}_{1} X_{i} + \widetilde{u}_{i}. \end{align} $$

However, the unobservable relationship is given by,

$$ \begin{align} Y_{i} = \beta_{0} + \beta_{1} X_{1i} + \beta_{2} X_{2i} + u_{i}. \end{align} $$

Note, if the effect of $X_{2}$ is known, the unobservable relationship can be also written as,

$$ \begin{align} \left(Y_{i} - \beta_{2} X_{2i}\right) &= \beta_{0} + \beta_{1} \left(X_{1i} - \beta_{2} X_{2i}\right) + u_{i} \\ Y_{i}^{adj} &= \beta_{0} + \beta_{1} X_{1i}^{adj} + u_{i}, \end{align} $$

where $Y_{i}^{adj} = \left(Y_{i} - \beta_{2} X_{2i}\right)$ and $X_{1i}^{adj} = \left(X_{1i} - \beta_{2} X_{2i}\right)$ are the variables adjusted for the effect of $X_{2}$.

Furthermore, the unobserved component $\widetilde{u}_{i}$ of the observable regression model can be written as,

$$ \begin{align} \widetilde{u}_{i} = \beta_{2} X_{2i} + u_{i}. \end{align} $$

Finally, $X_{1i}$, $X_{2i}$ and $u_{i}$ are generated by i.i.d. draws from the following distributions

$$ \begin{align} \left(\begin{array}{c} X_{1i} \\ X_{2i} \end{array}\right) \sim N\left( \begin{matrix} 0 \\ 0 \end{matrix} \; \begin{matrix} \phantom{,} \\ , \end{matrix} \;\;\; \begin{matrix} \sigma_{X_{1}}^{2} & \rho_{X_{1}X_{2}}\sigma_{X_{1}}\sigma_{X_{2}} \\ \rho_{X_{1}X_{2}}\sigma_{X_{1}}\sigma_{X_{2}} & \sigma_{X_{2}}^{2} \end{matrix}\right), \;\;\;\;\; u_{i} \sim N\left(0, \sigma_{u}^{2}\right). \end{align} $$

Conditions for and Effects of Omitted Variable Bias

Omitted variable bias (OVB) occurs when the following two conditions are true:

$X$ is correlated with the omitted variable.
The omitted variable is a determinant of the dependent variable, $Y$.

The effect of omitted variable bias (OVB) can be quantified as,

$$ \begin{align} \widehat{\beta}_{1} &\overset{p}{\rightarrow} \beta_{1} + \beta_{2} \rho_{X_{1}X_{2}} \frac{\sigma_{X_{2}}}{\sigma_{X_{1}}}. \end{align} $$

Check the details about the effects of OVB below.

Illustration

Change the parameters and see the effect on the properties of the OLS estimator $\widehat{\beta}_{1}$ as estimator for $\beta_{1}$.

Parameters

Sample Size $n$

Effect of $X_{1}$

Effect of $X_{2}$

Corr. $X_{1} \\\& X_{2}$

Scatter plot (observed and adjusted realizations)

The green fitted regression line is based on the regression of,

$$ \begin{align} Y_{i}^{adj} \;\;\;\;\; \text{on} \;\;\;\;\; X_{1i}^{adj}, \end{align} $$

and represents the unobservable relationship.

The red fitted regression line is based on the regression of,

$$ \begin{align} Y_{i} \;\;\;\;\; \text{on} \;\;\;\;\; X_{1i}, \end{align} $$

and represents the observable relationship.

The scatter plots and the fitted regression lines represent the result for only one simulation. The shaded areas illustrate the range of all fitted regression lines across all simulation outcomes.

Scatter plot (fitted residuals)

The fitted unobserved residuals are constructed as,

$$ \begin{align} \widehat{u}_{i} = \widehat{\widetilde{u}}_{i} - \beta_{2} X_{2i}, \end{align} $$

The green fitted regression line is based on the regression of,

$$ \begin{align} \widehat{u}_{i} \;\;\;\;\; \text{on} \;\;\;\;\; X_{1i}. \end{align} $$

The scatter plots and the fitted regression line represent the result for only one simulation. The shaded area illustrate the range of all fitted regression lines across all simulation outcomes.

Histogram of the OLS estimates $\widehat{\beta}_{1}$

Consistency:

As the sample size $N$ grows the OLS estimator $\widehat{\beta}_{1}$ gets closer to,

$$ \begin{align} \widehat{\beta}_{1} &\overset{p}{\rightarrow} \beta_{1} + \beta_{2} \rho_{X_{1}X_{2}} \frac{\sigma_{X_{2}}}{\sigma_{X_{1}}}, \end{align} $$

i.e., the OLS estimator is biased if $\beta_{2} \neq 0$ and $\rho_{X_{1}X_{2}} \neq 0$.

Histogram of the standardized OLS estimates $z_{\widehat{\beta}_{1}}$

Asymptotic Normality:

In the case of omitted variable bias (OVB), i.e., if $\beta_{2} \neq 0$ and $\rho_{X_{1}X_{2}} \neq 0$,

$$ \begin{align} z_{\widehat{\beta}_{1}} &= \frac{\widehat{\beta}_{1} - \beta_{\beta_{1}}}{\sigma_{\widehat{\beta}_{1}}}, \end{align} $$

does not get closer to the standard normal distribution $N\left(0, 1\right)$.

More Details

Suppose we are interested in the relationship between individual earnings $\left(Y\right)$ and years of education $\left(X_{1}\right)$.

However, suppose individual earnings $\left(Y\right)$ is additional determined by individual ability $\left(X_{2}\right)$ which is unobservable variable, or, hard to measure by the econometrician.

The OLS estimator of regressing individual earnings on years of eduction is biased if,

Years of education is correlated with individual ability, e.g., if individuals who are smarter (higher individual ability) learn longer (more years of education).
Individual ability is a determinant of individual earnings, e.g., if individuals who are smarter earn more.

$$ \begin{align} \widehat{\beta}_{1} &= \beta_{1} + \frac{\sum_{i=1}^{n}\left(X_{1i}-\overline{X}_{1}\right)\widetilde{u}_{i}}{\sum_{i=1}^{n}\left(X_{1i}-\overline{X}_{1}\right)^{2}} \\ \widehat{\beta}_{1} &= \beta_{1} + \frac{\sum_{i=1}^{n}\left(X_{1i}-\overline{X}_{1}\right)\left(u_{i} + \beta_{2} X_{2i}\right)}{\sum_{i=1}^{n}\left(X_{1i}-\overline{X}_{1}\right)^{2}} \\ \widehat{\beta}_{1} &= \beta_{1} + \frac{\sum_{i=1}^{n}\left(X_{1i}-\overline{X}_{1}\right)u_{i} }{\sum_{i=1}^{n}\left(X_{1i}-\overline{X}_{1}\right)^{2}} + \beta_{2} \frac{\sum_{i=1}^{n}\left(X_{1i}-\overline{X}_{1}\right)X_{2i}}{\sum_{i=1}^{n}\left(X_{1i}-\overline{X}_{1}\right)^{2}} \\ \widehat{\beta}_{1} &= \beta_{1} + \beta_{2} \frac{\sum_{i=1}^{n}\left(X_{1i}-\overline{X}_{1}\right)X_{2i}}{\sum_{i=1}^{n}\left(X_{1i}-\overline{X}_{1}\right)^{2}} \end{align} $$

Under the assumptions, (1) $\left(X_{1i}, Y_{i}\right)$, $i = 1,2,...,n$, are i.i.d., (2) $X_{1i}$ and $Y_{i}$ have nonzero finite fourth moment and (3) the omitted variable is centered, i.e., has mean zero, it applies,

$$ \begin{align} \sum_{i=1}^{n}\left(X_{1i}-\overline{X}_{1}\right)^{2} &\overset{p}{\rightarrow} \sigma_{X_{1}}^{2} \\ \sum_{i=1}^{n}\left(X_{1i}-\overline{X}_{1}\right)X_{2i} &\overset{p}{\rightarrow} \text{Cov}\left(X_{1i}, X_{2i}\right) = \rho_{X_{1}X_{2}}\sigma_{X_{1}}\sigma_{X_{2}} \end{align} $$

Thus, it applies,

$$ \begin{align} \widehat{\beta}_{1} &\overset{p}{\rightarrow} \beta_{1} + \beta_{2} \rho_{X_{1}X_{2}} \frac{\sigma_{X_{2}}}{\sigma_{X_{1}}}. \end{align} $$

See also: Wooldridge, J. M. (2019). Introductory Econometrics: A Modern Approach. Cengage Learning. Ch. 3-3b - 3-3c.

A realization of the DGP specificied above is simulated:
1. An i.i.d. sequence of realizations $X_{11}, X_{12}, ..., X_{1n}$ and $X_{21}, X_{22}, ..., X_{2n}$ are drawn from a multivariate normal distribution with a given correlation structure $\rho_{X_{1}X_{2}}$ specified above.
2. In addition a i.i.d. sequence of realizations $u_{1}, u_{2}, ..., u_{n}$, are drawn from the distribution of $u_{i}$ specified above.
3. Based on the sequence of observations $X_{11}, X_{12}, ..., X_{1n}$, $X_{21}, X_{22}, ..., X_{2n}$ and $u_{1}, u_{2}, ..., u_{n}$ a sequence of observations $Y_{1}, Y_{2}, ..., Y_{n}$ is constructed based on the unobserved regression model above.
Based on the sequence of observations $Y_{1}, Y_{2}, ... Y_{n}$ and $X_{11}, X_{12}, ..., X_{1n}$, the OLS estimates and standardized OLS estimates for the intercept and slope parameter of the observed regression model are calcluated.
The values of the OLS estimate $\widehat{\beta}_{1}$ and the standardized OLS estimate $z_{\widehat{\beta}_{1}}$ are stored.
Step 1 to 3 is repeated $10,000$ times resulting in $10,\!000$ OLS and standardized OLS estimates.
The distribution of the OLS and standardized estimates are illustrated using histograms.

There is no explanation yet.

The figure shows:
The scatter plot of the values of the observed variables y and x one and the corresponding fitted regression line in red for one particular realization of the underlying DGP.
The scatter plot of the observed variable y and x one adjusted for the effects of the unobserved variable x two and the corresponding fitted regression line in green for one particular realization of the underlying DGP.
The red and green shaded area illustrates the range of all fitted regression lines estimated by OLS across all realizations for the underlying DGP.
Note, the green fitted regression line is the correct or unbiased relationship between y and x whereas the red fitted regression line is potentially biased due to the omission of the unobserved variable x two.

Increasing the sample size decreases the range of the different fitted regression lines across different realizations of the underlying DGP illustrated by the red and green shaded area.

Changing the slope coefficient of x one of the underlying DGP rotates the fitted regression lines.

As long as the correlation between x one and x two is zero changing the slope coefficient of x two away from zero only increases the range of the different fitted regression lines of the observed variables y and x one across different realization of the underlying DGP illustrated by the red shaded area.
This is the result of including a variable relevant for the determination of y but unrelated to the variable of interest.
This increases the fitting of your regression model but it is not necessary for causal inference with respect to your variable of interest.
However, in the case the correlation between x one and x two is not zero changing the slope coefficient of x two away from zero rotates the fitted regression lines of the observed variables y and x one while the fitted regression lines of the unobserved variables y and x one adjusted for the effect of the unobserved variable x two is not affected.
Note, in this case the red shaded area no longer includes the green shaded area, i.e., the regression of the observed variables y on x one is biased.

As long as the slope coefficient of x two is zero changing the correlation between x one and x two away from zero does not affect the range of the different fitted regression lines of the observe variables y and x one across different realization of the underlying DGP illustrated by the red shaded area.
This is the case of including an irrelevant variable in your regression. This increases only the variance of your estimate of the variable of interest.
However, in the case the slope coefficient of x two is not zero changing the correlation between x one and x two away from zero rotates the fitted regression lines of the observed variables y and x one while the fitted regression lines of the observed variables y and x one adjusted for the effect of the unobserved variable x two is not affected.
Note, in this case the red shaded area no longer includes the green shaded area, i.e., the regression of the observed variables y on x one is biased.

The figure shows:
The scatter plot of the observed values of x one and the fitted residuals of the estimated regression model of the unobserved variables y and x one adjusted for the effect of the unobserved variable x two.
The corresponding fitted regression line of a particular realization of the DGP is illustrated by green dashed line. The green shaded are illustrates the different fitted regression lines across different realizations of the underlying DGP.
Note, the correlation between the observed variable x one and the estimated residual from the regression model are zero by construction of the OLS estimator. However, this must still be true after controlling or adjusting y and x one for all potentially unobserved variables.
Note, this condition cannot be checked using observed variables but we can simulate it here using the simulated observed variable x one and the simulated unobserved variable x two.

Increasing the sample size increases the number of fitted residuals estimated by OLS.

Changing the slope coefficient does not change the relationship between the fitted residuals and x one.

As long as the correlation between x one and x two is zero changing the slope coefficient of x two away from zero only increases the range of the different fitted regression lines using the fitted residuals of the estimated observed regression relationship across different realization of the underlying DGP illustrated by the green shaded area.
Note here, the range of the fitted regression lines spreads around the zero line symmetrically and they converges to the zero line as the sample size increases. Thus, for large number of observations the condition that the residuals u and the explanatory variable x one are uncorrelated is true.
However, in the case the correlation between x one and x two is not zero changing the slope coefficient of x two away from zero rotates the fitted regression lines using the fitted residuals of the estimated observed regression relationship.
Note here, the range of the fitted regression lines no longer spreads around the zero line symmetrically and they no longer converges to the zero line as the sample size increases. Thus, the condition that the residuals u and the explanatory variable x one are uncorrelated is violated.

As long as the slope coefficient of x two is zero changing the correlation between x one and x two away from zero does not affect the fitted residuals of the estimated observed regression relationship. Thus, the condition that the residuals u and the explanatory variable x one are uncorrelated is true.
However, in the case the slope coefficient of x two is not zero changing the correlation between x one and x two away from zero rotates the fitted regression lines using the fitted residuals of the estimated observed regression relationship. Note here, the range of the fitted regression lines no longer spreads around the zero line symmetrically and they no longer converges to the zero line as the sample size increases. Thus, the condition that the residuals u and the explanatory variable x one are uncorrelated is violated.

The figure shows:
The histogram of the estimated slope coefficient across all realizations of the underlying DGP.
The red vertical dashed line represents the estimated slope coefficient of regressing the observed variables y on x one for one particular realization of the underlying DGP.
The green vertical dashed line represents the slope coefficient of the underlying DGP, i.e., the the slope coefficient of regressing the observed variables y on x one adjusted for x two.

By increasing the sample size the estimated slope coefficients concentrate more around the slope coefficient of the underlying DGP.
This is the result of law of large numbers.

Changing the slope coefficient of x one of the underlying DGP shifts the histogram of the estimated slope coefficient.

As long as the correlation between x one and x two is zero changing the slope coefficient of x two away from zero only increases the range of OLS estimates for the slope coefficient of x one.
However, in the case the correlation between x one and x two is not zero changing the slope coefficient of x two away from zero shifts the histogram of the OLS estimates for the slope coefficients of the observed relationship away from the green vertical dashed line.
Thus, the OLS estimator for the slope coefficient based on the observed variables y and x one is biased and inconsistent.

As long as the slope coefficient of x two is zero changing the correlation between x one and x two away from zero does not affect range of OLS estimates for the slope coefficient of x one.
However, in the case the slope coefficient of x two is not zero changing the correlation between x one and x two away from zero shifts the histogram of the OLS estimates for the slope coefficients of the observed relationship away from the green vertical dashed line.
Thus, the OLS estimator for the slope coefficient based on the observed variables y and x one is biased and inconsistent.

The figure shows:
The histogram of the standardized estimated slope coefficient across all realizations of the DGP.
For the standardization we subtract the slope coefficient of the variable x one and divide by an estimate of the variance of the estimated slope coefficient.
The red vertical dashed line represents the standardized estimated slope coefficient of regressing the observed variables y on x one for one particular realization of the underlying DGP.
The green vertical dashed line represents the standardized estimated slope coefficient of regressing the observed variables y on x one adjusted for the unobserve variable x two for one particular realization of the underlying DGP.
The green dashed curve represents the pdf of the standard normal distribution.

In the case the effect of x two or the correlation between x one and x two is zero the sampling distribution of the standardized estimated slope coefficient of regressing the observed variables y on x one gets closer to the standard normal distribution which pdf is illustrated by the green dashed curve for increasing sample sizes.
However, in the case the effect of x two and the correlation between x one and x two are different from zero the standardized estimated slope coefficient shifts further away from the standard normal distribution for increasing sample size.
The latter case illustrates the effect of omitted variable bias on the sampling distribution of the standardized estimated slope coefficient.

In the case the effect of x two or the correlation between x one and x two is zero the sampling distribution of the standardized estimated slope coefficient of regressing the observed variables y on x one is not affected by the effect of x one on y.
However, in the case the effect of x two and the correlation between x one and x two are different from zero the standardized estimated slope coefficient is shifted away from the standard normal distribution for increasing sample size.
The latter case illustrates the effect of omitted variable bias on the sampling distribution of the standardized estimated slope coefficient.

As long as the correlation between x one and x two is zero changing the slope coefficient of x two away from zero does not affect the sampling distribution of the standardized estimated slope coefficient of the relationship between the observed variables y and x one.
However, in the case the correlation between x one and x two is not zero changing the slope coefficient of x two away from zero shifts the histogram of the standardized OLS estimates for the slope coefficients of the observed relationship away from the standard normal distribution.
The latter case illustrates the effect of omitted variable bias on the sampling distribution of the standardized estimated slope coefficient.

As long as the slope coefficient of x two is zero changing the correlation between x one and x two away from zero does not affect the sampling distribution of the standardized estimated slope coefficient of the relationship between the observed variables y and x one.
However, in the case the correlation between x one and x two is not zero changing the correlation between x one and x two away from zero shifts the histogram of the standardized OLS estimates for the slope coefficients of the observed relationship away from the standard normal distribution.
The latter case illustrates the effect of omitted variable bias on the sampling distribution of the standardized estimated slope coefficient.

This module is part of the DeLLFi project of the University of Hohenheim and funded by the
Foundation for Innovation in University Teaching