.b[Omitted Variables Bias (OVB)]

.title[
# .b[Omitted Variables Bias (OVB)]
]
.subtitle[
## .b[.green[EC 339]]
]
.author[
### Marcio Santetti
]
.date[
### Fall 2022
]

---

# Motivation

---

# Well-specified models

Recall .hi[CLRM Assumption I]:

> "*The regression model is .green[linear], .green[correctly specified], and has an .green[additive] stochastic error
term*."

<br>

The .green[hardest] part regarding this assumption is to have a .hi[well-specified model].

Suppose we have the following model:

$$
`\begin{align}
y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \beta_3x_{3i} + u_i
\end{align}`
$$
--

<br>

- How can we .green[evaluate] whether this is a well-specified model?
  - Does it have the appropriate .green[functional form]?
  - Is this model in accordance with .green[economic theory]?
  
  
---

# Well-specified models

In fact, we can .green[never know for sure] if we have the most appropriate model.

<br>

In addition, we must always .hi[visualize] our data, knowing it better in order to define the model's .green[functional form].

<br>

-  __A different functional form may also be an omitted variable!__

- For instance, if the .green['true'] model contains a squared term, in case we omit it from our sample regression model, it will be .hi[misspecified].

---

# The nature of the problem

---

# Recalling bias

An estimator is .hi[biased] if its expected value is different from the *true* population parameter.

When considering our slope coefficients `$(\hat{\beta}_i)$`, we expect that they, on average, are close to the .green["true"] population parameter, `$\beta_{pop}$`.

**Unbiased:** `$\mathop{\mathbb{E}}\left[ \hat{\beta}_{OLS} \right] = \beta_{pop}$`

]

**Biased:** `$\mathop{\mathbb{E}}\left[ \hat{\beta}_{OLS} \right] \neq \beta_{pop}$`

]

---

# Omitting a relevant variable

- Assume we know the .hi[true] population model:

$$
`\begin{align}
y_i^{true} = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + u_i   
\end{align}`
$$

<br>

- And we estimate the following model:

$$
`\begin{align}
y_i = \beta_0 + \beta_1x_{1i} + u_i^*  
\end{align}`
$$

with

$$
`\begin{align}
u_i^*  = u_i + \beta_2x_{2i} 
\end{align}`
$$

- Assuming that `$x_1$` and `$x_2$` (the omitted variable) share some degree of .green[correlation] (which is usually the case), the error term is no longer .hi[independent] of an explanatory variable, as per .green[CLRM Assumption III].

---

# Omitting a relevant variable

<br>

- Consider a simple .green[demand model]:

$$
\small
`\begin{align}
log(qchicken_i) = {} & \beta_0 + \beta_1pchicken_{i} + \beta_2pbeef_{i}  + \beta_3dispinc_{i} + \beta_4log(xchicken_i) + u_i
\end{align}`
$$
--

<br>

- And we .green[estimate] it:

$$
\small
`\begin{align}
\widehat{log(qchicken_i)} = {} & 2.95 - 0.23 \ pchicken_{i} +  0.18 \ pbeef_{i} + \\ & + 0.000036 \ dispinc_{i} + 
0.75 \ log(xchicken_i) 
\end{align}`
$$

---

# Omitting a relevant variable

- And now we omit `dispinc` from the model:

$$
\small
`\begin{align}
\widehat{log(qchicken_i)} =  3.49 - 0.30 \ pchicken_{i} +  0.25 \ pbeef_{i}  + 
1.65 \ log(xchicken_i) 
\end{align}`
$$
--

<br>

- This model's .green[residual] term contains `dispinc`.

<br>

- Let us check out the .green[correlation coefficient] between `dispinc` and other variables:

<br>

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> corr_y_pchicken </th>
   <th style="text-align:right;"> corr_y_pbeef </th>
   <th style="text-align:right;"> corr_y_x </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> -0.8552982 </td>
   <td style="text-align:right;"> -0.6940004 </td>
   <td style="text-align:right;"> NA </td>
  </tr>
</tbody>
</table>

---

# Omitting a relevant variable

<br>

<table>
<caption>'True' model</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 2.9575599 </td>
   <td style="text-align:right;"> 0.0951466 </td>
   <td style="text-align:right;"> 31.084255 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> p </td>
   <td style="text-align:right;"> -0.2342880 </td>
   <td style="text-align:right;"> 0.0176617 </td>
   <td style="text-align:right;"> -13.265322 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> pb </td>
   <td style="text-align:right;"> 0.1814819 </td>
   <td style="text-align:right;"> 0.0509694 </td>
   <td style="text-align:right;"> 3.560608 </td>
   <td style="text-align:right;"> 0.0008732 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> lexpts </td>
   <td style="text-align:right;"> 0.7526487 </td>
   <td style="text-align:right;"> 0.1404342 </td>
   <td style="text-align:right;"> 5.359440 </td>
   <td style="text-align:right;"> 0.0000026 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> y </td>
   <td style="text-align:right;"> 0.0000361 </td>
   <td style="text-align:right;"> 0.0000052 </td>
   <td style="text-align:right;"> 6.986129 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
</tbody>
</table>

<br>

<table>
<caption>Biased model</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 3.4926329 </td>
   <td style="text-align:right;"> 0.0801754 </td>
   <td style="text-align:right;"> 43.562414 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> p </td>
   <td style="text-align:right;"> -0.3045472 </td>
   <td style="text-align:right;"> 0.0206204 </td>
   <td style="text-align:right;"> -14.769222 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> pb </td>
   <td style="text-align:right;"> 0.2551898 </td>
   <td style="text-align:right;"> 0.0708221 </td>
   <td style="text-align:right;"> 3.603253 </td>
   <td style="text-align:right;"> 0.0007563 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> lexpts </td>
   <td style="text-align:right;"> 1.6504674 </td>
   <td style="text-align:right;"> 0.0804149 </td>
   <td style="text-align:right;"> 20.524400 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
</tbody>
</table>

---

# Including irrelevant variables

---

# Including irrelevant variables

<br>

- Now assume that the .hi[true] model is:

$$
`\begin{align}
y_i^{true} = \beta_0 + \beta_1x_{1i} + u_i   
\end{align}`
$$

<br>

- And, instead, we estimate

$$
`\begin{align}
y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + u_i^*   
\end{align}`
$$

with

$$
`\begin{align}
u_i^*  = u_i - \beta_2x_{2i} 
\end{align}`
$$

---

# Including irrelevant variables

<br>

- Suppose we add `popgro`, a variable measuring .green[population growth], to our original model:

<br>

$$
`\begin{align}
\small
\widehat{log(qchicken_i)} = {} & 2.89 - 0.23 \ pchicken_{i} +  0.19 \ pbeef_{i} + \\ & + 0.000038 \ dispinc_{i} + 
0.69 \ log(xchicken_i) + \\ & + 0.017 \ popgro_t
\end{align}`
$$

---

# Including irrelevant variables

<br>

<table>
<caption>Model with irrelevant variable</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 2.8951497 </td>
   <td style="text-align:right;"> 0.1353082 </td>
   <td style="text-align:right;"> 21.3967020 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> p </td>
   <td style="text-align:right;"> -0.2369439 </td>
   <td style="text-align:right;"> 0.0211080 </td>
   <td style="text-align:right;"> -11.2253171 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> pb </td>
   <td style="text-align:right;"> 0.1914541 </td>
   <td style="text-align:right;"> 0.0537460 </td>
   <td style="text-align:right;"> 3.5622008 </td>
   <td style="text-align:right;"> 0.0008984 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> lexpts </td>
   <td style="text-align:right;"> 0.6996547 </td>
   <td style="text-align:right;"> 0.1722889 </td>
   <td style="text-align:right;"> 4.0609386 </td>
   <td style="text-align:right;"> 0.0001978 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> y </td>
   <td style="text-align:right;"> 0.0000385 </td>
   <td style="text-align:right;"> 0.0000065 </td>
   <td style="text-align:right;"> 5.9044418 </td>
   <td style="text-align:right;"> 0.0000005 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> popgro </td>
   <td style="text-align:right;"> 0.0177147 </td>
   <td style="text-align:right;"> 0.0300050 </td>
   <td style="text-align:right;"> 0.5903904 </td>
   <td style="text-align:right;"> 0.5579493 </td>
  </tr>
</tbody>
</table>

---

# The .mono[RESET] test

---

# The .mono[RESET] test

<br>

Knowing for sure whether our models suffer from Omitted Variables Bias (OVB) is .green[hard].

However, the .green[.mono[RESET] test for functional form misspecification] can help us.

<br>

It consists of running an .hi[F-test] on .hi[functional forms] of the .hi[fitted values] of the dependent variable `$(\hat{y})$`.

These functional forms `$(\hat{y}^2, \hat{y}^3, etc.)$` serve as _.hi[proxies]_ for potentially omitted variables.

<br>

Recall that .green[functional forms] of .green[already included independent variables] can also be omitted variables!

---

# The .mono[RESET] test

### The .hi[recipe] 👩‍🍳 👨‍🍳:

<br>

1. Estimate the regression model via OLS;

2. Store the regression's fitted values `$(\hat{y}_i)$`;

3. Use functional forms of `$\hat{y}_i$` (squared, cubic terms, etc.) as **independent variables** in a new model;

4. Compare the fits of models from step **1** and **3** through an *F-test*;

5. In case these additional terms are **not** jointly significant, we do not suspect of omitted variables.

6. In case these terms are *jointly significant*, we should consider adding new regressors to the original model.

]

---

# The .mono[RESET] test

Estimate the regression model via OLS

]

$$
`\begin{align}
y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + u_i   
\end{align}`
$$
--

Store the regression's fitted values `$(\hat{y}_i)$`

]

$$
`\begin{align}
\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_{1i} + \hat{\beta}_2x_{2i}
\end{align}`
$$

Use functional forms of `$\hat{y}_i$` (squared, cubic terms, etc.) as **independent variables** in a new model

]

$$
`\begin{align}
y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \beta_3\hat{y}_i^2 + \beta_4\hat{y}_i^3  + u_i
\end{align}`
$$

Compare the fits of models from step **1** and **3** through an *F-test*

]

- `$H_0: \hat{\beta}_3 = \hat{\beta}_4  = 0$`
- `$H_a: H_0$` is not true

---

# The .mono[RESET] test

- In case the .hi[null hypothesis] is .hi[rejected], then we have evidence of omitted variables.

- In case we .hi[do not reject] `$H_0$`, then we can stick with the original model.

In .mono[R]...

```r
resettest(model_true, power = 2:4)
```

```
#> 
#> 	RESET test
#> 
#> data:  model_true
#> RESET = 1.6352, df1 = 3, df2 = 43, p-value = 0.1953
```

<br>

What do we conclude?

---

# The .mono[RESET] test

<br>

In .mono[Stata]...

```{}
estat ovtest

Ramsey RESET test for omitted variables
Omitted: Powers of fitted values of lq

H0: Model has no omitted variables

F(3, 43) =   1.64
Prob > F = 0.1953

```

<br>

What do we conclude?

---

# Next time: OVB in practice

---
exclude: true