class: center, middle, inverse, title-slide .title[ # .b[Omitted Variables Bias (OVB)] ] .subtitle[ ## .b[.green[EC 339]] ] .author[ ### Marcio Santetti ] .date[ ### Fall 2022 ] --- class: inverse, middle # Motivation --- # Well-specified models Recall .hi[CLRM Assumption I]: > "*The regression model is .green[linear], .green[correctly specified], and has an .green[additive] stochastic error term*." -- <br> The .green[hardest] part regarding this assumption is to have a .hi[well-specified model]. -- Suppose we have the following model: $$ `\begin{align} y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \beta_3x_{3i} + u_i \end{align}` $$ -- <br> - How can we .green[evaluate] whether this is a well-specified model? - Does it have the appropriate .green[functional form]? - Is this model in accordance with .green[economic theory]? --- # Well-specified models In fact, we can .green[never know for sure] if we have the most appropriate model. -- .hi[Theory] is always (and will always be) the best guide. -- <br> In addition, we must always .hi[visualize] our data, knowing it better in order to define the model's .green[functional form]. -- <br> - __A different functional form may also be an omitted variable!__ - For instance, if the .green['true'] model contains a squared term, in case we omit it from our sample regression model, it will be .hi[misspecified]. --- layout: false class: inverse, middle # The nature of the problem --- # Recalling bias An estimator is .hi[biased] if its expected value is different from the *true* population parameter. -- When considering our slope coefficients `\((\hat{\beta}_i)\)`, we expect that they, on average, are close to the .green["true"] population parameter, `\(\beta_{pop}\)`. -- .pull-left[ **Unbiased:** `\(\mathop{\mathbb{E}}\left[ \hat{\beta}_{OLS} \right] = \beta_{pop}\)` <img src="006-ovb_files/figure-html/unbiased pdf-1.svg" style="display: block; margin: auto;" /> ] -- .pull-right[ **Biased:** `\(\mathop{\mathbb{E}}\left[ \hat{\beta}_{OLS} \right] \neq \beta_{pop}\)` <img src="006-ovb_files/figure-html/biased pdf-1.svg" style="display: block; margin: auto;" /> ] --- # Omitting a relevant variable - Assume we know the .hi[true] population model: $$ `\begin{align} y_i^{true} = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + u_i \end{align}` $$ -- <br> - And we estimate the following model: $$ `\begin{align} y_i = \beta_0 + \beta_1x_{1i} + u_i^* \end{align}` $$ with $$ `\begin{align} u_i^* = u_i + \beta_2x_{2i} \end{align}` $$ -- - Assuming that `\(x_1\)` and `\(x_2\)` (the omitted variable) share some degree of .green[correlation] (which is usually the case), the error term is no longer .hi[independent] of an explanatory variable, as per .green[CLRM Assumption III]. --- # Omitting a relevant variable <br> - Consider a simple .green[demand model]: $$ \small `\begin{align} log(qchicken_i) = {} & \beta_0 + \beta_1pchicken_{i} + \beta_2pbeef_{i} + \beta_3dispinc_{i} + \beta_4log(xchicken_i) + u_i \end{align}` $$ -- <br> - And we .green[estimate] it: $$ \small `\begin{align} \widehat{log(qchicken_i)} = {} & 2.95 - 0.23 \ pchicken_{i} + 0.18 \ pbeef_{i} + \\ & + 0.000036 \ dispinc_{i} + 0.75 \ log(xchicken_i) \end{align}` $$ --- # Omitting a relevant variable - And now we omit `dispinc` from the model: $$ \small `\begin{align} \widehat{log(qchicken_i)} = 3.49 - 0.30 \ pchicken_{i} + 0.25 \ pbeef_{i} + 1.65 \ log(xchicken_i) \end{align}` $$ -- <br> - This model's .green[residual] term contains `dispinc`. <br> -- - Let us check out the .green[correlation coefficient] between `dispinc` and other variables: <br> <table> <thead> <tr> <th style="text-align:right;"> corr_y_pchicken </th> <th style="text-align:right;"> corr_y_pbeef </th> <th style="text-align:right;"> corr_y_x </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -0.8552982 </td> <td style="text-align:right;"> -0.6940004 </td> <td style="text-align:right;"> NA </td> </tr> </tbody> </table> --- # Omitting a relevant variable <br> <table> <caption>'True' model</caption> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 2.9575599 </td> <td style="text-align:right;"> 0.0951466 </td> <td style="text-align:right;"> 31.084255 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> p </td> <td style="text-align:right;"> -0.2342880 </td> <td style="text-align:right;"> 0.0176617 </td> <td style="text-align:right;"> -13.265322 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> pb </td> <td style="text-align:right;"> 0.1814819 </td> <td style="text-align:right;"> 0.0509694 </td> <td style="text-align:right;"> 3.560608 </td> <td style="text-align:right;"> 0.0008732 </td> </tr> <tr> <td style="text-align:left;"> lexpts </td> <td style="text-align:right;"> 0.7526487 </td> <td style="text-align:right;"> 0.1404342 </td> <td style="text-align:right;"> 5.359440 </td> <td style="text-align:right;"> 0.0000026 </td> </tr> <tr> <td style="text-align:left;"> y </td> <td style="text-align:right;"> 0.0000361 </td> <td style="text-align:right;"> 0.0000052 </td> <td style="text-align:right;"> 6.986129 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> </tbody> </table> <br> <table> <caption>Biased model</caption> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 3.4926329 </td> <td style="text-align:right;"> 0.0801754 </td> <td style="text-align:right;"> 43.562414 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> p </td> <td style="text-align:right;"> -0.3045472 </td> <td style="text-align:right;"> 0.0206204 </td> <td style="text-align:right;"> -14.769222 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> pb </td> <td style="text-align:right;"> 0.2551898 </td> <td style="text-align:right;"> 0.0708221 </td> <td style="text-align:right;"> 3.603253 </td> <td style="text-align:right;"> 0.0007563 </td> </tr> <tr> <td style="text-align:left;"> lexpts </td> <td style="text-align:right;"> 1.6504674 </td> <td style="text-align:right;"> 0.0804149 </td> <td style="text-align:right;"> 20.524400 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> </tbody> </table> --- layout: false class: inverse, middle # Including irrelevant variables --- # Including irrelevant variables <br> - Now assume that the .hi[true] model is: $$ `\begin{align} y_i^{true} = \beta_0 + \beta_1x_{1i} + u_i \end{align}` $$ -- <br> - And, instead, we estimate $$ `\begin{align} y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + u_i^* \end{align}` $$ with $$ `\begin{align} u_i^* = u_i - \beta_2x_{2i} \end{align}` $$ --- # Including irrelevant variables <br> - Suppose we add `popgro`, a variable measuring .green[population growth], to our original model: <br> $$ `\begin{align} \small \widehat{log(qchicken_i)} = {} & 2.89 - 0.23 \ pchicken_{i} + 0.19 \ pbeef_{i} + \\ & + 0.000038 \ dispinc_{i} + 0.69 \ log(xchicken_i) + \\ & + 0.017 \ popgro_t \end{align}` $$ --- # Including irrelevant variables <table> <caption>'True' model</caption> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 2.9575599 </td> <td style="text-align:right;"> 0.0951466 </td> <td style="text-align:right;"> 31.084255 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> p </td> <td style="text-align:right;"> -0.2342880 </td> <td style="text-align:right;"> 0.0176617 </td> <td style="text-align:right;"> -13.265322 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> pb </td> <td style="text-align:right;"> 0.1814819 </td> <td style="text-align:right;"> 0.0509694 </td> <td style="text-align:right;"> 3.560608 </td> <td style="text-align:right;"> 0.0008732 </td> </tr> <tr> <td style="text-align:left;"> lexpts </td> <td style="text-align:right;"> 0.7526487 </td> <td style="text-align:right;"> 0.1404342 </td> <td style="text-align:right;"> 5.359440 </td> <td style="text-align:right;"> 0.0000026 </td> </tr> <tr> <td style="text-align:left;"> y </td> <td style="text-align:right;"> 0.0000361 </td> <td style="text-align:right;"> 0.0000052 </td> <td style="text-align:right;"> 6.986129 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> </tbody> </table> <br> <table> <caption>Model with irrelevant variable</caption> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 2.8951497 </td> <td style="text-align:right;"> 0.1353082 </td> <td style="text-align:right;"> 21.3967020 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> p </td> <td style="text-align:right;"> -0.2369439 </td> <td style="text-align:right;"> 0.0211080 </td> <td style="text-align:right;"> -11.2253171 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> <tr> <td style="text-align:left;"> pb </td> <td style="text-align:right;"> 0.1914541 </td> <td style="text-align:right;"> 0.0537460 </td> <td style="text-align:right;"> 3.5622008 </td> <td style="text-align:right;"> 0.0008984 </td> </tr> <tr> <td style="text-align:left;"> lexpts </td> <td style="text-align:right;"> 0.6996547 </td> <td style="text-align:right;"> 0.1722889 </td> <td style="text-align:right;"> 4.0609386 </td> <td style="text-align:right;"> 0.0001978 </td> </tr> <tr> <td style="text-align:left;"> y </td> <td style="text-align:right;"> 0.0000385 </td> <td style="text-align:right;"> 0.0000065 </td> <td style="text-align:right;"> 5.9044418 </td> <td style="text-align:right;"> 0.0000005 </td> </tr> <tr> <td style="text-align:left;"> popgro </td> <td style="text-align:right;"> 0.0177147 </td> <td style="text-align:right;"> 0.0300050 </td> <td style="text-align:right;"> 0.5903904 </td> <td style="text-align:right;"> 0.5579493 </td> </tr> </tbody> </table> --- layout: false class: inverse, middle # The .mono[RESET] test --- # The .mono[RESET] test <br> Knowing for sure whether our models suffer from Omitted Variables Bias (OVB) is .green[hard]. -- However, the .green[.mono[RESET] test for functional form misspecification] can help us. -- <br> It consists of running an .hi[F-test] on .hi[functional forms] of the .hi[fitted values] of the dependent variable `\((\hat{y})\)`. -- These functional forms `\((\hat{y}^2, \hat{y}^3, etc.)\)` serve as _.hi[proxies]_ for potentially omitted variables. -- <br> Recall that .green[functional forms] of .green[already included independent variables] can also be omitted variables! --- # The .mono[RESET] test ### The .hi[recipe] 👩🍳 👨🍳: <br> .pseudocode-small[ 1. Estimate the regression model via OLS; 2. Store the regression's fitted values `\((\hat{y}_i)\)`; 3. Use functional forms of `\(\hat{y}_i\)` (squared, cubic terms, etc.) as **independent variables** in a new model; 4. Compare the fits of models from step **1** and **3** through an *F-test*; 5. In case these additional terms are **not** jointly significant, we do not suspect of omitted variables. 6. In case these terms are *jointly significant*, we should consider adding new regressors to the original model. ] --- # The .mono[RESET] test .pseudocode-small[ Estimate the regression model via OLS ] $$ `\begin{align} y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + u_i \end{align}` $$ -- .pseudocode-small[ Store the regression's fitted values `\((\hat{y}_i)\)` ] $$ `\begin{align} \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_{1i} + \hat{\beta}_2x_{2i} \end{align}` $$ -- .pseudocode-small[ Use functional forms of `\(\hat{y}_i\)` (squared, cubic terms, etc.) as **independent variables** in a new model ] $$ `\begin{align} y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \beta_3\hat{y}_i^2 + \beta_4\hat{y}_i^3 + u_i \end{align}` $$ -- .pseudocode-small[ Compare the fits of models from step **1** and **3** through an *F-test* ] - `\(H_0: \hat{\beta}_3 = \hat{\beta}_4 = 0\)` - `\(H_a: H_0\)` is not true --- # The .mono[RESET] test - In case the .hi[null hypothesis] is .hi[rejected], then we have evidence of omitted variables. -- - In case we .hi[do not reject] `\(H_0\)`, then we can stick with the original model. -- In .mono[R]... ```r resettest(model_true, power = 2:4) ``` ``` #> #> RESET test #> #> data: model_true #> RESET = 1.6352, df1 = 3, df2 = 43, p-value = 0.1953 ``` <br> What do we conclude? --- # The .mono[RESET] test <br> In .mono[Stata]... ```{} estat ovtest Ramsey RESET test for omitted variables Omitted: Powers of fitted values of lq H0: Model has no omitted variables F(3, 43) = 1.64 Prob > F = 0.1953 ``` <br> What do we conclude? --- layout: false class: inverse, middle # Next time: OVB in practice --- exclude: true