class: center, middle, inverse, title-slide .title[ # .b[Multiple Linear Regression] ] .subtitle[ ## .b[.green[EC 339]] ] .author[ ### Marcio Santetti ] .date[ ### Fall 2022 ] --- class: inverse, middle # Motivation --- # Beyond simple regression <br> Simple regression models may not be .b[sufficient] to describe the relationships we are interested in. -- <br> A few reasons: -- - Avoiding .b[bias] due to *omitted variables*; - More consistency with .b[economic theory]; - Usually, relationships we study are a product of .b[several different events]. --- # Multiple regression models In .b[standard] notation: $$ `\begin{align} y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \beta_3x_{3i} + ... + \beta_kx_{ki} + u_i \hspace{.7cm}\ \\ \forall \ i = 1, 2, 3,..., n \end{align}` $$ -- <br> - From last week... $$ `\begin{align} wage_i = \beta_0 + \beta_1educ_i + u_i \end{align}` $$ -- - And now... $$ `\begin{align} wage_i = \beta_0 + \beta_1educ_i + \beta_2exper_i + \beta_3tenure_i + \beta_4gender_i + u_i \end{align}` $$ -- <br> .small[.b[Important:] even if we are only interested in the effect of *educ* on *wage*, the model above is more consistent with theoretical priors.] --- # An example .center[ ``` #> #> =============================================== #> Dependent variable: #> --------------------------- #> wage #> ----------------------------------------------- #> educ 0.541*** #> (0.053) #> #> Constant -0.905 #> (0.685) #> #> ----------------------------------------------- #> Observations 526 #> R2 0.165 #> Adjusted R2 0.163 #> Residual Std. Error 3.378 (df = 524) #> F Statistic 103.363*** (df = 1; 524) #> =============================================== #> Note: *p<0.1; **p<0.05; ***p<0.01 ``` ] --- # An example .smaller[.center[ ``` #> #> =============================================== #> Dependent variable: #> --------------------------- #> wage #> ----------------------------------------------- #> educ 0.572*** #> (0.049) #> exper 0.025** #> (0.012) #> tenure 0.141*** #> (0.021) #> female -1.811*** #> (0.265) #> Constant -1.568** #> (0.725) #> ----------------------------------------------- #> Observations 526 #> R2 0.364 #> Adjusted R2 0.359 #> Residual Std. Error 2.958 (df = 521) #> F Statistic 74.398*** (df = 4; 521) #> =============================================== #> Note: *p<0.1; **p<0.05; ***p<0.01 ``` ]] --- layout: false class: inverse, middle # Interpreting multiple coefficients --- # The *ceteris paribus* assumption When .b[interpreting] multiple regression models, we .b[isolate] the effect of one independent variable on the dependent variable. -- Therefore, the estimated .b[slope parameters] `\((\hat{\beta}_1,...,\hat{\beta}_k)\)` inform the change in `\(y\)` resulting from a one-unit change in `\(x_i\)`, .it[holding all other independent variables constant]. -- .it[Mathematically] speaking... <br> $$ `\begin{align} wage_i = \beta_0 + \beta_1educ_i + \beta_2exper_i + \beta_3tenure_i + \beta_4gender_i + u_i \end{align}` $$ $$ `\begin{align} \dfrac{\partial wage_i}{\partial educ_i} = \beta_1 \end{align}` $$ $$ `\begin{align} \dfrac{\partial wage_i}{\partial exper_i} = \beta_2 \end{align}` $$ --- layout: false class: inverse, middle # Goodness-of-fit --- # Goodness-of-fit As more variables are added our model, *R*<sup>2</sup> increases in a .b[mechanical] fashion. -- - .b[Problem!] <br> -- <table> <thead> <tr> <th style="text-align:right;"> Simple regression wage model </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.16 </td> </tr> </tbody> </table> <br> <table> <thead> <tr> <th style="text-align:right;"> Multiple regression wage model </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.36 </td> </tr> </tbody> </table> --- # Goodness-of-fit <br><br> - Let us add a `construc` indicator variable, including it into our previous model. <br> - `construc == 1` if working in the construction sector; - `construc == 0` otherwise. -- <br><br> $$ `\begin{align} wage_i = \beta_0 + \beta_1educ_i + \beta_2exper_i + \beta_3tenure_i + \beta_4gender_i + \beta_5construc_i + u_i \end{align}` $$ --- # Goodness-of-fit .smaller[.center[ ``` #> #> =============================================== #> Dependent variable: #> --------------------------- #> wage #> ----------------------------------------------- #> educ 0.577*** #> (0.050) #> exper 0.026** #> (0.012) #> tenure 0.141*** #> (0.021) #> female -1.788*** #> (0.266) #> construc 0.563 #> (0.626) #> Constant -1.685** #> (0.736) #> ----------------------------------------------- #> Observations 526 #> R2 0.365 #> Adjusted R2 0.358 #> Residual Std. Error 2.958 (df = 520) #> F Statistic 59.658*** (df = 5; 520) #> =============================================== #> Note: *p<0.1; **p<0.05; ***p<0.01 ``` ]] --- # Goodness-of-fit Before, the *R*<sup>2</sup> was .b[.364]! Why? -- Let us have a closer look at its .b[formula]: $$ `\begin{align} R^2 = 1 - \dfrac{RSS}{TSS} = 1- \dfrac{\sum_{i=1}^{n}\hat{u}_i^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} \end{align}` $$ -- - The .b[denominator] will remain the same, but the .b[numerator] will, at most, remain the same. -- - .b[Solution]: the .it[adjusted] *R*<sup>2</sup>, <SPAN STYLE="text-decoration:overline">*R*</SPAN><sup>2</sup>: $$ `\begin{align} \bar{R}^2 = 1 - \dfrac{\sum_{i=1}^{n}\hat{u}_i^2/(n-k-1)}{\sum_{i=1}^{n}(y_i - \bar{y})^2/(n-1)} \end{align}` $$ -- - `\(k=\)` # independent variables; - `\((n-k-1) =\)` # degrees-of-freedom. --- # Goodness-of-fit <br> Multiple regression model .b[without] *construc*: <table> <thead> <tr> <th style="text-align:right;"> R-squared </th> <th style="text-align:right;"> Adjusted R-squared </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.36354 </td> <td style="text-align:right;"> 0.35865 </td> </tr> </tbody> </table> <br> Multiple regression model .b[with] *construc*: <table> <thead> <tr> <th style="text-align:right;"> R-squared </th> <th style="text-align:right;"> Adjusted R-squared </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.36453 </td> <td style="text-align:right;"> 0.35842 </td> </tr> </tbody> </table> -- .right[What happened?] --- layout: false class: inverse, middle # Functional forms --- # Nonlinear relationships .smaller[Many times, the relationships we are interested in .b[do not] follow a linear pattern.] <img src="002-multiple-regression_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> --- # A level-level model <br> <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 53.955561 </td> <td style="text-align:right;"> 0.314995 </td> <td style="text-align:right;"> 171.29025 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> gdpPercap </td> <td style="text-align:right;"> 0.000765 </td> <td style="text-align:right;"> 0.000026 </td> <td style="text-align:right;"> 29.65766 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> -- <br> - .b[Interpretation:] - A 10,000-dollar increase in GDP per capita _.b[increases]_ life expectancy by 7.65 years. --- # Nonlinear relationships <img src="002-multiple-regression_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" /> --- # A log-level model <br> <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 3.9666387 </td> <td style="text-align:right;"> 0.0058346 </td> <td style="text-align:right;"> 679.85339 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> gdpPercap </td> <td style="text-align:right;"> 0.0000129 </td> <td style="text-align:right;"> 0.0000005 </td> <td style="text-align:right;"> 27.03958 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> -- <br> - .b[Interpretation:] - A one-unit increase in the explanatory variable increases the dependent variable by approximately `\(\beta_1 \times 100\)` percent, on average. - A 1,000-dollar increase in GDP per capita _.b[increases]_ life expectancy by 1.29%. --- # Nonlinear relationships <img src="002-multiple-regression_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" /> --- # A log-log model <br> <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 2.864177 </td> <td style="text-align:right;"> 0.0232827 </td> <td style="text-align:right;"> 123.01718 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> log(gdpPercap) </td> <td style="text-align:right;"> 0.146549 </td> <td style="text-align:right;"> 0.0028213 </td> <td style="text-align:right;"> 51.94452 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> -- <br> - .b[Interpretation:] - A one-percent increase in the independent variable results in a `\(\beta_1\)` percent change in the dependent variable, on average. - A 1 % increase in GDP per capita _.b[increases]_ life expectancy by 0.147 %. --- # Nonlinear relationships <img src="002-multiple-regression_files/figure-html/unnamed-chunk-18-1.svg" style="display: block; margin: auto;" /> --- # A level-log model <br> <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -9.100889 </td> <td style="text-align:right;"> 1.227674 </td> <td style="text-align:right;"> -7.413117 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> log(gdpPercap) </td> <td style="text-align:right;"> 8.405085 </td> <td style="text-align:right;"> 0.148762 </td> <td style="text-align:right;"> 56.500206 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> -- <br> - .b[Interpretation:] - A one-percent change in the independent variable leads to a `\(\beta_1 \div 100\)` change in the dependent variable, on average. - A 1 % increase in GDP per capita _.b[increases]_ life expectancy by 0.0841 years. --- # Quick summary <br> .center[**A nice interpretation reference**<sup>*</sup>].footnote[ by Kyle Raze] <table> <thead> <tr> <th style="text-align:left;"> Model's functional form </th> <th style="text-align:left;"> How to interpret \(\beta_1\)? </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;font-style: italic;color: black !important;vertical-align:top;"> Level-level <br> \(y_i = \beta_0 + \beta_1 x_i + u_i\) </td> <td style="text-align:left;font-style: italic;color: black !important;"> \(\Delta y = \beta_1 \cdot \Delta x\) <br> A one-unit increase in \(x\) leads to a <br> \(\beta_1\)-unit increase in \(y\) </td> </tr> <tr> <td style="text-align:left;font-weight: bold;font-style: italic;color: black !important;vertical-align:top;"> Log-level <br> \(\log(y_i) = \beta_0 + \beta_1 x_i + u_i\) </td> <td style="text-align:left;font-style: italic;color: black !important;"> \(\%\Delta y = 100 \cdot \beta_1 \cdot \Delta x\) <br> A one-unit increase in \(x\) leads to a <br> \(\beta_1 \cdot 100\)-percent increase in \(y\) </td> </tr> <tr> <td style="text-align:left;font-weight: bold;font-style: italic;color: black !important;vertical-align:top;"> Log-log <br> \(\log(y_i) = \beta_0 + \beta_1 \log(x_i) + u_i\) </td> <td style="text-align:left;font-style: italic;color: black !important;"> \(\%\Delta y = \beta_1 \cdot \%\Delta x\) <br> A one-percent increase in \(x\) leads to a <br> \(\beta_1\)-percent increase in \(Y\) </td> </tr> <tr> <td style="text-align:left;font-weight: bold;font-style: italic;color: black !important;vertical-align:top;"> Level-log <br> \(y_i = \beta_0 + \beta_1 \log(x_i) + u_i\) </td> <td style="text-align:left;font-style: italic;color: black !important;"> \(\Delta y = (\beta_1 \div 100) \cdot \%\Delta x\) <br> A one-percent increase in \(x\) leads to a <br> \(\beta_1 \div 100\)-unit increase in \(y\) </td> </tr> </tbody> </table> --- # The meaning of linear regression If we are able to use these nonlinear functional forms, what does *linear* regression mean after all? -- - As long as the model remains .b[linear in parameters], it will be linear. - This means that we cannot .b[mess around] with our `\(\beta\)` coefficients! -- <br> - **Examples**: $$ `\begin{align} log(wage_i) = \beta_0 + \beta_1educ_i + \beta_2exper_i + \beta_3tenure_i + \beta_4gender_i + u_i \end{align}` $$ -- $$ `\begin{align} log(wage_i) = \beta_0 + log(\beta_1)educ_i + \beta_2exper_i + \beta_3^2tenure_i + \beta_4gender_i + u_i \end{align}` $$ -- <br> - Which one is .b[not] linear in parameters? --- layout: false class: inverse, middle # Next time: Multiple Regression in practice --- exclude: true