.b[Multiple Linear Regression]

.title[
# .b[Multiple Linear Regression]
]
.subtitle[
## .b[.green[EC 339]]
]
.author[
### Marcio Santetti
]
.date[
### Fall 2022
]

---

# Motivation

---

# Beyond simple regression

<br>

Simple regression models may not be .b[sufficient] to describe the relationships we are interested in.

<br>

A few reasons:

- Avoiding .b[bias] due to *omitted variables*;
  
  - More consistency with .b[economic theory];
  
  - Usually, relationships we study are a product of .b[several different events].

---

# Multiple regression models

In .b[standard] notation:

$$ 
`\begin{align}
y_i = \beta_0 + \beta_1x_{1i} +  \beta_2x_{2i} + \beta_3x_{3i} + ... + \beta_kx_{ki} + u_i \hspace{.7cm}\ \\ 
\forall \ i = 1, 2, 3,..., n 
\end{align}`
$$
--

<br>

- From last week...

$$
`\begin{align}
wage_i = \beta_0 + \beta_1educ_i + u_i
\end{align}`
$$

- And now...

$$
`\begin{align}
wage_i = \beta_0 + \beta_1educ_i + \beta_2exper_i + \beta_3tenure_i + \beta_4gender_i + u_i
\end{align}`
$$

<br>

.small[.b[Important:] even if we are only interested in the effect of *educ* on *wage*, the model above is more consistent with theoretical priors.]

---

# An example

```
#> 
#> ===============================================
#>                         Dependent variable:    
#>                     ---------------------------
#>                                wage            
#> -----------------------------------------------
#> educ                         0.541***          
#>                               (0.053)          
#>                                                
#> Constant                      -0.905           
#>                               (0.685)          
#>                                                
#> -----------------------------------------------
#> Observations                    526            
#> R2                             0.165           
#> Adjusted R2                    0.163           
#> Residual Std. Error      3.378 (df = 524)      
#> F Statistic          103.363*** (df = 1; 524)  
#> ===============================================
#> Note:               *p<0.1; **p<0.05; ***p<0.01
```
]

---

# An example

```
#> 
#> ===============================================
#>                         Dependent variable:    
#>                     ---------------------------
#>                                wage            
#> -----------------------------------------------
#> educ                         0.572***          
#>                               (0.049)          
#> exper                         0.025**          
#>                               (0.012)          
#> tenure                       0.141***          
#>                               (0.021)          
#> female                       -1.811***         
#>                               (0.265)          
#> Constant                     -1.568**          
#>                               (0.725)          
#> -----------------------------------------------
#> Observations                    526            
#> R2                             0.364           
#> Adjusted R2                    0.359           
#> Residual Std. Error      2.958 (df = 521)      
#> F Statistic           74.398*** (df = 4; 521)  
#> ===============================================
#> Note:               *p<0.1; **p<0.05; ***p<0.01
```
]]    
---

# Interpreting multiple coefficients

---

# The *ceteris paribus* assumption

When .b[interpreting] multiple regression models, we .b[isolate] the effect of one independent variable on the
dependent variable.

Therefore, the estimated .b[slope parameters] `$(\hat{\beta}_1,...,\hat{\beta}_k)$` inform the change in `$y$` resulting from a one-unit change in `$x_i$`, .it[holding all other independent variables constant].

<br>

$$
`\begin{align}
wage_i = \beta_0 + \beta_1educ_i + \beta_2exper_i + \beta_3tenure_i + \beta_4gender_i + u_i
\end{align}`
$$

$$
`\begin{align}
\dfrac{\partial wage_i}{\partial educ_i} = \beta_1
\end{align}`
$$
$$
`\begin{align}
\dfrac{\partial wage_i}{\partial exper_i} = \beta_2
\end{align}`
$$

---
layout: false
class: inverse, middle

# Goodness-of-fit

---
# Goodness-of-fit

As more variables are added our model, *R*<sup>2</sup> increases in a .b[mechanical] fashion.

- .b[Problem!]
  
<br>
  
--

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> Simple regression wage model </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.16 </td>
  </tr>
</tbody>
</table>

<br>

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> Multiple regression wage model </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.36 </td>
  </tr>
</tbody>
</table>

---

# Goodness-of-fit

- Let us add a `construc` indicator variable, including it into our previous model.

<br>

- `construc == 1` if working in the construction sector;
  - `construc == 0` otherwise.

$$
`\begin{align}
wage_i = \beta_0 + \beta_1educ_i + \beta_2exper_i + \beta_3tenure_i + \beta_4gender_i + \beta_5construc_i + u_i
\end{align}`
$$

---

# Goodness-of-fit

```
#> 
#> ===============================================
#>                         Dependent variable:    
#>                     ---------------------------
#>                                wage            
#> -----------------------------------------------
#> educ                         0.577***          
#>                               (0.050)          
#> exper                         0.026**          
#>                               (0.012)          
#> tenure                       0.141***          
#>                               (0.021)          
#> female                       -1.788***         
#>                               (0.266)          
#> construc                       0.563           
#>                               (0.626)          
#> Constant                     -1.685**          
#>                               (0.736)          
#> -----------------------------------------------
#> Observations                    526            
#> R2                             0.365           
#> Adjusted R2                    0.358           
#> Residual Std. Error      2.958 (df = 520)      
#> F Statistic           59.658*** (df = 5; 520)  
#> ===============================================
#> Note:               *p<0.1; **p<0.05; ***p<0.01
```
]]

---

# Goodness-of-fit

Before, the *R*<sup>2</sup> was .b[.364]! Why?

Let us have a closer look at its .b[formula]:

$$
`\begin{align}
R^2 = 1 - \dfrac{RSS}{TSS} = 1- \dfrac{\sum_{i=1}^{n}\hat{u}_i^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}
\end{align}`
$$

- The .b[denominator] will remain the same, but the .b[numerator] will, at most, remain the same.

- .b[Solution]: the .it[adjusted] *R*<sup>2</sup>, <SPAN STYLE="text-decoration:overline">*R*</SPAN><sup>2</sup>:

$$
`\begin{align}
\bar{R}^2 =  1 - \dfrac{\sum_{i=1}^{n}\hat{u}_i^2/(n-k-1)}{\sum_{i=1}^{n}(y_i - \bar{y})^2/(n-1)}
\end{align}`
$$

- `$k=$` # independent variables;
- `$(n-k-1) =$` # degrees-of-freedom.

---

# Goodness-of-fit

<br>

Multiple regression model .b[without] *construc*:

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> R-squared </th>
   <th style="text-align:right;"> Adjusted R-squared </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.36354 </td>
   <td style="text-align:right;"> 0.35865 </td>
  </tr>
</tbody>
</table>

<br>

Multiple regression model .b[with] *construc*:

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> R-squared </th>
   <th style="text-align:right;"> Adjusted R-squared </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.36453 </td>
   <td style="text-align:right;"> 0.35842 </td>
  </tr>
</tbody>
</table>

---

# Functional forms

---

# Nonlinear relationships

---

# A level-level model

<br>

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 53.955561 </td>
   <td style="text-align:right;"> 0.314995 </td>
   <td style="text-align:right;"> 171.29025 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> gdpPercap </td>
   <td style="text-align:right;"> 0.000765 </td>
   <td style="text-align:right;"> 0.000026 </td>
   <td style="text-align:right;"> 29.65766 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

<br>

- .b[Interpretation:]

- A 10,000-dollar increase in GDP per capita _.b[increases]_ life expectancy by 7.65 years.

---

# Nonlinear relationships

---

# A log-level model

<br>

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 3.9666387 </td>
   <td style="text-align:right;"> 0.0058346 </td>
   <td style="text-align:right;"> 679.85339 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> gdpPercap </td>
   <td style="text-align:right;"> 0.0000129 </td>
   <td style="text-align:right;"> 0.0000005 </td>
   <td style="text-align:right;"> 27.03958 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

<br>

- .b[Interpretation:]

- A one-unit increase in the explanatory variable increases the dependent variable by approximately `$\beta_1 \times 100$` percent, on average.

- A 1,000-dollar increase in GDP per capita _.b[increases]_ life expectancy by 1.29%.

---
# Nonlinear relationships

---

#  A log-log model

<br>

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 2.864177 </td>
   <td style="text-align:right;"> 0.0232827 </td>
   <td style="text-align:right;"> 123.01718 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> log(gdpPercap) </td>
   <td style="text-align:right;"> 0.146549 </td>
   <td style="text-align:right;"> 0.0028213 </td>
   <td style="text-align:right;"> 51.94452 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

<br>

- .b[Interpretation:]

- A one-percent increase in the independent variable results in a `$\beta_1$` percent change in the dependent variable, on average.

- A 1 % increase in GDP per capita _.b[increases]_ life expectancy by 0.147 %.

---
# Nonlinear relationships

---

# A level-log model

<br>

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> -9.100889 </td>
   <td style="text-align:right;"> 1.227674 </td>
   <td style="text-align:right;"> -7.413117 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> log(gdpPercap) </td>
   <td style="text-align:right;"> 8.405085 </td>
   <td style="text-align:right;"> 0.148762 </td>
   <td style="text-align:right;"> 56.500206 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

<br>

- .b[Interpretation:]

- A one-percent change in the independent variable leads to a `$\beta_1 \div 100$` change in the dependent variable, on average.

- A 1 % increase in GDP per capita _.b[increases]_ life expectancy by 0.0841 years.

---

# Quick summary

<br>

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Model's functional form </th>
   <th style="text-align:left;">  How to interpret $\beta_1$? </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-weight: bold;font-style: italic;color: black !important;vertical-align:top;"> Level-level <br> $y_i = \beta_0 + \beta_1 x_i + u_i$ </td>
   <td style="text-align:left;font-style: italic;color: black !important;"> $\Delta y = \beta_1 \cdot \Delta x$ <br> A one-unit increase in $x$ leads to a <br> $\beta_1$-unit increase in $y$ </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;font-style: italic;color: black !important;vertical-align:top;"> Log-level <br> $\log(y_i) = \beta_0 + \beta_1 x_i + u_i$ </td>
   <td style="text-align:left;font-style: italic;color: black !important;"> $\%\Delta y = 100 \cdot \beta_1 \cdot \Delta x$ <br> A one-unit increase in $x$ leads to a <br> $\beta_1 \cdot 100$-percent increase in $y$ </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;font-style: italic;color: black !important;vertical-align:top;"> Log-log <br> $\log(y_i) = \beta_0 + \beta_1 \log(x_i) + u_i$ </td>
   <td style="text-align:left;font-style: italic;color: black !important;"> $\%\Delta y = \beta_1 \cdot \%\Delta x$ <br> A one-percent increase in $x$ leads to a <br> $\beta_1$-percent increase in $Y$ </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;font-style: italic;color: black !important;vertical-align:top;"> Level-log <br> $y_i = \beta_0 + \beta_1 \log(x_i) + u_i$ </td>
   <td style="text-align:left;font-style: italic;color: black !important;"> $\Delta y = (\beta_1 \div 100) \cdot \%\Delta x$ <br> A one-percent increase in $x$ leads to a <br> $\beta_1 \div 100$-unit increase in $y$ </td>
  </tr>
</tbody>
</table>

---

# The meaning of linear regression

If we are able to use these nonlinear functional forms, what does *linear* regression mean after all?

- As long as the model remains .b[linear in parameters],  it will be linear.

- This means that we cannot .b[mess around] with our `$\beta$` coefficients!

<br>

- **Examples**:

$$
`\begin{align}
log(wage_i) = \beta_0 + \beta_1educ_i + \beta_2exper_i + \beta_3tenure_i + \beta_4gender_i + u_i
\end{align}`
$$
--

$$
`\begin{align}
log(wage_i) = \beta_0 + log(\beta_1)educ_i + \beta_2exper_i + \beta_3^2tenure_i + \beta_4gender_i + u_i
\end{align}`
$$

<br>

- Which one is .b[not] linear in parameters?

---

# Next time: Multiple Regression in practice

---
exclude: true