.b[Multicollinearity]

.title[
# .b[Multicollinearity]
]
.subtitle[
## .b[.green[EC 339]]
]
.author[
### Marcio Santetti
]
.date[
### Fall 2022
]

---

# Motivation

---

# Linear relationships
Let us recall .hi[CLRM Assumption VI]:

> *No explanatory variable is a .red[perfect linear function] of any other explanatory variable.*
  
--

This assumption implies a .hi[deterministic] relationship between two independent variables.

$$
`\begin{align}
x_1 = \alpha_0 + \alpha_1x_3
\end{align}`
$$

However, in practice we should worry more about strong .hi[stochastic] relationships between two independent variables.

$$
`\begin{align}
x_1 = \alpha_0 + \alpha_1x_3 + \epsilon_i
\end{align}`
$$

---

# Linear relationships

What does a linear relationship between two independent variables mean in practice?

- If two variables (say, `$x_1$` and `$x_3$`) move .hi[together], then how can OLS .hi-orange[distinguish] between the effects of these two on `$y$`?

- It .hi[cannot]!
  
--

---

# Perfect multicollinearity

---

# Perfect multicollinearity

CLRM Assumption VI only refers to .hi[perfect] multicollinearity.

With its presence, OLS estimation is .hi[indeterminate].

- Why?
  
--

How to .red[disentangle] the effect of each independent variable on `$y$`?
  
--

The _.red[ceteris paribus]_ assumption no longer holds.
  
--

- .hi[Good news]: _rare_ to occur in practice.

---

# Imperfect multicollinearity

---

# Imperfect multicollinearity

Even though CLRM Assumption VI .hi[does not] contemplate this version of multicollinearity, it is an actual problem within OLS estimation.

Strong .hi-orange[stochastic] relationships imply strong .hi[correlation coefficients] between two independent variables.

---

# Imperfect multicollinearity

Even though CLRM Assumption VI .hi[does not] contemplate this version of multicollinearity, it is an actual problem within OLS estimation.

Strong .hi-orange[stochastic] relationships imply strong .hi[correlation coefficients] between two independent variables.

---

# Consequences of multicollinearity

---

# Consequences of multicollinearity

By itself, multicollinearity .hi[does not] cause .hi-orange[bias] to OLS `$\beta$` coefficients.

However, it affects OLS .hi[standard errors].

Recall that standard errors are part of the .hi[t-test formula]:

$$
`\begin{align}
t = \dfrac{\hat{\beta}_k}{SE(\hat{\beta}_k)}
\end{align}`
$$
--

Therefore, it affects OLS .hi-orange[inference].

---

# Consequences of multicollinearity

Visually:

- Which estimate is *relatively more efficient*?

---

# Dealing with multicollinearity

---

# Dealing with multicollinearity

Consider the following model:

$$
`\begin{aligned}
log(rgdpna_i) = \beta_0 + \beta_1pop_i + \beta_2emp_i + \beta_3ck_i + \beta_4ccon_i + u_i
\end{aligned}`
$$

where (for each country *i*):

- `rgdpna`: real GDP (millions 2011 USD)
- `pop`: population (millions)
- `emp`: number of employed persons (millions)
- `ck`: capital services levels (index, USA = 1)
- `ccon`: real consumption (households and government)

---

# Dealing with multicollinearity

```
#> 
#> ===============================================
#> Dependent variable: 
#> ---------------------------
#> log(rgdpna) 
#> -----------------------------------------------
#> pop 0.050*** 
#> (0.018) 
#> emp -0.069 
#> (0.042) 
#> ck 26.632*** 
#> (6.518) 
#> ccon -0.00000*** 
#> (0.00000) 
#> Constant 10.785*** 
#> (0.145) 
#> -----------------------------------------------
#> Observations 130 
#> R2 0.478 
#> Adjusted R2 0.461 
#> Residual Std. Error 1.404 (df = 125) 
#> F Statistic 28.605*** (df = 4; 125) 
#> ===============================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
```
]]
---

# Dealing with multicollinearity

A little modification:

$$
`\begin{aligned}
log(rgdpna_i) = \beta_0 + \beta_1log(emp_i) + \beta_3ck_i + \beta_4log(ccon_i) + u_i
\end{aligned}`
$$

---

# Dealing with multicollinearity

```
#> 
#> ===============================================
#> Dependent variable: 
#> ---------------------------
#> log(rgdpna) 
#> -----------------------------------------------
#> log(emp) -0.059** 
#> (0.029) 
#> ck -0.206 
#> (0.288) 
#> log(ccon) 1.076*** 
#> (0.027) 
#> Constant -0.487* 
#> (0.275) 
#> -----------------------------------------------
#> Observations 130 
#> R2 0.979 
#> Adjusted R2 0.979 
#> Residual Std. Error 0.277 (df = 126) 
#> F Statistic 2,001.826*** (df = 3; 126) 
#> ===============================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
```
]]

---

# Dealing with multicollinearity

Checking .hi[correlation] coefficients:

- *Corr(popi, empi) = 0.987*

- *Corr(cconi, empi) = 0.980*

- *Corr(log(cconi), empi) = 0.584*

---

# Dealing with multicollinearity

A recommended procedure is to always check out the .hi[correlation coefficient] among the chosen independent variables.

- In addition, we can calculate .hi-orange[Variance Inflation Factors] (VIFs):

$$
`\begin{align}
VIF(\hat{\beta_i})  = \dfrac{1}{(1-R_i^2)}
\end{align}`
$$

where `$R_i^2$` is the coefficient of determination of the *auxiliary regression* models.

- The procedure is to estimate one auxiliary regression model for *each* independent variable.
  - Then, store the `$R^2$` for each regression.
  - A *VIF* greater than 5 is already sifficient to imply high multicollinearity.
  
---

# Dealing with multicollinearity

In .mono[R]...

```r
model_1 %>%
  vif()
```

```
#>      pop      emp       ck     ccon 
#> 42.68883 48.52425 30.43790 27.30301
```

```r
model_2 %>% 
  vif()
```

```
#>  log(emp)        ck log(ccon) 
#>  3.717818  1.516566  4.236570
```

- What do we conclude?

---

# Dealing with multicollinearity

In .mono[Stata]...

```{}

reg lrdgpna pop emp ck ccon

vif

Variable |       VIF       1/VIF  
-------------+----------------------
         emp |     48.52    0.020608
         pop |     42.69    0.023425
          ck |     30.44    0.032854
        ccon |     27.30    0.036626
-------------+----------------------
    Mean VIF |     37.24

```

- What do we conclude?

---

# Dealing with multicollinearity

In .mono[Stata]...

```{}

reg lrdgpna lemp ck lccon

vif

```

- What do we conclude?

---

# Next time: Multicollinearity in practice

---
exclude: true