class: center, middle, inverse, title-slide .title[ # .b[Multicollinearity] ] .subtitle[ ## .b[.green[EC 339]] ] .author[ ### Marcio Santetti ] .date[ ### Fall 2022 ] --- class: inverse, middle # Motivation --- # Linear relationships Let us recall .hi[CLRM Assumption VI]: -- <br> > *No explanatory variable is a .red[perfect linear function] of any other explanatory variable.* -- <br> This assumption implies a .hi[deterministic] relationship between two independent variables. $$ `\begin{align} x_1 = \alpha_0 + \alpha_1x_3 \end{align}` $$ -- However, in practice we should worry more about strong .hi[stochastic] relationships between two independent variables. $$ `\begin{align} x_1 = \alpha_0 + \alpha_1x_3 + \epsilon_i \end{align}` $$ --- # Linear relationships What does a linear relationship between two independent variables mean in practice? - If two variables (say, `\(x_1\)` and `\(x_3\)`) move .hi[together], then how can OLS .hi-orange[distinguish] between the effects of these two on `\(y\)`? -- - It .hi[cannot]! -- <img src="007-multicollinearity_files/figure-html/unnamed-chunk-1-1.svg" style="display: block; margin: auto;" /> --- layout: false class: inverse, middle # Perfect multicollinearity --- # Perfect multicollinearity <br> CLRM Assumption VI only refers to .hi[perfect] multicollinearity. -- With its presence, OLS estimation is .hi[indeterminate]. -- - Why? -- How to .red[disentangle] the effect of each independent variable on `\(y\)`? -- <br> The _.red[ceteris paribus]_ assumption no longer holds. -- - .hi[Good news]: _rare_ to occur in practice. --- layout: false class: inverse, middle # Imperfect multicollinearity --- # Imperfect multicollinearity Even though CLRM Assumption VI .hi[does not] contemplate this version of multicollinearity, it is an actual problem within OLS estimation. -- Strong .hi-orange[stochastic] relationships imply strong .hi[correlation coefficients] between two independent variables. -- <img src="007-multicollinearity_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> --- # Imperfect multicollinearity Even though CLRM Assumption VI .hi[does not] contemplate this version of multicollinearity, it is an actual problem within OLS estimation. Strong .hi-orange[stochastic] relationships imply strong .hi[correlation coefficients] between two independent variables. <img src="007-multicollinearity_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> --- layout: false class: inverse, middle # Consequences of multicollinearity --- # Consequences of multicollinearity <br> By itself, multicollinearity .hi[does not] cause .hi-orange[bias] to OLS `\(\beta\)` coefficients. -- However, it affects OLS .hi[standard errors]. -- Recall that standard errors are part of the .hi[t-test formula]: <br> $$ `\begin{align} t = \dfrac{\hat{\beta}_k}{SE(\hat{\beta}_k)} \end{align}` $$ -- <br> Therefore, it affects OLS .hi-orange[inference]. --- # Consequences of multicollinearity Visually: - Which estimate is *relatively more efficient*? <img src="007-multicollinearity_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> --- layout: false class: inverse, middle # Dealing with multicollinearity --- # Dealing with multicollinearity Consider the following model: <br> $$ `\begin{aligned} log(rgdpna_i) = \beta_0 + \beta_1pop_i + \beta_2emp_i + \beta_3ck_i + \beta_4ccon_i + u_i \end{aligned}` $$ <br> where (for each country *i*): - `rgdpna`: real GDP (millions 2011 USD) - `pop`: population (millions) - `emp`: number of employed persons (millions) - `ck`: capital services levels (index, USA = 1) - `ccon`: real consumption (households and government) --- # Dealing with multicollinearity .center[ .small[ ``` #> #> =============================================== #> Dependent variable: #> --------------------------- #> log(rgdpna) #> ----------------------------------------------- #> pop 0.050*** #> (0.018) #> emp -0.069 #> (0.042) #> ck 26.632*** #> (6.518) #> ccon -0.00000*** #> (0.00000) #> Constant 10.785*** #> (0.145) #> ----------------------------------------------- #> Observations 130 #> R2 0.478 #> Adjusted R2 0.461 #> Residual Std. Error 1.404 (df = 125) #> F Statistic 28.605*** (df = 4; 125) #> =============================================== #> Note: *p<0.1; **p<0.05; ***p<0.01 ``` ]] --- # Dealing with multicollinearity <br> A little modification: <br> $$ `\begin{aligned} log(rgdpna_i) = \beta_0 + \beta_1log(emp_i) + \beta_3ck_i + \beta_4log(ccon_i) + u_i \end{aligned}` $$ --- # Dealing with multicollinearity .center[ .small[ ``` #> #> =============================================== #> Dependent variable: #> --------------------------- #> log(rgdpna) #> ----------------------------------------------- #> log(emp) -0.059** #> (0.029) #> ck -0.206 #> (0.288) #> log(ccon) 1.076*** #> (0.027) #> Constant -0.487* #> (0.275) #> ----------------------------------------------- #> Observations 130 #> R2 0.979 #> Adjusted R2 0.979 #> Residual Std. Error 0.277 (df = 126) #> F Statistic 2,001.826*** (df = 3; 126) #> =============================================== #> Note: *p<0.1; **p<0.05; ***p<0.01 ``` ]] --- # Dealing with multicollinearity <br> Checking .hi[correlation] coefficients: <br> - *Corr(pop<sub>i</sub>, emp<sub>i</sub>) = 0.987* - *Corr(ccon<sub>i</sub>, emp<sub>i</sub>) = 0.980* -- <br><br> - *Corr(log(ccon<sub>i</sub>), emp<sub>i</sub>) = 0.584* --- # Dealing with multicollinearity A recommended procedure is to always check out the .hi[correlation coefficient] among the chosen independent variables. -- - In addition, we can calculate .hi-orange[Variance Inflation Factors] (VIFs): <br> $$ `\begin{align} VIF(\hat{\beta_i}) = \dfrac{1}{(1-R_i^2)} \end{align}` $$ <br> where `\(R_i^2\)` is the coefficient of determination of the *auxiliary regression* models. -- - The procedure is to estimate one auxiliary regression model for *each* independent variable. - Then, store the `\(R^2\)` for each regression. - A *VIF* greater than 5 is already sifficient to imply high multicollinearity. --- # Dealing with multicollinearity In .mono[R]... ```r model_1 %>% vif() ``` ``` #> pop emp ck ccon #> 42.68883 48.52425 30.43790 27.30301 ``` ```r model_2 %>% vif() ``` ``` #> log(emp) ck log(ccon) #> 3.717818 1.516566 4.236570 ``` <br> - What do we conclude? --- # Dealing with multicollinearity In .mono[Stata]... ```{} reg lrdgpna pop emp ck ccon vif Variable | VIF 1/VIF -------------+---------------------- emp | 48.52 0.020608 pop | 42.69 0.023425 ck | 30.44 0.032854 ccon | 27.30 0.036626 -------------+---------------------- Mean VIF | 37.24 ``` <br> - What do we conclude? --- # Dealing with multicollinearity In .mono[Stata]... ```{} reg lrdgpna lemp ck lccon vif Variable | VIF 1/VIF -------------+---------------------- lccon | 4.24 0.236040 lemp | 3.72 0.268975 ck | 1.52 0.659385 -------------+---------------------- Mean VIF | 3.16 ``` <br> - What do we conclude? --- layout: false class: inverse, middle # Next time: Multicollinearity in practice --- exclude: true