class: center, middle, inverse, title-slide # Multiple Linear Regression: Estimation ## EC 320: Introduction to Econometrics ### Winter 2022 --- class: inverse, middle # Prologue --- # Other Things Being Equal **Goal:** Isolate the effect of one variable on another. - All else equal, how does increasing `\(X\)` affect `\(Y\)`. -- **Challenge:** Changes in `\(X\)` often coincide with changes in other variables. - Failure to account for other changes can _bias_ OLS estimates of the effect of `\(X\)` on `\(Y\)`. -- **A potential solution:** Account for other variables using .hi[multiple linear regression]. - Easier to defend the exogeneity assumption. --- # Other Things Equal? OLS picks `\(\hat{\beta}_0\)` and `\(\hat{\beta}_1\)` that trace out the line of best fit. Ideally, we wound like to interpret the slope of the line as the causal effect of `\(X\)` on `\(Y\)`. <img src="11-Multiple_Linear_Regression_Estimation_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> --- # Confounders However, the data are grouped by a third variable `\(W\)`. How would omitting `\(W\)` from the regression model affect the slope estimator? <img src="11-Multiple_Linear_Regression_Estimation_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> --- # Confounders The problem with `\(W\)` is that it affects both `\(Y\)` and `\(X\)`. Without adjusting for `\(W\)`, we cannot isolate the causal effect of `\(X\)` on `\(Y\)`. <img src="11-Multiple_Linear_Regression_Estimation_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> --- # Controlling for Confounders .center[![Control](control.gif)] --- # Controlling for Confounders ```r lm(Y ~ X, data = df) %>% tidy() ``` ``` #> # A tibble: 2 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 1.51 0.169 8.91 3.36e-16 #> 2 X 0.494 0.0811 6.10 5.53e- 9 ``` -- ```r lm(Y ~ X + W, data = df) %>% tidy() ``` ``` #> # A tibble: 3 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 1.11 0.104 10.6 3.57e-21 #> 2 X -0.518 0.0731 -7.09 2.32e-11 #> 3 W 3.88 0.208 18.6 2.32e-45 ``` --- class: inverse, middle # Multiple Linear Regression --- # Multiple Linear Regression ## More explanatory variables **Simple linear regression** features one .pink[outcome variable] and one .purple[explanatory variable]: `$$\color{#e64173}{Y_i} = \beta_0 + \beta_1 \color{#9370DB}{X_i} + u_i.$$` **Multiple linear regression** features one .pink[outcome variable] and multiple .purple[explanatory variables]: `$$\color{#e64173}{Y_i} = \beta_0 + \beta_1 \color{#9370DB}{X_{1i}} + \beta_2 \color{#9370DB}{X_{2i}} + \cdots + \beta_{m} \color{#9370DB}{X_{mi}} + u_i.$$` -- **Why?** -- - Better explain the variation in `\(Y\)`. - Improve predictions. - Avoid bias. --- # Multiple Linear Regression <img src="11-Multiple_Linear_Regression_Estimation_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" /> --- # OLS Estimation As was the case with simple linear regressions, OLS minimizes the sum of squared residuals (RSS). However, residuals are now defined as `$$\hat{u}_i = Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_{1i} - \hat{\beta}_2 X_{2i} - \cdots - \hat{\beta}_{m} X_{mi}.$$` -- To obtain estimates, take partial derivatives of RSS with respect to each `\(\hat{\beta}\)`, set each derivative equal to zero, and solve the system of `\(m+1\)` equations. - Without matrices, the algebra is difficult. For the remainder of this course, we will let .mono[R] do the work for us. --- # Coefficient Interpretation **Model** `$$\color{}{Y_i} = \beta_0 + \beta_1 \color{}{X_{1i}} + \beta_2 \color{}{X_{2i}} + \cdots + \beta_{k} \color{}{X_{ki}} + u_i.$$` **Interpretation** - The intercept `\(\hat{\beta}_0\)` is the average value of `\(Y_i\)` when all of the explanatory variables are equal to zero. - Slope parameters `\(\hat{\beta}_1, \dots, \hat{\beta}_{k}\)` give us the change in `\(Y_i\)` from a one-unit change in `\(X_j\)`, holding the other `\(X\)` variables constant. --- # Algebraic Properties of OLS The OLS first-order conditions yield the same properties as before. 1. Residuals sum to zero: `\(\sum_{i=1}^n \hat{u_i} = 0\)`. 2. The sample covariance between the independent variables and the residuals is zero. 3. The point `\((\bar{X_1}, \bar{X_2}, \dots, \bar{X_m}, \bar{Y})\)` is always on the fitted regression "line." --- # Goodness of Fit Fitted values are defined similarly: `$$\hat{Y_i} = \hat{\beta}_0 + \hat{\beta}_1 X_{1i} + \hat{\beta}_2 X_{2i} + \cdots + \hat{\beta}_{k} X_{ki}.$$` The formula for `\(R^2\)` is the same as before: `$$R^2 =\frac{\sum(\hat{Y_i}-\bar{Y})^2}{\sum(Y_i-\bar{Y})^2}.$$` --- # Goodness of Fit **Model 1:** `\(Y_i = \beta_0 + \beta_1 X_{1i} + u_i\)`. **Model 2:** `\(Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + v_i\)` -- .hi-green[True or false?] Model 2 will yield a lower `\(R^2\)` than Model 1. - Hint: Think of `\(R^2\)` as `\(R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}\)`. --- # Goodness of Fit <img src="11-Multiple_Linear_Regression_Estimation_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> --- # Goodness of Fit <img src="11-Multiple_Linear_Regression_Estimation_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" /> --- # Goodness of Fit **Problem:** As we add variables to our model, `\(R^2\)` *mechanically* increases. -- To see this problem, we can simulate a dataset of 10,000 observations on `\(y\)` and 1,000 random `\(x_m\)` variables. **No relations between `\(y\)` and the `\(x_m\)`!** Pseudo-code outline of the simulation: -- .pseudocode-small[ - Generate 10,000 observations on `\(y\)` - Generate 10,000 observations on variables `\(x_1\)` through `\(x_{1000}\)` - Regressions - LM<sub>1</sub>: Regress `\(y\)` on `\(x_1\)`; record R<sup>2</sup> - LM<sub>2</sub>: Regress `\(y\)` on `\(x_1\)` and `\(x_2\)`; record R<sup>2</sup> - ... - LM<sub>1000</sub>: Regress `\(y\)` on `\(x_1\)`, `\(x_2\)`, ..., `\(x_{1000}\)`; record R<sup>2</sup> ] --- # Goodness of Fit **Problem:** As we add variables to our model, `\(R^2\)` *mechanically* increases. .mono[R] code for the simulation: ```r set.seed(1234) #plan(multiprocess) y <- rnorm(1e4) # 10000 obs x <- matrix(data = rnorm(1e6), nrow = 1e4) # 10000 by 100 matrix x %<>% cbind(matrix(data = 1, nrow = 1e4, ncol = 1) # 10000 by 1 vector , x) r_fun <- function(i) { tmp_reg <- lm(y ~ x[,1:(i + 1)]) %>% summary() data.frame( k = i + 1, r2 = tmp_reg$r.squared, r2_adj = tmp_reg$adj.r.squared) } r_df <- future_map(1:(1e2-1), r_fun) %>% bind_rows() r_df ``` --- # Goodness of Fit **Problem:** As we add variables to our model, `\(\color{#314f4f}{R^2}\)` *mechanically* increases. <img src="11-Multiple_Linear_Regression_Estimation_files/figure-html/r2 plot-1.svg" style="display: block; margin: auto;" /> --- # Goodness of Fit **One solution:** .pink[Adjusted] `\(\color{#e64173}{R^2}\)` <img src="11-Multiple_Linear_Regression_Estimation_files/figure-html/adjusted r2 plot-1.svg" style="display: block; margin: auto;" /> --- # Goodness of Fit **Problem:** As we add variables to our model, `\(R^2\)` *mechanically* increases. **One solution:** Penalize for the number of variables, _e.g._, adjusted `\(R^2\)`: $$ \bar{R}^2 = 1 - \dfrac{\sum_i \left( Y_i - \hat{Y}_i \right)^2/(n-m-1)}{\sum_i \left( Y_i - \bar{Y} \right)^2/(n-1)} $$ *Note:* Adjusted `\(R^2\)` need not be between 0 and 1. --- # Goodness of Fit ## Example: 2016 Election ```r lm(trump_margin ~ white, data = election) %>% glance() ``` ``` #> # A tibble: 1 × 12 #> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 0.320 0.320 25.4 1462. 1.51e-262 1 -14472. 28950. 28969. #> # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` -- ```r lm(trump_margin ~ white + poverty, data = election) %>% glance() ``` ``` #> # A tibble: 1 × 12 #> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 0.345 0.344 24.9 818. 4.20e-286 2 -14414. 28836. 28860. #> # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` --- # OLS Assumptions Same as before, except for .pink[assumption 2]: 1. **Linearity:** The population relationship is linear in parameters with an additive error term. 2. .hi[No perfect collinearity:] .pink[No] `\(\color{#e64173}{X}\)` .pink[variable is a perfect linear combination of the others.] <!-- 3. **Random Sampling:** We have a random sample from the population of interest. --> 3. **Exogeneity:** The `\(X\)` variable is exogenous (*i.e.,* `\(\mathop{\mathbb{E}}\left( u|X \right) = 0\)`). 4. **Homoskedasticity:** The error term has the same variance for each value of the independent variable (*i.e.,* `\(\mathop{\text{Var}}(u|X) = \sigma^2\)`). 5. **Non-autocorrelation:** The values of error terms are independent from one another (*i.e.,* `\(E[u_i u_j]=0, \forall i \text{ s.t. } i \neq j\)`) 6. **Normality:** The population error term is normally distributed with mean zero and variance `\(\sigma^2\)` (*i.e.,* `\(u \sim N(0,\sigma^2)\)`) --- # Perfect Collinearity ## Example: 2016 Election OLS cannot estimate parameters for .mono[white] and .mono[nonwhite] simultaneously. - .mono[white] .mono[=] 100 .mono[-] .mono[nonwhite]. -- ```r lm(trump_margin ~ white + nonwhite, data = election) %>% tidy() ``` ``` #> # A tibble: 3 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) -40.7 1.95 -20.9 6.82e- 91 #> 2 white 0.910 0.0238 38.2 1.51e-262 #> 3 nonwhite NA NA NA NA ``` .mono[R] drops perfectly collinear variables for you. --- # Multiple Linear Regression ## Tradeoffs There are tradeoffs to remember as we add/remove variables: **Fewer variables** - Generally explain less variation in `\(y\)`. - Provide simple interpretations and visualizations (*parsimonious*). - May need to worry about omitted-variable bias. **More variables** - More likely to find *spurious* relationships (statistically significant due to chance; do not reflect true, population-level relationships). - More difficult to interpret the model. - May still leave out important variables. --- # Omitted Variables <img src="11-Multiple_Linear_Regression_Estimation_files/figure-html/venn2-1.svg" style="display: block; margin: auto;" /> --- # Omitted Variables <img src="11-Multiple_Linear_Regression_Estimation_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto;" /> --- # Omitted Variables <table> <caption>Math Score</caption> <thead> <tr> <th style="text-align:left;"> Explanatory variable </th> <th style="text-align:center;"> 1 </th> <th style="text-align:center;"> 2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> Intercept </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> -84.84 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> -6.34 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> (18.57) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (15.00) </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> log(Spend) </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> -1.52 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> 11.34 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> (2.18) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (1.77) </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> Lunch </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> -0.47 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (0.01) </td> </tr> </tbody> </table> Data from 1823 elementary schools in Michigan - *Math Score* is average fourth grade state math test scores. - *log(Spend)* is the natural logarithm of spending per pupil. - *Lunch* is the percentage of student eligible for free or reduced-price lunch. --- count: false # Omitted Variables <table> <caption>Math Score</caption> <thead> <tr> <th style="text-align:left;"> Explanatory variable </th> <th style="text-align:center;"> 1 </th> <th style="text-align:center;"> 2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> Intercept </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> -84.84 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> -6.34 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (18.57) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> (15.00) </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> log(Spend) </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> -1.52 </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> 11.34 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> (2.18) </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> (1.77) </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;line-height: 110%;font-style: italic;color: black !important;"> Lunch </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;"> </td> <td style="text-align:center;color: #23373b !important;line-height: 110%;font-weight: bold;"> -0.47 </td> </tr> <tr> <td style="text-align:left;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-style: italic;color: black !important;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;"> </td> <td style="text-align:center;color: #23373b !important;color: #c2bebe !important;line-height: 110%;font-weight: bold;"> (0.01) </td> </tr> </tbody> </table> Data from 1823 elementary schools in Michigan - *Math Score* is average fourth grade state math test scores. - *log(Spend)* is the natural logarithm of spending per pupil. - *Lunch* is the percentage of student eligible for free or reduced-price lunch. --- # Omitted-Variable Bias **Model 1:** `\(Y_i = \beta_0 + \beta_1 X_{1i} + u_i\)`. **Model 2:** `\(Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + v_i\)` Estimating Model 1 (without `\(X_2\)`) yields .hi[omitted-variable bias:] `$$\color{#e64173}{\text{Bias} = \beta_2 \frac{\mathop{\text{Cov}}(X_{1i}, X_{2i})}{\mathop{\text{Var}}(X_{1i})}}.$$` -- The sign of the bias depends on 1. The correlation between `\(X_2\)` and `\(Y\)`, _i.e._, `\(\beta_2\)`. 2. The correlation between `\(X_1\)` and `\(X_2\)`, _i.e._, `\(\mathop{\text{Cov}}(X_{1i}, X_{2i})\)`.