Metrics review, part 2

EC421, Set 03

Ed Rubin

edwardr@uoregon.edu

Prologue

R showcase

ggplot2

Powerful graphing and mapping package for R.
Idea: Build your figures layer by layer.
Exportable to many applications; part of the tidyverse.

shiny

Export your figures and code to interactive web apps.
Enormous range of applications

Schedule

Last Time

We reviewed the fundamentals of statistics and econometrics.

Today

We review more of the main/basic results in metrics.

This week

We will post the first assignment (focused on review) soon.
First we need to finish more (of this) review.

Multiple regression

More explanatory variables

We’re moving from simple linear regression
(one outcome variable and one explanatory variable)

\[ \textcolor{#e64173}{y_i} = \beta_0 + \beta_1 \textcolor{#6A5ACD}{x_i} + u_i \]

to the land of multiple linear regression
(one outcome variable and multiple explanatory variables)

\[ \textcolor{#e64173}{y_i} = \beta_0 + \beta_1 \textcolor{#6A5ACD}{x_{1i}} + \beta_2 \textcolor{#6A5ACD}{x_{2i}} + \cdots + \beta_k \textcolor{#6A5ACD}{x_{ki}} + u_i \]

Why?

We can better explain variation in \(y\), improve predictions, avoid OV Bias, …

Multiple regression

\(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + u_i \quad\) \(x_1\) is continuous \(\quad x_2\) is categorical

Multiple regression

The intercept and categorical variable \(x_2\) control for the groups’ means.

Multiple regression

With groups’ means removed:

Multiple regression

\(\hat{\beta}_1\) estimates the relationship between \(y\) and \(x_1\) after controlling for \(x_2\).

Multiple regression

Another way to think about it: We’re estimating two (parallel) lines.

Looking at our estimator can also help.

Multiple regression

For the simple linear regression \(y_i = \beta_0 + \beta_1 x_i + u_i\)

\[ \begin{aligned} \hat{\beta}_1 &= \\[0.3em] &= \dfrac{\sum_i \left( x_i - \overline{x} \right) \left( y_i - \overline{y} \right)}{\sum_i \left( x_i -\overline{x} \right)} \\[0.3em] &= \dfrac{\sum_i \left( x_i - \overline{x} \right) \left( y_i - \overline{y} \right)/(n-1)}{\sum_i \left( x_i -\overline{x} \right) / (n-1)} \\[0.3em] &= \dfrac{\mathop{\hat{\text{Cov}}}(x,\,y)}{\mathop{\hat{\text{Var}}} \left( x \right)} \end{aligned} \]

Multiple regression

Simple linear regression estimator:

\[ \hat{\beta}_1 = \dfrac{\mathop{\hat{\text{Cov}}}(x,\,y)}{\mathop{\hat{\text{Var}}} \left( x \right)} \]

Moving to multiple linear regression, the estimator changes slightly:

\[ \hat{\beta}_1 = \dfrac{\mathop{\hat{\text{Cov}}}(\textcolor{#e64173}{\tilde{x}_1},\,y)}{\mathop{\hat{\text{Var}}} \left( \textcolor{#e64173}{\tilde{x}_1} \right)} \]

where \(\textcolor{#e64173}{\tilde{x}_1}\) is the residualized \(x_1\) variable—the variation remaining in \(x\) after controlling for the other explanatory variables.

Multiple regression

Consider the multiple-regression model

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + u \]

Residualized \(x_{1}\) (\(\textcolor{#e64173}{\tilde{x}_1}\)) comes from regressing \(x_1\) on an intercept and all other explanatory variables (then collecting the residuals), i.e.,

\[ \begin{aligned} \hat{x}_{1} &= \hat{\gamma}_0 + \hat{\gamma}_2 \, x_{2} + \hat{\gamma}_3 \, x_{3} \\ \textcolor{#e64173}{\tilde{x}_{1}} &= x_{1} - \hat{x}_{1} \end{aligned} \]

allowing us to better understand our OLS multiple-regression estimator

\[ \hat{\beta}_1 = \dfrac{\mathop{\hat{\text{Cov}}}(\textcolor{#e64173}{\tilde{x}_1},\,y)}{\mathop{\hat{\text{Var}}} \left( \textcolor{#e64173}{\tilde{x}_1} \right)} \]

Multiple regression

Model fit

Measures of goodness of fit quantify how well a model describes/fits the data.

Common measure: \(R^2\) [R-squared] (a.k.a. coefficient of determination)

\[ R^2 = \dfrac{\sum_i (\hat{y}_i - \overline{y})^2}{\sum_i \left( y_i - \overline{y} \right)^2} = 1 - \dfrac{\sum_i \left( y_i - \hat{y}_i \right)^2}{\sum_i \left( y_i - \overline{y} \right)^2} \]

Notice our old friend SSE: \(\sum_i \left( y_i - \hat{y}_i \right)^2 = \sum_i e_i^2\).

\(R^2\) literally tells us the share of the var. in \(y\) our current models accounts for.
Thus \(0 \leq R^2 \leq 1\).

Multiple regression

The problem: As we add variables to our model, \(R^2\) mechanically increases.

To see this problem, we can simulate a dataset of 10,000 observations on \(y\) and 1,000 random \(x_k\) variables. No relations between \(y\) and the \(x_k\)!

Pseudo-code outline of the simulation:

Generate 10,000 observations on \(y\)
Generate 10,000 observations on variables \(x_1\) through \(x_{1000}\)
Regressions
- LM₁: Regress \(y\) on \(x_1\); record R²
- LM₂: Regress \(y\) on \(x_1\) and \(x_2\); record R²
- LM₃: Regress \(y\) on \(x_1\), \(x_2\), and \(x_3\); record R²
- …
- LM₁₀₀₀: Regress \(y\) on \(x_1\), \(x_2\), …, \(x_{1000}\); record R²

Multiple regression

The problem: As we add variables to our model, \(R^2\) mechanically increases.

R code for the simulation:

set.seed(1234)
y = rnorm(1e4)
x = matrix(data = rnorm(1e7), nrow = 1e4)
x %<>% cbind(matrix(data = 1, nrow = 1e4, ncol = 1), x)
r_df = mclapply(X = 1:(1e3-1), mc.cores = detectCores() - 1, FUN = function(i) {
  tmp_reg = lm(y ~ x[,1:(i+1)]) %>% summary()
  data.frame(
    k = i + 1,
    r2 = tmp_reg %$% r.squared,
    r2_adj = tmp_reg %$% adj.r.squared
  )
}) %>% bind_rows()

Multiple regression

The problem: As we add variables to our model, \(\textcolor{#314f4f}{R^2}\) mechanically increases.

Multiple regression

One solution: Adjusted \(\textcolor{#e64173}{R^2}\)

Multiple regression

The problem: As we add variables to our model, \(R^2\) mechanically increases.

One solution: Penalize for the number of variables, e.g., adjusted \(R^2\):

\[ \overline{R}^2 = 1 - \dfrac{\sum_i \left( y_i - \hat{y}_i \right)^2/(n-k-1)}{\sum_i \left( y_i - \overline{y} \right)^2/(n-1)} \]

Note: Adjusted \(R^2\) need not be between 0 and 1.

Multiple regression

Tradeoffs

There are tradeoffs to remember as we add/remove variables:

Fewer variables

generally explain less variation in \(y\),
provide simple interpretations and visualizations (parsimonious),
may need to worry about omitted-variable bias (OVB).

More variables

more likely to find spurious relationships (statistically significant due to chance—does not reflect a true, population-level relationship),
more difficult to interpret the model,
may still miss important variabless—still OVB.

Omitted-variable bias

We’ll go deeper into this issue in a few weeks, but as a refresher:

Omitted-variable bias (OVB) arises when we omit a variable that

affects our outcome variable \(y\)
correlates with an explanatory variable \(x_j\)

As it’s name suggests, this situation leads to bias in our estimate of \(\beta_j\).

Note: OVB Is not exclusive to multiple linear regression, but it does require multiple variables affect \(y\).

Omitted-variable bias

Example

Let’s imagine a simple model of the returns to schooling \[ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i \] where \(\text{School}_i\) gives \(i\)’s years of schooling; \(\text{Male}_i\) represents an indicator variable for whether individual \(i\) is male.

Thus

\(\beta_1\): the returns to an additional year of schooling (ceteris paribus)
\(\beta_2\): the “premium” for being male (ceteris paribus)

If \(\beta_2 > 0\), then males are favored in the labor market
(discrimination, all else equal).

Omitted-variable bias

Example, continued

From our population model

\[ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i \]

If a study focuses on the relationship between pay and schooling, i.e., \[ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \left(\beta_2 \text{Male}_i + u_i\right) \] \[ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \varepsilon_i \] the “disturbance” becomes \(\varepsilon_i = \beta_2 \text{Male}_i + u_i\).

OLS needs exogeneity to be unbiasedness. Likely violated here.

But even if \(\mathop{\boldsymbol{E}}\left[ u | X \right] = 0\), it is not true that \(\mathop{\boldsymbol{E}}\left[ \varepsilon | X \right] = 0\) so long as \(\beta_2 \neq 0\).

Omitted-variable bias

Example, continued

From our population model

\[ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i \]

OLS needs exogeneity to be unbiasedness. Likely violated here.

Unless \(\text{School}\) and \(\text{Male}\) are unrelated, OLS is biased.

Omitted-variable bias

Example, continued

Let’s try to see this result graphically.

Population model:

\[ \text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i \]

Our regression model that suffers from omitted-variable bias:

\[ \text{Pay}_i = \hat{\beta}_0 + \hat{\beta}_1 \times \text{School}_i + e_i \]

Finally, imagine that women, on average, receive more schooling than men.