class: center, middle, inverse, title-slide .title[ # Metrics Review, Part 2 ] .subtitle[ ## EC 421, Set 3 ] .author[ ### Edward Rubin ] --- class: inverse, middle <style type="text/css"> @media print { .has-continuation { display: block !important; } } </style> # Prologue --- # .mono[R] showcase **[.mono[ggplot2]](https://ggplot2.tidyverse.org/reference/index.html)** - Incredibly powerful graphing and mapping package for .mono[R]. - Written in a way that helps you build your figures layer by layer. - Exportable to many applications. - Party of the `tidyverse`. **[.mono[shiny]](https://shiny.rstudio.com)** - Export your figures and code to interactive web apps. - Enormous range of applications - [Distribution calculator](https://gallery.shinyapps.io/dist_calc/) - [Tabsets](https://shiny.rstudio.com/gallery/tabsets.html) - [Traveling salesman](https://gallery.shinyapps.io/shiny-salesman/) --- name: schedule # Schedule ### Last Time We reviewed the fundamentals of statistics and econometrics. ### Today We review more of the main/basic results in metrics. ### This week We will post the **first assignment** (focused on *review*) soon. <br>First we need to finish more (of this) review. --- class: inverse, middle # Multiple regression --- layout: true # Multiple regression --- ## More explanatory variables We're moving from **simple linear regression** (one .pink[outcome variable] and one .purple[explanatory variable]) $$ \color{#e64173}{y_i} = \beta_0 + \beta_1 \color{#6A5ACD}{x_i} + u_i $$ to the land of **multiple linear regression** (one .pink[outcome variable] and multiple .purple[explanatory variables]) $$ \color{#e64173}{y\_i} = \beta\_0 + \beta\_1 \color{#6A5ACD}{x\_{1i}} + \beta\_2 \color{#6A5ACD}{x\_{2i}} + \cdots + \beta\_k \color{#6A5ACD}{x\_{ki}} + u\_i $$ -- **Why?** -- We can better explain the variation in `\(y\)`, improve predictions, avoid omitted-variable bias, ... --- `\(y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + u_i \quad\)` `\(x_1\)` is continuous `\(\quad x_2\)` is categorical <img src="slides_files/figure-html/mult reg plot 1-1.svg" style="display: block; margin: auto;" /> --- count: false The intercept and categorical variable `\(x_2\)` control for the groups' means. <img src="slides_files/figure-html/mult reg plot 2-1.svg" style="display: block; margin: auto;" /> --- count: false With groups' means removed: <img src="slides_files/figure-html/mult reg plot 3-1.svg" style="display: block; margin: auto;" /> --- count: false `\(\hat{\beta}_1\)` estimates the relationship between `\(y\)` and `\(x_1\)` after controlling for `\(x_2\)`. <img src="slides_files/figure-html/mult reg plot 4-1.svg" style="display: block; margin: auto;" /> --- count: false Another way to think about it: <img src="slides_files/figure-html/mult reg plot 5-1.svg" style="display: block; margin: auto;" /> --- Looking at our estimator can also help. For the simple linear regression `\(y_i = \beta_0 + \beta_1 x_i + u_i\)` $$ `\begin{aligned} \hat{\beta}_1 &= \\[0.3em] &= \dfrac{\sum_i \left( x_i - \overline{x} \right) \left( y_i - \overline{y} \right)}{\sum_i \left( x_i -\overline{x} \right)} \\[0.3em] &= \dfrac{\sum_i \left( x_i - \overline{x} \right) \left( y_i - \overline{y} \right)/(n-1)}{\sum_i \left( x_i -\overline{x} \right) / (n-1)} \\[0.3em] &= \dfrac{\mathop{\hat{\text{Cov}}}(x,\,y)}{\mathop{\hat{\text{Var}}} \left( x \right)} \end{aligned}` $$ --- Simple linear regression estimator: $$ \hat{\beta}_1 = \dfrac{\mathop{\hat{\text{Cov}}}(x,\,y)}{\mathop{\hat{\text{Var}}} \left( x \right)} $$ moving to multiple linear regression, the estimator changes slightly: $$ \hat{\beta}_1 = \dfrac{\mathop{\hat{\text{Cov}}}(\color{#e64173}{\tilde{x}_1},\,y)}{\mathop{\hat{\text{Var}}} \left( \color{#e64173}{\tilde{x}_1} \right)} $$ where `\(\color{#e64173}{\tilde{x}_1}\)` is the *residualized* `\(x_1\)` variable—the variation remaining in `\(x\)` after controlling for the other explanatory variables. --- More formally, consider the multiple-regression model $$ y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + u_i $$ Our residualized `\(x_{1}\)` (which we named `\(\color{#e64173}{\tilde{x}_1}\)`) comes from regressing `\(x_1\)` on an intercept and all of the other explanatory variables and collecting the residuals, _i.e._, $$ `\begin{aligned} \hat{x}_{1i} &= \hat{\gamma}_0 + \hat{\gamma}_2 \, x_{2i} + \hat{\gamma}_3 \, x_{3i} \\ \color{#e64173}{\tilde{x}_{1i}} &= x_{1i} - \hat{x}_{1i} \end{aligned}` $$ -- allowing us to better understand our OLS multiple-regression estimator $$ \hat{\beta}_1 = \dfrac{\mathop{\hat{\text{Cov}}}(\color{#e64173}{\tilde{x}_1},\,y)}{\mathop{\hat{\text{Var}}} \left( \color{#e64173}{\tilde{x}_1} \right)} $$ --- layout: false class: clear, middle <img src="slides_files/figure-html/venn_iv-1.svg" style="display: block; margin: auto;" /> --- layout: true # Multiple regression --- ## Model fit Measures of *goodness of fit* try to analyze how well our model describes (*fits*) the data. **Common measure:** `\(R^2\)` [R-squared] (*a.k.a.* coefficient of determination) $$ R^2 = \dfrac{\sum_i (\hat{y}_i - \overline{y})^2}{\sum_i \left( y_i - \overline{y} \right)^2} = 1 - \dfrac{\sum_i \left( y_i - \hat{y}_i \right)^2}{\sum_i \left( y_i - \overline{y} \right)^2} $$ Notice our old friend SSE: `\(\sum_i \left( y_i - \hat{y}_i \right)^2 = \sum_i e_i^2\)`. -- `\(R^2\)` literally tells us the share of the variance in `\(y\)` our current models accounts for. Thus `\(0 \leq R^2 \leq 1\)`. --- **The problem:** As we add variables to our model, `\(R^2\)` *mechanically* increases. -- To see this problem, we can simulate a dataset of 10,000 observations on `\(y\)` and 1,000 random `\(x_k\)` variables. **No relations between `\(y\)` and the `\(x_k\)`!** Pseudo-code outline of the simulation: -- .pseudocode-small[ - Generate 10,000 observations on `\(y\)` - Generate 10,000 observations on variables `\(x_1\)` through `\(x_{1000}\)` - Regressions - LM<sub>1</sub>: Regress `\(y\)` on `\(x_1\)`; record R<sup>2</sup> - LM<sub>2</sub>: Regress `\(y\)` on `\(x_1\)` and `\(x_2\)`; record R<sup>2</sup> - LM<sub>3</sub>: Regress `\(y\)` on `\(x_1\)`, `\(x_2\)`, and `\(x_3\)`; record R<sup>2</sup> - ... - LM<sub>1000</sub>: Regress `\(y\)` on `\(x_1\)`, `\(x_2\)`, ..., `\(x_{1000}\)`; record R<sup>2</sup> ] --- **The problem:** As we add variables to our model, `\(R^2\)` *mechanically* increases. .mono[R] code for the simulation: ``` r set.seed(1234) y <- rnorm(1e4) x <- matrix(data = rnorm(1e7), nrow = 1e4) x %<>% cbind(matrix(data = 1, nrow = 1e4, ncol = 1), x) r_df <- mclapply(X = 1:(1e3-1), mc.cores = detectCores() - 1, FUN = function(i) { tmp_reg <- lm(y ~ x[,1:(i+1)]) %>% summary() data.frame( k = i + 1, r2 = tmp_reg %$% r.squared, r2_adj = tmp_reg %$% adj.r.squared ) }) %>% bind_rows() ``` --- **The problem:** As we add variables to our model, `\(\color{#314f4f}{R^2}\)` *mechanically* increases. <img src="slides_files/figure-html/r2 plot-1.svg" style="display: block; margin: auto;" /> --- **One solution:** .pink[Adjusted] `\(\color{#e64173}{R^2}\)` <img src="slides_files/figure-html/adjusted r2 plot-1.svg" style="display: block; margin: auto;" /> --- **The problem:** As we add variables to our model, `\(R^2\)` *mechanically* increases. **One solution:** Penalize for the number of variables, _e.g._, adjusted `\(R^2\)`: $$ \overline{R}^2 = 1 - \dfrac{\sum_i \left( y_i - \hat{y}_i \right)^2/(n-k-1)}{\sum_i \left( y_i - \overline{y} \right)^2/(n-1)} $$ *Note:* Adjusted `\(R^2\)` need not be between 0 and 1. --- ## Tradeoffs There are tradeoffs to remember as we add/remove variables: **Fewer variables** - Generally explain less variation in `\(y\)` - Provide simple interpretations and visualizations (*parsimonious*) - May need to worry about omitted-variable bias **More variables** - More likely to find *spurious* relationships (statistically significant due to chance—does not reflect a true, population-level relationship) - More difficult to interpret the model - You may still miss important variabless—still omitted-variable bias --- layout: true # Omitted-variable bias --- class: inverse, middle --- We'll go deeper into this issue in a few weeks, but as a refresher: **Omitted-variable bias** (OVB) arises when we omit a variable that 1. affects our outcome variable `\(y\)` 2. correlates with an explanatory variable `\(x_j\)` As it's name suggests, this situation leads to bias in our estimate of `\(\beta_j\)`. -- **Note:** OVB Is not exclusive to multiple linear regression, but it does require multiple variables affect `\(y\)`. --- **Example** Let's imagine a simple model for the amount individual `\(i\)` gets paid $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i $$ where - `\(\text{School}_i\)` gives `\(i\)`'s years of schooling - `\(\text{Male}_i\)` denotes an indicator variable for whether individual `\(i\)` is male. thus - `\(\beta_1\)`: the returns to an additional year of schooling (*ceteris paribus*) - `\(\beta_2\)`: the "premium" for being male (*ceteris paribus*) <br>If `\(\beta_2 > 0\)`, then there is discrimination against women—receiving less pay based upon gender. --- layout: true # Omitted-variable bias **Example, continued** --- From our population model $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i $$ If a study focuses on the relationship between pay and schooling, _i.e._, $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \left(\beta_2 \text{Male}_i + u_i\right) $$ $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \varepsilon_i $$ where `\(\varepsilon_i = \beta_2 \text{Male}_i + u_i\)`. We used our exogeneity assumption to derive OLS' unbiasedness. But even if `\(\mathop{\boldsymbol{E}}\left[ u | X \right] = 0\)`, it is not true that `\(\mathop{\boldsymbol{E}}\left[ \varepsilon | X \right] = 0\)` so long as `\(\beta_2 \neq 0\)`. --- From our population model $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i $$ If a study focuses on the relationship between pay and schooling, _i.e._, $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \left(\beta_2 \text{Male}_i + u_i\right) $$ $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \varepsilon_i $$ where `\(\varepsilon_i = \beta_2 \text{Male}_i + u_i\)`. Specifically, exogeneity requires that `\(\text{School}\)` and `\(\text{Male}\)` are unrelated. <br>**Otherwise OLS is biased.** --- Let's try to see this result graphically. The population model: $$ \text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i $$ Our regression model that suffers from omitted-variable bias: $$ \text{Pay}_i = \hat{\beta}_0 + \hat{\beta}_1 \times \text{School}_i + e_i $$ Finally, imagine that women, on average, receive more schooling than men. --- layout: true # Omitted-variable bias **Example, continued:** `\(\text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i\)` --- The relationship between pay and schooling. <img src="slides_files/figure-html/plot ovb 1-1.svg" style="display: block; margin: auto;" /> --- count: false Biased regression estimate: `\(\widehat{\text{Pay}}_i = 31.3 + -0.9 \times \text{School}_i\)` <img src="slides_files/figure-html/plot ovb 2-1.svg" style="display: block; margin: auto;" /> --- count: false Recalling the omitted variable: Gender (**<font color="#e64173">female</font>** and **<font color="#314f4f">male</font>**) <img src="slides_files/figure-html/plot ovb 3-1.svg" style="display: block; margin: auto;" /> --- count: false Recalling the omitted variable: Gender (**<font color="#e64173">female</font>** and **<font color="#314f4f">male</font>**) <img src="slides_files/figure-html/plot ovb 4-1.svg" style="display: block; margin: auto;" /> --- count: false Unbiased regression estimate: `\(\widehat{\text{Pay}}_i = 20.9 + 0.4 \times \text{School}_i + 9.1 \times \text{Male}_i\)` <img src="slides_files/figure-html/plot ovb 5-1.svg" style="display: block; margin: auto;" /> --- layout: false class: clear, middle, center <img src="slides_files/figure-html/venn2-1.svg" style="display: block; margin: auto;" /> --- layout: false # Omitted-variable bias ## Solutions 1. Don't omit variables 2. Instrumental variables and two-stage least squares<sup>†</sup> **Warning:** There are situations in which neither solution is possible. -- 1. Proceed with caution (sometimes you can sign the bias). 2. Maybe just stop. .footnote[ [†]: Coming soon to our lectures. ] --- layout: true # Interpreting coefficients --- class: inverse, middle --- ## Continuous variables Consider the relationship $$ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + u_i $$ where - `\(\text{Pay}_i\)` is a continuous variable measuring an individual's pay - `\(\text{School}_i\)` is a continuous variable that measures years of education -- **Interpretations** - `\(\beta_0\)`: the `\(y\)`-intercept, _i.e._, `\(\text{Pay}\)` when `\(\text{School} = 0\)` - `\(\beta_1\)`: the expected increase in `\(\text{Pay}\)` for a one-unit increase in `\(\text{School}\)` --- ## Continuous variables Deriving the slope's interpretation: $$ `\begin{aligned} \mathop{\boldsymbol{E}}\left[ \text{Pay} | \text{School} = \ell + 1 \right] - \mathop{\boldsymbol{E}}\left[ \text{Pay} | \text{School} = \ell \right] &= \\[0.5em] \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1 (\ell + 1) + u \right] - \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1 \ell + u \right] &= \\[0.5em] \left[ \beta_0 + \beta_1 (\ell + 1) \right] - \left[ \beta_0 + \beta_1 \ell \right] &= \\[0.5em] \beta_0 - \beta_0 + \beta_1 \ell - \beta_1 \ell + \beta_1 &= \beta_1 \end{aligned}` $$ -- _I.e._, the slope gives the expected increase in our outcome variable for a one-unit increase in the explanatory variable. --- ## Continuous variables If we have multiple explanatory variables, _e.g._, $$ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + \beta_2 \, \text{Ability}_i + u_i $$ then the interpretation changes slightly. -- $$ `\begin{aligned} \mathop{\boldsymbol{E}}\left[ \text{Pay} | \text{School} = \ell + 1 \land \text{Ability} = \alpha \right] - & \\ \mathop{\boldsymbol{E}}\left[ \text{Pay} | \text{School} = \ell \land \text{Ability} = \alpha \right] &= \\ \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1 (\ell + 1) + \beta_2 \alpha + u \right] - \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1 \ell + \beta_2 \alpha + u \right] &= \\ \left[ \beta_0 + \beta_1 (\ell + 1) + \beta_2 \alpha \right] - \left[ \beta_0 + \beta_1 \ell + \beta_2 \alpha \right] &= \\ \beta_0 - \beta_0 + \beta_1 \ell - \beta_1 \ell + \beta_1 + \beta_2 \alpha - \beta_2 \alpha &= \beta_1 \end{aligned}` $$ -- _I.e._, the slope gives the expected increase in our outcome variable for a one-unit increase in the explanatory variable, **holding all other variables constant** (*ceteris paribus*). --- ## Continuous variables Alternative derivation Consider the model $$ y = \beta_0 + \beta_1 \, x + u $$ Differentiate the model: $$ \dfrac{dy}{dx} = \beta_1 $$ --- ## Categorical variables Consider the relationship $$ \text{Pay}_i = \beta_0 + \beta_1 \, \text{Female}_i + u_i $$ where - `\(\text{Pay}_i\)` is a continuous variable measuring an individual's pay - `\(\text{Female}_i\)` is a binary/indicator variable taking `\(1\)` when `\(i\)` is female -- **Interpretations** - `\(\beta_0\)`: the expected `\(\text{Pay}\)` for males (_i.e._, when `\(\text{Female} = 0\)`) - `\(\beta_1\)`: the expected difference in `\(\text{Pay}\)` between females and males - `\(\beta_0 + \beta_1\)`: the expected `\(\text{Pay}\)` for females --- ## Categorical variables Derivations $$ `\begin{aligned} \mathop{\boldsymbol{E}}\left[ \text{Pay} | \text{Male} \right] &= \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1\times 0 + u_i \right] \\ &= \mathop{\boldsymbol{E}}\left[ \beta_0 + 0 + u_i \right] \\ &= \beta_0 \end{aligned}` $$ -- $$ `\begin{aligned} \mathop{\boldsymbol{E}}\left[ \text{Pay} | \text{Female} \right] &= \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1\times 1 + u_i \right] \\ &= \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1 + u_i \right] \\ &= \beta_0 + \beta_1 \end{aligned}` $$ -- **Note:** If there are no other variables to condition on, then `\(\hat{\beta}_1\)` equals the difference in group means, _e.g._, `\(\overline{x}_\text{Female} - \overline{x}_\text{Male}\)`. -- **Note<sub>2</sub>:** The *holding all other variables constant* interpretation also applies for categorical variables in multiple regression settings. --- ## Categorical variables `\(y_i = \beta_0 + \beta_1 x_i + u_i\)` for binary variable `\(x_i = \{\color{#314f4f}{0}, \, \color{#e64173}{1}\}\)` <img src="slides_files/figure-html/dat plot 1-1.svg" style="display: block; margin: auto;" /> --- ## Categorical variables `\(y_i = \beta_0 + \beta_1 x_i + u_i\)` for binary variable `\(x_i = \{\color{#314f4f}{0}, \, \color{#e64173}{1}\}\)` <img src="slides_files/figure-html/dat plot 2-1.svg" style="display: block; margin: auto;" /> --- ## Interactions Interactions allow the effect of one variable to change based upon the level of another variable. **Examples** 1. Does the effect of schooling on pay change by gender? 1. Does the effect of gender on pay change by race? 1. Does the effect of schooling on pay change by experience? --- ## Interactions Previously, we considered a model that allowed women and men to have different wages, but the model assumed the effect of school on pay was the same for everyone: $$ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + \beta_2 \, \text{Female}_i + u_i $$ but we can also allow the effect of school to vary by gender: $$ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + \beta_2 \, \text{Female}_i + \beta_3 \, \text{School}_i\times\text{Female}_i + u_i $$ --- ## Interactions The model where schooling has the same effect for everyone (**<font color="#e64173">F</font>** and **<font color="#314f4f">M</font>**): <img src="slides_files/figure-html/int plot 1-1.svg" style="display: block; margin: auto;" /> --- ## Interactions The model where schooling's effect can differ by gender (**<font color="#e64173">F</font>** and **<font color="#314f4f">M</font>**): <img src="slides_files/figure-html/int plot 2-1.svg" style="display: block; margin: auto;" /> --- ## Interactions Interpreting coefficients can be a little tricky with interactions, but the key<sup>.pink[†]</sup> is to carefully work through the math. .footnote[.pink[†] As is often the case with econometrics.] $$ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + \beta_2 \, \text{Female}_i + \beta_3 \, \text{School}_i\times\text{Female}_i + u_i $$ Expected returns for an additional year of schooling for women: $$ `\begin{aligned} \mathop{\boldsymbol{E}}\left[ \text{Pay}_i | \text{Female} \land \text{School} = \ell + 1 \right] - \mathop{\boldsymbol{E}}\left[ \text{Pay}_i | \text{Female} \land \text{School} = \ell \right] &= \\ \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1 (\ell+1) + \beta_2 + \beta_3 (\ell + 1) + u_i \right] - \mathop{\boldsymbol{E}}\left[ \beta_0 + \beta_1 \ell + \beta_2 + \beta_3 \ell + u_i \right] &= \\ \beta_1 + \beta_3 \end{aligned}` $$ -- Similarly, `\(\beta_1\)` gives the expected return to an additional year of schooling for men. Thus, `\(\beta_3\)` gives the **difference in the returns to schooling** for women and men. --- ## Interactions The previous slides focused on interactions where one variable was **binary**. If both variables are continuous, then the interpretation is slightly trickier. *Remember:* Interactions simply mean the effect of one variable depends on the level of another variable. --- ## Interactions Suppose we're interested in the model $$ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + \beta_2 \, \text{Experience}_i + \beta_3 \, \text{School}_i\times\text{Experience}_i + u_i $$ where `\(\text{School}_i\)` and `\(\text{Experience}_i\)` are both continuous variables (in years). -- How do we interpret the interaction here? -- School's effect on pay now depends on the level of experience. **Interpretation** Consider the partial derivative: $$ \dfrac{\partial\text{Pay}_i}{\partial\text{School}_i} = \beta_1 + \beta_3 \text{Experience}_i $$ --- ## Interactions In the model $$ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + \beta_2 \, \text{Experience}_i + \beta_3 \, \text{School}_i\times\text{Experience}_i + u_i $$ all else equal, an additional year of school changes pay by $$ \beta_1 + \beta_3 \text{Experience} $$ --- ## Polynomials Polynomials are just interactions: they interact a variable with itself. $$ \text{Pay}_i = \beta_0 + \beta_1 \, \text{School}_i + \beta_2 \, \text{School}_i^2 + u_i $$ Here the effect of schooling depends on an individual's level of schooling. **Interpretation** Back to the partial derivative: $$ \dfrac{\partial\text{Pay}_i}{\partial\text{School}_i} = \beta_1 + 2 \beta_2 \text{School}_i $$ all else equal, an additional year of school changes pay by $$ \beta_1 + 2 \beta_2 \text{School}_i $$ --- ## Binary outcomes When your **outcome variable** is binary, the interpretation changes slightly. *Recall:* The avg. of a binary variable gives the % of observations with a '1'. *Example:* Avg(0, 0, 0, 1, 1) = 0.40 `\(\implies\)` 40% of obserations = 1. -- If your outcome is binary, then you are modeling the probability (percent) that the outcome equals one. $$ \text{Employed}_i = \beta_0 + \beta_1 \text{School}_i + u_i $$ **Interpretation** `\(\beta_1\)` is the effect of one additional year of schooling on the probability an individual is employed (all else equal). --- ## Log-linear specification In economics, you will frequently see logged outcome variables with linear (non-logged) explanatory variables, _e.g._, $$ \log(\text{Pay}_i) = \beta_0 + \beta_1 \, \text{School}_i + u_i $$ This specification changes our interpretation of the slope coefficients. **Interpretation** - A one-unit increase in our explanatory variable increases the outcome variable by approximately `\(\beta_1\times 100\)` percent. - *Example:* An additional year of schooling increases pay by approximately 3 percent (for `\(\beta_1 = 0.03\)`). --- ## Log-linear specification **Derivation** Consider the log-linear model $$ \log(y) = \beta_0 + \beta_1 \, x + u $$ and differentiate $$ \dfrac{dy}{y} = \beta_1 dx $$ So a marginal change in `\(x\)` (_i.e._, `\(dx\)`) leads to a `\(\beta_1 dx\)` **percentage change** in `\(y\)`. --- ## Log-linear specification Because the log-linear specification comes with a different interpretation, you need to make sure it fits your data-generating process/model. Does `\(x\)` change `\(y\)` in levels (_e.g._, a 3-unit increase) or percentages (_e.g._, a 10-percent increase)? -- _I.e._, you need to be sure an exponential relationship makes sense: $$ \log(y_i) = \beta_0 + \beta_1 \, x_i + u_i \iff y_i = e^{\beta_0 + \beta_1 x_i + u_i} $$ --- ## Log-linear specification <img src="slides_files/figure-html/log linear plot-1.svg" style="display: block; margin: auto;" /> --- ## Log-log specification Similarly, econometricians frequently employ log-log models, in which the outcome variable is logged *and* at least one explanatory variable is logged $$ \log(\text{Pay}_i) = \beta_0 + \beta_1 \, \log(\text{School}_i) + u_i $$ **Interpretation:** - A one-percent increase in `\(x\)` will lead to a `\(\beta_1\)` percent change in `\(y\)`. - Often interpreted as an elasticity. --- ## Log-log specification **Derivation** Consider the log-log model $$ \log(y) = \beta_0 + \beta_1 \, \log(x) + u $$ and differentiate $$ \dfrac{dy}{y} = \beta_1 \dfrac{dx}{x} $$ which says that for a one-percent increase in `\(x\)`, we will see a `\(\beta_1\)` percent increase in `\(y\)`. As an elasticity: $$ \dfrac{dy}{dx} \dfrac{x}{y} = \beta_1 $$ --- ## Log-linear with a binary variable **Note:** If you have a log-linear model with a binary indicator variable, the interpretation for the coefficient on that variable changes. Consider $$ \log(y_i) = \beta_0 + \beta_1 x_1 + u_i $$ for binary variable `\(x_1\)`. The interpretation of `\(\beta_1\)` is now - When `\(x_1\)` changes from 0 to 1, `\(y\)` will change by `\(100 \times \left( e^{\beta_1} -1 \right)\)` percent. - When `\(x_1\)` changes from 1 to 0, `\(y\)` will change by `\(100 \times \left( e^{-\beta_1} -1 \right)\)` percent. --- ## Log-log specification <img src="slides_files/figure-html/log log plot-1.svg" style="display: block; margin: auto;" /> --- layout: true # Additional topics --- class: inverse, middle --- ## Inference *vs.* prediction So far, we've focused mainly **statistical inference**—using estimators and their distributions properties to try to learn about underlying, unknown population parameters. $$ y\_i = \color{#e64173}{\hat{\beta}\_{0}} + \color{#e64173}{\hat{\beta\_1}} \, x\_{1i} + \color{#e64173}{\hat{\beta\_2}} \, x\_{2i} + \cdots + \color{#e64173}{\hat{\beta}\_{k}} \, x\_{ki} + e\_i $$ -- **Prediction** includes a fairly different set of topics/tools within econometrics (and data science/machine learning)—creating models that accurately estimate individual observations. $$ \color{#e64173}{\hat{y}\_i} = \mathop{\hat{f}}\left( x_1,\, x_2,\, \ldots x_k \right) $$ --- ## Inference *vs.* prediction Succinctly - **Inference:** causality, `\(\hat{\beta}_k\)` (consistent and efficient), standard errors/hypothesis tests for `\(\hat{\beta}_k\)`, generally OLS - **Prediction:**<sup>.pink[†]</sup> correlation, `\(\hat{y}_i\)` (low error), model selection, nonlinear models are much more common .footnote[ .pink[†] Includes forecasting. ] --- ## Treatment effects and causality Much of modern (micro)econometrics focuses on causally estimating (*identifying*) the effect of programs/policies, _e.g._, - [Government shutdowns](https://www.sciencedirect.com/science/article/pii/S004727271830118X) - [The minimum wage](https://www.jstor.org/stable/2118030) - [Recreational-cannabis legalization](https://pages.uoregon.edu/bchansen/working.html) - [Salary-history bans](http://www.drewmcnichols.com/research) - [Preschool](https://amstat.tandfonline.com/doi/abs/10.1198/016214508000000841#.XD4PVy2ZNO4) - [The Clean Water Act](https://academic.oup.com/qje/article-abstract/134/1/349/5092609) -- In this literature, the program is often a binary variable, and we place high importance on finding an unbiased estimate for the program's effect, `\(\hat{\tau}\)`. $$ \text{Outcome}_i = \beta_0 + \tau \, \text{Program}_i + u_i $$ --- ## Transformations Our linearity assumption requires 1. **parameters enter linearly** (_i.e._, the `\(\beta_k\)` multiplied by variables) 2. the `\(u_i\)` **disturbances enter additively** We allow nonlinear relationships between `\(y\)` and the explanatory variables. -- **Examples** - **Polynomials** and **interactions:** `\(y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \beta_3 x_2 + \beta_4 x_2^2 + \beta_5 \left( x_1 x_2 \right) + u_i\)` - **Exponentials** and **logs:** `\(\log(y_i) = \beta_0 + \beta_1 x_1 + \beta_2 e^{x_2} + u_i\)` - **Indicators** and **thresholds:** `\(y_i = \beta_0 + \beta_1 x_1 + \beta_2 \, \mathbb{I}(x_1 \geq 100) + u_i\)` --- **Transformation challenge:** (literally) infinite possibilities. What do we pick? <img src="slides_files/figure-html/trans figure start-1.svg" style="display: block; margin: auto;" /> --- `\(y_i = \beta_0 + u_i\)` <img src="slides_files/figure-html/trans figure 0-1.svg" style="display: block; margin: auto;" /> --- count: false `\(y_i = \beta_0 + \beta_1 x + u_i\)` <img src="slides_files/figure-html/trans figure 1-1.svg" style="display: block; margin: auto;" /> --- count: false `\(y_i = \beta_0 + \beta_1 x + \beta_2 x^2 + u_i\)` <img src="slides_files/figure-html/trans figure 2-1.svg" style="display: block; margin: auto;" /> --- count: false `\(y_i = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + u_i\)` <img src="slides_files/figure-html/trans figure 3-1.svg" style="display: block; margin: auto;" /> --- count: false `\(y_i = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4 + u_i\)` <img src="slides_files/figure-html/trans figure 4-1.svg" style="display: block; margin: auto;" /> --- count: false `\(y_i = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4 + \beta_5 x^5 + u_i\)` <img src="slides_files/figure-html/trans figure 5-1.svg" style="display: block; margin: auto;" /> --- count: false **Truth:** `\(y_i = 2 e^{x} + u_i\)` <img src="slides_files/figure-html/trans figure 6-1.svg" style="display: block; margin: auto;" /> --- ## Outliers Because OLS minimizes the sum of the **squared** errors, outliers can play a large role in our estimates. **Common responses** - Remove the outliers from the dataset - Replace outliers with the 99<sup>th</sup> percentile of their variable (*Windsorize*) - Take the log of the variable to "take care of" outliers - Do nothing. Outliers are not always bad. Some people are "far" from the average. It may not make sense to try to change this variation. --- ## Missing data Similarly, missing data can affect your results. .mono[R] doesn't know how to deal with a missing observation. ``` r 1 + 2 + 3 + NA + 5 ``` ``` #> [1] NA ``` If you run a regression<sup>†</sup> with missing values, .mono[R] drops the observations missing those values. If the observations are missing in a nonrandom way, a random sample may end up nonrandom. .footnote[ [†]: Or perform almost any operation/function ] --- exclude: true