class: center, middle, inverse, title-slide # Categorical Variables ## EC 320: Introduction to Econometrics ### Winter 2022 --- class: inverse, middle # Prologue --- # Housekeeping 1. .pink[**Problem Set 3**] grade posted 2. .pink[**Midterm**] grade to be posted by Wednesday 3. .pink[**Problem Set 4**] to be posted by tomorrow, stay tuned - Due next Monday 4. .pink[**Lab**] - **Lab** held on Wednesday - **Lab material** available on Github, **Ex7** available on Canvas - **Ex7** due Wednesday --- class: inverse, middle # Categorical Variables --- # Categorical Variables **Goal:** Make quantitative statements about .pink[qualitative information]. - *e.g.,* race, gender, being employed, living in Oregon, *etc.* -- **Approach:** Construct .pink[binary variables]. - _a.k.a._ .pink[dummy variables] or .pink[indicator variables]. - Value equals 1 if observation is in the category or 0 if otherwise. -- **Regression implications** 1. Binary variables change the interpretation of the intercept. 2. Coefficients on binary variables have different interpretations than those on continuous variables. --- # Continuous Variables Consider the relationship $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + u_i $$ where - `\(\text{Pay}_i\)` is a continuous variable measuring an individual's pay - `\(\text{School}_i\)` is a continuous variable that measures years of education -- **Interpretation** - `\(\beta_0\)`: `\(y\)`-intercept, _i.e._, `\(\text{Pay}\)` when `\(\text{School} = 0\)` - `\(\beta_1\)`: expected increase in `\(\text{Pay}\)` for a one-unit increase in `\(\text{School}\)` --- # Continuous Variables Consider the relationship $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + u_i $$ **Derive the slope's interpretation:** `\(\mathop{\mathbb{E}}\left[ \text{Pay} | \text{School} = \ell + 1 \right] - \mathop{\mathbb{E}}\left[ \text{Pay} | \text{School} = \ell \right]\)` -- <br> `\(\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 (\ell + 1) + u \right] - \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 \ell + u \right]\)` -- <br> `\(\quad = \left[ \beta_0 + \beta_1 (\ell + 1) \right] - \left[ \beta_0 + \beta_1 \ell \right]\)` -- <br> `\(\quad = \beta_0 - \beta_0 + \beta_1 \ell - \beta_1 \ell + \beta_1\)` -- <br> `\(\quad = \beta_1\)`. -- The slope gives the expected increase in pay for an additional year of schooling. --- # Continuous Variables Consider the relationship $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + u_i $$ **Alternative derivation** Differentiate the model with respect to schooling: $$ \dfrac{d\text{Pay}}{d\text{School}} = \beta_1 $$ The slope gives the expected increase in pay for an additional year of schooling. --- # Continuous Variables If we have multiple explanatory variables, _e.g._, $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Ability}_i + u_i $$ then the interpretation changes slightly. -- `\(\mathop{\mathbb{E}}\left[ \text{Pay} | \text{School} = \ell + 1 \land \text{Ability} = \alpha \right] - \mathop{\mathbb{E}}\left[ \text{Pay} | \text{School} = \ell \land \text{Ability} = \alpha \right]\)` -- <br> `\(\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 (\ell + 1) + \beta_2 \alpha + u \right] - \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 \ell + \beta_2 \alpha + u \right]\)` -- <br> `\(\quad = \left[ \beta_0 + \beta_1 (\ell + 1) + \beta_2 \alpha \right] - \left[ \beta_0 + \beta_1 \ell + \beta_2 \alpha \right]\)` -- <br> `\(\quad = \beta_0 - \beta_0 + \beta_1 \ell - \beta_1 \ell + \beta_1 + \beta_2 \alpha - \beta_2 \alpha\)` -- <br> `\(\quad = \beta_1\)` -- The slope gives the expected increase in pay for an additional year of schooling, **holding ability constant**. --- # Continuous Variables If we have multiple explanatory variables, _e.g._, $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Ability}_i + u_i $$ then the interpretation changes slightly. -- **Alternative derivation** Differentiate the model with respect to schooling: $$ \dfrac{\partial\text{Pay}}{\partial\text{School}} = \beta_1 $$ The slope gives the expected increase in pay for an additional year of schooling, **holding ability constant**. --- # Categorical Variables Consider the relationship $$ \text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + u_i $$ where `\(\text{Pay}_i\)` is a continuous variable measuring an individual's pay and `\(\text{Female}_i\)` is a binary variable equal to `\(1\)` when `\(i\)` is female. **Interpretation** `\(\beta_0\)` is the expected `\(\text{Pay}\)` for males (_i.e._, when `\(\text{Female} = 0\)`): `\(\mathop{\mathbb{E}}\left[ \text{Pay} | \text{Male} \right]\)` -- <br> `\(\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1\times 0 + u_i \right]\)` -- <br> `\(\quad = \mathop{\mathbb{E}}\left[ \beta_0 + 0 + u_i \right]\)` -- <br> `\(\quad = \beta_0\)` --- # Categorical Variables Consider the relationship $$ \text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + u_i $$ where `\(\text{Pay}_i\)` is a continuous variable measuring an individual's pay and `\(\text{Female}_i\)` is a binary variable equal to `\(1\)` when `\(i\)` is female. **Interpretation** `\(\beta_1\)` is the expected difference in `\(\text{Pay}\)` between females and males: `\(\mathop{\mathbb{E}}\left[ \text{Pay} | \text{Female} \right] - \mathop{\mathbb{E}}\left[ \text{Pay} | \text{Male} \right]\)` -- <br> `\(\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1\times 1 + u_i \right] - \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1\times 0 + u_i \right]\)` -- <br> `\(\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 + u_i \right] - \mathop{\mathbb{E}}\left[ \beta_0 + 0 + u_i \right]\)` -- <br> `\(\quad = \beta_0 + \beta_1 - \beta_0\)` -- <br> `\(\quad = \beta_1\)` --- # Categorical Variables Consider the relationship $$ \text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + u_i $$ where `\(\text{Pay}_i\)` is a continuous variable measuring an individual's pay and `\(\text{Female}_i\)` is a binary variable equal to `\(1\)` when `\(i\)` is female. **Interpretation** `\(\beta_0 + \beta_1\)`: is the expected `\(\text{Pay}\)` for females: `\(\mathop{\mathbb{E}}\left[ \text{Pay} | \text{Female} \right]\)` -- <br> `\(\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1\times 1 + u_i \right]\)` -- <br> `\(\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 + u_i \right]\)` -- <br> `\(\quad = \beta_0 + \beta_1\)` --- # Categorical Variables Consider the relationship $$ \text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + u_i $$ **Interpretation** - `\(\beta_0\)`: expected `\(\text{Pay}\)` for males (_i.e._, when `\(\text{Female} = 0\)`) - `\(\beta_1\)`: expected difference in `\(\text{Pay}\)` between females and males - `\(\beta_0 + \beta_1\)`: expected `\(\text{Pay}\)` for females - Males are the **reference group** -- **Note:** If there are no other variables to condition on, then `\(\hat{\beta}_1\)` equals the difference in group means, _e.g._, `\(\bar{X}_\text{Female} - \bar{X}_\text{Male}\)`. -- **Note<sub>2</sub>:** The *holding all other variables constant* interpretation also applies for categorical variables in multiple regression settings. --- # Categorical Variables `\(Y_i = \beta_0 + \beta_1 X_i + u_i\)` for binary variable `\(X_i = \{\color{#314f4f}{0}, \, \color{#e64173}{1}\}\)` <img src="14-Categorical_Variables_files/figure-html/dat plot 1-1.svg" style="display: block; margin: auto;" /> --- # Categorical Variables `\(Y_i = \beta_0 + \beta_1 X_i + u_i\)` for binary variable `\(X_i = \{\color{#314f4f}{0}, \, \color{#e64173}{1}\}\)` <img src="14-Categorical_Variables_files/figure-html/dat plot 2-1.svg" style="display: block; margin: auto;" /> --- # Multiple Regression `\(Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i \quad\)` `\(X_1\)` is continuous `\(\quad X_2\)` is categorical <img src="14-Categorical_Variables_files/figure-html/mult reg plot 1-1.svg" style="display: block; margin: auto;" /> --- count: false # Multiple Regression The intercept and categorical variable `\(X_2\)` control for the groups' means. <img src="14-Categorical_Variables_files/figure-html/mult reg plot 2-1.svg" style="display: block; margin: auto;" /> --- count: false # Multiple Regression With groups' means removed: <img src="14-Categorical_Variables_files/figure-html/mult reg plot 3-1.svg" style="display: block; margin: auto;" /> --- count: false # Multiple Regression `\(\hat{\beta}_1\)` estimates the relationship between `\(Y\)` and `\(X_1\)` after controlling for `\(X_2\)`. <img src="14-Categorical_Variables_files/figure-html/mult reg plot 4-1.svg" style="display: block; margin: auto;" /> --- count: false # Multiple Regression Another way to think about it: <img src="14-Categorical_Variables_files/figure-html/mult reg plot 5-1.svg" style="display: block; margin: auto;" /> --- class: white-slide **Question:** Why not estimate `\(\text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + \beta_2 \text{Male}_i + u_i\)`? -- **Answer:** The intercept is a perfect linear combination of `\(\text{Male}_i\)` and `\(\text{Female}_i\)`. - Violates .pink[no perfect collinearity] assumption. - OLS can't estimate all three parameters simultaneously. - Known as .hi[dummy variable trap]. **Practical solution:** Select a reference category and drop its indicator. --- # Dummy Variable _Trap?_ Don't worry, .mono[R] will bail you out if you include perfectly collinear indicators. **Example** ```r lm(wage ~ black + nonblack, data = wage_data) %>% tidy() ``` ``` #> # A tibble: 3 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 617. 5.27 117. 0 #> 2 black -168. 10.9 -15.4 7.78e-52 #> 3 nonblack NA NA NA NA ``` -- Thanks, .mono[R]. --- # Omitted Variable Bias **Omitted variable bias** (OVB) arises when we omit a variable that 1. Affects the outcome variable `\(Y\)` 2. Correlates with an explanatory variable `\(X_j\)` Biases OLS estimator of `\(\beta_j\)`. --- # Omitted Variable Bias **Example** Let's imagine a simple population model for the amount individual `\(i\)` gets paid $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i $$ where `\(\text{School}_i\)` gives `\(i\)`'s years of schooling and `\(\text{Male}_i\)` denotes an indicator variable for whether individual `\(i\)` is male. **Interpretation** - `\(\beta_1\)`: returns to an additional year of schooling (*ceteris paribus*) - `\(\beta_2\)`: premium for being male (*ceteris paribus*) -- <br>If `\(\beta_2 > 0\)`, then there is discrimination against women. --- # Omitted Variable Bias **Example, continued** From the population model $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i $$ An analyst focuses on the relationship between pay and schooling, _i.e._, $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \left(\beta_2 \text{Male}_i + u_i\right) $$ $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \varepsilon_i $$ where `\(\varepsilon_i = \beta_2 \text{Male}_i + u_i\)`. -- We assumed exogeneity to show that OLS is unbiasedness. But even if `\(\mathop{\mathbb{E}}\left[ u | X \right] = 0\)`, it is not necessarily true that `\(\mathop{\mathbb{E}}\left[ \varepsilon | X \right] = 0\)` (false if `\(\beta_2 \neq 0\)`). -- Specifically, `\(\mathop{\mathbb{E}}\left[ \varepsilon | \text{Male} = 1 \right] = \beta_2 + \mathop{\mathbb{E}}\left[ u | \text{Male} = 1 \right] \neq 0\)`. -- **Now OLS is biased.** --- # Omitted Variable Bias Let's try to see this result graphically. The true population model: $$ \text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i $$ The regression model that suffers from omitted-variable bias: $$ \text{Pay}_i = \hat{\beta}_0 + \hat{\beta}_1 \times \text{School}_i + e_i $$ Finally, imagine that women, on average, receive more schooling than men. --- # Omitted Variable Bias True model: `\(\text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i\)` <img src="14-Categorical_Variables_files/figure-html/plot ovb 1-1.svg" style="display: block; margin: auto;" /> --- count: false # Omitted Variable Bias Biased regression: `\(\widehat{\text{Pay}}_i = 31.3 + -0.9 \times \text{School}_i\)` <img src="14-Categorical_Variables_files/figure-html/plot ovb 2-1.svg" style="display: block; margin: auto;" /> --- count: false # Omitted Variable Bias Recalling the omitted variable: Gender (**<font color="#e64173">female</font>** and **<font color="#314f4f">male</font>**) <img src="14-Categorical_Variables_files/figure-html/plot ovb 3-1.svg" style="display: block; margin: auto;" /> --- count: false # Omitted Variable Bias Recalling the omitted variable: Gender (**<font color="#e64173">female</font>** and **<font color="#314f4f">male</font>**) <img src="14-Categorical_Variables_files/figure-html/plot ovb 4-1.svg" style="display: block; margin: auto;" /> --- count: false # Omitted Variable Bias Unbiased regression: `\(\widehat{\text{Pay}}_i = 20.9 + 0.4 \times \text{School}_i + 9.1 \times \text{Male}_i\)` <img src="14-Categorical_Variables_files/figure-html/plot ovb 5-1.svg" style="display: block; margin: auto;" /> --- # Categorical Variables ## Example: Weekly Wages ```r lm(wage ~ south, data = wage_data) %>% tidy() ``` ``` #> # A tibble: 2 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 632. 6.00 105. 0 #> 2 south -137. 9.45 -14.5 6.21e-46 ``` -- **Q.sub[1]:** What is the reference category? **Q.sub[2]:** Interpret the coefficients. **Q.sub[3]:** Suppose you ran `lm(wage ~ nonsouth, data = wage_data)` instead. What is the coefficient estimate on `nonsouth`? What is the intercept estimate? --- # Categorical Variables ## Example: Weekly Wages ```r lm(wage ~ south + black, data = wage_data) %>% tidy() ``` ``` #> # A tibble: 3 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 647. 6.02 107. 0 #> 2 south -98.6 9.84 -10.0 2.89e-23 #> 3 black -129. 11.4 -11.3 3.43e-29 ``` -- **Q.sub[1]:** What is the reference category? **Q.sub[2]:** Interpret the coefficients. **Q.sub[3]:** Suppose you ran `lm(wage ~ south + nonblack, data = wage_data)` instead. What is the coefficient estimate on `nonblack`? What is the coefficient estimate on `south`? What is the intercept estimate? --- # Categorical Variables ## Example: Weekly Wages **Answer to Q.sub[3]:** ```r lm(wage ~ south + nonblack, data = wage_data) %>% tidy() ``` ``` #> # A tibble: 3 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 518. 11.7 44.3 0 #> 2 south -98.6 9.84 -10.0 2.89e-23 #> 3 nonblack 129. 11.4 11.3 3.43e-29 ```