class: center, middle, inverse, title-slide # Categorical Variables ## EC 320: Introduction to Econometrics ### Philip Economides ### Winter 2022 --- class: inverse, middle # Prologue --- class: inverse, middle # Categorical Variables --- # Categorical Variables **Goal:** Make quantitative statements about .pink[qualitative information]. - *e.g.,* race, gender, being employed, living in Oregon, *etc.* -- **Approach:** Construct .pink[binary variables]. - _a.k.a._ .pink[dummy variables] or .pink[indicator variables]. - Value equals 1 if observation is in the category or 0 if otherwise. -- **Regression implications** 1. Binary variables change the interpretation of the intercept. 2. Coefficients on binary variables have different interpretations than those on continuous variables. --- # Continuous Variables Consider the relationship $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + u_i $$ where - `\(\text{Pay}_i\)` is a continuous variable measuring an individual's pay - `\(\text{School}_i\)` is a continuous variable that measures years of education -- **Interpretation** - `\(\beta_0\)`: `\(y\)`-intercept, _i.e._, `\(\text{Pay}\)` when `\(\text{School} = 0\)` - `\(\beta_1\)`: expected increase in `\(\text{Pay}\)` for a one-unit increase in `\(\text{School}\)` --- # Continuous Variables Consider the relationship $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + u_i $$ **Derive the slope's interpretation:** `\(\mathop{\mathbb{E}}\left[ \text{Pay} | \text{School} = \ell + 1 \right] - \mathop{\mathbb{E}}\left[ \text{Pay} | \text{School} = \ell \right]\)` -- <br> `\(\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 (\ell + 1) + u \right] - \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 \ell + u \right]\)` -- <br> `\(\quad = \left[ \beta_0 + \beta_1 (\ell + 1) \right] - \left[ \beta_0 + \beta_1 \ell \right]\)` -- <br> `\(\quad = \beta_0 - \beta_0 + \beta_1 \ell - \beta_1 \ell + \beta_1\)` -- <br> `\(\quad = \beta_1\)`. -- The slope gives the expected increase in pay for an additional year of schooling. --- # Continuous Variables Consider the relationship $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + u_i $$ **Alternative derivation** Differentiate the model with respect to schooling: $$ \dfrac{d\text{Pay}}{d\text{School}} = \beta_1 $$ The slope gives the expected increase in pay for an additional year of schooling. --- # Continuous Variables If we have multiple explanatory variables, _e.g._, $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Ability}_i + u_i $$ then the interpretation changes slightly. -- `\(\mathop{\mathbb{E}}\left[ \text{Pay} | \text{School} = \ell + 1 \land \text{Ability} = \alpha \right] - \mathop{\mathbb{E}}\left[ \text{Pay} | \text{School} = \ell \land \text{Ability} = \alpha \right]\)` -- <br> `\(\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 (\ell + 1) + \beta_2 \alpha + u \right] - \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 \ell + \beta_2 \alpha + u \right]\)` -- <br> `\(\quad = \left[ \beta_0 + \beta_1 (\ell + 1) + \beta_2 \alpha \right] - \left[ \beta_0 + \beta_1 \ell + \beta_2 \alpha \right]\)` -- <br> `\(\quad = \beta_0 - \beta_0 + \beta_1 \ell - \beta_1 \ell + \beta_1 + \beta_2 \alpha - \beta_2 \alpha\)` -- <br> `\(\quad = \beta_1\)` -- The slope gives the expected increase in pay for an additional year of schooling, **holding ability constant**. --- # Continuous Variables If we have multiple explanatory variables, _e.g._, $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Ability}_i + u_i $$ then the interpretation changes slightly. -- **Alternative derivation** Differentiate the model with respect to schooling: $$ \dfrac{\partial\text{Pay}}{\partial\text{School}} = \beta_1 $$ The slope gives the expected increase in pay for an additional year of schooling, **holding ability constant**. --- # Categorical Variables Consider the relationship $$ \text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + u_i $$ where `\(\text{Pay}_i\)` is a continuous variable measuring an individual's pay and `\(\text{Female}_i\)` is a binary variable equal to `\(1\)` when `\(i\)` is female. **Interpretation** `\(\beta_0\)` is the expected `\(\text{Pay}\)` for males (_i.e._, when `\(\text{Female} = 0\)`): `\(\mathop{\mathbb{E}}\left[ \text{Pay} | \text{Male} \right]\)` -- <br> `\(\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1\times 0 + u_i \right]\)` -- <br> `\(\quad = \mathop{\mathbb{E}}\left[ \beta_0 + 0 + u_i \right]\)` -- <br> `\(\quad = \beta_0\)` --- # Categorical Variables Consider the relationship $$ \text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + u_i $$ where `\(\text{Pay}_i\)` is a continuous variable measuring an individual's pay and `\(\text{Female}_i\)` is a binary variable equal to `\(1\)` when `\(i\)` is female. **Interpretation** `\(\beta_1\)` is the expected difference in `\(\text{Pay}\)` between females and males: `\(\mathop{\mathbb{E}}\left[ \text{Pay} | \text{Female} \right] - \mathop{\mathbb{E}}\left[ \text{Pay} | \text{Male} \right]\)` -- <br> `\(\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1\times 1 + u_i \right] - \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1\times 0 + u_i \right]\)` -- <br> `\(\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 + u_i \right] - \mathop{\mathbb{E}}\left[ \beta_0 + 0 + u_i \right]\)` -- <br> `\(\quad = \beta_0 + \beta_1 - \beta_0\)` -- <br> `\(\quad = \beta_1\)` --- # Categorical Variables Consider the relationship $$ \text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + u_i $$ where `\(\text{Pay}_i\)` is a continuous variable measuring an individual's pay and `\(\text{Female}_i\)` is a binary variable equal to `\(1\)` when `\(i\)` is female. **Interpretation** `\(\beta_0 + \beta_1\)`: is the expected `\(\text{Pay}\)` for females: `\(\mathop{\mathbb{E}}\left[ \text{Pay} | \text{Female} \right]\)` -- <br> `\(\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1\times 1 + u_i \right]\)` -- <br> `\(\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 + u_i \right]\)` -- <br> `\(\quad = \beta_0 + \beta_1\)` --- # Categorical Variables Consider the relationship $$ \text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + u_i $$ **Interpretation** - `\(\beta_0\)`: expected `\(\text{Pay}\)` for males (_i.e._, when `\(\text{Female} = 0\)`) - `\(\beta_1\)`: expected difference in `\(\text{Pay}\)` between females and males - `\(\beta_0 + \beta_1\)`: expected `\(\text{Pay}\)` for females - Males are the **reference group** -- **Note:** If there are no other variables to condition on, then `\(\hat{\beta}_1\)` equals the difference in group means, _e.g._, `\(\bar{X}_\text{Female} - \bar{X}_\text{Male}\)`. -- **Note<sub>2</sub>:** The *holding all other variables constant* interpretation also applies for categorical variables in multiple regression settings. --- # Categorical Variables `\(Y_i = \beta_0 + \beta_1 X_i + u_i\)` for binary variable `\(X_i = \{\color{#314f4f}{0}, \, \color{#e64173}{1}\}\)` <img src="13-Categorical_Variables_files/figure-html/dat plot 1-1.svg" style="display: block; margin: auto;" /> --- # Categorical Variables `\(Y_i = \beta_0 + \beta_1 X_i + u_i\)` for binary variable `\(X_i = \{\color{#314f4f}{0}, \, \color{#e64173}{1}\}\)` <img src="13-Categorical_Variables_files/figure-html/dat plot 2-1.svg" style="display: block; margin: auto;" /> --- # Multiple Regression `\(Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i \quad\)` `\(X_1\)` is continuous `\(\quad X_2\)` is categorical <img src="13-Categorical_Variables_files/figure-html/mult reg plot 1-1.svg" style="display: block; margin: auto;" /> --- count: false # Multiple Regression The intercept and categorical variable `\(X_2\)` control for the groups' means. <img src="13-Categorical_Variables_files/figure-html/mult reg plot 2-1.svg" style="display: block; margin: auto;" /> --- count: false # Multiple Regression With groups' means removed: <img src="13-Categorical_Variables_files/figure-html/mult reg plot 3-1.svg" style="display: block; margin: auto;" /> --- count: false # Multiple Regression `\(\hat{\beta}_1\)` estimates the relationship between `\(Y\)` and `\(X_1\)` after controlling for `\(X_2\)`. <img src="13-Categorical_Variables_files/figure-html/mult reg plot 4-1.svg" style="display: block; margin: auto;" /> --- count: false # Multiple Regression Another way to think about it: <img src="13-Categorical_Variables_files/figure-html/mult reg plot 5-1.svg" style="display: block; margin: auto;" /> --- class: white-slide **Question:** Why not estimate `\(\text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + \beta_2 \text{Male}_i + u_i\)`? -- **Answer:** The intercept is a perfect linear combination of `\(\text{Male}_i\)` and `\(\text{Female}_i\)`. - Violates .pink[no perfect collinearity] assumption. - OLS can't estimate all three parameters simultaneously. - Known as .hi[dummy variable trap]. **Practical solution:** Select a reference category and drop its indicator. --- # Dummy Variable _Trap?_ Don't worry, .mono[R] will bail you out if you include perfectly collinear indicators. **Example** ```r lm(wage ~ black + nonblack, data = wage_data) %>% tidy() ``` ``` #> # A tibble: 3 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 617. 5.27 117. 0 #> 2 black -168. 10.9 -15.4 7.78e-52 #> 3 nonblack NA NA NA NA ``` -- Thanks, .mono[R]. --- # Multiple Categories So far we have only discussed **binary** categorical variables represented by dummies. In many cases, there is a wide variety of categories by which we can characterize a set of observations. **For example** * Transport Modes: Rail, Highway, Air, Water * Income Range: 1st quartile, 2nd quartile, 3rd quartile, 4th quartile * Geographic Regions: Alabama, Idaho, Oregon etc. When addressing product diversification and trade, we can end up with an incredible number of categories to consider. [Trade Statistics by Product (HS 6-digit)](wits.worldbank.org/trade/country-byhs6product/aspx?lang=en) --- class: white-slide .center[**Categorical Variable Types**] <table> <thead> <tr> <th style="text-align:left;"> Type of Variable </th> <th style="text-align:left;"> Represents </th> <th style="text-align:left;"> Examples </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;font-style: italic;color: black !important;"> Binary Variables </td> <td style="text-align:left;font-style: italic;color: black !important;"> Yes/no outcomes </td> <td style="text-align:left;"> Heads/tails in a coin flip <br> Win/lose in a football game </td> </tr> <tr> <td style="text-align:left;font-weight: bold;font-style: italic;color: black !important;"> Nominal Variables </td> <td style="text-align:left;font-style: italic;color: black !important;"> Groups with no rank or order between them </td> <td style="text-align:left;"> Specific names <br> Colors <br> Brands </td> </tr> <tr> <td style="text-align:left;font-weight: bold;font-style: italic;color: black !important;"> Ordinal Variables </td> <td style="text-align:left;font-style: italic;color: black !important;"> Groups that are ranked in a specific order </td> <td style="text-align:left;"> Rankings in a competition <br> Rating scale responses in survey </td> </tr> </tbody> </table> --- # Beyond Binary How do we deal with heaps of categories? .hi-pink[It depends]. -- Are these categories your .hi-pink[outcome variable]? -- **Binary:** *Logistic Regression Model*, where we are determining the probability of an event, given individual characterisitics of `\(i\)`. -- **Ordinal:** *Cumulative/Ordered Logit Model* for categorical variables with an implied order and `\(J\)` choices. -- **Nominal:** *Generalized Logit Model* which holds characteristics fixed across choices and *Multinomial/Conditional Logit Model* which allows characteristics to differ for different choices. -- These items .hi-pink[will not be covered] in this class, nor will their descriptions be tested upon. This is guidance for those interested in reading further and understanding what future econometrics classes deliver. --- # Beyond Binary Are these categories part of an .hi-pink[explanatory variable]? -- __Approach I:__ Apply a unique dummy variable for each category -- For example consider `\(\text{earn}_i = \alpha + \beta_1 \text{HS}_i + \beta_2\text{UG}_i + \beta_3{\text{MS}}_i +\beta_3\text{PhD}_i + u_i\)` -- In this case I may have a single categorical variable, `\(DEG_i\)`, that lists degree types of individual `\(i\)` across my sample. ```r educ_df %<>% mutate(HS = if_else(DEG=="Highschool", 1, 0), UG = if_else(DEG=="Undergraduate", 1, 0), MS = if_else(DEG=="Masters of Science", 1, 0), PhD = if_else(DEG=="Doctorate", 1 ,0) ) educ_reg <- lm(data=educ_df, earn ~ HS + UG + MS + PhD) ``` Assuming `\(i\)` w/o any degree, would form my reference group in which for every included individual, `\(HS + UG + MS + PhD = 0\)`. --- # Beyond Binary What if there are __too many__ categories but I want to create individual dummies? -- Jacob Kaplan (Princeton) created the `fastDummies` package, which provides a useful function `dummy_cols()` [LINK](https://jacobkap.github.io/fastDummies/) -- Consider the following example <table> <thead> <tr> <th style="text-align:right;"> numbers </th> <th style="text-align:left;"> gender </th> <th style="text-align:left;"> animals </th> <th style="text-align:left;"> dates </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> dog </td> <td style="text-align:left;"> 2012-01-01 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> dog </td> <td style="text-align:left;"> 2011-12-31 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> cat </td> <td style="text-align:left;"> 2012-01-01 </td> </tr> </tbody> </table> --- # Beyond Binary ```r results <- fastDummies::dummy_cols(fastDummies_example) knitr::kable(results) %>% kable_styling(font_size=10) ``` <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> numbers </th> <th style="text-align:left;"> gender </th> <th style="text-align:left;"> animals </th> <th style="text-align:left;"> dates </th> <th style="text-align:right;"> gender_female </th> <th style="text-align:right;"> gender_male </th> <th style="text-align:right;"> animals_cat </th> <th style="text-align:right;"> animals_dog </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> dog </td> <td style="text-align:left;"> 2012-01-01 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> dog </td> <td style="text-align:left;"> 2011-12-31 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> cat </td> <td style="text-align:left;"> 2012-01-01 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> -- ```r results <- fastDummies::dummy_cols(fastDummies_example, select_columns= c("animals","gender")) knitr::kable(results) %>% kable_styling(font_size=10) ``` <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> numbers </th> <th style="text-align:left;"> gender </th> <th style="text-align:left;"> animals </th> <th style="text-align:left;"> dates </th> <th style="text-align:right;"> animals_cat </th> <th style="text-align:right;"> animals_dog </th> <th style="text-align:right;"> gender_female </th> <th style="text-align:right;"> gender_male </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> dog </td> <td style="text-align:left;"> 2012-01-01 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> dog </td> <td style="text-align:left;"> 2011-12-31 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> cat </td> <td style="text-align:left;"> 2012-01-01 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> --- # Beyond Binary Are these categories part of an .hi-pink[explanatory variable]? -- __Approach II:__ Apply a fixed effect to your model -- Consider the following model `$$\text{earn}_{ij} = \alpha + \beta_1 \text{Age}_i + \beta_2\text{AgeSq}_i + \beta_3{\text{Educ}}_i +\beta_3\text{Female}_i + u_{ij}$$` -- There may be .hi-pink[unobservable] aspects related to groups defined by `\(j\)`, that are **fixed** across individuals in each group `\(j\in\{1,2,\dots,J\}\)`. -- For example, if we were regressing the earnings of service staff across `\(J\)` countries, the USA may see unobserved `\(\text{tips}_{ij}\)` contributing more significantly towards income due to underlying cultural/professional norms. -- In this case a **country fixed effect** in our regression would do wonders. --- # Beyond Binary Where `\(u_{ij} = \phi_j + \nu_{ij}\)`, our new regression would look like `$$\text{earn}_{ij} = \alpha + \beta_1 \text{Age}_i + \beta_2\text{AgeSq}_i + \beta_3{\text{Educ}}_i +\beta_3\text{Female}_i + \phi_j+ \nu_{ij},$$` -- Any __unobserved__ contribution towards earnings that .hi-pink[varies across] `\(J\)` __but__ is .hi-pink[constant within] each `\(j\)` for those individuals is controlled for. -- **How do we run regressions with fixed-effects?** > fixed.dum = lm(data=dataset, Y ~ X + factor(category_variable)) Turning your character variables into factors will automatically have the code treat each `\(j\)`th category as if it had its own dummy variable [Example](https://rstudio-pubs-static.s3.amazonaws.com/372492_3e05f38dd3f248e89cdedd317d603b9a.html) `plm` maintained by Yves Croissant [Example Code](https://www.econometrics-with-r.org/10-3-fixed-effects-regression.html) `fixest` maintained by Laurent Berge and Grant McDermott [Example Code](https://cran.r-project.org/web/packages/fixest/vignettes/fixest_walkthrough.html) --- # Beyond Binary I estimate whether productivity rankings across different categories of firm-types, represented by .hi-pink[dummy variables], are consistent with .hi-blue[Melitz(2003)]. Unobs .hi-pink[fixed-effects] within industries, years and countries controlled for. <img src="research_results.png" width="80%" style="display: block; margin: auto;" /> --- # Omitted Variable Bias **Omitted variable bias** (OVB) arises when we omit a variable that 1. Affects the outcome variable `\(Y\)` 2. Correlates with an explanatory variable `\(X_j\)` Biases OLS estimator of `\(\beta_j\)`. --- # Omitted Variable Bias **Example** Let's imagine a simple population model for the amount individual `\(i\)` gets paid $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i $$ where `\(\text{School}_i\)` gives `\(i\)`'s years of schooling and `\(\text{Male}_i\)` denotes an indicator variable for whether individual `\(i\)` is male. **Interpretation** - `\(\beta_1\)`: returns to an additional year of schooling (*ceteris paribus*) - `\(\beta_2\)`: premium for being male (*ceteris paribus*) -- <br>If `\(\beta_2 > 0\)`, then there is discrimination against women. --- # Omitted Variable Bias **Example, continued** From the population model $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i $$ An analyst focuses on the relationship between pay and schooling, _i.e._, $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \left(\beta_2 \text{Male}_i + u_i\right) $$ $$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \varepsilon_i $$ where `\(\varepsilon_i = \beta_2 \text{Male}_i + u_i\)`. -- We assumed exogeneity to show that OLS is unbiasedness. But even if `\(\mathop{\mathbb{E}}\left[ u | X \right] = 0\)`, it is not necessarily true that `\(\mathop{\mathbb{E}}\left[ \varepsilon | X \right] = 0\)` (false if `\(\beta_2 \neq 0\)`). -- Specifically, `\(\mathop{\mathbb{E}}\left[ \varepsilon | \text{Male} = 1 \right] = \beta_2 + \mathop{\mathbb{E}}\left[ u | \text{Male} = 1 \right] \neq 0\)`. -- **Now OLS is biased.** --- # Omitted Variable Bias Let's try to see this result graphically. The true population model: $$ \text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i $$ The regression model that suffers from omitted-variable bias: $$ \text{Pay}_i = \hat{\beta}_0 + \hat{\beta}_1 \times \text{School}_i + e_i $$ Finally, imagine that women, on average, receive more schooling than men. --- # Omitted Variable Bias True model: `\(\text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i\)` <img src="13-Categorical_Variables_files/figure-html/plot ovb 1-1.svg" style="display: block; margin: auto;" /> --- count: false # Omitted Variable Bias Biased regression: `\(\widehat{\text{Pay}}_i = 31.3 + -0.9 \times \text{School}_i\)` <img src="13-Categorical_Variables_files/figure-html/plot ovb 2-1.svg" style="display: block; margin: auto;" /> --- count: false # Omitted Variable Bias Recalling the omitted variable: Gender (**<font color="#e64173">female</font>** and **<font color="#314f4f">male</font>**) <img src="13-Categorical_Variables_files/figure-html/plot ovb 3-1.svg" style="display: block; margin: auto;" /> --- count: false # Omitted Variable Bias Recalling the omitted variable: Gender (**<font color="#e64173">female</font>** and **<font color="#314f4f">male</font>**) <img src="13-Categorical_Variables_files/figure-html/plot ovb 4-1.svg" style="display: block; margin: auto;" /> --- count: false # Omitted Variable Bias Unbiased regression: `\(\widehat{\text{Pay}}_i = 20.9 + 0.4 \times \text{School}_i + 9.1 \times \text{Male}_i\)` <img src="13-Categorical_Variables_files/figure-html/plot ovb 5-1.svg" style="display: block; margin: auto;" /> --- # Categorical Variables ## Example: Weekly Wages ```r lm(wage ~ south, data = wage_data) %>% tidy() ``` ``` #> # A tibble: 2 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 632. 6.00 105. 0 #> 2 south -137. 9.45 -14.5 6.21e-46 ``` -- **Q.sub[1]:** What is the reference category? **Q.sub[2]:** Interpret the coefficients. **Q.sub[3]:** Suppose you ran `lm(wage ~ nonsouth, data = wage_data)` instead. What is the coefficient estimate on `nonsouth`? What is the intercept estimate? --- # Categorical Variables ## Example: Weekly Wages ```r lm(wage ~ south + black, data = wage_data) %>% tidy() ``` ``` #> # A tibble: 3 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 647. 6.02 107. 0 #> 2 south -98.6 9.84 -10.0 2.89e-23 #> 3 black -129. 11.4 -11.3 3.43e-29 ``` -- **Q.sub[1]:** What is the reference category? **Q.sub[2]:** Interpret the coefficients. **Q.sub[3]:** Suppose you ran `lm(wage ~ south + nonblack, data = wage_data)` instead. What is the coefficient estimate on `nonblack`? What is the coefficient estimate on `south`? What is the intercept estimate? --- # Categorical Variables ## Example: Weekly Wages **Answer to Q.sub[3]:** ```r lm(wage ~ south + nonblack, data = wage_data) %>% tidy() ``` ``` #> # A tibble: 3 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 518. 11.7 44.3 0 #> 2 south -98.6 9.84 -10.0 2.89e-23 #> 3 nonblack 129. 11.4 11.3 3.43e-29 ``` --- exclude: true