Categorical Variables

# Categorical Variables
## EC 320: Introduction to Econometrics
### Philip Economides
### Winter 2022

---

# Prologue

---
class: inverse, middle

# Categorical Variables

---
# Categorical Variables

**Goal:** Make quantitative statements about .pink[qualitative information].

- *e.g.,* race, gender, being employed, living in Oregon, *etc.*

**Approach:** Construct .pink[binary variables].

- _a.k.a._ .pink[dummy variables] or .pink[indicator variables].
- Value equals 1 if observation is in the category or 0 if otherwise.

**Regression implications**

1. Binary variables change the interpretation of the intercept.

2. Coefficients on binary variables have different interpretations than those on continuous variables.

---
# Continuous Variables

Consider the relationship

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + u_i $$

where

- `$\text{Pay}_i$` is a continuous variable measuring an individual's pay
- `$\text{School}_i$` is a continuous variable that measures years of education

**Interpretation**

- `$\beta_0$`: `$y$`-intercept, _i.e._, `$\text{Pay}$` when `$\text{School} = 0$`
- `$\beta_1$`: expected increase in `$\text{Pay}$` for a one-unit increase in `$\text{School}$`

---
# Continuous Variables

Consider the relationship

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + u_i $$

**Derive the slope's interpretation:**

`$\mathop{\mathbb{E}}\left[ \text{Pay} | \text{School} = \ell + 1 \right] - \mathop{\mathbb{E}}\left[ \text{Pay} | \text{School} = \ell \right]$`
--
<br> `$\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 (\ell + 1) + u \right] - \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 \ell + u \right]$`
--
<br> `$\quad = \left[ \beta_0 + \beta_1 (\ell + 1) \right] - \left[ \beta_0 + \beta_1 \ell \right]$`
--
<br> `$\quad = \beta_0 - \beta_0 + \beta_1 \ell - \beta_1 \ell + \beta_1$`
--
<br> `$\quad = \beta_1$`.

The slope gives the expected increase in pay for an additional year of schooling.

---
# Continuous Variables

Consider the relationship

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + u_i $$

**Alternative derivation**

Differentiate the model with respect to schooling:

$$ \dfrac{d\text{Pay}}{d\text{School}} = \beta_1 $$

The slope gives the expected increase in pay for an additional year of schooling.

---
# Continuous Variables

If we have multiple explanatory variables, _e.g._,

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Ability}_i + u_i $$

then the interpretation changes slightly.

`$\mathop{\mathbb{E}}\left[ \text{Pay} | \text{School} = \ell + 1 \land \text{Ability} = \alpha \right] - \mathop{\mathbb{E}}\left[ \text{Pay} | \text{School} = \ell \land \text{Ability} = \alpha \right]$`
--
<br> `$\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 (\ell + 1) + \beta_2 \alpha + u \right] - \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 \ell + \beta_2 \alpha + u \right]$`
--
<br> `$\quad = \left[ \beta_0 + \beta_1 (\ell + 1) + \beta_2 \alpha \right] - \left[ \beta_0 + \beta_1 \ell + \beta_2 \alpha \right]$`
--
<br> `$\quad = \beta_0 - \beta_0 + \beta_1 \ell - \beta_1 \ell + \beta_1 + \beta_2 \alpha - \beta_2 \alpha$`
--
<br> `$\quad = \beta_1$`

The slope gives the expected increase in pay for an additional year of schooling, **holding ability constant**.

---
# Continuous Variables

If we have multiple explanatory variables, _e.g._,

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Ability}_i + u_i $$

then the interpretation changes slightly.

**Alternative derivation**

Differentiate the model with respect to schooling:

$$ \dfrac{\partial\text{Pay}}{\partial\text{School}} = \beta_1 $$

The slope gives the expected increase in pay for an additional year of schooling, **holding ability constant**.

---
# Categorical Variables

Consider the relationship

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + u_i $$

where `$\text{Pay}_i$` is a continuous variable measuring an individual's pay and `$\text{Female}_i$` is a binary variable equal to `$1$` when `$i$` is female.

**Interpretation**

`$\beta_0$` is the expected `$\text{Pay}$` for males (_i.e._, when `$\text{Female} = 0$`):

`$\mathop{\mathbb{E}}\left[ \text{Pay} | \text{Male} \right]$`
--
<br> `$\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1\times 0 + u_i \right]$`
--
<br> `$\quad = \mathop{\mathbb{E}}\left[ \beta_0 + 0 + u_i \right]$`
--
<br> `$\quad = \beta_0$`

---
# Categorical Variables

Consider the relationship

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + u_i $$

where `$\text{Pay}_i$` is a continuous variable measuring an individual's pay and `$\text{Female}_i$` is a binary variable equal to `$1$` when `$i$` is female.

**Interpretation**

`$\beta_1$` is the expected difference in `$\text{Pay}$` between females and males:

`$\mathop{\mathbb{E}}\left[ \text{Pay} | \text{Female} \right] - \mathop{\mathbb{E}}\left[ \text{Pay} | \text{Male} \right]$`
--
<br> `$\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1\times 1 + u_i \right] - \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1\times 0 + u_i \right]$`
--
<br> `$\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 + u_i \right] - \mathop{\mathbb{E}}\left[ \beta_0 + 0 + u_i \right]$`
--
<br> `$\quad = \beta_0 + \beta_1 - \beta_0$`
--
<br> `$\quad = \beta_1$`

---
# Categorical Variables

Consider the relationship

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + u_i $$

where `$\text{Pay}_i$` is a continuous variable measuring an individual's pay and `$\text{Female}_i$` is a binary variable equal to `$1$` when `$i$` is female.

**Interpretation**

`$\beta_0 + \beta_1$`: is the expected `$\text{Pay}$` for females:

`$\mathop{\mathbb{E}}\left[ \text{Pay} | \text{Female} \right]$`
--
<br> `$\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1\times 1 + u_i \right]$`
--
<br> `$\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 + u_i \right]$`
--
<br> `$\quad = \beta_0 + \beta_1$`

---
# Categorical Variables

Consider the relationship

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + u_i $$

**Interpretation**

- `$\beta_0$`: expected `$\text{Pay}$` for males (_i.e._, when `$\text{Female} = 0$`)
- `$\beta_1$`: expected difference in `$\text{Pay}$` between females and males
- `$\beta_0 + \beta_1$`: expected `$\text{Pay}$` for females
- Males are the **reference group**

**Note:** If there are no other variables to condition on, then `$\hat{\beta}_1$` equals the difference in group means, _e.g._, `$\bar{X}_\text{Female} - \bar{X}_\text{Male}$`.

**Note<sub>2</sub>:** The *holding all other variables constant* interpretation also applies for categorical variables in multiple regression settings.

---
# Categorical Variables

`$Y_i = \beta_0 + \beta_1 X_i + u_i$` for binary variable `$X_i = \{\color{#314f4f}{0}, \, \color{#e64173}{1}\}$`

---
# Categorical Variables

`$Y_i = \beta_0 + \beta_1 X_i + u_i$` for binary variable `$X_i = \{\color{#314f4f}{0}, \, \color{#e64173}{1}\}$`

---
# Multiple Regression

`$Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i \quad$` `$X_1$` is continuous `$\quad X_2$` is categorical

---
count: false

# Multiple Regression

The intercept and categorical variable `$X_2$` control for the groups' means.

---
count: false

# Multiple Regression

With groups' means removed:

---
count: false

# Multiple Regression

`$\hat{\beta}_1$` estimates the relationship between `$Y$` and `$X_1$` after controlling for `$X_2$`.

---
count: false

# Multiple Regression

Another way to think about it:

---
class: white-slide

**Question:** Why not estimate `$\text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + \beta_2 \text{Male}_i  + u_i$`?

**Answer:** The intercept is a perfect linear combination of `$\text{Male}_i$` and `$\text{Female}_i$`.

- Violates .pink[no perfect collinearity] assumption.

- OLS can't estimate all three parameters simultaneously.

- Known as .hi[dummy variable trap].

**Practical solution:** Select a reference category and drop its indicator.

---
# Dummy Variable _Trap?_

Don't worry, .mono[R] will bail you out if you include perfectly collinear indicators.

**Example**

```r
lm(wage ~ black + nonblack, data = wage_data) %>% tidy()
```

```
#> # A tibble: 3 x 5
#>   term        estimate std.error statistic   p.value
#>   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)     617.      5.27     117.   0       
#> 2 black          -168.     10.9      -15.4  7.78e-52
#> 3 nonblack         NA      NA         NA   NA
```

Thanks, .mono[R].

---

# Multiple Categories

So far we have only discussed **binary** categorical variables represented by dummies.

In many cases, there is a wide variety of categories by which we can characterize a set of observations.

**For example**
* Transport Modes: Rail, Highway, Air, Water
* Income Range: 1st quartile, 2nd quartile, 3rd quartile, 4th quartile 
* Geographic Regions: Alabama, Idaho, Oregon etc.

When addressing product diversification and trade, we can end up with an incredible number of categories to consider. [Trade Statistics by Product (HS 6-digit)](wits.worldbank.org/trade/country-byhs6product/aspx?lang=en)

---

.center[**Categorical Variable Types**] 
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Type of Variable </th>
   <th style="text-align:left;"> Represents </th>
   <th style="text-align:left;"> Examples </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-weight: bold;font-style: italic;color: black !important;"> Binary Variables </td>
   <td style="text-align:left;font-style: italic;color: black !important;"> Yes/no outcomes </td>
   <td style="text-align:left;"> Heads/tails in a coin flip <br>
            Win/lose in a football game </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;font-style: italic;color: black !important;"> Nominal Variables </td>
   <td style="text-align:left;font-style: italic;color: black !important;"> Groups with no rank or order between them </td>
   <td style="text-align:left;"> Specific names <br> Colors <br> Brands </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;font-style: italic;color: black !important;"> Ordinal Variables </td>
   <td style="text-align:left;font-style: italic;color: black !important;"> Groups that are ranked in a specific order </td>
   <td style="text-align:left;"> Rankings in a competition <br> Rating scale responses in survey </td>
  </tr>
</tbody>
</table>

---

# Beyond Binary

How do we deal with heaps of categories? .hi-pink[It depends].

Are these categories your .hi-pink[outcome variable]?

**Binary:** *Logistic Regression Model*, where we are determining the probability of an event, given individual characterisitics of `$i$`.

**Ordinal:** *Cumulative/Ordered Logit Model* for categorical variables with an implied order and `$J$` choices.

**Nominal:** *Generalized Logit Model* which holds characteristics fixed across choices and *Multinomial/Conditional Logit Model* which allows characteristics to differ for different choices.

These items .hi-pink[will not be covered] in this class, nor will their descriptions be tested upon. This is guidance for those interested in reading further and understanding what future econometrics classes deliver.

---

# Beyond Binary

Are these categories part of an .hi-pink[explanatory variable]?

__Approach I:__ Apply a unique dummy variable for each category

For example consider `$\text{earn}_i = \alpha + \beta_1 \text{HS}_i + \beta_2\text{UG}_i + \beta_3{\text{MS}}_i +\beta_3\text{PhD}_i + u_i$`

In this case I may have a single categorical variable, `$DEG_i$`, that lists degree types of individual `$i$` across my sample.

```r
educ_df %<>% mutate(HS = if_else(DEG=="Highschool", 1, 0),
                    UG = if_else(DEG=="Undergraduate", 1, 0),
                    MS = if_else(DEG=="Masters of Science", 1, 0),
                    PhD = if_else(DEG=="Doctorate", 1 ,0)
                    )
educ_reg <- lm(data=educ_df, earn ~ HS + UG + MS + PhD)
```

Assuming `$i$` w/o any degree, would form my reference group in which for every included individual, `$HS + UG + MS + PhD = 0$`.

---

# Beyond Binary

What if there are __too many__ categories but I want to create individual dummies?

Jacob Kaplan (Princeton) created the `fastDummies` package, which provides a useful function `dummy_cols()` [LINK](https://jacobkap.github.io/fastDummies/)

Consider the following example

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> numbers </th>
   <th style="text-align:left;"> gender </th>
   <th style="text-align:left;"> animals </th>
   <th style="text-align:left;"> dates </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;"> dog </td>
   <td style="text-align:left;"> 2012-01-01 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;"> dog </td>
   <td style="text-align:left;"> 2011-12-31 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;"> cat </td>
   <td style="text-align:left;"> 2012-01-01 </td>
  </tr>
</tbody>
</table>

---

# Beyond Binary

```r
results <- fastDummies::dummy_cols(fastDummies_example)
knitr::kable(results) %>% 
  kable_styling(font_size=10)
```

<table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> numbers </th>
   <th style="text-align:left;"> gender </th>
   <th style="text-align:left;"> animals </th>
   <th style="text-align:left;"> dates </th>
   <th style="text-align:right;"> gender_female </th>
   <th style="text-align:right;"> gender_male </th>
   <th style="text-align:right;"> animals_cat </th>
   <th style="text-align:right;"> animals_dog </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;"> dog </td>
   <td style="text-align:left;"> 2012-01-01 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;"> dog </td>
   <td style="text-align:left;"> 2011-12-31 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;"> cat </td>
   <td style="text-align:left;"> 2012-01-01 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

```r
results <- fastDummies::dummy_cols(fastDummies_example,
                                   select_columns= c("animals","gender"))
knitr::kable(results) %>% 
  kable_styling(font_size=10)
```

<table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> numbers </th>
   <th style="text-align:left;"> gender </th>
   <th style="text-align:left;"> animals </th>
   <th style="text-align:left;"> dates </th>
   <th style="text-align:right;"> animals_cat </th>
   <th style="text-align:right;"> animals_dog </th>
   <th style="text-align:right;"> gender_female </th>
   <th style="text-align:right;"> gender_male </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;"> dog </td>
   <td style="text-align:left;"> 2012-01-01 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:left;"> dog </td>
   <td style="text-align:left;"> 2011-12-31 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:left;"> cat </td>
   <td style="text-align:left;"> 2012-01-01 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

---

# Beyond Binary

Are these categories part of an .hi-pink[explanatory variable]?

__Approach II:__ Apply a fixed effect to your model

Consider the following model `$$\text{earn}_{ij} = \alpha + \beta_1 \text{Age}_i + \beta_2\text{AgeSq}_i + \beta_3{\text{Educ}}_i +\beta_3\text{Female}_i + u_{ij}$$`

There may be .hi-pink[unobservable] aspects related to groups defined by `$j$`, that are **fixed** across individuals in each group `$j\in\{1,2,\dots,J\}$`.

For example, if we were regressing the earnings of service staff across `$J$` countries, the USA may see unobserved `$\text{tips}_{ij}$` contributing more significantly towards income due to underlying cultural/professional norms.

In this case a **country fixed effect** in our regression would do wonders.

---

# Beyond Binary

Where `$u_{ij} = \phi_j + \nu_{ij}$`, our new regression would look like `$$\text{earn}_{ij} = \alpha + \beta_1 \text{Age}_i + \beta_2\text{AgeSq}_i + \beta_3{\text{Educ}}_i +\beta_3\text{Female}_i + \phi_j+ \nu_{ij},$$`

Any __unobserved__ contribution towards earnings that .hi-pink[varies across] `$J$`  __but__ is .hi-pink[constant within] each `$j$` for those individuals is controlled for.

**How do we run regressions with fixed-effects?**

> fixed.dum = lm(data=dataset, Y ~ X + factor(category_variable))

Turning your character variables into factors will automatically have the code treat each `$j$`th category as if it had its own dummy variable [Example](https://rstudio-pubs-static.s3.amazonaws.com/372492_3e05f38dd3f248e89cdedd317d603b9a.html)

`plm` maintained by Yves Croissant [Example Code](https://www.econometrics-with-r.org/10-3-fixed-effects-regression.html)

`fixest` maintained by Laurent Berge and Grant McDermott [Example Code](https://cran.r-project.org/web/packages/fixest/vignettes/fixest_walkthrough.html)

---

# Beyond Binary

I estimate whether productivity rankings across different categories of firm-types, represented by .hi-pink[dummy variables], are consistent with .hi-blue[Melitz(2003)].

Unobs .hi-pink[fixed-effects] within industries, years and countries controlled for.

---

# Omitted Variable Bias

**Omitted variable bias** (OVB) arises when we omit a variable that

1. Affects the outcome variable `$Y$`

2. Correlates with an explanatory variable `$X_j$`

Biases OLS estimator of `$\beta_j$`.

---
# Omitted Variable Bias

**Example**

Let's imagine a simple population model for the amount individual `$i$` gets paid

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i $$

where `$\text{School}_i$` gives `$i$`'s years of schooling and `$\text{Male}_i$` denotes an indicator variable for whether individual `$i$` is male.

**Interpretation**

- `$\beta_1$`: returns to an additional year of schooling (*ceteris paribus*)
- `$\beta_2$`: premium for being male (*ceteris paribus*)
--
<br>If `$\beta_2 > 0$`, then there is discrimination against women.

---
# Omitted Variable Bias

**Example, continued**

From the population model

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i $$

An analyst focuses on the relationship between pay and schooling, _i.e._,

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \left(\beta_2 \text{Male}_i + u_i\right) $$
$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \varepsilon_i $$

where `$\varepsilon_i = \beta_2 \text{Male}_i + u_i$`.

We assumed exogeneity to show that OLS is unbiasedness. But even if `$\mathop{\mathbb{E}}\left[ u | X \right] = 0$`, it is not necessarily true that `$\mathop{\mathbb{E}}\left[ \varepsilon | X \right] = 0$` (false if `$\beta_2 \neq 0$`).

Specifically, `$\mathop{\mathbb{E}}\left[ \varepsilon | \text{Male} = 1 \right] = \beta_2 + \mathop{\mathbb{E}}\left[ u | \text{Male} = 1 \right] \neq 0$`.
--
 **Now OLS is biased.**

---
# Omitted Variable Bias

Let's try to see this result graphically.

The true population model:

$$ \text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i $$

The regression model that suffers from omitted-variable bias:

$$ \text{Pay}_i = \hat{\beta}_0 + \hat{\beta}_1 \times \text{School}_i + e_i $$

Finally, imagine that women, on average, receive more schooling than men.

---
# Omitted Variable Bias

True model: `$\text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i$`

<img src="13-Categorical_Variables_files/figure-html/plot ovb 1-1.svg" style="display: block; margin: auto;" />
---
count: false

# Omitted Variable Bias

Biased regression: `$\widehat{\text{Pay}}_i = 31.3 + -0.9 \times \text{School}_i$`

---
count: false

# Omitted Variable Bias

Recalling the omitted variable: Gender (**<font color="#e64173">female</font>** and **<font color="#314f4f">male</font>**)

---
count: false

# Omitted Variable Bias

Recalling the omitted variable: Gender (**<font color="#e64173">female</font>** and **<font color="#314f4f">male</font>**)

---
count: false

# Omitted Variable Bias

Unbiased regression: `$\widehat{\text{Pay}}_i = 20.9 + 0.4 \times \text{School}_i + 9.1 \times \text{Male}_i$`

---
# Categorical Variables

## Example: Weekly Wages

```r
lm(wage ~ south, data = wage_data) %>% tidy()
```

```
#> # A tibble: 2 x 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)     632.      6.00     105.  0       
#> 2 south          -137.      9.45     -14.5 6.21e-46
```

**Q.sub[1]:** What is the reference category?

**Q.sub[2]:** Interpret the coefficients.

**Q.sub[3]:** Suppose you ran `lm(wage ~ nonsouth, data = wage_data)` instead. What is the coefficient estimate on `nonsouth`? What is the intercept estimate?

---
# Categorical Variables

## Example: Weekly Wages

```r
lm(wage ~ south + black, data = wage_data) %>% tidy()
```

```
#> # A tibble: 3 x 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)    647.       6.02     107.  0       
#> 2 south          -98.6      9.84     -10.0 2.89e-23
#> 3 black         -129.      11.4      -11.3 3.43e-29
```

**Q.sub[1]:** What is the reference category?

**Q.sub[2]:** Interpret the coefficients.

**Q.sub[3]:** Suppose you ran `lm(wage ~ south + nonblack, data = wage_data)` instead. What is the coefficient estimate on `nonblack`? What is the coefficient estimate on `south`? What is the intercept estimate?

---
# Categorical Variables

## Example: Weekly Wages

**Answer to Q.sub[3]:**

```r
lm(wage ~ south + nonblack, data = wage_data) %>% tidy()
```

```
#> # A tibble: 3 x 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)    518.      11.7       44.3 0       
#> 2 south          -98.6      9.84     -10.0 2.89e-23
#> 3 nonblack       129.      11.4       11.3 3.43e-29
```

---