Categorical Variables

class: center, middle, inverse, title-slide

# Categorical Variables
## EC 320: Introduction to Econometrics
### Winter 2022

---

class: inverse, middle

# Prologue

---
# Housekeeping

1. .pink[**Problem Set 3**] grade posted
2. .pink[**Midterm**] grade to be posted by Wednesday
3. .pink[**Problem Set 4**] to be posted by tomorrow, stay tuned
   - Due next Monday
4. .pink[**Lab**]  
   - **Lab** held on Wednesday
   - **Lab material** available on Github, **Ex7** available on Canvas
   - **Ex7** due Wednesday

---
class: inverse, middle

# Categorical Variables

---
# Categorical Variables

**Goal:** Make quantitative statements about .pink[qualitative information].

- *e.g.,* race, gender, being employed, living in Oregon, *etc.*

**Approach:** Construct .pink[binary variables].

- _a.k.a._ .pink[dummy variables] or .pink[indicator variables].
- Value equals 1 if observation is in the category or 0 if otherwise.

**Regression implications**

1. Binary variables change the interpretation of the intercept.

2. Coefficients on binary variables have different interpretations than those on continuous variables.

---
# Continuous Variables

Consider the relationship

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + u_i $$

where

- `$\text{Pay}_i$` is a continuous variable measuring an individual's pay
- `$\text{School}_i$` is a continuous variable that measures years of education

**Interpretation**

- `$\beta_0$`: `$y$`-intercept, _i.e._, `$\text{Pay}$` when `$\text{School} = 0$`
- `$\beta_1$`: expected increase in `$\text{Pay}$` for a one-unit increase in `$\text{School}$`

---
# Continuous Variables

Consider the relationship

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + u_i $$

**Derive the slope's interpretation:**

`$\mathop{\mathbb{E}}\left[ \text{Pay} | \text{School} = \ell + 1 \right] - \mathop{\mathbb{E}}\left[ \text{Pay} | \text{School} = \ell \right]$`
--
 `$\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 (\ell + 1) + u \right] - \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 \ell + u \right]$`
--
 `$\quad = \left[ \beta_0 + \beta_1 (\ell + 1) \right] - \left[ \beta_0 + \beta_1 \ell \right]$`
--
 `$\quad = \beta_0 - \beta_0 + \beta_1 \ell - \beta_1 \ell + \beta_1$`
--
 `$\quad = \beta_1$`.

The slope gives the expected increase in pay for an additional year of schooling.

---
# Continuous Variables

Consider the relationship

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + u_i $$

**Alternative derivation**

Differentiate the model with respect to schooling:

$$ \dfrac{d\text{Pay}}{d\text{School}} = \beta_1 $$

The slope gives the expected increase in pay for an additional year of schooling.

---
# Continuous Variables

If we have multiple explanatory variables, _e.g._,

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Ability}_i + u_i $$

then the interpretation changes slightly.

`$\mathop{\mathbb{E}}\left[ \text{Pay} | \text{School} = \ell + 1 \land \text{Ability} = \alpha \right] - \mathop{\mathbb{E}}\left[ \text{Pay} | \text{School} = \ell \land \text{Ability} = \alpha \right]$`
--
 `$\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 (\ell + 1) + \beta_2 \alpha + u \right] - \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 \ell + \beta_2 \alpha + u \right]$`
--
 `$\quad = \left[ \beta_0 + \beta_1 (\ell + 1) + \beta_2 \alpha \right] - \left[ \beta_0 + \beta_1 \ell + \beta_2 \alpha \right]$`
--
 `$\quad = \beta_0 - \beta_0 + \beta_1 \ell - \beta_1 \ell + \beta_1 + \beta_2 \alpha - \beta_2 \alpha$`
--
 `$\quad = \beta_1$`

The slope gives the expected increase in pay for an additional year of schooling, **holding ability constant**.

---
# Continuous Variables

If we have multiple explanatory variables, _e.g._,

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Ability}_i + u_i $$

then the interpretation changes slightly.

**Alternative derivation**

Differentiate the model with respect to schooling:

$$ \dfrac{\partial\text{Pay}}{\partial\text{School}} = \beta_1 $$

The slope gives the expected increase in pay for an additional year of schooling, **holding ability constant**.

---
# Categorical Variables

Consider the relationship

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + u_i $$

where `$\text{Pay}_i$` is a continuous variable measuring an individual's pay and `$\text{Female}_i$` is a binary variable equal to `$1$` when `$i$` is female.

**Interpretation**

`$\beta_0$` is the expected `$\text{Pay}$` for males (_i.e._, when `$\text{Female} = 0$`):

`$\mathop{\mathbb{E}}\left[ \text{Pay} | \text{Male} \right]$`
--
 `$\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1\times 0 + u_i \right]$`
--
 `$\quad = \mathop{\mathbb{E}}\left[ \beta_0 + 0 + u_i \right]$`
--
 `$\quad = \beta_0$`

---
# Categorical Variables

Consider the relationship

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + u_i $$

where `$\text{Pay}_i$` is a continuous variable measuring an individual's pay and `$\text{Female}_i$` is a binary variable equal to `$1$` when `$i$` is female.

**Interpretation**

`$\beta_1$` is the expected difference in `$\text{Pay}$` between females and males:

`$\mathop{\mathbb{E}}\left[ \text{Pay} | \text{Female} \right] - \mathop{\mathbb{E}}\left[ \text{Pay} | \text{Male} \right]$`
--
 `$\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1\times 1 + u_i \right] - \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1\times 0 + u_i \right]$`
--
 `$\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 + u_i \right] - \mathop{\mathbb{E}}\left[ \beta_0 + 0 + u_i \right]$`
--
 `$\quad = \beta_0 + \beta_1 - \beta_0$`
--
 `$\quad = \beta_1$`

---
# Categorical Variables

Consider the relationship

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + u_i $$

where `$\text{Pay}_i$` is a continuous variable measuring an individual's pay and `$\text{Female}_i$` is a binary variable equal to `$1$` when `$i$` is female.

**Interpretation**

`$\beta_0 + \beta_1$`: is the expected `$\text{Pay}$` for females:

`$\mathop{\mathbb{E}}\left[ \text{Pay} | \text{Female} \right]$`
--
 `$\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1\times 1 + u_i \right]$`
--
 `$\quad = \mathop{\mathbb{E}}\left[ \beta_0 + \beta_1 + u_i \right]$`
--
 `$\quad = \beta_0 + \beta_1$`

---
# Categorical Variables

Consider the relationship

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + u_i $$

**Interpretation**

- `$\beta_0$`: expected `$\text{Pay}$` for males (_i.e._, when `$\text{Female} = 0$`)
- `$\beta_1$`: expected difference in `$\text{Pay}$` between females and males
- `$\beta_0 + \beta_1$`: expected `$\text{Pay}$` for females
- Males are the **reference group**

**Note:** If there are no other variables to condition on, then `$\hat{\beta}_1$` equals the difference in group means, _e.g._, `$\bar{X}_\text{Female} - \bar{X}_\text{Male}$`.

**Note2:** The *holding all other variables constant* interpretation also applies for categorical variables in multiple regression settings.

---
# Categorical Variables

`$Y_i = \beta_0 + \beta_1 X_i + u_i$` for binary variable `$X_i = \{\color{#314f4f}{0}, \, \color{#e64173}{1}\}$`

---
# Categorical Variables

`$Y_i = \beta_0 + \beta_1 X_i + u_i$` for binary variable `$X_i = \{\color{#314f4f}{0}, \, \color{#e64173}{1}\}$`

---
# Multiple Regression

`$Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i \quad$` `$X_1$` is continuous `$\quad X_2$` is categorical

---
count: false

# Multiple Regression

The intercept and categorical variable `$X_2$` control for the groups' means.

---
count: false

# Multiple Regression

With groups' means removed:

---
count: false

# Multiple Regression

`$\hat{\beta}_1$` estimates the relationship between `$Y$` and `$X_1$` after controlling for `$X_2$`.

---
count: false

# Multiple Regression

Another way to think about it:

---
class: white-slide

**Question:** Why not estimate `$\text{Pay}_i = \beta_0 + \beta_1 \text{Female}_i + \beta_2 \text{Male}_i  + u_i$`?

**Answer:** The intercept is a perfect linear combination of `$\text{Male}_i$` and `$\text{Female}_i$`.

- Violates .pink[no perfect collinearity] assumption.

- OLS can't estimate all three parameters simultaneously.

- Known as .hi[dummy variable trap].

**Practical solution:** Select a reference category and drop its indicator.

---
# Dummy Variable _Trap?_

Don't worry, .mono[R] will bail you out if you include perfectly collinear indicators.

**Example**

```r
lm(wage ~ black + nonblack, data = wage_data) %>% tidy()
```

```
#> # A tibble: 3 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 617. 5.27 117. 0 
#> 2 black -168. 10.9 -15.4 7.78e-52
#> 3 nonblack NA NA NA NA
```

Thanks, .mono[R].

---
# Omitted Variable Bias

**Omitted variable bias** (OVB) arises when we omit a variable that

1. Affects the outcome variable `$Y$`

2. Correlates with an explanatory variable `$X_j$`

Biases OLS estimator of `$\beta_j$`.

---
# Omitted Variable Bias

**Example**

Let's imagine a simple population model for the amount individual `$i$` gets paid

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i $$

where `$\text{School}_i$` gives `$i$`'s years of schooling and `$\text{Male}_i$` denotes an indicator variable for whether individual `$i$` is male.

**Interpretation**

- `$\beta_1$`: returns to an additional year of schooling (*ceteris paribus*)
- `$\beta_2$`: premium for being male (*ceteris paribus*)
--
 If `$\beta_2 > 0$`, then there is discrimination against women.

---
# Omitted Variable Bias

**Example, continued**

From the population model

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \beta_2 \text{Male}_i + u_i $$

An analyst focuses on the relationship between pay and schooling, _i.e._,

$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \left(\beta_2 \text{Male}_i + u_i\right) $$
$$ \text{Pay}_i = \beta_0 + \beta_1 \text{School}_i + \varepsilon_i $$

where `$\varepsilon_i = \beta_2 \text{Male}_i + u_i$`.

We assumed exogeneity to show that OLS is unbiasedness. But even if `$\mathop{\mathbb{E}}\left[ u | X \right] = 0$`, it is not necessarily true that `$\mathop{\mathbb{E}}\left[ \varepsilon | X \right] = 0$` (false if `$\beta_2 \neq 0$`).

Specifically, `$\mathop{\mathbb{E}}\left[ \varepsilon | \text{Male} = 1 \right] = \beta_2 + \mathop{\mathbb{E}}\left[ u | \text{Male} = 1 \right] \neq 0$`.
--
 **Now OLS is biased.**

---
# Omitted Variable Bias

Let's try to see this result graphically.

The true population model:

$$ \text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i $$

The regression model that suffers from omitted-variable bias:

$$ \text{Pay}_i = \hat{\beta}_0 + \hat{\beta}_1 \times \text{School}_i + e_i $$

Finally, imagine that women, on average, receive more schooling than men.

---
# Omitted Variable Bias

True model: `$\text{Pay}_i = 20 + 0.5 \times \text{School}_i + 10 \times \text{Male}_i + u_i$`

<img src="14-Categorical_Variables_files/figure-html/plot ovb 1-1.svg" style="display: block; margin: auto;" />
---
count: false

# Omitted Variable Bias

Biased regression: `$\widehat{\text{Pay}}_i = 31.3 + -0.9 \times \text{School}_i$`

---
count: false

# Omitted Variable Bias

Recalling the omitted variable: Gender (**female** and **male**)

---
count: false

# Omitted Variable Bias

Recalling the omitted variable: Gender (**female** and **male**)

---
count: false

# Omitted Variable Bias

Unbiased regression: `$\widehat{\text{Pay}}_i = 20.9 + 0.4 \times \text{School}_i + 9.1 \times \text{Male}_i$`

---
# Categorical Variables

## Example: Weekly Wages

```r
lm(wage ~ south, data = wage_data) %>% tidy()
```

```
#> # A tibble: 2 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 632. 6.00 105. 0 
#> 2 south -137. 9.45 -14.5 6.21e-46
```

**Q.sub[1]:** What is the reference category?

**Q.sub[2]:** Interpret the coefficients.

**Q.sub[3]:** Suppose you ran `lm(wage ~ nonsouth, data = wage_data)` instead. What is the coefficient estimate on `nonsouth`? What is the intercept estimate?

---
# Categorical Variables

## Example: Weekly Wages

```r
lm(wage ~ south + black, data = wage_data) %>% tidy()
```

```
#> # A tibble: 3 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 647. 6.02 107. 0 
#> 2 south -98.6 9.84 -10.0 2.89e-23
#> 3 black -129. 11.4 -11.3 3.43e-29
```

**Q.sub[1]:** What is the reference category?

**Q.sub[2]:** Interpret the coefficients.

**Q.sub[3]:** Suppose you ran `lm(wage ~ south + nonblack, data = wage_data)` instead. What is the coefficient estimate on `nonblack`? What is the coefficient estimate on `south`? What is the intercept estimate?

---
# Categorical Variables

## Example: Weekly Wages

**Answer to Q.sub[3]:**

```r
lm(wage ~ south + nonblack, data = wage_data) %>% tidy()
```

```
#> # A tibble: 3 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 518. 11.7 44.3 0 
#> 2 south -98.6 9.84 -10.0 2.89e-23
#> 3 nonblack 129. 11.4 11.3 3.43e-29
```