Non-Experimental Data

Current Population Survey

Today you will investigate racial disparities in the labor market using data from the Current Population Survey (CPS), a large survey administered by the US Census Bureau and the Bureau of Labor Statistics. The federal government uses the CPS to estimate the unemployment rate. Economists use the CPS to study a variety of topics in labor economics, including the effect of binding minimum wages, the gender pay gap, and returns to schooling. You will use a CPS sample of workers from Boston and Chicago to study employment patterns by race.

Preliminaries

Load packages

Today you will use the packages broom and stargazer in addition to the tidyverse. broom has a useful function that summarizes regression results. stargazer produces formatted regression tables.

Use the p_load function to install and load the tidyverse, broom, and stargazer:

library(pacman)
p_load(tidyverse, broom, stargazer)

Import data

The data file is 03-Non_Experimental_Data_data.csv. You can find it on Canvas under Files then Labs. Download the file and save it in your EC 320 folder.

Import the data using read_csv:

cps <- read_csv("03-Non_Experimental_Data_data.csv")

By looking at the dataset, you can see that most of the variables are binary: they take values of either 1 or 0.

cps

## # A tibble: 8,891 x 5
##    employed black female educ              exper
##       <dbl> <dbl>  <dbl> <chr>             <dbl>
##  1        1     0      1 HS Graduate          20
##  2        0     0      0 HS Graduate          20
##  3        1     1      1 HS Dropout           12
##  4        1     1      0 HS Dropout           17
##  5        0     1      1 HS Graduate          21
##  6        1     0      1 College or Higher    13
##  7        1     0      0 Some College         15
##  8        1     0      1 College or Higher     5
##  9        1     0      1 College or Higher     3
## 10        0     0      1 College or Higher     4
## # ... with 8,881 more rows

For example, individuals with employed == 1 have a job while those with employed == 0 do not.

Employment Rates

What percentage of individuals in the sample are employed?

The mean of a binary variable gives the fraction of observations with values equal to 1.

mean(cps$employed)

## [1] NA

Something went wrong. If you use the summary function, you’ll see that there are missing values (NAs) of employed.

summary(cps$employed)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  1.0000  1.0000  0.7811  1.0000  1.0000      18

When there are missing values, some functions, like mean, will return a missing value as output. To circumvent this, you can specify na.rm = TRUE in the mean function:

mean(cps$employed, na.rm = TRUE)

## [1] 0.7811338

The employment rate is 78 percent.

What are the employment rates by race and gender?

cps %>% 
  group_by(black, female) %>% 
  summarize(employed = mean(employed, na.rm = TRUE))

## # A tibble: 4 x 3
## # Groups:   black [2]
##   black female employed
##   <dbl>  <dbl>    <dbl>
## 1     0      0    0.868
## 2     0      1    0.723
## 3     1      0    0.718
## 4     1      1    0.693

You can see that the employment rate is

87 percent for white males (black == 0 and female == 0)
72 percent for white females (black == 0 and female == 1)
72 percent for black males (black == 1 and female == 0)
69 percent for black females (black == 1 and female == 1).

Racial disparities

What is the average difference in employment status between black individuals and white individuals?

To find the difference,

Use the filter function to restrict the sample to one group (black or white)
Use mean to calculate the group mean
Repeat for the other group
Take the difference in means.

black_emp <- filter(cps, black == 1)$employed %>% 
  mean(., na.rm = TRUE)

white_emp <- filter(cps, black == 0)$employed %>% 
  mean(., na.rm = TRUE)

black_emp - white_emp

## [1] -0.09128578

The employment rate is 9 percentage points lower for black individuals than for white individuals.

Does this mean that there is racial disparity?

Not yet. We still don’t know if the difference is statistically significant. You can find out by conducting a \(t\)-test of the null hypothesis that the true difference-in-means is zero against the alternative hypothesis that the difference is nonzero.

To conduct the test, you need to calculate the \(t\)-statistic for the difference-in-means, which is given by

\[t = \dfrac{\overline{\text{Employed}}_\text{Black} - \overline{\text{Employed}}_\text{White}}{\sqrt{S^2_\text{Black}/n_\text{Black} + S^2_\text{White}/n_\text{White}}}.\]

You can calculate the quantities you need for the \(t\)-stat using mean, var, and nrow:

# black mean
mean_b <- filter(cps, black == 1)$employed %>% 
  mean(., na.rm = TRUE)

# white mean
mean_w <- filter(cps, black == 0)$employed %>% 
  mean(., na.rm = TRUE)

# black variance
var_b <- filter(cps, black == 1)$employed %>% 
  var(., na.rm = TRUE)

# white variance
var_w <- filter(cps, black == 0)$employed %>% 
  var(., na.rm = TRUE)

# number of black observations 
# NOTE: !is.na(employed) removes the missing observations of employed
n_b <- filter(cps, black == 1 & !is.na(employed)) %>% 
  nrow()

# number of white observations
n_w <- filter(cps, black == 0 & !is.na(employed)) %>% 
  nrow()

# t-stat
t_stat <- (mean_b - mean_w) / sqrt(var_b/n_b + var_w/n_w)
t_stat

## [1] -6.844988

To conclude your test, compare your \(t\)-stat to 2 (an approximation for the critical value of \(t\) in 5 percent test).¹ If \(|t| > 2\), then you can reject the null hypothesis. Your \(t\)-stat of -6.84 is certainly more extreme than 2, so you can reject the null hypothesis. This means that the difference in employment rates is statistically significant. There is a racial disparity in employment.

Discrimination?

Does the disparity in employment rates provide causal evidence of racial discrimination in hiring? Does the comparison of employment rates by race hold all else constant? What else could explain the gap?

Regression-Adjusted Differences

Linear regression models provide another way to investigate racial disparities. You can implement regressions in R with the lm function (lm stands for “linear model”). The first argument of the lm function is a formula separated by ~, which you can think of as an “equals” sign. The outcome variable goes on the left side of ~ and the treatment variable goes on the right. The second argument of lm is the data you’re using to estimate the model.

reg1 <- lm(employed ~ black, data = cps)

# view results
tidy(reg1)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   0.795    0.00475    167.   0.      
## 2 black        -0.0913   0.0122      -7.46 9.48e-14

What does the coefficient on black tell you? How does it compare to the the difference-in-means you calculated earlier?

Can the gap be explained by differences in educational attainment across racial groups?

To address this question, you can estimate a new regression that includes controls for educational attainment.

reg2 <- lm(employed ~ black + educ, data = cps)

# view results
tidy(reg2)

## # A tibble: 5 x 5
##   term             estimate std.error statistic  p.value
##   <chr>               <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)        0.862    0.00739    117.   0.      
## 2 black             -0.0672   0.0122      -5.51 3.74e- 8
## 3 educHS Dropout    -0.226    0.0150     -15.1  1.28e-50
## 4 educHS Graduate   -0.0948   0.0112      -8.47 2.82e-17
## 5 educSome College  -0.0735   0.0111      -6.64 3.35e-11

Adding the educ variable results in three new coefficients. Why? The lm function recognizes that educ is a categorical variable and responds by turning it into several dummy variables, one for each level of education. It happens to be the case that there are four levels of education in the data. For technical reasons we’ll discuss later in the quarter, one of these levels gets absorbed by the intercept.

How does the coefficient on black change when you add controls for education? To facilitate an easy-on-the-eyes comparison of your estimates, you can use stargazer to report both regressions side-by-side:

stargazer(reg1, reg2, type = "text")

## 
## =====================================================================
##                                    Dependent variable:               
##                     -------------------------------------------------
##                                         employed                     
##                               (1)                      (2)           
## ---------------------------------------------------------------------
## black                      -0.091***                -0.067***        
##                             (0.012)                  (0.012)         
##                                                                      
## educHS Dropout                                      -0.226***        
##                                                      (0.015)         
##                                                                      
## educHS Graduate                                     -0.095***        
##                                                      (0.011)         
##                                                                      
## educSome College                                    -0.073***        
##                                                      (0.011)         
##                                                                      
## Constant                    0.795***                 0.862***        
##                             (0.005)                  (0.007)         
##                                                                      
## ---------------------------------------------------------------------
## Observations                 8,873                    8,873          
## R2                           0.006                    0.032          
## Adjusted R2                  0.006                    0.032          
## Residual Std. Error    0.412 (df = 8871)        0.407 (df = 8868)    
## F Statistic         55.649*** (df = 1; 8871) 74.437*** (df = 4; 8868)
## =====================================================================
## Note:                                     *p<0.1; **p<0.05; ***p<0.01

Note: If you omit type = "text", then stargazer will output LaTeX code, which will look nasty.²

Does the new coefficient on black provide causal evidence of racial discrimination? What are some potential remaining sources of omitted-variable bias?

In Problem Set 2, you will analyze an experimental dataset from an influential study that provides causal evidence of racial discrimination in the labor market. The authors of the study control for many potential sources of omitted-variable bias by finding a way to randomize race.

We’ll unpack the details about hypothesis testing later in the course. For now, we’ll rely on rules of thumb.↩
LaTeX (pronounced “lay-tech”) is an open-source typesetting language that many economists (including your instructor and GEs) use instead of MS Word. It is especially useful for typing math.↩

Kyle Raze

Non-Experimental Data

Kyle Raze, Youssef Ait Benasser, & Saurabh Gupta
EC 320: Introduction to Econometrics
University of Oregon

Fall 2019

Current Population Survey

Preliminaries

Load packages

Import data

Employment Rates

Racial disparities

Discrimination?

Regression-Adjusted Differences

Non-Experimental Data

Kyle Raze, Youssef Ait Benasser, & Saurabh Gupta EC 320: Introduction to Econometrics University of Oregon

Fall 2019

Current Population Survey

Preliminaries

Load packages

Import data

Employment Rates

Racial disparities

Discrimination?

Regression-Adjusted Differences

Kyle Raze, Youssef Ait Benasser, & Saurabh Gupta
EC 320: Introduction to Econometrics
University of Oregon