Today you will investigate racial disparities in the labor market using data from the Current Population Survey (CPS), a large survey administered by the US Census Bureau and the Bureau of Labor Statistics. The federal government uses the CPS to estimate the unemployment rate. Economists use the CPS to study a variety of topics in labor economics, including the effect of binding minimum wages, the gender pay gap, and returns to schooling. You will use a CPS sample of workers from Boston and Chicago to study employment patterns by race.
Today you will use the packages broom
and stargazer
in addition to the tidyverse
. broom
has a useful function that summarizes regression results. stargazer
produces formatted regression tables.
Use the p_load
function to install and load the tidyverse
, broom
, and stargazer
:
The data file is 03-Non_Experimental_Data_data.csv
. You can find it on Canvas under Files
then Labs
. Download the file and save it in your EC 320 folder.
Import the data using read_csv
:
By looking at the dataset, you can see that most of the variables are binary: they take values of either 1 or 0.
## # A tibble: 8,891 x 5
## employed black female educ exper
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 0 1 HS Graduate 20
## 2 0 0 0 HS Graduate 20
## 3 1 1 1 HS Dropout 12
## 4 1 1 0 HS Dropout 17
## 5 0 1 1 HS Graduate 21
## 6 1 0 1 College or Higher 13
## 7 1 0 0 Some College 15
## 8 1 0 1 College or Higher 5
## 9 1 0 1 College or Higher 3
## 10 0 0 1 College or Higher 4
## # ... with 8,881 more rows
For example, individuals with employed == 1
have a job while those with employed == 0
do not.
What percentage of individuals in the sample are employed?
The mean of a binary variable gives the fraction of observations with values equal to 1.
## [1] NA
Something went wrong. If you use the summary
function, you’ll see that there are missing values (NA
s) of employed
.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 1.0000 1.0000 0.7811 1.0000 1.0000 18
When there are missing values, some functions, like mean
, will return a missing value as output. To circumvent this, you can specify na.rm = TRUE
in the mean
function:
## [1] 0.7811338
The employment rate is 78 percent.
What are the employment rates by race and gender?
## # A tibble: 4 x 3
## # Groups: black [2]
## black female employed
## <dbl> <dbl> <dbl>
## 1 0 0 0.868
## 2 0 1 0.723
## 3 1 0 0.718
## 4 1 1 0.693
You can see that the employment rate is
black == 0
and female == 0
)black == 0
and female == 1
)black == 1
and female == 0
)black == 1
and female == 1
).What is the average difference in employment status between black individuals and white individuals?
To find the difference,
filter
function to restrict the sample to one group (black or white)mean
to calculate the group meanblack_emp <- filter(cps, black == 1)$employed %>%
mean(., na.rm = TRUE)
white_emp <- filter(cps, black == 0)$employed %>%
mean(., na.rm = TRUE)
black_emp - white_emp
## [1] -0.09128578
The employment rate is 9 percentage points lower for black individuals than for white individuals.
Does this mean that there is racial disparity?
Not yet. We still don’t know if the difference is statistically significant. You can find out by conducting a \(t\)-test of the null hypothesis that the true difference-in-means is zero against the alternative hypothesis that the difference is nonzero.
To conduct the test, you need to calculate the \(t\)-statistic for the difference-in-means, which is given by
\[t = \dfrac{\overline{\text{Employed}}_\text{Black} - \overline{\text{Employed}}_\text{White}}{\sqrt{S^2_\text{Black}/n_\text{Black} + S^2_\text{White}/n_\text{White}}}.\]
You can calculate the quantities you need for the \(t\)-stat using mean
, var
, and nrow
:
# black mean
mean_b <- filter(cps, black == 1)$employed %>%
mean(., na.rm = TRUE)
# white mean
mean_w <- filter(cps, black == 0)$employed %>%
mean(., na.rm = TRUE)
# black variance
var_b <- filter(cps, black == 1)$employed %>%
var(., na.rm = TRUE)
# white variance
var_w <- filter(cps, black == 0)$employed %>%
var(., na.rm = TRUE)
# number of black observations
# NOTE: !is.na(employed) removes the missing observations of employed
n_b <- filter(cps, black == 1 & !is.na(employed)) %>%
nrow()
# number of white observations
n_w <- filter(cps, black == 0 & !is.na(employed)) %>%
nrow()
# t-stat
t_stat <- (mean_b - mean_w) / sqrt(var_b/n_b + var_w/n_w)
t_stat
## [1] -6.844988
To conclude your test, compare your \(t\)-stat to 2 (an approximation for the critical value of \(t\) in 5 percent test).1 If \(|t| > 2\), then you can reject the null hypothesis. Your \(t\)-stat of -6.84 is certainly more extreme than 2, so you can reject the null hypothesis. This means that the difference in employment rates is statistically significant. There is a racial disparity in employment.
Does the disparity in employment rates provide causal evidence of racial discrimination in hiring? Does the comparison of employment rates by race hold all else constant? What else could explain the gap?
Linear regression models provide another way to investigate racial disparities. You can implement regressions in R with the lm
function (lm
stands for “linear model”). The first argument of the lm
function is a formula separated by ~
, which you can think of as an “equals” sign. The outcome variable goes on the left side of ~
and the treatment variable goes on the right. The second argument of lm
is the data you’re using to estimate the model.
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.795 0.00475 167. 0.
## 2 black -0.0913 0.0122 -7.46 9.48e-14
What does the coefficient on black
tell you? How does it compare to the the difference-in-means you calculated earlier?
Can the gap be explained by differences in educational attainment across racial groups?
To address this question, you can estimate a new regression that includes controls for educational attainment.
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.862 0.00739 117. 0.
## 2 black -0.0672 0.0122 -5.51 3.74e- 8
## 3 educHS Dropout -0.226 0.0150 -15.1 1.28e-50
## 4 educHS Graduate -0.0948 0.0112 -8.47 2.82e-17
## 5 educSome College -0.0735 0.0111 -6.64 3.35e-11
Adding the educ
variable results in three new coefficients. Why? The lm
function recognizes that educ
is a categorical variable and responds by turning it into several dummy variables, one for each level of education. It happens to be the case that there are four levels of education in the data. For technical reasons we’ll discuss later in the quarter, one of these levels gets absorbed by the intercept.
How does the coefficient on black
change when you add controls for education? To facilitate an easy-on-the-eyes comparison of your estimates, you can use stargazer
to report both regressions side-by-side:
##
## =====================================================================
## Dependent variable:
## -------------------------------------------------
## employed
## (1) (2)
## ---------------------------------------------------------------------
## black -0.091*** -0.067***
## (0.012) (0.012)
##
## educHS Dropout -0.226***
## (0.015)
##
## educHS Graduate -0.095***
## (0.011)
##
## educSome College -0.073***
## (0.011)
##
## Constant 0.795*** 0.862***
## (0.005) (0.007)
##
## ---------------------------------------------------------------------
## Observations 8,873 8,873
## R2 0.006 0.032
## Adjusted R2 0.006 0.032
## Residual Std. Error 0.412 (df = 8871) 0.407 (df = 8868)
## F Statistic 55.649*** (df = 1; 8871) 74.437*** (df = 4; 8868)
## =====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Note: If you omit type = "text"
, then stargazer
will output LaTeX code, which will look nasty.2
Does the new coefficient on black
provide causal evidence of racial discrimination? What are some potential remaining sources of omitted-variable bias?
In Problem Set 2, you will analyze an experimental dataset from an influential study that provides causal evidence of racial discrimination in the labor market. The authors of the study control for many potential sources of omitted-variable bias by finding a way to randomize race.
We’ll unpack the details about hypothesis testing later in the course. For now, we’ll rely on rules of thumb.↩
LaTeX
(pronounced “lay-tech”) is an open-source typesetting language that many economists (including your instructor and GEs) use instead of MS Word. It is especially useful for typing math.↩