Go to the Canvas and download the following file called EAWE01.csv
. The file will automatically be downloaded and stored in your Downloads folder. If you are a Mac user, right-click the file in the Downloads folder then press alt. You’ll see Copy "EAWE01.csv"as Pathname
. Click that and then paste it inside the read_csv()
. If you are a Windows user, you could simply copy and paste the file path by right-clicking the file then hit Copy as path
.
## [1] 500 96
As can be seen, the dataset has 500 observations and 96 columns or variables. In today’s lab, we’ll use a set of variables that is, EARNINGS
, S
, EXP
, FEMALE
.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.13 11.47 15.41 18.35 22.60 100.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 12.00 15.00 14.55 16.00 20.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4808 4.6106 6.5865 6.6613 8.6923 14.2692
\[ \text{EARNINGS}_i=\beta_0 + \beta_1 \text{S}_i+ u_i \]
(1) | |
---|---|
(Intercept) | 3.090 |
(2.464) | |
S | 1.049 *** |
(0.166) | |
N | 500 |
R2 | 0.074 |
logLik | -1875.992 |
AIC | 3757.983 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
reg1
assumes a linear relationship between years of schooling and earnings, which is probably unrealistic. The effect of schooling going from 12 to 13 years wouldn’t necessarily be equal to the effect of schooling going from 15 to 16. This is more consistent with a constant elasticity as compared to a constant marginal effect.
In this case, typically we take the log of the outcome variable to interpret the estimate of \(\beta_1\) as the expected \(\%\) change in earnings given a one year increase in schooling. \[ \text{log(EARNINGS)}_i=\beta_0 + \beta_1 \text{S}_i+ u_i \]
(1) | |
---|---|
(Intercept) | 1.810 *** |
(0.125) | |
S | 0.065 *** |
(0.008) | |
N | 500 |
R2 | 0.107 |
logLik | -386.696 |
AIC | 779.391 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Binary variable is a variable that takes either 0 or 1 as its value. In wages
data, some binary variables are already defined. Take a look at FEMALE
and MALE
variable. Also notice what happens if we add these two columns.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.5 0.5 1.0 1.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.5 0.5 1.0 1.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 1 1 1
That is, FEMALE
could be represented as linear combination of a column vector of ones and MALE
whereas MALE
could be represented as a linear combination of a column vector of ones and FEMALE
.
##
## Call:
## lm(formula = EARNINGS ~ FEMALE, data = wages, subset = FEMALE ==
## 1)
##
## Coefficients:
## (Intercept) FEMALE
## 17.52 NA
##
## Call:
## lm(formula = EARNINGS ~ FEMALE, data = wages)
##
## Coefficients:
## (Intercept) FEMALE
## 19.173 -1.655
##
## Call:
## lm(formula = EARNINGS ~ FEMALE + MALE, data = wages)
##
## Coefficients:
## (Intercept) FEMALE MALE
## 19.173 -1.655 NA
##
## Call:
## lm(formula = EARNINGS ~ FEMALE + MALE - 1, data = wages)
##
## Coefficients:
## FEMALE MALE
## 17.52 19.17
For the first regression, we subset the data to only include observations whose gender is female then regress EARNINGS
on FEMALE
. Since all the observations in the subset has the value of 1 for FEMALE
column, this equation collapses such that the regression model drops FEMALE
variable and just estimates the intercept. For the second regression, we do not subset the data and regress EARNINGS
on FEMALE
. Now we have the estimate for the coefficient of FEMALE
variable. This is because now the observations whose gender is denoted as MALE
serve as the reference group.
Interpreting the regression result, on average the yearly earnings of male is estimated to be 19,173 dollars in the sample. On the other hand, the average yearly earnings of female is about 1,655 dollars less of the average earnings of male. In other words, the average yearly earnings of female is estimated to be 19173-1655=17,518.
The third regression will also collapse to the second regression, since MALE
can exactly be represented as the linear combination of the intercept and FEMALE
. The fourth regression will run since we are now dropping the intercept. If we drop the intercept, the interpretation of the coefficient estimate for FEMALE
and MALE
would literally be the average yearly earnings for each group.
Notice that the regression model would look as follows with the intercept, where FEMALE=0
serves as the reference group, \[
\text{EARNINGS}_i=\beta_0 + \beta_1 \text{FEMALE}_i+ u_i,
\] and without the intercept, \[
\text{EARNINGS}_i=\beta_1 \text{FEMALE}_i+\beta_2\text{MALE} + u_i.
\]
Consider following regression model: \[ \text{log(EARNINGS)}_i=\beta_0 + \beta_1 \text{S}_i+\beta_2\text{EXP}_i+ u_i. \]
If we suspect a difference in earnings between female and male with the same level of experience and schooling, then the regression model would look as follows: \[ \text{log(EARNINGS)}_i=\beta_0 + \delta_0\text{FEMALE}_i+ \beta_1 \text{S}_i+\beta_2\text{EXP}_i+ u_i. \]
If it’s male, FEMALE
would equal zero and thus the regression model for male becomes \[
\text{log(EARNINGS)}_i=\beta_0 + \beta_1 \text{S}_i+\beta_2\text{EXP}_i+ u_i.
\] However, for female, FEMALE
would equal 1 and thus the regression model for female becomes \[
\text{log(EARNINGS)}_i=(\beta_0 + \delta_0)+\beta_1 \text{S}_i+\beta_2\text{EXP}_i+ u_i.
\]
(1) | |
---|---|
(Intercept) | 1.065 *** |
(0.195) | |
S | 0.100 *** |
(0.010) | |
EXP | 0.048 *** |
(0.010) | |
FEMALE | -0.171 *** |
(0.046) | |
N | 500 |
R2 | 0.169 |
logLik | -368.846 |
AIC | 747.692 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
All else equal, the above regression results show that from switching from FEMALE=0
to FEMALE=1
, we expect to see about 17% decrease in average earnings.
Here we assume that the effect of schooling differ by gender. In other words, the regression model would now be
\[
\text{log(EARNINGS)}_i=\beta_0 + \delta_0\text{FEMALE}_i+ \beta_1 \text{S}_i+\delta_1\text{FEMALE}_i \times \text{S}_i + \beta_2\text{EXP}_i+ u_i.
\]
Now what is the regression model for male, i.e. FEMALE=0
group? \[
\text{log(EARNINGS)}_i=\beta_0 + \beta_1 \text{S}_i+ \beta_2\text{EXP}_i+ u_i.
\] The reference group now becomes FEMALE=0
group, meaning that the estimates for \(\beta_1\) and \(\beta_0\) are those related with FEMALE=0
group.
Now what is the regression model for female, i.e. FEMALE=1
group? \[
\text{log(EARNINGS)}_i=(\beta_0 + \delta_0) + (\beta_1 +\delta_1) \text{S}_i + \beta_2\text{EXP}_i+ u_i.
\] This means that the interpretation of the estimates for \(\delta_0\) and \(\delta_1\) is a relative changes compared to the reference group.
(1) | |
---|---|
(Intercept) | 1.420 *** |
(0.220) | |
S | 0.075 *** |
(0.013) | |
EXP | 0.048 *** |
(0.010) | |
FEMALE | -0.979 *** |
(0.247) | |
S:FEMALE | 0.055 *** |
(0.017) | |
N | 500 |
R2 | 0.187 |
logLik | -363.287 |
AIC | 738.575 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
In other words, the above regression results suggest that the effect of schooling differ by gender, meaning that there seems to be an extra increase in earnings associated with one year additional schooling when it’s female. That is there seems to be 5% increase in earnings associated with one year additional schooling when it’s female compared to male. In other words, it seems to be the case that getting schooling is even more important for females to overcome the sexist barrier of being female.
Please open up the 07-Exercise.R
and fill out your answer for each question.