Task 1: Standardized regression

Let’s go back our grades dataset. Remember that to load the data you need to use the read_dta() function from the haven package. These are the estimates we got from regressing average math test scores on the full set of regressors.

library(haven)
grades = read_dta(file = "https://www.dropbox.com/s/wwp2cs9f0dubmhr/grade5.dta?dl=1")

reg_full <- lm(avgverb ~ classize + disadvantaged + school_enrollment + female + religious, grades)
reg_full$coefficients
##       (Intercept)          classize     disadvantaged school_enrollment 
##      78.560725298       0.003320773      -0.389333008       0.000758258 
##            female         religious 
##       0.923710499       2.876146701

1. Create a new variable avgmath_stand equal to the standardized math score. You can use the scale() function (combined with mutate) or do it by hand with base R.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# using scale()
grades <- grades %>%
    mutate(avgmath_stand = scale(avgmath))

# using base R
grades$avgmath_stand_2 <- (grades$avgmath - mean(grades$avgmath)) / sd(grades$avgmath)

identical(grades$avgmath_stand, grades$avgmath_stand_2)
## [1] FALSE

2. Run the full regression using the standardized math test score as the dependent variable. Interpret the coefficients and their magnitude.

reg_full_stand <- lm(avgmath_stand ~ classize + disadvantaged + school_enrollment + female + religious, grades)
reg_full_stand$coefficients
##       (Intercept)          classize     disadvantaged school_enrollment 
##       0.314986346       0.002630553      -0.036187210       0.001911697 
##            female         religious 
##      -0.125894281       0.120617503

Interpretation for the class size coefficient: holding all the other regressors included in the regression constant, on average, a 1-student increase in class size is associated with a 0.003 standard deviation increase in average math scores. This is a very small association: 0.3% of a standard deviation.

3. Create the standardized variables for each continuous regressor as <regressor>_stand. * Would it make sense to standardize the religious variable?

grades <- grades %>%
    mutate(classize_stand = scale(classize),
           disadvantaged_stand = scale(disadvantaged),
           school_enrollment_stand = scale(school_enrollment),
           female_stand = scale(female))

It would not make much sense to standardize the religious variable because it is a dummy variable, equal to 1 if the school is religious, and 0 otherwise.

4. Regress avgmath_stand on the full set of standardized regressors and religious. Discuss the relative influence of the regressors.

reg_full_stand_2 <- lm(avgmath_stand ~ classize_stand + disadvantaged_stand + school_enrollment_stand + female_stand + religious, grades)
reg_full_stand_2$coefficients
##             (Intercept)          classize_stand     disadvantaged_stand 
##             -0.02969138              0.01726695             -0.48857857 
## school_enrollment_stand            female_stand               religious 
##              0.07421537             -0.01236283              0.12061750

Interpretation for the class size coefficient: holding all the other regressors included in the regression constant, on average, a 1-standard deviation increase in class size (approx 6.5 students) is associated with a 0.02 standard deviation increase in average math scores. This is quite a small association: 2% of a standard deviation. Note that this coefficient can be otained by multiplying the class size coefficient from the previous question’s regression by the approx 6.5 (i.e. 1 standard deviation of the class size variable).

Task 2: Non-linear relationships

1. Load the data here. This dataset contains information about tuition and estimated incomes of graduates for universities in the US. More details can be found here.

college <- read.csv("https://www.dropbox.com/s/2v5mb04nzw2u7bd/college_tuition_income.csv?dl=1")

2. Create a scatter plot of estimated mid career pay (mid_career_pay) \((y-axis)\) as a function of out of state tuition (out_of_state_tuition) \((x-axis)\). Would you say the relationship is broadly linear or rather non-linear? Use geom_smooth(method = "lm", se = F) + geom_smooth(method = "lm", se = F, formula = y ~ poly(x, 2, raw= T)) to fit both a linear and 2nd order regression line. This time which seems most appropriate?

base_graph <- college %>%
    ggplot(aes(x = out_of_state_tuition, y = mid_career_pay)) +
    geom_point() +
    labs(x = "Tuition for out-of-state residents in USD",
         y = "Estimated mid career pay in USD",
         title = "Relationship between out of state tuition and estimate mid career pay") +
    scale_x_continuous(labels = scales::dollar) +
    scale_y_continuous(labels = scales::dollar) +
    theme_bw(base_size = 14)
base_graph
## Warning: Removed 2456 rows containing missing values (geom_point).

The relationship appears to be relatively linear though there is a some non-linearity at the upper end of the out of state tuition distribution, i.e. on the right-hand side of the graph. A 2nd order or 3rd order polynomial regression may be appropriate to capture it.

base_graph +
    geom_smooth(method='lm', se = FALSE, formula = y~poly(x,2, raw = T), aes(colour="y = b0 + b1x + b2x^2 + e")) +
    geom_smooth(method='lm', se = FALSE, aes(colour="y = b0 + b1x + e")) +
    scale_colour_manual(name = NULL, values=c("#d90502", "#DE9854")) +
    theme(legend.position = c(0,1),
          legend.justification = c(0,1))
## Warning: Removed 2456 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2456 rows containing non-finite values (stat_smooth).
## Warning: Removed 2456 rows containing missing values (geom_point).

Based on my first impressions, the 2nd order polynomial fit seems to be a bit better at capturing non-linearities to the left and right of the graph.

3. Create a variable equal to out of state tuition divided by 1000. Regress mid career pay on out of state tuition divided by 1000. Interpret the coefficient.

college <- college %>%
    mutate(out_state = out_of_state_tuition/1000)

lm(mid_career_pay ~ out_state, college)
## 
## Call:
## lm(formula = mid_career_pay ~ out_state, data = college)
## 
## Coefficients:
## (Intercept)    out_state  
##     68462.4        772.3

On average, a $1,000 increase in out of state tuition is associated with a $772 increase in estimated mid career pay.

4. Regress mid career pay on out of state tuition divided by 1000 and its square. Hint: you can use either poly(x, 2, raw = T) or x + I(x^2), where x is your regressor. What does the positive sign on the squared term imply?

lm(mid_career_pay ~ poly(out_state, 2, raw = T), college)
## 
## Call:
## lm(formula = mid_career_pay ~ poly(out_state, 2, raw = T), data = college)
## 
## Coefficients:
##                  (Intercept)  poly(out_state, 2, raw = T)1  
##                     83769.88                       -338.10  
## poly(out_state, 2, raw = T)2  
##                        16.92

The positive sign on the squared term (poly(out_state, 2, raw = T)2 implies that as out of state tuition increases the increase in estimated mid career pay increases faster. In other words, the slope of the line is increasing in out of state tuition, it becomes steeper as you move to the right of the graph.

Task 3: Wages, education and gender in 1985

1. Load the data CPS1985 from the AER package.

library(AER)
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
data("CPS1985")
cps = CPS1985

2. Look at the help to get the definition of each variable: ?CPS1985

?AER::CPS1985

3. We don’t know if people are working part-time or full-time, does it matter here?

It doesn’t matter much since we are analysing hourly wages.

4. Create thelog_wage variable equal to the log of wage.

cps <- cps %>%
    mutate(log_wage = log(wage))

5. Regress log_wage on gender and education, and save it as reg1. Interpret each coefficient.

reg1 <- lm(log_wage ~ gender + education, cps)
reg1
## 
## Call:
## lm(formula = log_wage ~ gender + education, data = cps)
## 
## Coefficients:
##  (Intercept)  genderfemale     education  
##      1.16519      -0.23207       0.07685

This is a log-level regression so the coefficients are interpreted as unit increase in \(x\) associated with percent change in \(y\). On average, women earn exp(-0.23207) = 0.79 percent of what men earn, holding education constant. That is, controlling for education, women earn on average 21% less than men. Note that since the coefficient is relatively small, you can approximate the interpretation as: on average, women earn 23% less than men with the same level of education.

Holding gender constant, a 1-year increase in years of education is associated, on average, with a exp(0.07685) = 1.08 percent change in hourly earnings. In other words, controlling for gender, on average, 1-year increase in years of education is associated with an 8% increase in hourly wage. Again, since the coefficient is quite small you can use the approximation we saw earlier on.

6. Regress the log_wage on gender, education and their interaction gender*education, save it as reg2. Interpret each coefficient. Does the gender wage gap decrease with education?

reg2 <- lm(log_wage ~ gender*education, cps)
reg2
## 
## Call:
## lm(formula = log_wage ~ gender * education, data = cps)
## 
## Coefficients:
##            (Intercept)            genderfemale               education  
##                1.32352                -0.63315                 0.06468  
## genderfemale:education  
##                0.03080

On average, men with no education earn exp(1.32352) = 3.76 dollars an hour. Women with no education, on average, earn about exp(-0.63315) = 0.53 47% less than men with no education. On average, for men, 1-year increase in years of education is associated, on average, with a exp(0.06468) = 1.07 percent increase, i.e. a 6.68% increase, in hourly wages. The difference in the return to education between women and men is about exp(0.03080) = 1.03 percentage points, that is for each additional year of educatio, women’s hourly wage increases by about exp(0.06468 + 0.03080) = 1.1, i.e. 10%. Graphically, this means that the slope for women is steeper than that for men. The gender pay gap therefore shrinks with years of education.

7. Create a plot showing this interaction. (Hint: use the color = gender argument in aes and geom_smooth(method = "lm", se = F) to obtain a regression line per gender.)

cps %>%
    ggplot(aes(x = education, y = log_wage, col = gender)) +
    geom_point() +
    geom_smooth(method= "lm", se = F) +
    scale_color_viridis_d() +
    labs(x = "Years of education", y = "Log hourly wage", title = "Relationship between hourly wage and years of education by gender", color = NULL) +
    theme_bw(base_size = 14) +
    theme(legend.position = c(0,1),
          legend.justification = c(0,1))
## `geom_smooth()` using formula 'y ~ x'