Regression Inference

Task 1

1. Copy the loading and cleaning code from slide 3 and run it.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.0.5

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.3     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.0.5

## Warning: package 'tibble' was built under R version 4.0.5

## Warning: package 'tidyr' was built under R version 4.0.5

## Warning: package 'readr' was built under R version 4.0.5

## Warning: package 'purrr' was built under R version 4.0.3

## Warning: package 'dplyr' was built under R version 4.0.5

## Warning: package 'stringr' was built under R version 4.0.3

## Warning: package 'forcats' was built under R version 4.0.5

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

star_df = read.csv("https://www.dropbox.com/s/bf1fog8yasw3wjj/star_data.csv?dl=1")
star_df = star_df[complete.cases(star_df),]
star_df = star_df %>%
    filter(star %in% c("small","regular") &
               grade == "k") %>% 
    mutate(small = (star == "small"))

2. Generate the bootstrap distribution of \(b_\textrm{small}\) based on 1,000 samples drawn from star_df.
Hint 1: Use mutate to change the type of the small variable, and make it numeric.
Hint 2: use the appropriate functions and arguments from the infer package so use the help pages.
Hint 3: in calculate set stat to slope.

library(infer)

## Warning: package 'infer' was built under R version 4.0.5

bootstrap_distrib <- star_df %>% 
    mutate(small=as.numeric(small)) %>% 
    specify(response = math, explanatory = small) %>%
    generate(reps = 1000, type = "bootstrap") %>%
    calculate(stat = "slope")

3. Plot this simulated sampling distribution and compute mean and the standard error of \(b_small\).

bootstrap_distrib %>%
    ggplot(aes(x = stat)) +
    geom_histogram(boundary = 9, binwidth = 0.5, col = "white", fill = "#d90502") +
    labs(
        x = "Bootstrap sample slope estimate",
        y = "Frequency"
    ) +
    theme_bw(base_size = 14)

mean(bootstrap_distrib$stat)

## [1] 8.814085

sd(bootstrap_distrib$stat)

## [1] 1.670975

Task 2

1. Using the bootstrap distribution you generated in Task 1, compute the 95% confidence interval using the percentile method.

ci_pctile = bootstrap_distrib %>%
    summarise(
        lower = quantile(stat, 0.025),
        upper = quantile(stat, 0.975)
    )
ci_pctile

## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1  5.67  12.2

2. How similar is it to the confidence intervals obtained in the previous slide?

# standard error method
ci_stderror <- bootstrap_distrib %>%
  summarise(
    lower = 8.895 - 1.96*sd(stat),
    upper = 8.895 + 1.96*sd(stat))
ci_stderror

## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1  5.62  12.2

# theory
library(broom)

## Warning: package 'broom' was built under R version 4.0.5

ci_theory <- tidy(lm(math ~ small, star_df),
     conf.int = TRUE, conf.level = 0.95) %>%
  filter(term == "smallTRUE") %>%
  select(term, conf.low, conf.high)
ci_theory

## # A tibble: 1 x 3
##   term      conf.low conf.high
##   <chr>        <dbl>     <dbl>
## 1 smallTRUE     5.60      12.2

bootstrap_distrib %>%
    ggplot(aes(x = stat)) +
    geom_histogram(boundary = 9, binwidth = 0.5, col = "white", fill = "#d90502") +
    labs(
        x = "Bootstrap sample slope estimate",
        y = "Frequency",
        title = "95% confidence interval computed with different methods",
        subtitle = "percentile (dashed), standard error (longdashed) and theory (solid)"
    ) +
  geom_vline(xintercept = c(ci_pctile$lower, ci_pctile$upper), linetype = "dashed", show.legend = TRUE) +
  geom_vline(xintercept = c(ci_stderror$lower, ci_stderror$upper), linetype = "longdash", show.legend = TRUE) +
  geom_vline(xintercept = c(ci_theory$conf.low, ci_theory$conf.high)) +
  theme_bw(base_size = 14)

Task 3.1

1. Load the data CPS1985 from the AER package and look back at the help to get the definition of each variable: ?CPS1985

data(CPS1985, package = "AER")
cps = CPS1985
# ?AER::CPS1985

2. Create the log_wage variable equal to the log of wage.

cps <- cps %>%
    mutate(log_wage = log(wage))

3. Regress log_wage on gender and education, and save it as reg1.

reg1 <- lm(log_wage ~ gender + education, cps)
tidy(reg1)

## # A tibble: 3 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    1.17     0.106       11.0  2.06e-25
## 2 genderfemale  -0.232    0.0413      -5.63 3.00e- 8
## 3 education      0.0768   0.00787      9.77 7.75e-21

Interpret each coefficient.

Intercept: Expected log wage of men without any education.
genderfemale: On average, women earn about 23% less than men controlling for education level.
education: An additional year of education is associated, on average, with a 7.68% increase in wages controlling for gender.

One way to see this regression is as follows:

reg1_aug <- augment(reg1)

cps %>%
  ggplot(aes(x = education, y = log_wage, color = gender)) +
  geom_point() +
  geom_line(data = reg1_aug %>% filter(gender == "male"), aes(x = education, y = .fitted)) +
  geom_line(data = reg1_aug %>% filter(gender == "female"), aes(x = education, y = .fitted)) +
  labs(
    x = "Years of education",
    y = "Log wage",
    color = "",
    title = "Relationship between income and education, by gender",
    subtitle = "Lines correspond to fitted values from a regression of log wage on a gender dummy and education level."
  ) +
  theme_bw(base_size = 14) +
  theme(legend.position = "top")

Both lines have the same slope. This slope is equal to the education coefficient from the previous regression. The distance between the two lines is equal to the coefficient on gender. Given a certain level of education, on average, the difference between men and women’s wages will be 23%. The intercept is equal to the value of the y-axis for the red line when years of education is equal to 0.

Are the coefficients statistically significant? At which significance level?

All the coefficients are statistically significant at the 1% level since their p-values are all very close to 0.

4. Regress the log_wage on gender, education and their interaction gender*education, save it as reg2.

reg2 <- lm(log_wage ~ gender*education, cps)
tidy(reg2)

## # A tibble: 4 x 5
##   term                   estimate std.error statistic  p.value
##   <chr>                     <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)              1.32      0.134       9.86 3.58e-21
## 2 genderfemale            -0.633     0.213      -2.97 3.09e- 3
## 3 education                0.0647    0.0101      6.41 3.19e-10
## 4 genderfemale:education   0.0308    0.0161      1.92 5.55e- 2

How do you interpret the coefficient associated to female*education?

One additional year of education is associated, on average, with an additional 3% increase in wages for women relative to men. In other words, one additional year of education is associated for women, on average, with a 9.5% increase in wages, while it is only 6.5% for men.

One way to see this regression is as follows:

cps %>%
  ggplot(aes(x = education, y = log_wage, color = gender)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Years of education",
    y = "Log wage",
    color = "",
    title = "Relationship between income and education, by gender",
    subtitle = "Lines correspond to fitted values from a regression of log wage on a gender dummy, education level and\ntheir interaction"
  ) +
  theme_bw(base_size = 14) +
  theme(legend.position = "top")

## `geom_smooth()` using formula 'y ~ x'

Each gender has its own slope: for men it is equal to the coefficient on education while for women it is equal to education + gender*female. An interaction term precisely allows for different slopes (if one of the two variables is continuous). The intercept is equal to the expected wage for men without any education while the coefficient on gender corresponds to the expected difference in incomes between men and women without any education.

Can we reject the nullity of this coefficient at the 5% level? At 10%?

The interaction term coefficient’s p-value is equal to \(0.055\) so it just fails to be significant at the 5% level but is significant at the 10% level.

Task 3.2

1. Produce a scatterplot of the relationship between the log wage and the level of education, by gender.

cps %>%
  ggplot(aes(x = education, y = log_wage, color = gender)) +
  geom_point() +
  labs(
    x = "Years of education",
    y = "Log wage",
    color = "",
    title = "Relationship between income and education, by gender") +
  theme_bw(base_size = 14) +
  theme(legend.position = "top")

2. Add the regression line with geom_smooth. What does this line represents?

cps %>%
  ggplot(aes(x = education, y = log_wage, color = gender)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    x = "Years of education",
    y = "Log wage",
    color = "",
    title = "Relationship between income and education, by gender") +
  theme_bw(base_size = 14) +
  theme(legend.position = "top")

## `geom_smooth()` using formula 'y ~ x'

The lines represent the fitted values from the regression of log wage on a gender dummy, education and their interaction (the one we ran in the previous task). The shaded area corresponds to the 95% confident interval for the fitted line itself (not the coefficients).

3. Let’s illustrate what the shaded area stands for.

1. Draw one bootstrap sample from our cps data.

cps_boot <- cps %>%
  rep_sample_n(reps = 1, size = nrow(cps), replace = TRUE)

2. Regress the log_wage on gender, education and their interaction gender*education, save it as reg_bootstrap.

reg_bootsrap <- lm(log_wage ~ gender*education, cps_boot)
tidy(reg_bootsrap)

## # A tibble: 4 x 5
##   term                   estimate std.error statistic  p.value
##   <chr>                     <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)              1.41     0.132       10.7  2.25e-24
## 2 genderfemale            -0.528    0.223       -2.37 1.80e- 2
## 3 education                0.0605   0.00994      6.09 2.23e- 9
## 4 genderfemale:education   0.0234   0.0170       1.38 1.69e- 1

3. From reg_bootstrap extract and save the value of the intercept for men as intercept_men_bootstrap and the value of the slope for men as slope_men_bootstrap. Do the same for women.

intercept_men_bootstrap = reg_bootsrap$coefficients[1]
slope_men_bootstrap = reg_bootsrap$coefficients[3]

intercept_women_bootstrap = reg_bootsrap$coefficients[1] + reg_bootsrap$coefficients[2]
slope_women_bootstrap = reg_bootsrap$coefficients[3] + reg_bootsrap$coefficients[4]

4. Add both predicted lines from this bootstrap sample to the previous plot (Hint: use geom_abline (x2))

cps %>%
  ggplot(aes(x = education, y = log_wage, color = gender)) +
  geom_point() +
  geom_smooth(method = "lm") +
  geom_abline(slope = slope_men_bootstrap, intercept = intercept_men_bootstrap, linetype = "dashed", color = "#F8766D") +
  geom_abline(slope = slope_women_bootstrap, intercept = intercept_women_bootstrap, linetype = "dashed", color = "#00BFC4") +
  labs(
    x = "Years of education",
    y = "Log wage",
    color = "",
    title = "Relationship between income and education, by gender",
    subtitle = "Sample regression lines (solid), bootstrap sample regression lines (dashed)") +
  theme_bw(base_size = 14) +
  theme(legend.position = "top")

## `geom_smooth()` using formula 'y ~ x'

If we were the repeat this procedure 100 times, at each location of the shaded area, on average, 95% of the lines would lie within the shaded area and 5% would lie outside.

Regression Inference - Tasks

Florian Oswald, Gustave Kenedi and Pierre Villedieu

2021-11-09

Task 1

Task 2

Task 3.1

Task 3.2