Simple Regression Extensions - Tasks

Author

Florian Oswald

Published

November 11, 2025

1 Task: Wages or Log Wages?

install the “wooldridge” package via install.packages("wooldridge"). Check out Jeffrey Wooldridge’s website.
Make a plot of wage ~ educ, and plot the distribution of wage alone. What do you observe?

library(tinyplot)
data("wage1", package = "wooldridge")

plt(wage ~ educ, data = wage1, 
     col = "red", pch = 21, bg = "orange",     
     cex=1.25,grid = TRUE,      # set default x-axis to none
     main = "Wages vs. Education, 1976",
     xlab = "years of education", 
     ylab = "Hourly wages")
plt_add(type = "rug", side = 2)

plt(~wage, data = wage1, type = "density", lw = 3, col = "red",xlab = "hourly wage")

Make the same plot but use log(wage) ~ educ as your formula instead. Observe what?

plt(log(wage) ~ educ, data = wage1, 
    col = "red", pch = 21, bg = "orange",     
    cex=1.25,grid = TRUE,      # set default x-axis to none
    main = "Wages vs. Education, 1976",
    xlab = "years of education", 
    ylab = "log Hourly wages")
plt_add(type = "rug", side = 2)

plt(~log(wage), data = wage1, type = "density", lw = 3, col = "red",xlab = "log hourly wage")

Run 3 regression models: wage ~ educ, log(wage) ~ educ, log(wage) ~ log(educ). Take care with the last model!

mods = list()
mods[["level-level"]] = lm(wage ~ educ , data = wage1)
mods[["log-level"]] = lm(log(wage) ~ educ , data = wage1)
mods[["log-log"]] = lm(log(wage) ~ log(educ) , data = wage1, subset = educ > 0)

modelsummary::modelsummary(mods, gof_map = gm)

	level-level	log-level	log-log
(Intercept)	-0.905	0.584	-0.445
	(0.685)	(0.097)	(0.218)
educ	0.541	0.083
	(0.053)	(0.008)
log(educ)			0.825
			(0.086)
Num.Obs.	526	526	524
R2 Adj.	0.163	0.184	0.147

Interpretation:

level-level: a 1-year increase in education implies a 0.541 dollar increase in the hourly wage.
log-level: a 1-year increase in education implies a (100 x 0.083) % = 8.3 percent increase in the hourly wage.
log-log: a 1 percent increase in education implies a 0.825 percent increase in the hourly wage.

2 Task: Standardized regression

Let’s go back our grades dataset. Remember that to load the data you need to use the read_dta() function from the haven package. These are the estimates we got from regressing average math test scores on the full set of regressors.

library(haven)
grades = read_dta(file = "https://www.dropbox.com/s/wwp2cs9f0dubmhr/grade5.dta?dl=1")

reg_full <- lm(avgverb ~ classize + disadvantaged + school_enrollment + female + religious, grades)
reg_full$coefficients

      (Intercept)          classize     disadvantaged school_enrollment 
     78.560725298       0.003320773      -0.389333008       0.000758258 
           female         religious 
      0.923710499       2.876146701

1. Create a new variable avgmath_stand equal to the standardized math score. You can use the scale() function (combined with mutate) or do it by hand with base R.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# using scale()
grades <- grades %>%
    mutate(avgmath_stand = scale(avgmath))

# using base R
grades$avgmath_stand_2 <- (grades$avgmath - mean(grades$avgmath)) / sd(grades$avgmath)

identical(grades$avgmath_stand, grades$avgmath_stand_2)

[1] FALSE

2. Run the full regression using the standardized math test score as the dependent variable. Interpret the coefficients and their magnitude.

reg_full_stand <- lm(avgmath_stand ~ classize + disadvantaged + school_enrollment + female + religious, grades)
reg_full_stand$coefficients

      (Intercept)          classize     disadvantaged school_enrollment 
      0.314986346       0.002630553      -0.036187210       0.001911697 
           female         religious 
     -0.125894281       0.120617503

Interpretation for the class size coefficient: holding all the other regressors included in the regression constant, on average, a 1-student increase in class size is associated with a 0.003 standard deviation increase in average math scores. This is a very small association: 0.3% of a standard deviation.

3. Create the standardized variables for each continuous regressor as <regressor>_stand. * Would it make sense to standardize the religious variable?

grades <- grades %>%
    mutate(classize_stand = scale(classize),
           disadvantaged_stand = scale(disadvantaged),
           school_enrollment_stand = scale(school_enrollment),
           female_stand = scale(female))

It would not make much sense to standardize the religious variable because it is a dummy variable, equal to 1 if the school is religious, and 0 otherwise.

4. Regress avgmath_stand on the full set of standardized regressors and religious. Discuss the relative influence of the regressors.

reg_full_stand_2 <- lm(avgmath_stand ~ classize_stand + disadvantaged_stand + school_enrollment_stand + female_stand + religious, grades)
reg_full_stand_2$coefficients

            (Intercept)          classize_stand     disadvantaged_stand 
            -0.02969138              0.01726695             -0.48857857 
school_enrollment_stand            female_stand               religious 
             0.07421537             -0.01236283              0.12061750

Interpretation for the class size coefficient: holding all the other regressors included in the regression constant, on average, a 1-standard deviation increase in class size (approx 6.5 students) is associated with a 0.02 standard deviation increase in average math scores. This is quite a small association: 2% of a standard deviation. Note that this coefficient can be otained by multiplying the class size coefficient from the previous question’s regression by the approx 6.5 (i.e. 1 standard deviation of the class size variable).

3 Task: Non-linear relationships

1. Load the data here. This dataset contains information about tuition and estimated incomes of graduates for universities in the US. More details can be found here.

college <- read.csv("https://www.dropbox.com/s/2v5mb04nzw2u7bd/college_tuition_income.csv?dl=1")

2. Create a scatter plot of estimated mid career pay (mid_career_pay) $(y-axis)$ as a function of out of state tuition (out_of_state_tuition) $(x-axis)$. Would you say the relationship is broadly linear or rather non-linear? Use geom_smooth(method = "lm", se = F) + geom_smooth(method = "lm", se = F, formula = y ~ poly(x, 2, raw= T)) to fit both a linear and 2nd order regression line. This time which seems most appropriate?

base_graph <- college %>%
    ggplot(aes(x = out_of_state_tuition, y = mid_career_pay)) +
    geom_point() +
    labs(x = "Tuition for out-of-state residents in USD",
         y = "Estimated mid career pay in USD",
         title = "Relationship between out of state tuition and estimate mid career pay") +
    scale_x_continuous(labels = scales::dollar) +
    scale_y_continuous(labels = scales::dollar) +
    theme_bw(base_size = 14)
base_graph

Warning: Removed 2456 rows containing missing values or values outside the scale range
(`geom_point()`).

The relationship appears to be relatively linear though there is a some non-linearity at the upper end of the out of state tuition distribution, i.e. on the right-hand side of the graph. A 2nd order or 3rd order polynomial regression may be appropriate to capture it.

base_graph +
    geom_smooth(method='lm', se = FALSE, formula = y~poly(x,2, raw = T), aes(colour="y = b0 + b1x + b2x^2 + e")) +
    geom_smooth(method='lm', se = FALSE, aes(colour="y = b0 + b1x + e")) +
    scale_colour_manual(name = NULL, values=c("#d90502", "#DE9854")) +
    theme(legend.position = c(0,1),
          legend.justification = c(0,1))

Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.

Warning: Removed 2456 rows containing non-finite outside the scale range
(`stat_smooth()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2456 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2456 rows containing missing values or values outside the scale range
(`geom_point()`).

Based on my first impressions, the 2nd order polynomial fit seems to be a bit better at capturing non-linearities to the left and right of the graph.

3. Create a variable equal to out of state tuition divided by 1000. Regress mid career pay on out of state tuition divided by 1000. Interpret the coefficient.

college <- college %>%
    mutate(out_state = out_of_state_tuition/1000)

lm(mid_career_pay ~ out_state, college)


Call:
lm(formula = mid_career_pay ~ out_state, data = college)

Coefficients:
(Intercept)    out_state  
    68462.4        772.3

On average, a $1,000 increase in out of state tuition is associated with a $772 increase in estimated mid career pay.

4. Regress mid career pay on out of state tuition divided by 1000 and its square. Hint: you can use either poly(x, 2, raw = T) or x + I(x^2), where x is your regressor. What does the positive sign on the squared term imply?

lm(mid_career_pay ~ poly(out_state, 2, raw = T), college)


Call:
lm(formula = mid_career_pay ~ poly(out_state, 2, raw = T), data = college)

Coefficients:
                 (Intercept)  poly(out_state, 2, raw = T)1  
                    83769.88                       -338.10  
poly(out_state, 2, raw = T)2  
                       16.92

The positive sign on the squared term (poly(out_state, 2, raw = T)2 implies that as out of state tuition increases the increase in estimated mid career pay increases faster. In other words, the slope of the line is increasing in out of state tuition, it becomes steeper as you move to the right of the graph.

3.1 Task: Wages, education and gender in 1985

1. Load the data CPS1985 from the AER package.

library(AER)

Loading required package: car

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: lmtest

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Loading required package: sandwich

Loading required package: survival

data("CPS1985")
cps = CPS1985

2. Look at the help to get the definition of each variable: ?CPS1985

?AER::CPS1985

3. We don’t know if people are working part-time or full-time, does it matter here?

It doesn’t matter much since we are analysing hourly wages.

4. Create thelog_wage variable equal to the log of wage.

cps <- cps %>%
    mutate(log_wage = log(wage))

5. Regress log_wage on gender and education, and save it as reg1. Interpret each coefficient.

reg1 <- lm(log_wage ~ gender + education, cps)
reg1


Call:
lm(formula = log_wage ~ gender + education, data = cps)

Coefficients:
 (Intercept)  genderfemale     education  
     1.16519      -0.23207       0.07685

This is a log-level regression so the coefficients are interpreted as unit increase in $x$ associated with percent change in $y$. On average, women earn exp(-0.23207) = 0.79 percent of what men earn, holding education constant. That is, controlling for education, women earn on average 21% less than men. Note that since the coefficient is relatively small, you can approximate the interpretation as: on average, women earn 23% less than men with the same level of education.

Holding gender constant, a 1-year increase in years of education is associated, on average, with a exp(0.07685) = 1.08 percent change in hourly earnings. In other words, controlling for gender, on average, 1-year increase in years of education is associated with an 8% increase in hourly wage. Again, since the coefficient is quite small you can use the approximation we saw earlier on.

6. Regress the log_wage on gender, education and their interaction gender*education, save it as reg2. Interpret each coefficient. Does the gender wage gap decrease with education?

reg2 <- lm(log_wage ~ gender*education, cps)
reg2


Call:
lm(formula = log_wage ~ gender * education, data = cps)

Coefficients:
           (Intercept)            genderfemale               education  
               1.32352                -0.63315                 0.06468  
genderfemale:education  
               0.03080

On average, men with no education earn exp(1.32352) = 3.76 dollars an hour. Women with no education, on average, earn about exp(-0.63315) = 0.53 47% less than men with no education. On average, for men, 1-year increase in years of education is associated, on average, with a exp(0.06468) = 1.07 percent increase, i.e. a 6.68% increase, in hourly wages. The difference in the return to education between women and men is about exp(0.03080) = 1.03 percentage points, that is for each additional year of educatio, women’s hourly wage increases by about exp(0.06468 + 0.03080) = 1.1, i.e. 10%. Graphically, this means that the slope for women is steeper than that for men. The gender pay gap therefore shrinks with years of education.

7. Create a plot showing this interaction. (Hint: use the color = gender argument in aes and geom_smooth(method = "lm", se = F) to obtain a regression line per gender.)

cps %>%
    ggplot(aes(x = education, y = log_wage, col = gender)) +
    geom_point() +
    geom_smooth(method= "lm", se = F) +
    scale_color_viridis_d() +
    labs(x = "Years of education", y = "Log hourly wage", title = "Relationship between hourly wage and years of education by gender", color = NULL) +
    theme_bw(base_size = 14) +
    theme(legend.position = c(0,1),
          legend.justification = c(0,1))

`geom_smooth()` using formula = 'y ~ x'