Multiple Linear Regression

Task 1

Let’s analyse the regression results using reading score as the dependent variable.

1. Load the data from here using the read_dta() function from the haven package. Assign it to an object grades.

library(haven)
grades <- read_dta("https://www.dropbox.com/s/wwp2cs9f0dubmhr/grade5.dta?dl=1")

1. Regress avgverb on classize and disadvantaged and assign the output to a new object reg. Interpret the coefficients. How do they compare with the math score regression coefficients?

reg <- lm(avgverb ~ classize + disadvantaged, grades)
reg

## 
## Call:
## lm(formula = avgverb ~ classize + disadvantaged, data = grades)
## 
## Coefficients:
##   (Intercept)       classize  disadvantaged  
##      80.23878       -0.03083       -0.34967

The intercept, which measures that expected reading score when class size and the percentage of disadvantaed students in the class are 0, equals 80. This, as we’ve discussed in class, doesn’t make any “real-world” sense because if there are no students then there cannot be a test score. The coefficient on class size is -0.03 which means that a 1 student increase in class size, controlling for the percentage of disadvantaged students, is, on average, associated to a 0.03 point decrease in average reading score. The coefficient is negative while it was positive when analysing math score as the outcome. However, the coefficient is very small considering reading is scored on a scale from 0 to 100. Lastly, the coefficient on disadvantaged is equal to -0.35 which means that keeping class size constrant, a 1 percentage point increase in disadvantaged students is associate, on average, with a 0.35 point decrease in averate reading score. The magnitude of this coefficient is very similar to that when looking at math scores. Again, it is very small: a 10 percentage point increase in disadvantaged students (quite a large increase) would only be associated, on average, with 3.5 points decrease in reading score, keeping class size constant.

1. What are the other available variables that we may add in the regression? * Run the regression with all these variables and assign it to reg_full. * Look at the coefficients. * Discuss all coefficients: sign and magnitude.

reg_full <- lm(avgverb ~ classize + disadvantaged + school_enrollment + female + religious, grades)
reg_full

## 
## Call:
## lm(formula = avgverb ~ classize + disadvantaged + school_enrollment + 
##     female + religious, data = grades)
## 
## Coefficients:
##       (Intercept)           classize      disadvantaged  school_enrollment  
##        78.5607253          0.0033208         -0.3893330          0.0007583  
##            female          religious  
##         0.9237105          2.8761467

Task 2: Dummy Variable Trap

Still need to think about this dummy variable trap? Let’s run another regression where there is perfect linear dependence between regressors.

1. Load the STAR data from here, using read.csv, and assign it to an object called star_df. Keep only cases with no NAs with the following code:
star_df <- star_df[complete.cases(star_df),]

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

star_df <- read.csv("https://www.dropbox.com/s/bf1fog8yasw3wjj/star_data.csv?dl=1")
star_df <- star_df[complete.cases(star_df),]

star_df_2 <- star_df %>%
    filter(grade == "2")

2. Create three dummy variables: (i) small equal to TRUE if students are in a small class and FALSE otherwise; (ii) regular equal to TRUE if students are in a regular class and FALSE otherwise; (iii) regular_plus equal to TRUE if students are in a regular+aide class and FALSE otherwise. Create a last variable, sum, equal to the sum of small, regular and regular_plus. What is sum equal to? What does this mean?

star_df_2 <- star_df_2 %>%
    mutate(small = (star == "small"),
           regular = (star == "regular"),
           regular_plus = (star == "regular+aide"),
           sum = small + regular + regular_plus)

star_df_2 %>% count(sum)

##   sum    n
## 1   1 5672

The sum variable is always equal to 1, which means that small, regular and regular_plus are perfectly collinear. In other words they can be expressed as a linear combination of each one of another, i.e. small = 1 - (regular + regular_plus).

3. Regress math on small. What is the average predicted math score of students in a small class? Do the same for regular and regular_plus.

reg_small <- lm(math ~ small, star_df_2)
reg_small

## 
## Call:
## lm(formula = math ~ small, data = star_df_2)
## 
## Coefficients:
## (Intercept)    smallTRUE  
##     579.074        7.564

Note that our regressor is a dummy variable. Remember from the lecture on causality that when the regressor is a dummy variable, the intercept corresponds to the conditional mean of outcome variable for the FALSE category while the slope coefficient corresponds to the difference in expected outcomes between the TRUE and the FALSE category. In this case, from the regression, we can get that the expected math score of students not in a small class is around 579.07 points , and that the difference in expected math score between students in a small class and those not in a small class is 7.56 points. As such, the expected math score of students in a small class is approximately 579.07 + 7.56 = 586.64.

For those of you who prefer seeing this formally, this is how you can (should proceed): \[ \mathbb{E}(\text{math score } | \text{ small} = 0) = b_0 + b_1 \times 0 = b_0 \approx 579.07 \\ \mathbb{E}(\text{math score } | \text{ small} = 1) = b_0 + b_1 \times 1 = b_0 + b_1 \approx 586.64 \]

reg_regular <- lm(math ~ regular, star_df_2)
reg_regular

## 
## Call:
## lm(formula = math ~ regular, data = star_df_2)
## 
## Coefficients:
## (Intercept)  regularTRUE  
##     583.231       -5.535

reg_plus <- lm(math ~ regular_plus, star_df_2)
reg_plus

## 
## Call:
## lm(formula = math ~ regular_plus, data = star_df_2)
## 
## Coefficients:
##      (Intercept)  regular_plusTRUE  
##          581.859            -1.466

Using the exact same logic as above, we obtain: \[ \mathbb{E}(\text{math score } | \text{ regular} = 0) = b_0 + b_1 \times 0 = b_0 \approx 583.23 \\ \mathbb{E}(\text{math score } | \text{ regular} = 1) = b_0 + b_1 \times 1 = b_0 + b_1 \approx 577.7 \\ \mathbb{E}(\text{math score } | \text{ regular+aide} = 0) = b_0 + b_1 \times 0 = b_0 \approx 581.86 \\ \mathbb{E}(\text{math score } | \text{ regular+aide} = 1) = b_0 + b_1 \times 1 = b_0 + b_1 \approx 580.39 \]

4. Regress math on small, regular and regular_plus. What do you notice? What’s the omitted (reference) category? Does this match the previous questions?

reg_full <- lm(math ~ small + regular + regular_plus, star_df_2)
reg_full

## 
## Call:
## lm(formula = math ~ small + regular + regular_plus, data = star_df_2)
## 
## Coefficients:
##      (Intercept)         smallTRUE       regularTRUE  regular_plusTRUE  
##          580.393             6.246            -2.696                NA

We notice that the regular_plus variable has been dropped, i.e. the coefficient for this variable cannot be estimated because it is collinear to small and regular. This means that the reference category is regular+aide. In other words, the coefficients on small and regular are to be interpreted relative to the outcomes of the regular+aide students. To be specific, and using the same formal interpretation as above, denoting the model as \(\text{math score} = b_0 + b_1 \text{small} + b_2 \text{regular} + e\), we obtain:

\[\begin{align} b_0 &= \mathbb{E}(\text{math score } | \text{ small} = 0 \text{ & } \text{ regular} = 0) \\ &= \mathbb{E}(\text{math score } | \text{ regular + aide} = 1) \\ &\approx 580.39 \end{align}\]

You can compare this resul with the answer to the previous question and see that we obtain exactly the same answer.

In the same way, we get:

\[\begin{align} b_1 &= \mathbb{E}(\text{math score } | \text{ small} = 1 \text{ & } \text{ regular} \in \{0,1\}) - \mathbb{E}(\text{math score } | \text{ small} = 0 \text{ & } \text{ regular} \in \{0,1\}) \\ &= b_0 + b_1 \times 1 + b_2 \times \text{regular} - (b_0 + b_1 \times 0 + b_2 \times \text{regular}) \\ &\approx 6.25 \end{align}\]

\[\begin{align} b_2 &= \mathbb{E}(\text{math score } | \text{ small} \in \{0,1\} \text{ & } \text{ regular} = 1) - \mathbb{E}(\text{math score } | \text{ small} \in \{0,1\} \text{ & } \text{ regular} = 0) \\ &= b_0 + b_1 \times \text{small} + b_2 \times 1 - (b_0 + b_1 \times \text{small} + b_2 \times 0) \\ &\approx -2.7 \end{align}\]

In plain English, the interpretation is that relative to the regular+aide group, students in small classes score 6.25 points higher in the math exam, while relative to the regular+aide group, students in regular classes score 2.7 points lower in the math exam.

5. Regress math on star. What do you notice? What’s the omitted category? Interpret the coefficient.

lm(math ~ star, star_df_2)

## 
## Call:
## lm(formula = math ~ star, data = star_df_2)
## 
## Coefficients:
##      (Intercept)  starregular+aide         starsmall  
##          577.696             2.696             8.942

R actually creates the dummy variables for me, I don’t need to do the tedious work above of creating an individual dummy variable for each category! The omitted category is regular therefore the coefficients are to be interpreted as differences in average math scores relative to students in regular classes.

Task 3: Recap

Let’s use the STAR data (used in the previous task) to review the main concepts covered.

1. Using the filtered data from the previous task, regress math on school (tabulate the variable to know what it contains). Interpret the coefficients. What’s the omitted category? Do you find them surprising? Why? What might be an omitted variable?

star_df_2 %>% count(school)

##       school    n
## 1 inner-city 1199
## 2      rural 2665
## 3   suburban 1476
## 4      urban  332

lm(math ~ school, star_df_2)

## 
## Call:
## lm(formula = math ~ school, data = star_df_2)
## 
## Coefficients:
##    (Intercept)     schoolrural  schoolsuburban     schoolurban  
##         561.56           30.18           16.93           20.33

The school variable is a categorical variable indicating the school location type. There are four categories: inner-cyt, suburban, rural and urban.

The ommitted category in the regression is inner-city, meaning that the intercept represents the average math score of students in inner-city schools and that the other coefficients should be interpreted relative to the expected math score of students in inner-city schools. In particular, on average, students in rural schools score 30 points higher than those in inner-city schools, those in suburban schools score almost 17 points higher, and those in urban schools score about 20 points higher. Inner-cities tend to be poorer areas which might explain that they score so much lower than students in other locations. An omitted variable might be some student- or school-evel variable of disadvantage.

2. Compute the share of students qualifying for free lunch (i.e. lunch equals “free”) by school location category. What do you observe? Add lunch to the previous question’s regression. How do the coefficients change?

star_df_2 %>%
    group_by(school) %>%
    summarise(mean(lunch == "free"))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 4 x 2
##   school     `mean(lunch == "free")`
##   <chr>                        <dbl>
## 1 inner-city                   0.897
## 2 rural                        0.389
## 3 suburban                     0.335
## 4 urban                        0.443

As expected, the share of students qualifying for free lunches, a common indicator of disadvantage, is significantly higher in inner-city schools (almost 90% of such students) while it is between 34% and 44% of students in other locations.

lm(math ~ school + lunch, star_df_2)

## 
## Call:
## lm(formula = math ~ school + lunch, data = star_df_2)
## 
## Coefficients:
##    (Intercept)     schoolrural  schoolsuburban     schoolurban   lunchnon-free  
##        559.228          18.750           4.289          10.114          22.517

We observe that the coefficients on the rural, suburban and urban decrease dramatically once the free lunch status of students are taken into account. Note that students not qualifying for free lunch score significantly higher, on average, than those who do, controlling for the school’s location category.

3. Regress math on gender and experience (the teacher’s experience). Interpret the coefficients. How would these regression results look like visually?

lm(math ~ gender + experience, star_df_2)

## 
## Call:
## lm(formula = math ~ gender + experience, data = star_df_2)
## 
## Coefficients:
## (Intercept)   gendermale   experience  
##   581.35255     -0.91383      0.03453

The intercept coefficient corresponds to the average math score of female students, keeping teacher experience constant. Recall that because we have both a numeric variable and a dummy variable as regressors, the intercept can be interpreted as the intercept for the line of the omitted category, in this case female. The coefficient on gender represents the expected difference in math scores between male and female students, holding teacher experience constant. The coefficient is negative but small, implying that on average male students score slightly lower than their female peers, holding teacher experience constant. Graphically, this implies that the line for male students lies below that for females. The coefficient on experience corresponds to the expected change in math score associated, on average, to an increase in a teacher’s experience by 1 year, accounting for student gender. As such, an additional year of teacher experience is associated, on average, with a 0.03 increase in math scores, keeping gender constant. This is a very small effect, since you would need 100 years of increased experience to expect math scores to increase by 3 points.

4. Regress math on star. After interpreting the coefficients, regress math on star, gender, ethnicity, lunch, degree, experience and school. Recalling that this is a randomized experiment, does it look like the randomization was well done?

lm(math ~ star, star_df_2)

## 
## Call:
## lm(formula = math ~ star, data = star_df_2)
## 
## Coefficients:
##      (Intercept)  starregular+aide         starsmall  
##          577.696             2.696             8.942

The interpretation of the coefficients is as usual a comparison of conditional means. That is, students in small classes score, on average 8.94 points higher than those in regular classes (the omitted category) while students in regular + aide classes score only 2.70 points higher.

reg_all <- lm(math ~ star + gender + ethnicity + lunch + degree + experience + school, star_df_2)
reg_all

## 
## Call:
## lm(formula = math ~ star + gender + ethnicity + lunch + degree + 
##     experience + school, data = star_df_2)
## 
## Coefficients:
##       (Intercept)   starregular+aide          starsmall         gendermale  
##         557.64898            2.01899            7.89579           -1.07263  
## ethnicityamindian     ethnicityasian      ethnicitycauc  ethnicityhispanic  
##         -21.91483           42.16429           18.61906           15.04510  
##    ethnicityother      lunchnon-free       degreemaster          degreephd  
##          54.67051           18.40555           -1.29047            0.97823  
##  degreespecialist         experience        schoolrural     schoolsuburban  
##          12.11634           -0.05146            3.99491           -4.65503  
##       schoolurban  
##          -3.97039

The coefficients on the treatment variable star decrease very slightly compared to the previous simple regression. This is expected considering this was a randomized experiment and therefore we can suppose that the randomization seems to have worked. If the randomization had been very poor, then accounting for all these factors should alter the coefficients on star more substantially.

5. What’s the adjusted \(R^2\) from the previous multiple regression. How do you interpret it? What might you deduce about the importance of observable individual, teacher and school characteristics in explaining educational outcomes?

summary(reg_all)

## 
## Call:
## lm(formula = math ~ star + gender + ethnicity + lunch + degree + 
##     experience + school, data = star_df_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -137.882  -27.033   -1.892   25.174  144.834 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       557.64898    1.68995 329.979  < 2e-16 ***
## starregular+aide    2.01899    1.30760   1.544 0.122635    
## starsmall           7.89579    1.36425   5.788 7.52e-09 ***
## gendermale         -1.07263    1.08674  -0.987 0.323675    
## ethnicityamindian -21.91483   28.97355  -0.756 0.449457    
## ethnicityasian     42.16429   11.90591   3.541 0.000401 ***
## ethnicitycauc      18.61906    1.76609  10.543  < 2e-16 ***
## ethnicityhispanic  15.04510   14.53924   1.035 0.300810    
## ethnicityother     54.67051   13.71246   3.987 6.78e-05 ***
## lunchnon-free      18.40555    1.25646  14.649  < 2e-16 ***
## degreemaster       -1.29047    1.17370  -1.099 0.271603    
## degreephd           0.97823    7.08286   0.138 0.890157    
## degreespecialist   12.11634    5.50130   2.202 0.027674 *  
## experience         -0.05146    0.06398  -0.804 0.421259    
## schoolrural         3.99491    2.09764   1.904 0.056898 .  
## schoolsuburban     -4.65503    1.90889  -2.439 0.014775 *  
## schoolurban        -3.97039    2.91817  -1.361 0.173703    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.86 on 5655 degrees of freedom
## Multiple R-squared:  0.1489, Adjusted R-squared:  0.1465 
## F-statistic: 61.85 on 16 and 5655 DF,  p-value: < 2.2e-16

The adjusted \(R^2\) of the previous regression is about 0.15, which means that that model explains about 15% of the variance in students’ math scores. This implies that 85% of the variance in math scores remains unaccounted for. In simple terms, class size, gender, ethnicity, free lunch status, teacher experience and degree, and school location, explain very little of the differences in math scores between the students in this dataset. Extrapolating a bit, one may argue that most of the differences in educational outcomes do not appear to be explained by these factors. This DOES NOT mean that they may not have important causal effects on educational outcomes.

Multiple Linear Regression - Tasks

Florian Oswald, Gustave Kenedi and Pierre Villedieu

2021-03-10

Task 1

Task 2: Dummy Variable Trap

Task 3: Recap