Let’s analyse the regression results using reading score as the dependent variable.
1. Load the data from here using the read_dta()
function from the haven
package. Assign it to an object grades
.
library(haven)
grades <- read_dta("https://www.dropbox.com/s/wwp2cs9f0dubmhr/grade5.dta?dl=1")
1. Regress avgverb
on classize
and disadvantaged
and assign the output to a new object reg
. Interpret the coefficients. How do they compare with the math score regression coefficients?
reg <- lm(avgverb ~ classize + disadvantaged, grades)
reg
##
## Call:
## lm(formula = avgverb ~ classize + disadvantaged, data = grades)
##
## Coefficients:
## (Intercept) classize disadvantaged
## 80.23878 -0.03083 -0.34967
The intercept, which measures that expected reading score when class size and the percentage of disadvantaed students in the class are 0, equals 80. This, as we’ve discussed in class, doesn’t make any “real-world” sense because if there are no students then there cannot be a test score. The coefficient on class size is -0.03 which means that a 1 student increase in class size, controlling for the percentage of disadvantaged students, is, on average, associated to a 0.03 point decrease in average reading score. The coefficient is negative while it was positive when analysing math score as the outcome. However, the coefficient is very small considering reading is scored on a scale from 0 to 100. Lastly, the coefficient on disadvantaged is equal to -0.35 which means that keeping class size constrant, a 1 percentage point increase in disadvantaged students is associate, on average, with a 0.35 point decrease in averate reading score. The magnitude of this coefficient is very similar to that when looking at math scores. Again, it is very small: a 10 percentage point increase in disadvantaged students (quite a large increase) would only be associated, on average, with 3.5 points decrease in reading score, keeping class size constant.
1. What are the other available variables that we may add in the regression? * Run the regression with all these variables and assign it to reg_full
. * Look at the coefficients. * Discuss all coefficients: sign and magnitude.
reg_full <- lm(avgverb ~ classize + disadvantaged + school_enrollment + female + religious, grades)
reg_full
##
## Call:
## lm(formula = avgverb ~ classize + disadvantaged + school_enrollment +
## female + religious, data = grades)
##
## Coefficients:
## (Intercept) classize disadvantaged school_enrollment
## 78.5607253 0.0033208 -0.3893330 0.0007583
## female religious
## 0.9237105 2.8761467
Still need to think about this dummy variable trap? Let’s run another regression where there is perfect linear dependence between regressors.
1. Load the STAR data from here, using read.csv
, and assign it to an object called star_df
. Keep only cases with no NA
s with the following code:
star_df <- star_df[complete.cases(star_df),]
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
star_df <- read.csv("https://www.dropbox.com/s/bf1fog8yasw3wjj/star_data.csv?dl=1")
star_df <- star_df[complete.cases(star_df),]
star_df_2 <- star_df %>%
filter(grade == "2")
2. Create three dummy variables: (i) small
equal to TRUE
if students are in a small class and FALSE
otherwise; (ii) regular
equal to TRUE
if students are in a regular class and FALSE
otherwise; (iii) regular_plus
equal to TRUE
if students are in a regular+aide class and FALSE
otherwise. Create a last variable, sum
, equal to the sum of small
, regular
and regular_plus
. What is sum
equal to? What does this mean?
star_df_2 <- star_df_2 %>%
mutate(small = (star == "small"),
regular = (star == "regular"),
regular_plus = (star == "regular+aide"),
sum = small + regular + regular_plus)
star_df_2 %>% count(sum)
## sum n
## 1 1 5672
The sum
variable is always equal to 1, which means that small
, regular
and regular_plus
are perfectly collinear. In other words they can be expressed as a linear combination of each one of another, i.e. small
= 1 - (regular
+ regular_plus
).
3. Regress math
on small
. What is the average predicted math
score of students in a small class? Do the same for regular
and regular_plus
.
reg_small <- lm(math ~ small, star_df_2)
reg_small
##
## Call:
## lm(formula = math ~ small, data = star_df_2)
##
## Coefficients:
## (Intercept) smallTRUE
## 579.074 7.564
Note that our regressor is a dummy variable. Remember from the lecture on causality that when the regressor is a dummy variable, the intercept corresponds to the conditional mean of outcome variable for the FALSE
category while the slope coefficient corresponds to the difference in expected outcomes between the TRUE
and the FALSE
category. In this case, from the regression, we can get that the expected math score of students not in a small class is around 579.07 points , and that the difference in expected math score between students in a small class and those not in a small class is 7.56 points. As such, the expected math score of students in a small class is approximately 579.07 + 7.56 = 586.64.
For those of you who prefer seeing this formally, this is how you can (should proceed): \[ \mathbb{E}(\text{math score } | \text{ small} = 0) = b_0 + b_1 \times 0 = b_0 \approx 579.07 \\ \mathbb{E}(\text{math score } | \text{ small} = 1) = b_0 + b_1 \times 1 = b_0 + b_1 \approx 586.64 \]
reg_regular <- lm(math ~ regular, star_df_2)
reg_regular
##
## Call:
## lm(formula = math ~ regular, data = star_df_2)
##
## Coefficients:
## (Intercept) regularTRUE
## 583.231 -5.535
reg_plus <- lm(math ~ regular_plus, star_df_2)
reg_plus
##
## Call:
## lm(formula = math ~ regular_plus, data = star_df_2)
##
## Coefficients:
## (Intercept) regular_plusTRUE
## 581.859 -1.466
Using the exact same logic as above, we obtain: \[ \mathbb{E}(\text{math score } | \text{ regular} = 0) = b_0 + b_1 \times 0 = b_0 \approx 583.23 \\ \mathbb{E}(\text{math score } | \text{ regular} = 1) = b_0 + b_1 \times 1 = b_0 + b_1 \approx 577.7 \\ \mathbb{E}(\text{math score } | \text{ regular+aide} = 0) = b_0 + b_1 \times 0 = b_0 \approx 581.86 \\ \mathbb{E}(\text{math score } | \text{ regular+aide} = 1) = b_0 + b_1 \times 1 = b_0 + b_1 \approx 580.39 \]
4. Regress math
on small
, regular
and regular_plus
. What do you notice? What’s the omitted (reference) category? Does this match the previous questions?
reg_full <- lm(math ~ small + regular + regular_plus, star_df_2)
reg_full
##
## Call:
## lm(formula = math ~ small + regular + regular_plus, data = star_df_2)
##
## Coefficients:
## (Intercept) smallTRUE regularTRUE regular_plusTRUE
## 580.393 6.246 -2.696 NA
We notice that the regular_plus
variable has been dropped, i.e. the coefficient for this variable cannot be estimated because it is collinear to small
and regular
. This means that the reference category is regular+aide. In other words, the coefficients on small
and regular
are to be interpreted relative to the outcomes of the regular+aide students. To be specific, and using the same formal interpretation as above, denoting the model as \(\text{math score} = b_0 + b_1 \text{small} + b_2 \text{regular} + e\), we obtain:
\[\begin{align} b_0 &= \mathbb{E}(\text{math score } | \text{ small} = 0 \text{ & } \text{ regular} = 0) \\ &= \mathbb{E}(\text{math score } | \text{ regular + aide} = 1) \\ &\approx 580.39 \end{align}\]
You can compare this resul with the answer to the previous question and see that we obtain exactly the same answer.
In the same way, we get:
\[\begin{align} b_1 &= \mathbb{E}(\text{math score } | \text{ small} = 1 \text{ & } \text{ regular} \in \{0,1\}) - \mathbb{E}(\text{math score } | \text{ small} = 0 \text{ & } \text{ regular} \in \{0,1\}) \\ &= b_0 + b_1 \times 1 + b_2 \times \text{regular} - (b_0 + b_1 \times 0 + b_2 \times \text{regular}) \\ &\approx 6.25 \end{align}\]
\[\begin{align} b_2 &= \mathbb{E}(\text{math score } | \text{ small} \in \{0,1\} \text{ & } \text{ regular} = 1) - \mathbb{E}(\text{math score } | \text{ small} \in \{0,1\} \text{ & } \text{ regular} = 0) \\ &= b_0 + b_1 \times \text{small} + b_2 \times 1 - (b_0 + b_1 \times \text{small} + b_2 \times 0) \\ &\approx -2.7 \end{align}\]
In plain English, the interpretation is that relative to the regular+aide group, students in small classes score 6.25 points higher in the math exam, while relative to the regular+aide group, students in regular classes score 2.7 points lower in the math exam.
5. Regress math
on star
. What do you notice? What’s the omitted category? Interpret the coefficient.
lm(math ~ star, star_df_2)
##
## Call:
## lm(formula = math ~ star, data = star_df_2)
##
## Coefficients:
## (Intercept) starregular+aide starsmall
## 577.696 2.696 8.942
R
actually creates the dummy variables for me, I don’t need to do the tedious work above of creating an individual dummy variable for each category! The omitted category is regular
therefore the coefficients are to be interpreted as differences in average math scores relative to students in regular
classes.
Let’s use the STAR data (used in the previous task) to review the main concepts covered.
1. Using the filtered data from the previous task, regress math
on school
(tabulate the variable to know what it contains). Interpret the coefficients. What’s the omitted category? Do you find them surprising? Why? What might be an omitted variable?
star_df_2 %>% count(school)
## school n
## 1 inner-city 1199
## 2 rural 2665
## 3 suburban 1476
## 4 urban 332
lm(math ~ school, star_df_2)
##
## Call:
## lm(formula = math ~ school, data = star_df_2)
##
## Coefficients:
## (Intercept) schoolrural schoolsuburban schoolurban
## 561.56 30.18 16.93 20.33
The school
variable is a categorical variable indicating the school location type. There are four categories: inner-cyt, suburban, rural and urban.
The ommitted category in the regression is inner-city, meaning that the intercept represents the average math score of students in inner-city schools and that the other coefficients should be interpreted relative to the expected math score of students in inner-city schools. In particular, on average, students in rural schools score 30 points higher than those in inner-city schools, those in suburban schools score almost 17 points higher, and those in urban schools score about 20 points higher. Inner-cities tend to be poorer areas which might explain that they score so much lower than students in other locations. An omitted variable might be some student- or school-evel variable of disadvantage.
2. Compute the share of students qualifying for free lunch (i.e. lunch
equals “free”) by school location category. What do you observe? Add lunch
to the previous question’s regression. How do the coefficients change?
star_df_2 %>%
group_by(school) %>%
summarise(mean(lunch == "free"))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
## school `mean(lunch == "free")`
## <chr> <dbl>
## 1 inner-city 0.897
## 2 rural 0.389
## 3 suburban 0.335
## 4 urban 0.443
As expected, the share of students qualifying for free lunches, a common indicator of disadvantage, is significantly higher in inner-city schools (almost 90% of such students) while it is between 34% and 44% of students in other locations.
lm(math ~ school + lunch, star_df_2)
##
## Call:
## lm(formula = math ~ school + lunch, data = star_df_2)
##
## Coefficients:
## (Intercept) schoolrural schoolsuburban schoolurban lunchnon-free
## 559.228 18.750 4.289 10.114 22.517
We observe that the coefficients on the rural, suburban and urban decrease dramatically once the free lunch status of students are taken into account. Note that students not qualifying for free lunch score significantly higher, on average, than those who do, controlling for the school’s location category.
3. Regress math
on gender
and experience
(the teacher’s experience). Interpret the coefficients. How would these regression results look like visually?
lm(math ~ gender + experience, star_df_2)
##
## Call:
## lm(formula = math ~ gender + experience, data = star_df_2)
##
## Coefficients:
## (Intercept) gendermale experience
## 581.35255 -0.91383 0.03453
The intercept coefficient corresponds to the average math score of female students, keeping teacher experience constant. Recall that because we have both a numeric variable and a dummy variable as regressors, the intercept can be interpreted as the intercept for the line of the omitted category, in this case female. The coefficient on gender represents the expected difference in math scores between male and female students, holding teacher experience constant. The coefficient is negative but small, implying that on average male students score slightly lower than their female peers, holding teacher experience constant. Graphically, this implies that the line for male students lies below that for females. The coefficient on experience corresponds to the expected change in math score associated, on average, to an increase in a teacher’s experience by 1 year, accounting for student gender. As such, an additional year of teacher experience is associated, on average, with a 0.03 increase in math scores, keeping gender constant. This is a very small effect, since you would need 100 years of increased experience to expect math scores to increase by 3 points.
4. Regress math
on star
. After interpreting the coefficients, regress math
on star
, gender
, ethnicity
, lunch
, degree
, experience
and school
. Recalling that this is a randomized experiment, does it look like the randomization was well done?
lm(math ~ star, star_df_2)
##
## Call:
## lm(formula = math ~ star, data = star_df_2)
##
## Coefficients:
## (Intercept) starregular+aide starsmall
## 577.696 2.696 8.942
The interpretation of the coefficients is as usual a comparison of conditional means. That is, students in small classes score, on average 8.94 points higher than those in regular classes (the omitted category) while students in regular + aide classes score only 2.70 points higher.
reg_all <- lm(math ~ star + gender + ethnicity + lunch + degree + experience + school, star_df_2)
reg_all
##
## Call:
## lm(formula = math ~ star + gender + ethnicity + lunch + degree +
## experience + school, data = star_df_2)
##
## Coefficients:
## (Intercept) starregular+aide starsmall gendermale
## 557.64898 2.01899 7.89579 -1.07263
## ethnicityamindian ethnicityasian ethnicitycauc ethnicityhispanic
## -21.91483 42.16429 18.61906 15.04510
## ethnicityother lunchnon-free degreemaster degreephd
## 54.67051 18.40555 -1.29047 0.97823
## degreespecialist experience schoolrural schoolsuburban
## 12.11634 -0.05146 3.99491 -4.65503
## schoolurban
## -3.97039
The coefficients on the treatment variable star
decrease very slightly compared to the previous simple regression. This is expected considering this was a randomized experiment and therefore we can suppose that the randomization seems to have worked. If the randomization had been very poor, then accounting for all these factors should alter the coefficients on star
more substantially.
5. What’s the adjusted \(R^2\) from the previous multiple regression. How do you interpret it? What might you deduce about the importance of observable individual, teacher and school characteristics in explaining educational outcomes?
summary(reg_all)
##
## Call:
## lm(formula = math ~ star + gender + ethnicity + lunch + degree +
## experience + school, data = star_df_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -137.882 -27.033 -1.892 25.174 144.834
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 557.64898 1.68995 329.979 < 2e-16 ***
## starregular+aide 2.01899 1.30760 1.544 0.122635
## starsmall 7.89579 1.36425 5.788 7.52e-09 ***
## gendermale -1.07263 1.08674 -0.987 0.323675
## ethnicityamindian -21.91483 28.97355 -0.756 0.449457
## ethnicityasian 42.16429 11.90591 3.541 0.000401 ***
## ethnicitycauc 18.61906 1.76609 10.543 < 2e-16 ***
## ethnicityhispanic 15.04510 14.53924 1.035 0.300810
## ethnicityother 54.67051 13.71246 3.987 6.78e-05 ***
## lunchnon-free 18.40555 1.25646 14.649 < 2e-16 ***
## degreemaster -1.29047 1.17370 -1.099 0.271603
## degreephd 0.97823 7.08286 0.138 0.890157
## degreespecialist 12.11634 5.50130 2.202 0.027674 *
## experience -0.05146 0.06398 -0.804 0.421259
## schoolrural 3.99491 2.09764 1.904 0.056898 .
## schoolsuburban -4.65503 1.90889 -2.439 0.014775 *
## schoolurban -3.97039 2.91817 -1.361 0.173703
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40.86 on 5655 degrees of freedom
## Multiple R-squared: 0.1489, Adjusted R-squared: 0.1465
## F-statistic: 61.85 on 16 and 5655 DF, p-value: < 2.2e-16
The adjusted \(R^2\) of the previous regression is about 0.15, which means that that model explains about 15% of the variance in students’ math scores. This implies that 85% of the variance in math scores remains unaccounted for. In simple terms, class size, gender, ethnicity, free lunch status, teacher experience and degree, and school location, explain very little of the differences in math scores between the students in this dataset. Extrapolating a bit, one may argue that most of the differences in educational outcomes do not appear to be explained by these factors. This DOES NOT mean that they may not have important causal effects on educational outcomes.