1. Load the data from here as grades
. You need to find the function that enables importing .dta files. (FYI: .dta is the extension for data files used in Stata)
library(haven)
link <- "https://www.dropbox.com/s/wwp2cs9f0dubmhr/grade5.dta?dl=1"
grades <- read_dta(link)
2. Describe the dataset:
The unit of observation is the class.
nrow(grades)
## [1] 2019
names(grades)
## [1] "schlcode" "school_enrollment" "grade"
## [4] "classize" "avgmath" "avgverb"
## [7] "disadvantaged" "female" "religious"
Note that if you view the data, under each column name you have the variable’s label, which is very convenient.
avgmath
and avgverb
correspond to?The avgmath
and avgverb
variables correspond to the class’ average math and average verb scores.
skim
function from the skimr
package to obtain common summary statistics for the variables classize
, avgmath
and avgverb
. (Hint: use dplyr
to select
the variables and then simply pipe (%>%
) skim()
.)library(skimr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
grades %>%
select(classize, avgmath, avgverb) %>%
skim()
Name | Piped data |
Number of rows | 2019 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
classize | 0 | 1 | 29.93 | 6.56 | 5.00 | 26.00 | 31.00 | 35.00 | 44.00 | ▁▁▆▇▃ |
avgmath | 0 | 1 | 67.30 | 9.60 | 27.69 | 61.12 | 67.82 | 74.08 | 93.93 | ▁▂▇▇▁ |
avgverb | 0 | 1 | 74.38 | 7.68 | 34.80 | 69.83 | 75.38 | 79.84 | 93.86 | ▁▁▃▇▂ |
The skim
function provides some useful statistics about a variable. There are no missing values for the class size and average scores variables. Class sizes range from 5 students to 44 students, with the average class around 30 students. There are relatively few small classes (25th percentile is 26 students). Also note that average math scores were slightly lower than average verb scores and that the former’s distribution is more dispersed than the latter’s.
3. Do you have any priors about the actual (linear) relationship between class size and student achievement? What would you do to get a first insight?
On the one hand, smaller classes may offer the opportunity for more individualised learning and teachers may more easily monitor low-performing students. On the other hand, if teachers are not trained to properly take advantage of smaller groups then perhaps smaller classes might not be that beneficial. Moreover, depending on the country, smaller classes may be more prevalent in poorer areas, which would yield a positive relationship between class size and student achievement. Conversely, more conscientious parents may choose to locate in areas with smaller class sizes, thinking their children would benefit from them, in which case a negative relationship is likely to be found between class size and student performance.
A scatter plot would provide a first insight into the relationship between class size and student achievement.
library(cowplot)
scatter_verb <- grades %>%
ggplot() +
aes(x = classize, y = avgverb) +
geom_point() +
scale_x_continuous(limits = c(0, 45), breaks = seq(0,45,5)) +
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 20)) +
labs(x = "Class size",
y = "Average reading score")
scatter_math <- grades %>%
ggplot() +
aes(x = classize, y = avgmath) +
geom_point() +
scale_x_continuous(limits = c(0, 45), breaks = seq(0,45,5)) +
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 20)) +
labs(x = "Class size",
y = "Average math score")
plot_grid(scatter_verb, scatter_math, labels = c("Reading", "Mathematics"))
4. Compute the correlation between class size and math and verbal scores. Is the relationship positive/negative, strong/weak?
grades %>%
summarise(cor_verb = cor(classize, avgverb),
cor_math = cor(classize, avgmath))
## # A tibble: 1 x 2
## cor_verb cor_math
## <dbl> <dbl>
## 1 0.190 0.217
Both correlations are positive, consistent with the positive association suggested by the previous scatter plots. However, they are quite weak, around 0.2, suggesting the relationship isn’t very strong. Recall that a correlation of 1 implies a perfect positive linear relationship, i.e. all points are on the same line, while a correlation of -1 implies a perfect negative linear relationship. A correlation of 0 implies no linear association whatsoever.
Run the following code to aggregate the data at the class size level:
grades_avg_cs <- grades %>%
group_by(classize) %>%
summarise(avgmath_cs = mean(avgmath),
avgverb_cs = mean(avgverb))
## `summarise()` ungrouping output (override with `.groups` argument)
1. Compute the OLS coefficients \(b_0\) and \(b_1\) of the previous regression using the formulas on slide 25. (Hint: you need to use the cov
, var
, and mean
functions.)
cov_x_y = grades_avg_cs %>%
summarise(cov(classize, avgmath_cs))
var_x = var(grades_avg_cs$classize)
b_1 = cov_x_y / var_x
b_1
## cov(classize, avgmath_cs)
## 1 0.191271
y_bar = mean(grades_avg_cs$avgmath_cs)
x_bar = mean(grades_avg_cs$classize)
b_0 = y_bar - b_1 * x_bar
b_0
## cov(classize, avgmath_cs)
## 1 61.10917
2. Regress average verbal score (dependent variable) on class size (independant variable). Interpret the coefficients.
lm(avgverb_cs ~ classize, grades_avg_cs)
##
## Call:
## lm(formula = avgverb_cs ~ classize, data = grades_avg_cs)
##
## Coefficients:
## (Intercept) classize
## 69.1861 0.1332
Interpretation for \(b_0\): The predicted verbal score when class size has no students is 69.19. Note that this makes little sense because if a class has no students then there cannot be any scores! You should be warry of out-of-sample predictions, i.e. predictions for values that are outside the range of values in the data sample.
Interpretation for \(b_1\): On average, a class size increase of one student is associated with a 0.13 increase in average verbal score.
3. Is the slope coefficient similar to the one found for average math score? Was it expected based on the graphical evidence?
The slope coefficient, \(b_1\), is very similar to the one found for average math sore. This was expected considering the scatter plots were pretty close for both scores.
4. What is the predicted average verbal score when class size is equal to 0? (Does that even make sense?!)
As noted above, the predicted average verbal score when class size is equal to 0 is 69.19, and that makes no sense!
5. What is the predicted average verbal score when the class size is equal to 30 students?
The predicted average verbal score when the class size is equal to 30 students is: 69.19 + 0.13 x 30 = 73.09.
1. Regress avgmath_cs
on classize
. Assign to an object math_reg
.
math_reg <- lm(avgmath_cs ~ classize, grades_avg_cs)
2. Pass math_reg
in the summary()
function. What is the (multiple) \(R^2\) for this regression? How can you interpret it?
summary(math_reg)
##
## Call:
## lm(formula = avgmath_cs ~ classize, data = grades_avg_cs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7543 -1.7985 0.3109 1.4768 11.9345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.10917 1.43598 42.556 < 2e-16 ***
## classize 0.19127 0.05175 3.696 0.000725 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.528 on 36 degrees of freedom
## Multiple R-squared: 0.2751, Adjusted R-squared: 0.2549
## F-statistic: 13.66 on 1 and 36 DF, p-value: 0.0007249
The \(R^2\) is equal to approx. 0.28, which implies that around 28% of the variation in average math score is explained by class size.
3. Compute the squared correlation between classize
and avgmath_cs
. What does this tell you about the relationship between \(R^2\) and the correlation in a regression with only one regressor?
grades_avg_cs %>%
summarise(cor_sq = cor(classize, avgmath_cs)^2)
## # A tibble: 1 x 1
## cor_sq
## <dbl>
## 1 0.275
In a regression with only one regressor the \(R^2\) is equal to the square of the correlation between the dependent and independent variables.
4. Install and load the broom
package. Pass math_reg
in the augment()
function and assign it to a new object. Use the variance in avgmath_cs
(SST) and the variance in .fitted
(predicted values; SSE) to find the \(R^2\) using the formula on the previous slide.
library(broom)
math_reg_aug <- augment(math_reg)
SST = var(grades_avg_cs$avgmath_cs)
SSE = var(math_reg_aug$.fitted)
SSE/SST
## [1] 0.275055
5. Repeat steps 1 and 2 for avgverb_cs
. For which exam does the variance in class size explain more of the variance in students’ scores?
verb_reg <- lm(avgverb_cs ~ classize, grades_avg_cs)
summary(verb_reg)
##
## Call:
## lm(formula = avgverb_cs ~ classize, data = grades_avg_cs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.0152 -0.2451 0.8818 1.8443 8.8323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.18610 1.69180 40.895 <2e-16 ***
## classize 0.13316 0.06097 2.184 0.0356 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.157 on 36 degrees of freedom
## Multiple R-squared: 0.117, Adjusted R-squared: 0.09246
## F-statistic: 4.77 on 1 and 36 DF, p-value: 0.03557
The \(R^2\) is greater for maths than for reading, therefore class size explains more of the variation in math scores than in reading scores.