lm()
# load pacman
library(pacman)
# use p_load to load multiple packages at once
p_load(AER, tidyverse, huxtable, car)
# data() loads specified data set, in this case "CASchools"
data("CASchools")
As usual, we load pacman first and use p_load
function to attach several packages that we will be using today; AER
package contains the dataset that we will do our analysis, 2) tidyverse
contains functions such as pipe operator that are useful for data analysis, 3) huxtable
contains functions such as huxreg
which produces nice formatted regression results. ### Learn about the dataset
## district school county grades
## Length:420 Length:420 Sonoma : 29 KK-06: 61
## Class :character Class :character Kern : 27 KK-08:359
## Mode :character Mode :character Los Angeles: 27
## Tulare : 24
## San Diego : 21
## Santa Clara: 20
## (Other) :272
## students teachers calworks lunch
## Min. : 81.0 Min. : 4.85 Min. : 0.000 Min. : 0.00
## 1st Qu.: 379.0 1st Qu.: 19.66 1st Qu.: 4.395 1st Qu.: 23.28
## Median : 950.5 Median : 48.56 Median :10.520 Median : 41.75
## Mean : 2628.8 Mean : 129.07 Mean :13.246 Mean : 44.71
## 3rd Qu.: 3008.0 3rd Qu.: 146.35 3rd Qu.:18.981 3rd Qu.: 66.86
## Max. :27176.0 Max. :1429.00 Max. :78.994 Max. :100.00
##
## computer expenditure income english
## Min. : 0.0 Min. :3926 Min. : 5.335 Min. : 0.000
## 1st Qu.: 46.0 1st Qu.:4906 1st Qu.:10.639 1st Qu.: 1.941
## Median : 117.5 Median :5215 Median :13.728 Median : 8.778
## Mean : 303.4 Mean :5312 Mean :15.317 Mean :15.768
## 3rd Qu.: 375.2 3rd Qu.:5601 3rd Qu.:17.629 3rd Qu.:22.970
## Max. :3324.0 Max. :7712 Max. :55.328 Max. :85.540
##
## read math
## Min. :604.5 Min. :605.4
## 1st Qu.:640.4 1st Qu.:639.4
## Median :655.8 Median :652.5
## Mean :655.0 Mean :653.3
## 3rd Qu.:668.7 3rd Qu.:665.9
## Max. :704.0 Max. :709.5
##
## Rows: 420
## Columns: 14
## $ district <chr> "75119", "61499", "61549", "61457", "61523", "62042", "685…
## $ school <chr> "Sunol Glen Unified", "Manzanita Elementary", "Thermalito …
## $ county <fct> Alameda, Butte, Butte, Butte, Butte, Fresno, San Joaquin, …
## $ grades <fct> KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK…
## $ students <dbl> 195, 240, 1550, 243, 1335, 137, 195, 888, 379, 2247, 446, …
## $ teachers <dbl> 10.90, 11.15, 82.90, 14.00, 71.50, 6.40, 10.00, 42.50, 19.…
## $ calworks <dbl> 0.5102, 15.4167, 55.0323, 36.4754, 33.1086, 12.3188, 12.90…
## $ lunch <dbl> 2.0408, 47.9167, 76.3226, 77.0492, 78.4270, 86.9565, 94.62…
## $ computer <dbl> 67, 101, 169, 85, 171, 25, 28, 66, 35, 0, 86, 56, 25, 0, 3…
## $ expenditure <dbl> 6384.911, 5099.381, 5501.955, 7101.831, 5235.988, 5580.147…
## $ income <dbl> 22.690001, 9.824000, 8.978000, 8.978000, 9.080333, 10.4150…
## $ english <dbl> 0.000000, 4.583333, 30.000002, 0.000000, 13.857677, 12.408…
## $ read <dbl> 691.6, 660.5, 636.3, 651.9, 641.8, 605.7, 604.5, 605.5, 60…
## $ math <dbl> 690.0, 661.9, 650.9, 643.5, 639.9, 605.4, 609.0, 612.5, 61…
This CASchools is a dataset contained in AER package. It is California Test Score Data that contains “data on test performance, school characteristics and student demographic backgrounds for school districts in California,” according to R documentation that you see on your help pane if you invoke ?CASchools
.
Next, we will create two new variables, 1) student to teacher ratio and 2) the average reading and math score for each school district.
For the simple linear regression, we will have score as the dependent variable and student-to-teacher ratio as the explanatory variable and store its result as reg1
. Notice that we defined STR as student to teacher ratio where we divided the number of students by the number of teacher.
##
## Call:
## lm(formula = score ~ STR, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.727 -14.251 0.483 12.822 48.540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 698.9329 9.4675 73.825 < 2e-16 ***
## STR -2.2798 0.4798 -4.751 2.78e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
## F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
The first regression result, which does not account for any control variables, shows that one unit increase in student-to-teacher ratio is associated with -2 unit decrease in scores. In other words, the reduction in the student-to-teacher ratio by one student per teacher is associated with approximately 2 point increase in average test scores without holding students characteristics constant.
Here I want to quickly go over how to calculate the confidence interval manually and using tidy()
function.
Step 1: Find t-critical value for the two-sided test with 5% significance level In a simple regression, \(k\) equals to 2, as there are two parameters being estimated, \(\beta_0\) and \(\beta_1\).
## [1] -1.965655
Step 2: Store the coefficient estimate and its std.error
tidy(reg1) %>% filter(term=="STR") %>% select(std.error) %>% .[[1]] -> se_STR
tidy(reg1) %>% filter(term=="STR") %>% select(estimate) %>% .[[1]] -> est_STR
Step 3: Use information from Step 1 and 2 to calculate the lower/upper bound of the confidence interval
## [1] -3.22298
## [1] -1.336636
tidy()
to report confidence intervalterm | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 699 | 9.47 | 73.8 | 6.57e-242 | 680 | 718 |
STR | -2.28 | 0.48 | -4.75 | 2.78e-06 | -3.22 | -1.34 |
Manual calculation should coincide with what is returned from using tidy()
function.
# You can also specify the confidence level using conf.level argument
tidy(reg1, conf.int = T, conf.level = .9)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 699 | 9.47 | 73.8 | 6.57e-242 | 683 | 715 |
STR | -2.28 | 0.48 | -4.75 | 2.78e-06 | -3.07 | -1.49 |
# mean of avg. test-scores
y_mean <- mean(CASchools$score)
# sum of squared residuals
SSR <- sum(residuals(reg1)^2)
# total sum of squares
TSS <- sum((CASchools$score - y_mean )^2)
# explained sum of squares
ESS <- sum((fitted(reg1) - y_mean)^2)
# R^2
Rsq <- 1 - (SSR / TSS)
Rsq
## [1] 0.05124009
## [1] 0.05124009
lm()
Until now, we only considered simply regressing the test score on student-to-teacher ratio without controlling for other factors.
Here, we will consider three variables to control for the background characteristics of students in each school that might affect test scores: english, which is the percent of students who are still learning English, lunch, which denotes the percent of students qualifying for reduced-price lunch, and calworks, which denotes the percent of students qualifying for CalWorks (California income assistance program).
The latter two variables could serve as indicators for students’ economic background and thus be measures for the fraction of the economically disadvantaged children in the district.
The first regression model is as follows:
\(\textrm{Score}_i = \beta_0+\beta_1\textrm{STR}_i + u_i\)
The second regression model is as follows:
\(\textrm{Score}_i = \beta_0+\beta_1\textrm{STR}_i+\beta_2\textrm{English}_i + u_i\)
The third regression model is as follows:
\(\textrm{Score}_i = \beta_0+\beta_1\textrm{STR}_i+\beta_2\textrm{English}_i +\beta_3\textrm{Lunch}_i+ u_i\)
The fourth regression model is as follows:
\(\textrm{Score}_i = \beta_0+\beta_1\textrm{STR}_i+\beta_2\textrm{English}_i +\beta_3\textrm{Calworks}_i+ u_i\)
The fifth regression model is as follows:
\(\textrm{Score}_i = \beta_0+\beta_1\textrm{STR}_i+\beta_2\textrm{English}_i +\beta_3\textrm{Lunch}_i+ \beta_4\textrm{Calworks}_i+ u_i\)
reg5 <- lm(score ~ STR + english + lunch + calworks, data = CASchools)
# We use `huxreg()` function in `huxtable` package to show five regression results
# side-by-side in one table.
huxreg(reg1,reg2,reg3,reg4,reg5)
(1) | (2) | (3) | (4) | (5) | |
---|---|---|---|---|---|
(Intercept) | 698.933 *** | 686.032 *** | 700.150 *** | 697.999 *** | 700.392 *** |
(9.467) | (7.411) | (4.686) | (6.024) | (4.698) | |
STR | -2.280 *** | -1.101 ** | -0.998 *** | -1.308 *** | -1.014 *** |
(0.480) | (0.380) | (0.239) | (0.307) | (0.240) | |
english | -0.650 *** | -0.122 *** | -0.488 *** | -0.130 *** | |
(0.039) | (0.032) | (0.033) | (0.034) | ||
lunch | -0.547 *** | -0.529 *** | |||
(0.022) | (0.032) | ||||
calworks | -0.790 *** | -0.048 | |||
(0.053) | (0.061) | ||||
N | 420 | 420 | 420 | 420 | 420 |
R2 | 0.051 | 0.426 | 0.775 | 0.629 | 0.775 |
logLik | -1822.250 | -1716.561 | -1520.499 | -1625.328 | -1520.188 |
AIC | 3650.499 | 3441.122 | 3050.999 | 3260.656 | 3052.376 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
calworks
as redundant since adding calworks
as an additional control to the base specification (3) does affect only marginally on the estimated coefficient/standard error for the student-to-teacher ratio and it is reported to be not statistically significant in determining the variation of the test score. It could be that its(calworks
) effect on the test score is likely been already addressed by other control, in this case lunch
, since both variables are indicators for students’ economic background.Recall that under the following two models, Model 1: \(Y_i = \beta_0 + \beta_1 X_{1i} + u_i\). Model 2: \(Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + v_i\),
the formula for calculating omitted variable bias is as follows: \[{\text{Bias} = \beta_2 \frac{\mathop{\text{Cov}}(X_{1i}, X_{2i})}{\mathop{\text{Var}}(X_{1i})}}.\]
Consider the specification (1) and (2) from the CASchools
example above, and consider the specification (1) as model 1 and (2) as model 2. Then we can calculate the omitted variable bias for \(\beta_1\) as follows:
## [1] 6.491212
## [1] -1.178512
There’s a negative omitted variable bias, meaning that omitting english
has amplified the effect of STR
negatively on the test core. Comparing (reg1 STR coef) with (reg2 STR coef + omv bias)
near(reg2$coef[2][[1]]+reg2$coef[3][[1]]*cov(CASchools$STR, CASchools$english)/var(CASchools$STR), reg1$coef[2][[1]])
## [1] TRUE
Let’s suppose we want to test to see if the coefficients on the percentage of students receiving lunch at a reduced price AND the percentage of students receiving income assistance are equal to zero at the 5% level. We will use the second and the fifth model specification to conduct an F-test. Recall that the second model specification was as follows:
\(\textrm{Score}_i = \beta_0+\beta_1\textrm{STR}_i+\beta_2\textrm{English}_i + u_i\) —–(2) ,
and the fifth model specification was as follows:
\(\textrm{Score}_i = \beta_0+\beta_1\textrm{STR}_i+\beta_2\textrm{English}_i +\beta_3\textrm{Lunch}_i+ \beta_4\textrm{Calworks}_i+ u_i\) —–(5)
First we set the null hypothesis, i.e. set up some guess about the unknown population parameters,
in this case \(\beta_3 = \beta_4 = 0\).
We now know what our restricted and unrestricted models are: the restricted model is (2) while the unrestricted model is (5). Notice that we are restricting the model such that \(\beta_3\) and \(\beta_4\) are both equal 0. If you plug in 0 for \(\beta_3\) and \(\beta_4\) in model (5), it becomes model (2) and thus, the restricted model is model (2) while the unrestricted one is model (5). Use following formula to calculate F-statistic: \[\begin{align} F = \dfrac{\left(R^2_u - R^2_r\right)/q}{ (1 - R^2_u)/(n-k)} \end{align}\]
## [1] 0.77485
## [1] 0.4264315
## [1] 415
# degrees of freedom for numerator (number of restrictions)
q = 2
# F statistic
((summary(reg5)$r.squared-summary(reg2)$r.squared)/q)/
((1-summary(reg5)$r.squared)/(nrow(CASchools)-nrow(summary(reg5)$coef)))
## [1] 321.1054
## [1] 3.017462
# Is F-stat greater than F-critical value?
((summary(reg5)$r.squared-summary(reg2)$r.squared)/q)/
((1-summary(reg5)$r.squared)/(nrow(CASchools)-nrow(summary(reg5)$coef)))>
qf(0.95, q, nrow(CASchools)-nrow(summary(reg5)$coef))
## [1] TRUE
Conclusion: Reject the null that \(\beta_3\) and \(\beta_4\) are all equal to 0 Also we could use packaged function linearHypothesis()
for hypothesis testing.
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
417 | 8.72e+04 | ||||
415 | 3.42e+04 | 2 | 5.3e+04 | 321 | 5.39e-85 |
Please open up the 06-Exercise.R
and fill out your answer for each question.