Overview

  • Calculate confidence interval
  • Calculate omitted variable bias
  • Conduct multiple regression using lm()
  • Conduct F - tests

Start with simple regression

Load packages

# load pacman
library(pacman)
# use p_load to load multiple packages at once
p_load(AER, tidyverse, huxtable, car)
# data() loads specified data set, in this case "CASchools"
data("CASchools")

As usual, we load pacman first and use p_load function to attach several packages that we will be using today; AER package contains the dataset that we will do our analysis, 2) tidyverse contains functions such as pipe operator that are useful for data analysis, 3) huxtable contains functions such as huxreg which produces nice formatted regression results. ### Learn about the dataset

summary(CASchools)
##    district            school                  county      grades   
##  Length:420         Length:420         Sonoma     : 29   KK-06: 61  
##  Class :character   Class :character   Kern       : 27   KK-08:359  
##  Mode  :character   Mode  :character   Los Angeles: 27              
##                                        Tulare     : 24              
##                                        San Diego  : 21              
##                                        Santa Clara: 20              
##                                        (Other)    :272              
##     students          teachers          calworks          lunch       
##  Min.   :   81.0   Min.   :   4.85   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.:  379.0   1st Qu.:  19.66   1st Qu.: 4.395   1st Qu.: 23.28  
##  Median :  950.5   Median :  48.56   Median :10.520   Median : 41.75  
##  Mean   : 2628.8   Mean   : 129.07   Mean   :13.246   Mean   : 44.71  
##  3rd Qu.: 3008.0   3rd Qu.: 146.35   3rd Qu.:18.981   3rd Qu.: 66.86  
##  Max.   :27176.0   Max.   :1429.00   Max.   :78.994   Max.   :100.00  
##                                                                       
##     computer       expenditure       income          english      
##  Min.   :   0.0   Min.   :3926   Min.   : 5.335   Min.   : 0.000  
##  1st Qu.:  46.0   1st Qu.:4906   1st Qu.:10.639   1st Qu.: 1.941  
##  Median : 117.5   Median :5215   Median :13.728   Median : 8.778  
##  Mean   : 303.4   Mean   :5312   Mean   :15.317   Mean   :15.768  
##  3rd Qu.: 375.2   3rd Qu.:5601   3rd Qu.:17.629   3rd Qu.:22.970  
##  Max.   :3324.0   Max.   :7712   Max.   :55.328   Max.   :85.540  
##                                                                   
##       read            math      
##  Min.   :604.5   Min.   :605.4  
##  1st Qu.:640.4   1st Qu.:639.4  
##  Median :655.8   Median :652.5  
##  Mean   :655.0   Mean   :653.3  
##  3rd Qu.:668.7   3rd Qu.:665.9  
##  Max.   :704.0   Max.   :709.5  
## 
glimpse(CASchools)
## Rows: 420
## Columns: 14
## $ district    <chr> "75119", "61499", "61549", "61457", "61523", "62042", "685…
## $ school      <chr> "Sunol Glen Unified", "Manzanita Elementary", "Thermalito …
## $ county      <fct> Alameda, Butte, Butte, Butte, Butte, Fresno, San Joaquin, …
## $ grades      <fct> KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK…
## $ students    <dbl> 195, 240, 1550, 243, 1335, 137, 195, 888, 379, 2247, 446, …
## $ teachers    <dbl> 10.90, 11.15, 82.90, 14.00, 71.50, 6.40, 10.00, 42.50, 19.…
## $ calworks    <dbl> 0.5102, 15.4167, 55.0323, 36.4754, 33.1086, 12.3188, 12.90…
## $ lunch       <dbl> 2.0408, 47.9167, 76.3226, 77.0492, 78.4270, 86.9565, 94.62…
## $ computer    <dbl> 67, 101, 169, 85, 171, 25, 28, 66, 35, 0, 86, 56, 25, 0, 3…
## $ expenditure <dbl> 6384.911, 5099.381, 5501.955, 7101.831, 5235.988, 5580.147…
## $ income      <dbl> 22.690001, 9.824000, 8.978000, 8.978000, 9.080333, 10.4150…
## $ english     <dbl> 0.000000, 4.583333, 30.000002, 0.000000, 13.857677, 12.408…
## $ read        <dbl> 691.6, 660.5, 636.3, 651.9, 641.8, 605.7, 604.5, 605.5, 60…
## $ math        <dbl> 690.0, 661.9, 650.9, 643.5, 639.9, 605.4, 609.0, 612.5, 61…
?CASchools

This CASchools is a dataset contained in AER package. It is California Test Score Data that contains “data on test performance, school characteristics and student demographic backgrounds for school districts in California,” according to R documentation that you see on your help pane if you invoke ?CASchools.

Create two new variables

Next, we will create two new variables, 1) student to teacher ratio and 2) the average reading and math score for each school district.

CASchools$STR <- CASchools$students/CASchools$teachers
CASchools$score <- (CASchools$read + CASchools$math)/2

Do a simple regression

For the simple linear regression, we will have score as the dependent variable and student-to-teacher ratio as the explanatory variable and store its result as reg1. Notice that we defined STR as student to teacher ratio where we divided the number of students by the number of teacher.

reg1 <- lm(score ~ STR, data = CASchools)
summary(reg1)
## 
## Call:
## lm(formula = score ~ STR, data = CASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.727 -14.251   0.483  12.822  48.540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 698.9329     9.4675  73.825  < 2e-16 ***
## STR          -2.2798     0.4798  -4.751 2.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared:  0.05124,    Adjusted R-squared:  0.04897 
## F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06

The first regression result, which does not account for any control variables, shows that one unit increase in student-to-teacher ratio is associated with -2 unit decrease in scores. In other words, the reduction in the student-to-teacher ratio by one student per teacher is associated with approximately 2 point increase in average test scores without holding students characteristics constant.

Confidence Interval

Here I want to quickly go over how to calculate the confidence interval manually and using tidy() function.

  1. Calculating confidence interval manually:

Step 1: Find t-critical value for the two-sided test with 5% significance level In a simple regression, \(k\) equals to 2, as there are two parameters being estimated, \(\beta_0\) and \(\beta_1\).

n=nrow(CASchools)
k=2
qt(.05/2, n-k)
## [1] -1.965655

Step 2: Store the coefficient estimate and its std.error

tidy(reg1) %>% filter(term=="STR") %>% select(std.error) %>% .[[1]] -> se_STR
tidy(reg1) %>% filter(term=="STR") %>% select(estimate) %>% .[[1]] -> est_STR

Step 3: Use information from Step 1 and 2 to calculate the lower/upper bound of the confidence interval

est_STR + se_STR*qt(.05/2, n-k)
## [1] -3.22298
est_STR - se_STR*qt(.05/2, n-k)
## [1] -1.336636
  1. Use tidy() to report confidence interval
tidy(reg1, conf.int = T)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)699   9.4773.8 6.57e-242680   718   
STR-2.280.48-4.752.78e-06 -3.22-1.34

Manual calculation should coincide with what is returned from using tidy() function.

# You can also specify the confidence level using conf.level argument
tidy(reg1, conf.int = T, conf.level = .9)
termestimatestd.errorstatisticp.valueconf.lowconf.high
(Intercept)699   9.4773.8 6.57e-242683   715   
STR-2.280.48-4.752.78e-06 -3.07-1.49

Goodness of Fit

# mean of avg. test-scores
y_mean <- mean(CASchools$score)
# sum of squared residuals
SSR <- sum(residuals(reg1)^2)
# total sum of squares
TSS <- sum((CASchools$score - y_mean )^2)
# explained sum of squares
ESS <- sum((fitted(reg1) - y_mean)^2)
# R^2
Rsq <- 1 - (SSR / TSS)
Rsq
## [1] 0.05124009
summary(reg1)$r.squared
## [1] 0.05124009

Multiple Regression Using lm()

Until now, we only considered simply regressing the test score on student-to-teacher ratio without controlling for other factors.

Here, we will consider three variables to control for the background characteristics of students in each school that might affect test scores: english, which is the percent of students who are still learning English, lunch, which denotes the percent of students qualifying for reduced-price lunch, and calworks, which denotes the percent of students qualifying for CalWorks (California income assistance program).

The latter two variables could serve as indicators for students’ economic background and thus be measures for the fraction of the economically disadvantaged children in the district.

Model specification

The first regression model is as follows:

\(\textrm{Score}_i = \beta_0+\beta_1\textrm{STR}_i + u_i\)

reg1 <- lm(score~STR, data = CASchools)

The second regression model is as follows:

\(\textrm{Score}_i = \beta_0+\beta_1\textrm{STR}_i+\beta_2\textrm{English}_i + u_i\)

reg2 <- lm(score ~ STR + english, data = CASchools)

The third regression model is as follows:

\(\textrm{Score}_i = \beta_0+\beta_1\textrm{STR}_i+\beta_2\textrm{English}_i +\beta_3\textrm{Lunch}_i+ u_i\)

reg3 <- lm(score ~ STR + english + lunch, data = CASchools)

The fourth regression model is as follows:

\(\textrm{Score}_i = \beta_0+\beta_1\textrm{STR}_i+\beta_2\textrm{English}_i +\beta_3\textrm{Calworks}_i+ u_i\)

reg4 <- lm(score ~ STR + english + calworks, data = CASchools)

The fifth regression model is as follows:

\(\textrm{Score}_i = \beta_0+\beta_1\textrm{STR}_i+\beta_2\textrm{English}_i +\beta_3\textrm{Lunch}_i+ \beta_4\textrm{Calworks}_i+ u_i\)

reg5 <- lm(score ~ STR + english + lunch + calworks, data = CASchools)
# We use `huxreg()` function in `huxtable` package to show five regression results
# side-by-side in one table.
huxreg(reg1,reg2,reg3,reg4,reg5)
(1)(2)(3)(4)(5)
(Intercept)698.933 ***686.032 ***700.150 ***697.999 ***700.392 ***
(9.467)   (7.411)   (4.686)   (6.024)   (4.698)   
STR-2.280 ***-1.101 ** -0.998 ***-1.308 ***-1.014 ***
(0.480)   (0.380)   (0.239)   (0.307)   (0.240)   
english        -0.650 ***-0.122 ***-0.488 ***-0.130 ***
        (0.039)   (0.032)   (0.033)   (0.034)   
lunch                -0.547 ***        -0.529 ***
                (0.022)           (0.032)   
calworks                        -0.790 ***-0.048    
                        (0.053)   (0.061)   
N420        420        420        420        420        
R20.051    0.426    0.775    0.629    0.775    
logLik-1822.250    -1716.561    -1520.499    -1625.328    -1520.188    
AIC3650.499    3441.122    3050.999    3260.656    3052.376    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Discussion of the empirical results:

  1. Controlling for student characteristics reduces the effect of the student-teacher ratio on test scores approximately in half.
  2. The variables that control for student characteristics are useful in predicting test cores. Notice that the goodness of fit (R-squared) from the first regression is only 0.05. This means that the student-to-teacher ratio alone only explains a small fraction of the variation in test score. However, the goodness of fit increases significantly when student characteristic variables are added as controls.
  3. The individual control variable can be not statistically significant. For example, in specification (5), the percentage qualifying for income assistance is not statistically significant, meaning that it fails to reject the null hypothesis that the effect of the variable on the test score equals zero. We may consider calworks as redundant since adding calworks as an additional control to the base specification (3) does affect only marginally on the estimated coefficient/standard error for the student-to-teacher ratio and it is reported to be not statistically significant in determining the variation of the test score. It could be that its(calworks) effect on the test score is likely been already addressed by other control, in this case lunch, since both variables are indicators for students’ economic background.

Omitted Variable Bias

Recall that under the following two models, Model 1: \(Y_i = \beta_0 + \beta_1 X_{1i} + u_i\). Model 2: \(Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + v_i\),

the formula for calculating omitted variable bias is as follows: \[{\text{Bias} = \beta_2 \frac{\mathop{\text{Cov}}(X_{1i}, X_{2i})}{\mathop{\text{Var}}(X_{1i})}}.\]

Consider the specification (1) and (2) from the CASchools example above, and consider the specification (1) as model 1 and (2) as model 2. Then we can calculate the omitted variable bias for \(\beta_1\) as follows:

cov(CASchools$STR, CASchools$english)
## [1] 6.491212
reg2$coef[3][[1]]*cov(CASchools$STR, CASchools$english)/var(CASchools$STR)
## [1] -1.178512

There’s a negative omitted variable bias, meaning that omitting english has amplified the effect of STR negatively on the test core. Comparing (reg1 STR coef) with (reg2 STR coef + omv bias)

near(reg2$coef[2][[1]]+reg2$coef[3][[1]]*cov(CASchools$STR, CASchools$english)/var(CASchools$STR), reg1$coef[2][[1]])
## [1] TRUE

F-tests

Let’s suppose we want to test to see if the coefficients on the percentage of students receiving lunch at a reduced price AND the percentage of students receiving income assistance are equal to zero at the 5% level. We will use the second and the fifth model specification to conduct an F-test. Recall that the second model specification was as follows:

\(\textrm{Score}_i = \beta_0+\beta_1\textrm{STR}_i+\beta_2\textrm{English}_i + u_i\) —–(2) ,

and the fifth model specification was as follows:

\(\textrm{Score}_i = \beta_0+\beta_1\textrm{STR}_i+\beta_2\textrm{English}_i +\beta_3\textrm{Lunch}_i+ \beta_4\textrm{Calworks}_i+ u_i\) —–(5)

First we set the null hypothesis, i.e. set up some guess about the unknown population parameters,

in this case \(\beta_3 = \beta_4 = 0\).

We now know what our restricted and unrestricted models are: the restricted model is (2) while the unrestricted model is (5). Notice that we are restricting the model such that \(\beta_3\) and \(\beta_4\) are both equal 0. If you plug in 0 for \(\beta_3\) and \(\beta_4\) in model (5), it becomes model (2) and thus, the restricted model is model (2) while the unrestricted one is model (5). Use following formula to calculate F-statistic: \[\begin{align} F = \dfrac{\left(R^2_u - R^2_r\right)/q}{ (1 - R^2_u)/(n-k)} \end{align}\]

## H0: beta3=beta4=0
# unrestricted model's R-squared
summary(reg5)$r.squared
## [1] 0.77485
# restricted model's R-squared
summary(reg2)$r.squared
## [1] 0.4264315
# degrees of freedom for denominator
nrow(CASchools)-nrow(summary(reg5)$coef)
## [1] 415
# degrees of freedom for numerator (number of restrictions)
q = 2
# F statistic
((summary(reg5)$r.squared-summary(reg2)$r.squared)/q)/
  ((1-summary(reg5)$r.squared)/(nrow(CASchools)-nrow(summary(reg5)$coef)))
## [1] 321.1054
# F critical value
qf(0.95, q, nrow(CASchools)-nrow(summary(reg5)$coef))
## [1] 3.017462
# Is F-stat greater than F-critical value?
((summary(reg5)$r.squared-summary(reg2)$r.squared)/q)/
  ((1-summary(reg5)$r.squared)/(nrow(CASchools)-nrow(summary(reg5)$coef)))>
  qf(0.95, q, nrow(CASchools)-nrow(summary(reg5)$coef))
## [1] TRUE

Conclusion: Reject the null that \(\beta_3\) and \(\beta_4\) are all equal to 0 Also we could use packaged function linearHypothesis() for hypothesis testing.

linearHypothesis(reg5, c("lunch=0", "calworks=0"))
Res.DfRSSDfSum of SqFPr(>F)
4178.72e+04             
4153.42e+0425.3e+043215.39e-85
## Now your turn

Please open up the 06-Exercise.R and fill out your answer for each question.