Probability, Statistics & Modeling II
Overall aim: make inference from sample to population.
Dataset 1: Terror data (“Trial and Terror dataset”)
load('./data/terror_data.RData')
names(terror_data)
## [1] "firstName" "lastName" "gender" "case_informant"
## [5] "case_sting" "sentence"
head(terror_data)
## firstName lastName gender case_informant case_sting sentence
## 3 Mubarak Hamed male false false 57
## 11 Tarek Makki male false false 26
## 20 Jalal Sadat Moheisen male true true 69
## 21 Thirunavukkarasu Varatharasa male true true 57
## 22 Reinhard Rusli male true true 13
## 23 Syed Mustajab Shah male true true 225
dim(terror_data)
## [1] 471 6
Dataset 2: Mass Shootings in detail (Stanford Mass Shootings in America dataset)
load('./data/mass_shootings_detailed.RData')
names(smsd)
## [1] "caseid" "n_fatal" "n_injured" "date"
## [5] "day" "age" "gender" "n_guns"
## [9] "school_related" "mental_illness"
head(smsd)
## caseid n_fatal n_injured date day age gender n_guns
## 1 1 16 32 8/1/1966 Monday 20 Male 8
## 2 2 5 1 11/12/1966 Saturday 11 Male 1
## 3 3 9 13 12/31/72 Sunday 17 Male 3
## 4 4 1 3 1/17/74 Thursday 3 Male 3
## 5 5 3 7 12/30/74 Monday 8 Male 3
## 7 7 7 2 7/12/76 Monday 34 Male 1
## school_related mental_illness
## 1 Yes Yes
## 2 Yes Yes
## 3 No Yes
## 4 Yes Yes
## 5 Yes No
## 7 Yes Yes
dim(smsd)
## [1] 182 10
Aim: find a line that simplifies the data
Why linear?
Y = a + b*X + E
Y
X
a
(= the value of Y
if X
is 0
)b
(= the change in Y
for every unit change in X
)E
(= the difference between the predicted value and the observed value)Y = a + b*X + E
Note: linear relationship
E ~ i.i.d. N(0, sd)
Modelling the no. of fatalities
victims = intercept + slope*number_of_guns
pred.victims = 3 + 1.5*smsd$n_guns
head(smsd, 1)
## caseid n_fatal n_injured date day age gender n_guns
## 1 1 16 32 8/1/1966 Monday 20 Male 8
## school_related mental_illness
## 1 Yes Yes
case_1 = 3+1.5*8
case_1
## [1] 15
plot(smsd$n_fatal, pred.victims, ylim=c(0,30))
pred.victims_adj = 5 + 3*smsd$n_guns
plot(smsd$n_fatal, pred.victims_adj, ylim=c(0,30))
An empirical approach:
Linearity in parameters
Y = a + b*X + E
Linear because: Y = a + b
OK, let’s model the data then…
R syntax for modelling:
~
to say “explained through…”shooter_model_1 = lm(formula = smsd$n_fatal ~ smsd$n_guns)
shooter_model_1
##
## Call:
## lm(formula = smsd$n_fatal ~ smsd$n_guns)
##
## Coefficients:
## (Intercept) smsd$n_guns
## 2.087 1.105
shooter_model_1
##
## Call:
## lm(formula = smsd$n_fatal ~ smsd$n_guns)
##
## Coefficients:
## (Intercept) smsd$n_guns
## 2.087 1.105
The model equation therefore is:
n_fatal = 2.087 + 1.105*n_guns
summary(shooter_model_1)
##
## Call:
## lm(formula = smsd$n_fatal ~ smsd$n_guns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9250 -2.4012 -0.4012 1.2440 26.5988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0870 0.5268 3.962 0.000107 ***
## smsd$n_guns 1.1047 0.1981 5.577 8.85e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.148 on 180 degrees of freedom
## Multiple R-squared: 0.1473, Adjusted R-squared: 0.1426
## F-statistic: 31.1 on 1 and 180 DF, p-value: 8.847e-08
n_guns
n_fatal = 2.087 + 1.105*n_guns
So we can make predictions, right?
Have a look at he model object:
shooter_model_1$fitted.values
## 1 2 3 4 5 6 7
## 10.924961 3.191753 5.401241 5.401241 5.401241 3.191753 3.191753
## 8 9 10 11 12 13 14
## 3.191753 3.191753 3.191753 6.505985 3.191753 6.505985 5.401241
## 15 16 17 18 19 20 21
## 3.191753 5.401241 6.505985 3.191753 6.505985 5.401241 10.924961
## 22 23 24 25 26 27 28
## 5.401241 3.191753 3.191753 5.401241 3.191753 9.820217 5.401241
## 29 30 31 32 33 34 35
## 3.191753 5.401241 5.401241 3.191753 5.401241 3.191753 3.191753
## 36 37 38 39 40 41 42
## 6.505985 3.191753 3.191753 6.505985 5.401241 3.191753 3.191753
## 43 44 45 46 47 48 49
## 3.191753 3.191753 3.191753 3.191753 3.191753 3.191753 5.401241
## 50 51 52 53 54 55 56
## 3.191753 6.505985 3.191753 3.191753 3.191753 3.191753 3.191753
## 57 58 59 60 61 62 63
## 6.505985 3.191753 3.191753 3.191753 4.296497 3.191753 6.505985
## 64 65 66 67 68 69 70
## 3.191753 7.610729 5.401241 7.610729 5.401241 3.191753 3.191753
## 71 72 73 74 75 76 77
## 5.401241 6.505985 7.610729 3.191753 3.191753 8.715473 5.401241
## 78 79 80 81 82 83 84
## 5.401241 3.191753 3.191753 3.191753 3.191753 6.505985 3.191753
## 85 86 87 88 89 90 91
## 5.401241 3.191753 5.401241 3.191753 5.401241 3.191753 5.401241
## 92 93 94 95 96 97 98
## 3.191753 3.191753 6.505985 5.401241 5.401241 5.401241 7.610729
## 99 100 101 102 103 104 105
## 3.191753 3.191753 3.191753 7.610729 7.610729 5.401241 6.505985
## 106 107 108 109 110 111 112
## 5.401241 6.505985 5.401241 3.191753 3.191753 3.191753 3.191753
## 113 114 115 116 117 118 119
## 3.191753 6.505985 3.191753 3.191753 3.191753 3.191753 5.401241
## 120 121 122 123 124 125 126
## 7.610729 3.191753 3.191753 3.191753 3.191753 6.505985 7.610729
## 127 128 129 130 131 132 133
## 10.924961 3.191753 5.401241 3.191753 5.401241 3.191753 5.401241
## 134 135 136 137 138 139 140
## 3.191753 5.401241 3.191753 3.191753 5.401241 3.191753 3.191753
## 141 142 143 144 145 146 147
## 3.191753 6.505985 3.191753 5.401241 5.401241 3.191753 3.191753
## 148 149 150 151 152 153 154
## 3.191753 3.191753 3.191753 6.505985 5.401241 5.401241 3.191753
## 155 156 157 158 159 160 161
## 3.191753 3.191753 5.401241 3.191753 3.191753 5.401241 3.191753
## 162 163 164 165 166 167 168
## 3.191753 3.191753 3.191753 3.191753 5.401241 3.191753 3.191753
## 169 170 171 172 173 174 175
## 3.191753 3.191753 3.191753 5.401241 3.191753 5.401241 3.191753
## 176 177 178 179 180 181 182
## 3.191753 3.191753 3.191753 5.401241 3.191753 3.191753 5.401241
{plot(smsd$n_fatal, main="Fitted and observed values", ylab="", ylim=c(-5, 30))
points(shooter_model_1$fitted.values, col='red')}
head(shooter_model_1$residuals, 10)
## 1 2 3 4 5 6 7
## 5.075039 1.808247 3.598759 -4.401241 -2.401241 3.808247 -1.191753
## 8 9 10
## -2.191753 4.808247 -2.191753
Relationships between observed values, fitted values and errors?
Observed values:
{plot(smsd$n_fatal, main="Fitted and observed values, and errors", ylab="", ylim=c(-5,30))}
Observed + fitted values
{plot(smsd$n_fatal, main="Fitted and observed values, and errors", ylab="", ylim=c(-5, 30))
points(shooter_model_1$fitted.values, col='red')}
{plot(smsd$n_fatal, main="Fitted and observed values, and errors", ylab="", ylim=c(-5,30))
points(shooter_model_1$fitted.values, col='red')
points(shooter_model_1$residuals, col='blue')}
{plot(smsd$n_fatal[1:5], main="Fitted and observed values, and errors", ylab="", ylim=c(-5,30), xlim=c(1,5))
points(shooter_model_1$fitted.values[1:5], col='red')
points(shooter_model_1$residuals[1:5], col='blue')}
If:
residual = observed - predicted
… then: What is the sum of residuals?
Aim: find best fitting line
{plot(smsd$n_guns,smsd$n_fatal)
abline(shooter_model_1)}
smsd$fitted_values = shooter_model_1$fitted.values
smsd$residuals = shooter_model_1$residuals
smsd[smsd$n_guns == 6, ]
## caseid n_fatal n_injured date day age gender n_guns
## 82 82 3 0 10/28/02 Monday 38 Male 6
## school_related mental_illness fitted_values residuals
## 82 Yes No 8.715473 -5.715473
What is the sum of residuals?
sum(smsd$residuals)
## [1] -2.292611e-14
So how to tell how good the model is?
sum(shooter_model_1$residuals^2)
## [1] 3096.339
Hence the name: OLS regression –> Ordinary Least Squares!
… this is a shitty model!
victims = intercept + slope*number_of_guns
General formula:
Y = b_0 + b_1*X1 + b_2*X2 + b_3*X3 ... b_i*Xi
Conceptual:
victims = b_0 + b_1*number_of_guns + b_2*mental_illness
shooter_model_2 = lm(formula = smsd$n_fatal ~ smsd$n_guns + smsd$mental_illness)
shooter_model_2
##
## Call:
## lm(formula = smsd$n_fatal ~ smsd$n_guns + smsd$mental_illness)
##
## Coefficients:
## (Intercept) smsd$n_guns smsd$mental_illnessYes
## 1.480 1.034 1.471
–>
n_fatal = 1.48 + 1.034*n_guns + 1.471*mentall_illness
{plot(smsd$n_fatal, main="Model 1 and model 2", ylab="", ylim=c(0, 30))
points(shooter_model_1$fitted.values, col='red')
points(shooter_model_2$fitted.values, col='green')}
Shooter model 1:
summary(shooter_model_1)
##
## Call:
## lm(formula = smsd$n_fatal ~ smsd$n_guns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9250 -2.4012 -0.4012 1.2440 26.5988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0870 0.5268 3.962 0.000107 ***
## smsd$n_guns 1.1047 0.1981 5.577 8.85e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.148 on 180 degrees of freedom
## Multiple R-squared: 0.1473, Adjusted R-squared: 0.1426
## F-statistic: 31.1 on 1 and 180 DF, p-value: 8.847e-08
Shooter model 2:
summary(shooter_model_2)
##
## Call:
## lm(formula = smsd$n_fatal ~ smsd$n_guns + smsd$mental_illness)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.224 -2.514 -0.514 1.486 25.947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4797 0.5785 2.558 0.0114 *
## smsd$n_guns 1.0342 0.1977 5.230 4.69e-07 ***
## smsd$mental_illnessYes 1.4706 0.6141 2.395 0.0177 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.094 on 179 degrees of freedom
## Multiple R-squared: 0.1738, Adjusted R-squared: 0.1646
## F-statistic: 18.83 on 2 and 179 DF, p-value: 3.793e-08
If all residuals sum to zero?
sum(shooter_model_1$residuals^2)
## [1] 3096.339
sum(shooter_model_2$residuals^2)
## [1] 3000.22
smsd = smsd[smsd$school_related != 'Killed', ]
smsd = droplevels(smsd)
shooter_model_3 = lm(smsd$n_fatal ~ smsd$mental_illness + smsd$school_related)
summary(shooter_model_3)
##
## Call:
## lm(formula = smsd$n_fatal ~ smsd$mental_illness + smsd$school_related)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0676 -2.5475 -0.8805 1.6396 27.4525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.8805 0.4998 7.764 6.24e-13 ***
## smsd$mental_illnessYes 2.1871 0.6543 3.343 0.00101 **
## smsd$school_relatedYes -1.5201 0.6865 -2.214 0.02808 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.347 on 178 degrees of freedom
## Multiple R-squared: 0.07356, Adjusted R-squared: 0.06315
## F-statistic: 7.067 on 2 and 178 DF, p-value: 0.001114
interaction.plot(smsd$mental_illness, smsd$school_related, smsd$n_fatal)
Main effects: effeect of one predictor variable on the outcome variable.
names(terror_data)
## [1] "firstName" "lastName" "gender" "case_informant"
## [5] "case_sting" "sentence"
baseline_model = lm(terror_data$sentence ~ terror_data$gender)
summary(baseline_model)
##
## Call:
## lm(formula = terror_data$sentence ~ terror_data$gender)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138.96 -97.96 -37.96 43.04 1007.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 151.963 6.762 22.474 <2e-16 ***
## terror_data$genderfemale -41.463 24.457 -1.695 0.0907 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 141 on 469 degrees of freedom
## Multiple R-squared: 0.006091, Adjusted R-squared: 0.003972
## F-statistic: 2.874 on 1 and 469 DF, p-value: 0.09068
No effect!
Extended model 1:
extended_model_1 = lm(terror_data$sentence ~ terror_data$gender + terror_data$case_informant)
summary(extended_model_1)
##
## Call:
## lm(formula = terror_data$sentence ~ terror_data$gender + terror_data$case_informant)
##
## Residuals:
## Min 1Q Median 3Q Max
## -152.72 -100.72 -39.72 62.78 1019.78
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 131.72 10.33 12.747 <2e-16 ***
## terror_data$genderfemale -33.50 24.51 -1.367 0.1723
## terror_data$case_informanttrue 34.00 13.18 2.579 0.0102 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 140.2 on 468 degrees of freedom
## Multiple R-squared: 0.02002, Adjusted R-squared: 0.01583
## F-statistic: 4.78 on 2 and 468 DF, p-value: 0.008807
interaction.plot(terror_data$gender, terror_data$case_informant, terror_data$sentence)
What’s going on??????
Statistical interaction: effect of one predictor variable on the outcome variable depends on another predictor variable.
extended_model_2 = lm(terror_data$sentence ~ terror_data$gender + terror_data$case_informant + terror_data$gender:terror_data$case_informant)
summary(extended_model_2)
##
## Call:
## lm(formula = terror_data$sentence ~ terror_data$gender + terror_data$case_informant +
## terror_data$gender:terror_data$case_informant)
##
## Residuals:
## Min 1Q Median 3Q Max
## -156.08 -100.77 -39.08 64.88 981.87
##
## Coefficients:
## Estimate
## (Intercept) 126.767
## terror_data$genderfemale 9.363
## terror_data$case_informanttrue 42.318
## terror_data$genderfemale:terror_data$case_informanttrue -113.294
## Std. Error t value
## (Intercept) 10.521 12.049
## terror_data$genderfemale 30.947 0.303
## terror_data$case_informanttrue 13.635 3.104
## terror_data$genderfemale:terror_data$case_informanttrue 50.315 -2.252
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## terror_data$genderfemale 0.76236
## terror_data$case_informanttrue 0.00203 **
## terror_data$genderfemale:terror_data$case_informanttrue 0.02480 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 139.6 on 467 degrees of freedom
## Multiple R-squared: 0.03055, Adjusted R-squared: 0.02432
## F-statistic: 4.905 on 3 and 467 DF, p-value: 0.002295
Main effect of case_informant
:
tapply(terror_data$sentence, list(terror_data$case_informant), mean)
## false true
## 127.8492 164.1176
Interpretation?
Main effect of gender
:
tapply(terror_data$sentence, list(terror_data$gender), mean)
## male female
## 151.9632 110.5000
Interaction between case_informant
and gender
:
tapply(terror_data$sentence, list(terror_data$gender, terror_data$case_informant), mean)
## false true
## male 126.7670 169.08494
## female 136.1304 65.15385
Specify the full model with *
lm(terror_data$sentence ~ terror_data$gender + terror_data$case_informant + terror_data$gender:terror_data$case_informant)
##
## Call:
## lm(formula = terror_data$sentence ~ terror_data$gender + terror_data$case_informant +
## terror_data$gender:terror_data$case_informant)
##
## Coefficients:
## (Intercept)
## 126.767
## terror_data$genderfemale
## 9.363
## terror_data$case_informanttrue
## 42.318
## terror_data$genderfemale:terror_data$case_informanttrue
## -113.294
#identical to:
lm(terror_data$sentence ~ terror_data$gender*terror_data$case_informant)
##
## Call:
## lm(formula = terror_data$sentence ~ terror_data$gender * terror_data$case_informant)
##
## Coefficients:
## (Intercept)
## 126.767
## terror_data$genderfemale
## 9.363
## terror_data$case_informanttrue
## 42.318
## terror_data$genderfemale:terror_data$case_informanttrue
## -113.294
What if you don’t know what the ‘ideal’ model is?
Especially neat for predictive modelling
**Back to the shooting data:**
names(smsd)
## [1] "caseid" "n_fatal" "n_injured" "date"
## [5] "day" "age" "gender" "n_guns"
## [9] "school_related" "mental_illness" "fitted_values" "residuals"
complete_model = lm(n_fatal ~ gender*n_guns*mental_illness*school_related, data = smsd)
4 predictor variables: how many terms in the model?
complete_model = lm(n_fatal ~ n_guns*mental_illness*school_related, data = smsd)
null_model = lm(n_fatal ~ 1, data = smsd)
3 predictor variables: how many terms in the model?
summary(complete_model)
##
## Call:
## lm(formula = n_fatal ~ n_guns * mental_illness * school_related,
## data = smsd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9592 -2.1233 -0.6777 1.2421 26.2074
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 2.30041 0.93210 2.468
## n_guns 0.86436 0.39577 2.184
## mental_illnessYes 1.47991 1.28991 1.147
## school_relatedYes -0.01274 1.70127 -0.007
## n_guns:mental_illnessYes 0.03300 0.49495 0.067
## n_guns:school_relatedYes -1.02874 0.77367 -1.330
## mental_illnessYes:school_relatedYes -3.41734 2.24208 -1.524
## n_guns:mental_illnessYes:school_relatedYes 1.94549 0.91794 2.119
## Pr(>|t|)
## (Intercept) 0.0146 *
## n_guns 0.0303 *
## mental_illnessYes 0.2528
## school_relatedYes 0.9940
## n_guns:mental_illnessYes 0.9469
## n_guns:school_relatedYes 0.1854
## mental_illnessYes:school_relatedYes 0.1293
## n_guns:mental_illnessYes:school_relatedYes 0.0355 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.016 on 173 degrees of freedom
## Multiple R-squared: 0.2315, Adjusted R-squared: 0.2004
## F-statistic: 7.444 on 7 and 173 DF, p-value: 8.021e-08
summary(null_model)
##
## Call:
## lm(formula = n_fatal ~ 1, data = smsd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4751 -2.4751 -0.4751 1.5249 27.5249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4751 0.3338 13.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.491 on 180 degrees of freedom
step(complete_model, direction = 'backward')
## Start: AIC=511.13
## n_fatal ~ n_guns * mental_illness * school_related
##
## Df Sum of Sq RSS AIC
## <none> 2790.6 511.13
## - n_guns:mental_illness:school_related 1 72.456 2863.0 513.77
##
## Call:
## lm(formula = n_fatal ~ n_guns * mental_illness * school_related,
## data = smsd)
##
## Coefficients:
## (Intercept)
## 2.30041
## n_guns
## 0.86436
## mental_illnessYes
## 1.47991
## school_relatedYes
## -0.01274
## n_guns:mental_illnessYes
## 0.03300
## n_guns:school_relatedYes
## -1.02874
## mental_illnessYes:school_relatedYes
## -3.41734
## n_guns:mental_illnessYes:school_relatedYes
## 1.94549
step(null_model, direction = 'forward'
, scope=list(lower=null_model, upper=complete_model))
## Start: AIC=544.78
## n_fatal ~ 1
##
## Df Sum of Sq RSS AIC
## + n_guns 1 535.46 3095.7 517.91
## + mental_illness 1 174.44 3456.7 537.87
## + school_related 1 55.94 3575.2 543.97
## <none> 3631.1 544.78
##
## Step: AIC=517.91
## n_fatal ~ n_guns
##
## Df Sum of Sq RSS AIC
## + mental_illness 1 95.460 3000.2 514.24
## + school_related 1 57.394 3038.3 516.52
## <none> 3095.7 517.91
##
## Step: AIC=514.24
## n_fatal ~ n_guns + mental_illness
##
## Df Sum of Sq RSS AIC
## + school_related 1 84.677 2915.5 511.06
## <none> 3000.2 514.24
## + n_guns:mental_illness 1 30.943 2969.3 514.36
##
## Step: AIC=511.06
## n_fatal ~ n_guns + mental_illness + school_related
##
## Df Sum of Sq RSS AIC
## + n_guns:mental_illness 1 36.664 2878.9 510.77
## <none> 2915.5 511.06
## + n_guns:school_related 1 18.227 2897.3 511.92
## + mental_illness:school_related 1 3.208 2912.3 512.86
##
## Step: AIC=510.77
## n_fatal ~ n_guns + mental_illness + school_related + n_guns:mental_illness
##
## Df Sum of Sq RSS AIC
## <none> 2878.9 510.77
## + n_guns:school_related 1 13.979 2864.9 511.88
## + mental_illness:school_related 1 4.243 2874.6 512.50
##
## Call:
## lm(formula = n_fatal ~ n_guns + mental_illness + school_related +
## n_guns:mental_illness, data = smsd)
##
## Coefficients:
## (Intercept) n_guns
## 2.7122 0.6060
## mental_illnessYes school_relatedYes
## 0.3975 -1.5038
## n_guns:mental_illnessYes
## 0.6247
step(null_model, direction = 'both'
, scope=list(upper=complete_model))
## Start: AIC=544.78
## n_fatal ~ 1
##
## Df Sum of Sq RSS AIC
## + n_guns 1 535.46 3095.7 517.91
## + mental_illness 1 174.44 3456.7 537.87
## + school_related 1 55.94 3575.2 543.97
## <none> 3631.1 544.78
##
## Step: AIC=517.91
## n_fatal ~ n_guns
##
## Df Sum of Sq RSS AIC
## + mental_illness 1 95.46 3000.2 514.24
## + school_related 1 57.39 3038.3 516.52
## <none> 3095.7 517.91
## - n_guns 1 535.46 3631.1 544.78
##
## Step: AIC=514.24
## n_fatal ~ n_guns + mental_illness
##
## Df Sum of Sq RSS AIC
## + school_related 1 84.68 2915.5 511.06
## <none> 3000.2 514.24
## + n_guns:mental_illness 1 30.94 2969.3 514.36
## - mental_illness 1 95.46 3095.7 517.91
## - n_guns 1 456.47 3456.7 537.87
##
## Step: AIC=511.06
## n_fatal ~ n_guns + mental_illness + school_related
##
## Df Sum of Sq RSS AIC
## + n_guns:mental_illness 1 36.66 2878.9 510.77
## <none> 2915.5 511.06
## + n_guns:school_related 1 18.23 2897.3 511.92
## + mental_illness:school_related 1 3.21 2912.3 512.86
## - school_related 1 84.68 3000.2 514.24
## - mental_illness 1 122.74 3038.3 516.52
## - n_guns 1 448.49 3364.0 534.95
##
## Step: AIC=510.77
## n_fatal ~ n_guns + mental_illness + school_related + n_guns:mental_illness
##
## Df Sum of Sq RSS AIC
## <none> 2878.9 510.77
## - n_guns:mental_illness 1 36.664 2915.5 511.06
## + n_guns:school_related 1 13.979 2864.9 511.88
## + mental_illness:school_related 1 4.243 2874.6 512.50
## - school_related 1 90.399 2969.3 514.36
##
## Call:
## lm(formula = n_fatal ~ n_guns + mental_illness + school_related +
## n_guns:mental_illness, data = smsd)
##
## Coefficients:
## (Intercept) n_guns
## 2.7122 0.6060
## mental_illnessYes school_relatedYes
## 0.3975 -1.5038
## n_guns:mental_illnessYes
## 0.6247
set.seed(123)
a = rnorm(1000, 30, 10)
b = a + rnorm(1000, 2, 8)
plot(a, b, main = round(cor(a, b), 4))
a_scaled = scale(a)
b_scaled = scale(b)
{plot(a_scaled, b_scaled)
abline(lm(a_scaled ~ b_scaled))}
lm(a_scaled ~ b_scaled - 1)
##
## Call:
## lm(formula = a_scaled ~ b_scaled - 1)
##
## Coefficients:
## b_scaled
## 0.7969
Next week
Homework