SOLUTIONS
Homework week 2: Regression modelling in R
We will use a dataset that includes variables of 180,000 terrorist attacks between 1970-2017. Note that we excluded some variables for clarity and did some preprocessing on the data.
Suppose you are given that dataset and you’re asked to inform policy-makers about the relationship between terrorist attack details and the number of victims. This notebook asks you to perform initial descriptive statistics and then build and evaluate models of the data.
You can load the dataset as follows:
This command loads a dataframe called gtd
in your notebook. You can query that dataframe as usual.
Have a look at the names (columns) in the data.frame:
names(gtd)
[1] "eventid" "iyear" "imonth" "iday" "nperps" "suicide" "ransom" "nkill"
[9] "nwound"
… and show the first 10 rows:
head(gtd, 10)
eventid iyear imonth iday nperps suicide ransom nkill nwound
1 9.733093e-313 1970 0 0 7 0 1 0 0
2 9.733144e-313 1970 1 2 3 0 0 0 0
3 9.733144e-313 1970 1 2 1 0 0 0 0
4 9.733144e-313 1970 1 3 1 0 0 0 0
5 9.733147e-313 1970 1 8 1 0 0 0 0
6 9.733148e-313 1970 1 11 1 0 0 1 0
7 9.733150e-313 1970 1 15 5 0 0 0 0
8 9.733152e-313 1970 1 19 3 0 0 0 0
9 9.733152e-313 1970 1 19 2 0 0 0 0
10 9.733153e-313 1970 1 20 1 0 0 1 0
Note that we excluded variables (for clarity) and observations (to avoid missing values), so the actual dimensions of this dataframe are:
dim(gtd)
[1] 9147 9
Let’s start with understanding the data bit better. You’d want to do this to avoid modelling relationships that are not meaningful.
Look at the frequencies of the number of perpetrators and subset these frequency counts by the suicide and ransom variable.
table(gtd$nperps)
0 1 2 3 4 5 6 7 8 9 10 11 12 13
25 1346 1780 1272 955 530 365 150 154 63 230 25 112 20
14 15 16 17 18 19 20 21 22 23 24 25 26 28
18 192 5 9 19 4 290 7 5 1 18 72 3 1
29 30 33 34 35 36 39 40 42 44 45 47 48 49
1 267 1 1 15 8 2 128 1 1 3 1 1 1
50 51 55 56 60 68 70 72 75 80 90 100 101 120
219 3 1 1 91 1 35 1 5 42 3 211 1 8
130 138 140 150 160 180 190 200 204 250 290 300 350 400
7 1 2 82 2 3 1 137 1 7 1 64 5 18
500 600 700 750 1000 1200 1500 2500 3200 10000 20000
20 8 8 1 12 2 2 1 1 36 1
table(gtd$nperps, gtd$suicide, gtd$ransom)
, , = 0
0 1
0 25 0
1 1294 31
2 1731 16
3 1191 17
4 868 9
5 475 9
6 336 4
7 133 2
8 139 1
9 51 0
10 215 0
11 18 0
12 103 0
13 20 0
14 17 0
15 187 0
16 5 0
17 8 0
18 12 0
19 4 0
20 279 0
21 7 0
22 5 0
23 1 0
24 17 0
25 70 0
26 3 0
28 1 0
29 1 0
30 257 2
33 1 0
34 1 0
35 15 0
36 7 1
39 2 0
40 125 0
42 1 0
44 1 0
45 3 0
47 1 0
48 1 0
49 1 0
50 218 0
51 3 0
55 1 0
56 1 0
60 90 0
68 1 0
70 35 0
72 1 0
75 5 0
80 42 0
90 3 0
100 208 0
101 1 0
120 8 0
130 7 0
138 1 0
140 2 0
150 81 1
160 2 0
180 3 0
190 1 0
200 136 0
204 1 0
250 7 0
290 1 0
300 63 0
350 5 0
400 18 0
500 20 0
600 8 0
700 8 0
750 1 0
1000 12 0
1200 2 0
1500 2 0
2500 1 0
3200 1 0
10000 36 0
20000 1 0
, , = 1
0 1
0 0 0
1 20 1
2 33 0
3 64 0
4 78 0
5 46 0
6 25 0
7 15 0
8 14 0
9 12 0
10 15 0
11 7 0
12 9 0
13 0 0
14 1 0
15 5 0
16 0 0
17 1 0
18 7 0
19 0 0
20 11 0
21 0 0
22 0 0
23 0 0
24 1 0
25 2 0
26 0 0
28 0 0
29 0 0
30 8 0
33 0 0
34 0 0
35 0 0
36 0 0
39 0 0
40 3 0
42 0 0
44 0 0
45 0 0
47 0 0
48 0 0
49 0 0
50 1 0
51 0 0
55 0 0
56 0 0
60 1 0
68 0 0
70 0 0
72 0 0
75 0 0
80 0 0
90 0 0
100 3 0
101 0 0
120 0 0
130 0 0
138 0 0
140 0 0
150 0 0
160 0 0
180 0 0
190 0 0
200 1 0
204 0 0
250 0 0
290 0 0
300 1 0
350 0 0
400 0 0
500 0 0
600 0 0
700 0 0
750 0 0
1000 0 0
1200 0 0
1500 0 0
2500 0 0
3200 0 0
10000 0 0
20000 0 0
In which year was the number of perpetrators (on average) the highest?
perps_year[which.max(perps_year$nperps),] # 1996
Group.1 nperps
26 1996 1110.928
What is the most common value of the number of persons killed and wounded?
Hint: ?hist
and ?table
table(gtd$nwound)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
6512 1145 419 253 127 97 69 66 42 34 38 27 28 12 16 21 9
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
16 11 11 22 6 6 1 2 10 7 4 5 2 14 3 5 1
34 35 36 37 38 39 40 41 42 45 48 50 53 54 55 56 60
3 1 1 2 4 2 9 2 2 2 5 11 2 1 2 1 6
65 68 69 70 72 74 76 78 80 81 85 90 96 98 100 104 106
1 2 1 7 1 1 3 1 2 1 1 2 1 1 9 1 1
120 121 127 138 150 161 170 192 217 236 650 671 8190 8191
1 1 1 1 1 1 1 1 1 1 1 1 1 1
Display the relationship between the number of killed victims and the number of wounded victims in a figure.
What kind of a relationship do you expect?
Hint: ?plot
What is the mean number of perpetrators when the attack was a suicide attack?
Hint: ?tapply
#your code comes here
tapply(gtd$nperps, gtd$suicide, mean)
0 1
65.476306 5.191489
We’ll start with simple regression (i.e. one predictor variable):
Build a simple regression model that models the number of wounded victims through the number of perpetrators.
summary(lm1)
Call:
lm(formula = gtd$nwound ~ gtd$nperps)
Residuals:
Min 1Q Median 3Q Max
-3.8 -3.8 -3.8 -2.8 8187.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.8327993 1.2796622 2.995 0.00275 **
gtd$nperps -0.0002303 0.0019183 -0.120 0.90445
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 121.8 on 9145 degrees of freedom
Multiple R-squared: 1.576e-06, Adjusted R-squared: -0.0001078
F-statistic: 0.01441 on 1 and 9145 DF, p-value: 0.9045
How satisfied are you with your model? One way to assess the “model fit” is to calculate the root mean square error - RMSE - (residuals). Calculate that metric and think about the meaning of this fit index. What does it tell you and how satisfied are you with it?
sqrt(mean(lm1$residuals^2))
[1] 121.7937
Plot the fitted values (in green) and the observed values (in blue) to assess the model fit visually.
Now you might want to use multiple variables in your model:
Build a multiple regression model that models the number of killed victims through the variables suicide and ransom. Include only the two main effects (and let the intercept in the model).
summary(lm2)
Call:
lm(formula = gtd$nkill ~ gtd$suicide + gtd$ransom)
Residuals:
Min 1Q Median 3Q Max
-50.11 -2.43 -1.43 -0.51 1333.89
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.4273 0.2385 10.177 <2e-16 ***
gtd$suicide 47.6782 2.3028 20.705 <2e-16 ***
gtd$ransom -1.9174 1.1566 -1.658 0.0974 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.21 on 9144 degrees of freedom
Multiple R-squared: 0.04518, Adjusted R-squared: 0.04497
F-statistic: 216.3 on 2 and 9144 DF, p-value: < 2.2e-16
Have a look at a potential interaction between these two predictor variables. Use the interaction.plot
function to look at the joint relationship of these two variables on the number of killed victims.
What does this graph tell you? Can you identify the main effects and (potential) interaction?
Now look at the interaction in a numerical manner.
Hint: ?tapply
tapply(gtd$nkill, list(gtd$suicide, gtd$ransom), mean)
0 1
0 2.421848 0.6328125
1 50.612903 1.0000000
Suppose you want to expand the model by adding the interaction term to it. Build that model.
summary(lm3)
Call:
lm(formula = gtd$nkill ~ gtd$suicide * gtd$ransom)
Residuals:
Min 1Q Median 3Q Max
-50.61 -2.42 -1.42 -0.63 1333.39
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.4218 0.2385 10.155 <2e-16 ***
gtd$suicide 48.1911 2.3148 20.819 <2e-16 ***
gtd$ransom -1.7890 1.1579 -1.545 0.1224
gtd$suicide:gtd$ransom -47.8239 22.3531 -2.139 0.0324 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.2 on 9143 degrees of freedom
Multiple R-squared: 0.04565, Adjusted R-squared: 0.04534
F-statistic: 145.8 on 3 and 9143 DF, p-value: < 2.2e-16
Based on the RMSE of each of the two models above (2 main effects vs 2 main effects + 1 interaction), which one do you prefer?
sqrt(mean(lm3$residuals^2))
[1] 22.19921
Have a look at the distribution of the nperps
and nkill
column. Are there some potential outliers in there?
Hint: ?plot
Re-run the best fitting regression model again after exluding the potential outliers. The decision for the “best” model can be made based on the RMSE:
sqrt(mean(lm_clean$residuals^2))
[1] 9.78957
Now suppose you want to let the model building process be decided by a stepwise model selection procedure.
Build a “null model” that only contains an intercept.
summary(null_model)
Call:
lm(formula = gtd_clean$nwound ~ 1)
Residuals:
Min 1Q Median 3Q Max
-2.03 -2.03 -2.03 -1.03 668.97
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.0274 0.1388 14.61 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.27 on 9144 degrees of freedom
Build a full model for the number of wounded victims modeled through the number of perpetrators, “suicide” and “ransom”.
summary(lm4)
Call:
lm(formula = gtd_clean$nwound ~ gtd_clean$nperps * gtd_clean$suicide *
gtd_clean$ransom)
Residuals:
Min 1Q Median 3Q Max
-32.87 -1.77 -1.77 -0.77 669.23
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value
(Intercept) 1.771e+00 1.395e-01 12.701
gtd_clean$nperps 4.745e-05 2.035e-04 0.233
gtd_clean$suicide 3.120e+01 1.430e+00 21.820
gtd_clean$ransom -1.380e+00 7.303e-01 -1.890
gtd_clean$nperps:gtd_clean$suicide -1.044e-01 8.345e-02 -1.251
gtd_clean$nperps:gtd_clean$ransom 1.684e-02 3.153e-02 0.534
gtd_clean$suicide:gtd_clean$ransom -3.151e+01 1.302e+01 -2.421
gtd_clean$nperps:gtd_clean$suicide:gtd_clean$ransom NA NA NA
Pr(>|t|)
(Intercept) <2e-16 ***
gtd_clean$nperps 0.8157
gtd_clean$suicide <2e-16 ***
gtd_clean$ransom 0.0588 .
gtd_clean$nperps:gtd_clean$suicide 0.2111
gtd_clean$nperps:gtd_clean$ransom 0.5932
gtd_clean$suicide:gtd_clean$ransom 0.0155 *
gtd_clean$nperps:gtd_clean$suicide:gtd_clean$ransom NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.92 on 9138 degrees of freedom
Multiple R-squared: 0.05327, Adjusted R-squared: 0.05265
F-statistic: 85.7 on 6 and 9138 DF, p-value: < 2.2e-16
Determine the best fitting model in a backward model selection procedure.
#your code comes here
step(lm4, direction = 'backward')
Start: AIC=46807.07
gtd_clean$nwound ~ gtd_clean$nperps * gtd_clean$suicide * gtd_clean$ransom
Step: AIC=46807.07
gtd_clean$nwound ~ gtd_clean$nperps + gtd_clean$suicide + gtd_clean$ransom +
gtd_clean$nperps:gtd_clean$suicide + gtd_clean$nperps:gtd_clean$ransom +
gtd_clean$suicide:gtd_clean$ransom
Df Sum of Sq RSS AIC
- gtd_clean$nperps:gtd_clean$ransom 1 47.64 1525430 46805
- gtd_clean$nperps:gtd_clean$suicide 1 261.08 1525644 46807
<none> 1525382 46807
- gtd_clean$suicide:gtd_clean$ransom 1 978.23 1526361 46811
Step: AIC=46805.36
gtd_clean$nwound ~ gtd_clean$nperps + gtd_clean$suicide + gtd_clean$ransom +
gtd_clean$nperps:gtd_clean$suicide + gtd_clean$suicide:gtd_clean$ransom
Df Sum of Sq RSS AIC
- gtd_clean$nperps:gtd_clean$suicide 1 261.09 1525691 46805
<none> 1525430 46805
- gtd_clean$suicide:gtd_clean$ransom 1 986.90 1526417 46809
Step: AIC=46804.92
gtd_clean$nwound ~ gtd_clean$nperps + gtd_clean$suicide + gtd_clean$ransom +
gtd_clean$suicide:gtd_clean$ransom
Df Sum of Sq RSS AIC
- gtd_clean$nperps 1 9.11 1525700 46803
<none> 1525691 46805
- gtd_clean$suicide:gtd_clean$ransom 1 960.18 1526651 46809
Step: AIC=46802.98
gtd_clean$nwound ~ gtd_clean$suicide + gtd_clean$ransom + gtd_clean$suicide:gtd_clean$ransom
Df Sum of Sq RSS AIC
<none> 1525700 46803
- gtd_clean$suicide:gtd_clean$ransom 1 960.02 1526660 46807
Call:
lm(formula = gtd_clean$nwound ~ gtd_clean$suicide + gtd_clean$ransom +
gtd_clean$suicide:gtd_clean$ransom)
Coefficients:
(Intercept) gtd_clean$suicide
1.774 30.654
gtd_clean$ransom gtd_clean$suicide:gtd_clean$ransom
-1.233 -31.196
Run the model selection again but this time using the forward model selection procedure.
step(lm4, direction = 'forward'
, scope = list(lower = null_model, upper = lm4))
Start: AIC=46807.07
gtd_clean$nwound ~ gtd_clean$nperps * gtd_clean$suicide * gtd_clean$ransom
Call:
lm(formula = gtd_clean$nwound ~ gtd_clean$nperps * gtd_clean$suicide *
gtd_clean$ransom)
Coefficients:
(Intercept)
1.771e+00
gtd_clean$nperps
4.745e-05
gtd_clean$suicide
3.120e+01
gtd_clean$ransom
-1.380e+00
gtd_clean$nperps:gtd_clean$suicide
-1.044e-01
gtd_clean$nperps:gtd_clean$ransom
1.684e-02
gtd_clean$suicide:gtd_clean$ransom
-3.151e+01
gtd_clean$nperps:gtd_clean$suicide:gtd_clean$ransom
NA
Compare the RMSE of the null model, the full model and the best fitting model (if different from the full model).
Display the residuals in a graph using the colours green, black, and blue for the null model, observed values, and the full model, respectively.
Hint: ?points
Did you expect the observations for the null model? See whether you can discover the reason for that relationship in the model outcome (i.e. the coefficients).