SOLUTIONS

Homework week 2: Regression modelling in R

## The Global Terrorism Database

We will use a dataset that includes variables of 180,000 terrorist attacks between 1970-2017. Note that we excluded some variables for clarity and did some preprocessing on the data.

Suppose you are given that dataset and you’re asked to inform policy-makers about the relationship between terrorist attack details and the number of victims. This notebook asks you to perform initial descriptive statistics and then build and evaluate models of the data.

You can load the dataset as follows:

• eventid: unique event identifier
• iyear: year
• imonth: month
• iday: day
• nperps: number of perpetrators
• suicide: whether the attack was a suicide attach
• ransom: whether ransom was demanded
• nkill: number of killed victims
• nwound: number of wounded victims

This command loads a dataframe called `gtd` in your notebook. You can query that dataframe as usual.

Have a look at the names (columns) in the data.frame:

``````names(gtd)
``````
``````[1] "eventid" "iyear"   "imonth"  "iday"    "nperps"  "suicide" "ransom"  "nkill"
[9] "nwound" ``````

… and show the first 10 rows:

``````head(gtd, 10)
``````
``````         eventid iyear imonth iday nperps suicide ransom nkill nwound
1  9.733093e-313  1970      0    0      7       0      1     0      0
2  9.733144e-313  1970      1    2      3       0      0     0      0
3  9.733144e-313  1970      1    2      1       0      0     0      0
4  9.733144e-313  1970      1    3      1       0      0     0      0
5  9.733147e-313  1970      1    8      1       0      0     0      0
6  9.733148e-313  1970      1   11      1       0      0     1      0
7  9.733150e-313  1970      1   15      5       0      0     0      0
8  9.733152e-313  1970      1   19      3       0      0     0      0
9  9.733152e-313  1970      1   19      2       0      0     0      0
10 9.733153e-313  1970      1   20      1       0      0     1      0``````

Note that we excluded variables (for clarity) and observations (to avoid missing values), so the actual dimensions of this dataframe are:

``````dim(gtd)
``````
``[1] 9147    9``

## Understanding the data

Let’s start with understanding the data bit better. You’d want to do this to avoid modelling relationships that are not meaningful.

Look at the frequencies of the number of perpetrators and subset these frequency counts by the suicide and ransom variable.

``````table(gtd\$nperps)
``````
``````
0     1     2     3     4     5     6     7     8     9    10    11    12    13
25  1346  1780  1272   955   530   365   150   154    63   230    25   112    20
14    15    16    17    18    19    20    21    22    23    24    25    26    28
18   192     5     9    19     4   290     7     5     1    18    72     3     1
29    30    33    34    35    36    39    40    42    44    45    47    48    49
1   267     1     1    15     8     2   128     1     1     3     1     1     1
50    51    55    56    60    68    70    72    75    80    90   100   101   120
219     3     1     1    91     1    35     1     5    42     3   211     1     8
130   138   140   150   160   180   190   200   204   250   290   300   350   400
7     1     2    82     2     3     1   137     1     7     1    64     5    18
500   600   700   750  1000  1200  1500  2500  3200 10000 20000
20     8     8     1    12     2     2     1     1    36     1 ``````
``table(gtd\$nperps, gtd\$suicide, gtd\$ransom)``
``````, ,  = 0

0    1
0       25    0
1     1294   31
2     1731   16
3     1191   17
4      868    9
5      475    9
6      336    4
7      133    2
8      139    1
9       51    0
10     215    0
11      18    0
12     103    0
13      20    0
14      17    0
15     187    0
16       5    0
17       8    0
18      12    0
19       4    0
20     279    0
21       7    0
22       5    0
23       1    0
24      17    0
25      70    0
26       3    0
28       1    0
29       1    0
30     257    2
33       1    0
34       1    0
35      15    0
36       7    1
39       2    0
40     125    0
42       1    0
44       1    0
45       3    0
47       1    0
48       1    0
49       1    0
50     218    0
51       3    0
55       1    0
56       1    0
60      90    0
68       1    0
70      35    0
72       1    0
75       5    0
80      42    0
90       3    0
100    208    0
101      1    0
120      8    0
130      7    0
138      1    0
140      2    0
150     81    1
160      2    0
180      3    0
190      1    0
200    136    0
204      1    0
250      7    0
290      1    0
300     63    0
350      5    0
400     18    0
500     20    0
600      8    0
700      8    0
750      1    0
1000    12    0
1200     2    0
1500     2    0
2500     1    0
3200     1    0
10000   36    0
20000    1    0

, ,  = 1

0    1
0        0    0
1       20    1
2       33    0
3       64    0
4       78    0
5       46    0
6       25    0
7       15    0
8       14    0
9       12    0
10      15    0
11       7    0
12       9    0
13       0    0
14       1    0
15       5    0
16       0    0
17       1    0
18       7    0
19       0    0
20      11    0
21       0    0
22       0    0
23       0    0
24       1    0
25       2    0
26       0    0
28       0    0
29       0    0
30       8    0
33       0    0
34       0    0
35       0    0
36       0    0
39       0    0
40       3    0
42       0    0
44       0    0
45       0    0
47       0    0
48       0    0
49       0    0
50       1    0
51       0    0
55       0    0
56       0    0
60       1    0
68       0    0
70       0    0
72       0    0
75       0    0
80       0    0
90       0    0
100      3    0
101      0    0
120      0    0
130      0    0
138      0    0
140      0    0
150      0    0
160      0    0
180      0    0
190      0    0
200      1    0
204      0    0
250      0    0
290      0    0
300      1    0
350      0    0
400      0    0
500      0    0
600      0    0
700      0    0
750      0    0
1000     0    0
1200     0    0
1500     0    0
2500     0    0
3200     0    0
10000    0    0
20000    0    0``````

In which year was the number of perpetrators (on average) the highest?

``````perps_year[which.max(perps_year\$nperps),] # 1996
``````
``````   Group.1   nperps
26    1996 1110.928``````

What is the most common value of the number of persons killed and wounded?

Hint: `?hist` and `?table`

``````table(gtd\$nwound)
``````
``````
0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16
6512 1145  419  253  127   97   69   66   42   34   38   27   28   12   16   21    9
17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33
16   11   11   22    6    6    1    2   10    7    4    5    2   14    3    5    1
34   35   36   37   38   39   40   41   42   45   48   50   53   54   55   56   60
3    1    1    2    4    2    9    2    2    2    5   11    2    1    2    1    6
65   68   69   70   72   74   76   78   80   81   85   90   96   98  100  104  106
1    2    1    7    1    1    3    1    2    1    1    2    1    1    9    1    1
120  121  127  138  150  161  170  192  217  236  650  671 8190 8191
1    1    1    1    1    1    1    1    1    1    1    1    1    1 ``````

Display the relationship between the number of killed victims and the number of wounded victims in a figure.

What kind of a relationship do you expect?

Hint: `?plot`

What is the mean number of perpetrators when the attack was a suicide attack?

Hint: `?tapply`

``````#your code comes here
tapply(gtd\$nperps, gtd\$suicide, mean)
``````
``````        0         1
65.476306  5.191489 ``````

## Simple regression

Build a simple regression model that models the number of wounded victims through the number of perpetrators.

``````summary(lm1)
``````
``````
Call:
lm(formula = gtd\$nwound ~ gtd\$nperps)

Residuals:
Min     1Q Median     3Q    Max
-3.8   -3.8   -3.8   -2.8 8187.2

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.8327993  1.2796622   2.995  0.00275 **
gtd\$nperps  -0.0002303  0.0019183  -0.120  0.90445
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 121.8 on 9145 degrees of freedom
Multiple R-squared:  1.576e-06, Adjusted R-squared:  -0.0001078
F-statistic: 0.01441 on 1 and 9145 DF,  p-value: 0.9045``````

How satisfied are you with your model? One way to assess the “model fit” is to calculate the root mean square error - RMSE - (residuals). Calculate that metric and think about the meaning of this fit index. What does it tell you and how satisfied are you with it?

``````sqrt(mean(lm1\$residuals^2))
``````
``[1] 121.7937``

Plot the fitted values (in green) and the observed values (in blue) to assess the model fit visually.

## Multiple regression

Now you might want to use multiple variables in your model:

Build a multiple regression model that models the number of killed victims through the variables suicide and ransom. Include only the two main effects (and let the intercept in the model).

``````summary(lm2)
``````
``````
Call:
lm(formula = gtd\$nkill ~ gtd\$suicide + gtd\$ransom)

Residuals:
Min      1Q  Median      3Q     Max
-50.11   -2.43   -1.43   -0.51 1333.89

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   2.4273     0.2385  10.177   <2e-16 ***
gtd\$suicide  47.6782     2.3028  20.705   <2e-16 ***
gtd\$ransom   -1.9174     1.1566  -1.658   0.0974 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22.21 on 9144 degrees of freedom
Multiple R-squared:  0.04518,   Adjusted R-squared:  0.04497
F-statistic: 216.3 on 2 and 9144 DF,  p-value: < 2.2e-16``````

Have a look at a potential interaction between these two predictor variables. Use the `interaction.plot` function to look at the joint relationship of these two variables on the number of killed victims.

What does this graph tell you? Can you identify the main effects and (potential) interaction?

Now look at the interaction in a numerical manner.

Hint: `?tapply`

``````tapply(gtd\$nkill, list(gtd\$suicide, gtd\$ransom), mean)
``````
``````          0         1
0  2.421848 0.6328125
1 50.612903 1.0000000``````

Suppose you want to expand the model by adding the interaction term to it. Build that model.

``````summary(lm3)
``````
``````
Call:
lm(formula = gtd\$nkill ~ gtd\$suicide * gtd\$ransom)

Residuals:
Min      1Q  Median      3Q     Max
-50.61   -2.42   -1.42   -0.63 1333.39

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)              2.4218     0.2385  10.155   <2e-16 ***
gtd\$suicide             48.1911     2.3148  20.819   <2e-16 ***
gtd\$ransom              -1.7890     1.1579  -1.545   0.1224
gtd\$suicide:gtd\$ransom -47.8239    22.3531  -2.139   0.0324 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22.2 on 9143 degrees of freedom
Multiple R-squared:  0.04565,   Adjusted R-squared:  0.04534
F-statistic: 145.8 on 3 and 9143 DF,  p-value: < 2.2e-16``````

Based on the RMSE of each of the two models above (2 main effects vs 2 main effects + 1 interaction), which one do you prefer?

``````sqrt(mean(lm3\$residuals^2))
``````
``[1] 22.19921``

Have a look at the distribution of the `nperps` and `nkill` column. Are there some potential outliers in there?

Hint: `?plot`

Re-run the best fitting regression model again after exluding the potential outliers. The decision for the “best” model can be made based on the RMSE:

``````sqrt(mean(lm_clean\$residuals^2))
``````
``[1] 9.78957``

## Model selection

Now suppose you want to let the model building process be decided by a stepwise model selection procedure.

Build a “null model” that only contains an intercept.

``````summary(null_model)
``````
``````
Call:
lm(formula = gtd_clean\$nwound ~ 1)

Residuals:
Min     1Q Median     3Q    Max
-2.03  -2.03  -2.03  -1.03 668.97

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   2.0274     0.1388   14.61   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.27 on 9144 degrees of freedom``````

Build a full model for the number of wounded victims modeled through the number of perpetrators, “suicide” and “ransom”.

``````summary(lm4)
``````
``````
Call:
lm(formula = gtd_clean\$nwound ~ gtd_clean\$nperps * gtd_clean\$suicide *
gtd_clean\$ransom)

Residuals:
Min     1Q Median     3Q    Max
-32.87  -1.77  -1.77  -0.77 669.23

Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value
(Intercept)                                          1.771e+00  1.395e-01  12.701
gtd_clean\$nperps                                     4.745e-05  2.035e-04   0.233
gtd_clean\$suicide                                    3.120e+01  1.430e+00  21.820
gtd_clean\$ransom                                    -1.380e+00  7.303e-01  -1.890
gtd_clean\$nperps:gtd_clean\$suicide                  -1.044e-01  8.345e-02  -1.251
gtd_clean\$nperps:gtd_clean\$ransom                    1.684e-02  3.153e-02   0.534
gtd_clean\$suicide:gtd_clean\$ransom                  -3.151e+01  1.302e+01  -2.421
gtd_clean\$nperps:gtd_clean\$suicide:gtd_clean\$ransom         NA         NA      NA
Pr(>|t|)
(Intercept)                                           <2e-16 ***
gtd_clean\$nperps                                      0.8157
gtd_clean\$suicide                                     <2e-16 ***
gtd_clean\$ransom                                      0.0588 .
gtd_clean\$nperps:gtd_clean\$suicide                    0.2111
gtd_clean\$nperps:gtd_clean\$ransom                     0.5932
gtd_clean\$suicide:gtd_clean\$ransom                    0.0155 *
gtd_clean\$nperps:gtd_clean\$suicide:gtd_clean\$ransom       NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.92 on 9138 degrees of freedom
Multiple R-squared:  0.05327,   Adjusted R-squared:  0.05265
F-statistic:  85.7 on 6 and 9138 DF,  p-value: < 2.2e-16``````

Determine the best fitting model in a backward model selection procedure.

``````#your code comes here
step(lm4, direction = 'backward')
``````
``````Start:  AIC=46807.07
gtd_clean\$nwound ~ gtd_clean\$nperps * gtd_clean\$suicide * gtd_clean\$ransom

Step:  AIC=46807.07
gtd_clean\$nwound ~ gtd_clean\$nperps + gtd_clean\$suicide + gtd_clean\$ransom +
gtd_clean\$nperps:gtd_clean\$suicide + gtd_clean\$nperps:gtd_clean\$ransom +
gtd_clean\$suicide:gtd_clean\$ransom

Df Sum of Sq     RSS   AIC
- gtd_clean\$nperps:gtd_clean\$ransom   1     47.64 1525430 46805
- gtd_clean\$nperps:gtd_clean\$suicide  1    261.08 1525644 46807
<none>                                            1525382 46807
- gtd_clean\$suicide:gtd_clean\$ransom  1    978.23 1526361 46811

Step:  AIC=46805.36
gtd_clean\$nwound ~ gtd_clean\$nperps + gtd_clean\$suicide + gtd_clean\$ransom +
gtd_clean\$nperps:gtd_clean\$suicide + gtd_clean\$suicide:gtd_clean\$ransom

Df Sum of Sq     RSS   AIC
- gtd_clean\$nperps:gtd_clean\$suicide  1    261.09 1525691 46805
<none>                                            1525430 46805
- gtd_clean\$suicide:gtd_clean\$ransom  1    986.90 1526417 46809

Step:  AIC=46804.92
gtd_clean\$nwound ~ gtd_clean\$nperps + gtd_clean\$suicide + gtd_clean\$ransom +
gtd_clean\$suicide:gtd_clean\$ransom

Df Sum of Sq     RSS   AIC
- gtd_clean\$nperps                    1      9.11 1525700 46803
<none>                                            1525691 46805
- gtd_clean\$suicide:gtd_clean\$ransom  1    960.18 1526651 46809

Step:  AIC=46802.98
gtd_clean\$nwound ~ gtd_clean\$suicide + gtd_clean\$ransom + gtd_clean\$suicide:gtd_clean\$ransom

Df Sum of Sq     RSS   AIC
<none>                                            1525700 46803
- gtd_clean\$suicide:gtd_clean\$ransom  1    960.02 1526660 46807

Call:
lm(formula = gtd_clean\$nwound ~ gtd_clean\$suicide + gtd_clean\$ransom +
gtd_clean\$suicide:gtd_clean\$ransom)

Coefficients:
(Intercept)                   gtd_clean\$suicide
1.774                              30.654
gtd_clean\$ransom  gtd_clean\$suicide:gtd_clean\$ransom
-1.233                             -31.196  ``````

Run the model selection again but this time using the forward model selection procedure.

``````step(lm4, direction = 'forward'
, scope = list(lower = null_model, upper = lm4))
``````
``````Start:  AIC=46807.07
gtd_clean\$nwound ~ gtd_clean\$nperps * gtd_clean\$suicide * gtd_clean\$ransom

Call:
lm(formula = gtd_clean\$nwound ~ gtd_clean\$nperps * gtd_clean\$suicide *
gtd_clean\$ransom)

Coefficients:
(Intercept)
1.771e+00
gtd_clean\$nperps
4.745e-05
gtd_clean\$suicide
3.120e+01
gtd_clean\$ransom
-1.380e+00
gtd_clean\$nperps:gtd_clean\$suicide
-1.044e-01
gtd_clean\$nperps:gtd_clean\$ransom
1.684e-02
gtd_clean\$suicide:gtd_clean\$ransom
-3.151e+01
gtd_clean\$nperps:gtd_clean\$suicide:gtd_clean\$ransom
NA  ``````

Hint: `?points`