SOLUTIONS

Homework week 2: Regression modelling in R

The Global Terrorism Database

We will use a dataset that includes variables of 180,000 terrorist attacks between 1970-2017. Note that we excluded some variables for clarity and did some preprocessing on the data.

Suppose you are given that dataset and you’re asked to inform policy-makers about the relationship between terrorist attack details and the number of victims. This notebook asks you to perform initial descriptive statistics and then build and evaluate models of the data.

You can load the dataset as follows:

Variable codebook:

This command loads a dataframe called gtd in your notebook. You can query that dataframe as usual.

Have a look at the names (columns) in the data.frame:

names(gtd)
[1] "eventid" "iyear"   "imonth"  "iday"    "nperps"  "suicide" "ransom"  "nkill"  
[9] "nwound" 

… and show the first 10 rows:

head(gtd, 10)
         eventid iyear imonth iday nperps suicide ransom nkill nwound
1  9.733093e-313  1970      0    0      7       0      1     0      0
2  9.733144e-313  1970      1    2      3       0      0     0      0
3  9.733144e-313  1970      1    2      1       0      0     0      0
4  9.733144e-313  1970      1    3      1       0      0     0      0
5  9.733147e-313  1970      1    8      1       0      0     0      0
6  9.733148e-313  1970      1   11      1       0      0     1      0
7  9.733150e-313  1970      1   15      5       0      0     0      0
8  9.733152e-313  1970      1   19      3       0      0     0      0
9  9.733152e-313  1970      1   19      2       0      0     0      0
10 9.733153e-313  1970      1   20      1       0      0     1      0

Note that we excluded variables (for clarity) and observations (to avoid missing values), so the actual dimensions of this dataframe are:

dim(gtd)
[1] 9147    9

Understanding the data

Let’s start with understanding the data bit better. You’d want to do this to avoid modelling relationships that are not meaningful.

Task

Look at the frequencies of the number of perpetrators and subset these frequency counts by the suicide and ransom variable.

table(gtd$nperps)

    0     1     2     3     4     5     6     7     8     9    10    11    12    13 
   25  1346  1780  1272   955   530   365   150   154    63   230    25   112    20 
   14    15    16    17    18    19    20    21    22    23    24    25    26    28 
   18   192     5     9    19     4   290     7     5     1    18    72     3     1 
   29    30    33    34    35    36    39    40    42    44    45    47    48    49 
    1   267     1     1    15     8     2   128     1     1     3     1     1     1 
   50    51    55    56    60    68    70    72    75    80    90   100   101   120 
  219     3     1     1    91     1    35     1     5    42     3   211     1     8 
  130   138   140   150   160   180   190   200   204   250   290   300   350   400 
    7     1     2    82     2     3     1   137     1     7     1    64     5    18 
  500   600   700   750  1000  1200  1500  2500  3200 10000 20000 
   20     8     8     1    12     2     2     1     1    36     1 
table(gtd$nperps, gtd$suicide, gtd$ransom)
, ,  = 0

       
           0    1
  0       25    0
  1     1294   31
  2     1731   16
  3     1191   17
  4      868    9
  5      475    9
  6      336    4
  7      133    2
  8      139    1
  9       51    0
  10     215    0
  11      18    0
  12     103    0
  13      20    0
  14      17    0
  15     187    0
  16       5    0
  17       8    0
  18      12    0
  19       4    0
  20     279    0
  21       7    0
  22       5    0
  23       1    0
  24      17    0
  25      70    0
  26       3    0
  28       1    0
  29       1    0
  30     257    2
  33       1    0
  34       1    0
  35      15    0
  36       7    1
  39       2    0
  40     125    0
  42       1    0
  44       1    0
  45       3    0
  47       1    0
  48       1    0
  49       1    0
  50     218    0
  51       3    0
  55       1    0
  56       1    0
  60      90    0
  68       1    0
  70      35    0
  72       1    0
  75       5    0
  80      42    0
  90       3    0
  100    208    0
  101      1    0
  120      8    0
  130      7    0
  138      1    0
  140      2    0
  150     81    1
  160      2    0
  180      3    0
  190      1    0
  200    136    0
  204      1    0
  250      7    0
  290      1    0
  300     63    0
  350      5    0
  400     18    0
  500     20    0
  600      8    0
  700      8    0
  750      1    0
  1000    12    0
  1200     2    0
  1500     2    0
  2500     1    0
  3200     1    0
  10000   36    0
  20000    1    0

, ,  = 1

       
           0    1
  0        0    0
  1       20    1
  2       33    0
  3       64    0
  4       78    0
  5       46    0
  6       25    0
  7       15    0
  8       14    0
  9       12    0
  10      15    0
  11       7    0
  12       9    0
  13       0    0
  14       1    0
  15       5    0
  16       0    0
  17       1    0
  18       7    0
  19       0    0
  20      11    0
  21       0    0
  22       0    0
  23       0    0
  24       1    0
  25       2    0
  26       0    0
  28       0    0
  29       0    0
  30       8    0
  33       0    0
  34       0    0
  35       0    0
  36       0    0
  39       0    0
  40       3    0
  42       0    0
  44       0    0
  45       0    0
  47       0    0
  48       0    0
  49       0    0
  50       1    0
  51       0    0
  55       0    0
  56       0    0
  60       1    0
  68       0    0
  70       0    0
  72       0    0
  75       0    0
  80       0    0
  90       0    0
  100      3    0
  101      0    0
  120      0    0
  130      0    0
  138      0    0
  140      0    0
  150      0    0
  160      0    0
  180      0    0
  190      0    0
  200      1    0
  204      0    0
  250      0    0
  290      0    0
  300      1    0
  350      0    0
  400      0    0
  500      0    0
  600      0    0
  700      0    0
  750      0    0
  1000     0    0
  1200     0    0
  1500     0    0
  2500     0    0
  3200     0    0
  10000    0    0
  20000    0    0

Task

In which year was the number of perpetrators (on average) the highest?

perps_year[which.max(perps_year$nperps),] # 1996
   Group.1   nperps
26    1996 1110.928

Task

What is the most common value of the number of persons killed and wounded?

Hint: ?hist and ?table

table(gtd$nwound)

   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
6512 1145  419  253  127   97   69   66   42   34   38   27   28   12   16   21    9 
  17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33 
  16   11   11   22    6    6    1    2   10    7    4    5    2   14    3    5    1 
  34   35   36   37   38   39   40   41   42   45   48   50   53   54   55   56   60 
   3    1    1    2    4    2    9    2    2    2    5   11    2    1    2    1    6 
  65   68   69   70   72   74   76   78   80   81   85   90   96   98  100  104  106 
   1    2    1    7    1    1    3    1    2    1    1    2    1    1    9    1    1 
 120  121  127  138  150  161  170  192  217  236  650  671 8190 8191 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1 

Task

Display the relationship between the number of killed victims and the number of wounded victims in a figure.

What kind of a relationship do you expect?

Hint: ?plot

Task

What is the mean number of perpetrators when the attack was a suicide attack?

Hint: ?tapply

#your code comes here
tapply(gtd$nperps, gtd$suicide, mean)
        0         1 
65.476306  5.191489 

Simple regression

We’ll start with simple regression (i.e. one predictor variable):

Task

Build a simple regression model that models the number of wounded victims through the number of perpetrators.

summary(lm1)

Call:
lm(formula = gtd$nwound ~ gtd$nperps)

Residuals:
   Min     1Q Median     3Q    Max 
  -3.8   -3.8   -3.8   -2.8 8187.2 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  3.8327993  1.2796622   2.995  0.00275 **
gtd$nperps  -0.0002303  0.0019183  -0.120  0.90445   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 121.8 on 9145 degrees of freedom
Multiple R-squared:  1.576e-06, Adjusted R-squared:  -0.0001078 
F-statistic: 0.01441 on 1 and 9145 DF,  p-value: 0.9045

Task

How satisfied are you with your model? One way to assess the “model fit” is to calculate the root mean square error - RMSE - (residuals). Calculate that metric and think about the meaning of this fit index. What does it tell you and how satisfied are you with it?

sqrt(mean(lm1$residuals^2))
[1] 121.7937

Task

Plot the fitted values (in green) and the observed values (in blue) to assess the model fit visually.

Multiple regression

Now you might want to use multiple variables in your model:

Task

Build a multiple regression model that models the number of killed victims through the variables suicide and ransom. Include only the two main effects (and let the intercept in the model).

summary(lm2)

Call:
lm(formula = gtd$nkill ~ gtd$suicide + gtd$ransom)

Residuals:
    Min      1Q  Median      3Q     Max 
 -50.11   -2.43   -1.43   -0.51 1333.89 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.4273     0.2385  10.177   <2e-16 ***
gtd$suicide  47.6782     2.3028  20.705   <2e-16 ***
gtd$ransom   -1.9174     1.1566  -1.658   0.0974 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22.21 on 9144 degrees of freedom
Multiple R-squared:  0.04518,   Adjusted R-squared:  0.04497 
F-statistic: 216.3 on 2 and 9144 DF,  p-value: < 2.2e-16

Task

Have a look at a potential interaction between these two predictor variables. Use the interaction.plot function to look at the joint relationship of these two variables on the number of killed victims.

What does this graph tell you? Can you identify the main effects and (potential) interaction?

Task

Now look at the interaction in a numerical manner.

Hint: ?tapply

tapply(gtd$nkill, list(gtd$suicide, gtd$ransom), mean)
          0         1
0  2.421848 0.6328125
1 50.612903 1.0000000

Task

Suppose you want to expand the model by adding the interaction term to it. Build that model.

summary(lm3)

Call:
lm(formula = gtd$nkill ~ gtd$suicide * gtd$ransom)

Residuals:
    Min      1Q  Median      3Q     Max 
 -50.61   -2.42   -1.42   -0.63 1333.39 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)              2.4218     0.2385  10.155   <2e-16 ***
gtd$suicide             48.1911     2.3148  20.819   <2e-16 ***
gtd$ransom              -1.7890     1.1579  -1.545   0.1224    
gtd$suicide:gtd$ransom -47.8239    22.3531  -2.139   0.0324 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22.2 on 9143 degrees of freedom
Multiple R-squared:  0.04565,   Adjusted R-squared:  0.04534 
F-statistic: 145.8 on 3 and 9143 DF,  p-value: < 2.2e-16

Task

Based on the RMSE of each of the two models above (2 main effects vs 2 main effects + 1 interaction), which one do you prefer?

sqrt(mean(lm3$residuals^2))
[1] 22.19921

Task

Have a look at the distribution of the nperps and nkill column. Are there some potential outliers in there?

Hint: ?plot

Task

Re-run the best fitting regression model again after exluding the potential outliers. The decision for the “best” model can be made based on the RMSE:

sqrt(mean(lm_clean$residuals^2))
[1] 9.78957

Model selection

Now suppose you want to let the model building process be decided by a stepwise model selection procedure.

Task

Build a “null model” that only contains an intercept.

summary(null_model)

Call:
lm(formula = gtd_clean$nwound ~ 1)

Residuals:
   Min     1Q Median     3Q    Max 
 -2.03  -2.03  -2.03  -1.03 668.97 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.0274     0.1388   14.61   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.27 on 9144 degrees of freedom

Task

Build a full model for the number of wounded victims modeled through the number of perpetrators, “suicide” and “ransom”.

summary(lm4)

Call:
lm(formula = gtd_clean$nwound ~ gtd_clean$nperps * gtd_clean$suicide * 
    gtd_clean$ransom)

Residuals:
   Min     1Q Median     3Q    Max 
-32.87  -1.77  -1.77  -0.77 669.23 

Coefficients: (1 not defined because of singularities)
                                                      Estimate Std. Error t value
(Intercept)                                          1.771e+00  1.395e-01  12.701
gtd_clean$nperps                                     4.745e-05  2.035e-04   0.233
gtd_clean$suicide                                    3.120e+01  1.430e+00  21.820
gtd_clean$ransom                                    -1.380e+00  7.303e-01  -1.890
gtd_clean$nperps:gtd_clean$suicide                  -1.044e-01  8.345e-02  -1.251
gtd_clean$nperps:gtd_clean$ransom                    1.684e-02  3.153e-02   0.534
gtd_clean$suicide:gtd_clean$ransom                  -3.151e+01  1.302e+01  -2.421
gtd_clean$nperps:gtd_clean$suicide:gtd_clean$ransom         NA         NA      NA
                                                    Pr(>|t|)    
(Intercept)                                           <2e-16 ***
gtd_clean$nperps                                      0.8157    
gtd_clean$suicide                                     <2e-16 ***
gtd_clean$ransom                                      0.0588 .  
gtd_clean$nperps:gtd_clean$suicide                    0.2111    
gtd_clean$nperps:gtd_clean$ransom                     0.5932    
gtd_clean$suicide:gtd_clean$ransom                    0.0155 *  
gtd_clean$nperps:gtd_clean$suicide:gtd_clean$ransom       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.92 on 9138 degrees of freedom
Multiple R-squared:  0.05327,   Adjusted R-squared:  0.05265 
F-statistic:  85.7 on 6 and 9138 DF,  p-value: < 2.2e-16

Task

Determine the best fitting model in a backward model selection procedure.

#your code comes here
step(lm4, direction = 'backward')
Start:  AIC=46807.07
gtd_clean$nwound ~ gtd_clean$nperps * gtd_clean$suicide * gtd_clean$ransom


Step:  AIC=46807.07
gtd_clean$nwound ~ gtd_clean$nperps + gtd_clean$suicide + gtd_clean$ransom + 
    gtd_clean$nperps:gtd_clean$suicide + gtd_clean$nperps:gtd_clean$ransom + 
    gtd_clean$suicide:gtd_clean$ransom

                                     Df Sum of Sq     RSS   AIC
- gtd_clean$nperps:gtd_clean$ransom   1     47.64 1525430 46805
- gtd_clean$nperps:gtd_clean$suicide  1    261.08 1525644 46807
<none>                                            1525382 46807
- gtd_clean$suicide:gtd_clean$ransom  1    978.23 1526361 46811

Step:  AIC=46805.36
gtd_clean$nwound ~ gtd_clean$nperps + gtd_clean$suicide + gtd_clean$ransom + 
    gtd_clean$nperps:gtd_clean$suicide + gtd_clean$suicide:gtd_clean$ransom

                                     Df Sum of Sq     RSS   AIC
- gtd_clean$nperps:gtd_clean$suicide  1    261.09 1525691 46805
<none>                                            1525430 46805
- gtd_clean$suicide:gtd_clean$ransom  1    986.90 1526417 46809

Step:  AIC=46804.92
gtd_clean$nwound ~ gtd_clean$nperps + gtd_clean$suicide + gtd_clean$ransom + 
    gtd_clean$suicide:gtd_clean$ransom

                                     Df Sum of Sq     RSS   AIC
- gtd_clean$nperps                    1      9.11 1525700 46803
<none>                                            1525691 46805
- gtd_clean$suicide:gtd_clean$ransom  1    960.18 1526651 46809

Step:  AIC=46802.98
gtd_clean$nwound ~ gtd_clean$suicide + gtd_clean$ransom + gtd_clean$suicide:gtd_clean$ransom

                                     Df Sum of Sq     RSS   AIC
<none>                                            1525700 46803
- gtd_clean$suicide:gtd_clean$ransom  1    960.02 1526660 46807

Call:
lm(formula = gtd_clean$nwound ~ gtd_clean$suicide + gtd_clean$ransom + 
    gtd_clean$suicide:gtd_clean$ransom)

Coefficients:
                       (Intercept)                   gtd_clean$suicide  
                             1.774                              30.654  
                  gtd_clean$ransom  gtd_clean$suicide:gtd_clean$ransom  
                            -1.233                             -31.196  

Task

Run the model selection again but this time using the forward model selection procedure.

step(lm4, direction = 'forward'
     , scope = list(lower = null_model, upper = lm4))
Start:  AIC=46807.07
gtd_clean$nwound ~ gtd_clean$nperps * gtd_clean$suicide * gtd_clean$ransom


Call:
lm(formula = gtd_clean$nwound ~ gtd_clean$nperps * gtd_clean$suicide * 
    gtd_clean$ransom)

Coefficients:
                                        (Intercept)  
                                          1.771e+00  
                                   gtd_clean$nperps  
                                          4.745e-05  
                                  gtd_clean$suicide  
                                          3.120e+01  
                                   gtd_clean$ransom  
                                         -1.380e+00  
                 gtd_clean$nperps:gtd_clean$suicide  
                                         -1.044e-01  
                  gtd_clean$nperps:gtd_clean$ransom  
                                          1.684e-02  
                 gtd_clean$suicide:gtd_clean$ransom  
                                         -3.151e+01  
gtd_clean$nperps:gtd_clean$suicide:gtd_clean$ransom  
                                                 NA  

Task

Compare the RMSE of the null model, the full model and the best fitting model (if different from the full model).

Display the residuals in a graph using the colours green, black, and blue for the null model, observed values, and the full model, respectively.

Hint: ?points