## The Global Terrorism Database

We will use a dataset that includes variables of 180,000 terrorist attacks between 1970-2017. Note that we excluded some variables for clarity and did some preprocessing on the data.

Suppose you are given that dataset and you’re asked to inform policy-makers about the relationship between terrorist attack details and the number of victims. This notebook asks you to perform initial descriptive statistics and then build and evaluate models of the data.

You can load the dataset as follows:

• eventid: unique event identifier
• iyear: year
• imonth: month
• iday: day
• nperps: number of perpetrators
• suicide: whether the attack was a suicide attach
• ransom: whether ransom was demanded
• nkill: number of killed victims
• nwound: number of wounded victims

This command loads a dataframe called gtd in your notebook. You can query that dataframe as usual.

Have a look at the names (columns) in the data.frame:

names(gtd)

[1] "eventid" "iyear"   "imonth"  "iday"    "nperps"  "suicide" "ransom"  "nkill"
[9] "nwound" 

… and show the first 10 rows:

head(gtd, 10)

         eventid iyear imonth iday nperps suicide ransom nkill nwound
1  9.733093e-313  1970      0    0      7       0      1     0      0
2  9.733144e-313  1970      1    2      3       0      0     0      0
3  9.733144e-313  1970      1    2      1       0      0     0      0
4  9.733144e-313  1970      1    3      1       0      0     0      0
5  9.733147e-313  1970      1    8      1       0      0     0      0
6  9.733148e-313  1970      1   11      1       0      0     1      0
7  9.733150e-313  1970      1   15      5       0      0     0      0
8  9.733152e-313  1970      1   19      3       0      0     0      0
9  9.733152e-313  1970      1   19      2       0      0     0      0
10 9.733153e-313  1970      1   20      1       0      0     1      0

Note that we excluded variables (for clarity) and observations (to avoid missing values), so the actual dimensions of this dataframe are:

dim(gtd)

[1] 9147    9

## Understanding the data

Let’s start with understanding the data bit better. You’d want to do this to avoid modelling relationships that are not meaningful.

Look at the frequencies of the number of perpetrators and subset these frequency counts by the suicide and ransom variable.

#your code comes here

In which year was the number of perpetrators (on average) the highest?

#your code comes here

What is the most common value of the number of persons killed and wounded?

Hint: ?hist and ?table

#your code comes here

Display the relationship between the number of killed victims and the number of wounded victims in a figure.

What kind of a relationship do you expect?

Hint: ?plot

#your code comes here

What is the mean number of perpetrators when the attack was a suicide attack?

Hint: ?tapply

#your code comes here

## Simple regression

Build a simple regression model that models the number of wounded victims through the number of perpetrators.

#your code comes here

How satisfied are you with your model? One way to assess the “model fit” is to calculate the root mean square error - RMSE - (residuals). Calculate that metric and think about the meaning of this fit index. What does it tell you and how satisfied are you with it?

#your code comes here

Plot the fitted values (in green) and the observed values (in blue) to assess the model fit visually.

#your code comes here

## Multiple regression

Now you might want to use multiple variables in your model:

Build a multiple regression model that models the number of killed victims through the variables suicide and ransom. Include only the two main effects (and let the intercept in the model).

#your code comes here

Have a look at a potential interaction between these two predictor variables. Use the interaction.plot function to look at the joint relationship of these two variables on the number of killed victims.

#your code comes here

What does this graph tell you? Can you identify the main effects and (potential) interaction?

Now look at the interaction in a numerical manner.

Hint: ?tapply

#your code comes here

Suppose you want to expand the model by adding the interaction term to it. Build that model.

#your code comes here

Based on the RMSE of each of the two models above (2 main effects vs 2 main effects + 1 interaction), which one do you prefer?

#your code comes here

Have a look at the distribution of the nperps and nkill column. Are there some potential outliers in there?

Hint: ?plot

#your code comes here

Re-run the best fitting regression model again after exluding the potential outliers. The decision for the “best” model can be made based on the RMSE:

#your code comes here

## Model selection

Now suppose you want to let the model building process be decided by a stepwise model selection procedure.

Build a “null model” that only contains an intercept.

#your code comes here

Build a full model for the number of wounded victims modeled through the number of perpetrators, “suicide” and “ransom”.

#your code comes here

Determine the best fitting model in a backward model selection procedure.

#your code comes here

Run the model selection again but this time using the forward model selection procedure.

#your code comes here

Compare the RMSE of the null model, the full model and the best fitting model (if different from the full model).

Display the residuals in a graph using the colours green, black, and blue for the null model, observed values, and the full model, respectively.

Hint: ?points

#your code comes here

Did you expect the observations for the null model? See whether you can discover the reason for that relationship in the model outcome (i.e. the coefficients).