Tutorial 1, Probability, Statistics and Modelling 2, BSc Security and Crime Science, UCL

This tutorial has two goals:

- Doing a quick refresher of PSM1 concepts with R.
- Getting started with the Generalized Linear Model approach in R.

You are expected to work through this R Notebook in the tutorial and we will assist you and outline concepts for the whole class if needed.

If you are encountering problems with programming problems you may find this guide on How to solve data science problems in R useful. This is targeted at the 3rd Data Science module (Advanced Crime Analysis) but the guide for getting help applies to other R problems too.

First, let’s use a dataset that is close to those that you might encounter in the SCS programme. The dataset was retieved from Data World and contains details on all mass shootings in the US between 1982 - 2018.

```
#Read the data
us_mass_shootings = read.csv(file = './data/tutorial1/USMassShootings.csv'
, header = T)
#Look at the data
head(us_mass_shootings)
```

You have now created a data.frame called `us_mass_shootings`

that you can query nicely with R. You can use the concepts from the R in 12 Steps and Getting ready for R on that data.frame.

In addition, if you feel you need a bit more guidance on how to deal with data.frames, have a look at the 17 Steps to investigate R dataframes tutorial.

`#If you want to run the data.frames tutorial, you can type your code here.`

Now let’s get some basic descriptive statastics: the first step when analysing a dataset is to describe how the data were generated/retrieved (here: using an existing dataset) and describing the dataset through descriptive statistics.

```
#the mean number of casualties per shooting
mean(us_mass_shootings$FATALITIES)
```

`## [1] 8.042254`

```
#standard deviation of casualties
sd(us_mass_shootings$FATALITIES)
```

`## [1] 5.349063`

```
#variance
var(us_mass_shootings$FATALITIES)
```

`## [1] 28.61247`

```
#range
range(us_mass_shootings$FATALITIES)
```

`## [1] 4 33`

```
#frequency counts of the shooters' gender
table(us_mass_shootings$GENDER)
```

```
##
## Female Male
## 2 69
```

```
#frequency counts of gender by race
table(us_mass_shootings$GENDER, us_mass_shootings$RACE)
```

```
##
## Asian Black Latino Middle Eastern Native American Other White
## Female 0 0 0 0 1 0 1
## Male 6 11 4 1 2 1 44
```

**TASK**

- To get a glimpse at your whole dataset at once, you can use use the
`summary()`

and`str()`

function. Use these on the`us_mass_shootings`

dataset:

`#type+run your code here`

- Calculate the frequency counts of the shooter’s “race”:

`#type+run your code here`

- Retrieve the frequnecy table of the number of weapons used by the shooter’s gender.

`#type+run your code here`

- Suppose you’re interested in going a little deeper and are not only interested in the gender and race but - in addition - also in the year of the shooting: retrieve the frequnecy table of gender by race and year.

Tip: you can arrange the variables in desired order to change the display (not the data).

In which year did females commit mass shootings and of what race where they?

`#type+run your code here`

**Advanced task**

Just as you can ‘subset’ frequency counts by additional variables, you can also calculate descriptive statistics “by” another one. Suppose you wanted to calculate the mean number of wounded victims by the gender of the shooter:

```
#we'd use the tapply function
?tapply
tapply(X = us_mass_shootings$WOUNDED
, INDEX = us_mass_shootings$GENDER
, FUN = mean
)
```

```
## Female Male
## 1.000000 7.405797
```

What is the mean (and standard deviation) number of fatalities for those who had prior mental health issues vs those who didn’t have prior mental health issues?

`#type+run your code here`

When you have two categorical variables (e.g. scored as yes/no, ill/healthy, offended/not offended, …) you might be interested in the association of these variables. In PSM1 you have covered the Chi-Square test of independence of two variables. In R, you’d run the test as follows:

```
#let's look at the association between whether the weapon was obtained legally and whether ther were signs of prior mental illness
table(us_mass_shootings$WEAPONSOBTAINEDLEGALLY, us_mass_shootings$PRIORSIGNSOFMENTALILLNESS
, dnn = c('WEAPONSOBTAINEDLEGALLY', 'PRIORSIGNSOFMENTALILLNESS'))
```

```
## PRIORSIGNSOFMENTALILLNESS
## WEAPONSOBTAINEDLEGALLY No Yes
## 2 0
## No 5 8
## Yes 15 41
```

You notice that there are two cases where there is no label for the WEAPONSOBTAINEDLEGALLY variable. We can assess this further:

`levels(us_mass_shootings$WEAPONSOBTAINEDLEGALLY)`

`## [1] "" "No" "Yes"`

We see that there are three categorical levels in the data here. Since we do not want to impose any speculative values on the data, we exclude these cases:

```
us_mass_shootings_clean = us_mass_shootings[us_mass_shootings$WEAPONSOBTAINEDLEGALLY == 'Yes' | us_mass_shootings$WEAPONSOBTAINEDLEGALLY == 'No', ]
us_mass_shootings_clean = droplevels(us_mass_shootings_clean)
#check again:
levels(us_mass_shootings_clean$WEAPONSOBTAINEDLEGALLY)
```

`## [1] "No" "Yes"`

```
#and:
table(us_mass_shootings_clean$WEAPONSOBTAINEDLEGALLY, us_mass_shootings_clean$PRIORSIGNSOFMENTALILLNESS
, dnn = c('WEAPONSOBTAINEDLEGALLY', 'PRIORSIGNSOFMENTALILLNESS'))
```

```
## PRIORSIGNSOFMENTALILLNESS
## WEAPONSOBTAINEDLEGALLY No Yes
## No 5 8
## Yes 15 41
```

Now, to test for an association, we run the `chisq.test(...)`

function **on the table** of frequency counts:

```
freq_table = table(us_mass_shootings_clean$WEAPONSOBTAINEDLEGALLY, us_mass_shootings_clean$PRIORSIGNSOFMENTALILLNESS
, dnn = c('WEAPONSOBTAINEDLEGALLY', 'PRIORSIGNSOFMENTALILLNESS'))
chisq.test(freq_table)
```

```
## Warning in chisq.test(freq_table): Chi-squared approximation may be
## incorrect
```

```
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: freq_table
## X-squared = 0.24665, df = 1, p-value = 0.6194
```

*Note that this is an illustration only here and that you need to check the assumptions (e.g. a minimum count per cell) first.*

**TASK**

Assess with a 99% confidence interval whether there’s an association between how the shooter obtained the weapon and gender.

`#type+run your code here`

From you Chisquare-Test, retrieve the expected counts under the null hypothesis.

Hint: `?chisq.test`

.

`#type+run your code here`

Another aspect covered in the PSM1 module was hypothesis testing with t-tests. R has most statistical functions available by default and you can run a t-test by using the `t.test(...)`

function:

```
#t-test for differences in fatalities between shooters with and without prior mental health problems
t.test(us_mass_shootings$FATALITIES ~ us_mass_shootings$PRIORSIGNSOFMENTALILLNESS)
```

```
##
## Welch Two Sample t-test
##
## data: us_mass_shootings$FATALITIES by us_mass_shootings$PRIORSIGNSOFMENTALILLNESS
## t = -1.5593, df = 53.281, p-value = 0.1248
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.3560172 0.5452565
## sample estimates:
## mean in group No mean in group Yes
## 6.727273 8.632653
```

Note that there are additional arguments in the `t.test(...)`

function (e.g. for running a paired-samples t-test):

```
#have a look at the help file to see all arguments
?t.test
```

**TASK**

Run a t-test to assess if there’s a significant difference between the number of wounded depending on whether the weapon was obtained legally or illegally.

`#type+run your code here`

No run that code with a 99% confidence interval and equal variance assumed. Check `?t.test`

for the specification of these parameters.

`#type+run your code here`

Finally, sometimes you are interested in looking at the correlation between two continuous variables. In R, there are three ways to do this:

```
#just the correlation
cor(us_mass_shootings$FATALITIES, us_mass_shootings$WOUNDED)
```

`## [1] 0.331917`

```
#the significance testing of the correlation
cor.test(us_mass_shootings$FATALITIES, us_mass_shootings$WOUNDED)
```

```
##
## Pearson's product-moment correlation
##
## data: us_mass_shootings$FATALITIES and us_mass_shootings$WOUNDED
## t = 2.9228, df = 69, p-value = 0.004687
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1068906 0.5245972
## sample estimates:
## cor
## 0.331917
```

```
#plotting the data in a screeplot
plot(us_mass_shootings$FATALITIES, us_mass_shootings$WOUNDED)
```

As with all R functions, there are several parameters that allow you to adjust the function (e.g. for special cases of correlation):

`?cor.test`

**TASK**

What is the 99% confidence interval of the correlation between the number of weapons used and the total number of victims.

`#type+run your code here`

The generalized linear model will form the basis of the first five weeks of this module.

The GLM is a family of models that extend (= generalize) the linear regression model to allow for more flexibility in the response variables (= dependent variable).

In the simplest sense, a linear regression model aims to predict a continuous outcome (e.g. income) through a predictor variable (e.g. age or gender). While the dependent variable (also called: outcome variable, response variable, target variable) is strictly continuous, the predictors can take both contionuous (age) and categorical (e.g. female vs male) form.

The ingredients of a regression model:

- The dependent variable
`Y`

- The predictor variable
`X`

- The intercept
`a`

(= the value of`Y`

if`X`

is`0`

) - The slope
`b`

(= the change in`Y`

for every unit change in`X`

) - The error term
`E`

(= the difference between the predicted value and the observed value)

*We will cover each of the above in detail in the lecture*

The model can be written down as: `Y = a + b*X + E`

Read this model as: `Y`

explained the intercept `a`

plus the predictor `X`

multiplied with the slope `b`

plus the error term `E`

.

Suppose you are predicting the income of a person through their age:

`income = a + b*age + E`

Once fitted to the data, the regression model could look like this:

`income = 10000 + 2500*age`

Note that the error term `E`

is not explicitly mentioned here but can be inferred (see below).

This model allows you to replace `age`

with the actual age of someone from your sample to estimate their income:

`income = 10000 + 2500*50 = 10000 + 12500 = 135000`

You can use R to calculate the model parameters (intercept `a`

and slope `b`

) by specifying only the model formula:

`income ~ age`

Read this as: income explained through age. Note that the `=`

sign is replaced by a tilde `~`

. R uses this notation to build models.

**Example with mass shootings**

Suppose you wanted to run a regression model on the shooting data. Specifically, suppose you wanted to predict the no. of victims through the number of weapons used.

Conceptually, you’d want this model:

`victims ~ number_of_weapons`

We can easily take this intuitive notion to R:

```
my_model = lm(formula = TOTALVICTIMS ~ NUMWEAPONS
, data = us_mass_shootings)
```

Now R has fitted the model you specified on the data. You can inspect that model:

```
#the full model and its parameters
my_model
```

```
##
## Call:
## lm(formula = TOTALVICTIMS ~ NUMWEAPONS, data = us_mass_shootings)
##
## Coefficients:
## (Intercept) NUMWEAPONS
## 10.683 2.087
```

Based on these values, you can construct the model equation:

`victims = 10.683 + 2.087*number_of_weapons`

Thus, if the shooter used 3 weapons, your model would predict almost 17 victims:

`10.683 + 2.087*3`

`## [1] 16.944`

The model has other information available too. The most important ones are:

- the residuals: these are the estimation errors for each case (i.e. the error term
`E`

, defined as:`outcome variable - fitted value`

). - the fitted values: these are the values predicted by the model
- model statistics: here you can assess whether a predictor was statistically significant or not and how well the model fits the data.

```
#residuals
my_model$residuals
```

```
## 1 2 3 4 5 6
## -9.9426853 -2.7696798 -6.7696798 -8.8561826 7.2303202 -5.7696798
## 7 8 9 10 11 12
## -5.8561826 -9.8561826 -5.7696798 10.9708120 -4.7696798 -2.7696798
## 13 14 15 16 17 18
## 50.9708120 -7.8561826 -2.7696798 -7.7696798 -7.9426853 -4.9426853
## 19 20 21 22 23 24
## 6.2303202 -3.8561826 -7.7696798 30.2303202 3.1438174 -3.8561826
## 25 26 27 28 29 30
## -5.7696798 7.9708120 -6.8561826 0.2303202 -5.7696798 41.1438174
## 31 32 33 34 35 36
## -4.8561826 -5.9426853 -10.0291880 -4.7696798 -1.9426853 -1.7696798
## 37 38 39 40 41 42
## -0.7696798 -6.1156908 -10.0291880 -9.9426853 -6.8561826 -5.7696798
## 43 44 45 46 47 48
## 0.1438174 2.9708120 19.9708120 12.0573147 -14.4617018 -6.7696798
## 49 50 51 52 53 54
## -5.7696798 -5.7696798 -7.8561826 -8.8561826 15.2303202 -7.7696798
## 55 56 57 58 59 60
## 12.2303202 -4.9426853 -1.9426853 -7.7696798 -0.8561826 -2.7696798
## 61 62 63 64 65 66
## -5.7696798 29.1438174 -0.8561826 -0.1156908 20.1438174 -14.2886963
## 67 68 69 70 71
## 3.0573147 4.0573147 24.0573147 -5.7696798 -1.7696798
```

The residuals show you that, for example, for the first observation, the model overestimated the number of victims by 9.94.

```
#fitted values
my_model$fitted.values
```

```
## 1 2 3 4 5 6 7 8
## 16.94269 12.76968 12.76968 14.85618 12.76968 12.76968 14.85618 14.85618
## 9 10 11 12 13 14 15 16
## 12.76968 19.02919 12.76968 12.76968 19.02919 14.85618 12.76968 12.76968
## 17 18 19 20 21 22 23 24
## 16.94269 16.94269 12.76968 14.85618 12.76968 12.76968 14.85618 14.85618
## 25 26 27 28 29 30 31 32
## 12.76968 19.02919 14.85618 12.76968 12.76968 14.85618 14.85618 16.94269
## 33 34 35 36 37 38 39 40
## 19.02919 12.76968 16.94269 12.76968 12.76968 21.11569 19.02919 16.94269
## 41 42 43 44 45 46 47 48
## 14.85618 12.76968 14.85618 19.02919 19.02919 16.94269 29.46170 12.76968
## 49 50 51 52 53 54 55 56
## 12.76968 12.76968 14.85618 14.85618 12.76968 12.76968 12.76968 16.94269
## 57 58 59 60 61 62 63 64
## 16.94269 12.76968 14.85618 12.76968 12.76968 14.85618 14.85618 21.11569
## 65 66 67 68 69 70 71
## 14.85618 25.28870 16.94269 16.94269 16.94269 12.76968 12.76968
```

Here you see that the fitted value for the first observation was 16.94, so we can infer that the observed (actual) value for the first observation must be: `fitted value + residual`

`my_model$fitted.values[1] + my_model$residuals[1]`

```
## 1
## 7
```

```
#checked against the actual data
my_model$model[1, ]
```

If we want to get information on the significance of predictors and the fit of the model, we’ll look at the `summary()`

of `my_model`

:

```
#model statistics
summary(my_model)
```

```
##
## Call:
## lm(formula = TOTALVICTIMS ~ NUMWEAPONS, data = us_mass_shootings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.462 -6.813 -4.770 1.601 50.971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.6832 2.6074 4.097 0.000112 ***
## NUMWEAPONS 2.0865 0.9806 2.128 0.036925 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 69 degrees of freedom
## Multiple R-squared: 0.06158, Adjusted R-squared: 0.04798
## F-statistic: 4.528 on 1 and 69 DF, p-value: 0.03692
```

You can see:

- that both the intercept and the predictor number_of_weapons are significant at the
`_p_ < .05`

level - the parameter estimates (for the intercept and the slope)
- the R-Squared statistic: this expresses the proportion of the variance in the outcome variable
`Y`

that is explained by the model.

In our case, we see that the model only explained 6.158% of the variance in the number of victims.

**Multiple predictors**

Of course, a single predictor rarely explains a lot of variation in an outcome variable. So we might want to include a new variable (e.g. the categorical variable whether there were signs of prior mental health issues or not). The conceptual model would the be:

`victims ~ number_of_weapons + mental_health`

In R:

```
my_model_2 = lm(formula = TOTALVICTIMS ~ NUMWEAPONS + PRIORSIGNSOFMENTALILLNESS
, data = us_mass_shootings)
summary(my_model_2)
```

```
##
## Call:
## lm(formula = TOTALVICTIMS ~ NUMWEAPONS + PRIORSIGNSOFMENTALILLNESS,
## data = us_mass_shootings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.766 -7.474 -2.816 1.966 48.709
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.7765 3.3592 1.720 0.0901 .
## NUMWEAPONS 2.1583 0.9541 2.262 0.0269 *
## PRIORSIGNSOFMENTALILLNESSYes 6.8810 3.0900 2.227 0.0293 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.03 on 68 degrees of freedom
## Multiple R-squared: 0.1254, Adjusted R-squared: 0.09964
## F-statistic: 4.873 on 2 and 68 DF, p-value: 0.01052
```

You can see that the model fit improved considerably (from 6.158% to 12.54%) and that the new model equation is:

`victims ~ 5.7765 + 2.1583*number_of_weapons + 6.881*mental_health`

Note that we now have two predictors which implies that we also have two slopes (one for each predictor).

You can see that the model summary has appended the word “Yes” to the `PRIORSIGNSOFMENTALILLNESS`

predictor. This is to reflect that the slope applies to the case that the variable `PRIORSIGNSOFMENTALILLNESS`

is equal to `Yes`

. This means that the predicted number of victims is 6.8810 more if there were signs of mental health issues tha if there were no previous mental health issues.

When there are mutliple predictors, the simple regression model becomes a *multiple regression model*.

**Task:**

Build your own model that predicts the number if wounded victims based on the shooter’s gender and whether the weapons were obtained legally.

`#type+run your code here`

Now that we covered some basics of linear regression (more in the lecture and homework), we can see why the generalized linear model is so useful. Essentially, it is a means to unify similar models in one.

Suppose you’d want to predict a categorical variable (e.g. “predict” whether the weapon was obtained legally based on the shooter’s gender): the problem is that this conflicts with the base linear regression model which assumes continuous data.

The GLM can handle different outcome variables through so-called “link” functions. A link function links the outcome variable to the linear predictor. An overview of link functions in R can be found here. This model unification allows you to build the model in the same way but changing the link function if the outcome variable is of a different type. It also shows that linear relationships are capable of expressing the data regardless of the nature of the response variable.

The modelling process of the GLM in R is very similar to the linear regression example with the addition of the link function:

```
my_glm = glm(formula = TOTALVICTIMS ~ NUMWEAPONS
, data = us_mass_shootings
, family = gaussian)
```

This fits the same model as above stored in `my_model`

but this time with the “Gaussian” link function. You can see that the model is identical to the one obtained with `lm(...)`

above. This is because the gaussian link function of the GLM is identical to the linear model retrieved with `lm(...)`

. More in-depth relationships are out of the scope of this module but this tutorial gives a nice starting point.

`my_glm`

```
##
## Call: glm(formula = TOTALVICTIMS ~ NUMWEAPONS, family = gaussian, data = us_mass_shootings)
##
## Coefficients:
## (Intercept) NUMWEAPONS
## 10.683 2.087
##
## Degrees of Freedom: 70 Total (i.e. Null); 69 Residual
## Null Deviance: 11260
## Residual Deviance: 10560 AIC: 562.7
```

Similar to models built with the `lm()`

function, the `glm()`

offers additional model information:

```
#residuals
my_glm$residuals
```

```
## 1 2 3 4 5 6
## -9.9426853 -2.7696798 -6.7696798 -8.8561826 7.2303202 -5.7696798
## 7 8 9 10 11 12
## -5.8561826 -9.8561826 -5.7696798 10.9708120 -4.7696798 -2.7696798
## 13 14 15 16 17 18
## 50.9708120 -7.8561826 -2.7696798 -7.7696798 -7.9426853 -4.9426853
## 19 20 21 22 23 24
## 6.2303202 -3.8561826 -7.7696798 30.2303202 3.1438174 -3.8561826
## 25 26 27 28 29 30
## -5.7696798 7.9708120 -6.8561826 0.2303202 -5.7696798 41.1438174
## 31 32 33 34 35 36
## -4.8561826 -5.9426853 -10.0291880 -4.7696798 -1.9426853 -1.7696798
## 37 38 39 40 41 42
## -0.7696798 -6.1156908 -10.0291880 -9.9426853 -6.8561826 -5.7696798
## 43 44 45 46 47 48
## 0.1438174 2.9708120 19.9708120 12.0573147 -14.4617018 -6.7696798
## 49 50 51 52 53 54
## -5.7696798 -5.7696798 -7.8561826 -8.8561826 15.2303202 -7.7696798
## 55 56 57 58 59 60
## 12.2303202 -4.9426853 -1.9426853 -7.7696798 -0.8561826 -2.7696798
## 61 62 63 64 65 66
## -5.7696798 29.1438174 -0.8561826 -0.1156908 20.1438174 -14.2886963
## 67 68 69 70 71
## 3.0573147 4.0573147 24.0573147 -5.7696798 -1.7696798
```

```
#fitted values
my_glm$fitted.values
```

```
## 1 2 3 4 5 6 7 8
## 16.94269 12.76968 12.76968 14.85618 12.76968 12.76968 14.85618 14.85618
## 9 10 11 12 13 14 15 16
## 12.76968 19.02919 12.76968 12.76968 19.02919 14.85618 12.76968 12.76968
## 17 18 19 20 21 22 23 24
## 16.94269 16.94269 12.76968 14.85618 12.76968 12.76968 14.85618 14.85618
## 25 26 27 28 29 30 31 32
## 12.76968 19.02919 14.85618 12.76968 12.76968 14.85618 14.85618 16.94269
## 33 34 35 36 37 38 39 40
## 19.02919 12.76968 16.94269 12.76968 12.76968 21.11569 19.02919 16.94269
## 41 42 43 44 45 46 47 48
## 14.85618 12.76968 14.85618 19.02919 19.02919 16.94269 29.46170 12.76968
## 49 50 51 52 53 54 55 56
## 12.76968 12.76968 14.85618 14.85618 12.76968 12.76968 12.76968 16.94269
## 57 58 59 60 61 62 63 64
## 16.94269 12.76968 14.85618 12.76968 12.76968 14.85618 14.85618 21.11569
## 65 66 67 68 69 70 71
## 14.85618 25.28870 16.94269 16.94269 16.94269 12.76968 12.76968
```

```
#model statistics
summary(my_glm)
```

```
##
## Call:
## glm(formula = TOTALVICTIMS ~ NUMWEAPONS, family = gaussian, data = us_mass_shootings)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -14.462 -6.813 -4.770 1.601 50.971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.6832 2.6074 4.097 0.000112 ***
## NUMWEAPONS 2.0865 0.9806 2.128 0.036925 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 153.1111)
##
## Null deviance: 11258 on 70 degrees of freedom
## Residual deviance: 10565 on 69 degrees of freedom
## AIC: 562.67
##
## Number of Fisher Scoring iterations: 2
```

**Using the GLM for different predictors**

We can extend the GLM to cases where the outcome variable is binary (e.g. coded to 0 or 1) or represents counts (e.g. counts of victims). For the latter, for example, a Poisson regression might be more apt than a gaussian linear regression. We will cover these extensions in the next lecture and tutorial.