Tutorial 1, Probability, Statistics and Modelling 2, BSc Security and Crime Science, UCL


Outcomes of this tutorial

This tutorial has two goals:

  1. Doing a quick refresher of PSM1 concepts with R.
  2. Getting started with the Generalized Linear Model approach in R.

Structure of this tutorial

You are expected to work through this R Notebook in the tutorial and we will assist you and outline concepts for the whole class if needed.

How to get help for programming problems?

If you are encountering problems with programming problems you may find this guide on How to solve data science problems in R useful. This is targeted at the 3rd Data Science module (Advanced Crime Analysis) but the guide for getting help applies to other R problems too.


PART 1

Recap PSM 1 in R

Descriptive statistics

First, let’s use a dataset that is close to those that you might encounter in the SCS programme. The dataset was retieved from Data World and contains details on all mass shootings in the US between 1982 - 2018.

#Read the data

us_mass_shootings = read.csv(file = './data/tutorial1/USMassShootings.csv'
                             , header = T)

#Look at the data
head(us_mass_shootings)

You have now created a data.frame called us_mass_shootings that you can query nicely with R. You can use the concepts from the R in 12 Steps and Getting ready for R on that data.frame.

In addition, if you feel you need a bit more guidance on how to deal with data.frames, have a look at the 17 Steps to investigate R dataframes tutorial.

#If you want to run the data.frames tutorial, you can type your code here.

Now let’s get some basic descriptive statastics: the first step when analysing a dataset is to describe how the data were generated/retrieved (here: using an existing dataset) and describing the dataset through descriptive statistics.

#the mean number of casualties per shooting
mean(us_mass_shootings$FATALITIES)
## [1] 8.042254
#standard deviation of casualties
sd(us_mass_shootings$FATALITIES)
## [1] 5.349063
#variance
var(us_mass_shootings$FATALITIES)
## [1] 28.61247
#range
range(us_mass_shootings$FATALITIES)
## [1]  4 33
#frequency counts of the shooters' gender
table(us_mass_shootings$GENDER)
## 
## Female   Male 
##      2     69
#frequency counts of gender by race
table(us_mass_shootings$GENDER, us_mass_shootings$RACE)
##         
##          Asian Black Latino Middle Eastern Native American Other White
##   Female     0     0      0              0               1     0     1
##   Male       6    11      4              1               2     1    44

TASK

  1. To get a glimpse at your whole dataset at once, you can use use the summary() and str() function. Use these on the us_mass_shootings dataset:
#type+run your code here
  1. Calculate the frequency counts of the shooter’s “race”:
#type+run your code here
  1. Retrieve the frequnecy table of the number of weapons used by the shooter’s gender.
#type+run your code here
  1. Suppose you’re interested in going a little deeper and are not only interested in the gender and race but - in addition - also in the year of the shooting: retrieve the frequnecy table of gender by race and year.

Tip: you can arrange the variables in desired order to change the display (not the data).

In which year did females commit mass shootings and of what race where they?

#type+run your code here

Advanced task

Just as you can ‘subset’ frequency counts by additional variables, you can also calculate descriptive statistics “by” another one. Suppose you wanted to calculate the mean number of wounded victims by the gender of the shooter:

#we'd use the tapply function

?tapply

tapply(X = us_mass_shootings$WOUNDED
       , INDEX = us_mass_shootings$GENDER
       , FUN = mean
       )
##   Female     Male 
## 1.000000 7.405797

What is the mean (and standard deviation) number of fatalities for those who had prior mental health issues vs those who didn’t have prior mental health issues?

#type+run your code here

Chi-square association

When you have two categorical variables (e.g. scored as yes/no, ill/healthy, offended/not offended, …) you might be interested in the association of these variables. In PSM1 you have covered the Chi-Square test of independence of two variables. In R, you’d run the test as follows:

#let's look at the association between whether the weapon was obtained legally and whether ther were signs of prior mental illness
table(us_mass_shootings$WEAPONSOBTAINEDLEGALLY, us_mass_shootings$PRIORSIGNSOFMENTALILLNESS
      , dnn = c('WEAPONSOBTAINEDLEGALLY', 'PRIORSIGNSOFMENTALILLNESS'))
##                       PRIORSIGNSOFMENTALILLNESS
## WEAPONSOBTAINEDLEGALLY No Yes
##                         2   0
##                    No   5   8
##                    Yes 15  41

You notice that there are two cases where there is no label for the WEAPONSOBTAINEDLEGALLY variable. We can assess this further:

levels(us_mass_shootings$WEAPONSOBTAINEDLEGALLY)
## [1] ""    "No"  "Yes"

We see that there are three categorical levels in the data here. Since we do not want to impose any speculative values on the data, we exclude these cases:

us_mass_shootings_clean = us_mass_shootings[us_mass_shootings$WEAPONSOBTAINEDLEGALLY == 'Yes' | us_mass_shootings$WEAPONSOBTAINEDLEGALLY == 'No', ]

us_mass_shootings_clean = droplevels(us_mass_shootings_clean)

#check again:
levels(us_mass_shootings_clean$WEAPONSOBTAINEDLEGALLY)
## [1] "No"  "Yes"
#and:
table(us_mass_shootings_clean$WEAPONSOBTAINEDLEGALLY, us_mass_shootings_clean$PRIORSIGNSOFMENTALILLNESS
      , dnn = c('WEAPONSOBTAINEDLEGALLY', 'PRIORSIGNSOFMENTALILLNESS'))
##                       PRIORSIGNSOFMENTALILLNESS
## WEAPONSOBTAINEDLEGALLY No Yes
##                    No   5   8
##                    Yes 15  41

Now, to test for an association, we run the chisq.test(...) function on the table of frequency counts:

freq_table = table(us_mass_shootings_clean$WEAPONSOBTAINEDLEGALLY, us_mass_shootings_clean$PRIORSIGNSOFMENTALILLNESS
      , dnn = c('WEAPONSOBTAINEDLEGALLY', 'PRIORSIGNSOFMENTALILLNESS'))

chisq.test(freq_table)
## Warning in chisq.test(freq_table): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  freq_table
## X-squared = 0.24665, df = 1, p-value = 0.6194

Note that this is an illustration only here and that you need to check the assumptions (e.g. a minimum count per cell) first.

TASK

Assess with a 99% confidence interval whether there’s an association between how the shooter obtained the weapon and gender.

#type+run your code here

From you Chisquare-Test, retrieve the expected counts under the null hypothesis.

Hint: ?chisq.test.

#type+run your code here

Hypothesis testing

Another aspect covered in the PSM1 module was hypothesis testing with t-tests. R has most statistical functions available by default and you can run a t-test by using the t.test(...) function:

#t-test for differences in fatalities between shooters with and without prior mental health problems

t.test(us_mass_shootings$FATALITIES ~ us_mass_shootings$PRIORSIGNSOFMENTALILLNESS)
## 
##  Welch Two Sample t-test
## 
## data:  us_mass_shootings$FATALITIES by us_mass_shootings$PRIORSIGNSOFMENTALILLNESS
## t = -1.5593, df = 53.281, p-value = 0.1248
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.3560172  0.5452565
## sample estimates:
##  mean in group No mean in group Yes 
##          6.727273          8.632653

Note that there are additional arguments in the t.test(...) function (e.g. for running a paired-samples t-test):

#have a look at the help file to see all arguments
?t.test

TASK

Run a t-test to assess if there’s a significant difference between the number of wounded depending on whether the weapon was obtained legally or illegally.

#type+run your code here

No run that code with a 99% confidence interval and equal variance assumed. Check ?t.test for the specification of these parameters.

#type+run your code here

Correlation

Finally, sometimes you are interested in looking at the correlation between two continuous variables. In R, there are three ways to do this:

#just the correlation
cor(us_mass_shootings$FATALITIES, us_mass_shootings$WOUNDED)
## [1] 0.331917
#the significance testing of the correlation
cor.test(us_mass_shootings$FATALITIES, us_mass_shootings$WOUNDED)
## 
##  Pearson's product-moment correlation
## 
## data:  us_mass_shootings$FATALITIES and us_mass_shootings$WOUNDED
## t = 2.9228, df = 69, p-value = 0.004687
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1068906 0.5245972
## sample estimates:
##      cor 
## 0.331917
#plotting the data in a screeplot
plot(us_mass_shootings$FATALITIES, us_mass_shootings$WOUNDED)

As with all R functions, there are several parameters that allow you to adjust the function (e.g. for special cases of correlation):

?cor.test

TASK

What is the 99% confidence interval of the correlation between the number of weapons used and the total number of victims.

#type+run your code here

PART 2

The Generalized Linear Model (GLM)

The generalized linear model will form the basis of the first five weeks of this module.

What is the GLM?

The GLM is a family of models that extend (= generalize) the linear regression model to allow for more flexibility in the response variables (= dependent variable).

What is a linear regression model anyway?

In the simplest sense, a linear regression model aims to predict a continuous outcome (e.g. income) through a predictor variable (e.g. age or gender). While the dependent variable (also called: outcome variable, response variable, target variable) is strictly continuous, the predictors can take both contionuous (age) and categorical (e.g. female vs male) form.

The ingredients of a regression model:

  • The dependent variable Y
  • The predictor variable X
  • The intercept a (= the value of Y if X is 0)
  • The slope b (= the change in Y for every unit change in X)
  • The error term E (= the difference between the predicted value and the observed value)

We will cover each of the above in detail in the lecture

The model can be written down as: Y = a + b*X + E

Read this model as: Y explained the intercept a plus the predictor X multiplied with the slope b plus the error term E.

Suppose you are predicting the income of a person through their age:

income = a + b*age + E

Once fitted to the data, the regression model could look like this:

income = 10000 + 2500*age Note that the error term E is not explicitly mentioned here but can be inferred (see below).

This model allows you to replace age with the actual age of someone from your sample to estimate their income:

income = 10000 + 2500*50 = 10000 + 12500 = 135000

You can use R to calculate the model parameters (intercept a and slope b) by specifying only the model formula:

income ~ age

Read this as: income explained through age. Note that the = sign is replaced by a tilde ~. R uses this notation to build models.

Example with mass shootings

Suppose you wanted to run a regression model on the shooting data. Specifically, suppose you wanted to predict the no. of victims through the number of weapons used.

Conceptually, you’d want this model:

victims ~ number_of_weapons

We can easily take this intuitive notion to R:

my_model = lm(formula = TOTALVICTIMS ~ NUMWEAPONS
              , data = us_mass_shootings)

Now R has fitted the model you specified on the data. You can inspect that model:

#the full model and its parameters
my_model
## 
## Call:
## lm(formula = TOTALVICTIMS ~ NUMWEAPONS, data = us_mass_shootings)
## 
## Coefficients:
## (Intercept)   NUMWEAPONS  
##      10.683        2.087

Based on these values, you can construct the model equation:

victims = 10.683 + 2.087*number_of_weapons

Thus, if the shooter used 3 weapons, your model would predict almost 17 victims:

10.683 + 2.087*3
## [1] 16.944

The model has other information available too. The most important ones are:

  • the residuals: these are the estimation errors for each case (i.e. the error term E, defined as: outcome variable - fitted value).
  • the fitted values: these are the values predicted by the model
  • model statistics: here you can assess whether a predictor was statistically significant or not and how well the model fits the data.
#residuals
my_model$residuals
##           1           2           3           4           5           6 
##  -9.9426853  -2.7696798  -6.7696798  -8.8561826   7.2303202  -5.7696798 
##           7           8           9          10          11          12 
##  -5.8561826  -9.8561826  -5.7696798  10.9708120  -4.7696798  -2.7696798 
##          13          14          15          16          17          18 
##  50.9708120  -7.8561826  -2.7696798  -7.7696798  -7.9426853  -4.9426853 
##          19          20          21          22          23          24 
##   6.2303202  -3.8561826  -7.7696798  30.2303202   3.1438174  -3.8561826 
##          25          26          27          28          29          30 
##  -5.7696798   7.9708120  -6.8561826   0.2303202  -5.7696798  41.1438174 
##          31          32          33          34          35          36 
##  -4.8561826  -5.9426853 -10.0291880  -4.7696798  -1.9426853  -1.7696798 
##          37          38          39          40          41          42 
##  -0.7696798  -6.1156908 -10.0291880  -9.9426853  -6.8561826  -5.7696798 
##          43          44          45          46          47          48 
##   0.1438174   2.9708120  19.9708120  12.0573147 -14.4617018  -6.7696798 
##          49          50          51          52          53          54 
##  -5.7696798  -5.7696798  -7.8561826  -8.8561826  15.2303202  -7.7696798 
##          55          56          57          58          59          60 
##  12.2303202  -4.9426853  -1.9426853  -7.7696798  -0.8561826  -2.7696798 
##          61          62          63          64          65          66 
##  -5.7696798  29.1438174  -0.8561826  -0.1156908  20.1438174 -14.2886963 
##          67          68          69          70          71 
##   3.0573147   4.0573147  24.0573147  -5.7696798  -1.7696798

The residuals show you that, for example, for the first observation, the model overestimated the number of victims by 9.94.

#fitted values
my_model$fitted.values
##        1        2        3        4        5        6        7        8 
## 16.94269 12.76968 12.76968 14.85618 12.76968 12.76968 14.85618 14.85618 
##        9       10       11       12       13       14       15       16 
## 12.76968 19.02919 12.76968 12.76968 19.02919 14.85618 12.76968 12.76968 
##       17       18       19       20       21       22       23       24 
## 16.94269 16.94269 12.76968 14.85618 12.76968 12.76968 14.85618 14.85618 
##       25       26       27       28       29       30       31       32 
## 12.76968 19.02919 14.85618 12.76968 12.76968 14.85618 14.85618 16.94269 
##       33       34       35       36       37       38       39       40 
## 19.02919 12.76968 16.94269 12.76968 12.76968 21.11569 19.02919 16.94269 
##       41       42       43       44       45       46       47       48 
## 14.85618 12.76968 14.85618 19.02919 19.02919 16.94269 29.46170 12.76968 
##       49       50       51       52       53       54       55       56 
## 12.76968 12.76968 14.85618 14.85618 12.76968 12.76968 12.76968 16.94269 
##       57       58       59       60       61       62       63       64 
## 16.94269 12.76968 14.85618 12.76968 12.76968 14.85618 14.85618 21.11569 
##       65       66       67       68       69       70       71 
## 14.85618 25.28870 16.94269 16.94269 16.94269 12.76968 12.76968

Here you see that the fitted value for the first observation was 16.94, so we can infer that the observed (actual) value for the first observation must be: fitted value + residual

my_model$fitted.values[1] + my_model$residuals[1]
## 1 
## 7
#checked against the actual data
my_model$model[1, ]

If we want to get information on the significance of predictors and the fit of the model, we’ll look at the summary() of my_model:

#model statistics
summary(my_model)
## 
## Call:
## lm(formula = TOTALVICTIMS ~ NUMWEAPONS, data = us_mass_shootings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.462  -6.813  -4.770   1.601  50.971 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.6832     2.6074   4.097 0.000112 ***
## NUMWEAPONS    2.0865     0.9806   2.128 0.036925 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 69 degrees of freedom
## Multiple R-squared:  0.06158,    Adjusted R-squared:  0.04798 
## F-statistic: 4.528 on 1 and 69 DF,  p-value: 0.03692

You can see:

  • that both the intercept and the predictor number_of_weapons are significant at the _p_ < .05 level
  • the parameter estimates (for the intercept and the slope)
  • the R-Squared statistic: this expresses the proportion of the variance in the outcome variable Y that is explained by the model.

In our case, we see that the model only explained 6.158% of the variance in the number of victims.

Multiple predictors

Of course, a single predictor rarely explains a lot of variation in an outcome variable. So we might want to include a new variable (e.g. the categorical variable whether there were signs of prior mental health issues or not). The conceptual model would the be:

victims ~ number_of_weapons + mental_health

In R:

my_model_2 = lm(formula = TOTALVICTIMS ~ NUMWEAPONS + PRIORSIGNSOFMENTALILLNESS
              , data = us_mass_shootings)

summary(my_model_2)
## 
## Call:
## lm(formula = TOTALVICTIMS ~ NUMWEAPONS + PRIORSIGNSOFMENTALILLNESS, 
##     data = us_mass_shootings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.766  -7.474  -2.816   1.966  48.709 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                    5.7765     3.3592   1.720   0.0901 .
## NUMWEAPONS                     2.1583     0.9541   2.262   0.0269 *
## PRIORSIGNSOFMENTALILLNESSYes   6.8810     3.0900   2.227   0.0293 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.03 on 68 degrees of freedom
## Multiple R-squared:  0.1254, Adjusted R-squared:  0.09964 
## F-statistic: 4.873 on 2 and 68 DF,  p-value: 0.01052

You can see that the model fit improved considerably (from 6.158% to 12.54%) and that the new model equation is:

victims ~ 5.7765 + 2.1583*number_of_weapons + 6.881*mental_health

Note that we now have two predictors which implies that we also have two slopes (one for each predictor).

You can see that the model summary has appended the word “Yes” to the PRIORSIGNSOFMENTALILLNESS predictor. This is to reflect that the slope applies to the case that the variable PRIORSIGNSOFMENTALILLNESS is equal to Yes. This means that the predicted number of victims is 6.8810 more if there were signs of mental health issues tha if there were no previous mental health issues.

When there are mutliple predictors, the simple regression model becomes a multiple regression model.

Task:

Build your own model that predicts the number if wounded victims based on the shooter’s gender and whether the weapons were obtained legally.

#type+run your code here

Back to the GLM….

Now that we covered some basics of linear regression (more in the lecture and homework), we can see why the generalized linear model is so useful. Essentially, it is a means to unify similar models in one.

Suppose you’d want to predict a categorical variable (e.g. “predict” whether the weapon was obtained legally based on the shooter’s gender): the problem is that this conflicts with the base linear regression model which assumes continuous data.

The GLM can handle different outcome variables through so-called “link” functions. A link function links the outcome variable to the linear predictor. An overview of link functions in R can be found here. This model unification allows you to build the model in the same way but changing the link function if the outcome variable is of a different type. It also shows that linear relationships are capable of expressing the data regardless of the nature of the response variable.

The modelling process of the GLM in R is very similar to the linear regression example with the addition of the link function:

my_glm = glm(formula = TOTALVICTIMS ~ NUMWEAPONS
             , data = us_mass_shootings
             , family = gaussian)

This fits the same model as above stored in my_model but this time with the “Gaussian” link function. You can see that the model is identical to the one obtained with lm(...) above. This is because the gaussian link function of the GLM is identical to the linear model retrieved with lm(...). More in-depth relationships are out of the scope of this module but this tutorial gives a nice starting point.

my_glm
## 
## Call:  glm(formula = TOTALVICTIMS ~ NUMWEAPONS, family = gaussian, data = us_mass_shootings)
## 
## Coefficients:
## (Intercept)   NUMWEAPONS  
##      10.683        2.087  
## 
## Degrees of Freedom: 70 Total (i.e. Null);  69 Residual
## Null Deviance:       11260 
## Residual Deviance: 10560     AIC: 562.7

Similar to models built with the lm() function, the glm() offers additional model information:

#residuals
my_glm$residuals
##           1           2           3           4           5           6 
##  -9.9426853  -2.7696798  -6.7696798  -8.8561826   7.2303202  -5.7696798 
##           7           8           9          10          11          12 
##  -5.8561826  -9.8561826  -5.7696798  10.9708120  -4.7696798  -2.7696798 
##          13          14          15          16          17          18 
##  50.9708120  -7.8561826  -2.7696798  -7.7696798  -7.9426853  -4.9426853 
##          19          20          21          22          23          24 
##   6.2303202  -3.8561826  -7.7696798  30.2303202   3.1438174  -3.8561826 
##          25          26          27          28          29          30 
##  -5.7696798   7.9708120  -6.8561826   0.2303202  -5.7696798  41.1438174 
##          31          32          33          34          35          36 
##  -4.8561826  -5.9426853 -10.0291880  -4.7696798  -1.9426853  -1.7696798 
##          37          38          39          40          41          42 
##  -0.7696798  -6.1156908 -10.0291880  -9.9426853  -6.8561826  -5.7696798 
##          43          44          45          46          47          48 
##   0.1438174   2.9708120  19.9708120  12.0573147 -14.4617018  -6.7696798 
##          49          50          51          52          53          54 
##  -5.7696798  -5.7696798  -7.8561826  -8.8561826  15.2303202  -7.7696798 
##          55          56          57          58          59          60 
##  12.2303202  -4.9426853  -1.9426853  -7.7696798  -0.8561826  -2.7696798 
##          61          62          63          64          65          66 
##  -5.7696798  29.1438174  -0.8561826  -0.1156908  20.1438174 -14.2886963 
##          67          68          69          70          71 
##   3.0573147   4.0573147  24.0573147  -5.7696798  -1.7696798
#fitted values
my_glm$fitted.values
##        1        2        3        4        5        6        7        8 
## 16.94269 12.76968 12.76968 14.85618 12.76968 12.76968 14.85618 14.85618 
##        9       10       11       12       13       14       15       16 
## 12.76968 19.02919 12.76968 12.76968 19.02919 14.85618 12.76968 12.76968 
##       17       18       19       20       21       22       23       24 
## 16.94269 16.94269 12.76968 14.85618 12.76968 12.76968 14.85618 14.85618 
##       25       26       27       28       29       30       31       32 
## 12.76968 19.02919 14.85618 12.76968 12.76968 14.85618 14.85618 16.94269 
##       33       34       35       36       37       38       39       40 
## 19.02919 12.76968 16.94269 12.76968 12.76968 21.11569 19.02919 16.94269 
##       41       42       43       44       45       46       47       48 
## 14.85618 12.76968 14.85618 19.02919 19.02919 16.94269 29.46170 12.76968 
##       49       50       51       52       53       54       55       56 
## 12.76968 12.76968 14.85618 14.85618 12.76968 12.76968 12.76968 16.94269 
##       57       58       59       60       61       62       63       64 
## 16.94269 12.76968 14.85618 12.76968 12.76968 14.85618 14.85618 21.11569 
##       65       66       67       68       69       70       71 
## 14.85618 25.28870 16.94269 16.94269 16.94269 12.76968 12.76968
#model statistics
summary(my_glm)
## 
## Call:
## glm(formula = TOTALVICTIMS ~ NUMWEAPONS, family = gaussian, data = us_mass_shootings)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -14.462   -6.813   -4.770    1.601   50.971  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.6832     2.6074   4.097 0.000112 ***
## NUMWEAPONS    2.0865     0.9806   2.128 0.036925 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 153.1111)
## 
##     Null deviance: 11258  on 70  degrees of freedom
## Residual deviance: 10565  on 69  degrees of freedom
## AIC: 562.67
## 
## Number of Fisher Scoring iterations: 2

Using the GLM for different predictors

We can extend the GLM to cases where the outcome variable is binary (e.g. coded to 0 or 1) or represents counts (e.g. counts of victims). For the latter, for example, a Poisson regression might be more apt than a gaussian linear regression. We will cover these extensions in the next lecture and tutorial.

END