Tutorial 1, Probability, Statistics and Modelling 2, BSc Security and Crime Science, UCL

## Outcomes of this tutorial

This tutorial has two goals:

1. Doing a quick refresher of PSM1 concepts with R.
2. Getting started with the Generalized Linear Model approach in R.

## Structure of this tutorial

You are expected to work through this R Notebook in the tutorial and we will assist you and outline concepts for the whole class if needed.

## How to get help for programming problems?

If you are encountering problems with programming problems you may find this guide on How to solve data science problems in R useful. This is targeted at the 3rd Data Science module (Advanced Crime Analysis) but the guide for getting help applies to other R problems too.

## Recap PSM 1 in R

### Descriptive statistics

First, let’s use a dataset that is close to those that you might encounter in the SCS programme. The dataset was retieved from Data World and contains details on all mass shootings in the US between 1982 - 2018.

#Look at the data

                                      CASE GENDER SHOOTINGTYPE            RACE
1  Chattanooga military recruitment center   Male         Mass  Middle Eastern
2               Charleston Church Shooting   Male         Mass           White
3 Marysville-Pilchuck High School shooting   Male         Mass Native American
4                  Alturas tribal shooting Female         Mass Native American
5            Washington Navy Yard shooting   Male         Mass           Black
6               Hialeah apartment shooting   Male         Mass          Latino
LOCATION          STATE       DATE YEAR
1     Chattanooga, Tennessee      Tennessee  7/16/2015 2015
2 Charleston, South Carolina South Carolina  6/17/2015 2015
3     Marysville, Washington     Washington 10/24/2014 2014
4        Alturas, California     California  2/20/2014 2014
5           Washington, D.C.           D.C.  9/16/2013 2013
6           Hialeah, Florida        Florida  7/26/2013 2013
SUMMARY
1  Kuwaiti-born Mohammod Youssuf Abdulazeez, 24, a naturalized US citizen, opened fire at a Naval reserve center, and then drove to a military recruitment office where he shot and killed four Marines and a Navy service member, and wounded a police officer and another military service member. He was then fatally shot in an exchange of gunfire with law enforcement officers responding to the attack.
2                                                                                                                                                                                Dylann Storm Roof, 21, shot and killed 9 people after opening fire at the Emanuel AME Church in Charleston, South Carolina. According to a roommate, he had allegedly been ¬â√õ√èplanning something like that for six months."
3 Jaylen Fryberg, 15, using a .40-caliber Berretta, shot five \nstudents at Marysville High School, including two of his cousins and three friends, killing all but one. Fryberg arranged to meet them for lunch in the school cafeteria by text. Fryberg was reportedly well-liked at the school and there was not believed to be any ill-will between him and his victims. He committed suicide at the scene.
4                                                                                                                                                                              Cherie Lash Rhoades, 44, opened fire at the Cedarville Rancheria Tribal Office and Community Center, killing four and wounding two. After running out of ammunition, Rhoades grabbed a butcher knife and stabbed another person.
5                                                                                                                                                                                                                                      Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police.
6                                                                                                                                                                                                                        Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him.
FATALITIES WOUNDED TOTALVICTIMS LOCATIONTYPE PRIORSIGNSOFMENTALILLNESS
1          5       2            7     Military                       Yes
2          9       1           10    Religious                        No
3          5       1            6       School                        No
4          4       2            6        Other                        No
5         12       8           20     Military                       Yes
6          7       0            7        Other                       Yes
MENTALHEALTHNOTES
1                   Abdulazeez "had suffered for years from depression and possibly from bipolar disorder," according to a representative of the family. (NYT, July 20 2015)
2
3 Fryberg was well-liked and allegedly happy, but was also upset about a girl and had posted emotional social media messages. No definitive signs of mental health problems.
4
5                Had told Rhode Island police the prior month that he'd "heard voices"; had been undergoing mental health treatment with Veterans Affairs since August 2013.
6                                                                         His mother told authorities her son had been acting strangely and needed a psychiatric evaluation.
WEAPONSOBTAINEDLEGALLY                                         WHEREWEAPONOBTAINED
1                    Yes                           On the internet, via Armslist.com
2                    Yes Shooter's Choice gun store in West Columbia, South Carolina
3                     No                                       Gun was his father's.
4                                                                            Unknown
5                    Yes                              Sharpshooters Small Arms Range
6                    Yes                                          Florida Gun Center
TYPEOFWEAPONS
1                                                                            2 assault rifles; semiautomatic handgun
2                                                                                                            Handgun
3                                                                                                            Handgun
4                                                                                   Two handguns and a butcher knife
5 Sawed-off shotgun, 2 boxes of shells; also a .45-caliber handgun taken from a security guard he shot at the scene.
6                                                                                         9mm semi-automatic handgun
NUMWEAPONS ASSAULT                                                 WEAPONDETAILS
1          3     Yes             AK-47, AR-15, and 30-round magazines; 9mm handgun
2          1      No .45-caliber Glock (model 41, with 13-round capacity magazine)
3          1      No                                   Beretta .40-caliber handgun
4          2      No                                    9mm semi-automatic handgun
5          1      No       Remington 870 Express 12-gauge shotgun; Beretta handgun
6          1      No                                                      Glock 17
SOURCES
1 http://www.reuters.com/article/2015/07/16/us-usa-shooting-tennessee-idUSKCN0PQ1WY20150716; http://www.nytimes.com/2015/07/18/us/chattanooga-gunmans-past-scoured-for-extremist-ties.html; http://www.wsj.com/articles/chattanooga-shooting-highlights-online-gun-sales-1437435518http://www.nytimes.com/2015/07/19/us/chattanooga-attacks-claim-a-5th-military-service-members-life.html; http://www.cnn.com/2015/07/17/us/tennessee-naval-reserve-shooting/; https://twitter.com/markfollman/status/622105627473477632; http://www.nytimes.com/2015/07/21/us/chattanooga-gunman-wrote-of-suicide-and-martyrdom-official-says.html
2                                                                                                                                                                          http://www.motherjones.com/politics/2015/06/dylann-roofs-attorney; http://www.newsweek.com/report-nine-shot-charleston-south-carolina-church-shooting-344235; http://www.cnn.com/2015/06/19/us/charleston-church-shooting-suspect/; http://www.motherjones.com/politics/2015/06/9-people-dead-mass-shooting-south-carolina-church; http://www.nbcnews.com/storyline/charleston-church-shooting/fbi-says-dyann-roof-should-not-have-been-sold-gun-n390056
3                                                                                                                                                                                                                                                                                                                                                                                                http://www.seattletimes.com/seattle-news/fourth-marysville-shooting-victim-dies-as-another-is-laid-to-rest/ http://www.mercurynews.com/crime-courts/ci_26814211/jaylen-fryberg-included-tributes-washington-school-shooting-victims
MENTALHEALTHSOURCES
1
2
4
5            http://bigstory.ap.org/article/13-killed-washington-navy-yard-shooting-rampage
6 http://www.miamiherald.com/2013/08/03/v-print/3539629/hialeah-killer-showed-signs-of.html
LATITUDE  LONGITUDE
1 35.04716  -85.31182
2 32.78839  -79.93314
3 48.05082 -122.17692
4 41.48710 -120.54224
5 38.87498  -76.99453
6 25.86701  -80.29147

You have now created a data.frame called us_mass_shootings that you can query nicely with R. You can use the concepts from the R in 12 Steps and Getting ready for R on that data.frame.

In addition, if you feel you need a bit more guidance on how to deal with data.frames, have a look at the 17 Steps to investigate R dataframes tutorial.

#If you want to run the data.frames tutorial, you can type your code here.


Now let’s get some basic descriptive statastics: the first step when analysing a dataset is to describe how the data were generated/retrieved (here: using an existing dataset) and describing the dataset through descriptive statistics.

#frequency counts of gender by race
table(us_mass_shootings$GENDER, us_mass_shootings$RACE)


Asian Black Latino Middle Eastern Native American Other White
Female     0     0      0              0               1     0     1
Male       6    11      4              1               2     1    44

1. To get a glimpse at your whole dataset at once, you can use use the summary() and str() function. Use these on the us_mass_shootings dataset:
summary(us_mass_shootings)

                                 CASE       GENDER   SHOOTINGTYPE              RACE
101 California Street shootings   : 1   Female: 2   Mass :63     Asian          : 6
Accent Signage Systems shooting   : 1   Male  :69   Spree: 8     Black          :11
Air Force base shooting           : 1                            Latino         : 4
Alturas tribal shooting           : 1                            Middle Eastern : 1
Amish school shooting             : 1                            Native American: 3
Atlanta day trading spree killings: 1                            Other          : 1
(Other)                           :65                            White          :45
LOCATION         STATE            DATE         YEAR
Aurora, Colorado     : 2   California:11   1/17/1989 : 1   Min.   :1982
Seattle, Washington  : 2   Florida   : 6   1/30/2006 : 1   1st Qu.:1994
Aiken, South Carolina: 1   Washington: 6   1/8/2011  : 1   Median :2005
Alturas, California  : 1   Texas     : 5   10/14/2011: 1   Mean   :2002
Atlanta, Georgia     : 1   New York  : 4   10/15/1992: 1   3rd Qu.:2011
Binghamton, New York : 1   Colorado  : 3   10/16/1991: 1   Max.   :2015
(Other)              :63   (Other)   :36   (Other)   :65
SUMMARY
Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police.                                                                                             : 1
Abdelkrim Belachheb, 39, opened fire at an upscale nightclub after a woman rejected his advances. He was later arrested.                                                                                                                                             : 1
Adam Lanza, 20, shot his mother dead at their home then drove to Sandy Hook Elementary school. He forced his way inside and opened fire, killing 20 children and six adults before committing suicide.                                                               : 1
After he was expelled for having a gun in his locker, Kipland P. Kinkel, 15, a freshman at Thurston High, went on a shooting spree, killing his parents at home and two students at school. Five classmates wrestled Kipland to the ground before he was arrested.   : 1
Andrew Engeldinger, 36, upon learning he was being fired, went on a shooting rampage, killing the business owner, three fellow employees, and a UPS driver. He then killed himself.                                                                                  : 1
Army psychiatrist Nidal Malik Hasan, 39, opened fire on an Army base in an attack linked to Islamist extremism. Hasan was injured during the attack and later arrested.                                                                                              : 1
(Other)                                                                                                                                                                                                                                                              :65
FATALITIES        WOUNDED        TOTALVICTIMS      LOCATIONTYPE
Min.   : 4.000   Min.   : 0.000   Min.   : 5.00   Military : 4
1st Qu.: 5.000   1st Qu.: 1.000   1st Qu.: 7.00   Other    :30
Median : 6.000   Median : 3.000   Median :11.00   Religious: 4
Mean   : 8.042   Mean   : 7.225   Mean   :15.27   School   :13
3rd Qu.: 8.500   3rd Qu.: 8.000   3rd Qu.:18.50   Workplace:20
Max.   :33.000   Max.   :58.000   Max.   :70.00

PRIORSIGNSOFMENTALILLNESS
No :22
Yes:49

MENTALHEALTHNOTES
: 4
A district court ruled Cho was "an imminent danger" to himself and others as a result of mental illness two years earlier, and directed Cho to seek treatment. : 1
A former instructor at Oikos described him as "mentally unstable" and "paranoid."                                                                              : 1
A psychiatrist, testifying for the prosecution,said he suffered from schizophrenia.                                                                            : 1
Abdulazeez "had suffered for years from depression and possibly from bipolar disorder," according to a representative of the family. (NYT, July 20 2015)       : 1
According to one relative, he was violent and had the mental capacity of a child. (But accounts from others did not indicate this about the shooter.)          : 1
(Other)                                                                                                                                                        :62
WEAPONSOBTAINEDLEGALLY
: 2
No :13
Yes:56

WHEREWEAPONOBTAINED
Unknown                                                     :16
Purchased from an individual                                : 2
AK-47 purchased from Tilford's Gun Sales in Louisville, Ky. : 1
Assembled a rifle out of component parts.                   : 1
B&B Gun Sales in Orange County, Calif.                      : 1
Bull's Eye Shooter Supply in Tacoma, Wash.                  : 1
(Other)                                                     :49
TYPEOFWEAPONS   NUMWEAPONS    ASSAULT
One semiautomatic handgun              :18    Min.   :1.000   No :55
Two semiautomatic handguns             : 6    1st Qu.:1.000   Yes:16
One rifle (assault)                    : 4    Median :2.000
One semiautomatic handgun, one revolver: 4    Mean   :2.197
Handgun                                : 2    3rd Qu.:3.000
One revolver, one shotgun              : 2    Max.   :9.000
(Other)                                :35
WEAPONDETAILS
.45-caliber semiautomatic handgun                                     : 2
9mm semiautomatic handgun                                             : 2
.22-caliber rifle; two 12-gauge shotguns                              : 1
.22-caliber Ruger sawed-off semiautomatic rifle                       : 1
.22-caliber sawed-off rifle; 12-gauge pump-action shotgun             : 1
.22-caliber, two .45-caliber Colt Model 1911-A1 semiautomatic handguns: 1
(Other)                                                               :63
SOURCES
http://archives.starbulletin.com/2000/06/02/news/story2.html; http://www.vpc.org/studies/wgun991102.htm                                                                                                                                                                                                                                                      : 1
http://articles.chicagotribune.com/2001-02-07/news/0102070122_1_navistar-gun-law-hunting-rifle; http://www.vpc.org/studies/wgun010205.htm                                                                                                                                                                                                                    : 1
http://articles.latimes.com/1987-04-25/news/mn-990_1_palm-bay-police                                                                                                                                                                                                                                                                                         : 1
(Other)                                                                                                                                                                                                                                                                                                                                                      :65
MENTALHEALTHSOURCES
: 4
(Supreme Court of Florida Document) http://www.murderpedia.org/male.C/images/cruse_william_b/op-74656.pdf               : 1
http://abcnews.go.com/US/story?id=3052278&page=1                                                                        : 1
http://archives.starbulletin.com/2000/06/02/news/story2.html                                                            : 1
http://articles.chicagotribune.com/2001-02-07/news/0102070122_1_navistar-gun-law-hunting-rifle                          : 1
http://articles.cnn.com/2000-01-21/us/kinkel.revisited_1_kip-kinkel-thurston-high-school-oregon-school-shooting?_s=PM:US: 1
(Other)                                                                                                                 :62
LATITUDE       LONGITUDE
Min.   :21.33   Min.   :-157.85
1st Qu.:33.75   1st Qu.:-117.75
Median :38.58   Median : -90.67
Mean   :37.82   Mean   : -96.95
3rd Qu.:41.92   3rd Qu.: -81.15
Max.   :48.05   Max.   : -71.07

1. Calculate the frequency counts of the shooter’s “race”:
table(us_mass_shootings$RACE)   Asian Black Latino Middle Eastern Native American 6 11 4 1 3 Other White 1 45  1. Retrieve the frequnecy table of the number of weapons used by the shooter’s gender. table(us_mass_shootings$NUMWEAPONS, us_mass_shootings$GENDER)   Female Male 1 1 29 2 1 17 3 0 12 4 0 7 5 0 2 7 0 1 9 0 1 1. Suppose you’re interested in going a little deeper and are not only interested in the gender and race but - in addition - also in the year of the shooting: retrieve the frequnecy table of gender by race and year. Tip: you can arrange the variables in desired order to change the display (not the data). In which year did females commit mass shootings and of what race where they? table(us_mass_shootings$YEAR, us_mass_shootings$GENDER, us_mass_shootings$RACE)

, ,  = Asian

Female Male
1982      0    0
1984      0    0
1986      0    0
1987      0    0
1988      0    0
1989      0    0
1990      0    0
1991      0    1
1992      0    0
1993      0    0
1994      0    0
1995      0    0
1996      0    0
1997      0    0
1998      0    0
1999      0    1
2000      0    0
2001      0    0
2003      0    0
2004      0    0
2005      0    0
2006      0    0
2007      0    1
2008      0    0
2009      0    1
2010      0    0
2011      0    0
2012      0    2
2013      0    0
2014      0    0
2015      0    0

, ,  = Black

Female Male
1982      0    0
1984      0    0
1986      0    0
1987      0    0
1988      0    0
1989      0    0
1990      0    1
1991      0    0
1992      0    0
1993      0    2
1994      0    0
1995      0    0
1996      0    1
1997      0    1
1998      0    0
1999      0    0
2000      0    0
2001      0    1
2003      0    0
2004      0    0
2005      0    0
2006      0    0
2007      0    0
2008      0    1
2009      0    1
2010      0    1
2011      0    0
2012      0    0
2013      0    2
2014      0    0
2015      0    0

, ,  = Latino

Female Male
1982      0    0
1984      0    0
1986      0    0
1987      0    0
1988      0    0
1989      0    0
1990      0    0
1991      0    0
1992      0    0
1993      0    0
1994      0    0
1995      0    0
1996      0    0
1997      0    1
1998      0    0
1999      0    1
2000      0    0
2001      0    0
2003      0    0
2004      0    0
2005      0    0
2006      0    0
2007      0    0
2008      0    0
2009      0    0
2010      0    0
2011      0    1
2012      0    0
2013      0    1
2014      0    0
2015      0    0

, ,  = Middle Eastern

Female Male
1982      0    0
1984      0    0
1986      0    0
1987      0    0
1988      0    0
1989      0    0
1990      0    0
1991      0    0
1992      0    0
1993      0    0
1994      0    0
1995      0    0
1996      0    0
1997      0    0
1998      0    0
1999      0    0
2000      0    0
2001      0    0
2003      0    0
2004      0    0
2005      0    0
2006      0    0
2007      0    0
2008      0    0
2009      0    0
2010      0    0
2011      0    0
2012      0    0
2013      0    0
2014      0    0
2015      0    1

, ,  = Native American

Female Male
1982      0    0
1984      0    0
1986      0    0
1987      0    0
1988      0    0
1989      0    0
1990      0    0
1991      0    0
1992      0    0
1993      0    0
1994      0    0
1995      0    0
1996      0    0
1997      0    0
1998      0    0
1999      0    0
2000      0    0
2001      0    0
2003      0    0
2004      0    0
2005      0    1
2006      0    0
2007      0    0
2008      0    0
2009      0    0
2010      0    0
2011      0    0
2012      0    0
2013      0    0
2014      1    1
2015      0    0

, ,  = Other

Female Male
1982      0    0
1984      0    0
1986      0    0
1987      0    0
1988      0    0
1989      0    0
1990      0    0
1991      0    0
1992      0    0
1993      0    0
1994      0    0
1995      0    0
1996      0    0
1997      0    0
1998      0    0
1999      0    0
2000      0    0
2001      0    0
2003      0    0
2004      0    0
2005      0    0
2006      0    0
2007      0    0
2008      0    0
2009      0    1
2010      0    0
2011      0    0
2012      0    0
2013      0    0
2014      0    0
2015      0    0

, ,  = White

Female Male
1982      0    1
1984      0    2
1986      0    1
1987      0    1
1988      0    1
1989      0    2
1990      0    0
1991      0    2
1992      0    2
1993      0    2
1994      0    1
1995      0    1
1996      0    0
1997      0    0
1998      0    3
1999      0    3
2000      0    1
2001      0    0
2003      0    1
2004      0    1
2005      0    1
2006      1    2
2007      0    3
2008      0    2
2009      0    1
2010      0    0
2011      0    2
2012      0    5
2013      0    2
2014      0    0
2015      0    1

Just as you can ‘subset’ frequency counts by additional variables, you can also calculate descriptive statistics “by” another one. Suppose you wanted to calculate the mean number of wounded victims by the gender of the shooter:

tapply(X = us_mass_shootings$WOUNDED , INDEX = us_mass_shootings$GENDER
, FUN = mean
)

  Female     Male
1.000000 7.405797 

What is the mean (and standard deviation) number of fatalities for those who had prior mental health issues vs those who didn’t have prior mental health issues?

tapply(X = us_mass_shootings$FATALITIES , INDEX = us_mass_shootings$PRIORSIGNSOFMENTALILLNESS
, FUN = sd
)

      No      Yes
4.267059 5.710713 

### Chi-square association

When you have two categorical variables (e.g. scored as yes/no, ill/healthy, offended/not offended, …) you might be interested in the association of these variables. In PSM1 you have covered the Chi-Square test of independence of two variables. In R, you’d run the test as follows:

#let's look at the association between whether the weapon was obtained legally and whether ther were signs of prior mental illness
table(us_mass_shootings$WEAPONSOBTAINEDLEGALLY, us_mass_shootings$PRIORSIGNSOFMENTALILLNESS
, dnn = c('WEAPONSOBTAINEDLEGALLY', 'PRIORSIGNSOFMENTALILLNESS'))

                      PRIORSIGNSOFMENTALILLNESS
WEAPONSOBTAINEDLEGALLY No Yes
2   0
No   5   8
Yes 15  41

You notice that there are two cases where there is no label for the WEAPONSOBTAINEDLEGALLY variable. We can assess this further:

levels(us_mass_shootings$WEAPONSOBTAINEDLEGALLY)  [1] "" "No" "Yes" We see that there are three categorical levels in the data here. Since we do not want to impose any speculative values on the data, we exclude these cases: table(us_mass_shootings_clean$WEAPONSOBTAINEDLEGALLY, us_mass_shootings_clean$PRIORSIGNSOFMENTALILLNESS , dnn = c('WEAPONSOBTAINEDLEGALLY', 'PRIORSIGNSOFMENTALILLNESS'))   PRIORSIGNSOFMENTALILLNESS WEAPONSOBTAINEDLEGALLY No Yes No 5 8 Yes 15 41 Now, to test for an association, we run the chisq.test(...) function on the table of frequency counts: chisq.test(freq_table)  Chi-squared approximation may be incorrect  Pearson's Chi-squared test with Yates' continuity correction data: freq_table X-squared = 0.24665, df = 1, p-value = 0.6194 Note that this is an illustration only here and that you need to check the assumptions (e.g. a minimum count per cell) first. TASK Assess with a 99% confidence interval whether there’s an association between how the shooter obtained the weapon and gender. chisq.test(freq_table)  Chi-squared approximation may be incorrect  Pearson's Chi-squared test with Yates' continuity correction data: freq_table X-squared = 2.4506e-31, df = 1, p-value = 1 From your Chisquare-Test, retrieve the expected counts under the null hypothesis. Hint: ?chisq.test. chisq$expected # index from the object

                      GENDER
WEAPONSOBTAINEDLEGALLY    Female     Male
No  0.1884058 12.81159
Yes 0.8115942 55.18841

### Hypothesis testing

Another aspect covered in the PSM1 module was hypothesis testing with t-tests. R has most statistical functions available by default and you can run a t-test by using the t.test(...) function:

t.test(us_mass_shootings$FATALITIES ~ us_mass_shootings$PRIORSIGNSOFMENTALILLNESS)


Welch Two Sample t-test

data:  us_mass_shootings$FATALITIES by us_mass_shootings$PRIORSIGNSOFMENTALILLNESS
t = -1.5593, df = 53.281, p-value = 0.1248
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.3560172  0.5452565
sample estimates:
mean in group No mean in group Yes
6.727273          8.632653 

Note that there are additional arguments in the t.test(...) function (e.g. for running a paired-samples t-test):

#have a look at the help file to see all arguments
?t.test

Run a t-test to assess if there’s a significant difference between the number of wounded depending on whether the weapon was obtained legally or illegally.

t.test(us_mass_shootings_clean$WOUNDED ~ us_mass_shootings_clean$WEAPONSOBTAINEDLEGALLY)


Welch Two Sample t-test

data:  us_mass_shootings_clean$WOUNDED by us_mass_shootings_clean$WEAPONSOBTAINEDLEGALLY
t = -0.28635, df = 21.753, p-value = 0.7773
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.321386  4.788419
sample estimates:
mean in group No mean in group Yes
6.769231          7.535714 

No run that code with a 99% confidence interval and equal variance assumed. Check ?t.test for the specification of these parameters.

t.test(us_mass_shootings_clean$WOUNDED ~ us_mass_shootings_clean$WEAPONSOBTAINEDLEGALLY
, var.equal = T
, conf.level = 0.99)


Two Sample t-test

data:  us_mass_shootings_clean$WOUNDED by us_mass_shootings_clean$WEAPONSOBTAINEDLEGALLY
t = -0.24841, df = 67, p-value = 0.8046
alternative hypothesis: true difference in means is not equal to 0
99 percent confidence interval:
-8.947030  7.414063
sample estimates:
mean in group No mean in group Yes
6.769231          7.535714 

### Correlation

Finally, sometimes you are interested in looking at the correlation between two continuous variables. In R, there are three ways to do this:

As with all R functions, there are several parameters that allow you to adjust the function (e.g. for special cases of correlation):

?cor.test

What is the 99% confidence interval of the correlation between the number of weapons used and the total number of victims.

cor.test(us_mass_shootings$NUMWEAPONS, us_mass_shootings$TOTALVICTIMS, conf.level = 0.99)


Pearson's product-moment correlation

data:  us_mass_shootings$NUMWEAPONS and us_mass_shootings$TOTALVICTIMS
t = 2.1278, df = 69, p-value = 0.03692
alternative hypothesis: true correlation is not equal to 0
99 percent confidence interval:
-0.05885602  0.51227271
sample estimates:
cor
0.2481506 

## The Generalized Linear Model (GLM)

The generalized linear model will form the basis of the first five weeks of this module.

### What is the GLM?

The GLM is a family of models that extend (= generalize) the linear regression model to allow for more flexibility in the response variables (= dependent variable).

### What is a linear regression model anyway?

In the simplest sense, a linear regression model aims to predict a continuous outcome (e.g. income) through a predictor variable (e.g. age or gender). While the dependent variable (also called: outcome variable, response variable, target variable) is strictly continuous, the predictors can take both contionuous (age) and categorical (e.g. female vs male) form.

The ingredients of a regression model:

• The dependent variable Y
• The predictor variable X
• The intercept a (= the value of Y if X is 0)
• The slope b (= the change in Y for every unit change in X)
• The error term E (= the difference between the predicted value and the observed value)

We will cover each of the above in detail in the lecture

The model can be written down as: Y = a + b*X + E

Read this model as: Y explained the intercept a plus the predictor X multiplied with the slope b plus the error term E.

Suppose you are predicting the income of a person through their age:

income = a + b*age + E

Once fitted to the data, the regression model could look like this:

income = 10000 + 2500*age Note that the error term E is not explicitly mentioned here but can be inferred (see below).

This model allows you to replace age with the actual age of someone from your sample to estimate their income:

income = 10000 + 2500*50 = 10000 + 12500 = 135000

You can use R to calculate the model parameters (intercept a and slope b) by specifying only the model formula:

income ~ age

Read this as: income explained through age. Note that the = sign is replaced by a tilde ~. R uses this notation to build models.

Example with mass shootings

Suppose you wanted to run a regression model on the shooting data. Specifically, suppose you wanted to predict the no. of victims through the number of weapons used.

Conceptually, you’d want this model:

victims ~ number_of_weapons

We can easily take this intuitive notion to R:

Now R has fitted the model you specified on the data. You can inspect that model:

#the full model and its parameters
my_model


Call:
lm(formula = TOTALVICTIMS ~ NUMWEAPONS, data = us_mass_shootings)

Coefficients:
(Intercept)   NUMWEAPONS
10.683        2.087  

Based on these values, you can construct the model equation:

victims = 10.683 + 2.087*number_of_weapons

Thus, if the shooter used 3 weapons, your model would predict almost 17 victims:

10.683 + 2.087*3

[1] 16.944

The model has other information available too. The most important ones are:

• the residuals: these are the estimation errors for each case (i.e. the error term E, defined as: outcome variable - fitted value).
• the fitted values: these are the values predicted by the model
• model statistics: here you can assess whether a predictor was statistically significant or not and how well the model fits the data.
#residuals
my_model$residuals   1 2 3 4 5 6 7 -9.9426853 -2.7696798 -6.7696798 -8.8561826 7.2303202 -5.7696798 -5.8561826 8 9 10 11 12 13 14 -9.8561826 -5.7696798 10.9708120 -4.7696798 -2.7696798 50.9708120 -7.8561826 15 16 17 18 19 20 21 -2.7696798 -7.7696798 -7.9426853 -4.9426853 6.2303202 -3.8561826 -7.7696798 22 23 24 25 26 27 28 30.2303202 3.1438174 -3.8561826 -5.7696798 7.9708120 -6.8561826 0.2303202 29 30 31 32 33 34 35 -5.7696798 41.1438174 -4.8561826 -5.9426853 -10.0291880 -4.7696798 -1.9426853 36 37 38 39 40 41 42 -1.7696798 -0.7696798 -6.1156908 -10.0291880 -9.9426853 -6.8561826 -5.7696798 43 44 45 46 47 48 49 0.1438174 2.9708120 19.9708120 12.0573147 -14.4617018 -6.7696798 -5.7696798 50 51 52 53 54 55 56 -5.7696798 -7.8561826 -8.8561826 15.2303202 -7.7696798 12.2303202 -4.9426853 57 58 59 60 61 62 63 -1.9426853 -7.7696798 -0.8561826 -2.7696798 -5.7696798 29.1438174 -0.8561826 64 65 66 67 68 69 70 -0.1156908 20.1438174 -14.2886963 3.0573147 4.0573147 24.0573147 -5.7696798 71 -1.7696798  The residuals show you that, for example, for the first observation, the model overestimated the number of victims by 9.94. #fitted values my_model$fitted.values

       1        2        3        4        5        6        7        8        9
16.94269 12.76968 12.76968 14.85618 12.76968 12.76968 14.85618 14.85618 12.76968
10       11       12       13       14       15       16       17       18
19.02919 12.76968 12.76968 19.02919 14.85618 12.76968 12.76968 16.94269 16.94269
19       20       21       22       23       24       25       26       27
12.76968 14.85618 12.76968 12.76968 14.85618 14.85618 12.76968 19.02919 14.85618
28       29       30       31       32       33       34       35       36
12.76968 12.76968 14.85618 14.85618 16.94269 19.02919 12.76968 16.94269 12.76968
37       38       39       40       41       42       43       44       45
12.76968 21.11569 19.02919 16.94269 14.85618 12.76968 14.85618 19.02919 19.02919
46       47       48       49       50       51       52       53       54
16.94269 29.46170 12.76968 12.76968 12.76968 14.85618 14.85618 12.76968 12.76968
55       56       57       58       59       60       61       62       63
12.76968 16.94269 16.94269 12.76968 14.85618 12.76968 12.76968 14.85618 14.85618
64       65       66       67       68       69       70       71
21.11569 14.85618 25.28870 16.94269 16.94269 16.94269 12.76968 12.76968 

Here you see that the fitted value for the first observation was 16.94, so we can infer that the observed (actual) value for the first observation must be: fitted value + residual

#checked against the actual data
my_model\$model[1, ]

  TOTALVICTIMS NUMWEAPONS
1            7          3

If we want to get information on the significance of predictors and the fit of the model, we’ll look at the summary() of my_model:

#model statistics
summary(my_model)


Call:
lm(formula = TOTALVICTIMS ~ NUMWEAPONS, data = us_mass_shootings)

Residuals:
Min      1Q  Median      3Q     Max
-14.462  -6.813  -4.770   1.601  50.971

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  10.6832     2.6074   4.097 0.000112 ***
NUMWEAPONS    2.0865     0.9806   2.128 0.036925 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.37 on 69 degrees of freedom
Multiple R-squared:  0.06158,   Adjusted R-squared:  0.04798
F-statistic: 4.528 on 1 and 69 DF,  p-value: 0.03692

You can see:

• that both the intercept and the predictor number_of_weapons are significant at the _p_ < .05 level
• the parameter estimates (for the intercept and the slope)
• the R-Squared statistic: this expresses the proportion of the variance in the outcome variable Y that is explained by the model.

In our case, we see that the model only explained 6.158% of the variance in the number of victims.

Multiple predictors

Of course, a single predictor rarely explains a lot of variation in an outcome variable. So we might want to include a new variable (e.g. the categorical variable whether there were signs of prior mental health issues or not). The conceptual model would the be:

victims ~ number_of_weapons + mental_health

In R:

my_model_2 = lm(formula = TOTALVICTIMS ~ NUMWEAPONS + PRIORSIGNSOFMENTALILLNESS
, data = us_mass_shootings)
summary(my_model_2)


Call:
lm(formula = TOTALVICTIMS ~ NUMWEAPONS + PRIORSIGNSOFMENTALILLNESS,
data = us_mass_shootings)

Residuals:
Min      1Q  Median      3Q     Max
-16.766  -7.474  -2.816   1.966  48.709

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)                    5.7765     3.3592   1.720   0.0901 .
NUMWEAPONS                     2.1583     0.9541   2.262   0.0269 *
PRIORSIGNSOFMENTALILLNESSYes   6.8810     3.0900   2.227   0.0293 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.03 on 68 degrees of freedom
Multiple R-squared:  0.1254,    Adjusted R-squared:  0.09964
F-statistic: 4.873 on 2 and 68 DF,  p-value: 0.01052

You can see that the model fit improved considerably (from 6.158% to 12.54%) and that the new model equation is:

victims ~ 5.7765 + 2.1583*number_of_weapons + 6.881*mental_health

Note that we now have two predictors which implies that we also have two slopes (one for each predictor).

You can see that the model summary has appended the word “Yes” to the PRIORSIGNSOFMENTALILLNESS predictor. This is to reflect that the slope applies to the case that the variable PRIORSIGNSOFMENTALILLNESS is equal to Yes. This means that the predicted number of victims is 6.8810 more if there were signs of mental health issues tha if there were no previous mental health issues.

When there are mutliple predictors, the simple regression model becomes a multiple regression model.

Build your own model that predicts the number if wounded victims based on the shooter’s gender and whether the weapons were obtained legally.

my_model_3 = lm(formula = WOUNDED ~ GENDER + WEAPONSOBTAINEDLEGALLY
, data = us_mass_shootings_clean)
summary(my_model_3)


Call:
lm(formula = WOUNDED ~ GENDER + WEAPONSOBTAINEDLEGALLY, data = us_mass_shootings_clean)

Residuals:
Min     1Q Median     3Q    Max
-7.673 -5.769 -3.673  0.327 50.327

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)                -0.9035    10.5220  -0.086    0.932
GENDERMale                  7.6727    10.1457   0.756    0.452
WEAPONSOBTAINEDLEGALLYYes   0.9035     3.1008   0.291    0.772

Residual standard error: 10.05 on 66 degrees of freedom
Multiple R-squared:  0.009503,  Adjusted R-squared:  -0.02051
F-statistic: 0.3166 on 2 and 66 DF,  p-value: 0.7297

### Back to the GLM….

Now that we covered some basics of linear regression (more in the lecture and homework), we can see why the generalized linear model is so useful. Essentially, it is a means to unify similar models in one.

Suppose you’d want to predict a categorical variable (e.g. “predict” whether the weapon was obtained legally based on the shooter’s gender): the problem is that this conflicts with the base linear regression model which assumes continuous data.

The GLM can handle different outcome variables through so-called “link” functions. A link function links the outcome variable to the linear predictor. An overview of link functions in R can be found here. This model unification allows you to build the model in the same way but changing the link function if the outcome variable is of a different type. It also shows that linear relationships are capable of expressing the data regardless of the nature of the response variable.

The modelling process of the GLM in R is very similar to the linear regression example with the addition of the link function:

This fits the same model as above stored in my_model but this time with the “Gaussian” link function. You can see that the model is identical to the one obtained with lm(...) above. This is because the gaussian link function of the GLM is identical to the linear model retrieved with lm(...). More in-depth relationships are out of the scope of this module but this tutorial gives a nice starting point.

my_glm


Call:  glm(formula = TOTALVICTIMS ~ NUMWEAPONS, family = gaussian, data = us_mass_shootings)

Coefficients:
(Intercept)   NUMWEAPONS
10.683        2.087

Degrees of Freedom: 70 Total (i.e. Null);  69 Residual
Null Deviance:      11260
Residual Deviance: 10560    AIC: 562.7

Similar to models built with the lm() function, the glm() offers additional model information:

#model statistics
summary(my_glm)


Call:
glm(formula = TOTALVICTIMS ~ NUMWEAPONS, family = gaussian, data = us_mass_shootings)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-14.462   -6.813   -4.770    1.601   50.971

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  10.6832     2.6074   4.097 0.000112 ***
NUMWEAPONS    2.0865     0.9806   2.128 0.036925 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 153.1111)

Null deviance: 11258  on 70  degrees of freedom
Residual deviance: 10565  on 69  degrees of freedom
AIC: 562.67

Number of Fisher Scoring iterations: 2

Using the GLM for different predictors

We can extend the GLM to cases where the outcome variable is binary (e.g. coded to 0 or 1) or represents counts (e.g. counts of victims). For the latter, for example, a Poisson regression might be more apt than a gaussian linear regression. We will cover these extensions in the next lecture and tutorial.