Tutorial 1, Probability, Statistics and Modelling 2, BSc Security and Crime Science, UCL
This tutorial has two goals:
You are expected to work through this R Notebook in the tutorial and we will assist you and outline concepts for the whole class if needed.
If you are encountering problems with programming problems you may find this guide on How to solve data science problems in R useful. This is targeted at the 3rd Data Science module (Advanced Crime Analysis) but the guide for getting help applies to other R problems too.
First, let’s use a dataset that is close to those that you might encounter in the SCS programme. The dataset was retieved from Data World and contains details on all mass shootings in the US between 1982 - 2018.
#Look at the data
head(us_mass_shootings)
CASE GENDER SHOOTINGTYPE RACE
1 Chattanooga military recruitment center Male Mass Middle Eastern
2 Charleston Church Shooting Male Mass White
3 Marysville-Pilchuck High School shooting Male Mass Native American
4 Alturas tribal shooting Female Mass Native American
5 Washington Navy Yard shooting Male Mass Black
6 Hialeah apartment shooting Male Mass Latino
LOCATION STATE DATE YEAR
1 Chattanooga, Tennessee Tennessee 7/16/2015 2015
2 Charleston, South Carolina South Carolina 6/17/2015 2015
3 Marysville, Washington Washington 10/24/2014 2014
4 Alturas, California California 2/20/2014 2014
5 Washington, D.C. D.C. 9/16/2013 2013
6 Hialeah, Florida Florida 7/26/2013 2013
SUMMARY
1 Kuwaiti-born Mohammod Youssuf Abdulazeez, 24, a naturalized US citizen, opened fire at a Naval reserve center, and then drove to a military recruitment office where he shot and killed four Marines and a Navy service member, and wounded a police officer and another military service member. He was then fatally shot in an exchange of gunfire with law enforcement officers responding to the attack.
2 Dylann Storm Roof, 21, shot and killed 9 people after opening fire at the Emanuel AME Church in Charleston, South Carolina. According to a roommate, he had allegedly been ‰ÛÏplanning something like that for six months."
3 Jaylen Fryberg, 15, using a .40-caliber Berretta, shot five \nstudents at Marysville High School, including two of his cousins and three friends, killing all but one. Fryberg arranged to meet them for lunch in the school cafeteria by text. Fryberg was reportedly well-liked at the school and there was not believed to be any ill-will between him and his victims. He committed suicide at the scene.
4 Cherie Lash Rhoades, 44, opened fire at the Cedarville Rancheria Tribal Office and Community Center, killing four and wounding two. After running out of ammunition, Rhoades grabbed a butcher knife and stabbed another person.
5 Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police.
6 Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him.
FATALITIES WOUNDED TOTALVICTIMS LOCATIONTYPE PRIORSIGNSOFMENTALILLNESS
1 5 2 7 Military Yes
2 9 1 10 Religious No
3 5 1 6 School No
4 4 2 6 Other No
5 12 8 20 Military Yes
6 7 0 7 Other Yes
MENTALHEALTHNOTES
1 Abdulazeez "had suffered for years from depression and possibly from bipolar disorder," according to a representative of the family. (NYT, July 20 2015)
2
3 Fryberg was well-liked and allegedly happy, but was also upset about a girl and had posted emotional social media messages. No definitive signs of mental health problems.
4
5 Had told Rhode Island police the prior month that he'd "heard voices"; had been undergoing mental health treatment with Veterans Affairs since August 2013.
6 His mother told authorities her son had been acting strangely and needed a psychiatric evaluation.
WEAPONSOBTAINEDLEGALLY WHEREWEAPONOBTAINED
1 Yes On the internet, via Armslist.com
2 Yes Shooter's Choice gun store in West Columbia, South Carolina
3 No Gun was his father's.
4 Unknown
5 Yes Sharpshooters Small Arms Range
6 Yes Florida Gun Center
TYPEOFWEAPONS
1 2 assault rifles; semiautomatic handgun
2 Handgun
3 Handgun
4 Two handguns and a butcher knife
5 Sawed-off shotgun, 2 boxes of shells; also a .45-caliber handgun taken from a security guard he shot at the scene.
6 9mm semi-automatic handgun
NUMWEAPONS ASSAULT WEAPONDETAILS
1 3 Yes AK-47, AR-15, and 30-round magazines; 9mm handgun
2 1 No .45-caliber Glock (model 41, with 13-round capacity magazine)
3 1 No Beretta .40-caliber handgun
4 2 No 9mm semi-automatic handgun
5 1 No Remington 870 Express 12-gauge shotgun; Beretta handgun
6 1 No Glock 17
SOURCES
1 http://www.reuters.com/article/2015/07/16/us-usa-shooting-tennessee-idUSKCN0PQ1WY20150716; http://www.nytimes.com/2015/07/18/us/chattanooga-gunmans-past-scoured-for-extremist-ties.html; http://www.wsj.com/articles/chattanooga-shooting-highlights-online-gun-sales-1437435518http://www.nytimes.com/2015/07/19/us/chattanooga-attacks-claim-a-5th-military-service-members-life.html; http://www.cnn.com/2015/07/17/us/tennessee-naval-reserve-shooting/; https://twitter.com/markfollman/status/622105627473477632; http://www.nytimes.com/2015/07/21/us/chattanooga-gunman-wrote-of-suicide-and-martyrdom-official-says.html
2 http://www.motherjones.com/politics/2015/06/dylann-roofs-attorney; http://www.newsweek.com/report-nine-shot-charleston-south-carolina-church-shooting-344235; http://www.cnn.com/2015/06/19/us/charleston-church-shooting-suspect/; http://www.motherjones.com/politics/2015/06/9-people-dead-mass-shooting-south-carolina-church; http://www.nbcnews.com/storyline/charleston-church-shooting/fbi-says-dyann-roof-should-not-have-been-sold-gun-n390056
3 http://www.seattletimes.com/seattle-news/fourth-marysville-shooting-victim-dies-as-another-is-laid-to-rest/ http://www.mercurynews.com/crime-courts/ci_26814211/jaylen-fryberg-included-tributes-washington-school-shooting-victims
4 http://www.sacbee.com/news/local/crime/article2603350.html; http://www.csmonitor.com/USA/2014/0223/Alturas-tribal-shooting-Was-embezzlement-eviction-behind-family-revenge-video; http://www.cbsnews.com/news/4-dead-2-injured-at-american-indian-tribal-office-in-northern-california/; http://nativenewsonline.net/currents/former-tribal-chairperson-custody-mass-murder-cedarville-rancheria-tribal-headquarters/
5 http://www.nytimes.com/2013/09/18/us/state-law-stopped-gunman-from-buying-rifle-officials-say.html; http://edition.cnn.com/2013/09/17/us/navy-yard-shooting-military-contractors; http://www.fbi.gov/washingtondc/press-releases/2013/law-enforcement-shares-findings-of-the-investigation-into-the-washington-navy-yard-shootings/index.html
6 http://www.miamiherald.com/2013/07/27/v-print/3526078/a-look-at-the-victims-in-the-hialeah.html http://www.cbsnews.com/8301-201_162-57595796/pedro-vargas-idd-as-gunman-behind-deadly-rampage-in-hialeah-florida/ http://www.miamiherald.com/2013/07/28/3528362/little-about-pedro-vargas-life.html
MENTALHEALTHSOURCES
1
2
3 http://www.newyorker.com/science/maria-konnikova/almost-link-mental-health-gun-violence
4
5 http://bigstory.ap.org/article/13-killed-washington-navy-yard-shooting-rampage
6 http://www.miamiherald.com/2013/08/03/v-print/3539629/hialeah-killer-showed-signs-of.html
LATITUDE LONGITUDE
1 35.04716 -85.31182
2 32.78839 -79.93314
3 48.05082 -122.17692
4 41.48710 -120.54224
5 38.87498 -76.99453
6 25.86701 -80.29147
You have now created a data.frame called us_mass_shootings
that you can query nicely with R. You can use the concepts from the R in 12 Steps and Getting ready for R on that data.frame.
In addition, if you feel you need a bit more guidance on how to deal with data.frames, have a look at the 17 Steps to investigate R dataframes tutorial.
#If you want to run the data.frames tutorial, you can type your code here.
Now let’s get some basic descriptive statastics: the first step when analysing a dataset is to describe how the data were generated/retrieved (here: using an existing dataset) and describing the dataset through descriptive statistics.
#frequency counts of gender by race
table(us_mass_shootings$GENDER, us_mass_shootings$RACE)
Asian Black Latino Middle Eastern Native American Other White
Female 0 0 0 0 1 0 1
Male 6 11 4 1 2 1 44
TASK
summary()
and str()
function. Use these on the us_mass_shootings
dataset:summary(us_mass_shootings)
CASE GENDER SHOOTINGTYPE RACE
101 California Street shootings : 1 Female: 2 Mass :63 Asian : 6
Accent Signage Systems shooting : 1 Male :69 Spree: 8 Black :11
Air Force base shooting : 1 Latino : 4
Alturas tribal shooting : 1 Middle Eastern : 1
Amish school shooting : 1 Native American: 3
Atlanta day trading spree killings: 1 Other : 1
(Other) :65 White :45
LOCATION STATE DATE YEAR
Aurora, Colorado : 2 California:11 1/17/1989 : 1 Min. :1982
Seattle, Washington : 2 Florida : 6 1/30/2006 : 1 1st Qu.:1994
Aiken, South Carolina: 1 Washington: 6 1/8/2011 : 1 Median :2005
Alturas, California : 1 Texas : 5 10/14/2011: 1 Mean :2002
Atlanta, Georgia : 1 New York : 4 10/15/1992: 1 3rd Qu.:2011
Binghamton, New York : 1 Colorado : 3 10/16/1991: 1 Max. :2015
(Other) :63 (Other) :36 (Other) :65
SUMMARY
Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police. : 1
Abdelkrim Belachheb, 39, opened fire at an upscale nightclub after a woman rejected his advances. He was later arrested. : 1
Adam Lanza, 20, shot his mother dead at their home then drove to Sandy Hook Elementary school. He forced his way inside and opened fire, killing 20 children and six adults before committing suicide. : 1
After he was expelled for having a gun in his locker, Kipland P. Kinkel, 15, a freshman at Thurston High, went on a shooting spree, killing his parents at home and two students at school. Five classmates wrestled Kipland to the ground before he was arrested. : 1
Andrew Engeldinger, 36, upon learning he was being fired, went on a shooting rampage, killing the business owner, three fellow employees, and a UPS driver. He then killed himself. : 1
Army psychiatrist Nidal Malik Hasan, 39, opened fire on an Army base in an attack linked to Islamist extremism. Hasan was injured during the attack and later arrested. : 1
(Other) :65
FATALITIES WOUNDED TOTALVICTIMS LOCATIONTYPE
Min. : 4.000 Min. : 0.000 Min. : 5.00 Military : 4
1st Qu.: 5.000 1st Qu.: 1.000 1st Qu.: 7.00 Other :30
Median : 6.000 Median : 3.000 Median :11.00 Religious: 4
Mean : 8.042 Mean : 7.225 Mean :15.27 School :13
3rd Qu.: 8.500 3rd Qu.: 8.000 3rd Qu.:18.50 Workplace:20
Max. :33.000 Max. :58.000 Max. :70.00
PRIORSIGNSOFMENTALILLNESS
No :22
Yes:49
MENTALHEALTHNOTES
: 4
A district court ruled Cho was "an imminent danger" to himself and others as a result of mental illness two years earlier, and directed Cho to seek treatment. : 1
A former instructor at Oikos described him as "mentally unstable" and "paranoid." : 1
A psychiatrist, testifying for the prosecution,said he suffered from schizophrenia. : 1
Abdulazeez "had suffered for years from depression and possibly from bipolar disorder," according to a representative of the family. (NYT, July 20 2015) : 1
According to one relative, he was violent and had the mental capacity of a child. (But accounts from others did not indicate this about the shooter.) : 1
(Other) :62
WEAPONSOBTAINEDLEGALLY
: 2
No :13
Yes:56
WHEREWEAPONOBTAINED
Unknown :16
Purchased from an individual : 2
AK-47 purchased from Tilford's Gun Sales in Louisville, Ky. : 1
Assembled a rifle out of component parts. : 1
B&B Gun Sales in Orange County, Calif. : 1
Bull's Eye Shooter Supply in Tacoma, Wash. : 1
(Other) :49
TYPEOFWEAPONS NUMWEAPONS ASSAULT
One semiautomatic handgun :18 Min. :1.000 No :55
Two semiautomatic handguns : 6 1st Qu.:1.000 Yes:16
One rifle (assault) : 4 Median :2.000
One semiautomatic handgun, one revolver: 4 Mean :2.197
Handgun : 2 3rd Qu.:3.000
One revolver, one shotgun : 2 Max. :9.000
(Other) :35
WEAPONDETAILS
.45-caliber semiautomatic handgun : 2
9mm semiautomatic handgun : 2
.22-caliber rifle; two 12-gauge shotguns : 1
.22-caliber Ruger sawed-off semiautomatic rifle : 1
.22-caliber sawed-off rifle; 12-gauge pump-action shotgun : 1
.22-caliber, two .45-caliber Colt Model 1911-A1 semiautomatic handguns: 1
(Other) :63
SOURCES
http://archives.starbulletin.com/2000/06/02/news/story2.html; http://www.vpc.org/studies/wgun991102.htm : 1
http://articles.chicagotribune.com/2001-02-07/news/0102070122_1_navistar-gun-law-hunting-rifle; http://www.vpc.org/studies/wgun010205.htm : 1
http://articles.latimes.com/1987-04-25/news/mn-990_1_palm-bay-police : 1
http://articles.latimes.com/1988-02-18/news/mn-43514_1_mr-farley-richard-farley-sunnyvale-public-safety-department; http://news.google.com/newspapers?id=uqxAAAAAIBAJ&sjid=sDIHAAAAIBAJ&pg=2425,5898911&dq=richard+farley+shooting&hl=en; http://news.google.com/newspapers?id=FmYzAAAAIBAJ&sjid=WzIHAAAAIBAJ&pg=7028,576811&dq=richard+farley+shooting&hl=en: 1
http://articles.latimes.com/1993-07-03/news/mn-10731_1_mortgage-business/2; http://www.motherjones.com/print/16316; http://www.vpc.org/studies/wgun930701.htm : 1
http://articles.latimes.com/1993-08-08/news/mn-21847_1_kills-army-french; http://news.google.com/newspapers?id=0AhPAAAAIBAJ&sjid=jhUEAAAAIBAJ&pg=6505,2482529&dq=kenneth+junior+french&hl=en : 1
(Other) :65
MENTALHEALTHSOURCES
: 4
(Supreme Court of Florida Document) http://www.murderpedia.org/male.C/images/cruse_william_b/op-74656.pdf : 1
http://abcnews.go.com/US/story?id=3052278&page=1 : 1
http://archives.starbulletin.com/2000/06/02/news/story2.html : 1
http://articles.chicagotribune.com/2001-02-07/news/0102070122_1_navistar-gun-law-hunting-rifle : 1
http://articles.cnn.com/2000-01-21/us/kinkel.revisited_1_kip-kinkel-thurston-high-school-oregon-school-shooting?_s=PM:US: 1
(Other) :62
LATITUDE LONGITUDE
Min. :21.33 Min. :-157.85
1st Qu.:33.75 1st Qu.:-117.75
Median :38.58 Median : -90.67
Mean :37.82 Mean : -96.95
3rd Qu.:41.92 3rd Qu.: -81.15
Max. :48.05 Max. : -71.07
table(us_mass_shootings$RACE)
Asian Black Latino Middle Eastern Native American
6 11 4 1 3
Other White
1 45
table(us_mass_shootings$NUMWEAPONS, us_mass_shootings$GENDER)
Female Male
1 1 29
2 1 17
3 0 12
4 0 7
5 0 2
7 0 1
9 0 1
Tip: you can arrange the variables in desired order to change the display (not the data).
In which year did females commit mass shootings and of what race where they?
table(us_mass_shootings$YEAR, us_mass_shootings$GENDER, us_mass_shootings$RACE)
, , = Asian
Female Male
1982 0 0
1984 0 0
1986 0 0
1987 0 0
1988 0 0
1989 0 0
1990 0 0
1991 0 1
1992 0 0
1993 0 0
1994 0 0
1995 0 0
1996 0 0
1997 0 0
1998 0 0
1999 0 1
2000 0 0
2001 0 0
2003 0 0
2004 0 0
2005 0 0
2006 0 0
2007 0 1
2008 0 0
2009 0 1
2010 0 0
2011 0 0
2012 0 2
2013 0 0
2014 0 0
2015 0 0
, , = Black
Female Male
1982 0 0
1984 0 0
1986 0 0
1987 0 0
1988 0 0
1989 0 0
1990 0 1
1991 0 0
1992 0 0
1993 0 2
1994 0 0
1995 0 0
1996 0 1
1997 0 1
1998 0 0
1999 0 0
2000 0 0
2001 0 1
2003 0 0
2004 0 0
2005 0 0
2006 0 0
2007 0 0
2008 0 1
2009 0 1
2010 0 1
2011 0 0
2012 0 0
2013 0 2
2014 0 0
2015 0 0
, , = Latino
Female Male
1982 0 0
1984 0 0
1986 0 0
1987 0 0
1988 0 0
1989 0 0
1990 0 0
1991 0 0
1992 0 0
1993 0 0
1994 0 0
1995 0 0
1996 0 0
1997 0 1
1998 0 0
1999 0 1
2000 0 0
2001 0 0
2003 0 0
2004 0 0
2005 0 0
2006 0 0
2007 0 0
2008 0 0
2009 0 0
2010 0 0
2011 0 1
2012 0 0
2013 0 1
2014 0 0
2015 0 0
, , = Middle Eastern
Female Male
1982 0 0
1984 0 0
1986 0 0
1987 0 0
1988 0 0
1989 0 0
1990 0 0
1991 0 0
1992 0 0
1993 0 0
1994 0 0
1995 0 0
1996 0 0
1997 0 0
1998 0 0
1999 0 0
2000 0 0
2001 0 0
2003 0 0
2004 0 0
2005 0 0
2006 0 0
2007 0 0
2008 0 0
2009 0 0
2010 0 0
2011 0 0
2012 0 0
2013 0 0
2014 0 0
2015 0 1
, , = Native American
Female Male
1982 0 0
1984 0 0
1986 0 0
1987 0 0
1988 0 0
1989 0 0
1990 0 0
1991 0 0
1992 0 0
1993 0 0
1994 0 0
1995 0 0
1996 0 0
1997 0 0
1998 0 0
1999 0 0
2000 0 0
2001 0 0
2003 0 0
2004 0 0
2005 0 1
2006 0 0
2007 0 0
2008 0 0
2009 0 0
2010 0 0
2011 0 0
2012 0 0
2013 0 0
2014 1 1
2015 0 0
, , = Other
Female Male
1982 0 0
1984 0 0
1986 0 0
1987 0 0
1988 0 0
1989 0 0
1990 0 0
1991 0 0
1992 0 0
1993 0 0
1994 0 0
1995 0 0
1996 0 0
1997 0 0
1998 0 0
1999 0 0
2000 0 0
2001 0 0
2003 0 0
2004 0 0
2005 0 0
2006 0 0
2007 0 0
2008 0 0
2009 0 1
2010 0 0
2011 0 0
2012 0 0
2013 0 0
2014 0 0
2015 0 0
, , = White
Female Male
1982 0 1
1984 0 2
1986 0 1
1987 0 1
1988 0 1
1989 0 2
1990 0 0
1991 0 2
1992 0 2
1993 0 2
1994 0 1
1995 0 1
1996 0 0
1997 0 0
1998 0 3
1999 0 3
2000 0 1
2001 0 0
2003 0 1
2004 0 1
2005 0 1
2006 1 2
2007 0 3
2008 0 2
2009 0 1
2010 0 0
2011 0 2
2012 0 5
2013 0 2
2014 0 0
2015 0 1
Advanced task
Just as you can ‘subset’ frequency counts by additional variables, you can also calculate descriptive statistics “by” another one. Suppose you wanted to calculate the mean number of wounded victims by the gender of the shooter:
tapply(X = us_mass_shootings$WOUNDED
, INDEX = us_mass_shootings$GENDER
, FUN = mean
)
Female Male
1.000000 7.405797
What is the mean (and standard deviation) number of fatalities for those who had prior mental health issues vs those who didn’t have prior mental health issues?
tapply(X = us_mass_shootings$FATALITIES
, INDEX = us_mass_shootings$PRIORSIGNSOFMENTALILLNESS
, FUN = sd
)
No Yes
4.267059 5.710713
When you have two categorical variables (e.g. scored as yes/no, ill/healthy, offended/not offended, …) you might be interested in the association of these variables. In PSM1 you have covered the Chi-Square test of independence of two variables. In R, you’d run the test as follows:
#let's look at the association between whether the weapon was obtained legally and whether ther were signs of prior mental illness
table(us_mass_shootings$WEAPONSOBTAINEDLEGALLY, us_mass_shootings$PRIORSIGNSOFMENTALILLNESS
, dnn = c('WEAPONSOBTAINEDLEGALLY', 'PRIORSIGNSOFMENTALILLNESS'))
PRIORSIGNSOFMENTALILLNESS
WEAPONSOBTAINEDLEGALLY No Yes
2 0
No 5 8
Yes 15 41
You notice that there are two cases where there is no label for the WEAPONSOBTAINEDLEGALLY variable. We can assess this further:
levels(us_mass_shootings$WEAPONSOBTAINEDLEGALLY)
[1] "" "No" "Yes"
We see that there are three categorical levels in the data here. Since we do not want to impose any speculative values on the data, we exclude these cases:
table(us_mass_shootings_clean$WEAPONSOBTAINEDLEGALLY, us_mass_shootings_clean$PRIORSIGNSOFMENTALILLNESS
, dnn = c('WEAPONSOBTAINEDLEGALLY', 'PRIORSIGNSOFMENTALILLNESS'))
PRIORSIGNSOFMENTALILLNESS
WEAPONSOBTAINEDLEGALLY No Yes
No 5 8
Yes 15 41
Now, to test for an association, we run the chisq.test(...)
function on the table of frequency counts:
chisq.test(freq_table)
Chi-squared approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity correction
data: freq_table
X-squared = 0.24665, df = 1, p-value = 0.6194
Note that this is an illustration only here and that you need to check the assumptions (e.g. a minimum count per cell) first.
TASK
Assess with a 99% confidence interval whether there’s an association between how the shooter obtained the weapon and gender.
chisq.test(freq_table)
Chi-squared approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity correction
data: freq_table
X-squared = 2.4506e-31, df = 1, p-value = 1
From your Chisquare-Test, retrieve the expected counts under the null hypothesis.
Hint: ?chisq.test
.
chisq$expected # index from the object
GENDER
WEAPONSOBTAINEDLEGALLY Female Male
No 0.1884058 12.81159
Yes 0.8115942 55.18841
Another aspect covered in the PSM1 module was hypothesis testing with t-tests. R has most statistical functions available by default and you can run a t-test by using the t.test(...)
function:
t.test(us_mass_shootings$FATALITIES ~ us_mass_shootings$PRIORSIGNSOFMENTALILLNESS)
Welch Two Sample t-test
data: us_mass_shootings$FATALITIES by us_mass_shootings$PRIORSIGNSOFMENTALILLNESS
t = -1.5593, df = 53.281, p-value = 0.1248
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.3560172 0.5452565
sample estimates:
mean in group No mean in group Yes
6.727273 8.632653
Note that there are additional arguments in the t.test(...)
function (e.g. for running a paired-samples t-test):
#have a look at the help file to see all arguments
?t.test
TASK
Run a t-test to assess if there’s a significant difference between the number of wounded depending on whether the weapon was obtained legally or illegally.
t.test(us_mass_shootings_clean$WOUNDED ~ us_mass_shootings_clean$WEAPONSOBTAINEDLEGALLY)
Welch Two Sample t-test
data: us_mass_shootings_clean$WOUNDED by us_mass_shootings_clean$WEAPONSOBTAINEDLEGALLY
t = -0.28635, df = 21.753, p-value = 0.7773
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.321386 4.788419
sample estimates:
mean in group No mean in group Yes
6.769231 7.535714
No run that code with a 99% confidence interval and equal variance assumed. Check ?t.test
for the specification of these parameters.
t.test(us_mass_shootings_clean$WOUNDED ~ us_mass_shootings_clean$WEAPONSOBTAINEDLEGALLY
, var.equal = T
, conf.level = 0.99)
Two Sample t-test
data: us_mass_shootings_clean$WOUNDED by us_mass_shootings_clean$WEAPONSOBTAINEDLEGALLY
t = -0.24841, df = 67, p-value = 0.8046
alternative hypothesis: true difference in means is not equal to 0
99 percent confidence interval:
-8.947030 7.414063
sample estimates:
mean in group No mean in group Yes
6.769231 7.535714
Finally, sometimes you are interested in looking at the correlation between two continuous variables. In R, there are three ways to do this:
As with all R functions, there are several parameters that allow you to adjust the function (e.g. for special cases of correlation):
?cor.test
TASK
What is the 99% confidence interval of the correlation between the number of weapons used and the total number of victims.
cor.test(us_mass_shootings$NUMWEAPONS, us_mass_shootings$TOTALVICTIMS, conf.level = 0.99)
Pearson's product-moment correlation
data: us_mass_shootings$NUMWEAPONS and us_mass_shootings$TOTALVICTIMS
t = 2.1278, df = 69, p-value = 0.03692
alternative hypothesis: true correlation is not equal to 0
99 percent confidence interval:
-0.05885602 0.51227271
sample estimates:
cor
0.2481506
The generalized linear model will form the basis of the first five weeks of this module.
The GLM is a family of models that extend (= generalize) the linear regression model to allow for more flexibility in the response variables (= dependent variable).
In the simplest sense, a linear regression model aims to predict a continuous outcome (e.g. income) through a predictor variable (e.g. age or gender). While the dependent variable (also called: outcome variable, response variable, target variable) is strictly continuous, the predictors can take both contionuous (age) and categorical (e.g. female vs male) form.
The ingredients of a regression model:
Y
X
a
(= the value of Y
if X
is 0
)b
(= the change in Y
for every unit change in X
)E
(= the difference between the predicted value and the observed value)We will cover each of the above in detail in the lecture
The model can be written down as: Y = a + b*X + E
Read this model as: Y
explained the intercept a
plus the predictor X
multiplied with the slope b
plus the error term E
.
Suppose you are predicting the income of a person through their age:
income = a + b*age + E
Once fitted to the data, the regression model could look like this:
income = 10000 + 2500*age
Note that the error term E
is not explicitly mentioned here but can be inferred (see below).
This model allows you to replace age
with the actual age of someone from your sample to estimate their income:
income = 10000 + 2500*50 = 10000 + 12500 = 135000
You can use R to calculate the model parameters (intercept a
and slope b
) by specifying only the model formula:
income ~ age
Read this as: income explained through age. Note that the =
sign is replaced by a tilde ~
. R uses this notation to build models.
Example with mass shootings
Suppose you wanted to run a regression model on the shooting data. Specifically, suppose you wanted to predict the no. of victims through the number of weapons used.
Conceptually, you’d want this model:
victims ~ number_of_weapons
We can easily take this intuitive notion to R:
Now R has fitted the model you specified on the data. You can inspect that model:
#the full model and its parameters
my_model
Call:
lm(formula = TOTALVICTIMS ~ NUMWEAPONS, data = us_mass_shootings)
Coefficients:
(Intercept) NUMWEAPONS
10.683 2.087
Based on these values, you can construct the model equation:
victims = 10.683 + 2.087*number_of_weapons
Thus, if the shooter used 3 weapons, your model would predict almost 17 victims:
10.683 + 2.087*3
[1] 16.944
The model has other information available too. The most important ones are:
E
, defined as: outcome variable - fitted value
).#residuals
my_model$residuals
1 2 3 4 5 6 7
-9.9426853 -2.7696798 -6.7696798 -8.8561826 7.2303202 -5.7696798 -5.8561826
8 9 10 11 12 13 14
-9.8561826 -5.7696798 10.9708120 -4.7696798 -2.7696798 50.9708120 -7.8561826
15 16 17 18 19 20 21
-2.7696798 -7.7696798 -7.9426853 -4.9426853 6.2303202 -3.8561826 -7.7696798
22 23 24 25 26 27 28
30.2303202 3.1438174 -3.8561826 -5.7696798 7.9708120 -6.8561826 0.2303202
29 30 31 32 33 34 35
-5.7696798 41.1438174 -4.8561826 -5.9426853 -10.0291880 -4.7696798 -1.9426853
36 37 38 39 40 41 42
-1.7696798 -0.7696798 -6.1156908 -10.0291880 -9.9426853 -6.8561826 -5.7696798
43 44 45 46 47 48 49
0.1438174 2.9708120 19.9708120 12.0573147 -14.4617018 -6.7696798 -5.7696798
50 51 52 53 54 55 56
-5.7696798 -7.8561826 -8.8561826 15.2303202 -7.7696798 12.2303202 -4.9426853
57 58 59 60 61 62 63
-1.9426853 -7.7696798 -0.8561826 -2.7696798 -5.7696798 29.1438174 -0.8561826
64 65 66 67 68 69 70
-0.1156908 20.1438174 -14.2886963 3.0573147 4.0573147 24.0573147 -5.7696798
71
-1.7696798
The residuals show you that, for example, for the first observation, the model overestimated the number of victims by 9.94.
#fitted values
my_model$fitted.values
1 2 3 4 5 6 7 8 9
16.94269 12.76968 12.76968 14.85618 12.76968 12.76968 14.85618 14.85618 12.76968
10 11 12 13 14 15 16 17 18
19.02919 12.76968 12.76968 19.02919 14.85618 12.76968 12.76968 16.94269 16.94269
19 20 21 22 23 24 25 26 27
12.76968 14.85618 12.76968 12.76968 14.85618 14.85618 12.76968 19.02919 14.85618
28 29 30 31 32 33 34 35 36
12.76968 12.76968 14.85618 14.85618 16.94269 19.02919 12.76968 16.94269 12.76968
37 38 39 40 41 42 43 44 45
12.76968 21.11569 19.02919 16.94269 14.85618 12.76968 14.85618 19.02919 19.02919
46 47 48 49 50 51 52 53 54
16.94269 29.46170 12.76968 12.76968 12.76968 14.85618 14.85618 12.76968 12.76968
55 56 57 58 59 60 61 62 63
12.76968 16.94269 16.94269 12.76968 14.85618 12.76968 12.76968 14.85618 14.85618
64 65 66 67 68 69 70 71
21.11569 14.85618 25.28870 16.94269 16.94269 16.94269 12.76968 12.76968
Here you see that the fitted value for the first observation was 16.94, so we can infer that the observed (actual) value for the first observation must be: fitted value + residual
#checked against the actual data
my_model$model[1, ]
TOTALVICTIMS NUMWEAPONS
1 7 3
If we want to get information on the significance of predictors and the fit of the model, we’ll look at the summary()
of my_model
:
#model statistics
summary(my_model)
Call:
lm(formula = TOTALVICTIMS ~ NUMWEAPONS, data = us_mass_shootings)
Residuals:
Min 1Q Median 3Q Max
-14.462 -6.813 -4.770 1.601 50.971
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.6832 2.6074 4.097 0.000112 ***
NUMWEAPONS 2.0865 0.9806 2.128 0.036925 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.37 on 69 degrees of freedom
Multiple R-squared: 0.06158, Adjusted R-squared: 0.04798
F-statistic: 4.528 on 1 and 69 DF, p-value: 0.03692
You can see:
_p_ < .05
levelY
that is explained by the model.In our case, we see that the model only explained 6.158% of the variance in the number of victims.
Multiple predictors
Of course, a single predictor rarely explains a lot of variation in an outcome variable. So we might want to include a new variable (e.g. the categorical variable whether there were signs of prior mental health issues or not). The conceptual model would the be:
victims ~ number_of_weapons + mental_health
In R:
my_model_2 = lm(formula = TOTALVICTIMS ~ NUMWEAPONS + PRIORSIGNSOFMENTALILLNESS
, data = us_mass_shootings)
summary(my_model_2)
Call:
lm(formula = TOTALVICTIMS ~ NUMWEAPONS + PRIORSIGNSOFMENTALILLNESS,
data = us_mass_shootings)
Residuals:
Min 1Q Median 3Q Max
-16.766 -7.474 -2.816 1.966 48.709
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.7765 3.3592 1.720 0.0901 .
NUMWEAPONS 2.1583 0.9541 2.262 0.0269 *
PRIORSIGNSOFMENTALILLNESSYes 6.8810 3.0900 2.227 0.0293 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.03 on 68 degrees of freedom
Multiple R-squared: 0.1254, Adjusted R-squared: 0.09964
F-statistic: 4.873 on 2 and 68 DF, p-value: 0.01052
You can see that the model fit improved considerably (from 6.158% to 12.54%) and that the new model equation is:
victims ~ 5.7765 + 2.1583*number_of_weapons + 6.881*mental_health
Note that we now have two predictors which implies that we also have two slopes (one for each predictor).
You can see that the model summary has appended the word “Yes” to the PRIORSIGNSOFMENTALILLNESS
predictor. This is to reflect that the slope applies to the case that the variable PRIORSIGNSOFMENTALILLNESS
is equal to Yes
. This means that the predicted number of victims is 6.8810 more if there were signs of mental health issues tha if there were no previous mental health issues.
When there are mutliple predictors, the simple regression model becomes a multiple regression model.
Task:
Build your own model that predicts the number if wounded victims based on the shooter’s gender and whether the weapons were obtained legally.
my_model_3 = lm(formula = WOUNDED ~ GENDER + WEAPONSOBTAINEDLEGALLY
, data = us_mass_shootings_clean)
summary(my_model_3)
Call:
lm(formula = WOUNDED ~ GENDER + WEAPONSOBTAINEDLEGALLY, data = us_mass_shootings_clean)
Residuals:
Min 1Q Median 3Q Max
-7.673 -5.769 -3.673 0.327 50.327
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9035 10.5220 -0.086 0.932
GENDERMale 7.6727 10.1457 0.756 0.452
WEAPONSOBTAINEDLEGALLYYes 0.9035 3.1008 0.291 0.772
Residual standard error: 10.05 on 66 degrees of freedom
Multiple R-squared: 0.009503, Adjusted R-squared: -0.02051
F-statistic: 0.3166 on 2 and 66 DF, p-value: 0.7297
Now that we covered some basics of linear regression (more in the lecture and homework), we can see why the generalized linear model is so useful. Essentially, it is a means to unify similar models in one.
Suppose you’d want to predict a categorical variable (e.g. “predict” whether the weapon was obtained legally based on the shooter’s gender): the problem is that this conflicts with the base linear regression model which assumes continuous data.
The GLM can handle different outcome variables through so-called “link” functions. A link function links the outcome variable to the linear predictor. An overview of link functions in R can be found here. This model unification allows you to build the model in the same way but changing the link function if the outcome variable is of a different type. It also shows that linear relationships are capable of expressing the data regardless of the nature of the response variable.
The modelling process of the GLM in R is very similar to the linear regression example with the addition of the link function:
This fits the same model as above stored in my_model
but this time with the “Gaussian” link function. You can see that the model is identical to the one obtained with lm(...)
above. This is because the gaussian link function of the GLM is identical to the linear model retrieved with lm(...)
. More in-depth relationships are out of the scope of this module but this tutorial gives a nice starting point.
my_glm
Call: glm(formula = TOTALVICTIMS ~ NUMWEAPONS, family = gaussian, data = us_mass_shootings)
Coefficients:
(Intercept) NUMWEAPONS
10.683 2.087
Degrees of Freedom: 70 Total (i.e. Null); 69 Residual
Null Deviance: 11260
Residual Deviance: 10560 AIC: 562.7
Similar to models built with the lm()
function, the glm()
offers additional model information:
#model statistics
summary(my_glm)
Call:
glm(formula = TOTALVICTIMS ~ NUMWEAPONS, family = gaussian, data = us_mass_shootings)
Deviance Residuals:
Min 1Q Median 3Q Max
-14.462 -6.813 -4.770 1.601 50.971
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.6832 2.6074 4.097 0.000112 ***
NUMWEAPONS 2.0865 0.9806 2.128 0.036925 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 153.1111)
Null deviance: 11258 on 70 degrees of freedom
Residual deviance: 10565 on 69 degrees of freedom
AIC: 562.67
Number of Fisher Scoring iterations: 2
Using the GLM for different predictors
We can extend the GLM to cases where the outcome variable is binary (e.g. coded to 0 or 1) or represents counts (e.g. counts of victims). For the latter, for example, a Poisson regression might be more apt than a gaussian linear regression. We will cover these extensions in the next lecture and tutorial.