SOLUTIONS
Tutorial 2, Probability, Statistics and Modelling 2, BSc Security and Crime Science, UCL
In this tutorial, you will:
Read the csv file located at data/tutorial2/black_friday/sample_black_friday.csv
as a dataframe, and name it blackfriday
(Details on the data).
r
r blackfriday = read.csv(‘data/tutorial2/blackfriday/sample_black_friday.csv’ , header=T)
cannot open file 'data/tutorial2/blackfriday/sample_black_friday.csv': No such file or directoryError in file(file, \rt\) : cannot open the connection
Familiarise yourself with the dataset:
r
r dim(blackfriday)
[1] 8000 5
Build a linear model that predicts the purchase size based on gender. Use the glm
function to do this.
#your code
lin_model = glm(Purchase ~ Gender
, data=blackfriday
, family = gaussian)
Assess the model fit using a “fit” metric that tells you how far on average the model mis-estimated the purchase size.
r
r summary(lin_model)
Call:
glm(formula = Purchase ~ Gender, family = gaussian, data = blackfriday)
Deviance Residuals:
Min 1Q Median 3Q Max
-9138 -3550 -1236 2757 14996
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8958.6 113.6 78.85 < 2e-16 ***
GenderM 552.2 130.2 4.24 2.26e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 24657621)
Null deviance: 1.9766e+11 on 7999 degrees of freedom
Residual deviance: 1.9721e+11 on 7998 degrees of freedom
AIC: 158872
Number of Fisher Scoring iterations: 2
r
r #your code mean(abs(lin_model$residuals))
[1] 4032.133
Now build two competing models:
Marital_Status
Marital_Status
and Gender
#your code
lin_model2 = glm(Purchase ~ Gender + Marital_Status
, data=blackfriday
, family = gaussian)
lin_model3 = glm(Purchase ~ Gender*Marital_Status
, data=blackfriday
, family = gaussian)
Now compare the sum of squared residuals of all three models:
r
r sum(lin_model2$residuals^2)
[1] 197203256070
Now compare the three models statistically. What are your conclusions?
Hint: make sure to check the documentation for the ?anova function. You want to use the “F” test.
r
r anova(lin_model, lin_model3, test = ‘F’)
Analysis of Deviance Table
Model 1: Purchase ~ Gender
Model 2: Purchase ~ Gender * Marital_Status
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 7998 1.9721e+11
2 7996 1.9718e+11 2 31755201 0.6439 0.5253
Load the dataframe called stop_search_met
from the file data/tutorial2/stop_and_search/stop_and_search_sample.RData
.
#your code
load('./data/tutorial2/stop_and_search/stop_and_search_sample.RData')
Familiarise yourself with the dataset:
r
r table(stop_search_met$Outcome)
Local resolution Nothing found - no further action
4 246
Offender cautioned Offender given drugs possession warning
0 30
Offender given penalty notice Suspect arrested
7 55
Suspect summonsed to court A no further action disposal
1 3331
Arrest Caution (simple or conditional)
766 2
Community resolution Khat or Cannabis warning
272 150
Penalty Notice for Disorder Summons / charged by post
84 52
Subset the data to contain only the two levels of the variable Outcome
“Arrest” and “A no further action disposal”.
#your code
stop_search_met_subset = stop_search_met[stop_search_met$Outcome == 'Arrest' | stop_search_met$Outcome == 'A no further action disposal', ]
Now build a logistic regression model on the Outcome
variable as outcome variable modelled through the gender and age of the suspect.
#your code
log_model = glm(Outcome ~ Gender + Age.range
, data=stop_search_met_subset
, family = binomial)
What are your conclusions about the effects of either predictor variable?
r
r summary(log_model)
Call:
glm(formula = Outcome ~ Gender + Age.range, family = binomial,
data = stop_search_met_subset)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3168 -0.7083 -0.6453 -0.2635 2.5993
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7666 0.4242 -4.164 3.12e-05 ***
GenderFemale -1.6497 0.5220 -3.160 0.00158 **
GenderMale -1.5768 0.4907 -3.214 0.00131 **
GenderOther -12.8880 324.7441 -0.040 0.96834
Age.range10-17 1.6773 0.3413 4.915 8.89e-07 ***
Age.range18-24 1.8802 0.3341 5.627 1.83e-08 ***
Age.range25-34 2.0886 0.3378 6.184 6.26e-10 ***
Age.rangeover 34 2.2689 0.3412 6.650 2.92e-11 ***
Age.rangeunder 10 -9.2226 324.7439 -0.028 0.97734
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3243.7 on 3368 degrees of freedom
Residual deviance: 3160.5 on 3360 degrees of freedom
(728 observations deleted due to missingness)
AIC: 3178.5
Number of Fisher Scoring iterations: 11
r
r exp(coefficients(log_model))
(Intercept) GenderFemale GenderMale GenderOther
1.709049e-01 1.921094e-01 2.066348e-01 2.528104e-06
Age.range10-17 Age.range18-24 Age.range25-34 Age.rangeover 34
5.350969e+00 6.554912e+00 8.073816e+00 9.668312e+00
Age.rangeunder 10
9.878027e-05
Expand the model and include the predictor variable Officer.defined.ethnicity
.
#your code
log_model2 = glm(Outcome ~ Gender + Age.range + Officer.defined.ethnicity
, data=stop_search_met_subset
, family = binomial)
How does this affect the model fit?
r
r #your code summary(log_model2)
Call:
glm(formula = Outcome ~ Gender + Age.range + Officer.defined.ethnicity,
family = binomial, data = stop_search_met_subset)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3676 -0.6945 -0.6365 -0.2921 2.6724
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.77400 0.44453 -3.991 6.59e-05 ***
GenderFemale -1.66580 0.63292 -2.632 0.00849 **
GenderMale -1.52112 0.60974 -2.495 0.01261 *
GenderOther -12.59363 324.74430 -0.039 0.96907
Age.range10-17 1.65168 0.34526 4.784 1.72e-06 ***
Age.range18-24 1.87722 0.33796 5.555 2.78e-08 ***
Age.range25-34 2.04890 0.34231 5.986 2.16e-09 ***
Age.rangeover 34 2.19814 0.34616 6.350 2.15e-10 ***
Age.rangeunder 10 -9.21793 324.74387 -0.028 0.97735
Officer.defined.ethnicityAsian -0.24733 0.55241 -0.448 0.65434
Officer.defined.ethnicityBlack -0.05301 0.54234 -0.098 0.92213
Officer.defined.ethnicityOther -0.42799 0.61205 -0.699 0.48438
Officer.defined.ethnicityWhite 0.16182 0.54351 0.298 0.76590
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3243.7 on 3368 degrees of freedom
Residual deviance: 3148.4 on 3356 degrees of freedom
(728 observations deleted due to missingness)
AIC: 3174.4
Number of Fisher Scoring iterations: 11
Examine the variable Type
of the original dataset:
r
r #your code table(stop_search_met$Type)
Person and Vehicle search Person search Vehicle search
1228 3702 70
Now exclude the level Vehicle search
and build a logistic regression model on the two remaining levels. Try to answer the question whether race or gender affected the type of stop and search.
#your code
stop_search_met_subset_2 = stop_search_met[stop_search_met$Type != 'Vehicle search', ]
r
r log_model3 = glm(Type ~ Gender*Officer.defined.ethnicity , data=stop_search_met_subset_2 , family = binomial)
glm.fit: fitted probabilities numerically 0 or 1 occurred
r
r summary(log_model3)
Call:
glm(formula = Type ~ Gender + Officer.defined.ethnicity, family = binomial,
data = stop_search_met_subset_2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9215 0.5862 0.6469 0.7512 0.9938
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 13.1193 229.6293 0.057 0.9544
GenderFemale -11.4466 229.6285 -0.050 0.9602
GenderMale -11.4451 229.6285 -0.050 0.9602
GenderOther -25.0138 318.9617 -0.078 0.9375
Officer.defined.ethnicityAsian -1.2242 0.6334 -1.933 0.0533 .
Officer.defined.ethnicityBlack -0.5532 0.6313 -0.876 0.3809
Officer.defined.ethnicityOther -0.4835 0.6562 -0.737 0.4613
Officer.defined.ethnicityWhite -0.2163 0.6321 -0.342 0.7322
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5534.7 on 4929 degrees of freedom
Residual deviance: 5411.7 on 4922 degrees of freedom
AIC: 5427.7
Number of Fisher Scoring iterations: 11
Use the blackfriday dataset.
Check the documentation in R for the aov(...)
function.
Note that an outstanding online help is https://www.rdocumentation.org/.
Does the amount of money spent on purchaes on black friday differ between female and male shoppers?
Show your results with a t-test, an ANOVA and the GLM.
r
r t.test(Purchase ~ Gender , data=blackfriday , var.equal = T)
Two Sample t-test
data: Purchase by Gender
t = -4.2404, df = 7998, p-value = 2.256e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-807.4805 -296.9296
sample estimates:
mean in group F mean in group M
8958.570 9510.775
r
r summary(aov(Purchase ~ Gender , data=blackfriday))
Df Sum Sq Mean Sq F value Pr(>F)
Gender 1 4.434e+08 443365062 17.98 2.26e-05 ***
Residuals 7998 1.972e+11 24657621
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
r
r anova(glm(Purchase ~ Gender , data=blackfriday , family = gaussian) , test = ‘F’)
Analysis of Deviance Table
Model: gaussian, link: identity
Response: Purchase
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev F Pr(>F)
NULL 7999 1.9766e+11
Gender 1 443365062 7998 1.9721e+11 17.981 2.256e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Show that
Note that for the GLM implementation, you will have to run a model comparison against the model itself using the F test, to obtain the F-value.
Note also that the t.test
corrects for unequal variances, which the other two don’t.
r
r sqrt(17.981)
[1] 4.240401
Use an ANOVA and the GLM to test the hypothesis that age affects the purchase size.
r
r summary(anova_1)
Df Sum Sq Mean Sq F value Pr(>F)
Age 6 2.051e+08 34189775 1.384 0.217
Residuals 7993 1.974e+11 24702850
r
r summary(glm_1)
Call:
glm(formula = Purchase ~ Age, family = gaussian, data = blackfriday)
Deviance Residuals:
Min 1Q Median 3Q Max
-9389 -3527 -1265 2779 14797
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9503.657 342.976 27.709 <2e-16 ***
Age18-25 -372.210 367.150 -1.014 0.311
Age26-35 -151.434 353.730 -0.428 0.669
Age36-45 -103.001 365.786 -0.282 0.778
Age46-50 160.832 390.229 0.412 0.680
Age51-55 6.036 406.597 0.015 0.988
Age55+ 260.004 448.744 0.579 0.562
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 24702850)
Null deviance: 1.9766e+11 on 7999 degrees of freedom
Residual deviance: 1.9745e+11 on 7993 degrees of freedom
AIC: 158891
Number of Fisher Scoring iterations: 2
What is your conclusion?
Bonus question
Use the blackfriday
dataset and try to adopt the role of a business analyst. Suppose you are interested in customer profiling and want to target married customers differently than single customers.
Build a model that models the customers’ marital status.
r
r #your code