SOLUTIONS

Tutorial 2, Probability, Statistics and Modelling 2, BSc Security and Crime Science, UCL

## Outcomes of this tutorial

In this tutorial, you will:

• build and evaluate your own multiple regression models
• build and evaluate your own logistic regression models
• conduct hypothesis testing
• t-test
• ANOVA
• perform generalized ANOVA/t-test testing

## Linear regression

Read the csv file located at data/tutorial2/black_friday/sample_black_friday.csv as a dataframe, and name it blackfriday (Details on the data).

rr blackfriday = read.csv(‘data/tutorial2/blackfriday/sample_black_friday.csv’ , header=T)

cannot open file 'data/tutorial2/blackfriday/sample_black_friday.csv': No such file or directoryError in file(file, \rt\) : cannot open the connection

Familiarise yourself with the dataset:

rr dim(blackfriday)

[1] 8000    5

Build a linear model that predicts the purchase size based on gender. Use the glm function to do this.

#your code
lin_model = glm(Purchase ~ Gender
, data=blackfriday
, family = gaussian)

Assess the model fit using a “fit” metric that tells you how far on average the model mis-estimated the purchase size.

rr summary(lin_model)


Call:
glm(formula = Purchase ~ Gender, family = gaussian, data = blackfriday)

Deviance Residuals:
Min      1Q  Median      3Q     Max
-9138   -3550   -1236    2757   14996

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   8958.6      113.6   78.85  < 2e-16 ***
GenderM        552.2      130.2    4.24 2.26e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 24657621)

Null deviance: 1.9766e+11  on 7999  degrees of freedom
Residual deviance: 1.9721e+11  on 7998  degrees of freedom
AIC: 158872

Number of Fisher Scoring iterations: 2

rr #your code mean(abs(lin_model$residuals)) [1] 4032.133 ### Task 3: Now build two competing models: • one model with the added variable Marital_Status • and one model with all terms possible from the two variables Marital_Status and Gender #your code lin_model2 = glm(Purchase ~ Gender + Marital_Status , data=blackfriday , family = gaussian) lin_model3 = glm(Purchase ~ Gender*Marital_Status , data=blackfriday , family = gaussian) Now compare the sum of squared residuals of all three models: rr sum(lin_model2$residuals^2)

[1] 197203256070

Now compare the three models statistically. What are your conclusions?

Hint: make sure to check the documentation for the ?anova function. You want to use the “F” test.

rr anova(lin_model, lin_model3, test = ‘F’)

Analysis of Deviance Table

Model 1: Purchase ~ Gender
Model 2: Purchase ~ Gender * Marital_Status
Resid. Df Resid. Dev Df Deviance      F Pr(>F)
1      7998 1.9721e+11
2      7996 1.9718e+11  2 31755201 0.6439 0.5253

## Logistic regression

Load the dataframe called stop_search_met from the file data/tutorial2/stop_and_search/stop_and_search_sample.RData.

#your code
load('./data/tutorial2/stop_and_search/stop_and_search_sample.RData')

Familiarise yourself with the dataset:

rr table(stop_search_met$Outcome)  Local resolution Nothing found - no further action 4 246 Offender cautioned Offender given drugs possession warning 0 30 Offender given penalty notice Suspect arrested 7 55 Suspect summonsed to court A no further action disposal 1 3331 Arrest Caution (simple or conditional) 766 2 Community resolution Khat or Cannabis warning 272 150 Penalty Notice for Disorder Summons / charged by post 84 52  ### Task 4: Subset the data to contain only the two levels of the variable Outcome “Arrest” and “A no further action disposal”. #your code stop_search_met_subset = stop_search_met[stop_search_met$Outcome == 'Arrest' | stop_search_met$Outcome == 'A no further action disposal', ] Now build a logistic regression model on the Outcome variable as outcome variable modelled through the gender and age of the suspect. #your code log_model = glm(Outcome ~ Gender + Age.range , data=stop_search_met_subset , family = binomial) What are your conclusions about the effects of either predictor variable? rr summary(log_model)  Call: glm(formula = Outcome ~ Gender + Age.range, family = binomial, data = stop_search_met_subset) Deviance Residuals: Min 1Q Median 3Q Max -1.3168 -0.7083 -0.6453 -0.2635 2.5993 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.7666 0.4242 -4.164 3.12e-05 *** GenderFemale -1.6497 0.5220 -3.160 0.00158 ** GenderMale -1.5768 0.4907 -3.214 0.00131 ** GenderOther -12.8880 324.7441 -0.040 0.96834 Age.range10-17 1.6773 0.3413 4.915 8.89e-07 *** Age.range18-24 1.8802 0.3341 5.627 1.83e-08 *** Age.range25-34 2.0886 0.3378 6.184 6.26e-10 *** Age.rangeover 34 2.2689 0.3412 6.650 2.92e-11 *** Age.rangeunder 10 -9.2226 324.7439 -0.028 0.97734 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3243.7 on 3368 degrees of freedom Residual deviance: 3160.5 on 3360 degrees of freedom (728 observations deleted due to missingness) AIC: 3178.5 Number of Fisher Scoring iterations: 11 rr exp(coefficients(log_model))  (Intercept) GenderFemale GenderMale GenderOther 1.709049e-01 1.921094e-01 2.066348e-01 2.528104e-06 Age.range10-17 Age.range18-24 Age.range25-34 Age.rangeover 34 5.350969e+00 6.554912e+00 8.073816e+00 9.668312e+00 Age.rangeunder 10 9.878027e-05  ### Task 5: Expand the model and include the predictor variable Officer.defined.ethnicity. #your code log_model2 = glm(Outcome ~ Gender + Age.range + Officer.defined.ethnicity , data=stop_search_met_subset , family = binomial) How does this affect the model fit? rr #your code summary(log_model2)  Call: glm(formula = Outcome ~ Gender + Age.range + Officer.defined.ethnicity, family = binomial, data = stop_search_met_subset) Deviance Residuals: Min 1Q Median 3Q Max -1.3676 -0.6945 -0.6365 -0.2921 2.6724 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.77400 0.44453 -3.991 6.59e-05 *** GenderFemale -1.66580 0.63292 -2.632 0.00849 ** GenderMale -1.52112 0.60974 -2.495 0.01261 * GenderOther -12.59363 324.74430 -0.039 0.96907 Age.range10-17 1.65168 0.34526 4.784 1.72e-06 *** Age.range18-24 1.87722 0.33796 5.555 2.78e-08 *** Age.range25-34 2.04890 0.34231 5.986 2.16e-09 *** Age.rangeover 34 2.19814 0.34616 6.350 2.15e-10 *** Age.rangeunder 10 -9.21793 324.74387 -0.028 0.97735 Officer.defined.ethnicityAsian -0.24733 0.55241 -0.448 0.65434 Officer.defined.ethnicityBlack -0.05301 0.54234 -0.098 0.92213 Officer.defined.ethnicityOther -0.42799 0.61205 -0.699 0.48438 Officer.defined.ethnicityWhite 0.16182 0.54351 0.298 0.76590 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3243.7 on 3368 degrees of freedom Residual deviance: 3148.4 on 3356 degrees of freedom (728 observations deleted due to missingness) AIC: 3174.4 Number of Fisher Scoring iterations: 11 ### Task 6: Examine the variable Type of the original dataset: rr #your code table(stop_search_met$Type)


Person and Vehicle search             Person search            Vehicle search
1228                      3702                        70 

Now exclude the level Vehicle search and build a logistic regression model on the two remaining levels. Try to answer the question whether race or gender affected the type of stop and search.

#your code
stop_search_met_subset_2 = stop_search_met[stop_search_met\$Type != 'Vehicle search', ]

rr log_model3 = glm(Type ~ Gender*Officer.defined.ethnicity , data=stop_search_met_subset_2 , family = binomial)

glm.fit: fitted probabilities numerically 0 or 1 occurred

rr summary(log_model3)


Call:
glm(formula = Type ~ Gender + Officer.defined.ethnicity, family = binomial,
data = stop_search_met_subset_2)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.9215   0.5862   0.6469   0.7512   0.9938

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)                     13.1193   229.6293   0.057   0.9544
GenderFemale                   -11.4466   229.6285  -0.050   0.9602
GenderMale                     -11.4451   229.6285  -0.050   0.9602
GenderOther                    -25.0138   318.9617  -0.078   0.9375
Officer.defined.ethnicityAsian  -1.2242     0.6334  -1.933   0.0533 .
Officer.defined.ethnicityBlack  -0.5532     0.6313  -0.876   0.3809
Officer.defined.ethnicityOther  -0.4835     0.6562  -0.737   0.4613
Officer.defined.ethnicityWhite  -0.2163     0.6321  -0.342   0.7322
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 5534.7  on 4929  degrees of freedom
Residual deviance: 5411.7  on 4922  degrees of freedom
AIC: 5427.7

Number of Fisher Scoring iterations: 11

## GLM for group comparisons

Use the blackfriday dataset.

Check the documentation in R for the aov(...) function.

Does the amount of money spent on purchaes on black friday differ between female and male shoppers?

Show your results with a t-test, an ANOVA and the GLM.

rr t.test(Purchase ~ Gender , data=blackfriday , var.equal = T)


Two Sample t-test

data:  Purchase by Gender
t = -4.2404, df = 7998, p-value = 2.256e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-807.4805 -296.9296
sample estimates:
mean in group F mean in group M
8958.570        9510.775 

rr summary(aov(Purchase ~ Gender , data=blackfriday))

              Df    Sum Sq   Mean Sq F value   Pr(>F)
Gender         1 4.434e+08 443365062   17.98 2.26e-05 ***
Residuals   7998 1.972e+11  24657621
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

rr anova(glm(Purchase ~ Gender , data=blackfriday , family = gaussian) , test = ‘F’)

Analysis of Deviance Table

Response: Purchase

Terms added sequentially (first to last)

Df  Deviance Resid. Df Resid. Dev      F    Pr(>F)
NULL                     7999 1.9766e+11
Gender  1 443365062      7998 1.9721e+11 17.981 2.256e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Show that

Note that for the GLM implementation, you will have to run a model comparison against the model itself using the F test, to obtain the F-value.
Note also that the t.test corrects for unequal variances, which the other two don’t.

rr sqrt(17.981)

[1] 4.240401

Use an ANOVA and the GLM to test the hypothesis that age affects the purchase size.

rr summary(anova_1)

              Df    Sum Sq  Mean Sq F value Pr(>F)
Age            6 2.051e+08 34189775   1.384  0.217
Residuals   7993 1.974e+11 24702850               

rr summary(glm_1)


Call:
glm(formula = Purchase ~ Age, family = gaussian, data = blackfriday)

Deviance Residuals:
Min      1Q  Median      3Q     Max
-9389   -3527   -1265    2779   14797

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9503.657    342.976  27.709   <2e-16 ***
Age18-25    -372.210    367.150  -1.014    0.311
Age26-35    -151.434    353.730  -0.428    0.669
Age36-45    -103.001    365.786  -0.282    0.778
Age46-50     160.832    390.229   0.412    0.680
Age51-55       6.036    406.597   0.015    0.988
Age55+       260.004    448.744   0.579    0.562
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 24702850)

Null deviance: 1.9766e+11  on 7999  degrees of freedom
Residual deviance: 1.9745e+11  on 7993  degrees of freedom
AIC: 158891

Number of Fisher Scoring iterations: 2

Use the blackfriday dataset and try to adopt the role of a business analyst. Suppose you are interested in customer profiling and want to target married customers differently than single customers.
rr #your code