SOLUTIONS


Tutorial 2, Probability, Statistics and Modelling 2, BSc Security and Crime Science, UCL


Outcomes of this tutorial

In this tutorial, you will:


Linear regression

Read the csv file located at data/tutorial2/black_friday/sample_black_friday.csv as a dataframe, and name it blackfriday (Details on the data).

rr blackfriday = read.csv(‘data/tutorial2/blackfriday/sample_black_friday.csv’ , header=T)

cannot open file 'data/tutorial2/blackfriday/sample_black_friday.csv': No such file or directoryError in file(file, \rt\) : cannot open the connection

Familiarise yourself with the dataset:

rr dim(blackfriday)

[1] 8000    5

Task 1:

Build a linear model that predicts the purchase size based on gender. Use the glm function to do this.

#your code
lin_model = glm(Purchase ~ Gender
                , data=blackfriday
                , family = gaussian)

Task 2:

Assess the model fit using a “fit” metric that tells you how far on average the model mis-estimated the purchase size.

rr summary(lin_model)


Call:
glm(formula = Purchase ~ Gender, family = gaussian, data = blackfriday)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
 -9138   -3550   -1236    2757   14996  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8958.6      113.6   78.85  < 2e-16 ***
GenderM        552.2      130.2    4.24 2.26e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 24657621)

    Null deviance: 1.9766e+11  on 7999  degrees of freedom
Residual deviance: 1.9721e+11  on 7998  degrees of freedom
AIC: 158872

Number of Fisher Scoring iterations: 2

rr #your code mean(abs(lin_model$residuals))

[1] 4032.133

Task 3:

Now build two competing models:

  • one model with the added variable Marital_Status
  • and one model with all terms possible from the two variables Marital_Status and Gender
#your code
lin_model2 = glm(Purchase ~ Gender + Marital_Status
                , data=blackfriday
                , family = gaussian)

lin_model3 = glm(Purchase ~ Gender*Marital_Status
                , data=blackfriday
                , family = gaussian)

Now compare the sum of squared residuals of all three models:

rr sum(lin_model2$residuals^2)

[1] 197203256070

Now compare the three models statistically. What are your conclusions?

Hint: make sure to check the documentation for the ?anova function. You want to use the “F” test.

rr anova(lin_model, lin_model3, test = ‘F’)

Analysis of Deviance Table

Model 1: Purchase ~ Gender
Model 2: Purchase ~ Gender * Marital_Status
  Resid. Df Resid. Dev Df Deviance      F Pr(>F)
1      7998 1.9721e+11                          
2      7996 1.9718e+11  2 31755201 0.6439 0.5253

Logistic regression

Load the dataframe called stop_search_met from the file data/tutorial2/stop_and_search/stop_and_search_sample.RData.

#your code
load('./data/tutorial2/stop_and_search/stop_and_search_sample.RData')

Familiarise yourself with the dataset:

rr table(stop_search_met$Outcome)


                       Local resolution       Nothing found - no further action 
                                      4                                     246 
                     Offender cautioned Offender given drugs possession warning 
                                      0                                      30 
          Offender given penalty notice                        Suspect arrested 
                                      7                                      55 
             Suspect summonsed to court            A no further action disposal 
                                      1                                    3331 
                                 Arrest         Caution (simple or conditional) 
                                    766                                       2 
                   Community resolution                Khat or Cannabis warning 
                                    272                                     150 
            Penalty Notice for Disorder               Summons / charged by post 
                                     84                                      52 

Task 4:

Subset the data to contain only the two levels of the variable Outcome “Arrest” and “A no further action disposal”.

#your code
stop_search_met_subset = stop_search_met[stop_search_met$Outcome == 'Arrest' | stop_search_met$Outcome == 'A no further action disposal', ]

Now build a logistic regression model on the Outcome variable as outcome variable modelled through the gender and age of the suspect.

#your code
log_model = glm(Outcome ~ Gender + Age.range
                , data=stop_search_met_subset
                , family = binomial)

What are your conclusions about the effects of either predictor variable?

rr summary(log_model)


Call:
glm(formula = Outcome ~ Gender + Age.range, family = binomial, 
    data = stop_search_met_subset)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3168  -0.7083  -0.6453  -0.2635   2.5993  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -1.7666     0.4242  -4.164 3.12e-05 ***
GenderFemale       -1.6497     0.5220  -3.160  0.00158 ** 
GenderMale         -1.5768     0.4907  -3.214  0.00131 ** 
GenderOther       -12.8880   324.7441  -0.040  0.96834    
Age.range10-17      1.6773     0.3413   4.915 8.89e-07 ***
Age.range18-24      1.8802     0.3341   5.627 1.83e-08 ***
Age.range25-34      2.0886     0.3378   6.184 6.26e-10 ***
Age.rangeover 34    2.2689     0.3412   6.650 2.92e-11 ***
Age.rangeunder 10  -9.2226   324.7439  -0.028  0.97734    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3243.7  on 3368  degrees of freedom
Residual deviance: 3160.5  on 3360  degrees of freedom
  (728 observations deleted due to missingness)
AIC: 3178.5

Number of Fisher Scoring iterations: 11

rr exp(coefficients(log_model))

      (Intercept)      GenderFemale        GenderMale       GenderOther 
     1.709049e-01      1.921094e-01      2.066348e-01      2.528104e-06 
   Age.range10-17    Age.range18-24    Age.range25-34  Age.rangeover 34 
     5.350969e+00      6.554912e+00      8.073816e+00      9.668312e+00 
Age.rangeunder 10 
     9.878027e-05 

Task 5:

Expand the model and include the predictor variable Officer.defined.ethnicity.

#your code
log_model2 = glm(Outcome ~ Gender + Age.range + Officer.defined.ethnicity
                , data=stop_search_met_subset
                , family = binomial)

How does this affect the model fit?

rr #your code summary(log_model2)


Call:
glm(formula = Outcome ~ Gender + Age.range + Officer.defined.ethnicity, 
    family = binomial, data = stop_search_met_subset)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3676  -0.6945  -0.6365  -0.2921   2.6724  

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -1.77400    0.44453  -3.991 6.59e-05 ***
GenderFemale                    -1.66580    0.63292  -2.632  0.00849 ** 
GenderMale                      -1.52112    0.60974  -2.495  0.01261 *  
GenderOther                    -12.59363  324.74430  -0.039  0.96907    
Age.range10-17                   1.65168    0.34526   4.784 1.72e-06 ***
Age.range18-24                   1.87722    0.33796   5.555 2.78e-08 ***
Age.range25-34                   2.04890    0.34231   5.986 2.16e-09 ***
Age.rangeover 34                 2.19814    0.34616   6.350 2.15e-10 ***
Age.rangeunder 10               -9.21793  324.74387  -0.028  0.97735    
Officer.defined.ethnicityAsian  -0.24733    0.55241  -0.448  0.65434    
Officer.defined.ethnicityBlack  -0.05301    0.54234  -0.098  0.92213    
Officer.defined.ethnicityOther  -0.42799    0.61205  -0.699  0.48438    
Officer.defined.ethnicityWhite   0.16182    0.54351   0.298  0.76590    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3243.7  on 3368  degrees of freedom
Residual deviance: 3148.4  on 3356  degrees of freedom
  (728 observations deleted due to missingness)
AIC: 3174.4

Number of Fisher Scoring iterations: 11

Task 6:

Examine the variable Type of the original dataset:

rr #your code table(stop_search_met$Type)


Person and Vehicle search             Person search            Vehicle search 
                     1228                      3702                        70 

Now exclude the level Vehicle search and build a logistic regression model on the two remaining levels. Try to answer the question whether race or gender affected the type of stop and search.

#your code
stop_search_met_subset_2 = stop_search_met[stop_search_met$Type != 'Vehicle search', ]

rr log_model3 = glm(Type ~ Gender*Officer.defined.ethnicity , data=stop_search_met_subset_2 , family = binomial)

glm.fit: fitted probabilities numerically 0 or 1 occurred

rr summary(log_model3)


Call:
glm(formula = Type ~ Gender + Officer.defined.ethnicity, family = binomial, 
    data = stop_search_met_subset_2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9215   0.5862   0.6469   0.7512   0.9938  

Coefficients:
                               Estimate Std. Error z value Pr(>|z|)  
(Intercept)                     13.1193   229.6293   0.057   0.9544  
GenderFemale                   -11.4466   229.6285  -0.050   0.9602  
GenderMale                     -11.4451   229.6285  -0.050   0.9602  
GenderOther                    -25.0138   318.9617  -0.078   0.9375  
Officer.defined.ethnicityAsian  -1.2242     0.6334  -1.933   0.0533 .
Officer.defined.ethnicityBlack  -0.5532     0.6313  -0.876   0.3809  
Officer.defined.ethnicityOther  -0.4835     0.6562  -0.737   0.4613  
Officer.defined.ethnicityWhite  -0.2163     0.6321  -0.342   0.7322  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 5534.7  on 4929  degrees of freedom
Residual deviance: 5411.7  on 4922  degrees of freedom
AIC: 5427.7

Number of Fisher Scoring iterations: 11

GLM for group comparisons

Use the blackfriday dataset.

Check the documentation in R for the aov(...) function.

Note that an outstanding online help is https://www.rdocumentation.org/.

Task 7:

Does the amount of money spent on purchaes on black friday differ between female and male shoppers?

Show your results with a t-test, an ANOVA and the GLM.

rr t.test(Purchase ~ Gender , data=blackfriday , var.equal = T)


    Two Sample t-test

data:  Purchase by Gender
t = -4.2404, df = 7998, p-value = 2.256e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -807.4805 -296.9296
sample estimates:
mean in group F mean in group M 
       8958.570        9510.775 

rr summary(aov(Purchase ~ Gender , data=blackfriday))

              Df    Sum Sq   Mean Sq F value   Pr(>F)    
Gender         1 4.434e+08 443365062   17.98 2.26e-05 ***
Residuals   7998 1.972e+11  24657621                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

rr anova(glm(Purchase ~ Gender , data=blackfriday , family = gaussian) , test = ‘F’)

Analysis of Deviance Table

Model: gaussian, link: identity

Response: Purchase

Terms added sequentially (first to last)

       Df  Deviance Resid. Df Resid. Dev      F    Pr(>F)    
NULL                     7999 1.9766e+11                     
Gender  1 443365062      7998 1.9721e+11 17.981 2.256e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Show that

Note that for the GLM implementation, you will have to run a model comparison against the model itself using the F test, to obtain the F-value.
Note also that the t.test corrects for unequal variances, which the other two don’t.

rr sqrt(17.981)

[1] 4.240401

Task 8:

Use an ANOVA and the GLM to test the hypothesis that age affects the purchase size.

rr summary(anova_1)

              Df    Sum Sq  Mean Sq F value Pr(>F)
Age            6 2.051e+08 34189775   1.384  0.217
Residuals   7993 1.974e+11 24702850               

rr summary(glm_1)


Call:
glm(formula = Purchase ~ Age, family = gaussian, data = blackfriday)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
 -9389   -3527   -1265    2779   14797  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 9503.657    342.976  27.709   <2e-16 ***
Age18-25    -372.210    367.150  -1.014    0.311    
Age26-35    -151.434    353.730  -0.428    0.669    
Age36-45    -103.001    365.786  -0.282    0.778    
Age46-50     160.832    390.229   0.412    0.680    
Age51-55       6.036    406.597   0.015    0.988    
Age55+       260.004    448.744   0.579    0.562    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 24702850)

    Null deviance: 1.9766e+11  on 7999  degrees of freedom
Residual deviance: 1.9745e+11  on 7993  degrees of freedom
AIC: 158891

Number of Fisher Scoring iterations: 2

What is your conclusion?

Task 9:

Bonus question

Use the blackfriday dataset and try to adopt the role of a business analyst. Suppose you are interested in customer profiling and want to target married customers differently than single customers.

Build a model that models the customers’ marital status.

rr #your code

END


---
title: Solutions GLM
author: B Kleinberg
date: 29 January 2019
subtitle: Dept of Security and Crime Science, UCL
output: html_notebook
---

*SOLUTIONS*

---

Tutorial 2, Probability, Statistics and Modelling 2, BSc Security and Crime Science, UCL

---

## Outcomes of this tutorial

In this tutorial, you will:

- build and evaluate your own multiple regression models
- build and evaluate your own logistic regression models
- conduct hypothesis testing
    - t-test
    - ANOVA
- perform generalized ANOVA/t-test testing

---

## Linear regression

Read the csv file located at `data/tutorial2/black_friday/sample_black_friday.csv` as a dataframe, and name it `blackfriday` [(Details on the data)](https://www.kaggle.com/mehdidag/black-friday/version/1).

```{r}
#your code
blackfriday = read.csv('data/tutorial2/black_friday/sample_black_friday.csv'
                       , header=T)
```

Familiarise yourself with the dataset:

```{r}
#your code
names(blackfriday)
head(blackfriday)
dim(blackfriday)
```

### Task 1:

Build a linear model that predicts the purchase size based on gender. Use the `glm` function to do this.

```{r}
#your code
lin_model = glm(Purchase ~ Gender
                , data=blackfriday
                , family = gaussian)
```

### Task 2:

Assess the model fit using a "fit" metric that tells you how far on average the model mis-estimated the purchase size.

```{r}
#your code
summary(lin_model)
```

```{r}
#your code
mean(abs(lin_model$residuals))
```


### Task 3:

Now build two competing models:

- one model with the added variable `Marital_Status`
- and one model with all terms possible from the two variables `Marital_Status` and `Gender`

```{r}
#your code
lin_model2 = glm(Purchase ~ Gender + Marital_Status
                , data=blackfriday
                , family = gaussian)

lin_model3 = glm(Purchase ~ Gender*Marital_Status
                , data=blackfriday
                , family = gaussian)
```

Now compare the sum of squared residuals of all three models:

```{r}
#your code
sum(lin_model$residuals^2)
sum(lin_model2$residuals^2)
sum(lin_model3$residuals^2)
```

Now compare the three models statistically. What are your conclusions?

_Hint:_ make sure to check the documentation for the ?anova function. You want to use the "F" test.

```{r}
#your code
anova(lin_model, lin_model2, test = 'F')
anova(lin_model2, lin_model3, test = 'F')
anova(lin_model, lin_model3, test = 'F')
```

## Logistic regression

Load the dataframe called `stop_search_met` from the file `data/tutorial2/stop_and_search/stop_and_search_sample.RData`.

```{r}
#your code
load('./data/tutorial2/stop_and_search/stop_and_search_sample.RData')
```

Familiarise yourself with the dataset:

```{r}
#your code
names(stop_search_met)
head(stop_search_met)
table(stop_search_met$Outcome)
```

### Task 4:

Subset the data to contain only the two levels of the variable `Outcome` "Arrest" and "A no further action disposal".

```{r}
#your code
stop_search_met_subset = stop_search_met[stop_search_met$Outcome == 'Arrest' | stop_search_met$Outcome == 'A no further action disposal', ]
```

Now build a logistic regression model on the `Outcome` variable as outcome variable modelled through the gender and age of the suspect.

```{r}
#your code
log_model = glm(Outcome ~ Gender + Age.range
                , data=stop_search_met_subset
                , family = binomial)
```

What are your conclusions about the effects of either predictor variable?

```{r}
#your code
summary(log_model)
```

```{r}
exp(coefficients(log_model))
```


### Task 5:

Expand the model and include the predictor variable `Officer.defined.ethnicity`.

```{r}
#your code
log_model2 = glm(Outcome ~ Gender + Age.range + Officer.defined.ethnicity
                , data=stop_search_met_subset
                , family = binomial)
```

How does this affect the model fit?

```{r}
#your code
summary(log_model2)
```

### Task 6:

Examine the variable `Type` of the original dataset:

```{r}
#your code
table(stop_search_met$Type)
```

Now exclude the level `Vehicle search` and build a logistic regression model on the two remaining levels. Try to answer the question whether race or gender affected the type of stop and search.

```{r}
#your code
stop_search_met_subset_2 = stop_search_met[stop_search_met$Type != 'Vehicle search', ]
```

```{r}
log_model3 = glm(Type ~ Gender+Officer.defined.ethnicity
                , data=stop_search_met_subset_2
                , family = binomial)
```

```{r}
summary(log_model3)
```

## GLM for group comparisons

Use the blackfriday dataset.

Check the documentation in R for the `aov(...)` function.

Note that an outstanding online help is [https://www.rdocumentation.org/](https://www.rdocumentation.org/).

### Task 7:

Does the amount of money spent on purchaes on black friday differ between female and male shoppers?

Show your results with a t-test, an ANOVA and the GLM.

```{r}
#your code
t.test(Purchase ~ Gender
       , data=blackfriday)
```


```{r}
#your code
summary(aov(Purchase ~ Gender
    , data=blackfriday))
```

```{r}
#your code
anova(glm(Purchase ~ Gender
    , data=blackfriday
    , family = gaussian)
    , test = 'F')
```


Show that 

Note that for the GLM implementation, you will have to run a model comparison against the model itself using the F test, to obtain the F-value.  
Note also that the `t.test` corrects for unequal variances, which the other two don't.

```{r}
#your code
(-4.24)^2
sqrt(17.981)
```

### Task 8:

Use an ANOVA and the GLM to test the hypothesis that age affects the purchase size.

```{r}
#your code
anova_1 = aov(Purchase ~ Age
              , data = blackfriday)

summary(anova_1)
```

```{r}
#your code
glm_1 = glm(Purchase ~ Age
            , data = blackfriday
            , family = gaussian)

summary(glm_1)
```

What is your conclusion?

### Task 9:

*Bonus question*

Use the `blackfriday` dataset and try to adopt the role of a business analyst. Suppose you are interested in customer profiling and want to target married customers differently than single customers.

Build a model that models the customers' marital status.

```{r}
#your code
```


## END

---

