Machine learning 2

Advanced Crime Analysis UCL

Bennett Kleinberg

25 Feb 2019

MACHINE LEARNING 2

Today

Recap supervised machine learning
UNsupervised ML
- Step-by-step example
Performance metrics
Validation and generalisation

Recap supervised ML

supervised = labeled data
- classification (e.g. death/alive, fake/real)
- regression (e.g. income, number of deaths)
step-wise procedure

Steps in supervised ML

clarify what outcome and features are
determine which classification algorithm to use
train the model
- train/test split
- cross-validation
fit the model

Unsupervised ML

often we don’t have labelled data
sometimes there are no labels at all
core idea: finding clusters in the data

library(caret)

Examples

grouping of online ads
clusters in crime descriptions
…

Practically everywhere.

Clustering reduces your data!

The unsupervised case

You know nothing about groups inherent to the data.

The k-means idea

separate data in set number of clusters
find best cluster assignment of observations

Stepwise

set the number of clusters
find best cluster assignment

1. no. of clusters

Let’s take 4.

unsup_model_1 = kmeans(data4
                       , centers = 4
                       , nstart = 10
                       , iter.max = 10)

What’s inside?

The k-means algorithm

find random centers
assign each observation to its closest center
optimise for the WSS

What’s problematic here?

But how do we know how many centers?

Possible approach:

run it for several combinations
assess the WSS
determine based on scree-plot

Cluster determination

wss = numeric()
for(i in 1:20){
  kmeans_model = kmeans(data4, centers = i, iter.max = 20, nstart = 10)
  wss[i] = kmeans_model$tot.withinss
}

Scree plot (elbow method)

Look for the inflexion point at center size i.

Other methods to establish k

Silhoutte method (cluster fit)
Gap statistic

Silhouette method

Gap statistic

Choosing k

We settle for \(k = 2\)

unsup_model_final = kmeans(data4
                       , centers = 2
                       , nstart = 10
                       , iter.max = 10)

Plot the cluster assignment

Other unsupervised methods

k-means (today)
hierarchical clustering
density clustering

Issues with unsupervised learning

What’s lacking?

What can you (not) say?

Caveats of unsup. ML

there is no “ground truth”
interpretation/subjectivity
cluster choice

Interpretation of findings

unsup_model_final$centers

##       salary     height
## 1 -0.8395549 -0.7457021
## 2  0.6869085  0.6101199

Cluster 1: low salary, small
Cluster 2: high salary, tall

Note: we cannot say anything about accuracy.

See the k-NN model.

Interpretation of findings

subjective
labelling tricky
researchers choice!
be open about this

Cluster choice

What if we chose \(k=3\)?

km_3 = kmeans(data4, centers = 3, nstart = 10, iter.max = 10)
fviz_cluster(km_3, geom = "point", data = data4)

Cluster choice

What if we chose \(k=3\)?

km_3$centers

##       salary     height
## 1  0.7063757  0.8795474
## 2 -1.4058046 -0.5668204
## 3  0.1876933 -0.7256515

Cluster 1: high salary, very tall
Cluster 2: very low salary, small
Cluster 3: avg salary, small

Cluster choice

be open about it
make all choices transparent
always share code and data (“least vulnerable”" principle)

Performance metrics for classification tasks

Fake news problem

Step 1: Splitting the data

set.seed(2019)
in_training = createDataPartition(y = fake_news_data$outcome
                                  , p = .7
                                  , list = FALSE
                                  )
training_data = fake_news_data[ in_training,]
test_data = fake_news_data[-in_training,]

Step 2: Define training controls

training_controls = trainControl(method="cv"
                                 , number = 5
                                 , classProbs = T
                                 )

Step 3: Train the model

fakenews_model = train(outcome ~ .
                       , data = training_data
                       , trControl = training_controls
                       , method = "svmLinear"
                       )

Step 4: Fit the model

model.predictions = predict(fakenews_model, test_data)

Your task:

Evaluate the model.

What do you do?

Model evaluation

	fake	real
fake	252	48
real	80	220

(252+220)/600 = 0.79

Intermezzo

The confusion matrix

Confusion matrix

	Fake	Real
Fake	True positives	False negatives
Real	False positives	True negatives

Confusion matrix

true positives (TP): correctly identified fake ones
true negatives (TN): correctly identified real ones
false positives (FP): false accusations
false negatives (FN): missed fakes

OKAY: let’s use accuracies

\(acc=\frac{(TP+TN)}{N}\)

Any problems with that?

Accuracy

Model 1
	Fake	Real
Fake	252	48
Real	80	220

Model 2
	Fake	Real
Fake	290	10
Real	118	182

Problem with accuracy

same accuracy, different confusion matrix
relies on thresholding idea
not suitable for comparing models (don’t be fooled by the literature!!)

Needed: more nuanced metrics

Beyond accuracy

##        prediction
## reality Fake Real Sum
##    Fake  252   48 300
##    Real   80  220 300
##    Sum   332  268 600

##        prediction
## reality Fake Real Sum
##    Fake  290   10 300
##    Real  118  182 300
##    Sum   408  192 600

Precision

i.e. –> how often the prediction is correct when prediction class X

Note: we have two classes, so we get two precision values

Formally:

\(Pr_{fake} = \frac{TP}{(TP+FP)}\)
\(Pr_{real} = \frac{TN}{(TN+FN)}\)

Precision

##        prediction
## reality Fake Real Sum
##    Fake  252   48 300
##    Real   80  220 300
##    Sum   332  268 600

\(Pr_{fake} = \frac{252}{332} = 0.76\)
\(Pr_{real} = \frac{220}{268} = 0.82\)

Comparing the models

	Model 1	Model 2
\(acc\)	0.79	0.79
\(Pr_{fake}\)	0.76	0.71
\(Pr_{real}\)	0.82	0.95

Recall

i.e. –> how many of class X is detected

Note: we have two classes, so we get two recall values

Also called sensitivity and specificity!

Formally:

\(R_{fake} = \frac{TP}{(TP+FN)}\)
\(R_{real} = \frac{TN}{(TN+FP)}\)

Recall

##        prediction
## reality Fake Real Sum
##    Fake  252   48 300
##    Real   80  220 300
##    Sum   332  268 600

\(R_{fake} = \frac{252}{300} = 0.84\)
\(R_{real} = \frac{220}{300} = 0.73\)

Comparing the models

	Model 1	Model 2
\(acc\)	0.79	0.79
\(Pr_{fake}\)	0.76	0.71
\(Pr_{real}\)	0.82	0.95
\(R_{fake}\)	0.84	0.97
\(R_{real}\)	0.73	0.61

Combining Pr and R

The F1 measure.

Note: we combine Pr and R for each class, so we get two F1 measures.

Formally:

\(F1_{fake} = 2*\frac{Pr_{fake} * R_{fake}}{Pr_{fake} + R_{fake}}\)
\(F1_{real} = 2*\frac{Pr_{real} * R_{real}}{Pr_{real} + R_{real}}\)

F1 measure

##        prediction
## reality Fake Real Sum
##    Fake  252   48 300
##    Real   80  220 300
##    Sum   332  268 600

\(F1_{fake} = 2*\frac{0.76 * 0.84}{0.76 + 0.84} = 2*\frac{0.64}{1.60} = 0.80\)
\(F1_{real} = 2*\frac{0.82 * 0.73}{0.82 + 0.73} = 0.78\)

Comparing the models

	Model 1	Model 2
\(acc\)	0.79	0.79
\(Pr_{fake}\)	0.76	0.71
\(Pr_{real}\)	0.82	0.95
\(R_{fake}\)	0.84	0.97
\(R_{real}\)	0.73	0.61
\(F1_{fake}\)	0.80	0.82
\(F1_{real}\)	0.78	0.74

In caret

confusionMatrix(model.predictions, as.factor(test_data$outcome))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fake real
##       fake  252   80
##       real   48  220
##                                           
##                Accuracy : 0.7867          
##                  95% CI : (0.7517, 0.8188)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5733          
##  Mcnemar's Test P-Value : 0.006143        
##                                           
##             Sensitivity : 0.8400          
##             Specificity : 0.7333          
##          Pos Pred Value : 0.7590          
##          Neg Pred Value : 0.8209          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4200          
##    Detection Prevalence : 0.5533          
##       Balanced Accuracy : 0.7867          
##                                           
##        'Positive' Class : fake            
##

There’s more

What’s actually behind the model’s predictions?

Any ideas?

Class probabilities

Notice anything?

The threshold problem

Issue!

classification threshold little informative
obscures certainty in judgment

Needed: a representation across all possible values

The Area Under the Curve (AUC)

Idea:

plot all observed values (here: class probs)
y-axis: sensitivity
x-axis: 1-specificity

AUC step-wise

threshold_1 = probs[1]
threshold_1

## [1] 0.6280156

pred_threshold_1 = ifelse(probs >= threshold_1, 'fake', 'real')
knitr::kable(table(test_data$outcome, pred_threshold_1))

	fake	real
fake	221	79
real	52	248

Sensitivity and 1-Specificity

	fake	real
fake	221	79
real	52	248

\(Sens. = 221/300 = 0.74\)

\(Spec. = 248/300 = 0.83\)

\(Sens. = 221/300 = 0.74\)

\(Spec. = 248/300 = 0.83\)

Threshold	Sens.	1-Spec
0.63	0.74	0.17

Do this for every threshold observed.

.. and plot the results:

Quantify this plot

auc1 = roc(response = test_data$outcome
               , predictor = probs
               , ci=T)

What if we compare our two models?

plot.roc(auc1, xlim=c(1, 0), legacy.axes = T)

auc2 = roc(response = test_data$outcome
               , predictor = probs2
               , ci=T)

plot.roc(auc2, xlim=c(1, 0), legacy.axes = T)

AUCs numerically

#model 1
roc(response = test_data$outcome , predictor = probs, ci=T)

## 
## Call:
## roc.default(response = test_data$outcome, predictor = probs,     ci = T)
## 
## Data: probs in 300 controls (test_data$outcome fake) > 300 cases (test_data$outcome real).
## Area under the curve: 0.8521
## 95% CI: 0.8216-0.8827 (DeLong)

#model 2
roc(response = test_data$outcome , predictor = probs2, ci=T)

## 
## Call:
## roc.default(response = test_data$outcome, predictor = probs2,     ci = T)
## 
## Data: probs2 in 300 controls (test_data$outcome fake) > 300 cases (test_data$outcome real).
## Area under the curve: 0.9573
## 95% CI: 0.9437-0.971 (DeLong)

RECAP

Unsupervised ML
Performance metrics
- confusion matrix
- AUC
Validation and generalisation

Outlook

Tutorial tomorrow

Week 8: Applied predictive modelling + R Notebooks

Machine learning 2

Advanced Crime Analysis UCL

Bennett Kleinberg

25 Feb 2019

Today

Recap supervised ML

Steps in supervised ML

Unsupervised ML

Examples

The unsupervised case

The k-means idea

Stepwise

1. no. of clusters

What’s inside?

The k-means algorithm

What’s problematic here?

But how do we know how many centers?

Cluster determination

Scree plot (elbow method)

Other methods to establish k

Silhouette method

Gap statistic

Choosing k

Plot the cluster assignment

Other unsupervised methods

Issues with unsupervised learning

Caveats of unsup. ML

Interpretation of findings

Interpretation of findings

Interpretation of findings

Interpretation of findings

Cluster choice

Cluster choice

Cluster choice

Performance metrics for classification tasks

Fake news problem

Step 1: Splitting the data

Step 2: Define training controls

Step 3: Train the model

Step 4: Fit the model

Your task:

Model evaluation

Intermezzo

The confusion matrix

Confusion matrix

Confusion matrix

OKAY: let’s use accuracies

Accuracy

Problem with accuracy

Beyond accuracy

Precision

Precision

Comparing the models

Recall

Recall

Comparing the models

Combining Pr and R

F1 measure

Comparing the models

In caret

There’s more

Class probabilities

The threshold problem

Issue!

The Area Under the Curve (AUC)

AUC step-wise

Sensitivity and 1-Specificity

.. and plot the results:

Quantify this plot

What if we compare our two models?

AUCs numerically

RECAP

Outlook

END