Machine learning 2

Advanced Crime Analysis UCL

Bennett Kleinberg

25 Feb 2019

MACHINE LEARNING 2

Today

  • Recap supervised machine learning
  • UNsupervised ML
    • Step-by-step example
  • Performance metrics
  • Validation and generalisation

Recap supervised ML

  • supervised = labeled data
    • classification (e.g. death/alive, fake/real)
    • regression (e.g. income, number of deaths)
  • step-wise procedure

Steps in supervised ML

  • clarify what outcome and features are
  • determine which classification algorithm to use
  • train the model
    • train/test split
    • cross-validation
  • fit the model

Unsupervised ML

  • often we don’t have labelled data
  • sometimes there are no labels at all
  • core idea: finding clusters in the data
library(caret)

Examples

  • grouping of online ads
  • clusters in crime descriptions

Practically everywhere.

Clustering reduces your data!

The unsupervised case

You know nothing about groups inherent to the data.

The k-means idea

  • separate data in set number of clusters
  • find best cluster assignment of observations

Stepwise

  1. set the number of clusters
  2. find best cluster assignment

1. no. of clusters

Let’s take 4.

unsup_model_1 = kmeans(data4
                       , centers = 4
                       , nstart = 10
                       , iter.max = 10)

What’s inside?

The k-means algorithm

  • find random centers
  • assign each observation to its closest center
  • optimise for the WSS

What’s problematic here?

But how do we know how many centers?

Possible approach:

  • run it for several combinations
  • assess the WSS
  • determine based on scree-plot

Cluster determination

wss = numeric()
for(i in 1:20){
  kmeans_model = kmeans(data4, centers = i, iter.max = 20, nstart = 10)
  wss[i] = kmeans_model$tot.withinss
}

Scree plot (elbow method)

Look for the inflexion point at center size i.

Other methods to establish k

  • Silhoutte method (cluster fit)
  • Gap statistic

See also this tutorial.

Silhouette method

Gap statistic

Choosing k

We settle for \(k = 2\)

unsup_model_final = kmeans(data4
                       , centers = 2
                       , nstart = 10
                       , iter.max = 10)

Plot the cluster assignment

Other unsupervised methods

  • k-means (today)
  • hierarchical clustering
  • density clustering

Issues with unsupervised learning

What’s lacking?

What can you (not) say?

Caveats of unsup. ML

  • there is no “ground truth”
  • interpretation/subjectivity
  • cluster choice

Interpretation of findings

Interpretation of findings

unsup_model_final$centers
##       salary     height
## 1 -0.8395549 -0.7457021
## 2  0.6869085  0.6101199
  • Cluster 1: low salary, small
  • Cluster 2: high salary, tall

Note: we cannot say anything about accuracy.

See the k-NN model.

Interpretation of findings

Interpretation of findings

  • subjective
  • labelling tricky
  • researchers choice!
  • be open about this

Cluster choice

What if we chose \(k=3\)?

km_3 = kmeans(data4, centers = 3, nstart = 10, iter.max = 10)
fviz_cluster(km_3, geom = "point", data = data4)

Cluster choice

What if we chose \(k=3\)?

km_3$centers
##       salary     height
## 1  0.7063757  0.8795474
## 2 -1.4058046 -0.5668204
## 3  0.1876933 -0.7256515
  • Cluster 1: high salary, very tall
  • Cluster 2: very low salary, small
  • Cluster 3: avg salary, small

Cluster choice

  • be open about it
  • make all choices transparent
  • always share code and data (“least vulnerable”" principle)

Performance metrics for classification tasks

Fake news problem

Step 1: Splitting the data

set.seed(2019)
in_training = createDataPartition(y = fake_news_data$outcome
                                  , p = .7
                                  , list = FALSE
                                  )
training_data = fake_news_data[ in_training,]
test_data = fake_news_data[-in_training,]

Step 2: Define training controls

training_controls = trainControl(method="cv"
                                 , number = 5
                                 , classProbs = T
                                 )

Step 3: Train the model

fakenews_model = train(outcome ~ .
                       , data = training_data
                       , trControl = training_controls
                       , method = "svmLinear"
                       )

Step 4: Fit the model

model.predictions = predict(fakenews_model, test_data)

Your task:

Evaluate the model.

What do you do?

Model evaluation

fake real
fake 252 48
real 80 220

(252+220)/600 = 0.79

Intermezzo

The confusion matrix

Confusion matrix

Fake Real
Fake True positives False negatives
Real False positives True negatives

Confusion matrix

  • true positives (TP): correctly identified fake ones
  • true negatives (TN): correctly identified real ones
  • false positives (FP): false accusations
  • false negatives (FN): missed fakes

OKAY: let’s use accuracies

\(acc=\frac{(TP+TN)}{N}\)

Any problems with that?

Accuracy

Model 1
Fake Real
Fake 252 48
Real 80 220
Model 2
Fake Real
Fake 290 10
Real 118 182

Problem with accuracy

  • same accuracy, different confusion matrix
  • relies on thresholding idea
  • not suitable for comparing models (don’t be fooled by the literature!!)

Needed: more nuanced metrics

Beyond accuracy

##        prediction
## reality Fake Real Sum
##    Fake  252   48 300
##    Real   80  220 300
##    Sum   332  268 600
##        prediction
## reality Fake Real Sum
##    Fake  290   10 300
##    Real  118  182 300
##    Sum   408  192 600

Precision

i.e. –> how often the prediction is correct when prediction class X

Note: we have two classes, so we get two precision values

Formally:

  • \(Pr_{fake} = \frac{TP}{(TP+FP)}\)
  • \(Pr_{real} = \frac{TN}{(TN+FN)}\)

Precision

##        prediction
## reality Fake Real Sum
##    Fake  252   48 300
##    Real   80  220 300
##    Sum   332  268 600
  • \(Pr_{fake} = \frac{252}{332} = 0.76\)
  • \(Pr_{real} = \frac{220}{268} = 0.82\)

Comparing the models

Model 1 Model 2
\(acc\) 0.79 0.79
\(Pr_{fake}\) 0.76 0.71
\(Pr_{real}\) 0.82 0.95

Recall

i.e. –> how many of class X is detected

Note: we have two classes, so we get two recall values

Also called sensitivity and specificity!

Formally:

  • \(R_{fake} = \frac{TP}{(TP+FN)}\)
  • \(R_{real} = \frac{TN}{(TN+FP)}\)

Recall

##        prediction
## reality Fake Real Sum
##    Fake  252   48 300
##    Real   80  220 300
##    Sum   332  268 600
  • \(R_{fake} = \frac{252}{300} = 0.84\)
  • \(R_{real} = \frac{220}{300} = 0.73\)

Comparing the models

Model 1 Model 2
\(acc\) 0.79 0.79
\(Pr_{fake}\) 0.76 0.71
\(Pr_{real}\) 0.82 0.95
\(R_{fake}\) 0.84 0.97
\(R_{real}\) 0.73 0.61

Combining Pr and R

The F1 measure.

Note: we combine Pr and R for each class, so we get two F1 measures.

Formally:

  • \(F1_{fake} = 2*\frac{Pr_{fake} * R_{fake}}{Pr_{fake} + R_{fake}}\)
  • \(F1_{real} = 2*\frac{Pr_{real} * R_{real}}{Pr_{real} + R_{real}}\)

F1 measure

##        prediction
## reality Fake Real Sum
##    Fake  252   48 300
##    Real   80  220 300
##    Sum   332  268 600
  • \(F1_{fake} = 2*\frac{0.76 * 0.84}{0.76 + 0.84} = 2*\frac{0.64}{1.60} = 0.80\)
  • \(F1_{real} = 2*\frac{0.82 * 0.73}{0.82 + 0.73} = 0.78\)

Comparing the models

Model 1 Model 2
\(acc\) 0.79 0.79
\(Pr_{fake}\) 0.76 0.71
\(Pr_{real}\) 0.82 0.95
\(R_{fake}\) 0.84 0.97
\(R_{real}\) 0.73 0.61
\(F1_{fake}\) 0.80 0.82
\(F1_{real}\) 0.78 0.74

In caret

confusionMatrix(model.predictions, as.factor(test_data$outcome))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fake real
##       fake  252   80
##       real   48  220
##                                           
##                Accuracy : 0.7867          
##                  95% CI : (0.7517, 0.8188)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5733          
##  Mcnemar's Test P-Value : 0.006143        
##                                           
##             Sensitivity : 0.8400          
##             Specificity : 0.7333          
##          Pos Pred Value : 0.7590          
##          Neg Pred Value : 0.8209          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4200          
##    Detection Prevalence : 0.5533          
##       Balanced Accuracy : 0.7867          
##                                           
##        'Positive' Class : fake            
## 

There’s more

What’s actually behind the model’s predictions?

Any ideas?

Class probabilities

Notice anything?

The threshold problem

Issue!

  • classification threshold little informative
  • obscures certainty in judgment

Needed: a representation across all possible values

The Area Under the Curve (AUC)

Idea:

  • plot all observed values (here: class probs)
  • y-axis: sensitivity
  • x-axis: 1-specificity

AUC step-wise

threshold_1 = probs[1]
threshold_1
## [1] 0.6280156
pred_threshold_1 = ifelse(probs >= threshold_1, 'fake', 'real')
knitr::kable(table(test_data$outcome, pred_threshold_1))
fake real
fake 221 79
real 52 248

Sensitivity and 1-Specificity

fake real
fake 221 79
real 52 248

\(Sens. = 221/300 = 0.74\)

\(Spec. = 248/300 = 0.83\)

\(Sens. = 221/300 = 0.74\)

\(Spec. = 248/300 = 0.83\)

Threshold Sens. 1-Spec
0.63 0.74 0.17

Do this for every threshold observed.

.. and plot the results:

Quantify this plot

auc1 = roc(response = test_data$outcome
               , predictor = probs
               , ci=T)

What if we compare our two models?

plot.roc(auc1, xlim=c(1, 0), legacy.axes = T)

auc2 = roc(response = test_data$outcome
               , predictor = probs2
               , ci=T)

plot.roc(auc2, xlim=c(1, 0), legacy.axes = T)

AUCs numerically

#model 1
roc(response = test_data$outcome , predictor = probs, ci=T)
## 
## Call:
## roc.default(response = test_data$outcome, predictor = probs,     ci = T)
## 
## Data: probs in 300 controls (test_data$outcome fake) > 300 cases (test_data$outcome real).
## Area under the curve: 0.8521
## 95% CI: 0.8216-0.8827 (DeLong)
#model 2
roc(response = test_data$outcome , predictor = probs2, ci=T)
## 
## Call:
## roc.default(response = test_data$outcome, predictor = probs2,     ci = T)
## 
## Data: probs2 in 300 controls (test_data$outcome fake) > 300 cases (test_data$outcome real).
## Area under the curve: 0.9573
## 95% CI: 0.9437-0.971 (DeLong)

RECAP

  • Unsupervised ML
  • Performance metrics
    • confusion matrix
    • AUC
  • Validation and generalisation

Outlook

Tutorial tomorrow

Week 8: Applied predictive modelling + R Notebooks

END