MACHINE LEARNING 2
outcome
and features
arelibrary(caret)
Practically everywhere.
Clustering reduces your data!
You know nothing about groups inherent to the data.
Let’s take 4.
unsup_model_1 = kmeans(data4
, centers = 4
, nstart = 10
, iter.max = 10)
Possible approach:
wss = numeric()
for(i in 1:20){
kmeans_model = kmeans(data4, centers = i, iter.max = 20, nstart = 10)
wss[i] = kmeans_model$tot.withinss
}
Look for the inflexion point at center size i.
See also this tutorial.
We settle for \(k = 2\)
unsup_model_final = kmeans(data4
, centers = 2
, nstart = 10
, iter.max = 10)
What’s lacking?
What can you (not) say?
unsup_model_final$centers
## salary height
## 1 -0.8395549 -0.7457021
## 2 0.6869085 0.6101199
Note: we cannot say anything about accuracy.
See the k-NN model.
What if we chose \(k=3\)?
km_3 = kmeans(data4, centers = 3, nstart = 10, iter.max = 10)
fviz_cluster(km_3, geom = "point", data = data4)
What if we chose \(k=3\)?
km_3$centers
## salary height
## 1 0.7063757 0.8795474
## 2 -1.4058046 -0.5668204
## 3 0.1876933 -0.7256515
set.seed(2019)
in_training = createDataPartition(y = fake_news_data$outcome
, p = .7
, list = FALSE
)
training_data = fake_news_data[ in_training,]
test_data = fake_news_data[-in_training,]
training_controls = trainControl(method="cv"
, number = 5
, classProbs = T
)
fakenews_model = train(outcome ~ .
, data = training_data
, trControl = training_controls
, method = "svmLinear"
)
model.predictions = predict(fakenews_model, test_data)
Evaluate the model.
What do you do?
fake | real | |
---|---|---|
fake | 252 | 48 |
real | 80 | 220 |
(252+220)/600 = 0.79
Fake | Real | |
---|---|---|
Fake | True positives | False negatives |
Real | False positives | True negatives |
\(acc=\frac{(TP+TN)}{N}\)
Any problems with that?
Fake | Real | |
---|---|---|
Fake | 252 | 48 |
Real | 80 | 220 |
Fake | Real | |
---|---|---|
Fake | 290 | 10 |
Real | 118 | 182 |
Needed: more nuanced metrics
## prediction
## reality Fake Real Sum
## Fake 252 48 300
## Real 80 220 300
## Sum 332 268 600
## prediction
## reality Fake Real Sum
## Fake 290 10 300
## Real 118 182 300
## Sum 408 192 600
i.e. –> how often the prediction is correct when prediction class X
Note: we have two classes, so we get two precision values
Formally:
## prediction
## reality Fake Real Sum
## Fake 252 48 300
## Real 80 220 300
## Sum 332 268 600
Model 1 | Model 2 | |
---|---|---|
\(acc\) | 0.79 | 0.79 |
\(Pr_{fake}\) | 0.76 | 0.71 |
\(Pr_{real}\) | 0.82 | 0.95 |
i.e. –> how many of class X is detected
Note: we have two classes, so we get two recall values
Also called sensitivity and specificity!
Formally:
## prediction
## reality Fake Real Sum
## Fake 252 48 300
## Real 80 220 300
## Sum 332 268 600
Model 1 | Model 2 | |
---|---|---|
\(acc\) | 0.79 | 0.79 |
\(Pr_{fake}\) | 0.76 | 0.71 |
\(Pr_{real}\) | 0.82 | 0.95 |
\(R_{fake}\) | 0.84 | 0.97 |
\(R_{real}\) | 0.73 | 0.61 |
The F1 measure.
Note: we combine Pr and R for each class, so we get two F1 measures.
Formally:
## prediction
## reality Fake Real Sum
## Fake 252 48 300
## Real 80 220 300
## Sum 332 268 600
Model 1 | Model 2 | |
---|---|---|
\(acc\) | 0.79 | 0.79 |
\(Pr_{fake}\) | 0.76 | 0.71 |
\(Pr_{real}\) | 0.82 | 0.95 |
\(R_{fake}\) | 0.84 | 0.97 |
\(R_{real}\) | 0.73 | 0.61 |
\(F1_{fake}\) | 0.80 | 0.82 |
\(F1_{real}\) | 0.78 | 0.74 |
confusionMatrix(model.predictions, as.factor(test_data$outcome))
## Confusion Matrix and Statistics
##
## Reference
## Prediction fake real
## fake 252 80
## real 48 220
##
## Accuracy : 0.7867
## 95% CI : (0.7517, 0.8188)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5733
## Mcnemar's Test P-Value : 0.006143
##
## Sensitivity : 0.8400
## Specificity : 0.7333
## Pos Pred Value : 0.7590
## Neg Pred Value : 0.8209
## Prevalence : 0.5000
## Detection Rate : 0.4200
## Detection Prevalence : 0.5533
## Balanced Accuracy : 0.7867
##
## 'Positive' Class : fake
##
What’s actually behind the model’s predictions?
Any ideas?
Notice anything?
Needed: a representation across all possible values
Idea:
threshold_1 = probs[1]
threshold_1
## [1] 0.6280156
pred_threshold_1 = ifelse(probs >= threshold_1, 'fake', 'real')
knitr::kable(table(test_data$outcome, pred_threshold_1))
fake | real | |
---|---|---|
fake | 221 | 79 |
real | 52 | 248 |
fake | real | |
---|---|---|
fake | 221 | 79 |
real | 52 | 248 |
\(Sens. = 221/300 = 0.74\)
\(Spec. = 248/300 = 0.83\)
\(Sens. = 221/300 = 0.74\)
\(Spec. = 248/300 = 0.83\)
Threshold | Sens. | 1-Spec |
---|---|---|
0.63 | 0.74 | 0.17 |
Do this for every threshold observed.
auc1 = roc(response = test_data$outcome
, predictor = probs
, ci=T)
plot.roc(auc1, xlim=c(1, 0), legacy.axes = T)
auc2 = roc(response = test_data$outcome
, predictor = probs2
, ci=T)
plot.roc(auc2, xlim=c(1, 0), legacy.axes = T)
#model 1
roc(response = test_data$outcome , predictor = probs, ci=T)
##
## Call:
## roc.default(response = test_data$outcome, predictor = probs, ci = T)
##
## Data: probs in 300 controls (test_data$outcome fake) > 300 cases (test_data$outcome real).
## Area under the curve: 0.8521
## 95% CI: 0.8216-0.8827 (DeLong)
#model 2
roc(response = test_data$outcome , predictor = probs2, ci=T)
##
## Call:
## roc.default(response = test_data$outcome, predictor = probs2, ci = T)
##
## Data: probs2 in 300 controls (test_data$outcome fake) > 300 cases (test_data$outcome real).
## Area under the curve: 0.9573
## 95% CI: 0.9437-0.971 (DeLong)
Tutorial tomorrow
Week 8: Applied predictive modelling + R Notebooks