MACHINE LEARNING 1
Applications?
Step back…
How did you perform regression analysis in PSM2?
Often we have no idea about the relationships.
Inspired by the human brain.
Contrary: unsupervised.
Supervised classification
based on salary
gend | er sal | ary |
---|---|---|
1 | male | 39169 |
2 | male | 33620 |
3 | male | 33225 |
4 | male | 35437 |
11 | female | 15039 |
12 | female | 13861 |
13 | female | 24443 |
14 | female | 36744 |
How to best separate the data into two groups?
if A then B
minimum_male = min(data1$salary[data1$gender == 'male']) #32869
data1$my_prediction = ifelse(data1$salary >= minimum_male, 'male', 'female')
maximum_female = max(data1$salary[data1$gender == 'female']) #41682
data1$my_prediction2 = ifelse(data1$salary <= maximum_female, 'female', 'male')
But this is not learning!
outcome
and features
arecaret
library(caret)
caret
in practicecaret
in practicemy_first_model = train(gender ~ .
, data = data2
, method = "svmLinear"
)
you have taught an algorithm to learn to predict gender from salary & height
But now what?
Make predictions:
data2$model_predictions = predict(my_first_model, data2)
female | male | |
---|---|---|
female | 8 | 2 |
male | 0 | 10 |
Think about what we did…
How to solve this?
Train/test split
caret
helps!set.seed(1)
in_training = createDataPartition(y = data1$gender
, p = .8
, list = FALSE
)
in_training
## Resample1
## [1,] 1
## [2,] 2
## [3,] 4
## [4,] 5
## [5,] 6
## [6,] 7
## [7,] 8
## [8,] 10
## [9,] 12
## [10,] 13
## [11,] 14
## [12,] 15
## [13,] 16
## [14,] 17
## [15,] 18
## [16,] 19
training_data = data2[ in_training,]
test_data = data2[-in_training,]
gender | salary | height | |
---|---|---|---|
3 | male | 33225 | 179 |
9 | male | 40841 | 193 |
11 | female | 15039 | 152 |
20 | female | 30597 | 148 |
Teach the SVM:
my_second_model = train(gender ~ .
, data = training_data
, method = "svmLinear"
)
Fit/test the SVM:
model_predictions = predict(my_second_model, test_data)
female | male | |
---|---|---|
female | 2 | 0 |
male | 0 | 2 |
K-fold cross-validation
caret
training_controls = trainControl(method="cv"
, number = 4
)
my_third_model = train(gender ~ .
, data = training_data
, trControl = training_controls
, method = "svmLinear"
)
my_third_model
## Support Vector Machines with Linear Kernel
##
## 16 samples
## 2 predictor
## 2 classes: 'female', 'male'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold)
## Summary of sample sizes: 12, 12, 12, 12
## Resampling results:
##
## Accuracy Kappa
## 0.75 0.5
##
## Tuning parameter 'C' was held constant at a value of 1
model_predictions = predict(my_third_model, test_data)
female | male | |
---|---|---|
female | 2 | 0 |
male | 0 | 2 |
Fakenews corpus: 1000 fake, 1000 real (data)
including | ones | information | house | show | security | outcome | |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | fake |
2 | 0 | 0 | 0 | 0 | 0 | 0 | fake |
3 | 0 | 0 | 1 | 2 | 1 | 0 | fake |
1000 | 0 | 0 | 0 | 0 | 1 | 0 | fake |
1001 | 1 | 0 | 0 | 0 | 0 | 0 | real |
1002 | 0 | 0 | 0 | 0 | 0 | 0 | real |
1003 | 0 | 0 | 0 | 0 | 0 | 0 | real |
dim(fake_news_data)
## [1] 2000 799
Model 1 | |
---|---|
outcome | fake vs real |
features | ngram freqs. |
algorithm | Linear SVM |
train/test | 80/20 |
Cross-val. | 10-fold |
Partition the data
set.seed(2019)
in_training = createDataPartition(y = fake_news_data$outcome
, p = .8 # <-- split value
, list = FALSE
)
training_data = fake_news_data[ in_training,]
test_data = fake_news_data[-in_training,]
Var1 | Freq |
---|---|
fake | 800 |
real | 800 |
Define training controls
training_controls = trainControl(method="cv"
, number = 10
)
Train the model
fakenews_model_1 = train(outcome ~ .
, data = training_data
, trControl = training_controls
, method = "svmLinear"
)
Fit the model
model_1.predictions = predict(fakenews_model_1, test_data)
fake | real | |
---|---|---|
fake | 159 | 41 |
real | 42 | 158 |
(159+158)/400 = 0.73
caret
…Let’s see whether we can do better
Model 1 | Model 2 | |
---|---|---|
outcome | fake vs real | ~ |
features | ngram freqs. | ~ |
algorithm | Linear SVM | ~ |
train/test | 80/20 | 60/40 |
Cross-val. | 10-fold | 5-fold |
Step 1: Splitting the data
set.seed(2019)
in_training = createDataPartition(y = fake_news_data$outcome
, p = .6 # <-- split value
, list = FALSE
)
training_data = fake_news_data[ in_training,]
test_data = fake_news_data[-in_training,]
Step 2: Define training controls
training_controls = trainControl(method="cv"
, number = 5
)
Step 3: Train the model
fakenews_model_2 = train(outcome ~ .
, data = training_data
, trControl = training_controls
, method = "svmLinear"
)
Step 4: Fit the model
model_2.predictions = predict(fakenews_model_2, test_data)
fake | real | |
---|---|---|
fake | 329 | 71 |
real | 91 | 309 |
(329+309)/800 = 0.80
What’s driving the classification?
varImp(fakenews_model_1_)
## ROC curve variable importance
##
## only 20 most important variables shown (out of 798)
##
## Importance
## said 100.00
## first 82.70
## last 77.06
## two 73.30
## year 67.09
## years 63.15
## still 58.76
## also 57.69
## three 53.95
## thats 53.91
## one 53.67
## made 53.66
## new 52.80
## good 46.36
## time 45.59
## much 45.48
## now 45.14
## since 44.99
## four 44.52
## back 44.09
“said”
tapply(training_data$said, training_data$outcome, mean)
## 1 0
## 0.93875 2.97625
“first”
tapply(training_data$first, training_data$outcome, mean)
## 1 0
## 0.48875 1.16000
caret
Selection of models –> https://topepo.github.io/caret/available-models.html
Model 1 | Model 2 | Model 3 | |
---|---|---|---|
outcome | fake vs real | ~ | ~ |
features | ngram freqs. | ~ | ~ |
algorithm | Linear SVM | ~ | Random Forest |
train/test | 80/20 | 60/40 | 70/30 |
Cross-val. | 10-fold | 5-fold | 2x Repeated 5-fold |
(skipping data splitting here)
training_controls = trainControl(method="repeatedcv"
, number = 5
, repeats = 2
)
fakenews_model_3 = train(outcome ~ .
, data = training_data
, trControl = training_controls
, method = "ranger"
)
## Random Forest
##
## 560 samples
## 798 predictors
## 2 classes: 'fake', 'real'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 448, 448, 448, 448, 448, 448, ...
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.7464286 0.4928571
## 2 extratrees 0.7366071 0.4732143
## 39 gini 0.8250000 0.6500000
## 39 extratrees 0.7901786 0.5803571
## 797 gini 0.8187500 0.6375000
## 797 extratrees 0.8392857 0.6785714
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 797, splitrule
## = extratrees and min.node.size = 1.
Make predictions
model_3.predictions = predict(fakenews_model_3, test_data)
fake | real | |
---|---|---|
fake | 94 | 26 |
real | 20 | 100 |
(90+108)/240 = 0.83
Tutorial tomorrow
Homework: Replication of fake news classification
Week 7: Machine learning 2
Next week: Unsupervised learning + performance metrics