Machine learning 1

Advanced Crime Analysis UCL

Bennett Kleinberg

18 Feb 2019

MACHINE LEARNING 1

Today

  • Recap week 1-5
  • Intro to machine learning
    • Types of ML
    • Supervised machine learning
    • Step-by-step example
    • Important algorithms

Recap Week 2

APIs

Recap week 3

Webscraping

Recap week 4

Text mining 1

Recap week 5

Text mining 2

Machine learning?

  • core idea: a system learns from experience
  • no precise instructions

Applications?

Why do we want this?

Step back…

How did you perform regression analysis in PSM2?

Okay …

  • you’ve got one outcome variable (e.g. number of shooting victims)
  • and two predictors (e.g. gender of shooter, age)
  • typical approach \(victims ~ gender + age\)
  • regression equation with intercept, beta coefficients and inferred error term

But!

Often we have no idea about the relationships.

  • too many predictors
  • too diverse a problem
  • simply unknown

ML in general

  • concered with patterns in data
  • learning from data
  • more experience results typically in better models
  • data, data, data

Types of machine learning

Broad categories

  • Supervised learning (today)
  • Unsupervised learning (next week)
  • Hybrid models
  • Deep learning
  • Reinforcement learning

Deep learning

Inspired by the human brain.

Reinforcement learning

Demo

SUPERVISED LEARNING

WTF is supervised?

  • supervised = labeled data
  • i.e. you know the outcome
  • flipped logic

Contrary: unsupervised.

Classes of supervised learning

  • classification (e.g. death/alive, fake/real)
  • regression (e.g. income, number of deaths)

Mini example

Supervised classification

Simple example

  • gender prediction
  • based on salary

    gend er sal ary
    1 male 39169
    2 male 33620
    3 male 33225
    4 male 35437
    11 female 15039
    12 female 13861
    13 female 24443
    14 female 36744

How to best separate the data into two groups?

Core idea

  • learn relationship between
    • outcome (target) variable
    • features (predictors)
  • “learning” is done through an algorithm
    • simplest algorithm: if A then B

Idea 1: male salary threshold

Idea 1: male salary threshold

minimum_male = min(data1$salary[data1$gender == 'male']) #32869
data1$my_prediction = ifelse(data1$salary >= minimum_male, 'male', 'female')

Idea 2: female salary threshold

Idea 2: female salary threshold

maximum_female = max(data1$salary[data1$gender == 'female']) #41682
data1$my_prediction2 = ifelse(data1$salary <= maximum_female, 'female', 'male')

But this is not learning!

Stepwise supervised ML

  • clarify what outcome and features are
  • determine which classification algorithm to use
  • train the model

Enter: caret

library(caret)

caret in practice

caret in practice

my_first_model = train(gender ~ .
                       , data = data2
                       , method = "svmLinear"
                       )
Now you have trained a model!

you have taught an algorithm to learn to predict gender from salary & height

But now what?

Put your model to use

Make predictions:

data2$model_predictions = predict(my_first_model, data2)
female male
female 8 2
male 0 10

The key challenge?

Think about what we did…

Problem of inductive bias

  • remember: we learn from the data
  • but what we really want to know is: how does it work on “unseen” data

How to solve this?

Keep some data for yourself

Train/test split

  • split the data (e.g. 80%/20%, 60%/40%)
  • use one part as TRAINING SET
  • use the other as TEST SET

caret helps!

set.seed(1)
in_training = createDataPartition(y = data1$gender
                                  , p = .8
                                  , list = FALSE
                                  )
in_training
##       Resample1
##  [1,]         1
##  [2,]         2
##  [3,]         4
##  [4,]         5
##  [5,]         6
##  [6,]         7
##  [7,]         8
##  [8,]        10
##  [9,]        12
## [10,]        13
## [11,]        14
## [12,]        15
## [13,]        16
## [14,]        17
## [15,]        18
## [16,]        19

Splitting the data

training_data = data2[ in_training,]
test_data = data2[-in_training,]
gender salary height
3 male 33225 179
9 male 40841 193
11 female 15039 152
20 female 30597 148

Pipeline again

  • define outcome (DONE)
  • define features (DONE)
  • build model (DONE)
    • but this time: on the TRAINING SET
  • evaluate model
    • this time: on the TEST SET

Teach the SVM:

my_second_model = train(gender ~ .
                       , data = training_data
                       , method = "svmLinear"
                       )

Fit/test the SVM:

model_predictions = predict(my_second_model, test_data)
female male
female 2 0
male 0 2

But!

  • our model might be really dependent on the training data
  • we want to be more careful
  • Can we do some kind of safeguarding in the training data?

Cross-validation

K-fold cross-validation

Img source

Specifying CV in caret

training_controls = trainControl(method="cv"
                                 , number = 4
                                 )

my_third_model = train(gender ~ .
                       , data = training_data
                       , trControl = training_controls
                       , method = "svmLinear"
                       )

my_third_model
## Support Vector Machines with Linear Kernel 
## 
## 16 samples
##  2 predictor
##  2 classes: 'female', 'male' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold) 
## Summary of sample sizes: 12, 12, 12, 12 
## Resampling results:
## 
##   Accuracy  Kappa
##   0.75      0.5  
## 
## Tuning parameter 'C' was held constant at a value of 1

Assess the CVed model

model_predictions = predict(my_third_model, test_data)
female male
female 2 0
male 0 2

Let’s apply this!

Fakenews corpus: 1000 fake, 1000 real (data)

including ones information house show security outcome
1 1 1 1 1 1 1 fake
2 0 0 0 0 0 0 fake
3 0 0 1 2 1 0 fake
1000 0 0 0 0 1 0 fake
1001 1 0 0 0 0 0 real
1002 0 0 0 0 0 0 real
1003 0 0 0 0 0 0 real

Problem

  • 1000 fake and 1000 real news items
  • only source of information: text
  • often fact-checking not available (yet)
  • idea: linguistic traces help differentiate fake and real news
dim(fake_news_data)
## [1] 2000  799

Stepwise ML approach

  • the outcome variable?
  • the features?
  • the algorithm?
  • the train/test split?
  • the training set cross-validation?

Model 1

Model 1
outcome fake vs real
features ngram freqs.
algorithm Linear SVM
train/test 80/20
Cross-val. 10-fold

Step 1

Partition the data

set.seed(2019)
in_training = createDataPartition(y = fake_news_data$outcome
                                  , p = .8 # <-- split value
                                  , list = FALSE
                                  )
training_data = fake_news_data[ in_training,]
test_data = fake_news_data[-in_training,]
Training data
Var1 Freq
fake 800
real 800

Step 2

Define training controls

training_controls = trainControl(method="cv"
                                 , number = 10
                                 )

Step 3

Train the model

fakenews_model_1 = train(outcome ~ .
                       , data = training_data
                       , trControl = training_controls
                       , method = "svmLinear"
                       )

Step 4

Fit the model

model_1.predictions = predict(fakenews_model_1, test_data)
fake real
fake 159 41
real 42 158

(159+158)/400 = 0.73

The strength of caret

Let’s see whether we can do better

Model 1 Model 2
outcome fake vs real ~
features ngram freqs. ~
algorithm Linear SVM ~
train/test 80/20 60/40
Cross-val. 10-fold 5-fold

Model 2

Step 1: Splitting the data

set.seed(2019)
in_training = createDataPartition(y = fake_news_data$outcome
                                  , p = .6 # <-- split value
                                  , list = FALSE
                                  )
training_data = fake_news_data[ in_training,]
test_data = fake_news_data[-in_training,]

Step 2: Define training controls

training_controls = trainControl(method="cv"
                                 , number = 5
                                 )

Step 3: Train the model

fakenews_model_2 = train(outcome ~ .
                       , data = training_data
                       , trControl = training_controls
                       , method = "svmLinear"
                       )

Step 4: Fit the model

model_2.predictions = predict(fakenews_model_2, test_data)
fake real
fake 329 71
real 91 309

(329+309)/800 = 0.80

Looking a step further

What’s driving the classification?

varImp(fakenews_model_1_)
## ROC curve variable importance
## 
##   only 20 most important variables shown (out of 798)
## 
##       Importance
## said      100.00
## first      82.70
## last       77.06
## two        73.30
## year       67.09
## years      63.15
## still      58.76
## also       57.69
## three      53.95
## thats      53.91
## one        53.67
## made       53.66
## new        52.80
## good       46.36
## time       45.59
## much       45.48
## now        45.14
## since      44.99
## four       44.52
## back       44.09

Important features

“said”

tapply(training_data$said, training_data$outcome, mean)
##       1       0 
## 0.93875 2.97625

“first”

tapply(training_data$first, training_data$outcome, mean)
##       1       0 
## 0.48875 1.16000

Making full use of caret

  • what if we want to use a different classification algorithm?

Selection of models –> https://topepo.github.io/caret/available-models.html

Intermezzo to different algorithms

  • Support Vector Machine video
  • Decision Trees
  • Random Forests
  • worth knowing:
    • Naive Bayes
    • Logistic regression
    • kNN

Decision Trees

Random Forests

  • selects random set of training data
    • builds decision tree
  • = many trees = forest
  • many random trees = random forest
  • averaging the trees (voting)

Model 3

Model 1 Model 2 Model 3
outcome fake vs real ~ ~
features ngram freqs. ~ ~
algorithm Linear SVM ~ Random Forest
train/test 80/20 60/40 70/30
Cross-val. 10-fold 5-fold 2x Repeated 5-fold

Model 3

(skipping data splitting here)

training_controls = trainControl(method="repeatedcv"
                                 , number = 5
                                 , repeats = 2
                                 )

Model 3

fakenews_model_3 = train(outcome ~ .
                       , data = training_data
                       , trControl = training_controls
                       , method = "ranger"
                       )

Model 3

## Random Forest 
## 
## 560 samples
## 798 predictors
##   2 classes: 'fake', 'real' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 448, 448, 448, 448, 448, 448, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa    
##     2   gini        0.7464286  0.4928571
##     2   extratrees  0.7366071  0.4732143
##    39   gini        0.8250000  0.6500000
##    39   extratrees  0.7901786  0.5803571
##   797   gini        0.8187500  0.6375000
##   797   extratrees  0.8392857  0.6785714
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 797, splitrule
##  = extratrees and min.node.size = 1.

Model 3

Make predictions

model_3.predictions = predict(fakenews_model_3, test_data)
fake real
fake 94 26
real 20 100

(90+108)/240 = 0.83

RECAP

  • Types of machine learning
  • Supervised ML
  • Cross-validation
  • Using caret

Outlook

Tutorial tomorrow

Homework: Replication of fake news classification

Week 7: Machine learning 2

Next week: Unsupervised learning + performance metrics

END