Data science for security
Bennett Kleinberg
25 Oct 2019
Today
- The big promise: A primer of data science
- The pitfalls and problems
- Data Science for security
(if we have time: The do-or-die problem of data science)
Your thoughts?
- More data = better problem solving.
- Every problem will become a “data problem”.
- The big challenge for data science is a technological one.
Aaah: so we’re talking Big Data!
![]()
Problems with “Big Data”
- what is “big”?
- data = data?
- complexity of data?
- sexiness of small data
Before we start
What are the chances that this man is a terrorist?
![]()
The big promise: A primer of data science
Machine learning?
- core idea: a system learns from experience
- no precise instructions
Applications?
Why do we want this?
Step back…
How did you perform regression analysis in school?
Okay …
- you’ve got one outcome variable (e.g. number of shooting victims)
- and two predictors (e.g. gender of shooter, age)
- typical approach \(victims = gender + age\)
- regression equation with: intercept, beta coefficients and inferred error term
But!
Often we have no idea about the relationships.
- too many predictors
- too diverse a problem
- simply unknown
ML in general
- concered with patterns in data
- learning from data
- more experience results typically in better models
- data, data, data
Types of machine learning
Broad categories
- Supervised learning
- Unsupervised learning
- Hybrid models
- Deep learning
- Reinforcement learning
Deep learning
Inspired by the human brain.
Reinforcement learning
Demo
What is supervised?
- who is the supervisor?
- supervised = labelled data
- i.e. you know the outcome
- flipped logic
Contrary: unsupervised.
Classes of supervised learning
- classification (e.g. death/alive, fake/real)
- regression (e.g. income, number of deaths)
Mini example
Supervised classification
Simple example
- gender prediction
- based on salary
male |
33796 |
male |
34597 |
male |
34296 |
male |
32262 |
female |
19190 |
female |
14424 |
female |
37614 |
female |
29079 |
How to best separate the data into two groups?
Core idea
- learn relationship between
- outcome (target) variable
- features (predictors)
- “learning” is done through an algorithm
- simplest algorithm:
if A then B
Idea 1: male salary threshold

Idea 1: male salary threshold

Idea 2: female salary threshold

Idea 2: female salary threshold

But this is not learning!
Stepwise supervised ML
- clarify what
outcome
and features
are
- determine which classification algorithm to use
- train the model
caret
in practice

caret
in practice
my_first_model = train(gender ~ .
, data = data2
, method = "svmLinear"
)
Now you have trained a model!
= you have taught an algorithm to learn to predict gender from salary & height

But now what?
Put your model to use
Make predictions:
data2$model_predictions = predict(my_first_model, data2)
The key challenge?
Think about what we did…
Problem of inductive bias
- remember: we learn from the data
- but what we really want to know is: how does it work on “unseen” data
How to solve this?
Keep some data for yourself
Train/test split
- split the data (e.g. 80%/20%, 60%/40%)
- use one part as TRAINING SET
- use the other as TEST SET
training_data = data2[ in_training,]
test_data = data2[-in_training,]
3 |
male |
33225 |
179 |
9 |
male |
40841 |
193 |
11 |
female |
15039 |
152 |
20 |
female |
30597 |
148 |
Pipeline again
- define outcome (DONE)
- define features (DONE)
- build model (DONE)
- but this time: on the TRAINING SET
- evaluate model
- this time: on the TEST SET
Teach the SVM:
my_second_model = train(gender ~ .
, data = training_data
, method = "svmLinear"
)
Fit/test the SVM:
model_predictions = predict(my_second_model, test_data)
But!
- our model might be really dependent on the training data
- we want to be more careful
- Can we do some kind of safeguarding in the training data?
Cross-validation
K-fold cross-validation

Img source
How do we know whether a model is good?
Intermezzo
The confusion matrix
Confusion matrix
Fake |
True positives |
False negatives |
Real |
False positives |
True negatives |
Confusion matrix
- true positives (TP): correctly identified fake ones
- true negatives (TN): correctly identified real ones
- false positives (FP): false accusations
- false negatives (FN): missed fakes
OKAY: let’s use accuracies
\(acc=\frac{(TP+TN)}{N}\)
Any problems with that?
Accuracy
Model 1
Fake |
252 |
48 |
Real |
80 |
220 |
Model 2
Fake |
290 |
10 |
Real |
118 |
182 |
Problem with accuracy
- same accuracy, different confusion matrix
- relies on thresholding idea
- not suitable for comparing models (don’t be fooled by the literature!!)
Needed: more nuanced metrics
The problem from the beginning:
What are the chances that this man is a terrorist?
![]()
Solving the problem
Terrorist |
950 |
50 |
1,000 |
Passenger |
4,950 |
94,050 |
99,000 |
|
5,900 |
94,100 |
100,000 |
P(terrorist|alarm) = 950/5900 = 16.10%
Beyond accuracy
## prediction
## reality Fake Real Sum
## Fake 252 48 300
## Real 80 220 300
## Sum 332 268 600
## prediction
## reality Fake Real Sum
## Fake 290 10 300
## Real 118 182 300
## Sum 408 192 600
Precision
How often is the prediction correct when predicting class X?
Note: we have two classes, so we get two precision values
Formally:
- \(Pr_{fake} = \frac{TP}{(TP+FP)}\)
- \(Pr_{real} = \frac{TN}{(TN+FN)}\)
Precision
## prediction
## reality Fake Real Sum
## Fake 252 48 300
## Real 80 220 300
## Sum 332 268 600
- \(Pr_{fake} = \frac{252}{332} = 0.76\)
- \(Pr_{real} = \frac{220}{268} = 0.82\)
Comparing the models
\(acc\) |
0.79 |
0.79 |
\(Pr_{fake}\) |
0.76 |
0.71 |
\(Pr_{real}\) |
0.82 |
0.95 |
Recall
How many cases of class X are detected?
Note: we have two classes, so we get two recall values
Also called sensitivity and specificity!
Formally:
- \(R_{fake} = \frac{TP}{(TP+FN)}\)
- \(R_{real} = \frac{TN}{(TN+FP)}\)
Recall
## prediction
## reality Fake Real Sum
## Fake 252 48 300
## Real 80 220 300
## Sum 332 268 600
- \(R_{fake} = \frac{252}{300} = 0.84\)
- \(R_{real} = \frac{220}{300} = 0.73\)
Comparing the models
\(acc\) |
0.79 |
0.79 |
\(Pr_{fake}\) |
0.76 |
0.71 |
\(Pr_{real}\) |
0.82 |
0.95 |
\(R_{fake}\) |
0.84 |
0.97 |
\(R_{real}\) |
0.73 |
0.61 |
Combining Pr and R
The F1 measure.
Note: we combine Pr and R for each class, so we get two F1 measures.
Formally:
- \(F1_{fake} = 2*\frac{Pr_{fake} * R_{fake}}{Pr_{fake} + R_{fake}}\)
- \(F1_{real} = 2*\frac{Pr_{real} * R_{real}}{Pr_{real} + R_{real}}\)
F1 measure
## prediction
## reality Fake Real Sum
## Fake 252 48 300
## Real 80 220 300
## Sum 332 268 600
- \(F1_{fake} = 2*\frac{0.76 * 0.84}{0.76 + 0.84} = 2*\frac{0.64}{1.60} = 0.80\)
- \(F1_{real} = 2*\frac{0.82 * 0.73}{0.82 + 0.73} = 0.78\)
Comparing the models
\(acc\) |
0.79 |
0.79 |
\(Pr_{fake}\) |
0.76 |
0.71 |
\(Pr_{real}\) |
0.82 |
0.95 |
\(R_{fake}\) |
0.84 |
0.97 |
\(R_{real}\) |
0.73 |
0.61 |
\(F1_{fake}\) |
0.80 |
0.82 |
\(F1_{real}\) |
0.78 |
0.74 |
- often we don’t have labelled data
- sometimes there are no labels at all
- core idea: finding clusters in the data
Examples
- grouping of online ads
- clusters in crime descriptions
- …
Practically everywhere.
Clustering reduces your data!
The unsupervised case
You know nothing about groups inherent to the data.

The k-means idea
- separate data in set number of clusters
- find best cluster assignment of observations
Stepwise
- set the number of clusters
- find best cluster assignment
1. no. of clusters
Let’s take 4.
unsup_model_1 = kmeans(data4
, centers = 4
, nstart = 10
, iter.max = 10)
What’s inside?

The k-means algorithm
- find random centers
- assign each observation to its closest center
- optimise for the WSS
But how do we know how many centers?
Possible approach:
- run it for several combinations
- assess the WSS
- determine based on scree-plot
Cluster determination
wss = numeric()
for(i in 1:20){
kmeans_model = kmeans(data4, centers = i, iter.max = 20, nstart = 10)
wss[i] = kmeans_model$tot.withinss
}
Scree plot (elbow method)
Look for the inflexion point at center size i.

Other methods to establish k
- Silhoutte method (cluster fit)
- Gap statistic
See also this tutorial.
Choosing k
We settle for \(k = 2\)
unsup_model_final = kmeans(data4
, centers = 2
, nstart = 10
, iter.max = 10)
Plot the cluster assignment

Other unsupervised methods
- k-means (today)
- hierarchical clustering
- density clustering
Issues with unsupervised learning
What’s lacking?
What can you (not) say?
Caveats of unsup. ML
- there is no “ground truth”
- interpretation/subjectivity
- cluster choice
Interpretation of findings

Interpretation of findings
unsup_model_final$centers
## salary height
## 1 0.6869085 0.6101199
## 2 -0.8395549 -0.7457021
- Cluster 1: low salary, small
- Cluster 2: high salary, tall
Note: we cannot say anything about accuracy.
See the k-NN model.
Interpretation of findings
- subjective
- labelling tricky
- researchers choice!
- be open about this
Bias
Remember supervised learning?
What is the essential characteristic of it?
Suppose …
… you have to predict the quality of song lyrics.
How would you do it?
Examples
- quality of a football match
- attractiveness of an area
- quality of your degree
Bias through labelled data
- machine learning is only the tool!
- supervised learning will always predict something
- you need the researcher’s/analyst’s mindset to interpret it
Basic principle: BS in = BS out.
Problematic trends
![]()
Problematic trends
![]()
Problematic trends
![]()
Problematic trends
Assumptions, assumptions, assumptions, assumptions, assumptions, assumptions, assumptions, assumptions, assumptions, assumptions, assumptions, assumptions, assumptions, assumptions, assumptions, assumptions, assumptions, assumptions, assumptions. Everywhere assumptions.
The naivité fallacy
![]()
The naivité fallacy
![]()
The naivité fallacy
Put simply: you can sell anything.
Here’s an idea
ai_terrorism_detection = function(person){
person_classification = 'no terrorist'
return(person_classification)
}
“UCL RESEARCHERS USE AI TO FIGHT TERRORISM!”
“AI 99.9999% ACCURATE IN SPOTTING TERRORISTS!”
Category mistake
- So we are getting there with self-driving cars.
- Hence: we can also address the other challenges.
!!!!
“I would not be at all surprised if earthquakes are just practically, inherently unpredictable.”
(Ned Field)
Category mistake
- Building a sophisticated visual recogntion system != predicting everything
- Static phenomena vs. complex systems
Human behaviour might be the ultimate frontier in prediction.
Ethical issues
- data sources
- (machine) learning systems
- reinforcing systems
- responsible practices
Ethics & data science
Your turn: do you see problems for these aspects?
- data sources
- (machine) learning systems
Ethics & data science
What about “reinforcing systems”?
Ethics & data science
Choose 1:
- FP/FN issue in the hand of practitioners
- academics’ responsibility
An outlook
What would an ideal Data Science look like?
Be specific…
Academic data science
vs
“Industry” data science
Question: which one is leading?
Extreme view:
current academic data science is catering hype to compensate the Google envy.
Academic data science
creating “cool” studies |
testing assumptions |
pumping out non-reproducible papers |
investing in fundamental data science research |
hiring people to do cool things with our data |
starting with the problem |
getting on the data science train |
focus on methods of data science |
Outlook
- we need boring studies!
- longitudinal studies
- assumption checks
- replications
- we need to accept that Google & Co. are a different league in applying things
- we need to focus on the “ACADEMIC” part
- we need unis as control mechanism, not as a player
Data science for security
Where can data science help?
Using data science
- Automating human work
- Exceeding human capacity
- Augmenting human decision-making
Automating human work
Examples:
- scanning images for guns
- moderating content on social media
- access control to buildings
Automating human work
Why?
- reliability
- costs per unit
- scalability
Exceeding human capacity
Examples:
- remote sensing applications
- deception detection
- tumor detection
Exceeding human capacity
Why?
- processing capacity problem
- complex relationships
- limited attention of humans
Augmenting human decision-making
Human-in-the-loop systems:
- ML system makes a decision
- Human revises the decision
- Final decision reached
Augmenting human decision-making
Why?
- uses best of both worlds
- context (human) + scale (machine)
- allows your system to gain traction
The biggest problem for data science
Everything can be measured
Everything can be represented in data
![]()
Pep the super data scientist
Pep the super data scientist
- knows everything about football
- knows what happens if you hit a ball from angle X from distance Y at speed Z, etc.
- knows everything about the physiology of the players, about the physical properties of the ball, about the rules
- has got access to all the data that you can possibly collect from a football game
But:
- Pep experiences the world from an isolated room…
- … through his python editor…
- … and only has access to the data
… and never saw a football match.
One day …
One day, Pep goes out to the ”real” world and watches a match between Juventus (C. Ronaldo) and Barca (Messi).
Will Pep learn anything?
What does this mean?
- Qualia problem
- Originates from the philosophy of mind (consciousness problem)
- But reaches far beyond that
Pep’s problem & security
- perception of security
- experience of security
- perception of fairness
Recap
- Fastest intro to ML
- Some problems and pitfalls of data science
- Possible applications in security problems
- The hard problem of data science
If you only read one book in 2019…
Read: “The Signal and the noise”, Nate Silver
