Text data and text mining 1

Advanced Crime Analysis UCL

Bennett Kleinberg

28 Jan 2019

Text data 1

Briefly about the module

0.5 UCL credits = 7.5 ECTS
150 learning hours
11 weeks with 14 hours/week
3 contact hours per week
leaves 11 hours of self-study per week

Expected self-study

Revise the lecture (your responsibility)
Replicate the code/examples
Read the required literature (read, annotate, summarise)
Read additional literature if necessary
Design own code examples to understand the concept
Find tutorials/guides online
If still unclear: attend the code clinics: Weds 10-11 am
or: post it on Moodle or ask us

Today

Why text data?
Applications to crime and security problems
Levels of text data
Quantifying text data
Considerations in text cleaning

Text is everywhere …

Practically all websites
Emails
Messaging
Government reports
Laws
Police reports
Uni coursework
Newspapers

… and everything is text

videos –> transcripts
music –> lyrics
conversations –> transcripts
speeches –> transcripts

Core idea

Text is a unique documentation of human activity.

We are obsessed with documenting.

Text & Crime Science

hate speech
police reports
crimestoppers
fake reviews
fear of crime
cryptofraud

Obtaining text data

Quantifying text data

Challenge of quantification

a text is not a numerical representation
compare this to trading data
a text is just that, “a text”
but: for quantitative analyses, we need numbers

Text –> numerical representation?

Example

All I ever wanted was to love women, and in turn to be loved by them back. Their behavior towards me has only earned my hatred, and rightfully so! I am the true victim in all of this. I am the good guy. Humanity struck at me first by condemning me to experience so much suffering. I didn’t ask for this. I didn’t want this. I didn’t start this war. I wasn’t the one who struck first. But I will finish it by striking back. I will punish everyone. And it will be beautiful. Finally, at long last, I can show the world my true worth.

How would you quantify the example?

Features of text data

meta dimension
- no. of words
- no. of sentences
syntactic dimension
- word frequencies
- verbs, nouns, persons, locations, ..
- structure of a sentence
semantic dimension
- sentiment
- psycholinguistic features
text metrics
- readability
- lexical diversity

Approaches to text data

Modelling text data
Comparing text data
Text data for predictive models

The `quanteda` package

library(quanteda)

## Package version: 1.2.0

## Parallel computing: 2 of 4 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:utils':
## 
##     View

quanteda: Quantitative Analysis of Textual Data
- documentation
- tutorials
- examples

Levels of text data

characters c('h', 'a', 't', 'r', 'e', 'd')
words hatred
sentences I didn’t ask for this.
documents: individual text files
corpora: collection of documents

Counting meta features in R

text level	R function
characters	`nchar()`
words	`quanteda::ntoken()`
sentences	`quanteda::nsentence()`

Homework: read about the type/token distinction here and here.

R examples

#sentences
no_of_sentences = nsentence(er)
no_of_sentences

## text1 
##    13

#words 1
no_of_words_1 = ntoken(er)
no_of_words_1

## text1 
##   123

#words 2
no_of_words_2 = ntype(er)
no_of_words_2

## text1 
##    72

Type-toke ratio

Note: often used metric for “lexical diversity” is the TTR (type-token ratio).

string_a = "I didn’t ask for this. I didn’t want this."
string_b = "But I will finish it by striking back."

What are the type-token ratios of each string?

Type-token ratio

ntype(string_a)/ntoken(string_a)

##     text1 
## 0.6363636

ntype(string_b)/ntoken(string_b)

## text1 
##     1

Nuanced meta features

Characters per word

nchar(er)/ntoken(er)

##    text1 
## 4.317073

Words per sentence

ntoken(er)/nsentence(er)

##    text1 
## 9.461538

Text representations

represent a text by its tokens (terms)
each text consists of a frequency of its tokens

"I think I believe him"

create a column for each token
count the frequency

text_id	I	think	believe	her
text1	2	1	1	1

Term frequency

frequency of tokens in each document
represented in a table (matrix)
tokens are features of a document
voilá: fancy name –> Document Feature Matrix (= DFM)

example_string_tok = tokens("I think I believe him")

DFM

from ‘tokens’ object, create a DFM table

dfm(example_string_tok)

## Document-feature matrix of: 1 document, 4 features (0% sparse).
## 1 x 4 sparse Matrix of class "dfm"
##        features
## docs    i think believe him
##   text1 2     1       1   1

Sparsity: % of zero-cells
- why is sparsity = 0% here?
- what would you expect if we take additional documents

DFM with multiple documents

Document-term frequency matrix

multiple_docs_tok = tokens(c("I think I believe him", "This is a cool function"))

dfm(multiple_docs_tok)

## Document-feature matrix of: 2 documents, 9 features (50% sparse).
## 2 x 9 sparse Matrix of class "dfm"
##        features
## docs    i think believe him this is a cool function
##   text1 2     1       1   1    0  0 0    0        0
##   text2 0     0       0   0    1  1 1    1        1

DFM with two lone-actors

“All I ever wanted was to love women, and in turn to be loved by them back. Their behavior towards me has only earned my hatred, and rightfully so! I am the true victim in all of this. I am the good guy. Humanity struck at me first by condemning me to experience so much suffering. I didn’t ask for this. I didn’t want this. I didn’t start this war. I wasn’t the one who struck first. But I will finish it by striking back. I will punish everyone. And it will be beautiful. Finally, at long last, I can show the world my true worth.”

DFM with two texts

The Industrial Revolution and its consequences have been a disaster for the human race. They have greatly increased the life-expectancy of those of us who live in “advanced” countries, but they have destabilized society, have made life unfulfilling, have subjected human beings to indignities, have led to widespread psychological suffering (in the Third World to physical suffering as well) and have inflicted severe damage on the natural world. The continued development of technology will worsen the situation.

DFM representation

Create a “mini corpus” for convenience
makes using the quanteda pipeline easier

mini_corpus = corpus(c(er, ub))
summary(mini_corpus)

## Corpus consisting of 2 documents:
## 
##   Text Types Tokens Sentences
##  text1    72    123        13
##  text2    63     88         3
## 
## Source: /Users/bennettkleinberg/GitHub/ucl_aca_20182019/slides/* on x86_64 by bennettkleinberg
## Created: Sun Feb  3 18:37:47 2019
## Notes:

DFM representation

corpus_tokenised = tokens(mini_corpus)
corpus_dfm = dfm(corpus_tokenised)

knitr::kable(corpus_dfm[, 1:8])

document	all	i	ever	wanted	was	to	love	women
text1	2	10	1	1	1	3	1	1
text2	0	0	0	0	0	3	0	0

…

knitr::kable(corpus_dfm[, 31:38])

document	am	the	true	victim	of	this	good	guy
text1	2	4	2	1	1	4	1	1
text2	0	7	0	0	3	0	0	0

Is this ideal?

What are the most frequent “terms”?

topfeatures(corpus_dfm[1])

##      .      i      ,    the   this     to    and     by     me didn't 
##     12     10      4      4      4      3      3      3      3      3

topfeatures(corpus_dfm[2])

##       the      have         ,        to         .        of       and 
##         7         7         4         3         3         3         2 
##        in suffering     world 
##         2         2         2

Highly recommended: Vsauce on Zipf’s Law

Word hierarchies

some words at more meaning than others
stopwords = meaningless (?)
in any case: too frequent words, don’t tell much about the documents
ideally: we want to get an “importance score” for each word

But how to get the important words?

Word importance

document	and	in	turn	be	loved	by
text1	3	2	1	2	1	3
text2	2	2	0	0	0	0

Ideally, we want to “reward” words that are:

important locally
but not ‘inflated’ globally

Metric for word importance

Term frequency: occurence/overall words in document

document	and	in	turn	be	loved	by
text1	0.024	0.016	0.008	0.016	0.008	0.024
text2	0.023	0.023	0.000	0.000	0.000	0.000

3/ntoken(mini_corpus[1])

##      text1 
## 0.02439024

Term frequency: reward for words that occur often in a document.

Metric for word importance

Problem: some words just occur a lot anyway (e.g. “stopwords”).

Correct for global occurrence:

	x
and	2
in	2
turn	1
be	1
loved	1
by	1

Document frequency: number of documents with each token.

Combining term frequency and document frequency

take the local importance

document	and	in	turn
text1	0.024	0.016	0.008
text2	0.023	0.023	0.000

correct for global occurrences

x

and 2

in 2

turn 1

TF/DF

#text1: "and"
0.024/2

#text2: "and"
0.022/2

#text1: "turn"
0.008/1

#text2: "turn"
0.000/1

TF-IDF

Term frequency
INVERSE document frequency

\(TFIDF = TF/DF\) = \(TFIDF = TF*IDF\), since

\(IDF = 1/DF\)

IDF often modelled as \(IDF = log(\frac{N}{DF})\)

TF-IDF

Note: for the exact formula for the inverse DF refer to the quanteda docs.

knitr::kable(round(dfm_tfidf(corpus_dfm, scheme_tf = 'prop', scheme_df = 'inverse'), 3)[, 10:15])

document	and	in	turn	be	loved	by
text1	0	0	0.002	0.005	0.002	0.007
text2	0	0	0.000	0.000	0.000	0.000

TF-IDF

TF: rewards local importance
IDF: punishes for global occurrence
TFIDF value as metric for the importance of words per document

There’s more to words

you can count them [DONE]
but they also have a function
- each word has a grammatical function
- nouns, verbs, pronouns
called: parts-of-speech

Syntactic dimension

library(qdap)

x
All
I
ever
wanted
was
to
love
women

Part-of-speech tagging

POS depend on POS framework.
Commonly used: Penn Treebank Project

x	POS
All	determiner
I	noun
ever	adverb
wanted	verb
was	verb
to	?
love	verb
women	noun

POS types

Tag	Description
CC	Coordinating conjunction
CD	Cardinal number
DT	Determiner
EX	Existential there
FW	Foreign word
IN	Preposition or subordinating conjunction
JJ	Adjective
JJR	Adjective, comparative
JJS	Adjective, superlative
LS	List item marker
MD	Modal
NN	Noun, singular or mass

POS tagging with `qdap`

er_ = "All I ever wanted was to love women"
pos_tagged = pos(er_)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

pos_tagged$POStagged$POStagged

## [1] all/DT i/FW ever/RB wanted/VBD was/VBD to/TO love/VB women/NNS
## Levels: all/DT i/FW ever/RB wanted/VBD was/VBD to/TO love/VB women/NNS

POS tagging

pos(er, percent = F, progress.bar = F)$POSfreq

##   wrd.cnt CC DT FW IN JJ MD NN NNS PRP PRP$ RB TO VB VBD VBG VBN VBP VBZ
## 1     106  4 10  1 14  8  4 17   1   7    3 10  3  9   6   2   2   3   1
##   WP
## 1  1

pos(ub, percent = F, progress.bar = F)$POSfreq

##   wrd.cnt CC DT IN JJ MD NN NNS PRP PRP$ RB TO VB VBN VBP WP
## 1      77  3  9  8 11  1 15   4   3    1  2  3  2   7   7  1

Considerations in text cleaning

Researcher’s degrees of freedom

stopword removal
stemming

Stopword removal

We know many words are “low in meaning”
So-called stopwords

x
i
me
my
myself
we
hers
herself
it
its
itself
they

You could decide to remove these…

Stopword removal

With stopwords:

document	all	i	ever	wanted	was	to	love	women
text1	2	10	1	1	1	3	1	1
text2	0	0	0	0	0	3	0	0

Without stopwords

document	ever	wanted	love	women	,	turn	loved	back
text1	1	1	1	1	4	1	1	2
text2	0	0	0	0	4	0	0	0

Stemming

some words originate from the same “stem”
e.g. “love”, “loved”, “loving”, “lovely”
but you might want to reduce all these to the stem

Word stems

love_stem = c("love", "loved", "loving", "lovely")

document	love	loved	loving	lovely
text1	1	0	0	0
text2	0	1	0	0
text3	0	0	1	0
text4	0	0	0	1

… after stemming

knitr::kable(dfm(love_stem_tok, stem = T))

document	love
text1	1
text2	1
text3	1
text4	1

Our mini corpus

From: (incl. stopwords and without stemming)

document	all	i	ever	wanted	was	to	love	women
text1	2	10	1	1	1	3	1	1
text2	0	0	0	0	0	3	0	0

… to (without stopwords and stemmed)

document	ever	want	love	women	,	turn	back	.
text1	1	2	2	1	4	1	2	12
text2	0	0	0	0	4	0	0	3

Limitations of text data

a lot of assumptions
text == behaviour?
produced text == displayed text?
linguistic “profiles”
many decisions in your hand
- stemming
- stopwords
- custom dictionary

RECAP

levels of text data
meta features
syntactic features
word frequencies
TFIDF
parts-pf-speech

Outlook

No tutorial.

Homework: Text data 1 (to come)

Next week: Text data 2

Text data and text mining 1

Advanced Crime Analysis UCL

Bennett Kleinberg

28 Jan 2019

Briefly about the module

Expected self-study

Today

Text is everywhere …

… and everything is text

Core idea

Text & Crime Science

Text & Crime Science

Obtaining text data

Quantifying text data

Challenge of quantification

Example

How would you quantify the example?

Features of text data

Approaches to text data

The quanteda package

Levels of text data

Counting meta features in R

R examples

Type-toke ratio

Type-token ratio

Nuanced meta features

Text representations

Text representations

Term frequency

DFM

DFM with multiple documents

DFM with two lone-actors

DFM with two texts

DFM representation

DFM representation

…

What are the most frequent “terms”?

Word hierarchies

But how to get the important words?

Word importance

Metric for word importance

Metric for word importance

Combining term frequency and document frequency

TF/DF

TF-IDF

TF-IDF

TF-IDF

There’s more to words

Syntactic dimension

Part-of-speech tagging

POS types

POS tagging with qdap

POS tagging

Considerations in text cleaning

Researcher’s degrees of freedom

Stopword removal

Stopword removal

Stemming

Word stems

… after stemming

Our mini corpus

Limitations of text data

RECAP

Outlook

END

The `quanteda` package

POS tagging with `qdap`