Text data 1
Text is a unique documentation of human activity.
We are obsessed with documenting.
Text –> numerical representation?
All I ever wanted was to love women, and in turn to be loved by them back. Their behavior towards me has only earned my hatred, and rightfully so! I am the true victim in all of this. I am the good guy. Humanity struck at me first by condemning me to experience so much suffering. I didn’t ask for this. I didn’t want this. I didn’t start this war. I wasn’t the one who struck first. But I will finish it by striking back. I will punish everyone. And it will be beautiful. Finally, at long last, I can show the world my true worth.
quanteda
packagelibrary(quanteda)
## Package version: 1.2.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
c('h', 'a', 't', 'r', 'e', 'd')
hatred
I didn’t ask for this.
text level | R function |
---|---|
characters | nchar() |
words | quanteda::ntoken() |
sentences | quanteda::nsentence() |
Homework: read about the type/token distinction here and here.
#sentences
no_of_sentences = nsentence(er)
no_of_sentences
## text1
## 13
#words 1
no_of_words_1 = ntoken(er)
no_of_words_1
## text1
## 123
#words 2
no_of_words_2 = ntype(er)
no_of_words_2
## text1
## 72
Note: often used metric for “lexical diversity” is the TTR (type-token ratio).
string_a = "I didn’t ask for this. I didn’t want this."
string_b = "But I will finish it by striking back."
What are the type-token ratios of each string?
ntype(string_a)/ntoken(string_a)
## text1
## 0.6363636
ntype(string_b)/ntoken(string_b)
## text1
## 1
nchar(er)/ntoken(er)
## text1
## 4.317073
ntoken(er)/nsentence(er)
## text1
## 9.461538
"I think I believe him"
text_id | I | think | believe | her |
---|---|---|---|---|
text1 | 2 | 1 | 1 | 1 |
example_string_tok = tokens("I think I believe him")
dfm(example_string_tok)
## Document-feature matrix of: 1 document, 4 features (0% sparse).
## 1 x 4 sparse Matrix of class "dfm"
## features
## docs i think believe him
## text1 2 1 1 1
Document-term frequency matrix
multiple_docs_tok = tokens(c("I think I believe him", "This is a cool function"))
dfm(multiple_docs_tok)
## Document-feature matrix of: 2 documents, 9 features (50% sparse).
## 2 x 9 sparse Matrix of class "dfm"
## features
## docs i think believe him this is a cool function
## text1 2 1 1 1 0 0 0 0 0
## text2 0 0 0 0 1 1 1 1 1
“All I ever wanted was to love women, and in turn to be loved by them back. Their behavior towards me has only earned my hatred, and rightfully so! I am the true victim in all of this. I am the good guy. Humanity struck at me first by condemning me to experience so much suffering. I didn’t ask for this. I didn’t want this. I didn’t start this war. I wasn’t the one who struck first. But I will finish it by striking back. I will punish everyone. And it will be beautiful. Finally, at long last, I can show the world my true worth.”
The Industrial Revolution and its consequences have been a disaster for the human race. They have greatly increased the life-expectancy of those of us who live in “advanced” countries, but they have destabilized society, have made life unfulfilling, have subjected human beings to indignities, have led to widespread psychological suffering (in the Third World to physical suffering as well) and have inflicted severe damage on the natural world. The continued development of technology will worsen the situation.
mini_corpus = corpus(c(er, ub))
summary(mini_corpus)
## Corpus consisting of 2 documents:
##
## Text Types Tokens Sentences
## text1 72 123 13
## text2 63 88 3
##
## Source: /Users/bennettkleinberg/GitHub/ucl_aca_20182019/slides/* on x86_64 by bennettkleinberg
## Created: Sun Feb 3 18:37:47 2019
## Notes:
corpus_tokenised = tokens(mini_corpus)
corpus_dfm = dfm(corpus_tokenised)
knitr::kable(corpus_dfm[, 1:8])
document | all | i | ever | wanted | was | to | love | women |
---|---|---|---|---|---|---|---|---|
text1 | 2 | 10 | 1 | 1 | 1 | 3 | 1 | 1 |
text2 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
knitr::kable(corpus_dfm[, 31:38])
document | am | the | true | victim | of | this | good | guy |
---|---|---|---|---|---|---|---|---|
text1 | 2 | 4 | 2 | 1 | 1 | 4 | 1 | 1 |
text2 | 0 | 7 | 0 | 0 | 3 | 0 | 0 | 0 |
Is this ideal?
topfeatures(corpus_dfm[1])
## . i , the this to and by me didn't
## 12 10 4 4 4 3 3 3 3 3
topfeatures(corpus_dfm[2])
## the have , to . of and
## 7 7 4 3 3 3 2
## in suffering world
## 2 2 2
Highly recommended: Vsauce on Zipf’s Law
document | and | in | turn | be | loved | by |
---|---|---|---|---|---|---|
text1 | 3 | 2 | 1 | 2 | 1 | 3 |
text2 | 2 | 2 | 0 | 0 | 0 | 0 |
Ideally, we want to “reward” words that are:
document | and | in | turn | be | loved | by |
---|---|---|---|---|---|---|
text1 | 0.024 | 0.016 | 0.008 | 0.016 | 0.008 | 0.024 |
text2 | 0.023 | 0.023 | 0.000 | 0.000 | 0.000 | 0.000 |
3/ntoken(mini_corpus[1])
## text1
## 0.02439024
Term frequency: reward for words that occur often in a document.
Problem: some words just occur a lot anyway (e.g. “stopwords”).
Correct for global occurrence:
x | |
---|---|
and | 2 |
in | 2 |
turn | 1 |
be | 1 |
loved | 1 |
by | 1 |
Document frequency: number of documents with each token.
document | and | in | turn |
---|---|---|---|
text1 | 0.024 | 0.016 | 0.008 |
text2 | 0.023 | 0.023 | 0.000 |
correct for global occurrences
x | |
---|---|
and | 2 |
in | 2 |
turn | 1 |
#text1: "and"
0.024/2
#text2: "and"
0.022/2
#text1: "turn"
0.008/1
#text2: "turn"
0.000/1
\(TFIDF = TF/DF\) = \(TFIDF = TF*IDF\), since
\(IDF = 1/DF\)
IDF often modelled as \(IDF = log(\frac{N}{DF})\)
Note: for the exact formula for the inverse DF refer to the quanteda docs.
knitr::kable(round(dfm_tfidf(corpus_dfm, scheme_tf = 'prop', scheme_df = 'inverse'), 3)[, 10:15])
document | and | in | turn | be | loved | by |
---|---|---|---|---|---|---|
text1 | 0 | 0 | 0.002 | 0.005 | 0.002 | 0.007 |
text2 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 |
library(qdap)
x |
---|
All |
I |
ever |
wanted |
was |
to |
love |
women |
POS depend on POS framework.
Commonly used: Penn Treebank Project
x | POS |
---|---|
All | determiner |
I | noun |
ever | adverb |
wanted | verb |
was | verb |
to | ? |
love | verb |
women | noun |
Tag | Description |
---|---|
CC | Coordinating conjunction |
CD | Cardinal number |
DT | Determiner |
EX | Existential there |
FW | Foreign word |
IN | Preposition or subordinating conjunction |
JJ | Adjective |
JJR | Adjective, comparative |
JJS | Adjective, superlative |
LS | List item marker |
MD | Modal |
NN | Noun, singular or mass |
qdap
er_ = "All I ever wanted was to love women"
pos_tagged = pos(er_)
##
|
| | 0%
|
|=================================================================| 100%
pos_tagged$POStagged$POStagged
## [1] all/DT i/FW ever/RB wanted/VBD was/VBD to/TO love/VB women/NNS
## Levels: all/DT i/FW ever/RB wanted/VBD was/VBD to/TO love/VB women/NNS
pos(er, percent = F, progress.bar = F)$POSfreq
## wrd.cnt CC DT FW IN JJ MD NN NNS PRP PRP$ RB TO VB VBD VBG VBN VBP VBZ
## 1 106 4 10 1 14 8 4 17 1 7 3 10 3 9 6 2 2 3 1
## WP
## 1 1
pos(ub, percent = F, progress.bar = F)$POSfreq
## wrd.cnt CC DT IN JJ MD NN NNS PRP PRP$ RB TO VB VBN VBP WP
## 1 77 3 9 8 11 1 15 4 3 1 2 3 2 7 7 1
x |
---|
i |
me |
my |
myself |
we |
hers |
herself |
it |
its |
itself |
they |
You could decide to remove these…
With stopwords:
document | all | i | ever | wanted | was | to | love | women |
---|---|---|---|---|---|---|---|---|
text1 | 2 | 10 | 1 | 1 | 1 | 3 | 1 | 1 |
text2 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
Without stopwords
document | ever | wanted | love | women | , | turn | loved | back |
---|---|---|---|---|---|---|---|---|
text1 | 1 | 1 | 1 | 1 | 4 | 1 | 1 | 2 |
text2 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 |
love_stem = c("love", "loved", "loving", "lovely")
document | love | loved | loving | lovely |
---|---|---|---|---|
text1 | 1 | 0 | 0 | 0 |
text2 | 0 | 1 | 0 | 0 |
text3 | 0 | 0 | 1 | 0 |
text4 | 0 | 0 | 0 | 1 |
knitr::kable(dfm(love_stem_tok, stem = T))
document | love |
---|---|
text1 | 1 |
text2 | 1 |
text3 | 1 |
text4 | 1 |
From: (incl. stopwords and without stemming)
document | all | i | ever | wanted | was | to | love | women |
---|---|---|---|---|---|---|---|---|
text1 | 2 | 10 | 1 | 1 | 1 | 3 | 1 | 1 |
text2 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
… to (without stopwords and stemmed)
document | ever | want | love | women | , | turn | back | . |
---|---|---|---|---|---|---|---|---|
text1 | 1 | 2 | 2 | 1 | 4 | 1 | 2 | 12 |
text2 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 3 |
No tutorial.
Homework: Text data 1 (to come)
Next week: Text data 2