Tutorial 4, Advanced Crime Analysis, BSc Security and Crime Science, UCL
Aim of this tutorial
You will use concepts learned in the lectures to:
- prepare data for supervised classification in R
- run supervised machine learning models in R
- assess the performance of the models
- apply unsupervised machine learning models in R
Task 1: Prepare data for supervised classification - predicting political polarity of news channels
(you’ll need the quanteda
package for this one)
For the supervised ML part, you will use the dataset from last tutorial on YouTube transcripts extracted from left-leaning and right-leaning news channels. In the provided dataset, you have the transcripts of 2000 YouTube videos each from FoxNews (a right-leaning US news channel) and from The Young Turks (a left-leaning US news outlet).
Load the original dataframe called media_data
from data/media_data.RData
.
#your code here
Build an ngram model that contains unigrams and bigrams and correct for sparsity so that the tokens are contained in at least 10% of the documents. Make sure to remove all punctuation, numbers and symbols.
#your code here
Task 2: Predict whether a transcript comes froma right or left-leaning YouTube channel
Step 1: split the data
(you need the caret
package for this one)
#your code here
Have a look at the dimensions of your data. How many features are there?
#your code here
Step 2: set your training controls
Here, you can go for a k-fold with a high number of k (e.g. 20).
Make sure to specify classProbs = T
since we need this for later Area Under the Curve calculations.
#your code here
Step 3: train the model
You can use a Linear SVM, for example.
#your code here
Step 4: fit the model
#your code here
Task 4: Unsupervised learning on tech titles
Load the data.frame tech_titles
from the tech_titles.RData
object located in the ./data
directory. These data are all titles of articles written on the two major tech websites VentureBeat and TechCrunch in 2017 (dataset details on Kaggle).
Your task is to represent these titles as unigrams and find out whether there are clusters in the data.
Step 1: Load the data
#your code here
Step 2: Create the unigrams
(apply preprocessing where you think this is necessary)
#your code here
Step 3: Determine the number of clusters
Use the elbow method:
(note: you will get an error here, try to figure out why and solve it!)
#your code here
Step 4: Build the final model
#your code here
Step 5: Interpret the class membership
Tip:
- assign the cluster membership to a column in the original dataframe
- then aggregate the unigram frequencies by cluster
- this returns the average freq per unigram per cluster
- now sort the frequencies per cluster separately to see what the clusters are about
#your code here
END
LS0tCnRpdGxlOiAnTWFjaGluZSBsZWFybmluZyBpbiBSJwphdXRob3I6IEIgS2xlaW5iZXJnCmRhdGU6IDI2IEZlYiAyMDE5CnN1YnRpdGxlOiBEZXB0IG9mIFNlY3VyaXR5IGFuZCBDcmltZSBTY2llbmNlLCBVQ0wKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKLS0tCgpUdXRvcmlhbCA0LCBBZHZhbmNlZCBDcmltZSBBbmFseXNpcywgQlNjIFNlY3VyaXR5IGFuZCBDcmltZSBTY2llbmNlLCBVQ0wKCi0tLQoKIyMgQWltIG9mIHRoaXMgdHV0b3JpYWwKCllvdSB3aWxsIHVzZSBjb25jZXB0cyBsZWFybmVkIGluIHRoZSBsZWN0dXJlcyB0bzoKCi0gcHJlcGFyZSBkYXRhIGZvciBzdXBlcnZpc2VkIGNsYXNzaWZpY2F0aW9uIGluIFIKLSBydW4gc3VwZXJ2aXNlZCBtYWNoaW5lIGxlYXJuaW5nIG1vZGVscyBpbiBSCi0gYXNzZXNzIHRoZSBwZXJmb3JtYW5jZSBvZiB0aGUgbW9kZWxzCi0gYXBwbHkgdW5zdXBlcnZpc2VkIG1hY2hpbmUgbGVhcm5pbmcgbW9kZWxzIGluIFIKCgojIyBUYXNrIDE6IFByZXBhcmUgZGF0YSBmb3Igc3VwZXJ2aXNlZCBjbGFzc2lmaWNhdGlvbiAtIHByZWRpY3RpbmcgcG9saXRpY2FsIHBvbGFyaXR5IG9mIG5ld3MgY2hhbm5lbHMKCih5b3UnbGwgbmVlZCB0aGUgYHF1YW50ZWRhYCBwYWNrYWdlIGZvciB0aGlzIG9uZSkKCkZvciB0aGUgc3VwZXJ2aXNlZCBNTCBwYXJ0LCB5b3Ugd2lsbCB1c2UgdGhlIGRhdGFzZXQgZnJvbSBsYXN0IHR1dG9yaWFsIG9uIFlvdVR1YmUgdHJhbnNjcmlwdHMgZXh0cmFjdGVkIGZyb20gbGVmdC1sZWFuaW5nIGFuZCByaWdodC1sZWFuaW5nIG5ld3MgY2hhbm5lbHMuIEluIHRoZSBwcm92aWRlZCBkYXRhc2V0LCB5b3UgaGF2ZSB0aGUgdHJhbnNjcmlwdHMgb2YgMjAwMCBZb3VUdWJlIHZpZGVvcyBlYWNoIGZyb20gRm94TmV3cyAoYSByaWdodC1sZWFuaW5nIFVTIG5ld3MgY2hhbm5lbCkgYW5kIGZyb20gVGhlIFlvdW5nIFR1cmtzIChhIGxlZnQtbGVhbmluZyBVUyBuZXdzIG91dGxldCkuCgpMb2FkIHRoZSBvcmlnaW5hbCBkYXRhZnJhbWUgY2FsbGVkIGBtZWRpYV9kYXRhYCBmcm9tIGBkYXRhL21lZGlhX2RhdGEuUkRhdGFgLgoKYGBge3J9CiN5b3VyIGNvZGUgaGVyZQoKYGBgCgpCdWlsZCBhbiBuZ3JhbSBtb2RlbCB0aGF0IGNvbnRhaW5zIHVuaWdyYW1zIGFuZCBiaWdyYW1zIGFuZCBjb3JyZWN0IGZvciBzcGFyc2l0eSBzbyB0aGF0IHRoZSB0b2tlbnMgYXJlIGNvbnRhaW5lZCBpbiBhdCBsZWFzdCAxMCUgb2YgdGhlIGRvY3VtZW50cy4gTWFrZSBzdXJlIHRvIHJlbW92ZSBbYWxsIHB1bmN0dWF0aW9uLCBudW1iZXJzIGFuZCBzeW1ib2xzXShodHRwczovL2RhdGEubGlicmFyeS52aXJnaW5pYS5lZHUvYS1iZWdpbm5lcnMtZ3VpZGUtdG8tdGV4dC1hbmFseXNpcy13aXRoLXF1YW50ZWRhLykuCgpgYGB7cn0KI3lvdXIgY29kZSBoZXJlCgpgYGAKCiMjIFRhc2sgMjogUHJlZGljdCB3aGV0aGVyIGEgdHJhbnNjcmlwdCBjb21lcyBmcm9tYSByaWdodCBvciBsZWZ0LWxlYW5pbmcgWW91VHViZSBjaGFubmVsCgpTdGVwIDE6IHNwbGl0IHRoZSBkYXRhCgooeW91IG5lZWQgdGhlIGBjYXJldGAgcGFja2FnZSBmb3IgdGhpcyBvbmUpCgpgYGB7cn0KI3lvdXIgY29kZSBoZXJlCgpgYGAKCkhhdmUgYSBsb29rIGF0IHRoZSBkaW1lbnNpb25zIG9mIHlvdXIgZGF0YS4gSG93IG1hbnkgZmVhdHVyZXMgYXJlIHRoZXJlPwoKYGBge3J9CiN5b3VyIGNvZGUgaGVyZQoKYGBgCgoKU3RlcCAyOiBzZXQgeW91ciB0cmFpbmluZyBjb250cm9scwoKSGVyZSwgeW91IGNhbiBnbyBmb3IgYSBrLWZvbGQgd2l0aCBhIGhpZ2ggbnVtYmVyIG9mIGsgKGUuZy4gMjApLgoKTWFrZSBzdXJlIHRvIHNwZWNpZnkgYGNsYXNzUHJvYnMgPSBUYCBzaW5jZSB3ZSBuZWVkIHRoaXMgZm9yIGxhdGVyIEFyZWEgVW5kZXIgdGhlIEN1cnZlIGNhbGN1bGF0aW9ucy4KCmBgYHtyfQojeW91ciBjb2RlIGhlcmUKCmBgYAoKClN0ZXAgMzogdHJhaW4gdGhlIG1vZGVsCgpZb3UgY2FuIHVzZSBhIExpbmVhciBTVk0sIGZvciBleGFtcGxlLgoKYGBge3J9CiN5b3VyIGNvZGUgaGVyZQoKYGBgCgpTdGVwIDQ6IGZpdCB0aGUgbW9kZWwKCmBgYHtyfQojeW91ciBjb2RlIGhlcmUKCmBgYAoKCiMjIFRhc2sgMzogQXNzZXNzIHRoZSBtb2RlbCBwZXJmb3JtYW5jZQoKU3RlcCAxOiBjYWxjdWxhdGUgdGhlIGFjY3VyYWN5IG9mIHlvdXIgbW9kZWwgb24gdGhlIHRlc3Qgc2V0CgpgYGB7cn0KI3lvdXIgY29kZSBoZXJlCgpgYGAKClN0ZXAgMjogY2FsY3VsYXRlIHRoZSBwcmVjaXNpb24sIHJlY2FsbCBhbmQgRjEgc2NvcmVzCgpTdG9wIGFuZCB0aGluayBmb3IgYSBzZWNvbmQ6IHdoeSBkbyB3ZSBuZWVkIHRoZXNlIG1ldHJpY3MgaW4gYWRkaXRpb24gdG8gdGhlIGFjY3VyYWN5PwoKYGBge3J9CiN5b3VyIGNvZGUgaGVyZQoKIy4uLgpgYGAKClN0ZXAgMzogY2FsY3VsYXRlIHRoZSBhcmVhIHVuZGVyIHRoZSBjdXJ2ZQoKVG8gb2J0YWluIHRoZSBhcmVhIHVuZGVyIHRoZSBjdXJ2ZSwgcmVtZW1iZXIgdGhhdCB3ZSBuZWVkIGNsYXNzIHByb2JhYmlsaXRpZXMuIFlvdSBjYW4gb2J0YWluIHRoZXNlIGJ5IGNyZWF0aW5nIGEgbmV3IHZhcmlhYmxlIHRoYXQgdXNlcyB0aGUgYHByZWRpY3RgIGZ1bmN0aW9uIHdpdGggdGhlIHBhcmFtZXRlciBgdHlwZSA9ICJwcm9iImAuCgpgYGB7cn0KI3lvdXIgY29kZSBoZXJlCgpgYGAKCihIaW50OiBpZiBkb25lIGNvcnJlY3RseSwgeW91IHdpbGwgb2J0YWluIGEgZGF0YWZyYW1lIHdpdGggZWFjaCBjYXNlIG9mIHRoZSB0ZXN0IHNldCBpbiB0aGUgcm93cyBhbmQgdHdvIGNvbHVtbnMgLSBvbmUgZm9yIHRoZSBjbGFzcyBwcm9iYWJpbGl0aWVzIGluIGNsYXNzIDEgYW5kIG9uZSBmb3IgY2xhc3MgMi4gWW91IHdpbGwgc2VlIHRoYXQgdGhleSBzdW0gdG8gMSwgc28geW91IGNhbiBjaG9vc2Ugb25lIG9mIHRoZW0gZm9yIHRoZSBBVUMgY2FsY3VsYXRpb24uIFRyeSBpdCBvdXQgdG8gcHJvb2YgdGhhdCB0aGUgcmVzdWx0cyB3b24ndCBjaGFuZ2UuKQoKCk5vdyB1c2UgdGhlIGBwUk9DYCBsaWJyYXJ5IHRvIGNhbGN1bGF0ZSB0aGUgYXJlYSB1bmRlciB0aGUgY3VydmUuCgpgYGB7cn0KI3lvdXIgY29kZSBoZXJlCgpgYGAKClBsb3QgdGhlIGFyZWEgdW5kZXIgdGhlIGN1cnZlICh1c2luZyBgcGxvdC5yb2NgOgoKYGBge3J9CiN5b3VyIGNvZGUgaGVyZQoKYGBgCgoKV2hhdCBpcyB5b3VyIGNvbmNsdXNpb24gcmUuIHRoZSBtb2RlbCB5b3UganVzdCBidWlsdD8KCkZpbmFsbHk6IEhhdmUgYSBsb29rIGF0IHRoZSBmZWF0dXJlcyB0aGF0IGRyaXZlIHRoZSBjbGFzc2lmaWVyIHVzaW5nIGB2YXJJbXBgLiBOb3RlIHRoYXQgdGhlIHZhcmlhYmxlIGltcG9ydGFuY2Ugb2YgY2FyZXQgcmVsaWVzIG9uIG51bWVyaWNhbCBvdXRjb21lcyAtIHRoZXJlZm9yZTogcmUtcnVuIHRoZSBtb2RlbCBidXQgY2hhbmdlIHRoZSB0cmFpbmluZyBzZXQgc28gdGhhdCB0aGUgb3V0Y29tZSB2YXJpYWJsZSdzIGxldmVscyBhcmUgbnVtZXJpYyAoMSBhbmQgMCkgYW5kIHNldCBgY2xhc3NQcm9icyA9IEZgIGluIHRoZSB0cmFpbmluZyBjb250cm9scy4KCk9uY2UgeW91IGlkZW50aWZpZWQgdGhlIG1vc3QgaW1wb3J0YW50IGZlYXR1cmVzLCBoYXZlIGEgbG9vayBpbiB3aGljaCBjbGFzcyB0aGV5IGFyZSBtb3JlIHByZXZhbGVudC4KCmBgYHtyfQojeW91ciBjb2RlIGhlcmUKCmBgYAoKCiMjIFRhc2sgNDogVW5zdXBlcnZpc2VkIGxlYXJuaW5nIG9uIHRlY2ggdGl0bGVzCgpMb2FkIHRoZSBkYXRhLmZyYW1lIGB0ZWNoX3RpdGxlc2AgZnJvbSB0aGUgYHRlY2hfdGl0bGVzLlJEYXRhYCBvYmplY3QgbG9jYXRlZCBpbiB0aGUgYC4vZGF0YWAgZGlyZWN0b3J5LiBUaGVzZSBkYXRhIGFyZSBhbGwgdGl0bGVzIG9mIGFydGljbGVzIHdyaXR0ZW4gb24gdGhlIHR3byBtYWpvciB0ZWNoIHdlYnNpdGVzIFZlbnR1cmVCZWF0IGFuZCBUZWNoQ3J1bmNoIGluIDIwMTcgWyhkYXRhc2V0IGRldGFpbHMgb24gS2FnZ2xlKV0oaHR0cHM6Ly93d3cua2FnZ2xlLmNvbS9Qcm9tcHRDbG91ZEhRL3RpdGxlcy1ieS10ZWNoY3J1bmNoLWFuZC12ZW50dXJlYmVhdC1pbi0yMDE3KS4KCllvdXIgdGFzayBpcyB0byByZXByZXNlbnQgdGhlc2UgdGl0bGVzIGFzIHVuaWdyYW1zIGFuZCBmaW5kIG91dCB3aGV0aGVyIHRoZXJlIGFyZSBjbHVzdGVycyBpbiB0aGUgZGF0YS4KClN0ZXAgMTogTG9hZCB0aGUgZGF0YQoKYGBge3J9CiN5b3VyIGNvZGUgaGVyZQoKYGBgCgpTdGVwIDI6IENyZWF0ZSB0aGUgdW5pZ3JhbXMKCihhcHBseSBwcmVwcm9jZXNzaW5nIHdoZXJlIHlvdSB0aGluayB0aGlzIGlzIG5lY2Vzc2FyeSkKCmBgYHtyfQojeW91ciBjb2RlIGhlcmUKCmBgYAoKU3RlcCAzOiBEZXRlcm1pbmUgdGhlIG51bWJlciBvZiBjbHVzdGVycwoKVXNlIHRoZSBlbGJvdyBtZXRob2Q6Cgoobm90ZTogeW91IHdpbGwgZ2V0IGFuIGVycm9yIGhlcmUsIHRyeSB0byBmaWd1cmUgb3V0IHdoeSBhbmQgc29sdmUgaXQhKQoKYGBge3J9CiN5b3VyIGNvZGUgaGVyZQoKYGBgCgoKU3RlcCA0OiBCdWlsZCB0aGUgZmluYWwgbW9kZWwKCmBgYHtyfQojeW91ciBjb2RlIGhlcmUKCmBgYAoKU3RlcCA1OiBJbnRlcnByZXQgdGhlIGNsYXNzIG1lbWJlcnNoaXAKClRpcDogCgotIGFzc2lnbiB0aGUgY2x1c3RlciBtZW1iZXJzaGlwIHRvIGEgY29sdW1uIGluIHRoZSBvcmlnaW5hbCBkYXRhZnJhbWUKLSB0aGVuIGFnZ3JlZ2F0ZSB0aGUgdW5pZ3JhbSBmcmVxdWVuY2llcyBieSBjbHVzdGVyCiAgICAtIHRoaXMgcmV0dXJucyB0aGUgYXZlcmFnZSBmcmVxIHBlciB1bmlncmFtIHBlciBjbHVzdGVyCi0gbm93IHNvcnQgdGhlIGZyZXF1ZW5jaWVzIHBlciBjbHVzdGVyIHNlcGFyYXRlbHkgdG8gc2VlIHdoYXQgdGhlIGNsdXN0ZXJzIGFyZSBhYm91dAoKYGBge3J9CiN5b3VyIGNvZGUgaGVyZQoKYGBgCgoKCi0tLQoKIyMgRU5E