Aims of this homework
- learn about the basics of supervised machine learning in R
- reproduce the fake news classification example
- use the caret interface
Task 1: Preparing the data
You will use the fake news dataset from the lecture. Load that dataset (located in ./data/fakenews_corpus.RData
.
#your code here
Now you will need to extract features from the text. To do this, load the quanteda package, and run the code below that will extract unigrams, apply DFM trimming and bind the ‘outcome’ variable to the datafram:
library(quanteda)
corpus_tokenised = tokens(fakenews_corpus)
ngrams_extract_1 = dfm(x = corpus_tokenised
, ngrams = 1
, verbose = F
, remove_punct = T
, stem = F
, remove = stopwords()
)
ngrams_extract_1 = dfm_trim(ngrams_extract_1, sparsity = 0.95)
fake_news_data = as.data.frame(ngrams_extract_1)
fake_news_data$outcome = fakenews_corpus$documents$veracity
fake_news_data = fake_news_data[, -1]
Have a look at the first 10 rows and 10 columns of the dataframe fake_news_data
:
#your code here
Task 2: Splitting the data
We covered in the lecture, that you would need to split your data into a training set and a test set. Load the caret
package and use the createDataPartition
function to split the data into a test set of 65% of the data and a training set of 35% of the data.
#your code here
Note: since the partitioning of the data involves a random number initialisation, you will get different results every time you run this code. To avoid this (especially in your assignment), you can set a “seed” that ensures that the random number generation is identical if you run the code again. To do this, use the set.seed
function BEFORE running the createDataPartition
function.
Task 3: Training your model
Use a linear Support Vector Machine algorithm and train your model on the training data.
#your code here
Task 4: Assessing your model
We discussed why you would want to assess your model on the held-out test data.
To illustrate the differences, evaluate your model on the TRAINING data using the predict
function:
#your code here
Now do the same on the TEST data. Remember that the test data was never seen by the model and did therefore not occur in the training phase:
#your code here
What does this show you?
Task 5: Including training control parameters
We also covered why you would want to apply cross-validation on your model. Include a training control object in your model training phase with a 10 fold cross-validation. Use the same training and test set but build a new model that includes the 10-fold cross-validation:
#your code here
Do these results differ from the model without cross-validation?
Task 6: Using k-fold cross-validation on the full dataset
You can also use k-fold cross-validation on the full dataset and iteratively use one fold as the test set (have a look at this SO post. Try to implement this in R with 10 folds (hint: you do not need the train/test partition).
#your code here
Now if you try to evaluate the model, you will not that the predict
function is of little help. This is because it would just fit a model to its own training data. Have a look at this:
#your code here
Instead, you want to retrieve the average performance of each of the test sets of the 10 folds. You can do this by using the confusionMatrix
function with the model as parameter.
How do these results compare to the previous ones?
Task 7: Using different classification algorithms
Keep the meta parameters as in your second model (65/35 split, 10 fold CV on the training set) and use the k-nearest neighbour (kNN) classifier. Read about kNN here, here and here.
#your code here
Task 8: The data, data, data
argument
You will often hear that more data is preferable over fewer data and that especially in ML applications, you need ideally vast amounts of training data.
Look at the effect of the size of the training set by building a model with a 10/90, 20/80, 30/70, and 40/60 training/test set split.
#your code here
What do you observe?
END
LS0tCnRpdGxlOiAiTWFjaGluZSBsZWFybmluZyAxIgpzdWJ0aXRlOiAiSG9tZXdvcmsgd2VlayA2IgphdXRob3I6ICJCIEtsZWluYmVyZyIKc3VidGl0bGU6IEFkdmFuY2VkIENyaW1lIEFuYWx5c2lzLCBVQ0wKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKIyMgQWltcyBvZiB0aGlzIGhvbWV3b3JrCgotIGxlYXJuIGFib3V0IHRoZSBiYXNpY3Mgb2Ygc3VwZXJ2aXNlZCBtYWNoaW5lIGxlYXJuaW5nIGluIFIKLSByZXByb2R1Y2UgdGhlIGZha2UgbmV3cyBjbGFzc2lmaWNhdGlvbiBleGFtcGxlCi0gdXNlIHRoZSBjYXJldCBpbnRlcmZhY2UKCgojIyBUYXNrIDE6IFByZXBhcmluZyB0aGUgZGF0YQoKWW91IHdpbGwgdXNlIHRoZSBmYWtlIG5ld3MgZGF0YXNldCBmcm9tIHRoZSBsZWN0dXJlLiBMb2FkIHRoYXQgZGF0YXNldCAobG9jYXRlZCBpbiBgLi9kYXRhL2Zha2VuZXdzX2NvcnB1cy5SRGF0YWAuCgpgYGB7cn0KI3lvdXIgY29kZSBoZXJlCgpgYGAKCk5vdyB5b3Ugd2lsbCBuZWVkIHRvIGV4dHJhY3QgZmVhdHVyZXMgZnJvbSB0aGUgdGV4dC4gVG8gZG8gdGhpcywgbG9hZCB0aGUgcXVhbnRlZGEgcGFja2FnZSwgYW5kIHJ1biB0aGUgY29kZSBiZWxvdyB0aGF0IHdpbGwgZXh0cmFjdCB1bmlncmFtcywgYXBwbHkgREZNIHRyaW1taW5nIGFuZCBiaW5kIHRoZSAnb3V0Y29tZScgdmFyaWFibGUgdG8gdGhlIGRhdGFmcmFtOgoKYGBge3J9CmxpYnJhcnkocXVhbnRlZGEpCmNvcnB1c190b2tlbmlzZWQgPSB0b2tlbnMoZmFrZW5ld3NfY29ycHVzKQpuZ3JhbXNfZXh0cmFjdF8xID0gZGZtKHggPSBjb3JwdXNfdG9rZW5pc2VkCiAgICAgICAgICAgICAgICAgICAgICAgLCBuZ3JhbXMgPSAxCiAgICAgICAgICAgICAgICAgICAgICAgLCB2ZXJib3NlID0gRgogICAgICAgICAgICAgICAgICAgICAgICwgcmVtb3ZlX3B1bmN0ID0gVAogICAgICAgICAgICAgICAgICAgICAgICwgc3RlbSA9IEYKICAgICAgICAgICAgICAgICAgICAgICAsIHJlbW92ZSA9IHN0b3B3b3JkcygpCiAgICAgICAgICAgICAgICAgICAgICAgKQpuZ3JhbXNfZXh0cmFjdF8xID0gZGZtX3RyaW0obmdyYW1zX2V4dHJhY3RfMSwgc3BhcnNpdHkgPSAwLjk1KQpmYWtlX25ld3NfZGF0YSA9IGFzLmRhdGEuZnJhbWUobmdyYW1zX2V4dHJhY3RfMSkKZmFrZV9uZXdzX2RhdGEkb3V0Y29tZSA9IGZha2VuZXdzX2NvcnB1cyRkb2N1bWVudHMkdmVyYWNpdHkKZmFrZV9uZXdzX2RhdGEgPSBmYWtlX25ld3NfZGF0YVssIC0xXQpgYGAKCkhhdmUgYSBsb29rIGF0IHRoZSBmaXJzdCAxMCByb3dzIGFuZCAxMCBjb2x1bW5zIG9mIHRoZSBkYXRhZnJhbWUgYGZha2VfbmV3c19kYXRhYDoKCmBgYHtyfQojeW91ciBjb2RlIGhlcmUKCmBgYAoKCiMjIFRhc2sgMjogU3BsaXR0aW5nIHRoZSBkYXRhCgpXZSBjb3ZlcmVkIGluIHRoZSBsZWN0dXJlLCB0aGF0IHlvdSB3b3VsZCBuZWVkIHRvIHNwbGl0IHlvdXIgZGF0YSBpbnRvIGEgdHJhaW5pbmcgc2V0IGFuZCBhIHRlc3Qgc2V0LiBMb2FkIHRoZSBgY2FyZXRgIHBhY2thZ2UgYW5kIHVzZSB0aGUgYGNyZWF0ZURhdGFQYXJ0aXRpb25gIGZ1bmN0aW9uIHRvIHNwbGl0IHRoZSBkYXRhIGludG8gYSB0ZXN0IHNldCBvZiA2NSUgb2YgdGhlIGRhdGEgYW5kIGEgdHJhaW5pbmcgc2V0IG9mIDM1JSBvZiB0aGUgZGF0YS4KCmBgYHtyfQojeW91ciBjb2RlIGhlcmUKCmBgYAoKTm90ZTogc2luY2UgdGhlIHBhcnRpdGlvbmluZyBvZiB0aGUgZGF0YSBpbnZvbHZlcyBhIHJhbmRvbSBudW1iZXIgaW5pdGlhbGlzYXRpb24sIHlvdSB3aWxsIGdldCBkaWZmZXJlbnQgcmVzdWx0cyBldmVyeSB0aW1lIHlvdSBydW4gdGhpcyBjb2RlLiBUbyBhdm9pZCB0aGlzIChlc3BlY2lhbGx5IGluIHlvdXIgYXNzaWdubWVudCksIHlvdSBjYW4gc2V0IGEgInNlZWQiIHRoYXQgZW5zdXJlcyB0aGF0IHRoZSByYW5kb20gbnVtYmVyIGdlbmVyYXRpb24gaXMgaWRlbnRpY2FsIGlmIHlvdSBydW4gdGhlIGNvZGUgYWdhaW4uIFRvIGRvIHRoaXMsIHVzZSB0aGUgYHNldC5zZWVkYCBmdW5jdGlvbiBCRUZPUkUgcnVubmluZyB0aGUgYGNyZWF0ZURhdGFQYXJ0aXRpb25gIGZ1bmN0aW9uLgoKCiMjIFRhc2sgMzogVHJhaW5pbmcgeW91ciBtb2RlbAoKVXNlIGEgbGluZWFyIFN1cHBvcnQgVmVjdG9yIE1hY2hpbmUgYWxnb3JpdGhtIGFuZCB0cmFpbiB5b3VyIG1vZGVsIG9uIHRoZSB0cmFpbmluZyBkYXRhLgoKYGBge3J9CiN5b3VyIGNvZGUgaGVyZQoKYGBgCgoKIyMgVGFzayA0OiBBc3Nlc3NpbmcgeW91ciBtb2RlbAoKV2UgZGlzY3Vzc2VkIHdoeSB5b3Ugd291bGQgd2FudCB0byBhc3Nlc3MgeW91ciBtb2RlbCBvbiB0aGUgaGVsZC1vdXQgdGVzdCBkYXRhLiAKClRvIGlsbHVzdHJhdGUgdGhlIGRpZmZlcmVuY2VzLCBldmFsdWF0ZSB5b3VyIG1vZGVsIG9uIHRoZSBUUkFJTklORyBkYXRhIHVzaW5nIHRoZSBgcHJlZGljdGAgZnVuY3Rpb246CgpgYGB7cn0KI3lvdXIgY29kZSBoZXJlCgpgYGAKCk5vdyBkbyB0aGUgc2FtZSBvbiB0aGUgVEVTVCBkYXRhLiBSZW1lbWJlciB0aGF0IHRoZSB0ZXN0IGRhdGEgd2FzIG5ldmVyIHNlZW4gYnkgdGhlIG1vZGVsIGFuZCBkaWQgdGhlcmVmb3JlIG5vdCBvY2N1ciBpbiB0aGUgdHJhaW5pbmcgcGhhc2U6CgpgYGB7cn0KI3lvdXIgY29kZSBoZXJlCgpgYGAKCldoYXQgZG9lcyB0aGlzIHNob3cgeW91PwoKIyMgVGFzayA1OiBJbmNsdWRpbmcgdHJhaW5pbmcgY29udHJvbCBwYXJhbWV0ZXJzCgpXZSBhbHNvIGNvdmVyZWQgd2h5IHlvdSB3b3VsZCB3YW50IHRvIGFwcGx5IGNyb3NzLXZhbGlkYXRpb24gb24geW91ciBtb2RlbC4gSW5jbHVkZSBhIHRyYWluaW5nIGNvbnRyb2wgb2JqZWN0IGluIHlvdXIgbW9kZWwgdHJhaW5pbmcgcGhhc2Ugd2l0aCBhIDEwIGZvbGQgY3Jvc3MtdmFsaWRhdGlvbi4gVXNlIHRoZSBzYW1lIHRyYWluaW5nIGFuZCB0ZXN0IHNldCBidXQgYnVpbGQgYSBuZXcgbW9kZWwgdGhhdCBpbmNsdWRlcyB0aGUgMTAtZm9sZCBjcm9zcy12YWxpZGF0aW9uOgoKYGBge3J9CiN5b3VyIGNvZGUgaGVyZQoKYGBgCgpEbyB0aGVzZSByZXN1bHRzIGRpZmZlciBmcm9tIHRoZSBtb2RlbCB3aXRob3V0IGNyb3NzLXZhbGlkYXRpb24/CgpgYGB7cn0KCmBgYAoKIyMgVGFzayA2OiBVc2luZyBrLWZvbGQgY3Jvc3MtdmFsaWRhdGlvbiBvbiB0aGUgZnVsbCBkYXRhc2V0CgpZb3UgY2FuIGFsc28gdXNlIGstZm9sZCBjcm9zcy12YWxpZGF0aW9uIG9uIHRoZSBmdWxsIGRhdGFzZXQgYW5kIGl0ZXJhdGl2ZWx5IHVzZSBvbmUgZm9sZCBhcyB0aGUgdGVzdCBzZXQgKGhhdmUgYSBsb29rIGF0IHRoaXMgW1NPIHBvc3RdKGh0dHBzOi8vc3RhdHMuc3RhY2tleGNoYW5nZS5jb20vcXVlc3Rpb25zLzI3MDAyNy93aGF0LWlzLXRoZS1kaWZmZXJlbmNlLWJldHdlZW4tay1ob2xkb3V0LWFuZC1rLWZvbGQtY3Jvc3MtdmFsaWRhdGlvbi8yNzAwMzApLiBUcnkgdG8gaW1wbGVtZW50IHRoaXMgaW4gUiB3aXRoIDEwIGZvbGRzIChoaW50OiB5b3UgZG8gbm90IG5lZWQgdGhlIHRyYWluL3Rlc3QgcGFydGl0aW9uKS4KCmBgYHtyfQojeW91ciBjb2RlIGhlcmUKCmBgYAoKTm93IGlmIHlvdSB0cnkgdG8gZXZhbHVhdGUgdGhlIG1vZGVsLCB5b3Ugd2lsbCBub3QgdGhhdCB0aGUgYHByZWRpY3RgIGZ1bmN0aW9uIGlzIG9mIGxpdHRsZSBoZWxwLiBUaGlzIGlzIGJlY2F1c2UgaXQgd291bGQganVzdCBmaXQgYSBtb2RlbCB0byBpdHMgb3duIHRyYWluaW5nIGRhdGEuIEhhdmUgYSBsb29rIGF0IHRoaXM6CgpgYGB7cn0KI3lvdXIgY29kZSBoZXJlCgpgYGAKCkluc3RlYWQsIHlvdSB3YW50IHRvIHJldHJpZXZlIHRoZSBhdmVyYWdlIHBlcmZvcm1hbmNlIG9mIGVhY2ggb2YgdGhlIHRlc3Qgc2V0cyBvZiB0aGUgMTAgZm9sZHMuIFlvdSBjYW4gZG8gdGhpcyBieSB1c2luZyB0aGUgYGNvbmZ1c2lvbk1hdHJpeGAgIGZ1bmN0aW9uIHdpdGggdGhlIG1vZGVsIGFzIHBhcmFtZXRlci4KCmBgYHtyfQoKYGBgCgpIb3cgZG8gdGhlc2UgcmVzdWx0cyBjb21wYXJlIHRvIHRoZSBwcmV2aW91cyBvbmVzPwoKIyMgVGFzayA3OiBVc2luZyBkaWZmZXJlbnQgY2xhc3NpZmljYXRpb24gYWxnb3JpdGhtcwoKS2VlcCB0aGUgbWV0YSBwYXJhbWV0ZXJzIGFzIGluIHlvdXIgc2Vjb25kIG1vZGVsICg2NS8zNSBzcGxpdCwgMTAgZm9sZCBDViBvbiB0aGUgdHJhaW5pbmcgc2V0KSBhbmQgdXNlIHRoZSBrLW5lYXJlc3QgbmVpZ2hib3VyIChrTk4pIGNsYXNzaWZpZXIuIFJlYWQgYWJvdXQga05OIFtoZXJlXShodHRwczovL21lZGl1bS5jb20vQGFkaS5icm9uc2h0ZWluL2EtcXVpY2staW50cm9kdWN0aW9uLXRvLWstbmVhcmVzdC1uZWlnaGJvcnMtYWxnb3JpdGhtLTYyMjE0Y2VhMjljNyksIFtoZXJlXShodHRwczovL3Rvd2FyZHNkYXRhc2NpZW5jZS5jb20vbWFjaGluZS1sZWFybmluZy1iYXNpY3Mtd2l0aC10aGUtay1uZWFyZXN0LW5laWdoYm9ycy1hbGdvcml0aG0tNmE2ZTcxZDAxNzYxKSBhbmQgW2hlcmVdKGh0dHBzOi8vd3d3LnItYmxvZ2dlcnMuY29tL2stbmVhcmVzdC1uZWlnaGJvci1zdGVwLWJ5LXN0ZXAtdHV0b3JpYWwvKS4KCmBgYHtyfQojeW91ciBjb2RlIGhlcmUKCgpgYGAKCiMjIFRhc2sgODogVGhlIGBkYXRhLCBkYXRhLCBkYXRhYCBhcmd1bWVudAoKWW91IHdpbGwgb2Z0ZW4gaGVhciB0aGF0IG1vcmUgZGF0YSBpcyBwcmVmZXJhYmxlIG92ZXIgZmV3ZXIgZGF0YSBhbmQgdGhhdCBlc3BlY2lhbGx5IGluIE1MIGFwcGxpY2F0aW9ucywgeW91IG5lZWQgaWRlYWxseSB2YXN0IGFtb3VudHMgb2YgdHJhaW5pbmcgZGF0YS4KCkxvb2sgYXQgdGhlIGVmZmVjdCBvZiB0aGUgc2l6ZSBvZiB0aGUgdHJhaW5pbmcgc2V0IGJ5IGJ1aWxkaW5nIGEgbW9kZWwgd2l0aCBhIDEwLzkwLCAyMC84MCwgMzAvNzAsIGFuZCA0MC82MCB0cmFpbmluZy90ZXN0IHNldCBzcGxpdC4KCmBgYHtyfQojeW91ciBjb2RlIGhlcmUKCgpgYGAKCldoYXQgZG8geW91IG9ic2VydmU/CgoKIyMgRU5ECgotLS0K