## Aims of this homework

• learn about the basics of supervised machine learning in R
• reproduce the fake news classification example
• use the caret interface

## Task 1: Preparing the data

You will use the fake news dataset from the lecture. Load that dataset (located in ./data/fakenews_corpus.RData.

Now you will need to extract features from the text. To do this, load the quanteda package, and run the code below that will extract unigrams, apply DFM trimming and bind the ‘outcome’ variable to the datafram:

library(quanteda)
corpus_tokenised = tokens(fakenews_corpus)
ngrams_extract_1 = dfm(x = corpus_tokenised
, ngrams = 1
, verbose = F
, remove_punct = T
, stem = F
, remove = stopwords()
)
ngrams_extract_1 = dfm_trim(ngrams_extract_1, sparsity = 0.95)
fake_news_data = as.data.frame(ngrams_extract_1)
fake_news_data$outcome = fakenews_corpus$documents$veracity fake_news_data = fake_news_data[, -1] Have a look at the first 10 rows and 10 columns of the dataframe fake_news_data: ## Task 2: Splitting the data We covered in the lecture, that you would need to split your data into a training set and a test set. Load the caret package and use the createDataPartition function to split the data into a test set of 65% of the data and a training set of 35% of the data. #your code here library(caret)  Loading required package: lattice Loading required package: ggplot2 Note: since the partitioning of the data involves a random number initialisation, you will get different results every time you run this code. To avoid this (especially in your assignment), you can set a “seed” that ensures that the random number generation is identical if you run the code again. To do this, use the set.seed function BEFORE running the createDataPartition function. ## Task 3: Training your model Use a linear Support Vector Machine algorithm and train your model on the training data. fakenews_model_1 = train(outcome ~ . , data = training_data , trControl = training_controls , method = "svmLinear" )  Error in train.default(x, y, weights = w, ...) : object 'training_controls' not found ## Task 4: Assessing your model We discussed why you would want to assess your model on the held-out test data. To illustrate the differences, evaluate your model on the TRAINING data using the predict function: table(training_data$outcome, model_1.predictions_trainingset)

      model_1.predictions_trainingset
fake real
fake  650    0
real    0  650

Now do the same on the TEST data. Remember that the test data was never seen by the model and did therefore not occur in the training phase:

table(test_data$outcome, model_1.predictions_testset)   model_1.predictions_testset fake real fake 293 57 real 64 286 What does this show you? ## Task 5: Including training control parameters We also covered why you would want to apply cross-validation on your model. Include a training control object in your model training phase with a 10 fold cross-validation. Use the same training and test set but build a new model that includes the 10-fold cross-validation: training_controls = trainControl(method="cv" , number = 10 ) fakenews_model_2 = train(outcome ~ . , data = training_data , trControl = training_controls , method = "svmLinear" ) model_2.predictions_testset = predict(fakenews_model_2, test_data) table(test_data$outcome, model_2.predictions_testset)

      model_2.predictions_testset
fake real
fake  293   57
real   64  286

Do these results differ from the model without cross-validation?

table(test_data$outcome, model_2.predictions_testset) == table(test_data$outcome, model_1.predictions_testset)

      model_2.predictions_testset
fake real
fake TRUE TRUE
real TRUE TRUE

## Task 6: Using k-fold cross-validation on the full dataset

You can also use k-fold cross-validation on the full dataset and iteratively use one fold as the test set (have a look at this SO post. Try to implement this in R with 10 folds (hint: you do not need the train/test partition).

#your code here
fakenews_model_3 = train(outcome ~ .
, data = fake_news_data
, trControl = training_controls
, method = "svmLinear"
)


Now if you try to evaluate the model, you will not that the predict function is of little help. This is because it would just fit a model to its own training data. Have a look at this:

model_3.predictions = predict(fakenews_model_3, fake_news_data)
table(fake_news_data$outcome, model_3.predictions)   model_3.predictions fake real fake 1000 0 real 1 999 Instead, you want to retrieve the average performance of each of the test sets of the 10 folds. You can do this by using the confusionMatrix function with the model as parameter. confusionMatrix(fakenews_model_3)  Cross-Validated (10 fold) Confusion Matrix (entries are percentual average cell counts across resamples) Reference Prediction fake real fake 40.5 9.8 real 9.4 40.2 Accuracy (average) : 0.8075 How do these results compare to the previous ones? ## Task 7: Using different classification algorithms Keep the meta parameters as in your second model (65/35 split, 10 fold CV on the training set) and use the k-nearest neighbour (kNN) classifier. Read about kNN here, here and here. table(test_data$outcome, model_4.predictions_testset)

      model_4.predictions_testset
fake real
fake  332   18
real  230  120

## Task 8: The data, data, data argument

You will often hear that more data is preferable over fewer data and that especially in ML applications, you need ideally vast amounts of training data.

Look at the effect of the size of the training set by building a model with a 10/90, 20/80, 30/70, and 40/60 training/test set split.

set.seed(1234)
in_training = createDataPartition(y = fake_news_data$outcome , p = .30 , list = FALSE ) training_data = fake_news_data[ in_training,] test_data = fake_news_data[-in_training,] training_controls = trainControl(method="cv" , number = 5 ) fakenews_model_3070 = train(outcome ~ . , data = training_data , trControl = training_controls , method = "svmLinear" ) model_3070.predictions_testset = predict(fakenews_model_3070, test_data) table(test_data$outcome, model_3070.predictions_testset)

      model_3070.predictions_testset
fake real
fake  561  139
real  189  511

What do you observe?