## Aims of this homework

• learn about the basics of supervised machine learning in R
• reproduce the fake news classification example
• use the caret interface

## Task 1: Preparing the data

You will use the fake news dataset from the lecture. Load that dataset (located in ./data/fakenews_corpus.RData.

#your code here


Now you will need to extract features from the text. To do this, load the quanteda package, and run the code below that will extract unigrams, apply DFM trimming and bind the ‘outcome’ variable to the datafram:

library(quanteda)
corpus_tokenised = tokens(fakenews_corpus)
ngrams_extract_1 = dfm(x = corpus_tokenised
, ngrams = 1
, verbose = F
, remove_punct = T
, stem = F
, remove = stopwords()
)
ngrams_extract_1 = dfm_trim(ngrams_extract_1, sparsity = 0.95)
fake_news_data = as.data.frame(ngrams_extract_1)
fake_news_data$outcome = fakenews_corpus$documents\$veracity
fake_news_data = fake_news_data[, -1]

Have a look at the first 10 rows and 10 columns of the dataframe fake_news_data:

#your code here


## Task 2: Splitting the data

We covered in the lecture, that you would need to split your data into a training set and a test set. Load the caret package and use the createDataPartition function to split the data into a test set of 65% of the data and a training set of 35% of the data.

#your code here


Note: since the partitioning of the data involves a random number initialisation, you will get different results every time you run this code. To avoid this (especially in your assignment), you can set a “seed” that ensures that the random number generation is identical if you run the code again. To do this, use the set.seed function BEFORE running the createDataPartition function.

Use a linear Support Vector Machine algorithm and train your model on the training data.

#your code here


We discussed why you would want to assess your model on the held-out test data.

To illustrate the differences, evaluate your model on the TRAINING data using the predict function:

#your code here


Now do the same on the TEST data. Remember that the test data was never seen by the model and did therefore not occur in the training phase:

#your code here


What does this show you?

## Task 5: Including training control parameters

We also covered why you would want to apply cross-validation on your model. Include a training control object in your model training phase with a 10 fold cross-validation. Use the same training and test set but build a new model that includes the 10-fold cross-validation:

#your code here


Do these results differ from the model without cross-validation?

## Task 6: Using k-fold cross-validation on the full dataset

You can also use k-fold cross-validation on the full dataset and iteratively use one fold as the test set (have a look at this SO post. Try to implement this in R with 10 folds (hint: you do not need the train/test partition).

#your code here


Now if you try to evaluate the model, you will not that the predict function is of little help. This is because it would just fit a model to its own training data. Have a look at this:

#your code here


Instead, you want to retrieve the average performance of each of the test sets of the 10 folds. You can do this by using the confusionMatrix function with the model as parameter.

How do these results compare to the previous ones?

## Task 7: Using different classification algorithms

Keep the meta parameters as in your second model (65/35 split, 10 fold CV on the training set) and use the k-nearest neighbour (kNN) classifier. Read about kNN here, here and here.

#your code here


## Task 8: The data, data, data argument

You will often hear that more data is preferable over fewer data and that especially in ML applications, you need ideally vast amounts of training data.

Look at the effect of the size of the training set by building a model with a 10/90, 20/80, 30/70, and 40/60 training/test set split.

#your code here


What do you observe?