Abstract

Predicting spam messages is a machine learning algorithm that most email providers need to include to protect their users. Five classification methods are used to identify spam emails from non-spam using word frequency. Logistic lasso regression proves to be the most reliable and accurate method.

Data

The “Spam Mails Dataset” is publicly available on Kaggle and contains the modified text of 500 spam emails and 2,500 non-spam emails (Garne, 2019). Data cleaning was conducted using a guide by Shreyas Khades (n.d.), and the steps included stemming (retrieving the root of a word), removing common stop words (such as “the” and “to”), and creating predictor variables of word frequencies. One disadvantage of the data was the number of spam emails that were either empty or in a different language, which changed the ratio between spam and non-spam emails we were able to use.

# Loading email data set: consists of "text data" and "dummy variable"
email_df <- data.table::fread( "spam_or_not_spam.csv", header = T )

# Randomizing data
set.seed(46234)
email_df <- email_df[ sample( 1:nrow( email_df ) ),  ]

# Factoring dummy variable
email_df$label <- factor( email_df$label )

Raw Data

library(DT)
email_df2 = data.frame(
  Type = ifelse(email_df$label == 1, "Spam", "Non-Spam"),
  "Raw Email Text" = email_df$email
  )
datatable(email_df2, rownames = TRUE, filter="top", options = list(pageLength = 10, scrollX=T) )

# datatable(..., class = 'white-space: nowrap')

Packages used

library(pacman)
p_load(
  
  # Basics
  here, skimr, dplyr, stringr, fastverse, disk.frame,
  
  # Visualizing
  ggplot2, ggthemes, ggthemes, wordcloud, RColorBrewer,
  
  # Text Processing
  tm, SnowballC, 
  
  # Modelling
  e1071,naivebayes, tidymodels, gridExtra, caret, ranger, 
  
  # Knitting
  knitr, kableExtra, DT, shiny, equatiomatic
  
 )

Preprocessing & cleaning (stemming)

To retrieve the root of a word (eg, doing -> do), options are “stemming” &“lemmatization”.

STEMMING: faster but maybe not as effective
LEMMATIZATION: slower but more effective
More on this here

VectorSource(): creates one document for each email
Vcorpus(): creates a volatile corpus from these individual emails

email_corpus <- VCorpus(
                  VectorSource(
                      email_df$email
                      )
                  )
# Using `tm` package to stem email content
email_corpus <- tm::tm_map( email_corpus, 
                            tm::stemDocument )

# Removing puctuations
email_corpus = tm_map( email_corpus, 
                       removePunctuation )

# Removing stopwords
email_corpus <- tm_map( email_corpus, 
                       removeWords, stopwords( "en" ) )

# Removing two most frequent stopwords: "NUMBER", "URL"
 email_corpus <- tm_map( email_corpus,
 removeWords, c("NUMBER", "number", "url", "URL") )
 
# DocumentTermMatrix(): tokenize the email corpus.
email_dtm <- tm::DocumentTermMatrix( sample( email_corpus, length( email_corpus ) ) )

Visualizations

Using wordclouds, visualize text data after cleaning and pre-processing

# Preprocessed data for visualizations
reverse_email <- data.frame(
                text = sapply( email_corpus, as.character ), 
                stringsAsFactors = FALSE, 
                type = email_df$label
                )

Most frequent words in all emails

# All emails
wordcloud( reverse_email$text, 
           max.words = 150, 
           colors = brewer.pal( 7, "Dark2" ), 
           random.order = FALSE )

Most frequent words in spam

# Subsetting to spam == 1
spam <- reverse_email %>% filter( type == 1 )
# layout(matrix(c(1, 2), nrow=2), heights=c(1, 8))
# par(mar=rep(0, 4))
# plot.new()
# text(x=0.5, y=0.5, cex = 1.5, offset = 0.5, "Most frequent words in spam")
wordcloud( spam$text, 
           max.words = 150, 
           colors = brewer.pal( 7, "Dark2" ), 
           random.order = FALSE,
           main = "Spam")

Most frequent words in non-spam

# Subsetting to spam == 0
ham <- reverse_email %>% filter( type == 0 )
wordcloud( ham$text, 
           max.words = 150, 
           colors = brewer.pal( 7, "Dark2" ), 
           random.order = FALSE,
           main = "Non-Spam")

Splitting Data

Earlier we randomly sorted the data, so we can use the indices to split 80% training and 20% testing.

# Split 80% training, 20% testing
email_dtm_train <- email_dtm[   1:2400, ]      
email_dtm_test  <- email_dtm[2401:3000, ]       

# Add labels for convenience
email_train_label <- email_df[   1:2400, ]$label
email_test_label  <- email_df[2401:3000, ]$label

Check split proportions

# Create table
prop_table = data.frame(c(prop.table( table( email_train_label ) )[[2]], #Train
                          prop.table( table( email_train_label ) )[[1]]),
                        c(prop.table( table( email_test_label ) )[[2]], # Test
                          prop.table( table( email_test_label ) )[[1]])
                        )

# Add table headings
rownames(prop_table) = c("Spam", "Non-Spam")
names(prop_table) = c("Train", "Test")

# View table
kable(prop_table, digits = 3) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Train	Test
Spam	0.17	0.152
Non-Spam	0.83	0.848

Trimming: reducing the number of features

There are currently 25,050 variables, which is most likely way too many! So, define a threshold (eg. 1 == 1%) and reduce the number of features used.

Goal: Eliminate words that appear in only 10% of records in the training data

min_freq <- round( email_dtm$nrow * ( ( threshold = 10.0 ) / 100 ), 0 )        # using 10%

Filter to frequent words

# Create vector of most frequent words
freq_words <- findFreqTerms( x = email_dtm, 
                             lowfreq = min_freq )

# Filter the DTM
email_dtm_freq_train <- email_dtm_train[ , freq_words]
email_dtm_freq_test  <- email_dtm_test[ , freq_words]

The list of our most frequent words

# Create table
freq_words_plus_1 =  c(freq_words, "")
freq_words_dt = data.frame(
  c(freq_words_plus_1[seq( 1+0*1,16*1)]),
  c(freq_words_plus_1[seq(1*16+1,16*2)]),
  c(freq_words_plus_1[seq(2*16+1,16*3)]),
  c(freq_words_plus_1[seq(3*16+1,16*4)]),
  c(freq_words_plus_1[seq(4*16+1,16*5)]),
  c(freq_words_plus_1[seq(5*16+1,16*6)]),
  c(freq_words_plus_1[seq(6*16+1,16*7)]),
  c(freq_words_plus_1[seq(7*16+1,16*8)]),
  c(freq_words_plus_1[seq(8*16+1,16*9)]),
  c(freq_words_plus_1[seq(9*16+1,16*10)]),
  c(freq_words_plus_1[seq(10*16+1,16*11)]),
  c(freq_words_plus_1[seq(11*16+1,16*12)]),
  c(freq_words_plus_1[seq(12*16+1,16*13)])
)
names(freq_words_dt)[] = ""
# View table
kable(freq_words_dt, digits = 3) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  column_spec (1:13,border_left = T, border_right = T) %>% 
  row_spec(1:16, extra_css = c("border-bottom: 1px solid", 
                               "border-top: 1px solid")) %>% 
  add_header_above()


actual	can	differ	found	instal	long	much	onli	problem	right	site	test	want
address	case	doe	free	interest	look	must	order	process	rpm	softwar	thank	way
also	chang	doesn	get	internet	lot	name	origin	product	run	someth	thing	web
american	check	don	give	invest	made	nation	packag	program	said	spam	think	week
ani	click	email	good	issu	mail	need	part	provid	say	spamassassin	time	well
anoth	code	end	got	just	make	net	peopl	put	secur	sponsor	today	whi
anyon	com	error	govern	keep	manag	network	perl	rate	see	start	trade	will
back	come	even	great	know	mani	never	person	razor	seem	state	tri	window
base	compani	everi	group	last	market	new	phone	read	send	still	two	within
becaus	comput	exmh	help	life	may	next	place	real	sent	subject	type	without
befor	countri	file	high	like	mean	now	pleas	realli	sep	support	unit	work
best	current	find	home	line	messag	numbertnumber	point	receiv	server	sure	unsubscrib	world
better	data	first	hyperlink	link	might	offer	post	releas	servic	system	use	write
build	date	follow	idea	linux	million	old	power	remov	set	take	user	wrote
busi	day	fork	includ	list	money	onc	price	report	show	talk	veri	year
call	develop	form	inform	live	month	one	probabl	requir	sinc	technolog	version

Cleaned dataset

# Simple dummy transformation fn.
convert_values <- function(x){
                    x = ifelse( x > 0, "Yes", "No" )
}

# Declaring final `train` and `test` datasets
email_train <- apply( email_dtm_freq_train, MARGIN = 2,
                      convert_values )
email_test  <- apply( email_dtm_freq_test, MARGIN = 2,
                      convert_values )

Preview data

# View data
datatable(email_train, rownames = FALSE, filter="none", options = list(pageLength = 5, scrollX=T) )

Methods

The five machine learning methods used for predictions were the Naive Bayes classifier, lasso regression, logistic regression, logistic lasso regression, and a random forest. In lasso, logistic lasso, and logistic regressions, the penalty was tuned to minimize the mean squared error (MSE). For the random forest, we used 200 trees.

# Function to create confusion matrix
## Source https://stackoverflow.com/questions/23891140/r-how-to-visualize-confusion-matrix-using-the-caret-package

# Draw confusion matric
draw_confusion_matrix <- function(cm) {
  total <- sum(cm$table)
  res <- as.numeric(cm$table)
  
  # Generate color gradients. Palettes come from RColorBrewer.
  greenPalette <- c("#F7FCF5","#E5F5E0","#C7E9C0","#A1D99B","#74C476","#41AB5D","#238B45","#006D2C","#00441B")
  redPalette <- c("#FFF5F0","#FEE0D2","#FCBBA1","#FC9272","#FB6A4A","#EF3B2C","#CB181D","#A50F15","#67000D")
  getColor <- function (greenOrRed = "green", amount = 0) {
    if (amount == 0)
      return("#FFFFFF")
    palette <- greenPalette
    if (greenOrRed == "red")
      palette <- redPalette
    colorRampPalette(palette)(100)[10 + ceiling(90 * amount / total)]
  }
  
  # Set the basic layout
  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('Confusion Matrix', cex.main=2)
  
  # Create the matrix 
  classes = colnames(cm$table)
  rect(150, 430, 240, 370, col=getColor("green", res[1]))
  text(195, 435, "Non-Spam", cex=1.2)
  rect(250, 430, 340, 370, col=getColor("red", res[3]))
  text(295, 435, "Spam", cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col=getColor("red", res[2]))
  rect(250, 305, 340, 365, col=getColor("green", res[4]))
  text(140, 400, "Non-Spam", cex=1.2, srt=90)
  text(140, 335, "Spam", cex=1.2, srt=90)
  
  # Add in the cm results
  text(195, 400, res[1], cex=1.6, font=2, col='black')
  text(195, 335, res[2], cex=1.6, font=2, col='black')
  text(295, 400, res[3], cex=1.6, font=2, col='black')
  text(295, 335, res[4], cex=1.6, font=2, col='black')
  
  # Add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "Metrics", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)
  
  # Add in accuracy information 
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}

Naive Bayes

# Create model from the training dataset
bayes_classifier <- e1071::naiveBayes( email_train, 
                                       email_train_label )

# Predicting on test data
email_test_pred <- predict( bayes_classifier, 
                            email_test )

Naive Bayes Results

# Display results
naive_bayes_results = confusionMatrix( data = email_test_pred, 
                                       reference = email_test_label,
                                       positive = "1", 
                                       dnn = c("Prediction", "Actual") )

draw_confusion_matrix(naive_bayes_results)

Lasso

# Splitting for 5-fold cross-validation
folds <- email_train %>% vfold_cv(v = 5)

# Defining Lambdas (from Lecture 005)
lambdas <- data.frame( penalty = c( 0, 10^seq( from = 5, to = -2, length = 100 ) ) )

# Defining the recipe
data_recipe <- recipe(
                email_train_label ~ ., 
                data = email_train 
                ) %>% 
              update_role(V1, new_role = 'id variable')  %>% 
              step_dummy(all_nominal(), - all_outcomes())

# Defining Lasso Model
lasso <- linear_reg(
          penalty = tune(), 
          mixture = 1) %>% 
          set_engine("glmnet")

# Setting up workflow
workflow_lasso <- workflow() %>%
                    add_recipe( data_recipe ) %>%
                    add_model( lasso )

# Parallelize 
doParallel::registerDoParallel(cores = 4)

# CV
lasso_cv <- workflow_lasso %>% 
              tune_grid(
                resamples = folds,
                grid = lambdas,
                metrics = metric_set(rmse, mae)
              )

Lasso Results

# Find best models            
## Source: juliasilge.com/blog/lasso-the-office/

# Graph results
lasso_cv %>% collect_metrics() %>%
            ggplot(aes(penalty, mean, color = .metric)) +
            geom_errorbar(aes(
              ymin = mean - std_err,
              ymax = mean + std_err
            ),
            alpha = 0.5
            ) +
            geom_line(size = 1.5) +
            facet_wrap(~.metric, scales = "free", nrow = 2) +
            theme(legend.position = "none") + theme_base() + 
            scale_x_log10()

# Get best penalties
best_lasso_mae = lasso_cv %>% show_best(metric = "mae") %>% filter(penalty == min(penalty))
best_lasso_rmse = lasso_cv %>% show_best(metric = "rmse") %>% filter(penalty == min(penalty))
best_lasso = rbind(best_lasso_mae, best_lasso_rmse)

# View in table
names(best_lasso) = c("Penalty", "Metric", "Estimator", "Mean", "n", "Standard Error", ".config")
kable(best_lasso, digits = 3) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Penalty	Metric	Estimator	Mean	n	Standard Error	.config
0.031	mae	standard	0.283	5	0.002	Preprocessor1_Model009
0.023	rmse	standard	0.376	5	0.004	Preprocessor1_Model007

# Get test metrics

## Define best model - according to MAE
best_lasso_workflow = workflow_lasso %>%
  finalize_workflow(select_best(lasso_cv, metric = "mae")) %>%
  fit(data = email_train)
best_lasso = best_lasso_workflow %>% extract_fit_parsnip()

## Clean test data
email_test_clean = recipe(
                email_test_label ~ ., 
                data = email_test 
                ) %>% 
              update_role(V1, new_role = 'id variable')  %>% 
              step_dummy(all_nominal(), - all_outcomes()) %>%
              prep() %>% juice()

## Make predictions
lasso_predictions = predict(best_lasso, email_test_clean)
lasso_predictions = lasso_predictions %>% mutate(prediction = ifelse(.pred < 0.5, 0, 1))
email_test_clean$predictions = lasso_predictions$prediction

## Calculate accuracy
email_test_clean = email_test_clean %>% mutate(accurate = ifelse(predictions == email_test_label, 1, 0))
acc = sum(email_test_clean$accurate) / nrow(email_test_clean)

## Calculate sensitivity
tp = 0                                 # our model predicts not-spam for all, so no true positives
fn = email_test_clean %>% filter(email_test_label == 1) %>% nrow()
sens = tp / (tp + fn)

## Calculate specificity
fp = 0                               # our model predicts not-spam for all, so no false positives
tn = email_test_clean %>% filter(email_test_label == 0) %>% nrow()
spec = tn / (tn + fp)

## Make table
lasso_table = tibble(
 metric = c("accuracy", "sensitivity", "specificity"),
 value = c(acc, sens, spec)
 )
lasso_table %>% kable(digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

metric	value
accuracy	0.848
sensitivity	0.000
specificity	1.000

Logistic Lasso

# Set seed
set.seed(9753)

# Converting outcome variable into character vector
email_train <- email_train %>% 
                  mutate(
                    outcome_as_vector = ifelse(email_train_label == 1, "Yes", "No")
                  ) 

# Split for 5-fold cross-validation
folds <- vfold_cv(email_train, v = 5)

# Defining the recipe
data_recipe <- recipe(
                  outcome_as_vector ~ ., 
                  data = email_train 
                ) %>% 
                  step_rm(email_train_label) %>% 
                  update_role(V1, new_role = 'id variable') %>%
                  step_dummy(all_nominal(), - all_outcomes()) %>%
                  step_zv(all_predictors()) %>%
                  step_normalize(all_predictors())

# Defining Lambdas (from Lecture 005)
lambdas <- data.frame( penalty = c( 0, 10^seq( from = 5, to = -2, length = 100 ) ) )

# Defining Lasso Model
log_lasso <- logistic_reg(
            mode = 'classification',
            penalty = tune(), 
            mixture = 1) %>% 
            set_engine("glmnet")

# Setting up workflow
workflow_lasso <- workflow() %>%
                    add_recipe( data_recipe ) %>%
                    add_model( log_lasso )

# CV
log_lasso_cv <- workflow_lasso %>% 
                tune_grid(
                  resamples = folds,
                  metrics = metric_set(yardstick::accuracy, 
                                       yardstick::roc_auc, 
                                       yardstick::sens, 
                                       yardstick::spec, 
                                       yardstick::precision),
                  grid = grid_latin_hypercube(penalty(), size = 5),
                  control = control_grid(parallel_over = 'resamples')
                )

Logistic Lasso Results

# Find test metrics          
log_lasso_cv_results = log_lasso_cv %>% collect_metrics() %>% group_by(.metric) %>% summarize(mean_accuracy = mean(mean, na.rm = T))
log_lasso_cv_results[1] = c("Accuracy", "Precision", "Area Under the Curve", "Sensitivity", "Specificity")
names(log_lasso_cv_results) = c("Metric", "Mean")

## View table
kable(log_lasso_cv_results, digits = 3) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Metric	Mean
Accuracy	0.809
Precision	0.831
Area Under the Curve	0.508
Sensitivity	0.967
Specificity	0.040

Logistic regression

# Set seed
set.seed(9753)

# Split for 5-fold cross-validation
folds <- vfold_cv(email_train, v = 5)

# Defining the recipe
data_recipe <- recipe(
                  outcome_as_vector ~ ., 
                  data = email_train 
                    ) %>% 
                  step_rm(email_train_label) %>% 
                  update_role(V1, new_role = 'id variable') %>%
                  step_dummy(all_nominal(), - all_outcomes()) %>%
                  step_zv(all_predictors()) %>%
                  step_normalize(all_predictors())

# Define the model - simple LogReg because no penalty
model_logistic <- logistic_reg(
                    mode = 'classification') %>%   
                    set_engine('glm')

# Define the workflow
workflow_logistic <- workflow() %>% 
                      add_recipe(data_recipe) %>% 
                      add_model(model_logistic)

# CV
cv_logistic <- workflow_logistic %>%
                fit_resamples(
                  resamples = folds,
                  metrics = metric_set(yardstick::accuracy, 
                                       yardstick::roc_auc, 
                                       yardstick::sens, 
                                       yardstick::spec, 
                                       yardstick::precision)
                )

Logistic Regression Results

# Get test metrics
log_reg_results = cv_logistic %>% collect_metrics() %>% select(.metric, mean)
log_reg_results[1] = c("Accuracy", "Precision", "Area Under the Curve", "Sensitivity", "Specificity")
names(log_reg_results) = c("Metric", "Mean")

# View in table
kable(log_reg_results, digits = 3) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Metric	Mean
Accuracy	0.797
Precision	0.832
Area Under the Curve	0.511
Sensitivity	0.947
Specificity	0.071

Random Forests

# Train the model - using 200 trees
random_forest <- train(
                  x = email_train_rf,
                  y = email_train_label_rf,
                  method = "ranger", 
                  num.trees = 200,
                  importance = "impurity",
                  trControl = trainControl(method = "cv", 
                                           number = 3,
                                           verboseIter = TRUE
                  )
                )

## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 2, splitrule = gini, min.node.size = 1 on full training set

Random Forests Results

# Check variable importances 
top_25words = varImp(random_forest, scale = TRUE)$importance %>% 
                rownames_to_column() %>% 
                arrange(-Overall) %>% 
                top_n(25) 

# Plot variable importance
ggplot(data = top_25words, 
   aes(x=reorder(rowname, Overall),
       y = Overall)) +
        geom_bar(stat = "identity") +
        theme_base() +
        theme(axis.text.x=element_text(angle=50, hjust=1))+
        xlab("Top 25 Predictive Words (stemmed)")+
        ylab("Frequency of Word") +
        labs(title = "Most Predictive Words") +
        theme(plot.title = element_text(hjust = 0.5))

# Re-declaring test data
email_test  <- apply( email_dtm_freq_test, MARGIN = 2,
                      convert_values )
email_test <- cbind( email_test, 
                     email_test_label )

# Predict on test data
predictions <- predict(random_forest, email_test)

# View test metrics in confusion matrix
random_forest_results = confusionMatrix( data = predictions, 
                         reference = email_test_label,
                         positive = "1", 
                         dnn = c("Prediction", "Actual") )
draw_confusion_matrix(random_forest_results)

Results

The crucial metrics in the spam email context are test accuracy (ACC) and sensitivity (SENS). ACC can help determine whether a model is performing well, but it is not the only measure of a good predictor. SENS is key because clicking on spam is dangerous, so a spam email that is predicted not-spam (a false negative) has consequences. Naive Bayes produced a test ACC of 0.788 and a SENS of 0.04. Lasso regression produces an ACC of 0.848 and a SENS of 0. Logistic lasso returns 0.809 ACC and 0.967 SENS. Logistic regression returns 0.797 ACC and 0.947 SENS. Finally, the random forest returns 0.847 ACC and 0 SENS. While each model has its advantages and disadvantages, balancing test accuracy and sensitivity, logistic lasso is the best predictor of spam.

# Creating df with metrics of all models
comparing_acc_table = data.frame(
  
  c(
    "Naive Bayes",
    "Lasso",
    "Logistic Lasso",
    "Logistic",
    "Random Forest"
    ),
  
  c(
    naive_bayes_results[["overall"]][["Accuracy"]],
    acc,
    log_lasso_cv_results$Mean[1],
    log_reg_results$Mean[1],
    random_forest_results[["overall"]][["Accuracy"]]
    ),
  
  c(
    naive_bayes_results[["byClass"]][["Sensitivity"]],
    sens,
    log_lasso_cv_results$Mean[4],
    log_reg_results$Mean[4],
    random_forest_results[["byClass"]][["Sensitivity"]]
  ),
  
  c(
    naive_bayes_results[["byClass"]][["Specificity"]],
    spec,
    log_lasso_cv_results$Mean[5],
    log_reg_results$Mean[5],
    random_forest_results[["byClass"]][["Specificity"]]
  ),
  
  c(
    naive_bayes_results[["byClass"]][["Precision"]],
    # ifelse((tp/(fp+tp))) == NaN, 0, (tp/(fp+tp)),
    0, #hardcoded
    log_lasso_cv_results$Mean[2],
    log_reg_results$Mean[2],
    random_forest_results[["byClass"]][["Precision"]]
  )
  
)

names(comparing_acc_table) = c("Method", "Accuracy", "Sensitivity", "Specificity", "Precision")

# View table
kable(comparing_acc_table, digits = 3) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Method	Accuracy	Sensitivity	Specificity	Precision
Naive Bayes	0.788	0.044	0.921	0.091
Lasso	0.848	0.000	1.000	0.000
Logistic Lasso	0.809	0.967	0.040	0.831
Logistic	0.797	0.947	0.071	0.832
Random Forest	0.847	0.000	0.998	0.000

Sources Cited

Chu, M. (n.d.) “What happens if you open spam email? 4 dangers to your account.” DataOverhaulers. Retrieved from
    https://dataoverhaulers.com/open-spam-email/.

Cybernetic. [username] (2017 Mar. 22). “R how to visualize confusion matrix using the caret package.” StackOverflow. Retrieved from
    https://stackoverflow.com/questions/23891140/r-how-to-visualize-confusion-matrix-using-the-caret-package.

Garne, V. (2019). Spam Mails Dataset [Data set]. Kaggle. Retrieved from https://www.kaggle.com/venky73/spam-mails-dataset/.

Khadse, S. (n.d.) “Spam/ham test SMS classification using Naive Bayes in R.” RPubs by RStudio.
    https://rpubs.com/shreyaskhadse/spam_ham_naive_bayes.

Silge, J. (2020 Mar. 17). “LASSO regression using tidymodels and #TidyTuesday data for The Office.” Julia Silge. Retrieved from
    https://juliasilge.com/blog/lasso-the-office/.

Predicting Spam Emails

Cyrus Tadjiki, Zoë Arnaut, Octavio Lima

March 9th 2022