Load packages
library(tidyverse) # Collection of all the good stuff like dplyr, ggplot2 ect.
library(magrittr)
library(data.table) # Good format to work with large datasets
library(skimr) # Nice descriptives
With so much freedom wrt. feature selection, functional form ect., models are prone to over-fitting. And no constraints by asymptotic properties, causality and so forth, how can we generalize anything?
In ML, generalization is not achived by statistical derivatives and theoretical argumentation, but rather by answering the practical question: How well would my model perform with new data? To answer this question, Out-of-Sample-Testing is usually taken as solution. Here, you do the following
An advanced version is a N-fold-Crossvalidation, where this process is repeated several time during the hyperparameter-tuning phase (more on that later).
As a rule-of-thumb: Richer and more complex functional forms and algorithms tend to be better in predictign complex real world pattern. This is particularly true for high-dimensional (big) data.
However, flexible algorithms at one point become so good in mimicing the pattern in our data that they overfit, meaning are to much tuned towards a specific dataset and might not reproduce the same accuracy in new data. Therefore, we aim at finding the sweet spot of high model complexity yet high-performing out-of-sample predictions
That we do usually in a process of hyperparameter-tuning, where we gear different options the models offer towards high out-of-sample predictive performance. Mathematically speaking, we try to minimize a loss function \(L(.)\) (eg. RMSE) the following problem:
\[minimize \underbrace{\sum_{i=1}^{n}L(f(x_i),y_i),}_{in-sample~loss} ~ over \overbrace{~ f \in F ~}^{function~class} subject~to \underbrace{~ R(f) \leq c.}_{complexity~restriction}\]
Lets for a second recap linear regression techniques, foremost the common allrounder and workhorse of statistical research since some 100 years.
OLS = Ordinary Least Squares
Basic Properties
Functional form
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon \]
where:
And that is what happens. Lets imagine we plot some feature against an outcome we want to predict. OLS will just fit a straight line through your data.
We do so by minimizing the sum of (squared) errors between our prediction-line and the observed outcome.
So, lets get your hand dirty. We will load a standard dataset from mlbench
, the BostonHousing dataset. It comes as a dataframe with 506 observations on 14 features, the last one medv
being the outcome:
crim
per capita crime rate by townzn
proportion of residential land zoned for lots over 25,000 sq.ftindus
proportion of non-retail business acres per townchas
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)nox
nitric oxides concentration (parts per 110 million)rm
average number of rooms per dwellingage
proportion of owner-occupied units built prior to 1940dis
weighted distances to five Boston employment centresrad
index of accessibility to radial highwaystax
full-value property-tax rate per USD 10,000ptratio
pupil-teacher ratio by townb
1000(B - 0.63)^2 where B is the proportion of blacks by townlstat
lower status of the populationmedv
median value of owner-occupied homes in USD 1000’sSource: Harrison, D. and Rubinfeld, D.L. “Hedonic prices and the demand for clean air”, J. Environ. Economics & Management, vol.5, 81-102, 1978.
These data have been taken from the UCI Repository Of Machine Learning Databases
library(mlbench) # Library including many ML benchmark datasets
data(BostonHousing)
data <- BostonHousing %>% as_data_frame()
rm(BostonHousing)
data %<>%
select(medv, everything()) %>%
mutate(chas = as.numeric(chas))
Lets inspect the data for a moment.
head(data)
glimpse(data)
## Observations: 506
## Variables: 14
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.090, 4.967, 4.967, 6.062, 6.062, 6.062, 5.561, 5.950...
## $ rad <dbl> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ b <dbl> 396.9, 396.9, 392.8, 394.6, 396.9, 394.1, 395.6, 396.9...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
skim(data) %>% skimr::kable()
## Skim summary statistics
## n obs: 506
## n variables: 14
##
## Variable type: numeric
##
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## ---------- --------- ---------- ----- -------- -------- -------- -------- -------- -------- ------- ----------
## age 0 506 506 68.57 28.15 2.9 45.02 77.5 94.07 100 <U+2581><U+2582><U+2582><U+2582><U+2582><U+2582><U+2583><U+2587>
## b 0 506 506 356.67 91.29 0.32 375.38 391.44 396.23 396.9 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
## chas 0 506 506 1.07 0.25 1 1 1 1 2 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## crim 0 506 506 3.61 8.6 0.0063 0.082 0.26 3.68 88.98 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## dis 0 506 506 3.8 2.11 1.13 2.1 3.21 5.19 12.13 <U+2587><U+2585><U+2583><U+2583><U+2582><U+2581><U+2581><U+2581>
## indus 0 506 506 11.14 6.86 0.46 5.19 9.69 18.1 27.74 <U+2583><U+2586><U+2585><U+2581><U+2581><U+2587><U+2581><U+2581>
## lstat 0 506 506 12.65 7.14 1.73 6.95 11.36 16.96 37.97 <U+2586><U+2587><U+2586><U+2585><U+2582><U+2581><U+2581><U+2581>
## medv 0 506 506 22.53 9.2 5 17.02 21.2 25 50 <U+2582><U+2585><U+2587><U+2586><U+2582><U+2582><U+2581><U+2581>
## nox 0 506 506 0.55 0.12 0.38 0.45 0.54 0.62 0.87 <U+2587><U+2586><U+2587><U+2586><U+2583><U+2585><U+2581><U+2581>
## ptratio 0 506 506 18.46 2.16 12.6 17.4 19.05 20.2 22 <U+2581><U+2582><U+2582><U+2582><U+2585><U+2585><U+2587><U+2583>
## rad 0 506 506 9.55 8.71 1 4 5 24 24 <U+2582><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2585>
## rm 0 506 506 6.28 0.7 3.56 5.89 6.21 6.62 8.78 <U+2581><U+2581><U+2582><U+2587><U+2587><U+2582><U+2581><U+2581>
## tax 0 506 506 408.24 168.54 187 279 330 666 711 <U+2583><U+2587><U+2582><U+2585><U+2581><U+2581><U+2581><U+2586>
## zn 0 506 506 11.36 23.32 0 0 0 12.5 100 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
Ok, us straight run our first linear model with the lm()
function:
fit.lm <- lm(medv ~ ., data = data)
summary(fit.lm)
##
## Call:
## lm(formula = medv ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.594 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.772755 5.198710 6.50 0.0000000002018 ***
## crim -0.108011 0.032865 -3.29 0.00109 **
## zn 0.046420 0.013727 3.38 0.00078 ***
## indus 0.020559 0.061496 0.33 0.73829
## chas 2.686734 0.861580 3.12 0.00193 **
## nox -17.766611 3.819744 -4.65 0.0000042456438 ***
## rm 3.809865 0.417925 9.12 < 0.0000000000000002 ***
## age 0.000692 0.013210 0.05 0.95823
## dis -1.475567 0.199455 -7.40 0.0000000000006 ***
## rad 0.306049 0.066346 4.61 0.0000050705290 ***
## tax -0.012335 0.003761 -3.28 0.00111 **
## ptratio -0.952747 0.130827 -7.28 0.0000000000013 ***
## b 0.009312 0.002686 3.47 0.00057 ***
## lstat -0.524758 0.050715 -10.35 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.75 on 492 degrees of freedom
## Multiple R-squared: 0.741, Adjusted R-squared: 0.734
## F-statistic: 108 on 13 and 492 DF, p-value: <0.0000000000000002
So, how would you interpret the results? Estimate? P-Value? \(R^2\)? If you don’t know, consider taking the SDS track on Datacamp for statistics basics.
However, since this is a predictive exercise, we are foremost interested in how well the model predicts. So, lets use the predict()
function.
pred.lm <- predict(fit.lm)
head(pred.lm, 10)
## 1 2 3 4 5 6 7 8 9 10
## 30.00 25.03 30.57 28.61 27.94 25.26 23.00 19.54 11.52 18.92
A common measure of predictive power of regresdsions models is the Root-Mean-Squared-Error (RSME), calculate as follows:
\[ RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(y_i - \hat{y_i} \Big)^2}}\]
Keep in mind, this rood&squared thingy does nothing with the error term except of transforming negative to positive values. Lets calculate:
error <- pull(data, medv) - pred.lm
# Calculate RMSE
sqrt(mean(error ^ 2))
## [1] 4.679
We can also visualize the error term:
# Some visualizatiom
error %>% as_data_frame() %>%
ggplot(aes(value)) +
geom_histogram()
However, how well does our OLS model predict out-of-sample? To do so, we have to set up an out-of-sample testing workflow. While that cound be done manually, we will use the (my very favorite) caret
(=ClassificationAndREgressionTree) package, which has by now grown to be the most popular R
package for ML, including a standardized ML workflow over more than 100 model classes.
First, we will now split our data in a train and test sample. We could create an index for subsetting manually. However, caret provides the nice createDataPartition()
function, which does that for you. Nice feature is that it takes care that the outcomes (y) are proportionally distrubuted in the subset. That is particularly important in case the outcomes are very unbalanced.
library(caret)
index <- createDataPartition(y = data$medv, p = 0.75, list = FALSE) # 75% to 25% split
training <- data[index,]
test <- data[-index,]
Before tuning the models, there is still some final preprocessing to do. For typical preprocessing tasks for ML, I like to use the reciepes
package. It lets you conveniently define a recipe of standard ML preprocessing tasks. Afterwards, we can just can use this recipe to “bake” our data, meaning performing all the steps in the recipe. Here, we do only some simple transformations. We normalizizing all numeric data by centering (subtracting the mean) and scaling (divide by standard deviation). Finally, we remove features with zero-variance (we dont have any here, but its a good practice to do that with unknown data).
library(recipes)
reci <- recipe(medv ~ ., data = training) %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
prep(data = train)
reci
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 13
##
## Training data contained 381 data points and no missing data.
##
## Operations:
##
## Centering for crim, zn, indus, chas, nox, rm, age, ... [trained]
## Scaling for crim, zn, indus, chas, nox, rm, age, ... [trained]
## Zero variance filter removed no terms [trained]
Now we can “bake” our data, executing the recipe, and split in outcomes and features (is not necessary, but usually an easier workflow lateron).
# Training: split in y and x
x_train <-bake(reci, newdata = training) %>% select(-medv)
y_train <- training %>% pull(medv)
# test: split in y and x
x_test <-bake(reci, newdata = test) %>% select(-medv)
y_test <- test %>% pull(medv)
We now get seriously started with the hyperparameter tuning. We will do so in the following way. We here do a n-fold crossvalidation, meaning that we again split the train sample again in n subsamples, on which we both fit and test every hyperparameter configuration. That minimizes the chance that the results are driven by a particular sampling. Sounds cumbersome. However, for this workflow caret allows us to automatize this workflow with defining a trainControl()
object, which we pass to every model we fit lateron.
Here, we have to define a set of parameters. First, the number of crossvalidations. He here for the sake of time do just 2 (method = "cv", number = 5
), but given enough computing power, RAM and/or time, for a real life calibration, I would recommend 5 or 10 folds. To be even more save, this whole process can be repeated serveral time (method = "repeatedcv", repeats = x
). Be aware, though, that time increases exponentially, which might make a difference when working with large data.
ctrl <- trainControl(method = "cv",
number = 10)
So, we finally start fitting models. The logit we skip in this phase, since it has no tunable parameters.
fit.lm <- train(x = x_train,
y = y_train,
trControl = ctrl,
method = "lm")
summary(fit.lm)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.614 -2.764 -0.646 1.686 26.181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.511 0.238 94.71 < 0.0000000000000002 ***
## crim -1.010 0.317 -3.19 0.00154 **
## zn 1.198 0.358 3.35 0.00090 ***
## indus 0.164 0.478 0.34 0.73237
## chas 0.924 0.248 3.72 0.00023 ***
## nox -1.858 0.497 -3.74 0.00021 ***
## rm 2.914 0.328 8.89 < 0.0000000000000002 ***
## age -0.525 0.414 -1.27 0.20535
## dis -3.108 0.467 -6.66 0.00000000010068 ***
## rad 2.561 0.624 4.11 0.00004968953389 ***
## tax -2.116 0.691 -3.06 0.00236 **
## ptratio -1.964 0.319 -6.16 0.00000000188036 ***
## b 0.790 0.276 2.86 0.00443 **
## lstat -3.024 0.405 -7.47 0.00000000000059 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.64 on 367 degrees of freedom
## Multiple R-squared: 0.743, Adjusted R-squared: 0.734
## F-statistic: 81.8 on 13 and 367 DF, p-value: <0.0000000000000002
fit.lm
## Linear Regression
##
## 381 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 343, 344, 342, 342, 345, 342, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 4.723 0.7321 3.341
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Notice the higher RMSE compared to the initial in-sample testing example.
The elastic net has the functional form of a generalized linear model, plus an adittional tgerm \(\lambda\) a parameter which penalizes the coefficient by its contribution to the models loss in the form of:
\[\lambda \sum_{p=1}^{P} [ 1 - \alpha |\beta_p| + \alpha |\beta_p|^2]\]
Here, we have 2 tunable parameters, \(\lambda\) and \(\alpha\). If \(\alpha = 0\), we are left with \(|\beta_i|\), turning it to a lately among econometricians very popular Least Absolute Shrinkage and Selection Operator (LASSO) regression. Obviously, when \(\lambda = 0\), the whole term vanishes, and we are again left with a generalized linear model. The first thing we do now is to set up a tuneGrid, a matrix of value combinations of tunable hyperparameters we aim at optimizing. This is easily done with the base-R expand.grid()
function.
tune.glmnet = expand.grid(alpha = seq(0, 1, length = 5),
lambda = seq(0, 1, length = 5))
tune.glmnet
Now, we utulize the caret train
function to carry out the grid search as well as crossvalidation of our model. The train function is a wrapper that uses a huge variety (150+) of models from other packages, and carries out the model fitting workflow. Here, we pass the function our feature and outcome vectors (X, y), the parameters in the formerly defined trainControl object, the metric to be optimized, the model class (here “glmnet” from the corresponding package), and our formerly defined tuneGrid. Lets go!
fit.glmnet <- train(x = x_train,
y = y_train,
trControl = ctrl,
method = "glmnet", family = "gaussian",
tuneGrid = tune.glmnet)
fit.glmnet
## glmnet
##
## 381 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 342, 344, 344, 343, 342, 343, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.00 0.00 4.728 0.7291 3.316
## 0.00 0.25 4.728 0.7291 3.316
## 0.00 0.50 4.728 0.7291 3.316
## 0.00 0.75 4.735 0.7285 3.315
## 0.00 1.00 4.752 0.7272 3.316
## 0.25 0.00 4.709 0.7311 3.350
## 0.25 0.25 4.744 0.7272 3.333
## 0.25 0.50 4.823 0.7193 3.355
## 0.25 0.75 4.906 0.7111 3.403
## 0.25 1.00 4.955 0.7072 3.440
## 0.50 0.00 4.709 0.7312 3.350
## 0.50 0.25 4.809 0.7204 3.355
## 0.50 0.50 4.934 0.7082 3.419
## 0.50 0.75 5.037 0.6981 3.507
## 0.50 1.00 5.116 0.6909 3.579
## 0.75 0.00 4.709 0.7311 3.350
## 0.75 0.25 4.889 0.7119 3.396
## 0.75 0.50 5.029 0.6982 3.501
## 0.75 0.75 5.120 0.6902 3.584
## 0.75 1.00 5.184 0.6875 3.639
## 1.00 0.00 4.709 0.7311 3.350
## 1.00 0.25 4.923 0.7090 3.413
## 1.00 0.50 5.099 0.6906 3.565
## 1.00 0.75 5.171 0.6873 3.629
## 1.00 1.00 5.264 0.6833 3.705
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.5 and lambda = 0.
We can also plot the results of the hyperparameter tuning.
ggplot(fit.glmnet)
So, taking all together, lets see which model performs best out-of-sample:
pred.lm <-
pred.glm <- predict(fit.glmnet, newdata = x_test)
# RMSE OLS
sqrt(mean( (y_test - predict(fit.lm, newdata = x_test) ) ^ 2))
## [1] 5.162
sqrt(mean( (y_test - predict(fit.glmnet, newdata = x_test) ) ^ 2))
## [1] 5.163
Notice again the higher RMSE for total out-of-sample testing. Here, the results are pretty similar, so we in fact see noa advantage of variable selection via glmnet
We remeber that the most commonly used performance measure for regression problems is the RMSE. However, how to we assess models aimed to solve classification problems? Here, it is not that straightforward, and we could (depending on the task) use different ones.
The Confusion Matrix (in inferential statistics you would call it Classification Table, so don’t get confused) is the main source
It is the 2x2
matrix with the foillowing cells:
Just remember, We describe predicted values as Positive and Negative and actual values as True and False. Out of combinations of these values, we dan derive a set of different quality measures.
Accuracy (ACC) \[ {ACC} ={\frac {\mathrm {TP} +\mathrm {TN} }{P+N}}={\frac {\mathrm {TP} +\mathrm {TN} }{\mathrm {TP} +\mathrm {TN} +\mathrm {FP} +\mathrm {FN} }} \]
** Sensitivity,** also called recall, hit rate, or true positive rate (TPR) \[ {TPR} ={\frac {\mathrm {TP} }{P}}={\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FN} }}=1-\mathrm {FNR} \]
Specificity, also called selectivity or true negative rate (TNR) \[ {TNR} ={\frac {\mathrm {TN} }{N}}={\frac {\mathrm {TN} }{\mathrm {TN} +\mathrm {FP} }}=1-\mathrm {FPR} \]
Precision, also called positive predictive value (PPV) \[ {PPV} ={\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FP} }} \]
F1 score: is the harmonic mean of precision and sensitivity, meaning a weighted average of the true positive rate (recall) and precision. \[ F_{1}=2\cdot {\frac {\mathrm {PPV} \cdot \mathrm {TPR} }{\mathrm {PPV} +\mathrm {TPR} }}={\frac {2\mathrm {TP} }{2\mathrm {TP} +\mathrm {FP} +\mathrm {FN} }} \]
So, which one should we use? Answer is: Depends on the problem. Are you more sensitive towards false negative or false positives? Do you have a balanced or unbalanced distribution of classes? Think about it for a moment.
An ROC curve (receiver operating characteristic curve, weird name, i know. Comes originally from signal processing) is a derivative of the confusion matrix and predicted class-probabilities.
So, what does it tell us? The ROC is a graph showing the performance of a classification model at all classification thresholds. It plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives.
AUC stands for “Area under the ROC Curve.” That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1). It provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example. AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
AUC is desirable for the following two reasons:
However, both these reasons come with caveats, which may limit the usefulness of AUC in certain use cases:
Customer churn refers to the situation when a customer ends their relationship with a company, and it’s a costly problem. Customers are the fuel that powers a business. Loss of customers impacts sales. Further, it’s much more difficult and costly to gain new customers than it is to retain existing customers. As a result, organizations need to focus on reducing customer churn.
The good news is that machine learning can help. For many businesses that offer subscription based services, it’s critical to both predict customer churn and explain what features relate to customer churn.
We now dive into the IBM Watson Telco Dataset. According to IBM, the business challenge is.
A telecommunications company [Telco] is concerned about the number of customers leaving their landline business for cable competitors. They need to understand who is leaving. Imagine that you’re an analyst at this company and you have to find out who is leaving and why.
The dataset includes information about:
Churn
rm(list=ls()); graphics.off() # get rid of everything in the workspace
data <- readRDS("data/telco_churn.rds")
Lets inspect our data
head(data)
glimpse(data)
## Observations: 7,043
## Variables: 21
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "77...
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "...
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "N...
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "N...
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 5...
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes"...
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone ser...
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "F...
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", ...
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", ...
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", ...
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "N...
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "...
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "N...
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month...
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes"...
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed c...
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89....
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820....
## $ Churn <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", ...
skim(data) %>% skimr::kable()
## Skim summary statistics
## n obs: 7043
## n variables: 21
##
## Variable type: character
##
## variable missing complete n min max empty n_unique
## ------------------ --------- ---------- ------ ----- ----- ------- ----------
## Churn 0 7043 7043 2 3 0 2
## Contract 0 7043 7043 8 14 0 3
## customerID 0 7043 7043 10 10 0 7043
## Dependents 0 7043 7043 2 3 0 2
## DeviceProtection 0 7043 7043 2 19 0 3
## gender 0 7043 7043 4 6 0 2
## InternetService 0 7043 7043 2 11 0 3
## MultipleLines 0 7043 7043 2 16 0 3
## OnlineBackup 0 7043 7043 2 19 0 3
## OnlineSecurity 0 7043 7043 2 19 0 3
## PaperlessBilling 0 7043 7043 2 3 0 2
## Partner 0 7043 7043 2 3 0 2
## PaymentMethod 0 7043 7043 12 25 0 4
## PhoneService 0 7043 7043 2 3 0 2
## StreamingMovies 0 7043 7043 2 19 0 3
## StreamingTV 0 7043 7043 2 19 0 3
## TechSupport 0 7043 7043 2 19 0 3
##
## Variable type: integer
##
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## --------------- --------- ---------- ------ ------- ------- ---- ----- ----- ----- ------ ----------
## SeniorCitizen 0 7043 7043 0.16 0.37 0 0 0 0 1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2582>
## tenure 0 7043 7043 32.37 24.56 0 9 29 55 72 <U+2587><U+2583><U+2583><U+2582><U+2582><U+2583><U+2583><U+2585>
##
## Variable type: numeric
##
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## ---------------- --------- ---------- ------ -------- --------- ------- -------- --------- --------- -------- ----------
## MonthlyCharges 0 7043 7043 64.76 30.09 18.25 35.5 70.35 89.85 118.75 <U+2587><U+2581><U+2583><U+2582><U+2586><U+2585><U+2585><U+2582>
## TotalCharges 11 7032 7043 2283.3 2266.77 18.8 401.45 1397.47 3794.74 8684.8 <U+2587><U+2583><U+2582><U+2582><U+2581><U+2581><U+2581><U+2581>
Lets do right away some preprocessing, and drop the customer ID, which we do not need. Let’s also say we’re lazy today, so we just prop observations with missing features (not good practice).
data %<>%
select(-customerID) %>%
drop_na() %>%
select(Churn, everything())
Next, lets have a first visual inspections. Many models in our prediction exercise to follow require the conditional distribution of the features to be different for the outcomes states to be predicted. So, lets take a look. Here, ggplot2
plus the ggridges
package is my favorite.
require(ggridges)
data %>%
gather(variable, value, -Churn) %>%
ggplot(aes(y = as.factor(variable),
fill = as.factor(Churn),
x = percent_rank(value)) ) +
geom_density_ridges(alpha = 0.75)
Again, we split in a training and test dataset.
library(caret)
index <- createDataPartition(y = data$Churn, p = 0.75, list = FALSE)
training <- data[index,]
test <- data[-index,]
We already see the numeric variable TotalCharges
appears to be right skewed. That is not a problem for predictive modeling per se, yet some transformation might increase still its predictive power. Lets see. The easiest approximation is just to check the correlation.
library(corrr)
training %>%
select(Churn, TotalCharges) %>%
mutate(
Churn = Churn %>% as.factor() %>% as.numeric(),
LogTotalCharges = log(TotalCharges)
) %>%
correlate() %>%
focus(Churn) %>%
fashion()
Allright, seems as if it could be worth it. Now we can already write down or recipe.
library(recipes)
reci <- recipe(Churn ~ ., data = training) %>%
step_discretize(tenure, options = list(cuts = 6)) %>%
step_log(TotalCharges) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_center(all_predictors(), -all_outcomes()) %>%
step_scale(all_predictors(), -all_outcomes()) %>%
prep(data = training)
reci
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 19
##
## Training data contained 5275 data points and no missing data.
##
## Operations:
##
## Dummy variables from tenure [trained]
## Log transformation on TotalCharges [trained]
## Dummy variables from gender, Partner, Dependents, tenure, ... [trained]
## Centering for SeniorCitizen, ... [trained]
## Scaling for SeniorCitizen, ... [trained]
Now we just split again in predictors and outcomes, bake it all, and we are good to go.
# Predictors
x_train <- bake(reci, newdata = training) %>% select(-Churn)
x_test <- bake(reci, newdata = test) %>% select(-Churn)
# Response variables for training and testing sets
y_train <- pull(training, Churn) %>% as.factor()
y_test <- pull(test, Churn) %>% as.factor()
We again define a trainControl()
object. However, now we need some adittional parameters.
ctrl <- trainControl(method = "repeatedcv", # repeatedcv, boot, cv, LOOCV, timeslice OR adaptive etc.
number = 5, # Number of CV's
repeats = 3, # Number or repeats -> repeats * CV's
classProbs = TRUE, # Include probability of class prediction
summaryFunction = twoClassSummary, # Which type of summary statistics to deliver
returnData = FALSE, # Don't include the original data in the train() output
returnResamp = "final", # Impåortant for resampling later
savePredictions = "final", # The predictions of every tune or only the final to be saved?
allowParallel = TRUE, # For parallel processing
verboseIter = FALSE)
metric <- "ROC" # Which metric should be optimized (more on that later)
Lets run the common allrounder first:
fit.logit <- train(x = x_train,
y = y_train,
trControl = ctrl,
metric = metric,
method = "glm", family = "binomial")
fit.logit
## Generalized Linear Model
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4220, 4221, 4219, 4220, 4220, 4221, ...
## Resampling results:
##
## ROC Sens Spec
## 0.8541 0.9038 0.535
summary(fit.logit)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.150 -0.656 -0.272 0.558 3.308
##
## Coefficients: (8 not defined because of singularities)
## Estimate Std. Error z value
## (Intercept) -1.698538 0.057470 -29.56
## SeniorCitizen 0.090776 0.036715 2.47
## MonthlyCharges -0.775133 1.104180 -0.70
## TotalCharges -1.008294 0.176949 -5.70
## gender_Male -0.005493 0.038348 -0.14
## Partner_Yes 0.047467 0.045963 1.03
## Dependents_Yes -0.079191 0.048790 -1.62
## tenure_bin1 0.061725 0.175406 0.35
## tenure_bin2 0.068972 0.114565 0.60
## tenure_bin3 0.064122 0.088934 0.72
## tenure_bin4 0.103126 0.080920 1.27
## tenure_bin5 0.015227 0.077975 0.20
## tenure_bin6 NA NA NA
## PhoneService_Yes 0.032503 0.220752 0.15
## MultipleLines_No.phone.service NA NA NA
## MultipleLines_Yes 0.280398 0.102508 2.74
## InternetService_Fiber.optic 0.953801 0.459183 2.08
## InternetService_No -0.829467 0.384322 -2.16
## OnlineSecurity_No.internet.service NA NA NA
## OnlineSecurity_Yes -0.055316 0.093315 -0.59
## OnlineBackup_No.internet.service NA NA NA
## OnlineBackup_Yes 0.035099 0.096702 0.36
## DeviceProtection_No.internet.service NA NA NA
## DeviceProtection_Yes 0.086077 0.097719 0.88
## TechSupport_No.internet.service NA NA NA
## TechSupport_Yes -0.062840 0.095480 -0.66
## StreamingTV_No.internet.service NA NA NA
## StreamingTV_Yes 0.291924 0.184630 1.58
## StreamingMovies_No.internet.service NA NA NA
## StreamingMovies_Yes 0.350737 0.185129 1.89
## Contract_One.year -0.284772 0.051512 -5.53
## Contract_Two.year -0.711598 0.095049 -7.49
## PaperlessBilling_Yes 0.160013 0.043313 3.69
## PaymentMethod_Credit.card..automatic. 0.007981 0.055125 0.14
## PaymentMethod_Electronic.check 0.150479 0.052916 2.84
## PaymentMethod_Mailed.check 0.000103 0.057171 0.00
## Pr(>|z|)
## (Intercept) < 0.0000000000000002 ***
## SeniorCitizen 0.01342 *
## MonthlyCharges 0.48268
## TotalCharges 0.000000012107471 ***
## gender_Male 0.88610
## Partner_Yes 0.30174
## Dependents_Yes 0.10457
## tenure_bin1 0.72492
## tenure_bin2 0.54715
## tenure_bin3 0.47090
## tenure_bin4 0.20252
## tenure_bin5 0.84517
## tenure_bin6 NA
## PhoneService_Yes 0.88294
## MultipleLines_No.phone.service NA
## MultipleLines_Yes 0.00623 **
## InternetService_Fiber.optic 0.03779 *
## InternetService_No 0.03091 *
## OnlineSecurity_No.internet.service NA
## OnlineSecurity_Yes 0.55332
## OnlineBackup_No.internet.service NA
## OnlineBackup_Yes 0.71663
## DeviceProtection_No.internet.service NA
## DeviceProtection_Yes 0.37840
## TechSupport_No.internet.service NA
## TechSupport_Yes 0.51044
## StreamingTV_No.internet.service NA
## StreamingTV_Yes 0.11385
## StreamingMovies_No.internet.service NA
## StreamingMovies_Yes 0.05815 .
## Contract_One.year 0.000000032338433 ***
## Contract_Two.year 0.000000000000071 ***
## PaperlessBilling_Yes 0.00022 ***
## PaymentMethod_Credit.card..automatic. 0.88489
## PaymentMethod_Electronic.check 0.00446 **
## PaymentMethod_Mailed.check 0.99856
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6108.6 on 5274 degrees of freedom
## Residual deviance: 4220.4 on 5247 degrees of freedom
## AIC: 4276
##
## Number of Fisher Scoring iterations: 6
We can surely also run an elasticnet with an logit-link (family = "binomial
). Now, we utulize the caret train()
function to carry out the grid search as well as crossvalidation of our model. The train function is a wrapper that uses a huge variety (150+) of models from other packages, and carries out the model fitting workflow. Here, we pass the function our feature and outcome vectors (X, y), the parameters in the formerly defined trainControl object, the metric to be optimized, the model class (here “glmnet” from the corresponding package), and our formerly defined tuneGrid. Lets go!
tune.glmnet = expand.grid(alpha = seq(0, 1, length = 3),
lambda = seq(0, 0.3, length = 7))
fit.glmnet <- train(x = x_train,
y = y_train,
trControl = ctrl,
metric = metric,
method = "glmnet", family = "binomial",
tuneGrid = tune.glmnet)
fit.glmnet
## glmnet
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4219, 4221, 4220, 4219, 4221, 4220, ...
## Resampling results across tuning parameters:
##
## alpha lambda ROC Sens Spec
## 0.0 0.00 0.8540 0.9108 0.5221
## 0.0 0.05 0.8526 0.9180 0.4874
## 0.0 0.10 0.8510 0.9280 0.4401
## 0.0 0.15 0.8496 0.9385 0.4080
## 0.0 0.20 0.8486 0.9484 0.3773
## 0.0 0.25 0.8478 0.9564 0.3454
## 0.0 0.30 0.8471 0.9630 0.3129
## 0.5 0.00 0.8545 0.9058 0.5338
## 0.5 0.05 0.8479 0.9608 0.3012
## 0.5 0.10 0.8387 0.9842 0.1671
## 0.5 0.15 0.8254 1.0000 0.0000
## 0.5 0.20 0.8126 1.0000 0.0000
## 0.5 0.25 0.8024 1.0000 0.0000
## 0.5 0.30 0.5000 1.0000 0.0000
## 1.0 0.00 0.8546 0.9053 0.5340
## 1.0 0.05 0.8366 0.9779 0.2121
## 1.0 0.10 0.8098 1.0000 0.0000
## 1.0 0.15 0.5000 1.0000 0.0000
## 1.0 0.20 0.5000 1.0000 0.0000
## 1.0 0.25 0.5000 1.0000 0.0000
## 1.0 0.30 0.5000 1.0000 0.0000
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.
We can also plot the results.
ggplot(fit.glmnet)
Ok, next we will do the same exercise for a classification tree. Some brief reminder what this interesting family of models is about:
This class became increasingly popular in business and other applications. Some reasons are:
Some tree terminology:
The decision of making strategic splits heavily affects a tree’s accuracy. So, How does the tree decide to split? This is different across the large family of tree-like models. Common approaches:
Some common complexity restrictions are:
Likewise, there are a veriety of tunable hyperparameters across different applications of this model family.
Ok, lets run it! We here use rpart
, but there is a huge variety available. This one has only one tunable parameter. Here we are able to restrict the complexity via a hyperparameter cp. This parameter represents the complexity costs of every split, and allows further splits only if it leads to an decrease in model loss below this threshold.
library(rpart)
tune.dt = expand.grid(cp = c(0.001, 0.005 ,0.010, 0.020, 0.040))
fit.dt <- train(x = x_train,
y = y_train,
trControl = ctrl,
metric = metric,
method = "rpart", tuneGrid = tune.dt)
fit.dt
## CART
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4220, 4221, 4219, 4219, 4221, 4220, ...
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.001 0.8179 0.8845 0.5024
## 0.005 0.7908 0.9021 0.4871
## 0.010 0.7606 0.9130 0.4451
## 0.020 0.7442 0.9191 0.4139
## 0.040 0.6643 0.9457 0.3129
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001.
ggplot(fit.dt)
We directly see that in this case, increasing complexity costs lead to decreasing model performance. Such results are somewhat typical for large datasets, where high complexity costs prevent the tree to fully exploit the richness of information. Therefore, we settle for a minimal cp of 0.001.
We can also plot the final plot structure.
require(rpart.plot)
rpart.plot(fit.dt$finalModel)
This plot is usually informative, and gives us a nice intuition how the prediction works in classification trees. However, with that many dummies, it’s a bit messy.
Finally, we fit another class of models which has gained popularity in the last decade, and proven to be a powerful and versatile prediction technique which performs well in almost every setting, a random forest. It is among the most popular non-neural-network ML algorithms, and by some considered to be a panacea of all data science problems. Some say: “when you can’t think of any algorithm (irrespective of situation), use random forest!”"
As a continuation of tree-based classification methods, random forests aim at reducing overfitting by introducing randomness via bootstrapping, boosting, and ensemble techniques. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model. The idea here is to create an “ensemble of classification trees”", all grown out of a different bootstrap sample. Having grown a forest of trees, every tree performs a prediction, and the final model prediction is formed by a majority vote of all trees. This idea close to Monte Carlo simulation approaches, tapping in the power of randomness.
In this case, we draw from a number of tunable hyperparameters. First, we tune the number of randomly selected features which are available candidates for every split on a range \([1,k-1]\), where lower values introduce a higher level of randomness to every split. Our second hyperparameter is the minimal number of observations which have to fall in every split, where lower numbers increase the potential precision of splits, but also the risk of overfitting. Finally, we also use the general splitrule as an hyperparameter, where the choice is between i.) a traditional split according to a the optimization of the gini coefficient of the distribution of classes in every split, and ii.) according to the “Extremely randomized trees”" (ExtraTree
) procedure by Geurts (2016), where adittional randomness is introduced to the selection of splitpoints.
Note: We here use the ranger
instead of the more popular randomforest
package. For those who wonder: It basically does the same, only faster.
library(ranger)
tune.rf <- expand.grid(mtry = round(seq(1, ncol(x_train)-1, length = 3)),
min.node.size = c(10, 50, 100),
splitrule = c("gini", "extratrees") )
fit.rf <- train(x = x_train,
y = y_train,
trControl = ctrl,
metric = metric,
method = "ranger",
importance = "impurity", # To define how to measure variable importantce (later more)
num.trees = 25,
tuneGrid = tune.rf)
fit.rf
## Random Forest
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4220, 4220, 4219, 4221, 4220, 4221, ...
## Resampling results across tuning parameters:
##
## mtry min.node.size splitrule ROC Sens Spec
## 1 10 gini 0.8310 0.9954 0.06490
## 1 10 extratrees 0.8291 0.9975 0.03425
## 1 50 gini 0.8299 0.9967 0.04920
## 1 50 extratrees 0.8276 0.9966 0.03874
## 1 100 gini 0.8290 0.9972 0.04208
## 1 100 extratrees 0.8279 0.9980 0.03471
## 18 10 gini 0.8304 0.8966 0.51377
## 18 10 extratrees 0.8307 0.8947 0.52352
## 18 50 gini 0.8443 0.9083 0.50379
## 18 50 extratrees 0.8499 0.9066 0.52376
## 18 100 gini 0.8466 0.9073 0.50949
## 18 100 extratrees 0.8507 0.9071 0.51282
## 34 10 gini 0.8241 0.8922 0.51234
## 34 10 extratrees 0.8278 0.8928 0.51498
## 34 50 gini 0.8394 0.9049 0.50593
## 34 50 extratrees 0.8462 0.9075 0.50950
## 34 100 gini 0.8430 0.9077 0.50594
## 34 100 extratrees 0.8484 0.9071 0.51425
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 18, splitrule =
## extratrees and min.node.size = 100.
ggplot(fit.rf)
we see that number of randomly selected features per split of roughly half of all available features in all cases maximizes model performance. Same goes for a high minimal number of observations (100) per split. Finally, the ExtraTree procedure first underperforms at a a minimal amount of randomly selected features, but outperforms the traditional gini-based splitrule when the number of available features increases. Such results are typical for large samples, where a high amount of injected randomness tends to make model predictions more robust.
rm(fit.logit,fit.glmnet,fit.dt,fit.rf)
The adittional caretEnsemble
package also has the caretList()
function, where one can fit a set of different models jointly. In adittion, it takes care of the indexing for the resamples, so that they are the same for al models. Neath, isnt it? We could have done just and only that from the very beginning. We just did seperate models for illustration before. Note: Be prepared to wait some time.
Since we already know the optimal tuning parameters, we can now define them and skip the tuning loop.
tune.glmnet <- expand.grid(alpha = 0.5, lambda = 0)
tune.dt <- expand.grid(cp = 0.001)
tune.rf <- expand.grid(mtry = 18, min.node.size = 100, splitrule = "extratrees" )
ANd lets run the ensemble
library(caretEnsemble)
models <- caretList(x = x_train,
y = y_train,
trControl = ctrl,
metric = metric,
continue_on_fail = TRUE,
tuneList = list(logit = caretModelSpec(method = "glm",
family = "binomial"),
elasticnet = caretModelSpec(method = "glmnet",
family = "binomial",
tuneGrid = tune.glmnet),
dt = caretModelSpec(method = "rpart",
tuneGrid = tune.dt),
rf = caretModelSpec(method = "ranger",
importance = "impurity",
# num.trees = 25,
tuneGrid = tune.rf)
)
)
And done. Now, lets evaluate the results. This evaluation is now on the average prediction of the different folds on the validation fold in the train data (final evaluation will still come later). Lets take a look. We can also plot the model comparison.
models
## $logit
## Generalized Linear Model
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4219, 4221, 4220, 4219, 4221, 4221, ...
## Resampling results:
##
## ROC Sens Spec
## 0.8546 0.9045 0.5388
##
##
## $elasticnet
## glmnet
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4219, 4221, 4220, 4219, 4221, 4221, ...
## Resampling results:
##
## ROC Sens Spec
## 0.8549 0.9064 0.5357
##
## Tuning parameter 'alpha' was held constant at a value of 0.5
##
## Tuning parameter 'lambda' was held constant at a value of 0
##
## $dt
## CART
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4219, 4221, 4220, 4219, 4221, 4221, ...
## Resampling results:
##
## ROC Sens Spec
## 0.8166 0.8878 0.4998
##
## Tuning parameter 'cp' was held constant at a value of 0.001
##
## $rf
## Random Forest
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4219, 4221, 4220, 4219, 4221, 4221, ...
## Resampling results:
##
## ROC Sens Spec
## 0.8519 0.9069 0.5197
##
## Tuning parameter 'mtry' was held constant at a value of 18
## extratrees
## Tuning parameter 'min.node.size' was held constant at a
## value of 100
##
## attr(,"class")
## [1] "caretList"
The last step is to recollect the results from each model and compare them. We here use the resamples()
function of caret. This can be done because we ran all models within a caretList
, keeping track of all resamples within the different models and enabling us to compare them. Lets check how the model perform.
results <- resamples(models)
results$values %>%
select(-Resample) %>%
tidy()
bwplot(results)
Lets see how similar the models predict on the validation sample.
modelCor(results)
## logit elasticnet dt rf
## logit 1.0000 0.9994 0.6975 0.9665
## elasticnet 0.9994 1.0000 0.7066 0.9681
## dt 0.6975 0.7066 1.0000 0.7322
## rf 0.9665 0.9681 0.7322 1.0000
Ok, to be as confident as possible, lets do the final evaluation. generally, many ML exercises can be done without it, and it is often common practice to go with the performance in the k-fold crossfalidation. However, I believe this final step should be done, if data permits. Only with this final prediction we expose our model to data it yet was not exposed to, neiter directly nor indirectly.
we first let all models predict all models do their prediction on “test”. We create 2 objects, one for the probability prediction (type = "prob"
), and one for the predicted class of the outcome (where positive means \(P \leq 0.5\)). We do so, becuase we need both for different summaries to come.
models.preds <- data.frame(lapply(models, predict, newdata = x_test))
models.preds.prob <- data.frame(lapply(models, predict, newdata = x_test, type = "prob"))
glimpse(models.preds)
## Observations: 1,757
## Variables: 4
## $ logit <fct> No, No, Yes, No, No, No, No, No, Yes, No, No, Yes, ...
## $ elasticnet <fct> No, No, Yes, No, No, No, No, No, Yes, No, No, Yes, ...
## $ dt <fct> No, No, Yes, Yes, No, No, No, No, Yes, No, No, Yes,...
## $ rf <fct> No, No, Yes, No, No, No, No, No, No, No, No, Yes, N...
glimpse(models.preds.prob)
## Observations: 1,757
## Variables: 8
## $ logit.No <dbl> 0.9661, 0.9609, 0.2187, 0.7574, 0.9858, 0.5990,...
## $ logit.Yes <dbl> 0.033924, 0.039050, 0.781258, 0.242560, 0.01420...
## $ elasticnet.No <dbl> 0.9635, 0.9577, 0.2260, 0.7580, 0.9843, 0.6072,...
## $ elasticnet.Yes <dbl> 0.036523, 0.042295, 0.774012, 0.241964, 0.01573...
## $ dt.No <dbl> 0.9362, 0.9362, 0.1702, 0.3043, 0.9362, 0.8500,...
## $ dt.Yes <dbl> 0.06377, 0.06377, 0.82979, 0.69565, 0.06377, 0....
## $ rf.No <dbl> 0.9350, 0.9256, 0.1913, 0.7430, 0.9659, 0.7038,...
## $ rf.Yes <dbl> 0.064972, 0.074414, 0.808670, 0.256981, 0.03412...
Let’s plot the corresponding confusion matrix for all models
cm <- list()
cm$logit <- confusionMatrix(factor(models.preds$logit), y_test, positive = "Yes")
cm$elasticnet <- confusionMatrix(factor(models.preds$elasticnet), y_test, positive = "Yes")
cm$dt <- confusionMatrix(factor(models.preds$dt), y_test, positive = "Yes")
cm$rf <- confusionMatrix(factor(models.preds$rf), y_test, positive = "Yes")
par(mfrow=c(2,2)) # Note: One day find a neath way to make it prettier
for(i in 1:length(cm)) {fourfoldplot(cm[[i]]$table, color = c("darkred", "darkgreen") ); title(main = names(cm[i]), cex.main = 1.5)}
And also the corresponding ROC curve…
library(caTools)
# Note: Is kind of ugly. At one point I will try making it prettier with the library "plotROC""
ROC <- list()
ROC$logit <- colAUC(models.preds.prob$logit.Yes, y_test, plotROC = F)
ROC$elasticnet <- colAUC(models.preds.prob$elasticnet.Yes, y_test, plotROC = F)
ROC$dt <- colAUC(models.preds.prob$dt.Yes, y_test, plotROC = F)
ROC$rf <- colAUC(models.preds.prob$rf.Yes, y_test, plotROC = F)
par(mfrow=c(2,2))
for(i in 1:length(ROC)) {colAUC(models.preds.prob[[i*2]], y_test, plotROC = T); title(main = names(ROC[i]), cex.main = 1.5)}
So, now only one nice final table for the sake of overview.
model_eval <- tidy(cm$logit$overall) %>%
select(names) %>%
mutate(Logit = round(cm$logit$overall,3) ) %>%
mutate(ElasticNet = round(cm$elasticnet$overall,3)) %>%
mutate(ClassTree = round(cm$dt$overall,3)) %>%
mutate(RandForest = round(cm$rf$overall,3))
model_eval2 <- tidy(cm$logit$byClass) %>%
select(names) %>%
mutate(Logit = round(cm$logit$byClass,3) ) %>%
mutate(ElasticNet = round(cm$elasticnet$byClass,3)) %>%
mutate(ClassTree = round(cm$dt$byClass,3)) %>%
mutate(RandForest = round(cm$rf$byClass,3)) %>%
filter( !(names %in% c("AccuracyPValue", "McnemarPValue")) )
model_eval_all <- model_eval %>%
rbind(model_eval2) %>%
rbind(c("AUC", round(ROC$logit,3), round(ROC$elasticnet,3), round(ROC$dt,3), round(ROC$rf,3) ))
model_eval_all
rm(model_eval, model_eval2)