Load packages

library(tidyverse) # Collection of all the good stuff like dplyr, ggplot2 ect.
library(magrittr)
library(data.table) # Good format to work with large datasets
library(skimr) # Nice descriptives

1 Introduction

1.1 Contrasting ML&AI with inferential statistics

1.1.1 Inferential Statistics

Mostly interested in producing good parameter estimates: Construct models with unbiased estimates of \(\beta\), capturing the relationship \(x\) and \(y\).
Supposedly models: Causal effect of directionality \(x \rightarrow y\), robust across a variety of observed as well as up to now unobserved settings.
How: Carefully draw from theories and empirical findings, apply logical reasoning to formulate hypotheses.
Typically, multivariate testing, cetris paribus.
Main concern: Minimize standard errors \(\epsilon\) of \(\beta\) estimates.
Not overly concerned with overall predictive power (eg. \(R^2\)) of those models, but about various type of endogeneity issues, leading us to develop sophisticated identification strategies

1.1.2 ML&AI Approach

To large extend driven by the needs of the private sector \(\rightarrow\) data analysis is gear towards producing good predictions of outcomes \(\rightarrow\) fits for \(\hat{y}\), not \(\hat{\beta}\)
- Recommender systems: Amazon, Netflix, Sportify ect.
- Risk scores}: Eg.g likelihood that a particular person has an accident, turns sick, or defaults on their credit.
- Image classification: Finding Cats & Dogs online
Often rely on big data (N,\(x_i\))
Not overly concerned with the properties of parameter estimates, but very rigorous in optimizing the overall prediction accuracy.
Often more flexibility wrt. the functional form, and non-parametric approaches.
No “build-in”" causality guarantee \(\rightarrow\) verification techniques.
Often sporadically used in econometric procedures, but seen as “son of a lesser god”.

1.2 Issues with ML

1.2.1 Generalization via “Out-of-Sample-Testing”

With so much freedom wrt. feature selection, functional form ect., models are prone to over-fitting. And no constraints by asymptotic properties, causality and so forth, how can we generalize anything?

In ML, generalization is not achived by statistical derivatives and theoretical argumentation, but rather by answering the practical question: How well would my model perform with new data? To answer this question, Out-of-Sample-Testing is usually taken as solution. Here, you do the following

Split the dataset in a training and a test sample.
Fit you regression (train your model) on one dataset
- Optimal: Tune hyperparameters by minimizing loss in a validation set.
- Optimal: Retrain final model configuration on whole training set
Finally, evaluate predictive power on test sample, on which model is not fitted.

An advanced version is a N-fold-Crossvalidation, where this process is repeated several time during the hyperparameter-tuning phase (more on that later).

1.2.2 Model complexity & Hyperparameter Tuning

As a rule-of-thumb: Richer and more complex functional forms and algorithms tend to be better in predictign complex real world pattern. This is particularly true for high-dimensional (big) data.

However, flexible algorithms at one point become so good in mimicing the pattern in our data that they overfit, meaning are to much tuned towards a specific dataset and might not reproduce the same accuracy in new data. Therefore, we aim at finding the sweet spot of high model complexity yet high-performing out-of-sample predictions

That we do usually in a process of hyperparameter-tuning, where we gear different options the models offer towards high out-of-sample predictive performance. Mathematically speaking, we try to minimize a loss function \(L(.)\) (eg. RMSE) the following problem:

\[minimize \underbrace{\sum_{i=1}^{n}L(f(x_i),y_i),}_{in-sample~loss} ~ over \overbrace{~ f \in F ~}^{function~class} subject~to \underbrace{~ R(f) \leq c.}_{complexity~restriction}\]

2 Regression

2.1 Introduction (a brief reminder)

Lets for a second recap linear regression techniques, foremost the common allrounder and workhorse of statistical research since some 100 years.

OLS = Ordinary Least Squares

Basic Properties

Outcome: contionous
Predictors: continous, dichotonomous, categorical
When to use: Predicting a phenomenon that scales and can be measured continuously

Functional form

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon \]

where:

\(y\) = Outcome, \(x_i\) = observed value \(ID_i\)
\(\beta_0\) = Constant
\(\beta_i\) = Estimated effect of \(x_i\) on \(y\) , slope of the linear function
\(\epsilon\) = Error term

And that is what happens. Lets imagine we plot some feature against an outcome we want to predict. OLS will just fit a straight line through your data.

We do so by minimizing the sum of (squared) errors between our prediction-line and the observed outcome.

2.1.1 Executing a regression

So, lets get your hand dirty. We will load a standard dataset from mlbench, the BostonHousing dataset. It comes as a dataframe with 506 observations on 14 features, the last one medv being the outcome:

crim per capita crime rate by town
zn proportion of residential land zoned for lots over 25,000 sq.ft
indus proportion of non-retail business acres per town
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox nitric oxides concentration (parts per 110 million)
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per USD 10,000
ptratio pupil-teacher ratio by town
b 1000(B - 0.63)^2 where B is the proportion of blacks by town
lstat lower status of the population
medv median value of owner-occupied homes in USD 1000’s

Source: Harrison, D. and Rubinfeld, D.L. “Hedonic prices and the demand for clean air”, J. Environ. Economics & Management, vol.5, 81-102, 1978.

These data have been taken from the UCI Repository Of Machine Learning Databases

library(mlbench) # Library including many ML benchmark datasets
data(BostonHousing) 
data <- BostonHousing %>% as_data_frame()
rm(BostonHousing)

data %<>%
  select(medv, everything()) %>%
  mutate(chas = as.numeric(chas))

Lets inspect the data for a moment.

head(data)

glimpse(data)

## Observations: 506
## Variables: 14
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.090, 4.967, 4.967, 6.062, 6.062, 6.062, 5.561, 5.950...
## $ rad     <dbl> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ b       <dbl> 396.9, 396.9, 392.8, 394.6, 396.9, 394.1, 395.6, 396.9...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...

skim(data) %>% skimr::kable()

## Skim summary statistics  
##  n obs: 506    
##  n variables: 14    
## 
## Variable type: numeric
## 
##  variable    missing    complete     n      mean       sd        p0       p25       p50       p75      p100       hist   
## ----------  ---------  ----------  -----  --------  --------  --------  --------  --------  --------  -------  ----------
##    age          0         506       506    68.57     28.15      2.9      45.02      77.5     94.07      100     <U+2581><U+2582><U+2582><U+2582><U+2582><U+2582><U+2583><U+2587> 
##     b           0         506       506    356.67    91.29      0.32     375.38    391.44    396.23    396.9    <U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587> 
##    chas         0         506       506     1.07      0.25       1         1         1         1         2      <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
##    crim         0         506       506     3.61      8.6      0.0063    0.082      0.26      3.68     88.98    <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
##    dis          0         506       506     3.8       2.11      1.13      2.1       3.21      5.19     12.13    <U+2587><U+2585><U+2583><U+2583><U+2582><U+2581><U+2581><U+2581> 
##   indus         0         506       506    11.14      6.86      0.46      5.19      9.69      18.1     27.74    <U+2583><U+2586><U+2585><U+2581><U+2581><U+2587><U+2581><U+2581> 
##   lstat         0         506       506    12.65      7.14      1.73      6.95     11.36     16.96     37.97    <U+2586><U+2587><U+2586><U+2585><U+2582><U+2581><U+2581><U+2581> 
##    medv         0         506       506    22.53      9.2        5       17.02      21.2       25       50      <U+2582><U+2585><U+2587><U+2586><U+2582><U+2582><U+2581><U+2581> 
##    nox          0         506       506     0.55      0.12      0.38      0.45      0.54      0.62     0.87     <U+2587><U+2586><U+2587><U+2586><U+2583><U+2585><U+2581><U+2581> 
##  ptratio        0         506       506    18.46      2.16      12.6      17.4     19.05      20.2      22      <U+2581><U+2582><U+2582><U+2582><U+2585><U+2585><U+2587><U+2583> 
##    rad          0         506       506     9.55      8.71       1         4         5         24       24      <U+2582><U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2585> 
##     rm          0         506       506     6.28      0.7       3.56      5.89      6.21      6.62     8.78     <U+2581><U+2581><U+2582><U+2587><U+2587><U+2582><U+2581><U+2581> 
##    tax          0         506       506    408.24    168.54     187       279       330       666       711     <U+2583><U+2587><U+2582><U+2585><U+2581><U+2581><U+2581><U+2586> 
##     zn          0         506       506    11.36     23.32       0         0         0        12.5      100     <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>

Ok, us straight run our first linear model with the lm() function:

fit.lm <- lm(medv ~ ., data = data)
summary(fit.lm)

## 
## Call:
## lm(formula = medv ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.594  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  33.772755   5.198710    6.50      0.0000000002018 ***
## crim         -0.108011   0.032865   -3.29              0.00109 ** 
## zn            0.046420   0.013727    3.38              0.00078 ***
## indus         0.020559   0.061496    0.33              0.73829    
## chas          2.686734   0.861580    3.12              0.00193 ** 
## nox         -17.766611   3.819744   -4.65      0.0000042456438 ***
## rm            3.809865   0.417925    9.12 < 0.0000000000000002 ***
## age           0.000692   0.013210    0.05              0.95823    
## dis          -1.475567   0.199455   -7.40      0.0000000000006 ***
## rad           0.306049   0.066346    4.61      0.0000050705290 ***
## tax          -0.012335   0.003761   -3.28              0.00111 ** 
## ptratio      -0.952747   0.130827   -7.28      0.0000000000013 ***
## b             0.009312   0.002686    3.47              0.00057 ***
## lstat        -0.524758   0.050715  -10.35 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.75 on 492 degrees of freedom
## Multiple R-squared:  0.741,  Adjusted R-squared:  0.734 
## F-statistic:  108 on 13 and 492 DF,  p-value: <0.0000000000000002

So, how would you interpret the results? Estimate? P-Value? \(R^2\)? If you don’t know, consider taking the SDS track on Datacamp for statistics basics.

However, since this is a predictive exercise, we are foremost interested in how well the model predicts. So, lets use the predict() function.

pred.lm <- predict(fit.lm) 
head(pred.lm, 10)

##     1     2     3     4     5     6     7     8     9    10 
## 30.00 25.03 30.57 28.61 27.94 25.26 23.00 19.54 11.52 18.92

A common measure of predictive power of regresdsions models is the Root-Mean-Squared-Error (RSME), calculate as follows:

\[ RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(y_i - \hat{y_i} \Big)^2}}\]

Keep in mind, this rood&squared thingy does nothing with the error term except of transforming negative to positive values. Lets calculate:

error <-  pull(data, medv) -  pred.lm

# Calculate RMSE
sqrt(mean(error ^ 2))

## [1] 4.679

We can also visualize the error term:

# Some visualizatiom
error %>% as_data_frame() %>%
  ggplot(aes(value)) + 
  geom_histogram()

2.2 ML workflows for regression analysis

2.2.1 Define a subset for the hyperparameter tuning

However, how well does our OLS model predict out-of-sample? To do so, we have to set up an out-of-sample testing workflow. While that cound be done manually, we will use the (my very favorite) caret (=ClassificationAndREgressionTree) package, which has by now grown to be the most popular R package for ML, including a standardized ML workflow over more than 100 model classes.

First, we will now split our data in a train and test sample. We could create an index for subsetting manually. However, caret provides the nice createDataPartition() function, which does that for you. Nice feature is that it takes care that the outcomes (y) are proportionally distrubuted in the subset. That is particularly important in case the outcomes are very unbalanced.

library(caret)
index <- createDataPartition(y = data$medv, p = 0.75, list = FALSE) # 75% to 25% split

training <- data[index,] 
test <- data[-index,]

2.2.2 Preprocessing

Before tuning the models, there is still some final preprocessing to do. For typical preprocessing tasks for ML, I like to use the reciepes package. It lets you conveniently define a recipe of standard ML preprocessing tasks. Afterwards, we can just can use this recipe to “bake” our data, meaning performing all the steps in the recipe. Here, we do only some simple transformations. We normalizizing all numeric data by centering (subtracting the mean) and scaling (divide by standard deviation). Finally, we remove features with zero-variance (we dont have any here, but its a good practice to do that with unknown data).

library(recipes)
reci <- recipe(medv ~ ., data = training) %>%
  step_center(all_numeric(), -all_outcomes()) %>% 
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_zv(all_predictors()) %>%
  prep(data = train)
reci

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         13
## 
## Training data contained 381 data points and no missing data.
## 
## Operations:
## 
## Centering for crim, zn, indus, chas, nox, rm, age, ... [trained]
## Scaling for crim, zn, indus, chas, nox, rm, age, ... [trained]
## Zero variance filter removed no terms [trained]

Now we can “bake” our data, executing the recipe, and split in outcomes and features (is not necessary, but usually an easier workflow lateron).

# Training: split in y and x
x_train <-bake(reci, newdata = training) %>% select(-medv) 
y_train <- training %>% pull(medv)

# test: split in y and x
x_test <-bake(reci, newdata = test) %>% select(-medv) 
y_test <- test %>% pull(medv)

2.2.3 Defining the training and control feature

We now get seriously started with the hyperparameter tuning. We will do so in the following way. We here do a n-fold crossvalidation, meaning that we again split the train sample again in n subsamples, on which we both fit and test every hyperparameter configuration. That minimizes the chance that the results are driven by a particular sampling. Sounds cumbersome. However, for this workflow caret allows us to automatize this workflow with defining a trainControl() object, which we pass to every model we fit lateron.

Here, we have to define a set of parameters. First, the number of crossvalidations. He here for the sake of time do just 2 (method = "cv", number = 5), but given enough computing power, RAM and/or time, for a real life calibration, I would recommend 5 or 10 folds. To be even more save, this whole process can be repeated serveral time (method = "repeatedcv", repeats = x). Be aware, though, that time increases exponentially, which might make a difference when working with large data.

ctrl <- trainControl(method = "cv", 
                     number = 10)

2.3 Hyperparameter Tuning

So, we finally start fitting models. The logit we skip in this phase, since it has no tunable parameters.

2.3.1 Linear Regression

fit.lm <- train(x = x_train, 
                y = y_train, 
                trControl = ctrl, 
                method = "lm")
summary(fit.lm)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.614  -2.764  -0.646   1.686  26.181 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   22.511      0.238   94.71 < 0.0000000000000002 ***
## crim          -1.010      0.317   -3.19              0.00154 ** 
## zn             1.198      0.358    3.35              0.00090 ***
## indus          0.164      0.478    0.34              0.73237    
## chas           0.924      0.248    3.72              0.00023 ***
## nox           -1.858      0.497   -3.74              0.00021 ***
## rm             2.914      0.328    8.89 < 0.0000000000000002 ***
## age           -0.525      0.414   -1.27              0.20535    
## dis           -3.108      0.467   -6.66     0.00000000010068 ***
## rad            2.561      0.624    4.11     0.00004968953389 ***
## tax           -2.116      0.691   -3.06              0.00236 ** 
## ptratio       -1.964      0.319   -6.16     0.00000000188036 ***
## b              0.790      0.276    2.86              0.00443 ** 
## lstat         -3.024      0.405   -7.47     0.00000000000059 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.64 on 367 degrees of freedom
## Multiple R-squared:  0.743,  Adjusted R-squared:  0.734 
## F-statistic: 81.8 on 13 and 367 DF,  p-value: <0.0000000000000002

fit.lm

## Linear Regression 
## 
## 381 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 343, 344, 342, 342, 345, 342, ... 
## Resampling results:
## 
##   RMSE   Rsquared  MAE  
##   4.723  0.7321    3.341
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Notice the higher RMSE compared to the initial in-sample testing example.

2.3.2 Elastic Net

The elastic net has the functional form of a generalized linear model, plus an adittional tgerm \(\lambda\) a parameter which penalizes the coefficient by its contribution to the models loss in the form of:

\[\lambda \sum_{p=1}^{P} [ 1 - \alpha |\beta_p| + \alpha |\beta_p|^2]\]

Here, we have 2 tunable parameters, \(\lambda\) and \(\alpha\). If \(\alpha = 0\), we are left with \(|\beta_i|\), turning it to a lately among econometricians very popular Least Absolute Shrinkage and Selection Operator (LASSO) regression. Obviously, when \(\lambda = 0\), the whole term vanishes, and we are again left with a generalized linear model. The first thing we do now is to set up a tuneGrid, a matrix of value combinations of tunable hyperparameters we aim at optimizing. This is easily done with the base-R expand.grid() function.

tune.glmnet = expand.grid(alpha = seq(0, 1, length = 5), 
                          lambda = seq(0, 1, length = 5))
tune.glmnet

Now, we utulize the caret train function to carry out the grid search as well as crossvalidation of our model. The train function is a wrapper that uses a huge variety (150+) of models from other packages, and carries out the model fitting workflow. Here, we pass the function our feature and outcome vectors (X, y), the parameters in the formerly defined trainControl object, the metric to be optimized, the model class (here “glmnet” from the corresponding package), and our formerly defined tuneGrid. Lets go!

fit.glmnet <- train(x = x_train, 
                    y = y_train, 
                    trControl = ctrl, 
                    method = "glmnet", family = "gaussian", 
                    tuneGrid = tune.glmnet)
fit.glmnet

## glmnet 
## 
## 381 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 342, 344, 344, 343, 342, 343, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda  RMSE   Rsquared  MAE  
##   0.00   0.00    4.728  0.7291    3.316
##   0.00   0.25    4.728  0.7291    3.316
##   0.00   0.50    4.728  0.7291    3.316
##   0.00   0.75    4.735  0.7285    3.315
##   0.00   1.00    4.752  0.7272    3.316
##   0.25   0.00    4.709  0.7311    3.350
##   0.25   0.25    4.744  0.7272    3.333
##   0.25   0.50    4.823  0.7193    3.355
##   0.25   0.75    4.906  0.7111    3.403
##   0.25   1.00    4.955  0.7072    3.440
##   0.50   0.00    4.709  0.7312    3.350
##   0.50   0.25    4.809  0.7204    3.355
##   0.50   0.50    4.934  0.7082    3.419
##   0.50   0.75    5.037  0.6981    3.507
##   0.50   1.00    5.116  0.6909    3.579
##   0.75   0.00    4.709  0.7311    3.350
##   0.75   0.25    4.889  0.7119    3.396
##   0.75   0.50    5.029  0.6982    3.501
##   0.75   0.75    5.120  0.6902    3.584
##   0.75   1.00    5.184  0.6875    3.639
##   1.00   0.00    4.709  0.7311    3.350
##   1.00   0.25    4.923  0.7090    3.413
##   1.00   0.50    5.099  0.6906    3.565
##   1.00   0.75    5.171  0.6873    3.629
##   1.00   1.00    5.264  0.6833    3.705
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.5 and lambda = 0.

We can also plot the results of the hyperparameter tuning.

ggplot(fit.glmnet)

2.3.3 Model evaluation

So, taking all together, lets see which model performs best out-of-sample:

pred.lm <- 
pred.glm <- predict(fit.glmnet, newdata = x_test) 

# RMSE OLS
sqrt(mean( (y_test - predict(fit.lm, newdata = x_test) ) ^ 2))

## [1] 5.162

sqrt(mean( (y_test - predict(fit.glmnet, newdata = x_test) ) ^ 2))

## [1] 5.163

Notice again the higher RMSE for total out-of-sample testing. Here, the results are pretty similar, so we in fact see noa advantage of variable selection via glmnet

3 ML for Classification

3.1 Introduction

3.1.1 Reminder: Model assesments and metrics for classification problems

We remeber that the most commonly used performance measure for regression problems is the RMSE. However, how to we assess models aimed to solve classification problems? Here, it is not that straightforward, and we could (depending on the task) use different ones.

3.1.1.1 The Confusion matrix and its metrics

The Confusion Matrix (in inferential statistics you would call it Classification Table, so don’t get confused) is the main source

It is the 2x2 matrix with the foillowing cells:

True Positive: (TP)
- Interpretation: You predicted positive and it’s true.
- You predicted that a woman is pregnant and she actually is.
True Negative: (TN)
- Interpretation: You predicted negative and it’s true.
- You predicted that a man is not pregnant and he actually is not.
False Positive: (FP) - (Type 1 Error)
- Interpretation: You predicted positive and it’s false.
- You predicted that a man is pregnant but he actually is not.
False Negative: (FN) - (Type 2 Error)
- Interpretation: You predicted negative and it’s false.
- You predicted that a woman is not pregnant but she actually is.

Just remember, We describe predicted values as Positive and Negative and actual values as True and False. Out of combinations of these values, we dan derive a set of different quality measures.

Accuracy (ACC) \[ {ACC} ={\frac {\mathrm {TP} +\mathrm {TN} }{P+N}}={\frac {\mathrm {TP} +\mathrm {TN} }{\mathrm {TP} +\mathrm {TN} +\mathrm {FP} +\mathrm {FN} }} \]

** Sensitivity,** also called recall, hit rate, or true positive rate (TPR) \[ {TPR} ={\frac {\mathrm {TP} }{P}}={\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FN} }}=1-\mathrm {FNR} \]

Specificity, also called selectivity or true negative rate (TNR) \[ {TNR} ={\frac {\mathrm {TN} }{N}}={\frac {\mathrm {TN} }{\mathrm {TN} +\mathrm {FP} }}=1-\mathrm {FPR} \]

Precision, also called positive predictive value (PPV) \[ {PPV} ={\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FP} }} \]

F1 score: is the harmonic mean of precision and sensitivity, meaning a weighted average of the true positive rate (recall) and precision. \[ F_{1}=2\cdot {\frac {\mathrm {PPV} \cdot \mathrm {TPR} }{\mathrm {PPV} +\mathrm {TPR} }}={\frac {2\mathrm {TP} }{2\mathrm {TP} +\mathrm {FP} +\mathrm {FN} }} \]

So, which one should we use? Answer is: Depends on the problem. Are you more sensitive towards false negative or false positives? Do you have a balanced or unbalanced distribution of classes? Think about it for a moment.

3.1.1.2 ROC and AUC

An ROC curve (receiver operating characteristic curve, weird name, i know. Comes originally from signal processing) is a derivative of the confusion matrix and predicted class-probabilities.

So, what does it tell us? The ROC is a graph showing the performance of a classification model at all classification thresholds. It plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives.

AUC stands for “Area under the ROC Curve.” That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1). It provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example. AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

AUC is desirable for the following two reasons:

AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
AUC is classification-threshold-invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.

However, both these reasons come with caveats, which may limit the usefulness of AUC in certain use cases:

Scale invariance is not always desirable. For example, sometimes we really do need well calibrated probability outputs, and AUC won’t tell us about that.
Classification-threshold invariance is not always desirable. In cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize one type of classification error. For example, when doing email spam detection, you likely want to prioritize minimizing false positives (even if that results in a significant increase of false negatives). AUC isn’t a useful metric for this type of optimization.

3.2 Introduction tom the case

3.2.1 Digression: Customer Churn: Hurts Sales, Hurts Company

Customer churn refers to the situation when a customer ends their relationship with a company, and it’s a costly problem. Customers are the fuel that powers a business. Loss of customers impacts sales. Further, it’s much more difficult and costly to gain new customers than it is to retain existing customers. As a result, organizations need to focus on reducing customer churn.

The good news is that machine learning can help. For many businesses that offer subscription based services, it’s critical to both predict customer churn and explain what features relate to customer churn.

3.2.2 IBM Watson Dataset

We now dive into the IBM Watson Telco Dataset. According to IBM, the business challenge is.

A telecommunications company [Telco] is concerned about the number of customers leaving their landline business for cable competitors. They need to understand who is leaving. Imagine that you’re an analyst at this company and you have to find out who is leaving and why.

The dataset includes information about:

Customers who left within the last month: Churn
Services that each customer has signed up for: phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information: how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers: gender, age range, and if they have partners and dependents Lets load it:

rm(list=ls()); graphics.off() # get rid of everything in the workspace
data <- readRDS("data/telco_churn.rds")

3.3 Data Inspection, exploration, preprocessing

Lets inspect our data

head(data)

glimpse(data)

## Observations: 7,043
## Variables: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "77...
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "...
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "N...
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "N...
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 5...
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes"...
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone ser...
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "F...
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", ...
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", ...
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", ...
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "N...
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "...
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "N...
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month...
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes"...
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed c...
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89....
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820....
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", ...

skim(data) %>% skimr::kable()

## Skim summary statistics  
##  n obs: 7043    
##  n variables: 21    
## 
## Variable type: character
## 
##      variable        missing    complete     n      min    max    empty    n_unique 
## ------------------  ---------  ----------  ------  -----  -----  -------  ----------
##       Churn             0         7043      7043     2      3       0         2     
##      Contract           0         7043      7043     8     14       0         3     
##     customerID          0         7043      7043    10     10       0        7043   
##     Dependents          0         7043      7043     2      3       0         2     
##  DeviceProtection       0         7043      7043     2     19       0         3     
##       gender            0         7043      7043     4      6       0         2     
##  InternetService        0         7043      7043     2     11       0         3     
##   MultipleLines         0         7043      7043     2     16       0         3     
##    OnlineBackup         0         7043      7043     2     19       0         3     
##   OnlineSecurity        0         7043      7043     2     19       0         3     
##  PaperlessBilling       0         7043      7043     2      3       0         2     
##      Partner            0         7043      7043     2      3       0         2     
##   PaymentMethod         0         7043      7043    12     25       0         4     
##    PhoneService         0         7043      7043     2      3       0         2     
##  StreamingMovies        0         7043      7043     2     19       0         3     
##    StreamingTV          0         7043      7043     2     19       0         3     
##    TechSupport          0         7043      7043     2     19       0         3     
## 
## Variable type: integer
## 
##    variable       missing    complete     n      mean      sd      p0    p25    p50    p75    p100      hist   
## ---------------  ---------  ----------  ------  -------  -------  ----  -----  -----  -----  ------  ----------
##  SeniorCitizen       0         7043      7043    0.16     0.37     0      0      0      0      1      <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2582> 
##     tenure           0         7043      7043    32.37    24.56    0      9     29     55      72     <U+2587><U+2583><U+2583><U+2582><U+2582><U+2583><U+2583><U+2585> 
## 
## Variable type: numeric
## 
##     variable       missing    complete     n       mean       sd        p0       p25        p50        p75       p100       hist   
## ----------------  ---------  ----------  ------  --------  ---------  -------  --------  ---------  ---------  --------  ----------
##  MonthlyCharges       0         7043      7043    64.76      30.09     18.25     35.5      70.35      89.85     118.75    <U+2587><U+2581><U+2583><U+2582><U+2586><U+2585><U+2585><U+2582> 
##   TotalCharges       11         7032      7043    2283.3    2266.77    18.8     401.45    1397.47    3794.74    8684.8    <U+2587><U+2583><U+2582><U+2582><U+2581><U+2581><U+2581><U+2581>

Lets do right away some preprocessing, and drop the customer ID, which we do not need. Let’s also say we’re lazy today, so we just prop observations with missing features (not good practice).

data %<>%
  select(-customerID) %>%
  drop_na() %>%
  select(Churn, everything())

Next, lets have a first visual inspections. Many models in our prediction exercise to follow require the conditional distribution of the features to be different for the outcomes states to be predicted. So, lets take a look. Here, ggplot2 plus the ggridges package is my favorite.

require(ggridges)
data %>%
  gather(variable, value, -Churn) %>%
  ggplot(aes(y = as.factor(variable), 
             fill =  as.factor(Churn), 
             x = percent_rank(value)) ) +
  geom_density_ridges(alpha = 0.75)

Again, we split in a training and test dataset.

library(caret)
index <- createDataPartition(y = data$Churn, p = 0.75, list = FALSE)

training <- data[index,] 
test <- data[-index,]

We already see the numeric variable TotalCharges appears to be right skewed. That is not a problem for predictive modeling per se, yet some transformation might increase still its predictive power. Lets see. The easiest approximation is just to check the correlation.

library(corrr)
training %>%
  select(Churn, TotalCharges) %>%
  mutate(
    Churn = Churn %>% as.factor() %>% as.numeric(),
    LogTotalCharges = log(TotalCharges)
  ) %>%
  correlate() %>%
  focus(Churn) %>%
  fashion()

Allright, seems as if it could be worth it. Now we can already write down or recipe.

library(recipes)
reci <- recipe(Churn ~ ., data = training) %>%
  step_discretize(tenure, options = list(cuts = 6)) %>%
  step_log(TotalCharges) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_center(all_predictors(), -all_outcomes()) %>%
  step_scale(all_predictors(), -all_outcomes()) %>%
  prep(data = training)

reci

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         19
## 
## Training data contained 5275 data points and no missing data.
## 
## Operations:
## 
## Dummy variables from tenure [trained]
## Log transformation on TotalCharges [trained]
## Dummy variables from gender, Partner, Dependents, tenure, ... [trained]
## Centering for SeniorCitizen, ... [trained]
## Scaling for SeniorCitizen, ... [trained]

Now we just split again in predictors and outcomes, bake it all, and we are good to go.

# Predictors
x_train <- bake(reci, newdata = training) %>% select(-Churn)
x_test  <- bake(reci, newdata = test) %>% select(-Churn)

# Response variables for training and testing sets
y_train <- pull(training, Churn) %>% as.factor()
y_test  <- pull(test, Churn) %>% as.factor()

We again define a trainControl() object. However, now we need some adittional parameters.

ctrl <- trainControl(method = "repeatedcv", # repeatedcv, boot, cv, LOOCV, timeslice OR adaptive etc.
                     number = 5, # Number of CV's
                     repeats = 3, # Number or repeats -> repeats * CV's
                     classProbs = TRUE, # Include probability of class prediction
                     summaryFunction = twoClassSummary, # Which type of summary statistics to deliver
                     returnData = FALSE, # Don't include the original data in the train() output
                     returnResamp = "final", # Impåortant for resampling later
                     savePredictions = "final", # The predictions of every tune or only the final to be saved?
                     allowParallel = TRUE, # For parallel processing
                     verboseIter = FALSE)

metric <- "ROC" # Which metric should be optimized (more on that later)

3.3.1 Standard logistic regression

Lets run the common allrounder first:

fit.logit <- train(x = x_train, 
                   y = y_train, 
                   trControl = ctrl, 
                   metric = metric,
                   method = "glm", family = "binomial") 
fit.logit

## Generalized Linear Model 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4220, 4221, 4219, 4220, 4220, 4221, ... 
## Resampling results:
## 
##   ROC     Sens    Spec 
##   0.8541  0.9038  0.535

summary(fit.logit)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.150  -0.656  -0.272   0.558   3.308  
## 
## Coefficients: (8 not defined because of singularities)
##                                        Estimate Std. Error z value
## (Intercept)                           -1.698538   0.057470  -29.56
## SeniorCitizen                          0.090776   0.036715    2.47
## MonthlyCharges                        -0.775133   1.104180   -0.70
## TotalCharges                          -1.008294   0.176949   -5.70
## gender_Male                           -0.005493   0.038348   -0.14
## Partner_Yes                            0.047467   0.045963    1.03
## Dependents_Yes                        -0.079191   0.048790   -1.62
## tenure_bin1                            0.061725   0.175406    0.35
## tenure_bin2                            0.068972   0.114565    0.60
## tenure_bin3                            0.064122   0.088934    0.72
## tenure_bin4                            0.103126   0.080920    1.27
## tenure_bin5                            0.015227   0.077975    0.20
## tenure_bin6                                  NA         NA      NA
## PhoneService_Yes                       0.032503   0.220752    0.15
## MultipleLines_No.phone.service               NA         NA      NA
## MultipleLines_Yes                      0.280398   0.102508    2.74
## InternetService_Fiber.optic            0.953801   0.459183    2.08
## InternetService_No                    -0.829467   0.384322   -2.16
## OnlineSecurity_No.internet.service           NA         NA      NA
## OnlineSecurity_Yes                    -0.055316   0.093315   -0.59
## OnlineBackup_No.internet.service             NA         NA      NA
## OnlineBackup_Yes                       0.035099   0.096702    0.36
## DeviceProtection_No.internet.service         NA         NA      NA
## DeviceProtection_Yes                   0.086077   0.097719    0.88
## TechSupport_No.internet.service              NA         NA      NA
## TechSupport_Yes                       -0.062840   0.095480   -0.66
## StreamingTV_No.internet.service              NA         NA      NA
## StreamingTV_Yes                        0.291924   0.184630    1.58
## StreamingMovies_No.internet.service          NA         NA      NA
## StreamingMovies_Yes                    0.350737   0.185129    1.89
## Contract_One.year                     -0.284772   0.051512   -5.53
## Contract_Two.year                     -0.711598   0.095049   -7.49
## PaperlessBilling_Yes                   0.160013   0.043313    3.69
## PaymentMethod_Credit.card..automatic.  0.007981   0.055125    0.14
## PaymentMethod_Electronic.check         0.150479   0.052916    2.84
## PaymentMethod_Mailed.check             0.000103   0.057171    0.00
##                                                   Pr(>|z|)    
## (Intercept)                           < 0.0000000000000002 ***
## SeniorCitizen                                      0.01342 *  
## MonthlyCharges                                     0.48268    
## TotalCharges                             0.000000012107471 ***
## gender_Male                                        0.88610    
## Partner_Yes                                        0.30174    
## Dependents_Yes                                     0.10457    
## tenure_bin1                                        0.72492    
## tenure_bin2                                        0.54715    
## tenure_bin3                                        0.47090    
## tenure_bin4                                        0.20252    
## tenure_bin5                                        0.84517    
## tenure_bin6                                             NA    
## PhoneService_Yes                                   0.88294    
## MultipleLines_No.phone.service                          NA    
## MultipleLines_Yes                                  0.00623 ** 
## InternetService_Fiber.optic                        0.03779 *  
## InternetService_No                                 0.03091 *  
## OnlineSecurity_No.internet.service                      NA    
## OnlineSecurity_Yes                                 0.55332    
## OnlineBackup_No.internet.service                        NA    
## OnlineBackup_Yes                                   0.71663    
## DeviceProtection_No.internet.service                    NA    
## DeviceProtection_Yes                               0.37840    
## TechSupport_No.internet.service                         NA    
## TechSupport_Yes                                    0.51044    
## StreamingTV_No.internet.service                         NA    
## StreamingTV_Yes                                    0.11385    
## StreamingMovies_No.internet.service                     NA    
## StreamingMovies_Yes                                0.05815 .  
## Contract_One.year                        0.000000032338433 ***
## Contract_Two.year                        0.000000000000071 ***
## PaperlessBilling_Yes                               0.00022 ***
## PaymentMethod_Credit.card..automatic.              0.88489    
## PaymentMethod_Electronic.check                     0.00446 ** 
## PaymentMethod_Mailed.check                         0.99856    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6108.6  on 5274  degrees of freedom
## Residual deviance: 4220.4  on 5247  degrees of freedom
## AIC: 4276
## 
## Number of Fisher Scoring iterations: 6

3.3.2 Elastic net

We can surely also run an elasticnet with an logit-link (family = "binomial). Now, we utulize the caret train() function to carry out the grid search as well as crossvalidation of our model. The train function is a wrapper that uses a huge variety (150+) of models from other packages, and carries out the model fitting workflow. Here, we pass the function our feature and outcome vectors (X, y), the parameters in the formerly defined trainControl object, the metric to be optimized, the model class (here “glmnet” from the corresponding package), and our formerly defined tuneGrid. Lets go!

tune.glmnet = expand.grid(alpha = seq(0, 1, length = 3), 
                          lambda = seq(0, 0.3, length = 7))

fit.glmnet <- train(x = x_train, 
                    y = y_train, 
                    trControl = ctrl, 
                    metric = metric, 
                    method = "glmnet", family = "binomial", 
                    tuneGrid = tune.glmnet)
fit.glmnet

## glmnet 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4219, 4221, 4220, 4219, 4221, 4220, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda  ROC     Sens    Spec  
##   0.0    0.00    0.8540  0.9108  0.5221
##   0.0    0.05    0.8526  0.9180  0.4874
##   0.0    0.10    0.8510  0.9280  0.4401
##   0.0    0.15    0.8496  0.9385  0.4080
##   0.0    0.20    0.8486  0.9484  0.3773
##   0.0    0.25    0.8478  0.9564  0.3454
##   0.0    0.30    0.8471  0.9630  0.3129
##   0.5    0.00    0.8545  0.9058  0.5338
##   0.5    0.05    0.8479  0.9608  0.3012
##   0.5    0.10    0.8387  0.9842  0.1671
##   0.5    0.15    0.8254  1.0000  0.0000
##   0.5    0.20    0.8126  1.0000  0.0000
##   0.5    0.25    0.8024  1.0000  0.0000
##   0.5    0.30    0.5000  1.0000  0.0000
##   1.0    0.00    0.8546  0.9053  0.5340
##   1.0    0.05    0.8366  0.9779  0.2121
##   1.0    0.10    0.8098  1.0000  0.0000
##   1.0    0.15    0.5000  1.0000  0.0000
##   1.0    0.20    0.5000  1.0000  0.0000
##   1.0    0.25    0.5000  1.0000  0.0000
##   1.0    0.30    0.5000  1.0000  0.0000
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.

We can also plot the results.

ggplot(fit.glmnet)

3.3.3 Decision tree

3.3.3.1 Introduction

Ok, next we will do the same exercise for a classification tree. Some brief reminder what this interesting family of models is about:

Mostly used in classification problems on continuous or categorical variables.
Idea: split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.
Repeat till stop criterium reachesd. leads to a tree-like structure.

This class became increasingly popular in business and other applications. Some reasons are:

Easy to Understand: Decision tree output is very easy to understand even for people from non-analytical background.
Useful in Data exploration: Decision tree is one of the fastest way to identify most significant variables and relation between two or more variables.
Data type is not a constraint: It can handle both numerical and categorical variables.
Non Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about the space distribution and the classifier structure.

Some tree terminology:

Root Node: Entire population or sample and this further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.

The decision of making strategic splits heavily affects a tree’s accuracy. So, How does the tree decide to split? This is different across the large family of tree-like models. Common approaches:

Gini Index
\(\chi^2\)
Reduction in \(\sigma^2\)

Some common complexity restrictions are:

Minimum samples for a node split
Minimum samples for a terminal node (leaf)
Maximum depth of tree (vertical depth)
Maximum number of terminal nodes
Maximum features to consider for split

Likewise, there are a veriety of tunable hyperparameters across different applications of this model family.

3.3.3.2 Application

Ok, lets run it! We here use rpart, but there is a huge variety available. This one has only one tunable parameter. Here we are able to restrict the complexity via a hyperparameter cp. This parameter represents the complexity costs of every split, and allows further splits only if it leads to an decrease in model loss below this threshold.

library(rpart)
tune.dt = expand.grid(cp = c(0.001, 0.005 ,0.010, 0.020, 0.040))

fit.dt <- train(x = x_train, 
                y = y_train, 
                trControl = ctrl, 
                metric = metric,
                method = "rpart", tuneGrid =  tune.dt) 
fit.dt

## CART 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4220, 4221, 4219, 4219, 4221, 4220, ... 
## Resampling results across tuning parameters:
## 
##   cp     ROC     Sens    Spec  
##   0.001  0.8179  0.8845  0.5024
##   0.005  0.7908  0.9021  0.4871
##   0.010  0.7606  0.9130  0.4451
##   0.020  0.7442  0.9191  0.4139
##   0.040  0.6643  0.9457  0.3129
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001.

ggplot(fit.dt)

We directly see that in this case, increasing complexity costs lead to decreasing model performance. Such results are somewhat typical for large datasets, where high complexity costs prevent the tree to fully exploit the richness of information. Therefore, we settle for a minimal cp of 0.001.

We can also plot the final plot structure.

require(rpart.plot)
rpart.plot(fit.dt$finalModel)

This plot is usually informative, and gives us a nice intuition how the prediction works in classification trees. However, with that many dummies, it’s a bit messy.

3.3.4 Random Forest

3.3.4.1 Introduction

Finally, we fit another class of models which has gained popularity in the last decade, and proven to be a powerful and versatile prediction technique which performs well in almost every setting, a random forest. It is among the most popular non-neural-network ML algorithms, and by some considered to be a panacea of all data science problems. Some say: “when you can’t think of any algorithm (irrespective of situation), use random forest!”"

As a continuation of tree-based classification methods, random forests aim at reducing overfitting by introducing randomness via bootstrapping, boosting, and ensemble techniques. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model. The idea here is to create an “ensemble of classification trees”", all grown out of a different bootstrap sample. Having grown a forest of trees, every tree performs a prediction, and the final model prediction is formed by a majority vote of all trees. This idea close to Monte Carlo simulation approaches, tapping in the power of randomness.

3.3.4.2 Application

In this case, we draw from a number of tunable hyperparameters. First, we tune the number of randomly selected features which are available candidates for every split on a range \([1,k-1]\), where lower values introduce a higher level of randomness to every split. Our second hyperparameter is the minimal number of observations which have to fall in every split, where lower numbers increase the potential precision of splits, but also the risk of overfitting. Finally, we also use the general splitrule as an hyperparameter, where the choice is between i.) a traditional split according to a the optimization of the gini coefficient of the distribution of classes in every split, and ii.) according to the “Extremely randomized trees”" (ExtraTree) procedure by Geurts (2016), where adittional randomness is introduced to the selection of splitpoints.

Note: We here use the ranger instead of the more popular randomforest package. For those who wonder: It basically does the same, only faster.

library(ranger)
tune.rf <- expand.grid(mtry = round(seq(1, ncol(x_train)-1, length = 3)),
                       min.node.size = c(10, 50, 100),
                       splitrule = c("gini", "extratrees") )
                       
fit.rf <- train(x = x_train, 
                y = y_train, 
                trControl = ctrl,  
                metric = metric,
                method = "ranger", 
                importance = "impurity", # To define how to measure variable importantce (later more)
                num.trees = 25,
                tuneGrid =  tune.rf) 
fit.rf

## Random Forest 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4220, 4220, 4219, 4221, 4220, 4221, ... 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  splitrule   ROC     Sens    Spec   
##    1     10            gini        0.8310  0.9954  0.06490
##    1     10            extratrees  0.8291  0.9975  0.03425
##    1     50            gini        0.8299  0.9967  0.04920
##    1     50            extratrees  0.8276  0.9966  0.03874
##    1    100            gini        0.8290  0.9972  0.04208
##    1    100            extratrees  0.8279  0.9980  0.03471
##   18     10            gini        0.8304  0.8966  0.51377
##   18     10            extratrees  0.8307  0.8947  0.52352
##   18     50            gini        0.8443  0.9083  0.50379
##   18     50            extratrees  0.8499  0.9066  0.52376
##   18    100            gini        0.8466  0.9073  0.50949
##   18    100            extratrees  0.8507  0.9071  0.51282
##   34     10            gini        0.8241  0.8922  0.51234
##   34     10            extratrees  0.8278  0.8928  0.51498
##   34     50            gini        0.8394  0.9049  0.50593
##   34     50            extratrees  0.8462  0.9075  0.50950
##   34    100            gini        0.8430  0.9077  0.50594
##   34    100            extratrees  0.8484  0.9071  0.51425
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 18, splitrule =
##  extratrees and min.node.size = 100.

ggplot(fit.rf)

we see that number of randomly selected features per split of roughly half of all available features in all cases maximizes model performance. Same goes for a high minimal number of observations (100) per split. Finally, the ExtraTree procedure first underperforms at a a minimal amount of randomly selected features, but outperforms the traditional gini-based splitrule when the number of available features increases. Such results are typical for large samples, where a high amount of injected randomness tends to make model predictions more robust.

rm(fit.logit,fit.glmnet,fit.dt,fit.rf)

3.4 Final prediction

The adittional caretEnsemble package also has the caretList() function, where one can fit a set of different models jointly. In adittion, it takes care of the indexing for the resamples, so that they are the same for al models. Neath, isnt it? We could have done just and only that from the very beginning. We just did seperate models for illustration before. Note: Be prepared to wait some time.

Since we already know the optimal tuning parameters, we can now define them and skip the tuning loop.

tune.glmnet <- expand.grid(alpha = 0.5, lambda = 0)
tune.dt <- expand.grid(cp = 0.001)
tune.rf <- expand.grid(mtry = 18, min.node.size = 100, splitrule = "extratrees" )

ANd lets run the ensemble

library(caretEnsemble)
models <- caretList(x = x_train, 
                    y = y_train, 
                    trControl = ctrl, 
                    metric = metric, 
                    continue_on_fail = TRUE,
                    tuneList = list(logit = caretModelSpec(method = "glm", 
                                                           family = "binomial"),
                                   elasticnet = caretModelSpec(method = "glmnet", 
                                                               family = "binomial",
                                                               tuneGrid =  tune.glmnet),
                                   dt = caretModelSpec(method = "rpart",
                                                       tuneGrid =  tune.dt),
                                   rf = caretModelSpec(method = "ranger", 
                                                       importance = "impurity", 
                                                       # num.trees = 25, 
                                                       tuneGrid =  tune.rf)   
                                   ) 
                    )

3.5 Performance Evaluation

And done. Now, lets evaluate the results. This evaluation is now on the average prediction of the different folds on the validation fold in the train data (final evaluation will still come later). Lets take a look. We can also plot the model comparison.

models

## $logit
## Generalized Linear Model 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4219, 4221, 4220, 4219, 4221, 4221, ... 
## Resampling results:
## 
##   ROC     Sens    Spec  
##   0.8546  0.9045  0.5388
## 
## 
## $elasticnet
## glmnet 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4219, 4221, 4220, 4219, 4221, 4221, ... 
## Resampling results:
## 
##   ROC     Sens    Spec  
##   0.8549  0.9064  0.5357
## 
## Tuning parameter 'alpha' was held constant at a value of 0.5
## 
## Tuning parameter 'lambda' was held constant at a value of 0
## 
## $dt
## CART 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4219, 4221, 4220, 4219, 4221, 4221, ... 
## Resampling results:
## 
##   ROC     Sens    Spec  
##   0.8166  0.8878  0.4998
## 
## Tuning parameter 'cp' was held constant at a value of 0.001
## 
## $rf
## Random Forest 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4219, 4221, 4220, 4219, 4221, 4221, ... 
## Resampling results:
## 
##   ROC     Sens    Spec  
##   0.8519  0.9069  0.5197
## 
## Tuning parameter 'mtry' was held constant at a value of 18
##  extratrees
## Tuning parameter 'min.node.size' was held constant at a
##  value of 100
## 
## attr(,"class")
## [1] "caretList"

The last step is to recollect the results from each model and compare them. We here use the resamples() function of caret. This can be done because we ran all models within a caretList, keeping track of all resamples within the different models and enabling us to compare them. Lets check how the model perform.

results <- resamples(models) 

results$values %>%
  select(-Resample) %>%
  tidy()

bwplot(results)

Lets see how similar the models predict on the validation sample.

modelCor(results)

##             logit elasticnet     dt     rf
## logit      1.0000     0.9994 0.6975 0.9665
## elasticnet 0.9994     1.0000 0.7066 0.9681
## dt         0.6975     0.7066 1.0000 0.7322
## rf         0.9665     0.9681 0.7322 1.0000

3.6 Evaluation via final out-of-sample prediction

Ok, to be as confident as possible, lets do the final evaluation. generally, many ML exercises can be done without it, and it is often common practice to go with the performance in the k-fold crossfalidation. However, I believe this final step should be done, if data permits. Only with this final prediction we expose our model to data it yet was not exposed to, neiter directly nor indirectly.

we first let all models predict all models do their prediction on “test”. We create 2 objects, one for the probability prediction (type = "prob"), and one for the predicted class of the outcome (where positive means \(P \leq 0.5\)). We do so, becuase we need both for different summaries to come.

models.preds <- data.frame(lapply(models, predict, newdata = x_test)) 
models.preds.prob <- data.frame(lapply(models, predict, newdata = x_test, type = "prob"))
glimpse(models.preds)

## Observations: 1,757
## Variables: 4
## $ logit      <fct> No, No, Yes, No, No, No, No, No, Yes, No, No, Yes, ...
## $ elasticnet <fct> No, No, Yes, No, No, No, No, No, Yes, No, No, Yes, ...
## $ dt         <fct> No, No, Yes, Yes, No, No, No, No, Yes, No, No, Yes,...
## $ rf         <fct> No, No, Yes, No, No, No, No, No, No, No, No, Yes, N...

glimpse(models.preds.prob)

## Observations: 1,757
## Variables: 8
## $ logit.No       <dbl> 0.9661, 0.9609, 0.2187, 0.7574, 0.9858, 0.5990,...
## $ logit.Yes      <dbl> 0.033924, 0.039050, 0.781258, 0.242560, 0.01420...
## $ elasticnet.No  <dbl> 0.9635, 0.9577, 0.2260, 0.7580, 0.9843, 0.6072,...
## $ elasticnet.Yes <dbl> 0.036523, 0.042295, 0.774012, 0.241964, 0.01573...
## $ dt.No          <dbl> 0.9362, 0.9362, 0.1702, 0.3043, 0.9362, 0.8500,...
## $ dt.Yes         <dbl> 0.06377, 0.06377, 0.82979, 0.69565, 0.06377, 0....
## $ rf.No          <dbl> 0.9350, 0.9256, 0.1913, 0.7430, 0.9659, 0.7038,...
## $ rf.Yes         <dbl> 0.064972, 0.074414, 0.808670, 0.256981, 0.03412...

Let’s plot the corresponding confusion matrix for all models

cm <- list()
cm$logit <- confusionMatrix(factor(models.preds$logit), y_test, positive = "Yes")
cm$elasticnet <- confusionMatrix(factor(models.preds$elasticnet), y_test, positive = "Yes")
cm$dt <- confusionMatrix(factor(models.preds$dt), y_test, positive = "Yes")
cm$rf <- confusionMatrix(factor(models.preds$rf), y_test, positive = "Yes")

par(mfrow=c(2,2)) # Note: One day find a neath way to make it prettier
for(i in 1:length(cm)) {fourfoldplot(cm[[i]]$table, color = c("darkred", "darkgreen") ); title(main = names(cm[i]), cex.main = 1.5)}

And also the corresponding ROC curve…

library(caTools)
# Note: Is kind of ugly. At one point I will try making it prettier with the library "plotROC""
ROC <- list()
ROC$logit <- colAUC(models.preds.prob$logit.Yes, y_test, plotROC = F)
ROC$elasticnet <- colAUC(models.preds.prob$elasticnet.Yes, y_test, plotROC = F)
ROC$dt <- colAUC(models.preds.prob$dt.Yes, y_test, plotROC = F)
ROC$rf <- colAUC(models.preds.prob$rf.Yes, y_test, plotROC = F)

par(mfrow=c(2,2))
for(i in 1:length(ROC)) {colAUC(models.preds.prob[[i*2]], y_test, plotROC = T); title(main = names(ROC[i]), cex.main = 1.5)}

So, now only one nice final table for the sake of overview.

model_eval <- tidy(cm$logit$overall) %>%
  select(names) %>%
  mutate(Logit = round(cm$logit$overall,3) ) %>%
  mutate(ElasticNet = round(cm$elasticnet$overall,3)) %>%
  mutate(ClassTree = round(cm$dt$overall,3)) %>%
  mutate(RandForest = round(cm$rf$overall,3)) 

model_eval2 <- tidy(cm$logit$byClass) %>%
  select(names) %>%
  mutate(Logit = round(cm$logit$byClass,3) ) %>%
  mutate(ElasticNet = round(cm$elasticnet$byClass,3)) %>%
  mutate(ClassTree = round(cm$dt$byClass,3)) %>%
  mutate(RandForest = round(cm$rf$byClass,3)) %>%
  filter( !(names %in% c("AccuracyPValue", "McnemarPValue")) ) 

model_eval_all <- model_eval %>%
  rbind(model_eval2) %>%
  rbind(c("AUC", round(ROC$logit,3), round(ROC$elasticnet,3), round(ROC$dt,3), round(ROC$rf,3) ))

model_eval_all

rm(model_eval, model_eval2)

ML2: Supervised Machine Learning

Daniel S. Hain (dsh@business.aau.dk)

oday