1 Introduction

Machine learning models are being developed in different tools and different languages. On one hand it is great as everyone can choose tool that suits his needs, but on another hand it is hard to compare such models in a different way that simply by comparing accuracies.

In this vignette we will show how DALEX can be used for comparison of models across different languages.

We trained four models, gbm and CatBoost in R, h2o implementation of in java and scikit-learn implementation of gbm in Python. Then we visually explore their similarities and differences through DALEX explainers.

2 Data

We use titanic dataset. It is divided into titanic_test and titanic_train and stored in csv files. For this dataset we will train binary classifiers that predicts probability of survival from Titanic disaster.

kable(titanic_test_X %>% head(), "html") %>% kable_styling("striped") %>%
  scroll_box(width = "100%")
gender.female gender.male age class.1st class.2nd class.3rd class.deck.crew class.engineering.crew class.restaurant.staff class.victualling.crew embarked.Belfast embarked.Cherbourg embarked.Queenstown embarked.Southampton fare sibsp parch
1 0 16 0 0 1 0 0 0 0 0 0 0 1 7.1300 0 0
0 1 25 0 0 1 0 0 0 0 0 0 0 1 7.1300 0 0
1 0 28 0 1 0 0 0 0 0 0 1 0 0 24.0000 1 0
0 1 20 0 0 1 0 0 0 0 0 0 0 1 7.1806 0 0
0 1 30 0 0 1 0 0 0 0 0 0 0 1 7.0500 0 0
1 0 19 0 0 1 0 0 0 0 0 0 0 1 7.1701 1 0
kable(titanic_train_X %>% head(), "html") %>% kable_styling("striped") %>%
  scroll_box(width = "100%")
gender.female gender.male age class.1st class.2nd class.3rd class.deck.crew class.engineering.crew class.restaurant.staff class.victualling.crew embarked.Belfast embarked.Cherbourg embarked.Queenstown embarked.Southampton fare sibsp parch
0 1 42 0 0 1 0 0 0 0 0 0 0 1 7.1100 0 0
0 1 13 0 0 1 0 0 0 0 0 0 0 1 20.0500 0 2
0 1 16 0 0 1 0 0 0 0 0 0 0 1 20.0500 1 1
1 0 39 0 0 1 0 0 0 0 0 0 0 1 20.0500 1 1
0 1 30 0 1 0 0 0 0 0 0 1 0 0 24.0000 1 0
0 1 27 0 0 1 0 0 0 0 0 1 0 0 18.1509 0 0

3 Models in different languages

3.1 gbm in R

First, we train R implementation of gbm which we will acces thorugh mlr package as a wrapper. Specifying most of parameters helps us fiting similiar models accros langugaes, at least in theory.

library("mlr")
set.seed(123, "L'Ecuyer")
task <- makeClassifTask(
            id = "R",
            data = cbind(titanic_train_X, titanic_train_Y),
            target = "survived"
          )
learner <- makeLearner(
            "classif.gbm",
            par.vals = list(
              distribution = "bernoulli",
              n.trees = 5000,
              interaction.depth = 4,
              n.minobsinnode = 12,
              shrinkage = 0.001,
              bag.fraction = 0.5,
              train.fraction = 1
            ),
            predict.type = "prob"
          )
r_gbm <- train(learner, task)
performance(predict(r_gbm, 
                    newdata = cbind(titanic_test_X, titanic_test_Y)), 
            measures = auc)
##      auc 
## 0.826275

3.2 CatBoost in R

CatBoost is a machine learning algorithm that uses gradient boosting on decision trees. It is similiar in spirit to gbm and we will see how similar are trained models.

library("catboost")
pool_train <- catboost.load_pool(titanic_train_X, titanic_train_Y$survived)
pool_test  <- catboost.load_pool(titanic_test_X)
r_catboost <- catboost.train(pool_train, 
               test_pool = NULL, 
               params = list(
                 custom_loss = "AUC",
                 iterations = 5000,
                 learing_rate = 0.001,
                 depth = 4,
                 logging_level = "Silent"
               )
               )
preds <- catboost.predict(r_catboost, pool_test, "Probability")
mltools::auc_roc(preds, y)
## [1] 0.8154836

3.3 gbm in h2o (java)

We will access H2O models via h2o R package. Using h2o documentation we are able to match as many parameters as it possible which will help objectively compare models.

set.seed(123, "L'Ecuyer")
java_h2o_gbm <- h2o.gbm(
                  training_frame = titanic_h2o,
                  y = "survived",
                  distribution = "bernoulli",
                  ntrees = 5000,
                  max_depth = 4,
                  min_rows =  12,
                  learn_rate = 0.001
                )
h2o.auc(h2o.performance(java_h2o_gbm, newdata = titanic_test_h2o))
## [1] 0.8146509

3.4 gbm in scikit-learn (python)

Inspection of models that have been created at Python via R is not as hard as it may seem to. It is possbile thanks to two packages. reticulate avaiable at R and pickle from Python.

from pandas import DataFrame, read_csv
import pandas as pd 
import pickle
import sklearn.ensemble
from sklearn.metrics import auc, accuracy_score, confusion_matrix, mean_squared_error
model = sklearn.ensemble.GradientBoostingClassifier(
  n_estimators= 5000,
  learning_rate=0.001, 
  max_depth=4, 
  min_samples_split = 12
)
model = model.fit(titanic_train_X, titanic_train_Y)
pickle.dump(model, open("gbm.pkl", "wb"))
library("reticulate")
python_scikitlearn_gbm <- py_load_object("gbm.pkl", pickle = "pickle")
preds <- python_scikitlearn_gbm$predict_proba(titanic_test_X)[, 2]
mltools::auc_roc(preds, y)
## [1] 0.8244396

scikit-learn turned up to be better than h2o and slightly better than R model. But is is a big difference? Let’s explore these models in details.

4 Model cross-language comparison

Because all four packages return sligthly diffrent objects, we have to create DALEX wrappers around them. This requires differnt predict functions in order to get plain predicted probabilities vector of of the trained models.

# h2o wrapper
h2o_predict <- function(model, newdata) {
  newdata_h2o <- as.h2o(newdata)
  res <- as.data.frame(h2o.predict(model, newdata_h2o))
  return(as.numeric(res$p1))
}
# catboost wrapper 
catboost_predict <- function(object, newdata) {
  newdata_pool <- catboost.load_pool(newdata)
  catboost.predict(object, newdata_pool, "Probability")
}
# mlr wrapper 
mlr_predict <- function(object, newdata) {
  pred <- predict(object, newdata = newdata)
  response <- pred$data[, 1]
  return(response)
}
# python wrapper
py_predict <- function(model, newdata) {
  model$predict_proba(newdata)[, 2]
}

Now we can create DALEX wrappers around our models.

library("DALEX")
r_explain <- DALEX::explain(r_gbm,
  data = titanic_test_X, y = y, label = "gmb (R)",
  predict_function = mlr_predict
)
catboost_explain <- DALEX::explain(r_catboost,
  data = titanic_test_X, y = y, label = "CatBoost (R)",
  predict_function = catboost_predict
)
h2o_explain <- DALEX::explain(java_h2o_gbm,
  data = titanic_test_X, y = y, label = "gbm (h2o/java)",
  predict_function = h2o_predict
)
py_explain <- explain(python_scikitlearn_gbm,
  data = titanic_test_X, y = y, label = "gbm (python/sklearn)",
  predict_function = py_predict
)

4.1 Model performance

With explainers ready, we can compare our models in order to find possible differences. Models performance and residual distribution gets our first look.

plot(
  model_performance(r_explain),
  model_performance(h2o_explain),
  model_performance(py_explain),
  model_performance(catboost_explain)
  )

As we can see, models are quite similiar. The biggest difference is residuals distrubution in [0, 0.13] compartment where R gbm has biggest mistake ratio. R is worse in compartment [0.13, 0.25] aswell, but diffence is not as easy to spot as in previous area. CatBoost, on the other hand, is the best model for residuals [0, 0.13].

4.2 Variable importance

Here the drop in 1 - AUC is used to compare variable perforamnce. Keep in mind that when defining custom_loss_function you have to provide arguments in correct order. Real values of y first and predicted second.

custom_loss_function <- function(y, yhat) {
  1 - mltools::auc_roc(yhat, y)
}
plot(
  variable_importance(r_explain, type = "difference", loss_function = custom_loss_function),
  variable_importance(h2o_explain, type = "difference", loss_function = custom_loss_function),
  variable_importance(py_explain, type = "difference", loss_function = custom_loss_function),
  variable_importance(catboost_explain, type = "difference", loss_function = custom_loss_function)
)

This time we can see significant difference. h2o model figured out correlation between gender.male and gender.female and dropped one of them. Other models use both of those columns. What is interesting, next four most significant variables are the same for all three of models.

4.3 PDP plots for fare

library("ingredients")
pdp_r   <- partial_dependency(r_explain, 
                          variable_splits = list(fare = seq(0,100,0.1)))
pdp_h2o  <- partial_dependency(h2o_explain, 
                          variable_splits = list(fare = seq(0,100,0.1)))
pdp_py <- partial_dependency(py_explain,
                         variable_splits = list(fare = seq(0,100,0.1)))
pdp_catboost <- partial_dependency(catboost_explain, 
                         variable_splits = list(fare = seq(0,100,0.1)))
plot(pdp_r, pdp_h2o, pdp_py, pdp_catboost) + xlim(0, 150) + ylim(0, 0.5)

We can see the difference in how our model behaves for different values of fare. Response is similiar only in [0,25] compartment.

4.4 PDP plots for age

pdp_r   <- partial_dependency(r_explain, 
                          variable_splits = list(age = seq(0,80,0.1)))
pdp_h2o  <- partial_dependency(h2o_explain, 
                          variable_splits = list(age = seq(0,80,0.1)))
pdp_py <- partial_dependency(py_explain,
                         variable_splits = list(age = seq(0,80,0.1)))
pdp_catboost <- partial_dependency(catboost_explain, 
                         variable_splits = list(age = seq(0,80,0.1)))
plot(pdp_r, pdp_h2o, pdp_py, pdp_catboost) + xlim(0, 80) + ylim(0, 1)