# system.file() function extracts files from package files that are going to be used with examples.
titanic_test_X <- read.csv(system.file("extdata", "titanic_test.csv", package = "DALEXtra"))[,1:17] 
titanic_test_y <- read.csv(system.file("extdata", "titanic_test.csv", package = "DALEXtra"))[,18]
titanic_train <- read.csv(system.file("extdata", "titanic_train.csv", package = "DALEXtra"))
pima_indians_diabets_X <- read.csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv", sep = ",")[,1:8]
pima_indians_diabets_y <- read.csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv", sep = ",")[,9]

yml <- system.file("extdata", "scikitlearn.yml", package = "DALEXtra")
pkl_gbm <- system.file("extdata", "scikitlearn.pkl", package = "DALEXtra")
pkl_SGDC <- "SGDC.pkl"
pkl_keras <- system.file("extdata", "keras.pkl", package = "DALEXtra")

1 Introduction

Everyone knows DALEX. It is a tool designed to work with various black-box models like tree ensembles, linear models, neural networks, etc. In fact not everyone is familiar with R and prefers to build models using Python scikit-learn or Keras. That’s one of the reasons why DALEXtra was created. Our goal is to help integrate DALEX with other environments, libraries (eg. mlr), or even programming languages (Python and Java).

To illustrate applications of DALEXtra with scikit-learn we use preprocessed titanic dataset set. Original version is available in the base DALEX package, while modified is available along with DALEXtra (see the first chunk). For keras appliances we use and Pima Indias diabets. Our goal is to make predictions using scikit-learn and Keras Python models and explain them using DALEX. Most of the common errors will be covered along the following vignette.

2 Creating Python model

In order to save the created python model you have to use pickle library. Below an example with scikit-learn package.

import pandas as pd 
import pickle
import sklearn.ensemble
titanic_train = pd.read_csv('https://raw.githubusercontent.com/ModelOriented/DALEXtra/master/inst/extdata/titanic_train.csv', delimiter=",")
titanic_test = pd.read_csv('https://raw.githubusercontent.com/ModelOriented/DALEXtra/master/inst/extdata/titanic_test.csv', delimiter=",")
titanic_train_Y = titanic_train[['survived']]
titanic_train_X = titanic_train[['gender.female', 'gender.male', 'age', 'class.1st',
       'class.2nd', 'class.3rd', 'class.deck.crew', 'class.engineering.crew',
       'class.restaurant.staff', 'class.victualling.crew', 'embarked.Belfast',
       'embarked.Cherbourg', 'embarked.Queenstown', 'embarked.Southampton', 'sibsp', 'parch', 'fare']]
titanic_test_Y = titanic_test[['survived']]
titanic_test_X = titanic_test[['gender.female', 'gender.male', 'age', 'class.1st',
       'class.2nd', 'class.3rd', 'class.deck.crew', 'class.engineering.crew',
       'class.restaurant.staff', 'class.victualling.crew', 'embarked.Belfast',
       'embarked.Cherbourg', 'embarked.Queenstown', 'embarked.Southampton',
       'fare', 'sibsp', 'parch']]
model = sklearn.ensemble.GradientBoostingClassifier(
  n_estimators= 5000,
  learning_rate=0.001, 
  max_depth=4, 
  min_samples_split = 12
)
model = model.fit(titanic_train_X, titanic_train_Y)
pickle.dump(model, open("scikitlearn.pkl", "wb"))

Note that .pkl file is created based on specific libraries and Python versions. That’s why it is highly recommended to save your environment, otherwise, a friend of yours, who would like to run your model may find difficulties doing so. In order to extract crucial information about your env, lunch your anaconda prompt, activate environment executing activate name_of_the_env and export .yml file conda env export > environment.yml. In the further chapters it will be explained how to use that file.

3 Model created and explained using the same machine

3.1 Default env

One of the most common use case. R tools for XAI are much more developed, therefore its quite obvious that you want to explain model using for example DALEX, even if it was created using scikit-learn. When the creation and explanation process are held using the same machine, you will probably not witness environmental problems, so usage is simple.

explainer_scikit_1 <- explain_scikitlearn(pkl_gbm, data = titanic_test_X, y = titanic_test_y, 
                                          colorize = FALSE, label = "GBM")

## Preparation of a new explainer is initiated
##   -> model label       :  GBM 
##   -> data              :  524  rows  17  cols 
##   -> target variable   :  524  values 
##   -> model_info        :  package reticulate , ver. 1.14 , task classification (  default  ) 
##   -> predict function  :  yhat.scikitlearn_model  will be used (  default  )
##   -> predicted values  :  numerical, min =  0.02086126 , mean =  0.288584 , max =  0.9119996  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.8669431 , mean =  0.02248468 , max =  0.9791387  
##   A new explainer has been created!

explainer_keras_1 <- explain_keras(pkl_keras, data = pima_indians_diabets_X, y = pima_indians_diabets_y, 
                                   colorize = FALSE)

## Preparation of a new explainer is initiated
##   -> model label       :  keras  (  default  )
##   -> data              :  767  rows  8  cols 
##   -> target variable   :  767  values 
##   -> model_info        :  package reticulate , ver. 1.14 , task classification (  default  ) 
##   -> predict function  :  yhat.keras  will be used (  default  )
##   -> predicted values  :  numerical, min =  0.0006659329 , mean =  0.4633349 , max =  0.9935235  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.9620994 , mean =  -0.1152253 , max =  0.9141646  
##   A new explainer has been created!

explainer_SGDC <- explain_scikitlearn(pkl_SGDC, data = titanic_test_X, y = titanic_test_y, 
                                      colorize = FALSE, label = "SGDC")

## Preparation of a new explainer is initiated
##   -> model label       :  SGDC 
##   -> data              :  524  rows  17  cols 
##   -> target variable   :  524  values 
##   -> model_info        :  package reticulate , ver. 1.14 , task classification (  default  ) 
##   -> predict function  :  yhat.scikitlearn_model  will be used (  default  )
##   -> predicted values  :  numerical, min =  0 , mean =  0.1481639 , max =  0.8470998  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.8418441 , mean =  0.1629048 , max =  1  
##   A new explainer has been created!

Keep in mind, that scikit-learn models do not store training data, so it is required to pass test data to create an explainer.

library(DALEX)
library(iBreakDown)
plot(model_performance(explainer_scikit_1))

plot(model_performance(explainer_scikit_1), model_performance(explainer_SGDC))

plot(model_performance(explainer_keras_1))

plot(break_down(explainer_scikit_1, new_observation = explainer_scikit_1$data[1,]))

plot(break_down(explainer_keras_1, new_observation = explainer_keras_1$data[1,]))

As we can see, our explainer works perfectly fine and can be used by any package from DrWhy.ai universe.

3.2 Specifying environment

Sometimes we compute things using a virtual environment, that’s why explain_scikitlearn() function allows users to specify Python version that will be in use. condaenv and env are mutually exclusive arguments. First of them is a string that determines conda virtual environment. We pass it using the name of a virtual env. The second one is a path to a virtual environment (non-conda one). This may cause problems when OS is windows due to reticulate not supporting that way of specifying env on windows.

library(ingredients)
plotD3(feature_importance(explainer_SGDC))

plotD3(feature_importance(explainer_scikit_1))

plot(partial_dependency(explainer_keras_1, variables = "X33.6"))

plot(partial_dependency(explainer_scikit_1, variables = "age"), 
     partial_dependency(explainer_SGDC, variables = "age"))

4 Model created and explained using different machines

4.1 Environment creation

Python along with its libraries is version sensitive. That’s why we need a tool that lets’s us arrange a launch environment for the model we have got and want to explain. Unfortunately due to system limitations, use case works slightly different for Windows and Unix-like OS.

4.1.1 Windows

The first thing is that anaconda has to be in the system PATH. Otherwise Python libraries would not download due to SSL verification problems. When you manage to solve that issue, a free way to creating explainer lays in front of you. Just specify a path to .yml file. Do not bother about the virtual env name, it is defined in the header of .yml file. condaenv argument will be omitted when yml specified and OS is windows.

explainer_scikit_2 <- explain_scikitlearn(pkl_gbm, yml = yml, data = titanic_test_X, 
                                          y = titanic_test_y, colorize = FALSE)
explainer_keras_2 <- explain_keras(pkl_keras, condaenv = "myenv", data = pima_indians_diabets_X, 
                                   y = pima_indians_diabets_y, colorize = FALSE)

4.1.2 Unix

It is recommended to pass path where Anaconda may be found (eg. ./anaconda3) along with yml file path. condaenv argument should be used for this purpose. Therefore, when yml argument is specified and OS is Unix, condaenv means path to conda. If user left condaenv as NULL, DALEXtra will try to look for Anaconda on it’s own.

explainer_scikit_3 <- explain_scikitlearn(pkl_gbm, yml = yml, condaenv = "/home/user/anaconda", data = titanic_test_X, y = titanic_test_y)

5 Common problems

5.1 ResolvePackageNotFound

This is probably the most common problem, any user can meet during usage of DALXtra. It is being thrown when packages specified in .yml are not available from channels you have specified. There are some ways of fixing it.

5.1.1 Version

When .yml is exported, it saves not only the main version (eg. 1.16.4) but specific OS related version as well (eg. py36h19fb1c0_0). This way, when you are trying to rebuild env using a different machine, a lot of problems may occur. The easiest way to fix it, is to remove that statement.

R/win-library/3.6/DALEXtra/extdata/scikitlearn.yml

name: myenv
channels:
  - defaults
dependencies:
  - blas=1.0=mkl
  - certifi=2019.6.16=py36_0
  - icc_rt=2019.0.0=h0cc432a_1
  - intel-openmp=2019.4=245
  - joblib=0.13.2=py36_0
  - mkl=2019.4=245
  - mkl-service=2.0.2=py36he774522_0
  - mkl_fft=1.0.12=py36h14836fe_0
  - mkl_random=1.0.2=py36h343c172_0
  - numpy=1.16.4=py36h19fb1c0_0
  - numpy-base=1.16.4=py36hc3f5095_0
  - pip=19.1.1=py36_0
  - python=3.6.8=h9f7ef89_7
  - scikit-learn=0.21.2=py36h6288b17_0
  - scipy=1.2.1=py36h29ff71c_0
  - setuptools=41.0.1=py36_0
  - six=1.12.0=py36_0
  - sqlite=3.28.0=he774522_0
  - vc=14.1=h0510ff6_4
  - vs2015_runtime=14.15.26706=h3a45250_4
  - wheel=0.33.4=py36_0
  - wincertstore=0.2=py36h7fe50ca_0
  - pandas=0.24.2
  - keras=2.2.4

R/win-library/3.6/DALEXtra/extdata/scikitlearn_unix.yml


name: myenv
channels:
  - defaults
dependencies:
  - blas=1.0=mkl
  - certifi=2019.6.16
  - intel-openmp
  - joblib=0.13.2
  - mkl=2019.4
  - mkl-service=2.0.2
  - mkl_fft=1.0.12
  - mkl_random=1.0.2
  - numpy=1.16.4
  - numpy-base=1.16.4
  - pip=19.1.1
  - python=3.6.8
  - scikit-learn=0.21.2
  - scipy=1.2.1
  - setuptools=41.0.1
  - six=1.12.0
  - sqlite=3.28.0
  - wheel=0.33.4
  - pandas=0.24.2
  - keras=2.2.4

as it may be seen, not only a specific version but some packages disappeared as well. It is because some libraries are necessary only for Windows, and some only for Unix. In order to determine which packages you should remove, look at the error message, with what packages there is a problem.

5.1.2 Package not avialble from standard conda repo

conda has a lot of repositories where packages are stored. Some outdated ones are moved to older repositories. To get them, just add links in .yml header.

name: deeplearning
channels:
- defaults
- https://repo.continuum.io/pkgs/free/win-64/
- conda-forge
- ericmjl
dependencies:
- python=3.6
- matplotlib=2.0.2
- jupyter=1.0.0
- numpy=1.13.1
- seaborn=0.8
- pymc3=3.1
- pandas=0.20.3
- scipy=0.19.1
- biopython=1.69

5.1.3 pip statement

If nothing above works, there is also a way to use pip statement. Keep in mind that it is not a recommended solution but sometimes it is the last thing You can do.

name: PATS
channels:
  - conda-forge
dependencies:
  - python=3
  - vtk>=8.2.0
  - numpy
  - matplotlib
  - scipy
  - pip:
    - pynrrd
    - pyqt5
    - pydicom
    - colorama
    - nibabel
    - matplotlib
    - imageio
    - pywt
    - polarTransform
    - pyqt5ac

5.2 Environment already exists

If you provide .yml file that in its header contains name exact to name of the environment that already exists, existing will be set active without changing it.

You have two ways of solving that issue. Both connected with anaconda prompt. First is removing conda env with a command:

conda env remove --name myenv

And execute function once again. Second is updating env via:

conda env create -f environment.yml

keep in mind that conda env may be removed using R aswell

reticulate::conda_remove("name_of_the_env")

5.3 Create conda env manually

Creating environments is a very delicate use case, so when nothing above works try to create it by yourself via conda prompt or using reticulate library.

conda prompt

conda create -n name_of_env python=3.4 
conda install -n name_of_env name_of_package=0.20

reticulate

reticulate::conda_install(envname = "myenv", packages = c("name_of_package"))

How to use DALEXtra to explain and visualize scikitlearn and keras models

Szymon Maksymiuk

2020-03-26