# system.file() function extracts files from package files that are going to be used with examples.
titanic_test_X <- read.csv(system.file("extdata", "titanic_test.csv", package = "DALEXtra"))[,1:17]
titanic_test_y <- read.csv(system.file("extdata", "titanic_test.csv", package = "DALEXtra"))[,18]
titanic_train <- read.csv(system.file("extdata", "titanic_train.csv", package = "DALEXtra"))
pima_indians_diabets_X <- read.csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv", sep = ",")[,1:8]
pima_indians_diabets_y <- read.csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv", sep = ",")[,9]
yml <- system.file("extdata", "scikitlearn.yml", package = "DALEXtra")
pkl_gbm <- system.file("extdata", "scikitlearn.pkl", package = "DALEXtra")
pkl_SGDC <- "SGDC.pkl"
pkl_keras <- system.file("extdata", "keras.pkl", package = "DALEXtra")
Everyone knows DALEX. It is a tool designed to work with various black-box models like tree ensembles, linear models, neural networks, etc. In fact not everyone is familiar with R and prefers to build models using Python scikit-learn or Keras. That’s one of the reasons why DALEXtra was created. Our goal is to help integrate DALEX with other environments, libraries (eg. mlr), or even programming languages (Python and Java).
To illustrate applications of DALEXtra with scikit-learn we use preprocessed titanic dataset set. Original version is available in the base DALEX package, while modified is available along with DALEXtra (see the first chunk). For keras appliances we use and Pima Indias diabets. Our goal is to make predictions using scikit-learn and Keras Python models and explain them using DALEX. Most of the common errors will be covered along the following vignette.
In order to save the created python model you have to use pickle library. Below an example with scikit-learn package.
import pandas as pd
import pickle
import sklearn.ensemble
titanic_train = pd.read_csv('https://raw.githubusercontent.com/ModelOriented/DALEXtra/master/inst/extdata/titanic_train.csv', delimiter=",")
titanic_test = pd.read_csv('https://raw.githubusercontent.com/ModelOriented/DALEXtra/master/inst/extdata/titanic_test.csv', delimiter=",")
titanic_train_Y = titanic_train[['survived']]
titanic_train_X = titanic_train[['gender.female', 'gender.male', 'age', 'class.1st',
'class.2nd', 'class.3rd', 'class.deck.crew', 'class.engineering.crew',
'class.restaurant.staff', 'class.victualling.crew', 'embarked.Belfast',
'embarked.Cherbourg', 'embarked.Queenstown', 'embarked.Southampton', 'sibsp', 'parch', 'fare']]
titanic_test_Y = titanic_test[['survived']]
titanic_test_X = titanic_test[['gender.female', 'gender.male', 'age', 'class.1st',
'class.2nd', 'class.3rd', 'class.deck.crew', 'class.engineering.crew',
'class.restaurant.staff', 'class.victualling.crew', 'embarked.Belfast',
'embarked.Cherbourg', 'embarked.Queenstown', 'embarked.Southampton',
'fare', 'sibsp', 'parch']]
model = sklearn.ensemble.GradientBoostingClassifier(
n_estimators= 5000,
learning_rate=0.001,
max_depth=4,
min_samples_split = 12
)
model = model.fit(titanic_train_X, titanic_train_Y)
pickle.dump(model, open("scikitlearn.pkl", "wb"))
Note that .pkl file is created based on specific libraries and Python versions. That’s why it is highly recommended to save your environment, otherwise, a friend of yours, who would like to run your model may find difficulties doing so. In order to extract crucial information about your env, lunch your anaconda prompt, activate environment executing activate name_of_the_env
and export .yml file conda env export > environment.yml
. In the further chapters it will be explained how to use that file.
One of the most common use case. R tools for XAI are much more developed, therefore its quite obvious that you want to explain model using for example DALEX, even if it was created using scikit-learn. When the creation and explanation process are held using the same machine, you will probably not witness environmental problems, so usage is simple.
explainer_scikit_1 <- explain_scikitlearn(pkl_gbm, data = titanic_test_X, y = titanic_test_y,
colorize = FALSE, label = "GBM")
## Preparation of a new explainer is initiated
## -> model label : GBM
## -> data : 524 rows 17 cols
## -> target variable : 524 values
## -> model_info : package reticulate , ver. 1.14 , task classification ( default )
## -> predict function : yhat.scikitlearn_model will be used ( default )
## -> predicted values : numerical, min = 0.02086126 , mean = 0.288584 , max = 0.9119996
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.8669431 , mean = 0.02248468 , max = 0.9791387
## A new explainer has been created!
explainer_keras_1 <- explain_keras(pkl_keras, data = pima_indians_diabets_X, y = pima_indians_diabets_y,
colorize = FALSE)
## Preparation of a new explainer is initiated
## -> model label : keras ( default )
## -> data : 767 rows 8 cols
## -> target variable : 767 values
## -> model_info : package reticulate , ver. 1.14 , task classification ( default )
## -> predict function : yhat.keras will be used ( default )
## -> predicted values : numerical, min = 0.0006659329 , mean = 0.4633349 , max = 0.9935235
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.9620994 , mean = -0.1152253 , max = 0.9141646
## A new explainer has been created!
explainer_SGDC <- explain_scikitlearn(pkl_SGDC, data = titanic_test_X, y = titanic_test_y,
colorize = FALSE, label = "SGDC")
## Preparation of a new explainer is initiated
## -> model label : SGDC
## -> data : 524 rows 17 cols
## -> target variable : 524 values
## -> model_info : package reticulate , ver. 1.14 , task classification ( default )
## -> predict function : yhat.scikitlearn_model will be used ( default )
## -> predicted values : numerical, min = 0 , mean = 0.1481639 , max = 0.8470998
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.8418441 , mean = 0.1629048 , max = 1
## A new explainer has been created!
Keep in mind, that scikit-learn models do not store training data, so it is required to pass test data to create an explainer.
library(DALEX)
library(iBreakDown)
plot(model_performance(explainer_scikit_1))
plot(model_performance(explainer_scikit_1), model_performance(explainer_SGDC))
plot(model_performance(explainer_keras_1))
plot(break_down(explainer_scikit_1, new_observation = explainer_scikit_1$data[1,]))
plot(break_down(explainer_keras_1, new_observation = explainer_keras_1$data[1,]))
As we can see, our explainer works perfectly fine and can be used by any package from DrWhy.ai universe.
Sometimes we compute things using a virtual environment, that’s why explain_scikitlearn()
function allows users to specify Python version that will be in use. condaenv
and env
are mutually exclusive arguments. First of them is a string that determines conda virtual environment. We pass it using the name of a virtual env. The second one is a path to a virtual environment (non-conda one). This may cause problems when OS is windows due to reticulate
not supporting that way of specifying env on windows.
library(ingredients)
plotD3(feature_importance(explainer_SGDC))
plotD3(feature_importance(explainer_scikit_1))
plot(partial_dependency(explainer_keras_1, variables = "X33.6"))
plot(partial_dependency(explainer_scikit_1, variables = "age"),
partial_dependency(explainer_SGDC, variables = "age"))
Python along with its libraries is version sensitive. That’s why we need a tool that lets’s us arrange a launch environment for the model we have got and want to explain. Unfortunately due to system limitations, use case works slightly different for Windows and Unix-like OS.
The first thing is that anaconda has to be in the system PATH. Otherwise Python libraries would not download due to SSL verification problems. When you manage to solve that issue, a free way to creating explainer lays in front of you. Just specify a path to .yml file. Do not bother about the virtual env name, it is defined in the header of .yml file. condaenv
argument will be omitted when yml
specified and OS is windows.
explainer_scikit_2 <- explain_scikitlearn(pkl_gbm, yml = yml, data = titanic_test_X,
y = titanic_test_y, colorize = FALSE)
explainer_keras_2 <- explain_keras(pkl_keras, condaenv = "myenv", data = pima_indians_diabets_X,
y = pima_indians_diabets_y, colorize = FALSE)
It is recommended to pass path where Anaconda may be found (eg. ./anaconda3) along with yml file path. condaenv
argument should be used for this purpose. Therefore, when yml
argument is specified and OS is Unix, condaenv
means path to conda. If user left condaenv
as NULL, DALEXtra will try to look for Anaconda on it’s own.
explainer_scikit_3 <- explain_scikitlearn(pkl_gbm, yml = yml, condaenv = "/home/user/anaconda", data = titanic_test_X, y = titanic_test_y)
This is probably the most common problem, any user can meet during usage of DALXtra
. It is being thrown when packages specified in .yml are not available from channels you have specified. There are some ways of fixing it.
When .yml is exported, it saves not only the main version (eg. 1.16.4) but specific OS related version as well (eg. py36h19fb1c0_0). This way, when you are trying to rebuild env using a different machine, a lot of problems may occur. The easiest way to fix it, is to remove that statement.
R/win-library/3.6/DALEXtra/extdata/scikitlearn.yml
name: myenv
channels:
- defaults
dependencies:
- blas=1.0=mkl
- certifi=2019.6.16=py36_0
- icc_rt=2019.0.0=h0cc432a_1
- intel-openmp=2019.4=245
- joblib=0.13.2=py36_0
- mkl=2019.4=245
- mkl-service=2.0.2=py36he774522_0
- mkl_fft=1.0.12=py36h14836fe_0
- mkl_random=1.0.2=py36h343c172_0
- numpy=1.16.4=py36h19fb1c0_0
- numpy-base=1.16.4=py36hc3f5095_0
- pip=19.1.1=py36_0
- python=3.6.8=h9f7ef89_7
- scikit-learn=0.21.2=py36h6288b17_0
- scipy=1.2.1=py36h29ff71c_0
- setuptools=41.0.1=py36_0
- six=1.12.0=py36_0
- sqlite=3.28.0=he774522_0
- vc=14.1=h0510ff6_4
- vs2015_runtime=14.15.26706=h3a45250_4
- wheel=0.33.4=py36_0
- wincertstore=0.2=py36h7fe50ca_0
- pandas=0.24.2
- keras=2.2.4
R/win-library/3.6/DALEXtra/extdata/scikitlearn_unix.yml
name: myenv
channels:
- defaults
dependencies:
- blas=1.0=mkl
- certifi=2019.6.16
- intel-openmp
- joblib=0.13.2
- mkl=2019.4
- mkl-service=2.0.2
- mkl_fft=1.0.12
- mkl_random=1.0.2
- numpy=1.16.4
- numpy-base=1.16.4
- pip=19.1.1
- python=3.6.8
- scikit-learn=0.21.2
- scipy=1.2.1
- setuptools=41.0.1
- six=1.12.0
- sqlite=3.28.0
- wheel=0.33.4
- pandas=0.24.2
- keras=2.2.4
as it may be seen, not only a specific version but some packages disappeared as well. It is because some libraries are necessary only for Windows, and some only for Unix. In order to determine which packages you should remove, look at the error message, with what packages there is a problem.
conda has a lot of repositories where packages are stored. Some outdated ones are moved to older repositories. To get them, just add links in .yml header.
name: deeplearning
channels:
- defaults
- https://repo.continuum.io/pkgs/free/win-64/
- conda-forge
- ericmjl
dependencies:
- python=3.6
- matplotlib=2.0.2
- jupyter=1.0.0
- numpy=1.13.1
- seaborn=0.8
- pymc3=3.1
- pandas=0.20.3
- scipy=0.19.1
- biopython=1.69
If nothing above works, there is also a way to use pip statement. Keep in mind that it is not a recommended solution but sometimes it is the last thing You can do.
name: PATS
channels:
- conda-forge
dependencies:
- python=3
- vtk>=8.2.0
- numpy
- matplotlib
- scipy
- pip:
- pynrrd
- pyqt5
- pydicom
- colorama
- nibabel
- matplotlib
- imageio
- pywt
- polarTransform
- pyqt5ac
If you provide .yml file that in its header contains name exact to name of the environment that already exists, existing will be set active without changing it.
You have two ways of solving that issue. Both connected with anaconda prompt. First is removing conda env with a command:
conda env remove --name myenv
And execute function once again. Second is updating env via:
conda env create -f environment.yml
keep in mind that conda env may be removed using R aswell
reticulate::conda_remove("name_of_the_env")
Creating environments is a very delicate use case, so when nothing above works try to create it by yourself via conda prompt or using reticulate
library.
conda prompt
conda create -n name_of_env python=3.4
conda install -n name_of_env name_of_package=0.20
reticulate
reticulate::conda_install(envname = "myenv", packages = c("name_of_package"))