8

I have a pandas DataFrame X. I would like to find the prediction explanation of a a particular model.

My model is given below:

pipeline = Pipeline(steps= [
        ('imputer', imputer_function()),
        ('classifier', RandomForestClassifier()
    ])
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
y_pred = pipeline.fit(x_train, y_train).predict(x_test)

Now for prediction explainer, I use Kernal Explainer from Shap.

This is the following:

# use Kernel SHAP to explain test set predictions
shap.initjs()

explainer = shap.KernelExplainer(pipeline.predict_proba, x_train, link="logit")

shap_values = explainer.shap_values(x_test, nsamples=10)

# # plot the SHAP values for the Setosa output of the first instance
shap.force_plot(explainer.expected_value[0], shap_values[0][0,:], x_test.iloc[0,:], link="logit")

When I run the code, I get the error:

ValueError: Specifying the columns using strings is only supported for pandas DataFrames.

Provided model function fails when applied to the provided data set.

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

Can anyone please help me? I'm really stuck with this. Both x_train and x_test are pandas data frames.

Ethan
  • 1,625
  • 8
  • 23
  • 39
Nayana Madhu
  • 406
  • 1
  • 3
  • 8

2 Answers2

11

The reason is kernel shap sends data as numpy array which has no column names. so we need to fix it as follows:

def model_predict(data_asarray):
    data_asframe =  pd.DataFrame(data_asarray, columns=feature_names)
    return estimator.predict(data_asframe)

Then,

shap_kernel_explainer = shap.KernelExplainer(model_predict, x_train, link='logit')
shap_values_single = shap_kernel_explainer.shap_values(x_test.iloc[0,:])
shap.force_plot(shap_kernel_explainer.expected_value[0],np.array(shap_values_single[0]), x_test.iloc[0,:],link='logit')
Nayana Madhu
  • 406
  • 1
  • 3
  • 8
  • ```shap_values_single = shap_kernel_explainer.shap_values(x_test.iloc[0,:])``` fails due to ```ValueError: Input contains NaN, infinity or a value too large for dtype('float64').``` I believe this is because the test set is not being preprocessed in your code sample. Do you know how to fix this issue? – Josh Zwiebel Mar 01 '22 at 15:47
  • this didnt work for me until i changed ```estimator.predict``` to ```estimator.predict_proba``` – Josh Zwiebel Mar 01 '22 at 19:09
3

I've tried to create a function as suggested but it doesn't work for my code. However, as suggested from an example on Kaggle, I found the below solution:

import shap

#load JS vis in the notebook
shap.initjs() 

#set the tree explainer as the model of the pipeline
explainer = shap.TreeExplainer(pipeline['classifier'])

#apply the preprocessing to x_test
observations = pipeline['imputer'].transform(x_test)

#get Shap values from preprocessed data
shap_values = explainer.shap_values(observations)

#plot the feature importance
shap.summary_plot(shap_values, x_test, plot_type="bar")
ntnq
  • 31
  • 1