3

I have a heavy imbalanced dataset with a classification problem. I try to plot the Calibration Curve from the sklearn.calibration package. In specific, I try the following models:

rft = RandomForestClassifier(n_estimators=1000)
svc = SVC(probability = True, gamma = "auto")
gnb = MultinomialNB(alpha=0.5)
xgb = xgboost.XGBClassifier(n_estimators=1000, learning_rate=0.08, gamma=0, subsample=0.75, colsample_bytree=1, max_depth=7)

The calibration curve is the following

enter image description here

As you can see on the plot, Random Forest and XGBoost are close to the perfectly calibrated model. However, Naive Bayes and SVM perform terribly.

How can I explain/describe the behaviour of those two models? ​

Tasos
  • 3,860
  • 4
  • 22
  • 54
  • Do the NB and SVM just have bad performance? In the NB bins in particular, never is the true response rate above 20%; the tree models by contrast seem to get some strong conclusions. The SVM never predicts anything above 50%. SVC with `probability=True` does Platt scaling under the hood, so it's surprising that it wouldn't be better calibrated. One last thing to look into: sklearn's calibration plots make the bins of fixed width (10% in your plot); for your unbalanced dataset, it might be worth doing one by hand with bins of equal number of samples? – Ben Reiniger Apr 25 '19 at 14:11
  • Could you also add the score of each classifier so we can know if it is a problem of bad classification more than bad calibration ? – Samos Sep 17 '19 at 16:26

0 Answers0