0

When I use SMOTE-NC to oversample three classes of a 4-class classification problem, the Prec, Recall, and F1 metrics for minority classes are still VERY low (~3%). I have 32 categorical and 30 continuous variables in my dataset. All the categorical variables have been converted to binary columns using one-hot encoding. Also, before going for the over-sampling process, I am imputing all missing values using Iterativeimputer.

Regarding the classifiers, I am using logistic regression, random forest and XGboost. May I have your thoughts on this? Any suggestions to over-sample a multiclass and highly imbalanced dataset?

sophros
  • 209
  • 2
  • 11
Sarah
  • 601
  • 2
  • 5
  • 17
  • 1
    First of all, [one-hot encoding is generally not recommended for tree-based methods](https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769). I would use OrdinalEncoder from sklearn instead. Secondly, what is your class distribution (what is the % of each class in your data)? As @georg-un pointed out scaling weights can be helpful sometimes. What are you setting `class_weights` to? – jared3412341 Oct 03 '20 at 15:07

1 Answers1

1

Before going through the process of oversampling, always see if the implementation of your algorithm supports assigning different weights to individual classes. The sklearn RandomForestClassifier has for example a class_weights parameter with which you can do that. I found this method to work better than over- or undersampling.

Also, I have to add an obligatory part: if you minority classes have only very few samples so that the charachteristics of the respective classes are not well captured, there is little you can do except collecting more data.

georg-un
  • 1,206
  • 8
  • 21
  • Thank you for your suggestion, I tried the class_weight parameter, no improvement! The final AUC numbers are high enough (~90%) to be able to say the models have enough discrimination power, but I thought there is no point in a high AUC when you have very low precision/recall. – Sarah Aug 10 '19 at 17:02