2

I have unbalanced dataset with 11 classes where 1 one class is 30% and rest are between 5-12%. I am not a hardcore programmer so I am using the product from https://www.h2o.ai/. I used GBM and DRF and used the option to balance the classes and the training results are great (98-99% precision and recall) as per the confusion matrix however when I test it on the validation set the only class where I get decent accuracy is the class which is 30% all the others have classification errors close to 100%. Not sure what approach I am supposed to take. The 11 classes are 11 market segments and even an accuracy of ~70% for each of the classes is doable for my purposes.

Edit 1: Additional Info: validation the model predicts almost each sample as the 30% class i.e why close to 30% accuracy...similar to a credit fraud detection gone wrong...

Update 1: So I tried 2 more approaches

1) Tried to make it into a 2 class classification by labelling everything other than the 30 % class as "OTHER" and the results were still poor

2) I removed the 30% class and kept the other 9 as is and then trained a GBM and the results are scary accurate with 85%-15% split between test and validation. However as soon as I do a cross validation the classification is really poor.

Not sure what's going on...maybe I need to rethink my entire approach and redefine the problem and come up with an entirely different hypothesis to begin with.

Brian Spiering
  • 20,142
  • 2
  • 25
  • 102
Swap
  • 41
  • 3
  • 1
    The first thing comes to my mind is that, have you checked the distribution of classes between training and validation are the same? It is OK if they are unbalanced, by which you have to oversample perhaps the minority class using methods out there together with StratifiedSampling, but still it is best if their distribution is somewhat similar or close. And you better make sure what that "option to balance the classes" does!! – TwinPenguins Aug 20 '18 at 18:40
  • @MajidMortazavi the validation sample distribution is close to the test sample distribution...not exact but the nuances are similar and the 30% class is still close to 30% in the validation sample. As for the "option to balance class" parameter..you can specify the ratio of under/over sampling..so i tried to both oversampling the other classes and undersampling the 30% class and both lead to same result...great test accuracy poor validation results – Swap Aug 20 '18 at 20:01

0 Answers0