Imbalanced dataset - Undersampling & multiple classifiers

Question

Let's suppose that my dataset in a classification problem looks like that:

class A: 50000 observations
class B: 2000 observations
class C: 800 observations
class D: 200 observations

These are some ways which I considered to deal with this imbalanced dataset:

I reject straight away oversampling because it usually makes the model overfit (in the minority classes) by a lot.
Secondly, if I run the classifier with the data like that then it will be overclassifying documents in class A so I reject this method too.
Another approach is to do undersampling and reduce class A to let's say 4000 documents (where I tested it and it gives the best results so far).
However, in this way I am losing quite a lot of information. So I am wondering if building multiple classifiers with 4000 documents each for class A (different at each time) is a better solution (although I think that this approach resembles quite a lot the oversampling approach which I rejected).

What do you think of method (4) comparing to method (3)?

score 0 · Answer 1 · answered Oct 16 '19 at 14:25

0

Maybe your (3) can be complemented to an oversampling of the other classes. I don't think every oversampling generates overfitting. I agree with you, it will make you lose information. But the only way to know if it's that bad is by checking it.

Regarding (4), I don't see how you'll manage the predictions of these multiple classifiers. You could test binary classifiers for each class, like classifier1 for A/not-A, classifier2 for B/not-B. In this case, undersampling could be applied.

However, all these advices are just speculation. You must test on your data, and see the results with more evaluation tools, such as learning curves and feature importance analysis.

answered Oct 16 '19 at 14:25

Adelson Araújo

280
1
6

Thank you for your reply. Yes, the management of the predictions at (4) (i.e. how you will combine the predictions of each one) is a question but a simply average could be a quick way to have a single final number. – Outcast Oct 16 '19 at 16:52
Your suggestion "You could test binary classifiers for each class, like classifier1 for A/not-A, classifier2 for B/not-B" makes sense however it does not deal in any very effective way with the imbalanced dataset problems since at each one of these classifiers the dataset will be very imbalanced too (but a bit less than (2) at my post for example) – Outcast Oct 16 '19 at 16:54
Hi. Actually my answer suggested to apply undersampling for these binary classifiers to not have imbalance. I don't suggested you this is a very effective way, but I believe you have to try and evaluate many strategies to find "the method". – Adelson Araújo Oct 16 '19 at 18:36
Sure in terms of testing different methods then this is a good idea. However, I see it having the problem of (3) and also how you will combine the results may be a question too. – Outcast Oct 17 '19 at 10:53
Yes but (3) still may have imbalance, since you'll have 4000 vs 200 samples, agree? To combine the results, you can for instance implement a wrapper class to the four binary classifiers. – Adelson Araújo Oct 17 '19 at 14:07
But this is what I say: both (3) and your suggestion more or less the same problems. – Outcast Oct 17 '19 at 14:32

score 0 · Answer 2 · answered Oct 16 '19 at 15:35

0

So your second highest class ( B ) makes up 4% of your highest class ( A ). It is highly imbalanced for those 2 classes on their own. No need to mention the other classes.

For 1) since you have not many samples in those classes especially in C/D , it might work in B but, only you can know that by trying to oversample that class.

None of the suggestions above appeal to me. What i suggest is transforming it into a binary classification x3 models with the following combinations : A/B , A,C , A/D , and try to solve them one by one.

Note that each model would different from the others since B,C,D samples are different.

EDIT : you can deal with each classifier the same as an anomaly detection system.

answered Oct 16 '19 at 15:35

Blenz

2,044
10
28

Thank you for your reply. However, but I do not see how your suggestion deals with the imbalanced dataset problem. Each one of your 3 classifiers will have the problems that I mention at my post from (1) - (4) and actually in an even worse way because each one will be about a highly imbalanced 2-classes dataset. – Outcast Oct 16 '19 at 16:50
As in , there is no magical way of solving a data shortage problem. You can either collect more data, or try to formulate the problem better. Throwing in 3 extremely imbalanced classes with different ratios with a 50k samples class won't give a good result. – Blenz Oct 17 '19 at 09:11
Haha I agree with your points (which are quite obvious) but I am just saying that your suggestion at your post has the same problems too. – Outcast Oct 17 '19 at 10:55
https://stats.stackexchange.com/questions/126082/multiclass-classification-and-unbalanced-dataset?rq=1 – Blenz Oct 17 '19 at 11:09

Imbalanced dataset - Undersampling & multiple classifiers

2 Answers2