Let's suppose that my dataset in a classification problem looks like that:
- class A: 50000 observations
- class B: 2000 observations
- class C: 800 observations
- class D: 200 observations
These are some ways which I considered to deal with this imbalanced dataset:
I reject straight away oversampling because it usually makes the model overfit (in the minority classes) by a lot.
Secondly, if I run the classifier with the data like that then it will be overclassifying documents in class A so I reject this method too.
Another approach is to do undersampling and reduce class A to let's say 4000 documents (where I tested it and it gives the best results so far).
However, in this way I am losing quite a lot of information. So I am wondering if building multiple classifiers with 4000 documents each for class A (different at each time) is a better solution (although I think that this approach resembles quite a lot the oversampling approach which I rejected).
What do you think of method (4) comparing to method (3)?