Does synthetic data be over sampled as well?

Question

I'm building a binary text classifier, the ratio between the positives and negatives is 1:100 (100 / 10000).

By using back translation as an augmentation, I was able to get 400 more positives. Then I decided to do up sampling to balance the data. Do I include only the positive data points (100) or should I also include the 400 that I have generated?

I will definitely try both, but I wanted to know if there is any rule of thumb as to what to do in such a case.

Thanks.

I would argue that oversampling is not needed. Why do you advocate for doing so? https://www.fharrell.com/post/classification/ https://www.fharrell.com/post/class-damage/ https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he — Dave, Mar 30 '22 at 16:57

score 0 · Answer 1 · answered Mar 31 '22 at 06:07

Imbalance class is a problem when you have small data in most cases. Your data ratio 100:10000 in this case for your model to do well you should increase the records related to minority class. Now no thumb rule exist in ML (read No Free lunch in ML). Unfortunately you will have to try three scenarios and see what works best for you :

Upsampling by using only actual data
Upsampling by creating new synthetic data by using techniques like SMOTE
Trying a combination of both generate some synthetic data and oversample.

Does synthetic data be over sampled as well?

1 Answers1