Is it right to maintain the train distribution in test set for unbalanced data?

Question

If the training set was unbalanced the chances are the model will be biased. But if the data distribution in the test set is the same distribution as the train set, this kind of bias is not going to affect validation accuracy. But my question is if this is the right thing to do? It's not cheating? What if we want to use the model for commercial business that we have no idea how the distribution of data would be? In this case, what is the right thing to do?

score 1 · Accepted Answer · answered Dec 09 '20 at 08:45

If the training set was unbalanced the chances are the model will be biased.

Not really. Depending on the loss function you use. Also, note that for data to be unbalanced at least it has to be in a proportion of 1/100.

The rest of the questions:

ML is based on the hypothesis that train and test look alike. Oversampling methods can help in training time, still in validation and test, you should not use the oversample and validate with the real data.

Use your evaluation metrics on the real test, with test distribution and don't oversample there.

we want to use the model for commercial business that we have no idea how the distribution of data would be?

If you have no idea of how the distribution will be you have a problem. The hypothesis that people make is that the future distribution resembles something of the current real distribution (last week,last month, one year ago....)

Depends on the algorithm, its different for DL, decision tree, linear model.... But first step is using the right performance metric https://towardsdatascience.com/how-to-deal-with-imbalanced-data-34ab7db9b100 — Carlos Mougan, Dec 09 '20 at 09:27

Is it right to maintain the train distribution in test set for unbalanced data?

1 Answers1