Prior probability shift vs oversampling/undersampling imbalanced datasets

Question

I'm trying to understand what prior probability shift (label drift) in data means.

If I understand it correctly then it means that distribution of labels in training dataset differs compared to distribution of labels in production environment. This difference causes that ML model trained on such data and deployed to production environment makes poor predictions.

That makes sense.

But then I remembered that one of the techniques how to train ML model on imbalanced dataset is oversampling the minor class or undersampling the major class (so changing labels distribution in the training dataset). But these techniques (oversampling or undersampling) causes that distribution of labels in training dataset differs compared to the production environment (imbalanced data setting). That sounds exactly like label drift setting!

So is my understanding of prior shift in data wrong (most probably yes)?

Are undersampling/oversampling techniques of imbalanced dataset flawed (I don't this so)?

Am I missing something else?

Thank you for explanation.

Tomas

Why do you have a different distribution in production than in training? In other words, why aren’t you modeling the distribution that applies to your work? — Dave, Nov 27 '22 at 21:00
Prior probability shift is just ML concept I am trying to understand. My understanding is that label distribution might change in time (without changing input variables distributions). Say that most of the time the label distribution is the same as in the training set but at some point in time this distribution might change due to an unexpected event. E.g.Let's have some fraud payment detector. Most of the time there is very low number of attempted fraudulent payments, say 0.001% of all payments. But at some point due to some attack the amount of fraudulent payments might climb up to 30% — user60175, Nov 27 '22 at 22:57

Prior probability shift vs oversampling/undersampling imbalanced datasets

0 Answers0