Unbalanced multi-class : distribution might change as more data come in

Question

I am currently working on a problem of multi-class classification on testing logs data.

Basically, I have the context data from tests' execution saved, and want to automate the analysis of the failed tests. The two targets I have are the page (which corresponds to the working team) responsible for the error, and the type of error.

So, for this matter I got the data at a time t and worked on it locally in order to get the best results possible.

I trained a few models, optimized them through hyper-parameter optimization with grid search, and got good results.

The f1_scores are around 90+% for the target page and 80+% for the target type.

These results seem good so I was working on ensembling methods to then get even better results from the combinations of my different models.

On the following figures, you can see the f1 scores and variances of different models for the target page.

But as I was working on ways to do ensembling, I got the idea of getting the latest testing data (remember I was working with the same data from the time t).

I went from having only 2162 samples to 4367, so a bit more than double the amount of data.

This gives my data a distribution different from before (on the following figures we can see the number of values per class for the target page).

Above, we can see that there are 11 highly unbalanced classes.

And on the second figure, we can see that we have now 14 classes, also very unbalanced.

(The differences with the new data are also true for the target type). It resulted in my models loosing between 6 and 10% of f1_score.

So, my questions are the following :

How do I work around data in this case in order for the models to work well with new data even when the distribution might change ? Will I need to check the evolution of the scores and when it goes too low, redo the parameter optimization ?
I am for now working with each target independently. Should I try multi-labeling between both targets ? If so, do you have any recommendation on algorithms or ways to go around it ?
Any suggestion on how to improve the work, maybe different models to try, or ensembling methods would be greatly appreciated ☺

Thanks for reading this long post !

Have you seen the answer I gave here: https://datascience.stackexchange.com/questions/28331/different-test-set-and-training-set-distribution/28346#28346? Your problem rather sounds the same to me!! — TwinPenguins, Aug 22 '18 at 22:41
Hi! Thank you for your comment. I'll check out a way to automatically detect covariate shift thanks to your links. Also, I'm thinking about implementing automatic RandomizedSearch for hyper-parameters when the difference is too big and combine it with the "Kullback-Leibler divergence" theorem. — Thomas Coquereau, Aug 23 '18 at 07:32
On the paper mentionned in your answer, they conclude by saying "(...) there is no statistical advantage of reweighting the training dataset.". Do you think it would still be interesting checking that out? — Thomas Coquereau, Aug 23 '18 at 08:14
As far as I understand https://datascience.stackexchange.com/questions/28331/different-test-set-and-training-set-distribution/ deals with drift between training and test data, while you deal with drift between training datasets at different points over time. Is that right? — Make42, Jul 05 '21 at 15:48

score 0 · Answer 1 · answered Sep 24 '21 at 15:00

There are several ways the data distribution can change after training.

If there is a new category, that will require retraining the model to handle the new category.

If the relative frequency of the categories change, that will not effect most models. Most models learn a decision boundary based on features. If the model is based on prior priority of class frequency, then the model will have to be retrained to reflect the updated prior frequency.

Another way the data can change is if the features used to predict a specific category changes. If the features change, the model will have be retrained to learn the updated feature weights.

Unbalanced multi-class : distribution might change as more data come in

1 Answers1