3

I have a dataset of about 300k records. Classes are highly imbalanced (which means that one may have 30k records, and the other may have only 100). Unfortunately, about 5% of records is incorrectly labeled.

Is there any way of finding out which elements are wrong, so I would be able to discard them?

severin
  • 31
  • 1

2 Answers2

2

Yes! This could be an excellent test case for your classification algorithm. With only 5% mislabeling a good algorithm will be easily able to identify "outliers" by having much worse predictions for these mislabeled records.

If you were able to at least identify "correct" records to generate a training set that would be even better but with 5% mislabeling even if not it won't be a problem. This also leads to the second part of the answer, that while it might be better to remove or correct the mislabeled records it might also not matter.

This is obviously based on the assumption that the 5% errors are somewhat randomly distributed over all classes.

Finally, you did not mention any hints/data/info that could identify mislabels. Obviously if you have information about those errors doing some pre-processing to identify and remove them based on analysis / rules generation would be best.

Fnguyen
  • 1,723
  • 5
  • 15
0

A good method for identifying mislabeled data is Confident Learning. It can use predictions from any trained classifier to automatically identify which data is incorrectly labeled. Since Confident Learning directly estimates which datapoints have label errors, it can also estimate what portion of these labels were incorrectly labeled by the curator. This method was previously used to discover tons of label errors in many major ML benchmark datasets.

Intuitively, a baseline solution could be flagging any example where the classifier's prediction differs from the given label. However this baseline performs poorly if the classifier makes mistakes (something typically inevitable in practice). Confident Learning also accounts for the classifier's confidence-level in each prediction and its propensity to predict certain classes (eg. some classifiers may incorrectly predict class A overly much due to a training shortcoming, especially in imbalanced settings like you are dealing with), in a theoretically principled way that ensures one can still identify most label errors even with an imperfect classifier.

Here is a Python implementation of this CL algorithm that I helped develop, which is easy to run on most types of classification data (image, text, tabular, audio,... as well as binary, multi-class, or multi-label settings).