How should I construct a binary classifier for small set of positive data and million of unlabeled data?

Question

Does anyone have suggestions for specific algorithm or implementation for labeled data of only one class and unlabeled data that can be from either classes? And I'm unsure what is the proportion of Class A to B that exists within the unlabeled data and also my labeled data is not randomly chosen.

"my labeled data is not randomly chosen." Can you explain this further? — Bert Kellerman, May 28 '21 at 22:54
@BertKellerman I mean I haven't labeled the data by myself. I'm using a well-known source which has the label for only one class. — Deli, May 28 '21 at 23:26
@BertKellerman you can ignore that part. I think I should use the one-class classifier but I'm not sure It's an appropriate method for my case where I 'm not sure about the proportion of Class A to B that exists within the unlabeled data — Deli, May 28 '21 at 23:28

Bert Kellerman · Answer 1 · 2021-05-29T14:37:03.607

This is called PU Learning, and it can be used when using a probabilistic classifier and certain assumptions are met about how the data is labeled.

If the assumptions are met, you

Label positive, already labeled instances as positive
Labeled unlabeled instances as negative
Train a probabilistic classifier.

This produces the same ranking of class probabilities as a classifier would if trained on a dataset labeled with true positive/negative labels.

This video covers the assumptions pretty well and the Elkan paper is pretty accessible.

How should I construct a binary classifier for small set of positive data and million of unlabeled data?

1 Answers1