1

I’m working on a project where I try to flag bots from legit users on social media. The data I collected is not labeled but I have labeled about 17% of it (22k users) thought different techniques. Finding bots was easy as they all have similarities with each other but it's different for legit users.

In my labeled data, I have most if not all bots labeled but still have a ton of legit users to label which is really hard without doing it manually (and even manually, it sucks).

I found from labeling users randomly and manually at the beginning that this is a very imbalanced data set (86/14 - legit/bot). As it was easier to spot bots rather than legit users in the labeling process, my labeled data is now balanced as (60/40).

One of the steps of the labeling process was to build a model to help me label the data and it's pretty amazing today. I got 99 for the accuracy, 97 for the precision, and 98 for recall.

For the rest of the data, I thought about predicting my whole dataset with the model and looking at the users having a predict proba less than 70/80/90 for the dominant class. I can then look at them and manually label but this might take quite some time depending on what proba threshold I choose.

Any advice/help?

Marc
  • 222
  • 1
  • 7
  • I think semi-supervised learning makes sense here, see [this question](https://datascience.stackexchange.com/q/74955/64377) and [that one](https://datascience.stackexchange.com/a/109337/64377). – Erwan Jun 04 '22 at 21:19

0 Answers0