too much data to label

Question

I'm working on a Data Science project to flag bots on Instagram. I collected a lot of data (+80k users) and now I have to label them as bot/legit users. I already flagged 20k users with different techniques but now I feel like I'm gonna have to flag them one by one with will likely take months.

Can I just stop and be like "I'm fine with what I have" or is this bad practice? Stopping now would also mean that the distribution of the data is NOT the same as my labeling techniques were used to find bots and not legit users.

What are my options?

score 4 · Answer 1 · answered Apr 01 '22 at 17:17

You could look into semi-supervised learning, which is useful for training models when you have both labeled and unlabeled data. Semi-supervised methods consider the distribution of unlabeled data to improve the performance of your model. The following picture should give you some intuition regarding how unlabeled data can be useful.

https://upload.wikimedia.org/wikipedia/commons/d/d0/Example_of_unlabeled_data_in_semisupervised_learning.png

In another direction, you may train a classifier with the labels you have so far. Then, use the classifier to predict the probability of each label for your unlabeled data. Sort labels by their probability, and manually label a small sample of low (p<0.25), medium (0.25 < p < 0.75) and high (p> 0.75) probabilities. Then, try to estimate in which probability range your model is struggling most. In theory, it should be a better investment of your time to manually label the cases that fall in the medium probability range, as these are the ones your current model is more uncertain about. This and similar approaches belong to the category of active learning.

In short, look into semi-supervised or active learning.

You mean it should be better investment of your time to manually label the cases that fall in the low probability range? — nammerkage, Apr 25 '22 at 07:51
The model is less certain in the mid probability range. For instance, for examples predicted with probability p=0.5 the model shows fully uncertainty about the right label. In theory, it is better investment to label these. — Enk9456, Apr 26 '22 at 13:58

too much data to label

1 Answers1