2

I am creating a supervised model using sensitive and scarce data. For the sake of discussion, I've simiplified the problem statement by assuming that I'm creating a model for identifying dogs.

Let's say I am creating a model to identify dogs in pictures. I trained it with few positive and negative examples. I could not gather a lot of data because it is scarce. Therefore, the model accuracy is not good (say f-score = 0.64). I deployed this model in production. When the model predicts, I label the output of model as "True Positive" and "False Positive". Then I train the model using these labels again.

I problem that I see with this approach is that I do not know when the model missed a dog picture i.e. "False-Negative" and hence I cannot retrain the model on such examples. Therefore, the current approach will only improve my model's Precision (TP/(TP+FP)) and not the Recall (TP/(TP+FN)).

Please suggest

  1. how can I improve the model's Recall
  2. do you see any other problem with my approach
  • Welcome to DataScienceSE. As far as I know, the only way would be check and relabel all the instances, including the negative ones. – Erwan Mar 13 '22 at 21:05
  • Thanks for the welcome, @Erwan. Could you suggest how do I find the negative ones? Here's an example - I pass 100,000 pictures. The model predicts only 100 pictures are dogs pictures. I would label these 100 pictures are TP or FP. FN are hiding somewhere in 99,900 (=100,000-100) pictures and it would impossible to find them manually. – learnlifelong Mar 13 '22 at 23:53

0 Answers0