I have a huge collection of objects from which only a tiny fraction are in a class of interest. The collection is initially unlabelled, but labels can be added using an expensive operation (for example, by human).
Currently I use the simple generic machine learning strategy:
Use hand-crafted rules to select a smaller subset of objects (thus leaving out a fraction of interesting ones).
Label part of the smaller subset, and use these for training and choosing a classification algorithm and its parameters.
Classify the remaining objects in the smaller set (and also perhaps in the big set).
This has two drawbacks:
The labeller still needs to see a huge number of uninteresting objects, and therefore is able to label only a very small fraction of interesting ones.
The objects not in the smaller set are completely ignored in the learning phase, resulting in a loss of some information (the classification algorithm might not work well on this complement).
It seems that it would be better to use online learning: i.e., select the objects to show to the labeller based on the previous labels. But then it becomes no longer obvious that the result of classification algorithm retains the nice theoretical properties (i.e., statistical consistency).
Is there a general framework for active object detection which works either theoretically or practically (or both)? I could not get the complete picture from the Wikipedia article active learning.