Using softmax for multilabel classification (as per Facebook paper)

Question

I came across this paper by some Facebook researchers where they found that using a softmax and CE loss function during training led to improved results over sigmoid + BCE. They do this by changing the one-hot label vector such that each '1' is divided by the number of labels for the given image (e.g. from [0, 1, 1, 0] to [0, 0.5, 0.5, 0]).

However, they do not mention how this could then be used in the inference stage, because the required threshold for selecting the correct labels is not clear.

Does anyone know how this would work?

As mentioned in the paper, the value of `k` i.e the number of hashtags per image, is required to create the target vector against which the CE loss is calculated for every batch. Note, the model is still predicting a probability distribution for the labels ( hashtags ). So, during inference, you can select the top N labels which have the highest probability. — Shubham Panchal, Apr 08 '21 at 07:47
Yes but how do I decide what to set as N when selecting the top N labels in inference? My images can have 1, 2, or maybe 3 labels per image during inference, so typically I used sigmoid and set a threshold to 0.5, but with Softmax I cannot do this since the sum of all probabilities will equal 1. If I do as you say and choose the top N, what criteria would I use to determine the number of labels to choose? Because if one image should only have 1, whereas the next image should have 3 labels, there is no way that I can think of that would allow me to dynamically choose the value of N per image. — Steve Ahlswede, Apr 08 '21 at 08:52

score 1 · Answer 1 · answered Dec 24 '21 at 21:37

In regards to your question:

However, they do not mention how this could then be used in the inference stage, because the required threshold for selecting the correct labels is not clear.

Does anyone know how this would work?

While this answer may be unsatisfying, I believe the answer is: you don't use it for inference.

The paper describes how the multi-label classification using the softmax is done during pre-training only, where they just had to compute the loss of the multi-label softmax relative to the ground truths they already knew about. The Facebook paper discusses how they used either the features found during the pre-training on hashtag data or used the hashtag trained neural network as merely a point of weight initialization – not for actual inference on "live data."

The softmax function only gives a relative level of confidence in the labels and gives probability values that are more of an "ordinal" than "cardinal" use, so in order to use the softmax values during inference, one would need a separate way to determine how many labels to extract, whether that be a pre-determined constant number n (the paper notes that each image had about 2 canonical hashtags per image on average so that is a possible choice), a separate algorithm/model that decides how many labels an image should have, etc.

Sources:

D. Mahajan et al., “Exploring the Limits of Weakly Supervised Pretraining,” Sep. 2018.

Using softmax for multilabel classification (as per Facebook paper)

1 Answers1