In regards to your question:
However, they do not mention how this could then be used in the inference stage, because the required threshold for selecting the correct labels is not clear.
Does anyone know how this would work?
While this answer may be unsatisfying, I believe the answer is: you don't use it for inference.
The paper describes how the multi-label classification using the softmax is done during pre-training only, where they just had to compute the loss of the multi-label softmax relative to the ground truths they already knew about. The Facebook paper discusses how they used either the features found during the pre-training on hashtag data or used the hashtag trained neural network as merely a point of weight initialization – not for actual inference on "live data."
The softmax function only gives a relative level of confidence in the labels and gives probability values that are more of an "ordinal" than "cardinal" use, so in order to use the softmax values during inference, one would need a separate way to determine how many labels to extract, whether that be a pre-determined constant number n (the paper notes that each image had about 2 canonical hashtags per image on average so that is a possible choice), a separate algorithm/model that decides how many labels an image should have, etc.
Sources:
D. Mahajan et al., “Exploring the Limits of Weakly Supervised Pretraining,” Sep. 2018.