BERT pre-trains the special [CLS] token on the NSP task - for every pair A-B predicting whether sentence B follows sentence A in the corpus or not.
When fine-tuning BERT for sentence classification (e.g. spam or not), it is recommended to use a degenerate pair A-null and use the [CLS] token output for our task.
How is that making sense? in the pre-training stage, BERT never saw such pairs, how come it will eat them just fine and "know" that instead of extracting the relation between A and B it is to extract the meaning of sentence A as there is no sentence B?
Is there another practice of fine-tuning the model with A-spam and A-notspam for every sentence A, and seeing which pair gets the better NSP score? or is that totally equivalent to fine tuning with A-null?
