why is the BERT NSP task useful for sentence classification tasks?

Question

BERT pre-trains the special [CLS] token on the NSP task - for every pair A-B predicting whether sentence B follows sentence A in the corpus or not. When fine-tuning BERT for sentence classification (e.g. spam or not), it is recommended to use a degenerate pair A-null and use the [CLS] token output for our task.

How is that making sense? in the pre-training stage, BERT never saw such pairs, how come it will eat them just fine and "know" that instead of extracting the relation between A and B it is to extract the meaning of sentence A as there is no sentence B?

Is there another practice of fine-tuning the model with A-spam and A-notspam for every sentence A, and seeing which pair gets the better NSP score? or is that totally equivalent to fine tuning with A-null?

related to Bert-Transformer : Why Bert transformer uses [CLS] token for classification instead of average over all tokens?

score 1 · Accepted Answer · answered Oct 19 '21 at 08:22

1

The motivation is that the [CLS] embedding should contain "a summary" of both sentences to be able to decide if they follow each other or not.

However, in follow-up papers such as RoBERTa or XLNet, only the masked LM objective is used and they reach better results than the original BERT.

Here is the table with with results from the RoBERTa paper (Table 2 on page 5) that specifically measures the effect of the next-sentence-prediction (NSP) loss.NS

answered Oct 19 '21 at 08:22

Jindřich

1,661
5
8

also, craffel@ the author of T5 tried a variant of NSP (prefix lm) and showed its worse then masked lm https://github.com/huggingface/transformers/issues/5388#issuecomment-652142076 – ihadanny Oct 20 '21 at 19:20

why is the BERT NSP task useful for sentence classification tasks?

1 Answers1