Trained BERT models perform unpredictably on test set

Question

We are training a BERT model (using the Huggingface library) for a sequence labeling task with six labels: five labels indicate that a token belongs to a class that is interesting to us, and one label indicates that the token does not belong to any class.

Generally speaking, this works well: loss decreases with each epoch, and we get good enough results. However, if we compute precision, recall and f-score after each epoch on a test set, we see that they oscillate quite a bit. We train for 1,000 epochs. After 100 epochs performance seems to have plateaued. During the last 900 epochs, precision jumps constantly to seemingly random values between 0.677 and 0.709; recall between 0.729 and 0.798. The model does not seem to stabilize. To mitigate the problem, we already tried the following:

We increase the size of our test data set.
We experimented with different learning rates and batch sizes.
We used different transformer models from the Huggingface library, e.g. RoBERTa, GPT-2 etc. Nothing of this has helped.

Does anyone have any recommendations on what we could do here? How can we pick the “best model”? Currently, we pick the one that performs best on the test set, but we are unsure about this approach.

Are you finetuning a pretrained model or training from scratch? If finetuning, do you use BertAdam as optimizer? — noe, Dec 11 '20 at 11:00
We are fine-tuning and are currently using Adam from torch.optim — PeterPaul, Dec 11 '20 at 13:04
Please, consider upvoting the answer if you found it useful, and marking it as correct if deemed so. Alternatively, please considering describing what the answer is lacking or why you think it is not correct, so that it can be improved. — noe, Dec 13 '20 at 16:25

noe · Accepted Answer · 2020-12-11T16:53:56.077

BERT-style finetuning is known for its instability. Some aspects to take into account when having this kind of issues are:

The number of epochs typically used to finetune BERT models is normally around 3.
The main source of instability is that the authors of the original BERT article suggested using the Adam optimizer but disabling the bias compensation (such a variant became known as "BertAdam").
Currently, practitioners have shifted from Adam to AdamW as optimizer.
It is typical to do multiple "restarts", that is, train the model multiple times and choose the best performing one on the validation data.
Model checkpoints are normally saved after each epoch. The model we chose is the checkpoint with best validation loss among all epoch of every restart we tried.

There are two main articles that study BERT-like finetuning instabilities that may be of use to you. They describe in detail most of the aspects I mentioned before:

Trained BERT models perform unpredictably on test set

1 Answers1

Linked