We are training a BERT model (using the Huggingface library) for a sequence labeling task with six labels: five labels indicate that a token belongs to a class that is interesting to us, and one label indicates that the token does not belong to any class.
Generally speaking, this works well: loss decreases with each epoch, and we get good enough results. However, if we compute precision, recall and f-score after each epoch on a test set, we see that they oscillate quite a bit. We train for 1,000 epochs. After 100 epochs performance seems to have plateaued. During the last 900 epochs, precision jumps constantly to seemingly random values between 0.677 and 0.709; recall between 0.729 and 0.798. The model does not seem to stabilize. To mitigate the problem, we already tried the following:
- We increase the size of our test data set.
- We experimented with different learning rates and batch sizes.
- We used different transformer models from the Huggingface library, e.g. RoBERTa, GPT-2 etc. Nothing of this has helped.
Does anyone have any recommendations on what we could do here? How can we pick the “best model”? Currently, we pick the one that performs best on the test set, but we are unsure about this approach.