2

Currently learning and reading about transformer models, I get that during the pretraining stage the BERT model is trained on a large corpus via MLM and NSP. But during finetuning, for example trying to classify sentiment based on another text, are all of the BERT parameters (110M+ parameters + final classification layer) updated or just only final classification layers? Couldn't find a concrete answer to this in the resources I've been looking at.

Thank you in advance.

spnc
  • 21
  • 2

2 Answers2

1

Both approaches are reasonable. Updating the BERT weights will train for longer period of time, but should give more accurate results.

Akavall
  • 904
  • 5
  • 11
1

By default, BERT fine-tuning involves learning a task-specific layer (For classification task, a neural network on top of the CLS token), as well as update the existing parameters of the model to adapt for the task. Thus, it's both, new layer + BERT model weights. However, you still have a choice of using just the emebdding of CLS token and train only the layer on top of it to reduce the training complexity. However, its a matter of trade-off between performance and the compute cost.

Ashwin Geet D'Sa
  • 1,049
  • 1
  • 9
  • 19