BERT data cleaning

Asked Nov 26 '20 at 10:35

Active Nov 26 '20 at 12:17

Viewed 204 times

I am wondering which data cleaning steps should be performed if you want to re-fine a BERT model on custom text data.

Which steps should be performed?

Does it make sense to perform a stemming or lemmatization if it has not been applied to the initial training of the BERT Base/Large model?

edited Nov 26 '20 at 10:40

asked Nov 26 '20 at 10:35

Predicted Life

Yes it answered 50% of my question. Thanks. I would like to know if the same pre-processing functions which have been applied on the training data of the BERT base model also need to be applied to the data which is used for the fine-tuning of the BERT model? Asking the other way round: If no preprocessing was performed on the training data of the base model... would it be a good idea to apply a preprocessing (like for example stemming) on the data for the (fine-tuning) training of the BERT model? – Predicted Life Nov 26 '20 at 23:20
No, it would not be a good idea to apply stemming on the fine-tuning data. For transfer learning to be effective, the fine-tuning data should resemble the original data used for pre-training. – noe Nov 26 '20 at 23:34
Thanks. This is what I wanted to know. – Predicted Life Nov 27 '20 at 07:59

0 Answers0