5

My goal is to use the general knowledge and language understanding of a pre-trained LLM and to continue training on a smaller domain specific corpus to improve the model's knowledge on the domain. What is the best practice approach here without running into issues (e.g. catastrophic forgetting)? Here are some points I consider, but not completely sure about them:

  • use last checkpoint of pre-trained LLM and continue training on custom corpus
  • training policy and procedure is the same as used for pre-training (MLM etc.)
  • use a very small learning rate
  • is it possible to load the model in int8 (bitsandbytes) and continue training without breaking it?

Does this approach make sense? Has anyone done this before and has some insights?

Any hints are highly appreciated!

Arthuro
  • 81
  • 3

2 Answers2

6

Yes you are on the right track. What you are mentioning is called fine tuning the model. I personally have done this and used the same approach. The LLM I used was GPT-J 6B to generate MCQ's. Some tips when fine tuning large LLM's are:

  1. Do not feed all the data to the model. First create a small dataset and fine tune the model in order to check whether the fine tuning is working properly for a couple epochs. This may save you some hours down the road.
  2. Make sure you understand all the Hyperparameters and their effects first before fine tuning. Since the model is large, fine tuning might take a long time and resources and you do not want to fine tune multiple times with different parameters. So save yourselves some time and money by understanding the effects of the parameters first.
  3. Yes it is possible to load the model into lower bits (also known as quantisation) in order to reduce the model size and in turn reduce resource utilisation. Make sure to quantise before training and then before inferencing, make sure to de-quantise the model to original bits. This is important as I was getting nonsense results just because I forgot to de-quantise the model after training and before inferencing.
  4. Your first point of only using the last layer. I am not sure about this as I did not try this method.
  5. Regarding the 3rd point of using a very small learning rate, it is a hyperparameter. So you might want to tune it. But usually if the model is huge (as in my case 6 billion parameters, 100 GB) even if there is some leeway in tuning the parameters, it won't affect the results much as the model is robust enough to counter it. But again it depends on your model size!

Cheers!

spectre
  • 1,831
  • 1
  • 9
  • 29
  • Appreciate your answer, but it seems you misunderstood my question. I didn't mean fine-tuning the model to a specific downstream task but to further pre-train a generic model on a domain specific corpus (e.g. feed the model heaps of biomedical data before fine tuning a NER). Furthermore, by last checkpoint I mean the training checkpoint and not freezing all layers but the last one. – Arthuro Jun 13 '23 at 13:33
0

I find this link very relevant. Basically, you need to find the right AutoModel type for your model-specific word prediction task.