1

Suppose I am working with a Masked Language Model to pre-train on a specific dataset. In that dataset, most sequences have a particular token of a high frequency

Sample Sequence:-
<tok1>, <tok1>, <tok4>, <tok7>, <tok4>, <tok4> ---> here tok4 is very frequent in this sequence

So if I mask some tokens and get the model to train to predict those masked tokens, obviously the model will gain a bias in predicting <tok4> due to its statistical frequency.

Since <tok4> represents important information, 'downsampling' (or removing those frequent tokens) would not be preferred and I would love to have my sequence as intact as possible.

How best should I deal with this? Is there any already established method that can counter this problem?

neel g
  • 207
  • 4
  • 11

1 Answers1

1

The goal of language modeling is to build a statistical model of how language is used in a specific context. One of the important components of that is token frequency.

Bias can mean many things in machine learning. I think you bias in the sense of high chance of prediction. That kind of bias is useful in language modeling. If <tok4> appears frequently, a useful language model will capture that property.

Brian Spiering
  • 20,142
  • 2
  • 25
  • 102