Dealing with high frequency tokens during masked Language modelling?

Question

Suppose I am working with a Masked Language Model to pre-train on a specific dataset. In that dataset, most sequences have a particular token of a high frequency

Sample Sequence:-
<tok1>, <tok1>, <tok4>, <tok7>, <tok4>, <tok4> ---> here tok4 is very frequent in this sequence

So if I mask some tokens and get the model to train to predict those masked tokens, obviously the model will gain a bias in predicting <tok4> due to its statistical frequency.

Since <tok4> represents important information, 'downsampling' (or removing those frequent tokens) would not be preferred and I would love to have my sequence as intact as possible.

How best should I deal with this? Is there any already established method that can counter this problem?

score 1 · Answer 1 · answered Jul 15 '21 at 21:14

1

The goal of language modeling is to build a statistical model of how language is used in a specific context. One of the important components of that is token frequency.

Bias can mean many things in machine learning. I think you bias in the sense of high chance of prediction. That kind of bias is useful in language modeling. If <tok4> appears frequently, a useful language model will capture that property.

answered Jul 15 '21 at 21:14

Brian Spiering

20,142
2
25
102

sure, but if we want to reduce the bias? – neel g Jul 17 '21 at 17:54
Add a back-off model https://en.wikipedia.org/wiki/Katz%27s_back-off_model – Brian Spiering Jul 17 '21 at 19:43

Dealing with high frequency tokens during masked Language modelling?

1 Answers1