What is whole word masking in the recent BERT model?

Question

I was checking BERT GitHub page and noticed that there are new models built from a new training technique called "whole word masking". Here is a snippet describing it:

In the original pre-processing code, we randomly select WordPiece tokens to mask. For example:

Input Text: the man jumped up , put his basket on phil ##am ##mon ' s head

Original Masked Input: [MASK] man [MASK] up , put his [MASK] on phil [MASK] ##mon ' s head

The new technique is called Whole Word Masking. In this case, we always mask all of the the tokens corresponding to a word at once. The overall masking rate remains the same.

Whole Word Masked Input: the man [MASK] up , put his basket on [MASK] [MASK] [MASK] ' s head

I can't understand "we always mask all of the the tokens corresponding to a word at once". "jumped", "phil", "##am", and "##mon" are masked and I am not sure how these tokens are related.

score 11 · Accepted Answer · answered Jun 16 '19 at 19:15

11

phil ##am #mon is a subword encoding of the single word “philammon” into 3 tokens. The comment just means that they mask words as opposed to tokens by taking into account subword encoding.

For more on subword encodings take a look at the slides from cs224, especially Byte Pair Encoding, from the Feb 14 subwords lecture at http://web.stanford.edu/class/cs224n/index.html#schedule.

answered Jun 16 '19 at 19:15

BookYourLuck

456
3
7

I think it's not available anymore. Where to see it? – orkenstein Apr 12 '22 at 13:46
That link is broken but here is a good explanation: https://towardsdatascience.com/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0 – dashesy Aug 07 '22 at 00:22

What is whole word masking in the recent BERT model?

1 Answers1