12

I was checking BERT GitHub page and noticed that there are new models built from a new training technique called "whole word masking". Here is a snippet describing it:

In the original pre-processing code, we randomly select WordPiece tokens to mask. For example:

Input Text: the man jumped up , put his basket on phil ##am ##mon ' s head

Original Masked Input: [MASK] man [MASK] up , put his [MASK] on phil [MASK] ##mon ' s head

The new technique is called Whole Word Masking. In this case, we always mask all of the the tokens corresponding to a word at once. The overall masking rate remains the same.

Whole Word Masked Input: the man [MASK] up , put his basket on [MASK] [MASK] [MASK] ' s head

I can't understand "we always mask all of the the tokens corresponding to a word at once". "jumped", "phil", "##am", and "##mon" are masked and I am not sure how these tokens are related.

clarity123
  • 105
  • 1
kee
  • 223
  • 2
  • 6

1 Answers1

11

phil ##am #mon is a subword encoding of the single word “philammon” into 3 tokens. The comment just means that they mask words as opposed to tokens by taking into account subword encoding.

For more on subword encodings take a look at the slides from cs224, especially Byte Pair Encoding, from the Feb 14 subwords lecture at http://web.stanford.edu/class/cs224n/index.html#schedule.

BookYourLuck
  • 456
  • 3
  • 7
  • I think it's not available anymore. Where to see it? – orkenstein Apr 12 '22 at 13:46
  • That link is broken but here is a good explanation: https://towardsdatascience.com/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0 – dashesy Aug 07 '22 at 00:22