2

When BERT is used for masked language modeling, it masks a token and then tries to predict it.

What are the candidate tokens BERT can choose from? Does it just predict an integer (like a regression problem) and then use that token? Or does it do a softmax over all possible word tokens? For the latter, isn't there just an enormous amount of possible tokens? I have a hard time imaging BERT treats it like a classification problem where # classes = # all possible word tokens.

From where does BERT get the token it predicts?

Nick Koprowicz
  • 213
  • 1
  • 3
  • 10

1 Answers1

2

There is a token vocabulary, that is, the set of all possible tokens that can be handled by BERT. You can find the vocabulary used by one of the variants of BERT (BERT-base-uncased) here.

You can see that it contains one token per line, with a total of 30522 tokens. The softmax is computed over them.

The token granularity in the BERT vocabulary is subwords. This means that each token does not represent a complete word, but just a piece of word. Before feeding text as input to BERT, it is needed to segment it into subwords according to the subword vocabulary mentioned before. Having a subword vocabulary instead of a word-level vocabulary is what makes it possible for BERT (and any other text generation subword model) to only need a "small" vocabulary to be able to represent any string (within the character set seen in the training data).

noe
  • 22,074
  • 1
  • 43
  • 70
  • I'm surprised that it's possible to do a softmax over so many possible outcomes with any level of accuracy. But I suppose they're able to accomplish it because the set of training data is so large? – Nick Koprowicz Nov 17 '20 at 03:57
  • 1
    Yes, as long as you have enough data, 30k elements in the softmax work well. For larger vocabularies (e.g. 100k), people often use [adaptive softmax](https://arxiv.org/abs/1609.04309). Smaller training datasets lead to data sparsity, and this affects both the input embeddings and output projection+softmax. – noe Nov 17 '20 at 09:23