9

What's the general tradeoff between choosing BPE vs WordPiece Tokenization? When is one preferable to the other? Are there any differences in model performance between the two? I'm looking for a general overall answer, backed up with specific examples.

Shayan Shafiq
  • 1,012
  • 4
  • 11
  • 24
vgoklani
  • 229
  • 2
  • 6
  • 2
    You can find the algorithmic difference [here](https://stackoverflow.com/a/55416944/674487). In practical terms, their main difference is that BPE places the `@@` at the end of tokens while wordpieces place the `##` at the beginning. The main performance difference usually comes not from the algorithm, but the specific implementation, e.g. [sentencepiece](https://github.com/google/sentencepiece) offers a very fast C++ implementation of BPE. You can find fast Rust implementations of both in Hugginface's [tokenizers](https://github.com/huggingface/tokenizers). – noe Jun 02 '20 at 15:13
  • I made an answer out of my previous comment. – noe Mar 25 '21 at 02:09

3 Answers3

5

(This answer was originally a comment)

You can find the algorithmic difference here. In practical terms, their main difference is that BPE places the @@ at the end of tokens while wordpieces place the ## at the beginning. The main performance difference usually comes not from the algorithm, but the specific implementation, e.g. sentencepiece offers a very fast C++ implementation of BPE. You can find fast Rust implementations of both in Hugginface's tokenizers.

noe
  • 22,074
  • 1
  • 43
  • 70
  • I don't disagree with your response, but the point of the question is how to choose between the two when building a model. More specifically, my models get better performance when using word-piece vs BPE, and my colleagues have similar results. – vgoklani Nov 30 '21 at 16:26
2

Adding more info to noe's answer:

The difference between BPE and WordPiece lies in the way the symbol pairs are chosen for adding to the vocabulary. Instead of relying on the frequency of the pairs, WordPiece chooses the one which maximises the likelihood of the training data. This means that it trains a language model starting on the base vocabulary and picks the pair with the highest likelihood (pair = base vocab character + highest probability generated character). This pair is added to the vocab and the language model is again trained on the new vocab. These steps are repeated until the desired vocabulary is reached.

Abhi25t
  • 121
  • 3
0

In contrast to BPE, WordPiece does not choose the most frequent symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.

So what does this mean exactly? Referring to the previous example, maximizing the likelihood of the training data is equivalent to finding the symbol pair, whose probability divided by the probabilities of its first symbol followed by its second symbol is the greatest among all symbol pairs. E.g. "u", followed by "g" would have only been merged if the probability of "ug" divided by "u", "g" would have been greater than for any other symbol pair. Intuitively, WordPiece is slightly different to BPE in that it evaluates what it loses by merging two symbols to ensure it’s worth it.

from : https://huggingface.co/docs/transformers/tokenizer_summary

Yash
  • 101