Is BERT a language model?

Question

Is BERT a language model in the sense of a function that gets a sentence and returns a probability? I know its main usage is sentence embedding, but can it also provide this functionality?

noe · Accepted Answer · 2020-05-15T17:30:44.743

No, BERT is not a traditional language model. It is a model trained on a masked language model loss, and it cannot be used to compute the probability of a sentence like a normal LM.

A normal LM takes an autoregressive factorization of the probability of the sentence:

$p(s) = \prod_t P(w_t | w_{<t})$

On the other hand, BERT's masked LM loss focuses on the probability of the (masked) token at a specific position given the rest of the unmasked tokens in the sentence.

Therefore, it makes no sense to use the token probabilities generated by BERT and multiply them to obtain a sentence level probability.

A secondary issue is the BERT's tokenization is subword-level so, even if it would make sense to compute a sentence-level probability with BERT, such a probability would not be comparable with a word-level LM, as we would not be taking into account all possible word segmentations into subwords.

UPDATE: there is a new technique called Masked Language Model Scoring (ACL'20) that allows precisely what the OP asked for. From the article:

To score a sentence, one creates copies with each token masked out. The log probability for each missing token is summed over copies to give the pseudo-log-likelihood score (PLL).

So the answer is now YES. It is possible to score a sentence using BERT, by means of the described pseudo-log-likelihood score.

score 1 · Answer 2 · answered Apr 19 '21 at 09:11

1

Although, the previous answer is a good reference to find how to measure probability of a sentence using BERT, in order to perform a meaningful evaluation of cross-model (e.g., compare BERT with Roberta) they should use the same tokenization.

answered Apr 19 '21 at 09:11

Amir

111
5

Is BERT a language model?

2 Answers2