9

Is BERT a language model in the sense of a function that gets a sentence and returns a probability? I know its main usage is sentence embedding, but can it also provide this functionality?

Amit Keinan
  • 776
  • 6
  • 19

2 Answers2

15

No, BERT is not a traditional language model. It is a model trained on a masked language model loss, and it cannot be used to compute the probability of a sentence like a normal LM.

A normal LM takes an autoregressive factorization of the probability of the sentence:

$p(s) = \prod_t P(w_t | w_{<t})$

On the other hand, BERT's masked LM loss focuses on the probability of the (masked) token at a specific position given the rest of the unmasked tokens in the sentence.

Therefore, it makes no sense to use the token probabilities generated by BERT and multiply them to obtain a sentence level probability.


A secondary issue is the BERT's tokenization is subword-level so, even if it would make sense to compute a sentence-level probability with BERT, such a probability would not be comparable with a word-level LM, as we would not be taking into account all possible word segmentations into subwords.


UPDATE: there is a new technique called Masked Language Model Scoring (ACL'20) that allows precisely what the OP asked for. From the article:

To score a sentence, one creates copies with each token masked out. The log probability for each missing token is summed over copies to give the pseudo-log-likelihood score (PLL).

So the answer is now YES. It is possible to score a sentence using BERT, by means of the described pseudo-log-likelihood score.

noe
  • 22,074
  • 1
  • 43
  • 70
1

Although, the previous answer is a good reference to find how to measure probability of a sentence using BERT, in order to perform a meaningful evaluation of cross-model (e.g., compare BERT with Roberta) they should use the same tokenization.

Amir
  • 111
  • 5