Calculationg perplexity (in natural language processing) manually

Question

I am trying to understand Perplexity within Natural Language Processing as a metric more fully. And I am doing so by creating manual examples to understand all the component parts. Is the following correctly understood:

Given a lists W of words (as probabilities), where W consists of $w_1$ .. $w_n$ , and where we know the probabilities for each word, a model will still have to compute the intersection of words for the following formula to be useful:

$$ P(W)=P\left(w_1\right) P\left(w_2 \mid w_1\right) P\left(w_3 \mid w_2, w_1\right) \ldots P\left(w_N \mid w_{N-1}, w_{N-2}\right) $$

Since the formula for conditional probability is given by:

$P(A \mid B)=\frac{P(A \cap B)}{P(B)}$

And the intersection ${P(A \cap B)}$ will in the instance of NLP be calculated by Cross Entropy loss in a model.

My guess would be that $P(A|B)$ will be measured, and therefore forms a model, that can then come up with prediction of the kind that is the case for a logistic regression. — Piskator, Nov 06 '22 at 09:54
Your last sentence is strange to me: why would the intersection $P(A\cap B)$ be calculated by cross-entropy? (and how?). It's calculated the usual way, by dividing the frequency of $A\cap B$ by the size of the sample space (in this case corpus size minus 1). — Erwan, Nov 07 '22 at 19:37
Because in a text of length N there are exactly N-1 pairs of consecutive tokens (bigrams). Example: A B C D -> AB, BC, CD (4-1=3). note: it's N-2 for trigrams, N-3 for 4-grams, etc. — Erwan, Nov 08 '22 at 11:01
There are still some concepts here, that I suspect I havent fully grasped. When you talk about the frequency of $A \cap B$ is that measured by a bigram, that is $w_1$ and $w_2$, such that the number of times this bigram occurs in text with length N is given by that intersection. Furthermore, what about perplexity isn't it measured on the whole text, so that context is taking into account, and does this then mean that in a sentence of four words, you would have to count both the intersection of bigrams which can happen 3 times, and trigrams which can happen 2 times, and 4-gram 1 time? — Piskator, Nov 08 '22 at 11:55
Well I'm not so clear myself about perplexity but I know the basics of language models (LMs). Actually your formula is for trigrams, but yes, LMs represent n-grams probabilities so P(W2|W1) means 'prob of W2 knowing the prev word W1'. This is exactly how context is taken into account, by how likely a word is given a sequence of previous words. Yes, your count is correct for a sentence of 4 words. But of course one obtains more accurate probs on a corpus with millions words ;) — Erwan, Nov 08 '22 at 12:23
So actually what I can infer about perplexity is, that if it uses the formula I originally stated (which it does since I got it from a textbook) then modelling on and 3-gram LM must also mean that you would have to do as stated above, and that you would furthermore have to calculate the conditional probabilities as you mentioned, based on these interesctions, that is of the bi-grams and tri-grams and unigrams occuring for each word, and this expression could then be used in the $P P(W)=P\left(w_1 w_2 \ldots w_N\right)^{-\frac{1}{N}}$ which is the perplexity formula. — Piskator, Nov 08 '22 at 12:36
So if you want to post your comments as an 'answer', I shall mark the post as answered. — Piskator, Nov 08 '22 at 12:38
NB: What is the difference between MLE in NLP and probability. In math more generally these are two distinct concepts as far as I have understood. — Piskator, Nov 08 '22 at 12:39
Thanks, but I prefer to leave the answer to anyone who can also explain perplexity. by MLE I assume you mean maximum likelihood estimator? I'm not sure which difference you mean, but when training from a corpus the MLE is simply the case which has the highest probability. — Erwan, Nov 08 '22 at 18:53
Yes, that is what I meant by MLE. We shall leave the question open for someone else to answer. — Piskator, Nov 09 '22 at 05:00

Calculationg perplexity (in natural language processing) manually

0 Answers0