How to choose threshold for gensim Phrases when generating bigrams?

Question

I'm generating bigrams with from gensim.models.phrases, which I'll use downstream with TF-IDF and/or gensim.LDA

from gensim.models.phrases import Phrases, Phraser

# 7k documents, ~500-1k tokens each. Already ran cleanup, stop_words, lemmatization, etc
docs = get_docs()

phrases = Phrases(docs)
bigram = Phraser(phrases)
docs = [bigram[d] for d in docs]

Phrases has min_count=5, threshold=10. I don't quite understand how they interact, they seem related? Anyway, I see threshold having values in different tutorials ranging 1->1000, described as important in determining the number of bigrams generated. I can't find an explanation on how to come by a decent value for one's purposes, simply "fiddle and what works best for you". Is there any intuition / formula for choosing this value, maybe something like "if you want x% more tokens added to your dictionary, use y"; or "if your corpus size is x, try y"? I also see scoring='default' can be set to 'npmi' instead. From the linked paper, they say and t is a chosen threshold, typically around 10e−5. Might that be a decent approach if I just want this to work "good enough" without needing to fiddle much? That is, phrases = Phrases(docs, scoring='npmi', threshold=10e-5).

TL;DR: is there a simple or intuitive way to choose a decent threshold (eg, based on corpus size); alternatively would scoring='npmi',threshold=10e-5 be simpler?

I also would be really interested in this subject. – user_1177868 Apr 06 '21 at 09:23 — user_1177868, Apr 06 '21 at 09:23

score 0 · Answer 1 · answered Jul 27 '21 at 13:54

0

Since min_count and threshold are hyperparameters, better values could be found through cross validation. Evaluate a range of values to empirically find the values that have the highest performance on a validation set.

answered Jul 27 '21 at 13:54

Brian Spiering

20,142
2
25
102

How to choose threshold for gensim Phrases when generating bigrams?

1 Answers1