I'm generating bigrams with from gensim.models.phrases, which I'll use downstream with TF-IDF and/or gensim.LDA
from gensim.models.phrases import Phrases, Phraser
# 7k documents, ~500-1k tokens each. Already ran cleanup, stop_words, lemmatization, etc
docs = get_docs()
phrases = Phrases(docs)
bigram = Phraser(phrases)
docs = [bigram[d] for d in docs]
Phrases has min_count=5, threshold=10. I don't quite understand how they interact, they seem related? Anyway, I see threshold having values in different tutorials ranging 1->1000, described as important in determining the number of bigrams generated. I can't find an explanation on how to come by a decent value for one's purposes, simply "fiddle and what works best for you". Is there any intuition / formula for choosing this value, maybe something like "if you want x% more tokens added to your dictionary, use y"; or "if your corpus size is x, try y"? I also see scoring='default' can be set to 'npmi' instead. From the linked paper, they say and t is a chosen threshold, typically around 10eā5. Might that be a decent approach if I just want this to work "good enough" without needing to fiddle much? That is, phrases = Phrases(docs, scoring='npmi', threshold=10e-5).
TL;DR: is there a simple or intuitive way to choose a decent threshold (eg, based on corpus size); alternatively would scoring='npmi',threshold=10e-5 be simpler?