5

I'm generating bigrams with from gensim.models.phrases, which I'll use downstream with TF-IDF and/or gensim.LDA

from gensim.models.phrases import Phrases, Phraser

# 7k documents, ~500-1k tokens each. Already ran cleanup, stop_words, lemmatization, etc
docs = get_docs()

phrases = Phrases(docs)
bigram = Phraser(phrases)
docs = [bigram[d] for d in docs]

Phrases has min_count=5, threshold=10. I don't quite understand how they interact, they seem related? Anyway, I see threshold having values in different tutorials ranging 1->1000, described as important in determining the number of bigrams generated. I can't find an explanation on how to come by a decent value for one's purposes, simply "fiddle and what works best for you". Is there any intuition / formula for choosing this value, maybe something like "if you want x% more tokens added to your dictionary, use y"; or "if your corpus size is x, try y"? I also see scoring='default' can be set to 'npmi' instead. From the linked paper, they say and t is a chosen threshold, typically around 10eāˆ’5. Might that be a decent approach if I just want this to work "good enough" without needing to fiddle much? That is, phrases = Phrases(docs, scoring='npmi', threshold=10e-5).

TL;DR: is there a simple or intuitive way to choose a decent threshold (eg, based on corpus size); alternatively would scoring='npmi',threshold=10e-5 be simpler?

lefnire
  • 151
  • 4

1 Answers1

0

Since min_count and threshold are hyperparameters, better values could be found through cross validation. Evaluate a range of values to empirically find the values that have the highest performance on a validation set.

Brian Spiering
  • 20,142
  • 2
  • 25
  • 102