1

Are there any pre-trained models for finding similar word n-grams, where n>1?

FastText, for instance, seems to work only on unigrams:

from pyfasttext import FastText
model = FastText('cc.en.300.bin')
model.nearest_neighbors('dog', k=2000)

[('dogs', 0.8463464975357056),
 ('puppy', 0.7873005270957947),
 ('pup', 0.7692237496376038),
 ('canine', 0.7435278296470642),
 ...

but it fails on longer n-grams:

model.nearest_neighbors('Gone with the Wind', k=2000)

[('DEky4M0BSpUOTPnSpkuL5I0GTSnRI4jMepcaFAoxIoFnX5kmJQk1aYvr2odGBAAIfkECQoABAAsCQAAABAAEgAACGcAARAYSLCgQQEABBokkFAhAQEQHQ4EMKCiQogRCVKsOOAiRocbLQ7EmJEhR4cfEWoUOTFhRIUNE44kGZOjSIQfG9rsyDCnzp0AaMYMyfNjS6JFZWpEKlDiUqALJ0KNatKmU4NDBwYEACH5BAUKAAQALAkAAAAQABIAAAhpAAEQGEiQIICDBAUgLEgAwICHAgkImBhxoMOHAyJOpGgQY8aBGxV2hJgwZMWLFTcCUIjwoEuLBym69PgxJMuDNAUqVDkz50qZLi',
  0.71047443151474),

or

model.nearest_neighbors('Star Wars', k=2000)
[('clockHauser', 0.5432934761047363),
 ('CrônicasEsdrasNeemiasEsterJóSalmosProvérbiosEclesiastesCânticosIsaíasJeremiasLamentaçõesEzequielDanielOséiasJoelAmósObadiasJonasMiquéiasNaumHabacuqueSofoniasAgeuZacariasMalaquiasNovo',
  0.5197194218635559),
Fatemeh Rahimi
  • 539
  • 3
  • 16
dzieciou
  • 697
  • 1
  • 6
  • 15
  • Out of curiosity, what is your use case? What are you trying to achieve? – Valentin Calomme Jan 09 '21 at 10:52
  • @ValentinCalomme My use case was to find aliases for a given movie or similar entities. I was expecting that "Star Wars" to result in terms like "Star Wars trilogy", "Star Wars New Hope", but also completely different movies", because I thought movie titles in certain corpora can appear in similar context, e.g, "I love watching (movie title)!". – dzieciou Jan 10 '21 at 21:08
  • One more note, I wanted the model to be the source of aliases, vocabulary for a given movie title, rather a way to say that two phrases are similar. – dzieciou Jan 10 '21 at 21:10

1 Answers1

1

First off, there aren't, to my knowledge, models trained specifically to generate ngram embeddings. Although, it would be very easy to modify the word2vec algorithm to accommodate ngrams.

Now, what can you do?

You could compute the ngram embedding by summing up the individual word embeddings. Potentially, you could apply weights based on tfidf for instance, but not required. Once you have 1 embedding, simply find a nearest neighbor using cosine distance.

Another approach, though more computionally expensive would be to compute the Earth Mover's Distance (also called Wasserstein) between ngrams and find nearest neighbors this way.

Valentin Calomme
  • 5,396
  • 3
  • 20
  • 49
  • 1
    Thank you. I will look into that, although my use case was a bit different. I wanted to extract aliases from a model for a given phrase, rather that use the model to assess whether two phrases are similar. The latte required acquiring both phrases from another source. Anyway, your answer helped me to word my problem better :-) – dzieciou Jan 10 '21 at 21:13