Questions tagged [gensim]

gensim is the python library for topic modelling. multi-dimensional vector representation of words or sentences which preserves semantic meaning is computed through word2vec and doc2vec models.

103 questions
35
votes
6 answers

How do I load FastText pretrained model with Gensim?

I tried to load fastText pretrained model from here Fasttext model. I am using wiki.simple.en from gensim.models.keyedvectors import KeyedVectors word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True) But, it shows the…
Sabbiu Shah
  • 733
  • 1
  • 6
  • 9
21
votes
4 answers

How to initialize a new word2vec model with pre-trained model weights?

I am using Gensim Library in python for using and training word2vector model. Recently, I was looking at initializing my model weights with some pre-trained word2vec model such as (GoogleNewDataset pretrained model). I have been struggling with it…
Nomiluks
  • 461
  • 1
  • 4
  • 9
17
votes
3 answers

Word2Vec how to choose the embedding size parameter

I'm running word2vec over collection of documents. I understand that the size of the model is the number of dimensions of the vector space that the word is embedded into. And that different dimensions are somewhat related to different, independent…
Neil
  • 257
  • 1
  • 2
  • 8
16
votes
5 answers

Number of epochs in Gensim Word2Vec implementation

There's an iter parameter in the gensim Word2Vec implementation class gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0, seed=1, workers=1, min_alpha=0.0001, sg=1, hs=1,…
alvas
  • 2,340
  • 6
  • 25
  • 38
16
votes
3 answers

Doc2vec(gensim) - How can I infer unseen sentences’ label?

https://radimrehurek.com/gensim/models/doc2vec.html For example, if we have trained doc2vec with "aaaaaAAAAAaaaaaa" - "label 1" “bbbbbbBBBBBbbbb" - "label 2" can we infer “aaaaAAAAaaaaAA” is label 1 using Doc2vec? I know Doc2vec can train word…
Seongho
  • 163
  • 1
  • 6
8
votes
1 answer

Difference between Gensim word2vec and keras Embedding layer

I used the gensim word2vec package and Keras Embedding layer for various different projects. Then I realize they seem to do the same thing, they all try to convert a word into a feature vector. Am I understanding this properly? What exactly is the…
Edamame
  • 2,705
  • 5
  • 23
  • 32
8
votes
1 answer

Gensim LDA model: return keywords based on relevance (λ - lambda) value

I am using the gensim library for topic modeling, more specifically LDA. I created my corpus, my dictionary, and my LDA model. With the help of the pyLDAvis library I visualized the results. When I print the words with the highest probability on…
5
votes
2 answers

Why is averaging the vectors required in word2vec?

While implementing word2vec using gensim by following few tutorials online, one thing that I couldn't understand is the reason why word vectors are averaged once the model is trained. Few example links…
mockash
  • 163
  • 5
5
votes
1 answer

How to choose threshold for gensim Phrases when generating bigrams?

I'm generating bigrams with from gensim.models.phrases, which I'll use downstream with TF-IDF and/or gensim.LDA from gensim.models.phrases import Phrases, Phraser # 7k documents, ~500-1k tokens each. Already ran cleanup, stop_words, lemmatization,…
lefnire
  • 151
  • 4
5
votes
4 answers

How to train an existing word2vec gensim model on new words?

According to gensim docs, you can take an existing word2vec model and further train it on new words. The training is streamed, meaning sentences can be a generator, reading input data from disk on the fly, without loading the entire corpus into…
tim_xyz
  • 177
  • 1
  • 1
  • 11
5
votes
2 answers

can I use public pretrained word2vec, and continue train it for domain specific text?

I have a set of reviews from apparel domain, about 100K reviews (2M words). And I want to train word2vec to do some cool NLP staff with it. However the size is not enough for creating adequate word2vec model, it requires billions of words. So the…
5
votes
1 answer

Doc2vec to calculate cosine similarity - absolutely inaccurate

I'm trying to modify the Doc2vec tutorial to calculate cosine similarity and take Pandas dataframes instead of .txt documents. I want to find the most similar sentence to a new sentence I put in from my data. However, after training, even if I give…
lte__
  • 1,310
  • 5
  • 18
  • 26
4
votes
2 answers

Does spaCy support multiple GPUs?

I was wondering if spaCy supports multi-GPU via mpi4py? I am currently using spaCy's nlp.pipe for Named Entity Recognition on a high-performance-computing cluster that supports the MPI protocol and has many GPUs. It says here that I would need to…
Jinhua Wang
  • 163
  • 8
4
votes
1 answer

Predicting the missing word using fasttext pretrained word embedding models (CBOW vs skipgram)

I am trying to implement a simple word prediction algorithm for filling a gap in a sentence by choosing from several options: Driving a ---- is not fun in London streets. Apple Car Book King With the right model in place: Question 1. What…
Kingstar
  • 53
  • 5
4
votes
1 answer

word2vec word embeddings creates very distant vectors, closest cosine similarity is still very far, only 0.7

I started using gensim's FastText to create word embeddings on a large corpus of a specialized domain (after finding that existing open source embeddings are not performing well on this domain), although I'm not using its character level n-grams, so…
1
2 3 4 5 6 7