gensim is the python library for topic modelling. multi-dimensional vector representation of words or sentences which preserves semantic meaning is computed through word2vec and doc2vec models.
Questions tagged [gensim]
103 questions
35
votes
6 answers
How do I load FastText pretrained model with Gensim?
I tried to load fastText pretrained model from here Fasttext model. I am using wiki.simple.en
from gensim.models.keyedvectors import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('wiki.simple.bin', binary=True)
But, it shows the…
Sabbiu Shah
- 733
- 1
- 6
- 9
21
votes
4 answers
How to initialize a new word2vec model with pre-trained model weights?
I am using Gensim Library in python for using and training word2vector model. Recently, I was looking at initializing my model weights with some pre-trained word2vec model such as (GoogleNewDataset pretrained model). I have been struggling with it…
Nomiluks
- 461
- 1
- 4
- 9
17
votes
3 answers
Word2Vec how to choose the embedding size parameter
I'm running word2vec over collection of documents. I understand that the size of the model is the number of dimensions of the vector space that the word is embedded into. And that different dimensions are somewhat related to different, independent…
Neil
- 257
- 1
- 2
- 8
16
votes
5 answers
Number of epochs in Gensim Word2Vec implementation
There's an iter parameter in the gensim Word2Vec implementation
class gensim.models.word2vec.Word2Vec(sentences=None, size=100,
alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0,
seed=1, workers=1, min_alpha=0.0001, sg=1, hs=1,…
alvas
- 2,340
- 6
- 25
- 38
16
votes
3 answers
Doc2vec(gensim) - How can I infer unseen sentences’ label?
https://radimrehurek.com/gensim/models/doc2vec.html
For example, if we have trained doc2vec with
"aaaaaAAAAAaaaaaa" - "label 1"
“bbbbbbBBBBBbbbb" - "label 2"
can we infer “aaaaAAAAaaaaAA” is label 1 using Doc2vec?
I know Doc2vec can train word…
Seongho
- 163
- 1
- 6
8
votes
1 answer
Difference between Gensim word2vec and keras Embedding layer
I used the gensim word2vec package and Keras Embedding layer for various different projects. Then I realize they seem to do the same thing, they all try to convert a word into a feature vector.
Am I understanding this properly? What exactly is the…
Edamame
- 2,705
- 5
- 23
- 32
8
votes
1 answer
Gensim LDA model: return keywords based on relevance (λ - lambda) value
I am using the gensim library for topic modeling, more specifically LDA. I created my corpus, my dictionary, and my LDA model. With the help of the pyLDAvis library I visualized the results. When I print the words with the highest probability on…
Tasos Lytos
- 81
- 3
5
votes
2 answers
Why is averaging the vectors required in word2vec?
While implementing word2vec using gensim by following few tutorials online, one thing that I couldn't understand is the reason why word vectors are averaged once the model is trained. Few example links…
mockash
- 163
- 5
5
votes
1 answer
How to choose threshold for gensim Phrases when generating bigrams?
I'm generating bigrams with from gensim.models.phrases, which I'll use downstream with TF-IDF and/or gensim.LDA
from gensim.models.phrases import Phrases, Phraser
# 7k documents, ~500-1k tokens each. Already ran cleanup, stop_words, lemmatization,…
lefnire
- 151
- 4
5
votes
4 answers
How to train an existing word2vec gensim model on new words?
According to gensim docs, you can take an existing word2vec model and further train it on new words.
The training is streamed, meaning sentences can be a generator,
reading input data from disk on the fly, without loading the entire
corpus into…
tim_xyz
- 177
- 1
- 1
- 11
5
votes
2 answers
can I use public pretrained word2vec, and continue train it for domain specific text?
I have a set of reviews from apparel domain, about 100K reviews (2M words).
And I want to train word2vec to do some cool NLP staff with it.
However the size is not enough for creating adequate word2vec model, it requires billions of words.
So the…
Ilia Kandrashou
- 51
- 1
- 3
5
votes
1 answer
Doc2vec to calculate cosine similarity - absolutely inaccurate
I'm trying to modify the Doc2vec tutorial to calculate cosine similarity and take Pandas dataframes instead of .txt documents. I want to find the most similar sentence to a new sentence I put in from my data. However, after training, even if I give…
lte__
- 1,310
- 5
- 18
- 26
4
votes
2 answers
Does spaCy support multiple GPUs?
I was wondering if spaCy supports multi-GPU via mpi4py?
I am currently using spaCy's nlp.pipe for Named Entity Recognition on a high-performance-computing cluster that supports the MPI protocol and has many GPUs. It says here that I would need to…
Jinhua Wang
- 163
- 8
4
votes
1 answer
Predicting the missing word using fasttext pretrained word embedding models (CBOW vs skipgram)
I am trying to implement a simple word prediction algorithm for filling a gap in a sentence by choosing from several options:
Driving a ---- is not fun in London streets.
Apple
Car
Book
King
With the right model in place:
Question 1. What…
Kingstar
- 53
- 5
4
votes
1 answer
word2vec word embeddings creates very distant vectors, closest cosine similarity is still very far, only 0.7
I started using gensim's FastText to create word embeddings on a large corpus of a specialized domain (after finding that existing open source embeddings are not performing well on this domain), although I'm not using its character level n-grams, so…
Oren Matar
- 221
- 1
- 7