Questions tagged [doc2vec]
22 questions
3
votes
2 answers
Gensim doc2vec error: KeyError: "word 'senseless' not in vocabulary"
I am new to machine learning and tried doc2vec on quora duplicate dataset. new_dfx has columns 'question1' and 'question2' which has preprocessed questions in each row. Following is the tagged document sample:
input:
q_arr =…
Ankit Rohilla
- 31
- 2
2
votes
2 answers
classification of similar text input features with text output label
I hope somebody can provide guidance/input/advice on my project, where I believe AI can help.
I have a general understanding of AI, but I lack a formal training.
I've never built a neural net from scratch on my own.
Task
Build a classification model…
andrea
- 73
- 6
2
votes
0 answers
Preprocessing for Document Similarity Using Doc2Vec
I'm trying to determine document similarity using Doc2Vec on a large series of legal opinions, which can contain some highly jargonistic language and phrases (e.g. en banc, de novo, etc.). I'm wondering if anyone has any thoughts about the criteria…
user118648
- 21
- 1
2
votes
0 answers
What is the meaning of, or explanation for, having multiple tags in a Doc2Vec model's TaggedDocuments?
I've tried reading the other answers on this topic but I'm unsure if I understand completely.
For my dataset, I have a series of tagged documents, "good" or "bad." Each document belongs to an entity, and each entity has a different number of…
Jayke
- 21
- 1
2
votes
1 answer
Word2Vec vs. Doc2Vec Word Vectors
I am doing some analysis on document similarity and was also interested in word similarity. I know that doc2vec inherits from word2vec and by default trains using word vectors which we can access.
My question is:
Should we expect these word vectors…
Tylerr
- 146
- 3
2
votes
1 answer
DBSCAN on textual and numerical columns
I have a dataset which has two columns:
title price
sentence1 12
sentence2 13
I have used doc2vec to convert the sentences into vectors of size 100 as below:
LabeledSentence1 = gensim.models.doc2vec.TaggedDocument
all_content = []
j=0
for…
Jazz
- 420
- 1
- 5
- 15
2
votes
1 answer
How to implement LSTM using Doc2Vec vectors to get representation?
Hi all. I'm a newbie in ML. I read and found a paper about A Multi-Level Plagiarism Detection System Based on Deep Learning Algorithms and want to implement this model . But I can't find more about step-by-step guide to build it. How LSTM can make…
Omasaka Opacha Revok
- 21
- 3
2
votes
1 answer
Approach to semantic similarity between documents
I was wondering what approach people would take, or point me in the right direction on this challenge I have set myself. I am pretty new at this, I have covered some area but want to expand my skillset.
Say you have an abstract from a research…
user5067291
- 151
- 2
2
votes
2 answers
How to examine if a Doc2Vec model is sufficiently trained?
I started experimenting with gensim's Doc2Vec for sentiment analysis. For the training of the embedding itself, I have seen examples using a reduced learning rate with a few 10s or even a few hundred epochs. However, there does not seem to be a…
Shan Dou
- 131
- 2
1
vote
1 answer
Embedding from Transformer-based model from paragraph or documnet (like Doc2Vec)
I have a set of data that contains the different lengths of sequences. On average the sequence length is 600. The dataset is like this:
S1 = ['Walk','Eat','Going school','Eat','Watching movie','Walk'......,'Sleep']
S2 = ['Eat','Eat','Going…
Bloodstone Programmer
- 300
- 2
- 3
- 9
1
vote
1 answer
Clustering using both text and numerical features
I have a dataset that contains 2 types of features, one is generated from doc2vec and one is numerical feature. I would like to perform clustering analysis on them. However, due to the size of doc2vec features, if I simply combine them into one…
E.TTT
- 11
- 1
1
vote
0 answers
doc2vec - paragraph or article as document
I'm trying to train a doc2vec model on the German wiki corpus. While looking for the best practice I've found different possibilities on how to create the training data.
Should I split every Wikipedia article by each natural paragraph into several…
jonas
- 143
- 4
1
vote
0 answers
Document Similarity to List of Words in Sentiment Analysis
How would you go about finding document similarity to a list of words in Sentiment Analysis?
Looking find document similarity to multiple lists of words in sentiment analysis. I had been working on this with my intern but he is sorting by sentiment…
JohnT
- 111
- 5
1
vote
1 answer
Topic alignment / topic modelling
What is the most efficient method for detecting whether the article is mostly about a specific topic, but without lots of data for training? My task is to determine how much a document is e.g. about the weather or holidays or several other specific…
piernik
- 51
- 2
1
vote
0 answers
T-SNE good clustering but SVM classification poor
I am trying to classify in 4 different classes, paragraph embedding vector computed with doc2vec using an non-linear svm over them.
When I visualize the embeddings using tensorboard t-sne I can see that they are clustered quite well as in the…
Luca Massarelli
- 11
- 1