3

I am new to machine learning and tried doc2vec on quora duplicate dataset. new_dfx has columns 'question1' and 'question2' which has preprocessed questions in each row. Following is the tagged document sample:

input:

q_arr = np.append(new_dfx['question1'].values, new_dfx['question2'].values)
tagged_data1 = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(q_arr)]
tagged_data1[50001]

output:

TaggedDocument(words=['senseless', 'movi', 'like', 'dilwal', 'happi', 'new', 'year', 'earn', 'easi', '100', 'crore', 'india'], tags=['50001'])

Input:

model_dbow1 = Doc2Vec(dm=1, vector_size=300, negative=5, workers=cores)
model_dbow1.build_vocab([x for x in tqdm(tagged_data1)])
train_documents1  = utils.shuffle(tagged_data1)
model_dbow1.train(tagged_data1,total_examples=len(train_documents1), epochs=30)

-- to check if model trained right

model_dbow1.most_similar('senseless')

Error:

KeyError: "word 'senseless' not in vocabulary"

The data I have given to model for training as input has the word "senseless" so why this error? Could anyone please help?Other word is giving output

2 Answers2

0

This is Doc2Vec not Word2Vec, so I don't think you you don't give a word to most_similar(). So instead of:

model_dbow1.most_similar('senseless')

I think you would do:

model_dbow1.most_similar('55001')

Or, alternatively if you did want to search for a one-word sentence:

vector = model_dbow1.infer_vector(["senseless"])
model_dbow1.most_similar([vector])

(The above is a bit of guesswork, based on the online docs for Doc2Vec, and some tutorials (e.g. this and e.g. Sentence similarity using Doc2vec). Where possible give a fully reproducible example that we can test against.)

Darren Cook
  • 892
  • 5
  • 12
  • For other word like 'new', it's working – Ankit Rohilla Jan 17 '23 at 06:31
  • What did my suggestions do? BTW, looking at your updated question, what is `word_tokenize()`? If you run `most_similar( word_tokenize('senseless') )` does it work? – Darren Cook Jan 17 '23 at 09:13
  • I think if model_dbow1.most_similar('new') is working, then model_dbow1.most_similar('senseless') should also work. I checked "senseless" word is not in the model_dbow1.wv.vocab after training. I checked "trim_rule" parameter of doc2vec. I think vocab is being trimmed. – Ankit Rohilla Jan 19 '23 at 11:05
  • @AnkitRohilla It would be very normal for `word_tokenize()` to be splitting 'senseless' into two tokens, which would explain all the behaviour you see. (I.e. why else would you be using a tokenize function!) But without more information about the modules you imported, or a fully reproducible example (which always shows all the import commands), or more information, we are reduced to guessing. – Darren Cook Jan 19 '23 at 13:47
0

The Doc2Vec API should be called this:

vector = model_dbow1.infer_vector(["senseless"])
model_dbow1.wv.most_similar([vector])

Here is a complete working example:

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from tqdm import tqdm

q_arr = " ".join(['senseless', 'movi', 'like', 'dilwal', 'happi', 'new', 'year', 'earn', 'easi', '100', 'crore', 'india'])
cores = 4
train_documents1 = q_arr
tagged_data1 = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(q_arr)]
model_dbow1 = Doc2Vec(dm=1, vector_size=300, negative=5, workers=cores)
model_dbow1.build_vocab([x for x in tqdm(tagged_data1)])
# train_documents1  = utils.shuffle(tagged_data1)
train_documents1  = tagged_data1
model_dbow1.train(tagged_data1,total_examples=len(train_documents1), epochs=30)

vector = model_dbow1.infer_vector(["senseless"])
model_dbow1.wv.most_similar([vector])
Brian Spiering
  • 20,142
  • 2
  • 25
  • 102