-2

I am using Word2Vec for text vectorization. It is doing a good job but some cases it is failing. For example "turn the computer off and on" and the sentence "restart the computer" does not have a very good similarity score, even though they mean the same thing. Doc2Vec is not doing a good job as my inputs are usually a couple of sentences and not a document.

Can anyone please suggest an approach which would give a good similarity score between "turn on and off" and "restart" and also other combinations like that?

Shamy
  • 207
  • 2
  • 3
  • Use pre-trained embeddings, and form the document embeddings through [averaging the tf-idf scores](https://openreview.net/forum?id=SyK00v5xx) or concatenation of summary statistics (min, max, mean, std) – Emre Aug 22 '17 at 06:04
  • This question related to this other question https://datascience.stackexchange.com/questions/22536/detect-related-sentences/22948#22948 – Brian Spiering Sep 20 '17 at 16:54

2 Answers2

0

If you are training your word2vec by yourself than you should increase your training dataset. You can easily get the Wikipedia database. If you are using a pretrained model, you can always fine tune it with additional data.

HatemB
  • 316
  • 2
  • 7
0

One approach you could take is to build sentence vectors using vectors generated for Words.

This post covers the different techniques you could use to achieve it.

Ethan
  • 1,625
  • 8
  • 23
  • 39
Nischal Hp
  • 765
  • 3
  • 10