How much training data does Word2Vec need?

Question

I'd like to compare the difference among the same word mentioned in different sources. That is, how authors differ in their usage of ill-defined words, such as "democracy".

A brief plan was

Take the books mentioning the term "democracy" as plain text
In each book, replace democracy with democracy_%AuthorName%
Train a word2vec model on these books
Calculate the distance between democracy_AuthorA, democracy_AuthorB, and other relabeled mentions of "democracy"

So each author's "democracy" gets its own vector, which is used for comparison.

But it seems that word2vec requires much more than several books (each relabeled word occurs only in a subset of books) to train reliable vectors. The official page recommends datasets including billions of words.

I just wanted to ask how large should be the subset of one author's books to make such inference with word2vec or alternative tools, if available?

Are the books you are using solely on the topic of democracy, if not, might not your distance metric get swamped by larger differences between the books contents? This is a side effect of your problem being in a very high dimensional space and being touched by the hand of the curse of dimensionality. Perhaps taking only a small region of text around the word of interest would help, but it is still a problem with significant dimension. — image_doctor, Jul 23 '15 at 06:41
@image_doctor That's a excellent question. Technically, I expected "democracy" to be mentioned in different contexts across books. But you mean that the other words will "move away" the mentions of democracy because they themselves are not well-placed relative to the content of other books? — Anton Tarasenko, Jul 23 '15 at 08:08
Yes that's the essence of that. here goes with a probably ill thought out metaphor. Imagine chapters of books being represented by colours. And a book a a whole represented as the mixture of all the colours of the chapters. A book on democracy in western europe would likely end up with an overall reddish hue as the sum of it's chapters. If we represent tourism by blue, a book on Tourism in Cuba, with a sole chapter on democracy and it's influence on economic development, would have a strong blue hue. So the two books would appear very different when viewed as a whole. — image_doctor, Jul 23 '15 at 08:38
That's the more accessible way of saying what a data scientist would phrase as the vectors for the two books will be a long way apart in feature space and so will appear quite dissimilar. It's really hard to quantify beforehand how many examples you will need without playing with the data, but language is subtle and layered so you will probably want as many as you can get .... and maybe more. Ultimately you won't know until you try. It's not a concrete answer, but unless someone direct experience of doing a similar thing, it's probably the best you will get. — image_doctor, Jul 23 '15 at 08:43
word2vec already only uses "a small region of text around the word of interest." The `window` parameter sets how many words in the context are used to train the model for your word _w_ — jamesmf, Sep 09 '15 at 15:03
@AntonTarasenko so how much data your project actually needed? :) would be nice to see the results if you published them — help-ukraine-now, Jan 08 '20 at 00:08

score 1 · Answer 1 · edited Jan 01 '21 at 12:54

It sounds like Doc2Vec (or paragraph/context vectors) might be the right fit for this problem.

In a nutshell, in addition to the word vectors, you add a "context vector" (in your case, an embedding for the author) that is used to predict the center or context words.

This means that you would benefit from all the data about "democracy" but also extract an embedding for that author, which combined should allow you to analyze the bias of each author with limited data about each author.

You can use gensim's implementation. The doc includes links to the source papers.

How much training data does Word2Vec need?

1 Answers1

Linked