I'd like to compare the difference among the same word mentioned in different sources. That is, how authors differ in their usage of ill-defined words, such as "democracy".
A brief plan was
- Take the books mentioning the term "democracy" as plain text
- In each book, replace
democracywithdemocracy_%AuthorName% - Train a
word2vecmodel on these books - Calculate the distance between
democracy_AuthorA,democracy_AuthorB, and other relabeled mentions of "democracy"
So each author's "democracy" gets its own vector, which is used for comparison.
But it seems that word2vec requires much more than several books (each relabeled word occurs only in a subset of books) to train reliable vectors. The official page recommends datasets including billions of words.
I just wanted to ask how large should be the subset of one author's books to make such inference with word2vec or alternative tools, if available?