Are Word2Vec and Doc2Vec both distributional representation or distributed representation?

Question

I have read that distributional representation is based on distributional hypothesis that words occurring in similar context tends to have similar meanings.

Word2Vec and Doc2Vec both are modeled according to this hypothesis. But, in the original paper, even they are titled as Distributed representation of words and phrases and Distributed representation of sentences and documents. So, are these algorithms based on distributional representation or distributed representation.

How about other models such as LDA and LSA.

Tu N. · Answer 1 · 2016-03-21T21:50:28.860

5

Effectively, Word2Vec/Doc2Vec is based on distributional hypothesis where the context for each word is its nearby words. Similarly, LSA takes the entire document as the context. Both techniques solve the word embedding problem - embed words into a continuous vector space while keeping semantically related words close together.

On the other hand, LDA isn't made to solve the same problem. They deal with a different problem called topic modeling, which is finding latent topics in a set of documents.

edited Mar 21 '16 at 21:50

answered Mar 21 '16 at 21:28

Tu N.

509
2
3

1

I received a reply from google groups stating that, its both distributed and distributional in different perspectives. Distributional in terms of the hypothesis used and distributed in terms of the distributed features in vector space. – chmodsss Mar 21 '16 at 22:27
yeah, the representation is distributed in the sense that a word vector is capturing multiple concepts, each concept is itself a vector. For example: $v_{king}$ might capture two concepts `male` in gender and `royal`, $v_{queen}$ captures `female` in gender and `royal`. That's why $v_{king} - v_{queen} \sim v_{man} - v_{woman}$ – Tu N. Mar 22 '16 at 00:40

score 3 · Accepted Answer · edited May 02 '16 at 11:53

The reply from Andrey Kutuzov via google groups felt satisfactory

I would say that word2vec algorithms are based on both.

When people say distributional representation, they usually mean the linguistic aspect: meaning is context, know the word by its company and other famous quotes.

But when people say distributed representation, it mostly doesn't have anything to do with linguistics. It is more about computer science aspect. If I understand Mikolov and other correctly, the word distributed in their papers means that each single component of a vector representation does not have any meaning of its own. The interpretable features (for example, word contexts in case of word2vec) are hidden and distributed among uninterpretable vector components: each component is responsible for several interpretable features, and each interpretable feature is bound to several components.

So, word2vec (and doc2vec) uses distributed representations technically, as a way to represent lexical semantics. And at the same time it is conceptually based on distributional hypothesis: it works only because distributional hypothesis is true (word meanings do correlate with their typical contexts).

But of course often the terms distributed and distributional are used interchangeably, increasing misunderstanding :)

score 2 · Answer 3 · edited Apr 13 '17 at 12:54

Turian, Joseph, Lev Ratinov, and Yoshua Bengio. "Word representations: a simple and general method for semi-supervised learning." Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, 2010. define distributional representations and distributed representations as follows:

A distributional word representation is based upon a cooccurrence matrix $F$ of size $W×C$, where $W$ is the vocabulary size, each row $F_w$ is the initial representation of word $w$, and each column $F_c$ is some context. Sahlgren (2006) and Turney and Pantel (2010) describe a handful of possible design decisions in contructing $F$, including choice of context types (left window? right window? size of window?) and type of frequency count (raw? binary? tf-idf?). $F_w$ has dimensionality $W$, which can be too large to use $F_w$ as features for word w in a supervised model. One can map $F$ to matrix f of size W × d, where $d << C$, using some function g, where f = g(F). $F_w$ represents word $w$ as a vector with $d$ dimensions. The choice of $g$ is another design decision, although perhaps not as important as the statistics used to initially construct $F$.

A distributed representation is dense, low-dimensional, and real-valued. Distributed word representations are called word embeddings. Each dimension of the embedding represents a latent feature of the word, hopefully capturing useful syntactic and semantic properties. A distributed representation is compact, in the sense that it can represent an exponential number of clusters in the number of dimensions.

FYI: What's the difference between word vectors, word representations and vector embeddings?

The same confusion remains in the answer too. It has properties from both representation. Lets see what it has in common. `Distributional`: It has a matrix of size WxC and then its reduced to Wxd, where d is the embedding vector size. It uses window sizes to determine the context. `Distributed`: Dense, low-dimensional vectors. It preserves latent features (semantic properties) in those dimensions. — chmodsss, Mar 25 '16 at 10:48

Are Word2Vec and Doc2Vec both distributional representation or distributed representation?

3 Answers3