Why ELMo's word embedding can represent the word better than glove?

Question

I have read the code of ELMo.
Based on my understanding, ELMo first init an word embedding matrix A for all the word and then add LSTM B, at end use the LSTM B's outputs to predict each word's next word.

I am wondering why we can input each word in the vocab and get the final word representation from the word embedding matrix A after training.

It seems that we lost the information of LSTM B.

Why the embedding can contains the information we want in the language model.

Why the training process can inject the information for a good word representation into the word embedding matrix A?

score 1 · Accepted Answer · answered Dec 13 '18 at 07:18

I am wrong. ELMo also use the output of LSTM for context-dependent representation.

The output only from word embedding is the context-independent representation.

Why the representation is useful?

I think it is because, it is learning the difference between words and the representation is not the real meaning for the word.

Why ELMo's word embedding can represent the word better than glove?

1 Answers1