I have read the code of ELMo.
Based on my understanding, ELMo first init an word embedding matrix A for all the word and then add LSTM B, at end use the LSTM B's outputs to predict each word's next word.
I am wondering why we can input each word in the vocab and get the final word representation from the word embedding matrix A after training.
It seems that we lost the information of LSTM B.
Why the embedding can contains the information we want in the language model.
Why the training process can inject the information for a good word representation into the word embedding matrix A?